---
title: "Dump of old ML notes"
author: "Vahram Poghosyan"
date: "2025-11-03"
categories: ["Machine Learning", "Large Language Models", "Deep Learning"]
format:
  html:
    toc: true
    toc-depth: 5
    code-fold: true
jupyter: python3
include-after-body:
  text: |
    <script type="application/javascript" src="../../javascript/light-dark.js"></script>
---

# Logistic Regression



So far, we've seen Linear Regression (which is a *regression problem*), and Classification Trees as well as Perceptron (which are *classification* problems). On a high level, regression problems are those that extrapolate (or *fit*) discrete input to a continuous valued output. For instance, Linear Regression fits an $n$-dimensional line (a continous output) to the data set of $n$-dimensional points (a discrete input). Logistic Regression is a *regression* problem (as its name correctly implies) whose continuous output is within the probabilistic range $(0,1)$ as opposed to Linear Regression's unbounded output in $\mathbb{R}$ (or $\mathbb{R^n}$ in the $n$-dimensional case). Specifically, Linear Regression hopes that the data fits the model $\mathbb{E}[Y|X] = w^TX$ (so that it may fit a line), whereas Logistic Regression hopes it fits the model $\mathbb{E}[Y|X]=\sigma(Yw^TX)$ where $\sigma$ is the *sigmoid function* we will define in time (so that it may fit a *sigmoid*). The fact that Logistic Regression outputs a value in $(0,1)$ makes it practical for use in classification problems since we can use its output (which is just a probability) to decide a label. 



**Recall Perceptron**

What happens when the data supplied to Perceptron not only lacks the margin $\rho$ needed for the $O(R^2/\rho^2)$ mistake-bound, but is not even linearly separable at all? Perceptron could theoretically still be run with some error, and in minimizing this error we should get a somewhat acceptable classifcation. 

To that end, we define the following *0-1 Loss Function*.

$$
\Phi_{0-1}(z) =
\begin{cases}
  1 & \text{if } z \le 0 \\
  0 & \text{if } z > 0
\end{cases}
$$
Now, note that Perceptron's guess for the a label $y^i \in \{-1,+1\}$ is $sign(w^Tx^i)$.
When $y^i(w^Tx^i)>0$, the label and the guess have the same sign $\implies$ no penalty should apply. 
When $y^i(w^Tx^i)<0$, the label and the guess have opposite signs $\implies$ a penalty should apply. 

Then clearly  $\Phi_{0-1}(y^iw^Tx^i)$ is the correct loss function and, if there are a total of $m$ data points, then $\frac{1}{m}\sum_{i=1}^m \Phi_{0-1}(y^iw^Tx^i)$ is the average loss over the entire dataset. This is the *objective function* to minimize and so we're faced with the following optimization problem:

$$ \min_{w} (\frac{1}{m}\sum_{i=1}^m \Phi_{0-1}(y^iw^Tx^i)) \ \ ^{\dagger} $$

Recall that the loss associated to linear regression was *Least Squares Loss* which is the quantity inside the summation of its objective function $\min_w (\frac{1}{m} \sum_{i=1}^{m}(w^Tx^i - y^i)^2)$. 

Unlike *Least Squares Loss*, *0-1 Loss* is neither convex, nor differentiable. This, unfortunately, means that we cannot minimize it using *gradient descent*. In fact there's more bad news... the problem of minimizing $\dagger$ turns out to be NP-hard. 



**Surrogate Loss Functions**

What if we relaxed *0-1 Loss* to a similarly behaved function which *is* convex and differentiable? Then minimizing this new *surrogate loss function* would closely approximate minimizing the *0-1 Loss*. 

To this end, we define *Logistic Loss* as: $\Phi_{log}(z) = \log(1+e^{-z})$.

Note what happens when we put $z = y^iw^Tx^i$ as in *0-1 Loss*...

- When $y^iw^Tx^i << 0$, which is when $x^T$ is pointing in a direction that's very opposite to the best norm $w^*$ (i.e. the one we're trying to solve for, the norm for which $sign(w^{*^{T}}x)$ is closest to $y^i$ since the data isn't perfectly linearly separable), then the guess is *very* wrong. 

  In this case note that $\Phi_{log}(y^iw^Tx^i)$ is very large, applying a large penalty. 

- When $y^iw^Tx^i >> 0$, which is when $x^T$ is pointing in a direction that's very close to the best norm $w^*$, then the guess is *very* correct. 

  In this case note that $\Phi_{log}(y^iw^Tx^i)$ is very small, almost no penalty.

In both cases, we no longer have a perfect measure for the loss in terms of the number of mistakes made (which *0-1 Loss* provided) since we're still applying some penalty to right classifcations ($log(1+e^{-z})$ is never $0$). But some sloppiness is to be expected from a *relaxtation* procedure. Overall, *Logistic Loss* has the desired properties of a loss function.

So, our new optimization problem is:  $\min_w L(w)$ where $L(w) = \frac{1}{m}\sum_{i=1}^m \log(1 + e^{-y^iw^Tx^i})$. 



**Where does Logistic Loss Come From? Enter the Sigmoid Function...**

The *sigmoid function* is defined as $\sigma(z) = \frac{1}{1+e^{-z}}$ . Crucially, it maps $R$ to the probability range$(0,1)$ and has the property $\sigma(z) + \sigma(-z) = 1$. 

Recall that there are only two events for the label, $y^i = -1$, and $y^i = +1$ each of which corresponds to $y^iw^Tx^i = -w^Tx^i$, and $y^iw^Tx^i = +w^Tx^i$. And by the property above, $\sigma(+w^Tx^i) + \sigma(-w^Tx^i) = 1$ (i.e. the probability on the entire event space is 1). So $\sigma(y^iw^Tx)$ satisfies the probability laws (namely *nonnegativity*, *additivity*, and *normalization*). 

So, we can model the *problem of classifying linearly inseparable data* as follows.

Let $Y$ be a random variable corresponding to the labels, $X$ a random variable corresponding to the samples. Furthermore, fix a norm vector *w* that best classifies the samples. 

Set $\mathbb{P}[Y = y^i \ | \ X=x^i;w] = \sigma(y^iw^Tx^i)$

This means that as $y^iw^Tx^i > 0$ (which indicates that the point would *certainly* have been classified correctly in the simple, linearly separable, case) grows to be more positive, the probability that $Y$ actually *does* take on the value of the corresponding label $y^i$ (and thus that the point *is* actually correctly classified in the actual, linearly inseparable, case) is large. Conversely the probability of misclassification is $1-\delta(y^iw^Tx^i)$ which is small. 

So, how do we get from this model to minimizing *Logistic Loss*? 

We fixed parameter $w$ above to define the model. We can now ask ourselves about its *likelihood* (i.e. the joint probability of seeing a set of observations given $w$). That is, how likely it is to be the one we seek, given a set of observations. This question falls under the broad category of *Maximum Likelihood Estimation* in which we use a set of observations (in this case *not experimental* observations, but *desired* observations $\{y^i\}$), to decide the unknown parameter $\hat{w}$ of a distribution. In other words, given a set of observartions, we work back to surmise the probability distribution they could've been drawn from. 

$Likelihood(w) = \prod_{i=1}^m \mathbb{P}[Y = y^i \ | \ X=x^i;w] = \prod_{i=1}^m \sigma(y^iw^Tx^i)$

That is, $w$ has more likelihood the larger the probabilities of guessing each point correctly in the training set are. Some choice of $w$ (denoted $\hat{w}$) maximizes the product, the questions is which.

Instead of maximizing the *likelihood* of $w$, we can maximize its *log-likelihood* in order to turn the product into a sum.

$Log-Likelihood(w) =  \sum_{i=1}^m \log(\sigma(y^iw^Tx^i)) = - \sum_{i=1}^m \log(1+e^{-y^iw^Tx^i}) = -mL(w)$ where $L(w)$ is the average *Logistic Loss* over the dataset as seen before. 

But $\max_w -mL(w)$ amounts to $\min_w mL(w)$ which is equivalent to just $\min_w L(w)$. So, from simply modeling the problem of classification of a linearly inseparable dataset we've arrived at the propblem of minmizing *Logistic Loss*. 



**Gradient Descent for Logistic Regression**

Note that for $z \in \mathbb{R}$, the derivative of *Logistic Loss* is: 

$\Phi'_{log}(z) = \frac{d \Phi_{log}(z)}{dz} = \frac{-e^{-z}}{1+e^{-z}} = \frac{e^z}{e^z} \cdot \frac{-e^{-z}}{1+e^{-z}} = - \frac{1}{1+e^z} = -\sigma(-z)$.

Then, for a fixed $k \leq m$, the partial derivatives w.r.t. the coordinates of $w$ of $\Phi_{log}(y^kw^Tx^k)$ are:

$$
\frac{\delta \Phi_{log}(y^kw^Tx^k)}{\delta w_i} = \frac{d \Phi_{log}(y^kw^Tx^k)}{d (y^kw^Tx^k)} \cdot \frac{\delta(y^kw^Tx^k)}{\delta w_i} = -\sigma(-y^kw^Tx^k) \cdot (y^kx^k_i)
$$

These are the coordinates of the gradient:

$$
\nabla \Phi_{log}(y^kw^Tx^k) = \langle -\sigma(-y^kw^Tx^k) \cdot (y^kx^k_1) \ , ... , -\sigma(-y^kw^Tx^k) \cdot (y^kx^k_n) \rangle ^T
$$

Then $\nabla L(w)$ can be computed using the sum of these gradients. 



# **Mistake Bounded Learning**



> **Definition:**
> We say that a learner has *mistake-bound* $t$ if for every sequence of challenges the learner makes at most $t$ mistakes.

**Example: Learning Monotone Disjunctions**

Suppose we want to come up with a learner for the function class $C = \{monotone \ \ disjunctions \ \ on \ \ n \ \ literals\}$. 

First, let's look at some examples of functions that inhabit this class. *Monotone* refers to the fact that negation of literals is not allowed, and *disjunction* refers to the Boolean $\lor$ (the 'OR' operator). So $C$ is a class of Boolean functions which receive Boolean strings of leangth $n$ as input (i.e. $x \in \{0,1\}^n$) . For example, $f(x) = x_1 \lor x_2 \lor x_3$ is a monotone disjunction on $3$ Literals (i.e. $x \in \{0,1\}^3$).

Let $c \in C$ be the true function the learner is trying to guess. The hypothesis $h \in C$ is the current state of the learner (i.e. its current best guess for $c$). Set the initial state to $h_0(x) = x_1 \lor x_2 \lor ... \lor x_n$. Now suppose the learner is given the challenge $0110 ...0$. Its classification will be $h_0(0110...0) = 1$. If this is correct, the learner moves on to the next challenge without updating its state. If it's incorrect, then the learner knows that the literals $x_2$, and $x_3$ are not in $c$. The new state is $h_1(x) = x_1 \lor x_4 \lor ... \lor x_n$.

Note that two literals were eliminated from consideration. In fact, every time the learner makes a mistake at least one literal must be eliminated. Since there are $n$ literals in any given challenge, the learner can make at most $n$ mistakes.


**Example: Learning Disjunctions** 

Now suppose the goal is to learn the function class $C' = \{disjunctions \ \ on \ \ n \ \ literals\}$. With some extra setup we can re-use the learner in the previous example to learn this larger class. 

First, here's an example of a function in $C' $: $f(x) = \neg{x_1} \lor x_2 \lor \neg(x_3)$.

Suppose the learner again receives the challenge $0110...0$ and that it's initial state is, again, $h_0$. If the learner's classification is incorrect, there's no way to distinguish between the following cases: 

Case 1: The actual function $c$ contains neither the literals $x_2$ nor $x_3$
Case 2: The actual function $c$ contains both the literals $\neg{x_2}$ and $\neg{x_3}$ 
Case 3: The actual function $c$ contains one of the literals $\neg x_2$ or $\neg x_3$ and does not contain the other 

In short, the strategy outlined in the previous example yields non-deterministic results. 

Note that each literal in a function belonging to $C'$ is encoded using $2$ bits of information as opposed to the $1$ bit required to encode a literal in a function belonging to $C$. We can ask two Boolean questions: the literal either exists in the function or not ($1$ bit), and it's negated or not (the $2$nd bit). So, it seems that in a sense $C'$ is twice the size of $C$. Motivated by this insight, we come up with a bijection from $C'$ to $\{monotone \ \ disjunctions \ \ on \ \ 2n \ \ literals\}$. We append $n$ extra $y$-literals at the end of the input such that each $y$-literal is the negation of its corresponding $x$-literal. That is, an input of the form $x = (x_1, x_2, ...,  x_n)$  is mapped to $\hat x = (x_1, x_2, ... ,x_n,y_1, y_2, ..., y_n)$ with $y_i = \neg x_i \ \ \forall i$. 

Now we are back to learning monotone disjunctions, except on $2n$ literals instead of $n$. We can do this with mistake bound $2n$ as shown in the previous example. 


# **Holdout Sets**


Holdout sets provide a naive strategy to test the *true error* of a classifier. Recall that if a classifier has a small *training error* on a training set $S$, we can't infer how well it generalizes to a new training set. That is, it might've been overfitted to $S$, in which case its *true error*, the probability of misclassifying a point drawn from a new training set, could be large.

A holdout set $H$ is a subset of $S$ which we've 'stashed away' for testing. We do not present it to the classifier until it's been trained on the set $S \setminus H$. We then test the classifier on the holdout set $H$ to approximate its *true error*.


**Objective** 

We would like to find a size $|H|$ for this holdout set such that the *training error* on $H$ (finding which is simply a matter of counting the number of mistakes the classifier makes on $H$) closely approximates the *true error* of the classifier.


**So, How Big of a Holdout Set Should We Put Aside?** 

Suppose $h$ is our classifier and suppose it has *true error* $\epsilon$. Let $|H| = n$.
We start by defining an indicator random variable $X^i \in \{1,0\}$, $X^i = \cases{1 \ \ \text{ if  $h$  is  incorrect on the $i^{th}$ point of $H$} \cr 0 \ \ \  \text{otherwise}}$ 
Consider $\mathbb{E}[X^i] = 1 \cdot \mathbb{P}[X^i = 1] + 0 \cdot \mathbb{P}[X^i=0] = \mathbb{P}[X^i = 1]$.  But $X^i = 1$ is the event in which $h$ misclassifies $x^i \in H$ so, by definition, it's the *true error* $\epsilon$ of $h$.$^\dagger$ Put simply, the way we've defined the indicator variables $X^i$ makes for $\mathbb{E}[X^i] = \epsilon$, the *true error* of $h$.

> **$^\dagger$ True Error**:
>    $true \ error \coloneqq  \mathbb{P}_{x \sim H}[h(x)\ne c(x)]$ where $c$ is the true classifier. 

Now, consider $S = \sum_{i=1}^{n} X^i$ which is the *total number of mistakes* $h$ makes on $H$.
By linearity of expectation, $\mathbb{E}[S] = \sum_{i=1}^{n} \mathbb{E}[X^i] = n\mathbb{E}[X^i] =n\epsilon$  .

We can now apply a famous probabilistic bound called the *Chernoff Bound*. We will apply the version of the Chernoff Bound which applies to sums of independent, identically distributed indicator random variables (also known as *Bernouli Random Variables*). Here, the $X^i$s are indeed *i.i.d.* since the drawing of data points from $H$ represent independent events and the probability distribution on each $X^i$ is the same: namely $\mathbb{P}[X^i = 1] = \epsilon$ (i.e. the *true error*) $\ \forall i$.

> **Chernoff Bound:** 
> If $X$ is the sum of $n$  *i.i.d.* indicator random variables $X^1, ... , X^n$, $\mu$ is its mean (i.e. $\mu = \mathbb{E}[X]$), and $\delta \in (0,1)$, then 
> $\mathbb{P}[ \ |X-\mu| > n \delta \ ] \leq 2e^{-2n\delta^2}$

Intuitively, the Chernoff Bound says that the probability of the sum of $n$  *i.i.d.* indicator random variables being more than $n$ standard deviations away from its mean is *exponentially small* in $n$.

Applying the Chernoff Bound to $S$, we get:
$\mathbb{P}[ \ |S - n\epsilon| > n\delta \ ] \leq 2e^{-2n\delta^2}$, which implies that $\mathbb{P}[|\frac{S}{n} - \epsilon| > \delta] \leq 2e^{-2n\delta^2}$.

Since $S$ is the *total number of mistakes* $h$ makes on $H$ and $n = |H|$, $\frac{S}{n}$ is the *training error* of $h$ on the holdout set $H$. And as we recall $\epsilon$ was the *true error* of $h$. So we can make the *true error* arbitrarily close to the *training error* by choosing $\delta$.

Fix $\delta \in (0,1)$ to be small. Now fix a small *confidence threshold* $\alpha$ and set $2e^{-2n\delta^2} < \alpha$. We end up with $ \mathbb{P}[|\frac{S}{n} - \epsilon| > \delta] < \alpha$ which means that we've bounded the probability of the *true error* deviating from the *training error* by more than a factor of $\delta$ by $\alpha$. Solving for $n$, the required size of the holdout set $H$, we get $n > \frac{ln(2/\alpha)}{2\delta^2}$.

In summary, if we want to be $\alpha$ confident that the *true error* and the *training error* are $\delta$ close, we choose a holdout set of more than $\frac{ln(2/\alpha)}{2\delta^2}$ data points.


### **Cross-Validation**


The holdout set method is impractical for two reasons.

1. Labeled data is expensive. When we stash a holdout set, we're wasting valuable labeled data that we could've used to train the model.
2. Probabilities of failure add up if we want to try multiple algorithms in building our classifier, making us quickly lose confidence in its *true error*. Suppose with $\alpha$ confidence a classifier we've created has *training error* closely approximating *true error*. Now suppose we build another classifier and run it over the holdout set again. Now we are only $2\alpha$ confident in its *true error* since we've done it twice... 

*Cross-validation* is a method to overcome these limitations which is not currently backed by theory but works really well in practice. 

The process is simple, we divide the training set $S$ into $k$ 'folds,' collections of training points. At each iteration $i$ we stash the $i^{th}$ fold away (as a holdout set), train on the rest of the folds $1,...,i-1,i+1,...,k$ and get en estimate of the *true error* as the *training error* on the $i^{th}$ fold. We keep track of all $k$ *true error* estimates and average them up at the end. 

In practice, we typically choose $k = 5$ to $10$ folds.

This may seem counterintuitive because there's no independence in this procedure. The model is exposed to any given $i^{th}$ fold (for $i>1$) before it's used as a holdout set. Yet, this methods works very well in practice...


# **Perceptron Learning**

Perceptron *classification* is yet another algorithm for dividing the input space into decision boundaries (which are *halfspaces* in this case).

Consider the set of halfspaces where $w^* \in \mathbb{Z^n}$ (the *normal* vector of the associated *hyperplane*) and $\theta^* \in \mathbb{Z}$ (its *offset* from the origin) are in some bounded range in $\mathbb{Z^n}$ and $\mathbb{Z}$ respectively. This is a simplifying assumption we make so that the set is finite. Perceptron will be trying to learn the following class of Boolean functions $C = \{h(x) = sign(w^Tx - \theta)\}$ for different values of $w$ and $\theta$. In doing so, the data points will end up being classified based on whether or not they fall inside or outside the true halfspace, since $sign(w^{*^{T}}x -\theta^*)$ outputs either $-1$ or $1$ based on where $x$ lies relative to the dividing hyperplane.

Suppose the actual function that Perceptron is trying to learn is $f \in C$. This comes down to learning $f$'s' $w^*$ and $\theta^*$.
Any given data point $(x, f(x))$ in the training set $S$, corresponds to a linear inequality $x_1w_1^* + ... x_nw_n^* - \theta^* > 0$ or $x_1w_1^* + ... x_nw_n^* - \theta^* < 0$ depending on the value of the label $f(x)$. So the training set is a system of linear inequalities in the literals $w^*_i$ and $\theta^*$ which can be solved in polynomial time using *Linear Programming*. However, Perceptron is a much simpler algorithm than LP and it's suited specifically to learning halfspaces. There is also an associated *kernel trick* in a variation of Perceptron called *Kernel Perceptron* (which we will discuss in some detail in a different post) that dramatically cuts down on time complexity.



**The Perceptron Algorithm**

For simplicity, assume $\theta = 0$. This assumption is done without loss of generality since we can get rid of it simply by extending the dimension of the problem to $x \in \mathbb{R^{n+1}}, \ \ w^* \in \mathbb{Z^{n+1}}$ with $x = (x_1,  ..., x_n, 1)$, and $w^* = (w_1^*, ...,w_n^*,-\theta)$.

The algorithm starts with an initial guess $w_0$ for the norm vector $w^*$. By convention, it's either the *zero vector* $w_0 = 0^n \in \mathbb{Z^n}$ or the unit vector $w_0 = (\frac{1}{\sqrt{n}},...,\frac{1}{\sqrt{n}}) \in \mathbb{Z^n}$. For the rest of this discussion, we'll adopt the first convention.

The algorithm then receives a challenge $(x,y)$ where $y = f(x)$ (recall that $f \in C$ is the actual function Perceptron must learn) and evaluates it based on its current hypothesis $h(x)=sign(w_0^Tx)$. 

- <u>Case 1:</u> The guess was correct, i.e. $h(x) = y$

  No update is needed

- <u>Case 2:</u> The guess was incorrect, i.e. $h(x) \ne y$

  Perceptron updates its state to $w_{new} = w_{old} + yx$.

  Geometrically this update rule nudges the old norm into a direction which brings it closer to $w^*$. On the next iteration, $x$ is to the correct side of the hypothesis.



**Perceptron Convergence Theorem**

Perceptron turns out to be a mistake bounded algorithm. In order to show this, we first lay out a number of simplifying assumptions.

<u>Assumptions</u>

1. The true norm exists (i.e. $\exists w^*$)

2. The true norm is a unit vector (i.e. $||w^*||_2 = 1$)

3. Every input $x \in \mathbb{R^n}$ is also a unit vector (i.e. $\forall x, \ \ ||x||_2=1$)

4. The offset is zero (i.e. $\theta = 0$)

5. There is a margin $\rho > 0$ s.t. all inputs $x \in \mathbb{R^n}$ are at least a distance $\rho$ away from the separating hyperplane.
   That is, the magnitude of the projection of $x$ onto $w^*$ is at least $\rho$ (i.e. $|x^Tw^*| \geq \rho$)

   <u>Side note:</u> the reason that the inner product $x^Tw^*$ is the projection of $x$ onto $w^*$ is that at least one of the vectors (in fact both in this case) are unit vectors.

Assumption 1 simply means that the data is *linearly separable*, which is required for Perceptron to even stand a chance... 
Assumption 2 is *WLOG* since $w^{*^{T}}x = ||w^*||_2||x||_2\cos(\alpha)$ where $\alpha$ is the angle between the two. Since the *2-norms* are always positive, the sign of $w^{*^{T}}x$, which is all $sign$ cares about, depends only on $cos(\alpha)$. 
Assumption 3 is not *WLOG* but we'll consider the case of inputs with larger norm separately.
Assumption 4 is *WLOG* since, as we saw earlier, we can include $\theta$ in the mix simply by increasing the dimension of the problem.
Assumption 5 goes hand-in-hand with assumption 1. It strengthens assumption 1 by requiring not only that the data be linearly separable, but that there also be a comfortable margin $\rho$ between the data points nearest to the hyperplane on either side. 

Given these assumptions, we can state the following convergence theorem.

> **Perceptron Convergence Theorem:** 
> The *mistake-bound* of the Perceptron algorithm is $O(1/ \rho^2)$

Remarkably, the mistake-bound of Perceptron is independent of the dimension of the problem. It only depends on the margin $\rho$.



**Proof of Convergence Theorem**

Let's take a leap forward and prove the convergence theorem given two underlying claims which, for now, we'll take for granted.

> **Claim 1:** On every mistake $w^Tw^*$ increases by at least $\rho$

> **Claim 2:** On every mistake $||w||_2$ increases by at most $1$

First, notice the significance of. each claim.

Claim 1 deals with the quantity $w^Tw^*$ which is a measure of how much the guess vector $w$ and the true norm $w^*$ are pointing in the same direction. It says that $w$ gets more and more aligned with $w^*$ since $w^Tw^*$ increases by the positive quantity $\rho$. Claim 2 deals with the quantity $||w||_2$ which is the magnitude of the guess vector. It says that the magnitude of the guess vector does not blow up. Since $sign$ doesn't care about magnitude anyway, this assumption is significant but not in the way we would expect it to be. Putting together claims 1 and 2, we have that $w$ approaches $w^*$ in direction without blowing up in magnitude, which is certainly a nice framework for convergence. 

<u>Proof of Convergence Given the Claims:</u>

Suppose $t$ is the number of mistakes made by Perceptron at some point during its execution. 

Then $t\rho \leq w^Tw^* \leq ||w||_2||w^*||_2$ . The first inequality is due to claim 1 and the fact that $w^Tw^*$ starts at $0$ (since $w_0 = 0^n$) and grows only when Perceptron makes an update (which happens only when a mistake is made). The second is the *Cauchy–Schwarz inequality*. But since $||w^*||_2 = 1$ by assumption 2, we have $t\rho \leq w^Tw^* \leq ||w||_2 \ \ \dagger$.

By claim 2, we also have that  $||w||_2 \leq t$. 
Recall that *2-norm* is defined as the inner product, $w^Tw = w_1^2 + ... + w_n^2$, so we have that $||w|| \leq \sqrt{t} \ \ \dagger \dagger$ . 

Putting together $\dagger$ and $\dagger \dagger$, we have $t\rho \leq \sqrt{t}$. With some algebraic manipulation, we get $t = 1/\rho^2$ which is the claimed mistake bound $^\square$.

But we aren't done here, we still have to prove claims 1 and 2. 

<u>Proof of Claim 1:</u>

Suppose Perceptron has made a mistake, prompting an update.
Consider the update rule $w_{new} = w_{old} + yx$.
Taking the inner product with $w^*$ on both sides yields $w_{new} \cdot w^* = (w_{old} + yx) \cdot w^*$.
Distributing yields, $w_{new}^Tw^* = w_{old}^Tw^* + yx^Tw^*$.
But by assumption 5, $|x^Tw^*| \geq \rho$. But $y = f(x) = sign(x^Tw^*)$ (inner product is commutative), so $y$ and $x^Tw^*$ are either both positiive or both negative.  This means that $yx^Tw^* = |x^Tw^*| \geq \rho$ where the first equation is by definition of absolute value.
Thus, $w_{new}^Tw^* \geq w_{old}^Tw^* + \rho$  which proves the claim $^\square$.

<u>Proof of Claim 2:</u>

Suppose again that perceptron has made a mistake, prompting an update.
Consider $||w_{new}||_2 = w_{new}^Tw_{new} = (w_{old}+yx)^T(w_{old}+yx) = ||w_{old}||_2 + 2yx^Tw_{old} + ||x||_2$.
But since $||x||_2 = 1$ by assumption 3, $||w_{new}||_2 = ||w_{old}||_2 + 2yx^Tw_{old} + 1$.
But $y = sign(x^Tw^*)$ is, as we recall, the true label. Since Perceptron has made a mistake, $x^Tw_{old}$ and $y$ have the opposite sign.
Then,  $2yx^Tw_{old} < 0$ and $||w_{new}||_2 < ||w_{old}||_2 + 1$ proving the claim $^\square$.


**Getting Rid of the Simplifying Assumptions** 

As we saw earlier, getting rid of assumption 4 is easy. 
Getting rid of assumption 3 simply yields a mistake bound of $O(R^2/\rho^2)$ where $R$ is s.t. $\forall x, \ \ ||x||_2 \leq R$.  That is, $R$ Is distance of the furthest data point. So, in getting rid of assumption 3 we pay quadratically in $R$, which is okay.


# **Decision Trees (WIP)**


We start our discussion of decision trees with a definition of *classification* and *classifiers*.

> **Definition:** *Classification* is the process of grouping data into discrete categories (*class labels*).

A common example is the sorting of emails into the binary categories of *'spam'* and *'not spam'*. The labels in a classification problem need not be binary, they may belong to any discrete set.

> **Definition:** A *classifier* is any algorithm that performs classification.

*Decision trees* are one type of classifier among many. Other notable examples include *logistic regression classifiers* (not to be confused with *linear regression* which solves a problem of *line-fitting*, not classification), *perceptron classifiers*, etc. 


In the following discussion, for simplicity, we assume binary input and binary output for decision trees. That is, the training set is $S = \{(x^1,y^1), ... ,(x^k, y^k)\}$ with $x^i \in \{0,1\}^n$ and $y^i \in \{0,1\} \ \ \forall i$. The nodes of a decision tree correspond to the *features* (or *literals*) of the input and its leaves correspond to the class labels. Its paths correspond to the conjunction of features that lead to those class labels.


Let's look at two key attributes of decision trees.

> **Definition:** The *size* of a decision tree is its number of nodes.

> **Definition:** The *depth* of a decision tree is its longest root-to-leaf path.


Given a training set $S$ of size $|S|$, it's easy to come up with a decision tree of size $\geq |S|$ that classifies all the points in $S$ correctly. We can simply include a path for each conjunction of features leading up to its correct class label. But, in a sense, this tree is not a *learner*. It has simply memorized the training set $S$ and would perform poorly on a different training set. This phenomenon is referred to as *overfitting*.  The challenge is to come up with a way to build a decision tree that's not overfitted to any particular training set. 



**Preliminary Setup**

First, we define the following *potential* function: $\Phi(a) = \min(a,1-a)$. Note a curious property of the potential function: if $a > 1/2$ is the probability of an event occurring, $\Phi(a)$ is the probability of the event not occurring (i.e. the probability of the complementary event occurring).

We also define *training error* of a decision tree $T$ as: $E^T_S = \frac{\# \ \ of \ \ mistakes \ \ T \ \ makes \ \ on \ \ S}{|S|}$. Note that for an overfitted decision tree, as in the discussion above, the training error is $0$. Training error is a measure of a classifier's performance on a specific training set. Contrast this with the *true error* of a classifier which will be defined later on...


**The Simplest Tree**

The simplest decision tree is the one with a single node: a leaf which, as we recall, represents a class label. In this case, it would be wise to choose the label that has the highest incidence in the training set. For instance, suppose a training set $S$ has $5$ labels that are $1$'s and $10$ labels that are $0$'s. The leaf would be chosen to represent the $0$ label, so that it is correct on most of the training set. 

Assuming uniform distribution on the training set $S$, the probability of drawing a point with label $0$ is $2/3$.
Then $\Phi(\mathbb{P}[y = 0]) = \Phi(2/3) = \min(2/3,1-2/3) = 1/3$. This is the probability of the complementary event (i.e. when the label is actually $1$), in which case the decision tree has made an error. In fact, since uniform distribution was assumed, note that $\Phi(\mathbb{P}[y = 0]) = \mathbb{P}[y = 1]$ *is* $E^T_S$ for this simple decision tree.


**Depth-1 Decision Tree**

Suppose we have a feature $x_i$ at the root of the decision tree. Which labels should its left and right leaves correspond to? As before it would make sense to go with the highest incidence of a label in the training set. But the highest incidence is now conditioned upon the value of $x_i$. So, if $x_i = 0$, we look at the subset $S|_{x_i = 0} \sube S$ and choose the label with the highest incidence. Similarly, for the case of $x_i = 1$.

But which feature should be at the root? It would make sense to choose the feature which has the most impact on the accuracy of the classification so that even a shallow tree would be a decent classifier. In order to quantify this we define the *gain* in training error as $Gain = E^{T_{old}}_S - E^{T_{new}}_S$ where $E^{T_{old}}_S$ is the training error of the simplest tree.

We fix each feature $x_i$ at the root and compute the associated training error $E^{T_{new}}_S$. Now the training error is a weighted average of probabilities. Suppose $y=0$ is the label with highest incidence in both $S|_{x_i = 0}$ and $S|_{x_i = 1}$, then the new training error would be:

$E^{T_{new}}_S = \mathbb{P}_{x,y \sim S}[x_i = 0] \Phi(\mathbb{P}_{x,y \sim S}[y = 0|x_i = 0]) + \mathbb{P}_{x,y \sim S}[x_i = 1] \Phi(\mathbb{P}_{x,y \sim S}[y = 0|x_i = 1])$

That is, the weighted sum of the probabilities that $y=0$ was the wrong guess. 


**Deeper Decision Trees**

Doing the above procedure for each of the features $x_i$ we choose the feature which results in the biggest gain. This feature is placed at the root of the tree. Then we do the same procedure recursively on the subsets $S|_{x_i = 0}$ and $S|_{x_i = 1}$ for the subtrees.



**The Gini Function**

The structure of a decision tree is obviously defined by the choice of a potential function. Previously we've been using $\Phi(a) = \min(a, 1-a)$ which corresponded to training error. Another popular choice, inspired by the afforementioned, is the *Gini Function* $\Phi(a) = 2a(1-a)$.



Let's examine the graphs of both:

...

# **Principal Component Analysis (WIP)**

In this post we'll depart from classification problems (those that ultimately divide up a space into *decision boundaries*) and look at *dimensionality reduction*.  PCA is a simple yet effective way to reduce the dimensionality of a given dataset without the loss of crucial information (i.e. the *variation* between data points). The hope is to *learn* the dataset (i.e. make some statistical inferences) based on its most important features — the *principal components*.

...