# Binary Classification

In Binary Class the goal is to be able to correctly assign data $x$ to one of two discrete classes $y \in \{0,1\}$ we refer ot $y$ as the $\textcolor{lightblue}{label}$.
- Predicting positive $(y = 1)$ or negative $(y = 0)$ reviews from **text data** $x$.
- Predicting whethere a tumor is present $(y = 1)$ or absent $(y = 0)$ from an MRI/Image scan $x$.
- Whether it'll be rainy $(y = 1)$ or sunny $(y = 0)$ from historical events data $x$.
  
We'll build up to our final Loss function based on the method learnt, by initially learning about other non-probabilistic loss methods and we'll understand what they characterise.

### Set-up 

**Data**: Our training set $$\{(x^{(i)}, y^{(i)})\}_{i=1}^N \quad | \quad y^{(i)} \in \{-1, +1\}$$

**Model**: Is a Linear function of $x$

$$z = f[x, \phi] = \phi_0 + \sum_{i=1}^dx_i\phi_i =  \phi^Tx$$

The result is fedforward to a **threshold (piecewise) function**

$$
y = sign(z) = 
\begin{cases}
+1, &, z \ge 0 \\
-1, &, z \lt 0
\end{cases}
$$

### Geometric View

The linear function $f[x, \phi]$ describes a line or a hyperplane, the implicit form of this hyperplane is typicaly used for classifying how far a point is from this plane.

We do this by:
1.  Defining the normal vector to the hyperplane classifer.
2.  Provided a point in the dataspace (this is where the classifer is defined over) we compute the vector
3.  Then compute the angle between the normal vector and the projection vector (or the dot product).
4.  The result of (4) determines how this data point should be classfied.


For more insight follow the following link: 

https://github.com/yossefPartouche/Machine_Learning/blob/51c0e17053834fb4064f2fd37efa4cb7e092ffca/ML_Lessons/Unit3%20-%20Linear_Classification/3.2_LinearClassifier_LinearAlgebra.ipynb


## The 0-1 Loss (The Examiner)


Note that according to the method our data-space is discrete so we it would be plausible to use some discrete loss function. 

For every wrong classification add a point 
For every correct classification do nothing 

This the essential idea of the 0-1 loss. 


$$\boxed{
L_{0-1} (y, \hat{y})= 
\begin{cases}
0, &, y = \hat{y} \\
1, &, otherwise
\end{cases}
}
$$

We then sum this over all instances in the learning process:

$$\boxed{L(\phi) = \frac{1}{n}\sum_{i=1}^N L_{0-1} (sign(f[x_i, \phi]), \hat{y}_i)}$$

We won't discuss just yet about the training process but keep in mind that this loss function doesn't tell us how to improve. <br>
It's just a testing function i.e. how good the model is.


## The Perceptron Loss (Surrogate loss function)

This loss function is defined by:

$$\boxed{L_p(\phi, x, y) = \max(0, -y(\phi^Tx))}$$

This function is essentially providing a loss that is proportionate to how far it misclassified. <br>
The worst the misclassification the bigger the loss on this instance. <br>
On the other hand if it classified correctly or on the linear classifier then it won't add it.<br>


We then sum this over all instances in the learning process:

$$\boxed{L(\phi) = \frac{1}{n}\sum_{i=1}^N L_{p}(\phi, x, y)}$$

In Chapter 5 and 6 we'll come back to this loss function. <br> For now know that this **is** an improvement from the 0-1 loss since it does provide a method for improving. <br> On the other hand this function has no aim to generalise well only to classify correctly.<br>
So this is an improvement from the 0-1 Loss... we can do beter.



## The Binary Cross Entropy Loss

#### Reminder on the General Method

$$\boxed{\begin{aligned}
&1. \text{ Given the output choose a suitable probability distribution } Pr(y | \theta) \text{ defined over the domain of predictions} \\
\\
&2. \text{ Set the ML model } f[x, \phi] \text{ to predict all independant parameters } \\
&\quad \text{(and compute the rest of the parameters based on what's learnt) so } \theta = f[x, \phi] \text{ and } Pr(y | \theta) = Pr(y | f[x, \phi]) \\
\\
&3. \text{ We train the model to find the network parameters } \hat{\phi} \text{ that minimizes the negative log-likelihood} \\
&\quad \text{over the training dataset } \{x_i, y_i\}_{i=1}^N \\
\\
&4. \text{ When needed to perform the inference we'll apply the argmax of the distribution } Pr(y | f[x, \hat{\phi}])
\end{aligned}}$$

### Loss Construction

1. $\text{We Choose a probability distribution over the output space: } \quad y \in \{0, 1\}$

    - Since it's a binary then we could choose the $\text{Bernoulli Distribution}$
    - This has a single parameter $\lambda \in [0,1]$
    - $Pr(y | \lambda) = (1-\lambda)^{1-y}(\lambda)^y$
  
2. $\text{We set the ML model } f[x, \phi] \text{ to predict the parameter } \lambda$
    - At the current construction of the ML model we **can't** guraantee $f[x, \phi] \in [0,1]$
    - We pass the output of the model through $\textcolor{lightblue}{ligistic \ sigmoid} \quad \text{sig[z]} : \mathbb{R} \to [0,1]$ $$\boxed{\displaystyle \text{sig}[z] = \frac{1}{1 + e^{-z}}}$$
    - Now, since $\lambda = \text{sig}[f[x, \phi]] $ we have  $Pr(y | \lambda) = Pr(y | x) = (1-\text{sig}[f[x, \phi]])^{1-y}(\text{sig}[f[x, \phi]])^y$
3. $\text{We now train the model to find the parameters to minimize the loss}$ $$ L[\phi] = \sum_{i=1}^N -(1-y_i)\log\big[1-\text{sig}[f[x, \phi]]\big] - y_i\log\big[\text{sig}[f[x_i, \phi]\big]$$