# Linear models for classifications

Outline:

1. Tf-idf features
1. Encoding categorical features
1. Logistic Regression
1. Multiclass classification
1. SVM
1. Kernels
1. Evaluation

Readings:

1. Boshop. Pattern Recognition and Machine Learning. pp 178-220
1. Sokolov  lectures on Machine Learning

## 1 Tf-Idf features

$$x_{ij} = tf_{ij} idf_j$$
where $tf_{ij}$ is frequency of term $j$ in the sample $i$  
$$idf_j = \log (\frac n {df_j + 1}) + 1$$  
$df_j$ - number of samples, where term $j$ present  
$n$ - total number of samples  

Usually we also apply normalization to discount sentence length

## 2 Encoding categorical features

1. One-hot enocoding

<img src="images/onehot.png" style="height:200px">

1. Label encoding 

<img src="images/label.webp" style="height:200px">

1. Target-mean encoding

<img src="images/target_mean.png" style="height:200px">

1. Embeddings (in neural nets)

## 3 Logistic regression 

Suppose $\hat y = f(x)$ and $y \in \{0, 1\}$, Show, that $f(x_i)$ should be $p(y=1 | x_i)$.   
Probability to generate such samples from the point view of $f(x)$:

$$likelihood = \prod_{i=1}^N f(x_i)^{[y_i=1]} (1 - f(x_i))^{[y_i=0]} \rightarrow \max_{f}$$
$$log likelihood = \sum_{i=1}^N [y_i=1] \log f(x_i) + [y_i=0](1 - f(x_i)) \rightarrow \max_{f}$$
$$ L = - \frac 1 N \sum_{i=1}^N [y_i=1] \log f(x_i) + [y_i=0](1 - f(x_i)) \rightarrow \min_{f}$$
$$ E[L|x] = E[- \frac 1 N \sum_{i=1}^N [y_i=1] \log f(x_i) + [y_i=0](1 - f(x_i))]  = \\
    -p(y=1 |x) \log f(x) - p(y=0 |x)(1 - f(x)) \rightarrow \max_{f} $$    
$$ \frac {\partial E[L|x]} {\partial f} = - \frac {p(y=1|x)}{f(x)} + \frac {1 - p(y=0 |x)} {1 - f(x)} = 0$$

from which we conclude
    $$f(x) = p(y=1|x)$$


You can find 2 different formulae for logistic loss  
First, with margins 
$$ L(y_i,x_i) = \log (1 + e^{-y_i w^T x_i})$$

<img src="images/logmargin.png" style="height:300px">

Second, aka cross-entropy 
$$ L(y_i, p_i) = - y_i \log p_i - (1-y_i) \log(1 - p_i) $$


Decision function = probability of $y=1$ class given sample $x$
$$ p(y=1 | x_i) = \sigma(x_i) = \frac 1 {1 + e^{-w^Tx_i}}$$

<img src="images/sigmoid.png" style="height:200px">

## 4 Multiclass classification

### Multinomial

Cross-entropy loss:
$$ L(y_i, p_i) = - \sum_k^K y_{ik} \log p_{ik} $$
where $k$ is a number of classes  
$p_{ik}$ is probability of $k$-th class of $i$-th sample


Decision function = probability if j class (in vector form):  
$$ p(y_i = j | x_i) = softmax(x_i^T W)_j$$
where $W \in R^{KxD}$, K = number of classes, D = number of features
$$ softmax(z)_j = \frac {e^{z_j}} {\sum_k e^{z_k}}$$

<img src="images/softmax.jpg" style="height:300px">



<img src="images/multinomial.png" style="height:300px">

### One-vs-rest
Idea: build multiclass classifier from several binary classifier  
Train K binary classifiers.  

$$ \hat y = \arg \max_k b_k(x) $$

Notes:
1. $b_k$ is unbalanced even if initial problem was balanced
2. scale of the confidence values may differ between the binary classifiers $b_k$

<img src="images/ovr.png" style="height:300px">

### One-vs-one
Idea: build multiclass classifier from several binary classifiers  
Train $K(K-1)/2$ binary classifiers. 

$$ \hat y = \arg \max_k \sum_{i \neq k} b_{ik}(x) $$

Notes:
1. One vs one is less prone to imbalance in dataset

thick lines = one-vs-one  
thin lines = one-vs-rest  

<img src="images/1vs1.jpg" style="height:300px">

##  5 SVM

suppose we have some linear dicision surface
$$f(x) = sign(<w,x>)$$
where $<w,x>$ is a scalar product.  
In linear kernels, $<w,x> = x^T w$   
Then distance from point in $X^D$ to decision surface is 
$$p(x_0, f) = \frac {|<w,x>|} {||w||}$$
Choose scale of $w$ such that $$\min_{x} |<w,x>| = 1$$
Then distance from decision surface to the nearest object is
$$\min_{x} \frac {|<w,x>|} {||w||} = \frac 1 {||w||} \min_{x} |<w,x>| = \frac 1 {||w||}$$

For linear separable case we have optimization problem:
$$\frac 1 2 ||w||^2 + \sum \rightarrow \min_w$$
$$ s.t. \; y_i <w,x_i> \;\geq 1 $$

For linear inseparable case we introduce corrections for each object $\xi_i$:

$$\frac 1 2 ||w||^2 + C \sum_{i=1}^N \xi_i \sum \rightarrow \min_{w, \xi_i}$$
$$ s.t. \; y_i <w,x_i> \; \geq 1 - \xi_i $$
$$ s.t. \; \xi_i \geq 0 $$

OR:
$$ \xi_i = max(0, 1 - y_i <w,x_i>)$$

And finally,

$$\frac 1 2 ||w||^2 + C \sum_{i=1}^N max(0, 1 - y_i <w,x_i>) \sum \rightarrow \min_w$$

!!! Notice, that unlike logistic regression, weight norm penalty already build in the model.

<img src="images/svm.png" style="height:400px">

## 6 Kernels
  
* aka feature engineering  

Introduce  
$$\phi: X \rightarrow H$$  
$$f(x) = <w, \phi(x)> $$
where H some Reproducing Kernel Hilbert Hilbert space (basically, euclidean space with scalar product)  

But how to choose $\phi$ ?

* aka similarity function  

Actually, you don't even need to write $\phi$ explicitly  

Def: Kernel
$$K(x,z) = <\phi(x), \phi(z)> $$

Th: Mercer:
K(x,z) is a kernel function <=>
1. K(x,z) = K(z,x)
2. for any finite $\{x_i\}_{i=1}^N$ the matrix $K(x_i, x_j)$ is positive semi-define.  

Th: Representer  Theorem

$$ f(x) = \sum_{i=1}^N w_i K(x, x_i) $$

Pros:  
1. using domain knowledge to construct $\phi$ or $K$  

Cons:  
1. $O(N^2)$ complexity on memory and inference time.  

**RBF kernel**

$$ K(X,Y) = \exp( - \gamma || X - Y ||_2^2) $$

<img src="images/rbf.png" style="height:200px">

**Polynomial kernel**

$$ K(X,Y) = (\gamma <X, Y> + \epsilon)^d $$

<img src="images/poly.png" style="height:200px">

## 7 Evaluation 

### For binary classification:

Accuracy $ = \frac 1 N \sum_{i=1}^N [\hat y_i = y_i]$

TP = true positive  
FP = false positive  
TN = true negative  
FN = false negstive 

$$precision = \frac {TP} {TP + FP}$$  
$$recall = \frac {TP}  {TP + FN}$$

<img src="images/f1.png" style="height:600px">

$$ F1 = \frac {2 * precision * recall } {precision + recall} $$

$FPR = \frac {FP} {FP + TN} $ false negative rate    
$TPR = \frac {TP} {TP + FN} $ true positive rate  

AUC = area under the curve ROC  
$ROC(t) = (TPR(t), FPR(t))$ is parametrized by threshold $t$ on the probability $p(y=1|x)$ 

<img src="images/auc.png" style="height:200px">

note:
* you use hard labels for Accuracy and F1
* you use class probabilities for AUC

### For multiclass:

There is no direct quality metric, it is assembled from metrics for K binary classification problems.

$$macro\_precision = \frac 1 K \sum_k \frac {TP_k} {TP_k + FP_k}$$  
$$macro\_recall = \frac 1 K \sum_k \frac {TP_k} {TP_k + FN_k}$$  
$$ macro\_f1 = \frac {2 * macro\_precision * macro\_recall } {macro\_precision + macro\_recall} $$

note: macro averaging is insensitive to imbalanced datasets

$$micro\_precision =  \frac { \sum_k TP_k} { \sum_k TP_k + \sum_k FP_k}$$  
$$micro\_recall = \frac {\sum_k TP_k} {\sum_k TP_k + \sum_k FN_k}$$  
$$ micro\_f1 = \frac {2 * micro\_precision * micro\_recall } {micro\_precision + micro\_recall} $$

weighted:

$$weighted\_precision = \frac 1 K \sum_k \frac {|K|} {N} \frac {TP_k} {TP_k + FP_k}$$  
$$weighted\_recall = \frac 1 K \sum_k \frac {|K|} {N} \frac {TP_k} {TP_k + FN_k}$$  
$$weighted\_f1 = \frac {2 * weighted\_precision * weighted\_recall } {weighted\_precision + weighted\_recall} $$

~ weight in proportion of class size