# Linear models

Outline:

1. Logistic Regression
1. Multiclass classification
2. SVM
3. Kernels
4. String Kernels

Reading:

1. Boshop. Pattern Recognition and Machine Learning. pp 178-220
1. Sokolov  lectures on Machine Learning

## Logistic regression 

Logistic Loss  
$$ L(y,z) = \log (1 + e^{-yz})$$

Logloss: 
$$ L(y, p) = y \log p + (1-y) \log(1 - p) $$

Desicion function
$$ \sigma(x) = \frac 1 {1 + e^{-wx}}$$

## Multiclass classification

### Multinomial

Probability if j class:  
$$ softmax(x)_j = \frac {e^{w_j x}} {\sum_k e^{w_k x}}$$

Loss - cross-entropy:
$$ L(y, p) = - \sum_k^K y_k \log p_k $$

<img src="images/multinomial.png" style="height:300px">

### One-vs-rest

Train K binary classifiers.  

$$ a(x) = \arg \max_k b_k(x) $$

<img src="images/ovr.png" style="height:300px">

### One-vs-one

Train $K(K-1)/2$ binary classifiers. 

$$ a(x) = \arg \max_k \sum_{i, i \neq k} b_{ik}(x) $$

thick lines = one-vs-one  
thin lines = one-vs-rest  

<img src="images/1vs1.jpg" style="height:300px">

## SVM

Hinge loss  
$L(y,z) = (1 -yz)_{+}$

Desicion function  
$a(x) = sign(<w,x>)$





<img src="images/svm.png" style="height:400px">

## Kernels

Def: Reproducing Kernel Hilbert Space  

$$f(x) = <w, \phi(x)> $$

Th: Mercer

$$K(x,y) = <\phi(x), \phi(y)> $$

Th: Representing  Theorem

$$ f(x) = \sum_i w_i, K(x, x_i) $$


<img src="images/rkhs.png" style="height:300px">

### RBF kernel

$$ K(X,Y) = \exp( - \gamma || X - Y ||_2^2) $$

<img src="images/rbf.png" style="height:300px">

### Polynomial kernel

$$ K(X,Y) = (\gamma <X, Y> + \epsilon)^d $$

<img src="images/poly.png" style="height:300px">

In [1]:
## String kernels
from sklearn import neighbors
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn import metrics

SEED = 1
np.random.seed(SEED)

In [18]:
# toy dataset
X= np.array([['AAAAAAA'],['AAATTAA'], ['TTTTTTT'],['ATTAAAA'],['TTTTAAT'],['TATTTAT']]) 
# class 0 if #A>#T, class 1 if #A<#T
y= np.array([0,0,1,0,1,1])

In [19]:
def equal_elements(s1,s2):
    score = 0
    for i in range(min(len(s1), len(s1))):
        score += float(s1[i] == s2[i])
    return score

full_string_kernel = np.zeros((X.shape[0],X.shape[0]))
for i in range(X.shape[0]):
    for j in range(X.shape[0]):
        full_string_kernel[i,j] = equal_elements(X[i,0],X[j,0])
full_string_kernel  

array([[7., 5., 0., 5., 2., 2.],
       [5., 7., 2., 3., 2., 4.],
       [0., 2., 7., 2., 5., 5.],
       [5., 3., 2., 7., 4., 2.],
       [2., 2., 5., 4., 7., 5.],
       [2., 4., 5., 2., 5., 7.]])

In [20]:
def compose_kernel(row_idxs, col_idxs):
    row_idxs = np.asarray(row_idxs, dtype=np.int).ravel()
    col_idxs = np.asarray(col_idxs, dtype=np.int).ravel()
    return full_string_kernel[row_idxs, :][:, col_idxs]

In [21]:
indices = np.arange(X.shape[0], dtype=np.int)
idx_train, idx_test = train_test_split(indices, stratify=y, test_size=2, random_state=SEED) 
idx_train, idx_test

(array([0, 4, 3, 5]), array([1, 2]))

In [22]:
y[idx_train], y[idx_test]

(array([0, 1, 0, 1]), array([0, 1]))

In [23]:
compose_kernel(idx_train, idx_train)

array([[7., 2., 5., 2.],
       [2., 7., 4., 5.],
       [5., 4., 7., 2.],
       [2., 5., 2., 7.]])

In [24]:
compose_kernel(idx_train, idx_test)

array([[5., 0.],
       [2., 5.],
       [3., 2.],
       [4., 5.]])

In [25]:
clf= SVC(kernel=compose_kernel)

clf.fit(idx_train.reshape(-1,1), y[idx_train])
y_pred = clf.predict(idx_test.reshape(-1,1))
metrics.accuracy_score(y[idx_test], y_pred)

1.0