## Logistic Regression



### Sigmoid

- Sigmoid function:
$\sigma(x) = \frac{1}{1 + e^{-x}}$
- Sigmoid derivative:
$\sigma'(x) = \sigma(x) (1 - \sigma(x))$


### Hàm mục tiêu

Là bài toán phân Binary Classification, mong muốn output sẽ là [0, 1]<br>
ta có thể giả sử rằng xác suất để một điểm dữ liệu $x$ rơi vào class 1 là $f(Wx)$ và rơi vào class 0 là $1−f(Wx)$.<br>
Đặt $z_i = f(Wx_i)$ là dự đoán của $x_i$
- nếu $y_i = 1 \rightarrow P(y_i = 1|W,x_i) = z_i$. 
- nếu $y_i = 0 \rightarrow P(y_i = 0|W,x_i) = 1-z_i$. 
- Hay nói cách khác: $P(y_i|W,x_i) = z_i^{y_i}({1-z_i})^{1-y_i}$

Vì vậy, ta giả sử các điểm dữ liệu sinh ra độc lập với nhau:<br>
    $ \Rightarrow$ Với dataset, hàm mục tiêu: $P(Y|W,X) = \prod\limits_{i = 1}^{n}  z_i^{y_i}({1-z_i})^{1-y_i}$ <br>
    $ \Rightarrow$ Khi đó ta mong muốn $P(Y|W,X)$ đạt max, khi đó bài toán sẽ trở thành $\argmax\limits_{W}P(Y|W,X)$  tức tìm W để hàm $P(Y|W,X)$ tối ưu (Maximum Likelihood Estimator)
  - tối ưu $~~P(Y|W,X)$ phức tạp, ta sẽ đi tối ưu min $~-\log(P(Y|W,X)) = -\sum\limits_{i = 1}^{n}  {y_i}\log({z_i})+({1-y_i})\log({1-z_i})$, rõ ràng đây là môt hàm [cross entropy](###Cross-Entropy)
  - Đặt $J(W,X,Y) = -\log(P(Y|W,X))$

### Cập nhật 

Với hàm activation là [sigmoid](###Sigmoid) ta cũng có được

$$\frac{\partial J(\mathbf{w}; \mathbf{x}_i, y_i)}{\partial \mathbf{w}} = (z_i - y_i)\mathbf{x}_i$$

SGD cho từng điểm dữ liệu: 

$$\mathbf{w} = \mathbf{w} + \eta(y_i - z_i)\mathbf{x}_i$$

### Implement

Dùng phương pháp xấp xỉ để kiểm tra đạo hàm
$$f’(x) = \lim_{\varepsilon \rightarrow 0}\frac{f(x + \varepsilon) - f(x)}{\varepsilon}\newline$$

$$f’(x) \approx \frac{f(x + \varepsilon) - f(x - \varepsilon)}{2\varepsilon} ~~~~ (2)$$

In [1]:
import numpy as np

class LogisticRegression():
    def __init__(self, lr=0.01, epochs=5, batch_size=20):
        self.lr = lr
        self.epochs = epochs
        self.w = None
        self.b = None
        self.batch_size = batch_size

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.w = np.random.rand(n_features)
        self.b = 0
        w_latest = self.w.copy()
        stop_condtion = False
        # train
        for epoch in range(self.epochs):
            if stop_condtion == False: break

            idxs = np.random.permutation(n_samples)
            # mini-batch gradient descent
            for i in range(0,n_samples,self.batch_size):
                idx = idxs[i:min(i + self.batch_size, n_samples)]
                X_batch,y_batch = X[idx],y[idx]

                z_batch = self.sigmoid(X_batch@self.w + self.b)
                gradient = (1/self.batch_size)*(y_batch - z_batch)
                self.w += self.lr*X_batch.T@gradient
                self.b -= self.lr*np.sum(gradient)
                # stop condtion
                if np.allclose(w_latest, self.w, atol=0.001):
                    stop_condtion = True
                    break
                else:
                    w_latest = self.w.copy()
                
    def predict(self,X):
        y_pred = X@self.w + self.b
        return [0 if y<=0.7 else 1 for y in y_pred]

    @staticmethod
    def sigmoid(input):
        return 1/(1+ np.exp(-input))
    

In [6]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

bc = datasets.load_breast_cancer()
X, y = bc.data, bc.target
# X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

clf = LogisticRegression(lr=0.001,batch_size=5,epochs=10)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

def accuracy(y_pred, y_test):
    return np.sum(y_pred==y_test)/len(y_test)

acc = accuracy(y_pred, y_test)
print(acc)

0.6052631578947368


## Softmax Regression

### Softmax
$Softmax(X) = \frac{e^{x_i}}{\sum\limits_{i=1}^{n} e^{x_i}}$

### Cross Entropy 

$H(P,Q) = {\sum\limits_{i=1}^C p_i \log(q_i)}$
Using Cross Entropy instead of Distance function because of properties: 
 - Reach minimal value when $p_i$ = $q_i$
 - but when $q_i$ too far away from $p_i$, the Cross entropy get really large -> benefit when optimize loss

<u>Note</u>: $H(P,Q) ≠ H(Q,P)$, điều này dễ nhận ra bởi vì nhìn vào công thức, $P$ có thể nhận giá trị 0, trong khi $Q$ bên trong hàm $\log$ không thể  nhận giá trị 0 <br>=> In Practical, $P$ dùng làm *giá trị thực* (0,0,1,...,0), $Q$ dùng *giá trị dự đoán*.<br>
Cross Entropy dùng để đo khoảng cách giữa hai phân phối



### Loss for Softmax Regression

- Loss for one point $(x_i, y_i)$: 
  - $L(W,x_i,y_i) = {y_i}\log({a_i}) = \sum\limits_{j=1}^{C}{y_{ji}}\log({a_{ji}})$<br>
  Trong đó, $x_i, y_i$ là một cặp điểm dữ liệu; $~~y_j$ đã được one-hot encode; $~~~a_i = Softmax(y_i)$;   $~~~y_{ji}, a_{ji}$ tương ứng là class thứ j trong vector $y_j, a_j$
- Gradient for one point $(x_i, y_i)$:
  - We defined: $J_i(\mathbf{W}) \triangleq J(\mathbf{W}; \mathbf{x}_i, \mathbf{y}_i)$
  - $\frac{\partial J_i(W)}{\partial w_j} = e_{ji}x_i ~(\text{where}~ e_{ji} = a_{ji} - y_{ji}) ~~$
  - => $\frac{\partial J_i(W)}{\partial W} = x_i[e_{1i},e_{2i},...,e_{Ci}] = x_ie_i^T$
  
- Loss and Gradient for Training dataset:<br>
  - $L(W,X,Y) = Y\log(A) = \sum\limits_{i=1}^{N}{y_i}\log({a_i}) = \sum\limits_{i=1}^{N}\sum\limits_{j=1}^{C}{y_{ji}}\log({a_{ji}})$
  - $\frac{\partial J(W)}{\partial W} = XE^T$

### Update Weight

Batch Grident Decents: 
$\mathbf{W} = \mathbf{W} +\eta \mathbf{x}_{i}(\mathbf{y}_i - \mathbf{a}_i)^T$

### Recall

Công thức cập nhật này giống hệt với Logistic Regression

### Implement

In [None]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

class SoftmaxRegression():
    def __init__(self, lr = 0.0001, epochs=100):
        self.lr = lr
        self.epochs = epochs
        self.w = None
        self.b = None

    def fit(self, X, y):
        #transform first
        encoder = OneHotEncoder(sparse=False)
        y = encoder.fit_transform(y.reshape(-1, 1))
        # some parameters
        n_samples, n_features = X.shape
        _, n_class = y.shape
        self.w = np.zeros([n_features,n_class])
        self.b = 0
        w_latest = self.w.copy()
        # train
        for epoch in range(self.epochs):
            y_pred = self.softmax_stable(X@self.w)
            gradient = y-y_pred
            self.w += 1/n_samples*self.lr*X.T@gradient
            self.b -= 1/n_samples*self.lr*np.sum(gradient)
            if np.allclose(w_latest, self.w):
                break
            else: w_latest = self.w.copy()
        
    @staticmethod
    def softmax(input):
        return np.apply_along_axis(lambda row: np.exp(row)/(np.sum(np.exp(row))), 1, input)

    # To avoid run into numerical overflow issues bc computing the exponential of large numbers,.
    @staticmethod
    def softmax_stable(input):
        input_shifted = input - np.max(input, axis=1, keepdims=True)
        return np.apply_along_axis(lambda row: np.exp(row) / np.sum(np.exp(row)), 1, input_shifted)

    def predict(self,X):
        y_pred = self.softmax(X@self.w)
        return np.argmax(y_pred,axis=1)

In [None]:
import numpy as np
from sklearn.datasets import load_digits, load_breast_cancer, load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

data = load_digits()
X,y = data.data,data.target
# with some dataset like iris,breast, we need to scale the data to achive better result
# X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=24)

# Our model 
clf = SoftmaxRegression()
clf.fit(X_train,y_train)
y_pred_man = clf.predict(X_test)

# sklearn model
clf = LogisticRegression(multi_class='multinomial',solver='lbfgs')
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

def accuracy(y_pred, y_test):
    return np.sum(y_pred==y_test)/len(y_test)

print(accuracy(y_pred_man,y_test),accuracy(y_pred,y_test))


## Take away

### Regression??

as we can see, both Softmax and Logistic Regression are really useful on Classification Problems. 

Logistic Regression apply Sigmoid function on output, while Softmax Regression apply Softmax function. The objective of they are trying to *find hyperplane* that can seperate the data

They are both having very nice gradient, which can be generalized as $$gradient = y-output$$
which `output` is the prediction: $output = activation\_function(W.X)$; and `y` is the origin class. <br>
After that, we can update weights and bias as: 
```
    weight += learning_rate*X@gradient
    bias += learning_rate*gradient
```

Take a look back how Linear Regression using Gradient Descent to optimization. We can reallize the way Softmax and Logistic Regression calculate gradient and update parameters is exactly the same as Linear Regression thanks to apply flexible Loss and activation funciton. I think that how they called *REGRESSION* 

### Why dont we just using Loss function MSE like Linear Regression instead of Cross Entropy

The figure is ilustrated the comparation with MSE with Cross Entropy, they both get the minimal value at $q=p$, but more important, Corss Entropy 'fine' the mis-classification data point very much

<img src="utils/img/sigmoid-softmax/cross-entropy-vs-mse.png" alt="cross entropy vs mse" width="800" height="300"/>

### Cheatsheet

|                   | Logistic Regression | Softmax Regression |
| ------------      | ------------------- | ------------------ |
|**Using for**| Binary Classification | Multi-Class Classification |
|**Activation Function** | Sigmoid | Softmax |
|**Loss Function**| Negative Log Likelihood (Suprising, it has the form same as Cross Entropy) | Cross Entropy | 