# Basic Classification Methods


_Summarized by QH_  
_First version: 2022-11-22_  
_Last updated on : 2022-11-22_  

## What is Classification?
Linear regression deals with response variable $Y$ to be _quantitative_. In some cases, the response variable is _qualitative_ (or _categorical_), for example, type of flowers. Linear Regression might not be the best approach for the following reasons:
1. The difficulty to encode a qualitative response with more than two classes to a quantitative measure 
    * If the qualitative variable is nominal like different diagnoses (stroke, drug overdose and epileptic seizure), since there's no natural order of the three categories, encoding them to be (1, 2, 3) or (2, 1, 3) are both reasonable. However, different encodings will generate foundamentally different linear relationship between the response and independent variables and thus different predictions.
    * If the qualitative variable is ordinal like (mild, moderate and severe), using (1, 2, 3) - i.e. similar gap between mild and moderate and moderate and severe or (1, 1.5, 3) - i.e. gap between moderate and severe is higher than gap between mild and moderate.
2. For the binary qualitative response, it reasonable just to introduce _dummy variable_ and predict 1 when $\hat{Y} > 0.5$ using linear regression. However, it is not ideal:
    * The estimate may be out of $[0, 1]$ which makes it hard to interpret.
    * A regresion method will not provide meaningful estimates of $\text{Pr}(Y|X)$.

And thus, we will have another type of methods to deal with qualitative response variables called _classification_ methods.

This document will summarize the commonly used classification method:
* Logistic Regression
* Naive Bayes
* K-nearest neighbor

## Logistic Regression
One of the most commonly used binary classification method is logistic regression. Rather than modeling the Response $Y$ (0 or 1), logistic regression models the probability of the event given independent variables $p(X) = \text{Pr}(Y=1|X)$. We can set a threshold $\delta$ that when $\Pr(Y=1|X) > \delta$, the outcome is predicted to be 1 otherwise 0. 

How do we model probability? Since $p(X) = \beta_0 + \beta_1X_1 + \cdots + \beta_p X_p$ may generate negative or above 1 result, this is not a sensible solution. Instead, we use _logistic function_ to make sure $p(X) \in [0, 1]$:
$$p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}$$

And it is easy to find out that, it is equivalent to model the _odds_:
$$\frac{p(X)}{1 - p(X)} = e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p} \rightarrow \log \bigg( \frac{p(X)}{1 - p(X)} \bigg) = \beta_0 + \beta_1X_1 + \cdots + \beta_p X_p$$

And the _log odds_ or _logit_ is linear in $X$.

The estimation for the $\beta$ is the process to maximize the likelihood function:
$$l(\beta) = \prod_{i:y_i =1} p(x_i) \prod_{j:y_j = 0} (1 - p(x_j))$$

### Multinomial Logistic Regression
When there more than 2 classes, we can also use logistic regression but in this case, we are modeling the odds of one class versus a baseline class. Specifically,
$$\Pr(Y=k | X=x) = \frac{e^{\beta_{k0} + \beta_{k1} x_1 + \cdots + \beta_{kp} x_p}}{1 + \sum_{l = 1}^{K - 1}e^{\beta_{l0} + \beta_{l1} x_1 + \cdots + \beta_{kp} x_p}}$$ 
for $k = 1, \cdots, K-1$, and

$$\Pr(Y=K | X=x) = \frac{1}{1 + \sum_{l = 1}^{K - 1} e^{\beta_{l0} + \beta_{l1} x_1 + \cdots + \beta_{kp} x_p}} $$

$$\rightarrow \log \bigg( \frac{Pr(Y=k | X=x)}{\Pr(Y=K | X=x)} \bigg) = \beta_{k0} + \beta_{k1} x_1 + \cdots + \beta_{kp} x_p $$

### Important Parameters of Logistic regression in Scikit-Learn
The logistc regression implemented in scikit-learn library by default adds in the "L2" regularization term with regularization strength parameter $C = 1$. The following lists the parameters that are important to pay attention to when fit a model:
* penalty:
    * `'none'`: no penalty is added;
    * `'l2'`: add a L2 penalty term and it is the default choice;
    * `'l1'`: add a L1 penalty term;
    * `'elasticnet'`: both L1 and L2 penalty terms are added.
* C: default=1.0. Inverse of regularization strength, i.e. the higher the value the lower the regularization strength.
* max_iter: default=100. Maximum number of iteration taken for the solvers to converge.
* multi_class: {‘auto’, ‘ovr’, ‘multinomial’}, default=’auto’
    * 'ovr': means one-versus-rest, binary problem is fit for each label
    * 'multinomial': loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary.
    * ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.
* solver{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’

| Criteria | ‘liblinear’ | ‘lbfgs’ |‘newton-cg’|‘sag’|‘saga’|
| :--|:-- | :-- |:--|:-- | :-- |
|__Penalties__|
|Multinomial + L2 penalty|no|yes|yes|yes|yes|
|OVR + L2 penalty|yes|yes|yes|yes|yes|
|Multinomial + L1 penalty|no|no|no|no|yes|
|OVR + L1 penalty|yes|no|no|no|yes|
|Elastic-Net|no|no|no|no|yes|
|No penalty (‘none’)|no|yes|yes|yes|yes|
|__Behaviors__|
|Penalize the intercept (bad)|yes|no|no|no|no|
|Faster for large datasets|no|no|no|yes|yes|
|Robust to unscaled datasets|yes|yes|yes|no|no|

* random_state: Used when `solver` == 'sag', 'saga' or 'liblinear' to shuffle the data.
* l1_ratio: Only used if penalty='elasticnet'. Setting l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to using penalty='l1'. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2.

In [2]:
# Simple Pipeline of Logistic Regression in scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import roc_auc_score
import numpy as np

X, y = load_iris(return_X_y=True)
print(f"Total {len(np.unique(y))} classes")
print(f"Number of features: {X.shape[1]}")
clf = LogisticRegression(random_state=0, penalty='none', multi_class='multinomial', solver='lbfgs').fit(X, y)
# Print out the coefficients for each class
print(f"The coefficients for each class: \n{clf.coef_}")
# Print intercept for each class
print(f"The intecept for each class: \n{clf.intercept_}")
# Make a prediction
print(clf.predict(X[:2, :]))
# Make a prediction on the probabilities of each class
print(clf.predict_proba(X[:2, :]))
# Print the accuracy
print(clf.score(X, y))

Total 3 classes
Number of features: 4
The coefficients for each class: 
[[  7.35275466  20.39784579 -30.26354695 -14.14340745]
 [ -2.44378438  -6.85846875  10.41707167  -2.07137781]
 [ -4.90897028 -13.53937704  19.84647528  16.21478526]]
The intecept for each class: 
[  3.97751891  19.33028473 -23.30780364]
[0 0]
[[1.00000000e+00 2.09745715e-31 3.23880813e-58]
 [1.00000000e+00 1.23379546e-24 8.80642052e-50]]
0.9866666666666667


If we don't do regularization on logistic regression, we don't need to do hyper-parameter tuning. The following is the pipepline to perform hyper-parameter tuning on the regularization strength $C$ using `GridSearchCV`.

In [3]:
from sklearn.datasets import make_blobs
from sklearn.model_selection import GridSearchCV, StratifiedKFold, RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression

# Create the dataset - binary classification
X_train, y_train = make_blobs(n_samples=2000, centers=2, n_features=100, cluster_std=18)

# Defined the parameters
solver = ['lbfgs', 'liblinear']
penalty = ['l2']
c = [100, 10, 1.0, 0.3, 0.1, 0.01]
max_iter = [1000]
params = {'solver': solver, 'penalty': penalty, 'C': c, 'max_iter': max_iter}


# Create the logistic regression model
clf = LogisticRegression()

# Cross validation generator
#cv = StratifiedKFold(n_splits = 10, shuffle=True, random_state = 0)
cv_repeat = RepeatedStratifiedKFold(n_splits = 10, n_repeats=3, random_state = 0)

# Grid Search the best parameter for the solver and c
grid_search = GridSearchCV(estimator=clf, param_grid=params, n_jobs=-1, cv=cv_repeat, scoring='roc_auc')
grid_search_result = grid_search.fit(X_train, y_train)

# Output results
print("Best: %f using %s" % (grid_search_result.best_score_, grid_search_result.best_params_))
means = grid_search_result.cv_results_['mean_test_score']
stds = grid_search_result.cv_results_['std_test_score']
params = grid_search_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.997467 using {'C': 0.01, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'liblinear'}
0.996447 (0.002268) with: {'C': 100, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'lbfgs'}
0.996777 (0.002107) with: {'C': 100, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'liblinear'}
0.996527 (0.002208) with: {'C': 10, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'lbfgs'}
0.996790 (0.002102) with: {'C': 10, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'liblinear'}
0.996650 (0.002153) with: {'C': 1.0, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'lbfgs'}
0.996863 (0.002068) with: {'C': 1.0, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'liblinear'}
0.996790 (0.002065) with: {'C': 0.3, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'lbfgs'}
0.996933 (0.002020) with: {'C': 0.3, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'liblinear'}
0.996907 (0.002013) with: {'C': 0.1, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'lbfgs'}
0.997037 (0.001983) with: {'C': 0.1, 'max_iter': 1000, 'penalty': 'l2', 'sol

Note that `StratifiedKFold` is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

## Bayes Classifiers
We can also use Bayes' theorem to calculate the $\Pr(Y = k|X)$ which generate a series of classifiers: _Linear Discriminant Analysis_, _Quadratic Discriminant Analysis_ and _Naive Bayes_. Specifically,
$$p_k(x) = \Pr(Y = k|X = x) = \frac{\Pr(X=x, Y=k)}{\Pr(X=x)} =\frac{\Pr(Y=k, X=x)}{\sum_{l=1}^K\Pr(X=x, Y=k)} = \frac{\pi_k \cdot f_k(x)}{\sum_{l = 1}^K \pi_l \cdot f_l(x)} $$
where $\pi_k$ represents the overall (or prior) probability that a randomly chosen observation comes from the $k$th class, and $f_k(X) =\Pr(X|Y=k)$ is the density function of X for an observation that comes from the $k$ th class. We can then classify an observation to the class which has the highest $p_k(x)$.

As can be seen:
* $\pi_k$ is easy to estimate if we have a random sample from population: the fraction of the training observations that belong to the $k$th class.
* $f_k(x)$ estimation needs more assumption and it results in the three classifiers:_Linear Discriminant Analysis_, _Quadratic Discriminant Analysis_ and _Naive Bayes_

### _Linear Discriminant Analysis_
We assume that $f_k(x)$ is multi-variate Gaussian distribution with $\mu_k$ for the $k$ the class and the same covariance matrix across all the classes $\Sigma$. Finding the largest $p_k(x)$ means finding the largest
$$\pi_k \cdot f_k(x) = \pi_k \cdot \frac{1}{(2\pi)^{p/2} |\Sigma|^{1/2}} \exp \bigg( -\frac{1}{2} (x-\mu_k) ^{T} \Sigma ^{-1} (x - \mu_k)\bigg) \rightarrow \log (\pi_k \cdot f_k(x)) = \delta_k(x) = x^T \Sigma^{-1} \mu_k - \frac{1}{2}\mu_k \Sigma^{-1} \mu_k + \log \pi_k + constant $$ 
$\delta_k(x)$ is linear of $X$ so the decision boundary is linear of $X$.

### _Quadratic Discriminant Analysis_
We assume that $f_k(x)$ is multi-variate Gaussian distribution with $\mu_k$ for the $k$ the class and the covariance matrix $\Sigma_k$. Finding the largest $p_k(x)$ means finding the largest
$$\pi_k \cdot f_k(x) = \pi_k \cdot \frac{1}{(2\pi)^{p/2} |\Sigma_k|^{1/2}} \exp \bigg( -\frac{1}{2} (x-\mu_k) ^{T} \Sigma_k ^{-1} (x - \mu_k)\bigg) $$

$$\rightarrow \log (\pi_k \cdot f_k(x)) = \delta_k(x) = - \frac{1}{2} x^T \Sigma_k^{-1} x + x^T \Sigma_k^{-1} \mu_k - \frac{1}{2}\mu_k \Sigma_k^{-1} \mu_k -\frac{1}{2} \log|\Sigma_k|+ \log \pi_k + constant$$ 
$\delta_k(x)$ is a quadratic function of $X$ so the decision boundary is quadratic of $X$.

### Naive Bayes
Instead of make assumption on the distribution of $f_x(x)$, Naive Bayes assumes that:  
_Within the $k$th class, the $p$ predictors are independent_. That means: $f_k(x) = f_{k1}(x_1) \times \cdots \times f_{kp}(x_p)$. We have several options:
* If $X_j$ is quantitative, we can assume $X_j|Y=k \sim N(\mu_jk, \sigma_{jk}^2)$. This amounts to QDA with an additional assumption that the class-specific covariance matrix is diagonal.
* If $X_j$ is quantitative, we can use non-parametric kernel density estimator to estimate $f_{kj}$.
* If $X_j$ is qualitative, then we can simply count the proportion of training observations for the jth predictor corresponding to each class.



# References
1. https://machinelearningmastery.com/hyperparameters-for-classification-machine-learning-algorithms/
2. Introduction to Statistical Learning