<a href="https://colab.research.google.com/github/sanjeesi/Notes-Notebooks/blob/master/Data%20Science%20IITM/MLP/Week%205/Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification Functions in Scikit Learn
Specific classificaion algorithms
- Least square classification
- Perceptron
- Logistic regression
  - With regularization
  - multiclass, multilabel and multi-output setting

**Cross validation** and **hyper parameter search** for classification works exactly like it works in regression setting.
- However there are a couple of CV strategies that are specific to classification


Two types of APIs:

|Generic|Specific|
|---|---|
|SGD classifier|Logistic regression|
||Perceptron|
||Ridge classifier (for LSC)|
||K-nearest neighbours (KNNs)|
||Support vector machines (SVMs)|
||Naive Bayes|
|||
|uses **gradient descent** for optimization|**Specialized solvers** for opt.|

# Ridge classifier
- classifier variant of the **Ridge** regressor.  

Binary classification:
- classifier first converts binary targets to {-1, 1} and then treats the problem as a regression task.
- sklearn provides different **solvers** for the optimization
- predicted class corresponds to the sign of the regressor's prediction  

Multiclass classification:
- treated as multi-output regression
- predicted class corresponds to the output with the highest value.  

Use one of the following **solvers**:
- `svd` : uses a Singular Value Decomposition of the feature matrix to compute the Ridge coefficients.
- `cholesky` : uses `scipy.linalg.solve` function to obtain the closed-form solution
- `sparse_cg` : uses the conjugate gradient solver of `scipy.sparse.linalg.cg`.
- `lsqr` : uses the dedicated regularized least-squares routine `scipy.sparse.linalg.lsqr` and **it is the fastest**.
- `sag`, `saga` : uses Stochastic Average Gradient descent iterative procedure. 'saga' is unbiased and more flexible version of 'sag'
- `lbfgs` : uses L-BFGS-B algorithm implemented in `scipy.optimize.minimize`. can be used only when coefficients are forced to be positive.  

Choice of **solver** in RidgeClassifier:
- For large scale data, use `sparse_cg` solver.
- When both n_sampels and n_features are large, use 'sag' or 'saga' solvers.
  - Note that fast convergence is only guaranteed on features with approximately the same scale.  

### How to make RidgeClassifier select the solver automatically?
```ridge_classifier = RidgeClassifier(solver=auto)```  
`auto`: chooses the solver automatically based on the type of data.  
Default choice for solver is always `auto`.

# Perceptron
- It is a simple classification algorithm suitable for **large-scale learning**.
- Shares the same underlying implementation with `SGDClassifier`

|Both|are Equivalent|
|---|---|
|`Perceptron()`|```SGDClassifier(loss="perceptron", eta0=1, learning_rate="constant"```|

Perceptron uses SGD for training

# Logistic Regression
- a.k.a. logit regression, maximum entropy classifier (maxent) and log-linear classifier.
$$
arg min_{w,c}\ regularization\ penalty\ +\ C×cross\ entropy\ loss
$$  

C: **inverse of regularization rate**

- This implementation can fit
  - binary classification
  - one-vs-rest (OVR)
  - multinomial logistic regression
- Provision for **l1, l2** or **elastic-net regularization**  

### How to select **solvers** for Logistic Regression classifier?
The choice of the solver depends on the classification problem set up such as **size of the dataset, number of features and labels**.
- `newton-cg`
- `lbfgs`: default solver
- `liblinear`
- `sag`
- `saga`

- For **small datasets**, '**liblinear**' is a good choice, whereas '**sag**' and '**saga**' are faster for **large** ones.
- For **unscaled datasets**, 'liblinear', 'lbfgs' and 'newton-cg' are robust.
- For **multiclass problems**, only 'newton-cg', 'sag', 'saga' and 'lbfgs' handle multinomial loss.
- 'liblinear' is limited to one-versus-rest schemes

> By default, Logistic regression uses **L2 penalty**.

- **Not all solvers supports all the penalties**.
  - **L2 penalty** is supported by all solvers
  - **L1 penalty** is supported only by a few solvers.  

- C is specified in the constructor and must be positive
  - **smaller value** leads to **stronger** regularization.
  - **Larger value** leads to **weaker** regularization.  

> `class_weight` parameter in the constructor of classifier estimators handles **class imbalance**.

> **LogisticRegressionCV** implements logistic regression with in built **cross validation support** to find the **best values** of **C** and **l1_ratio** according to the specified **scoring** attribute.

# SGDClassifier
- **SGD** is a simple yet very efficient **approach to fitting linear classifiers** under convex loss functions
- It supports **multi-class classification** by combining multiple binary classifiers in a "**one versus all**" (OVA) scheme.
- **Easily scales up to large scale problems** with more than $10^5$ training examples and $10^5$ features. It also works with **sparse** machine learning problems
  - Text classification and natural language processing  

We need to set **loss parameter** appropriately to build train classifier of our interest with **SGDClassifier**
`loss` parameter:
- `hinge`: (soft-margin) linear Support Vector Machine [By DEFAULT]
- `log`: logistic regression
- `modified_huber` - smoothed hinge loss brings tolerance to outliers as well as probability estimates
- `squared_hinge`: like hinge but is quadratically penalized
- `perceptron`: linear loss used by the perceptron algorithm
- `squared_error`(least square classification), `huber`, `epsilon_insensitive`, or `squared_epsilon_insensitive` - regression losses

|Advantages:|Disadvantages:|
|---|---|
|Efficiency|Requires a number of hyperparameters.|
|Ease of implementation|Sensitive to feature scaling.|

It is important to
- **permute** (shuffle) the training data before fitting the model.
- standardize the features.

> By default: 
> - penalty='l2' and value is 0.0001
> - max_iter = 1000