# <a id='toc1_'></a>[ML Regression and Classification Review](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [ML Regression and Classification Review](#toc1_)    
  - [Logistic Regression](#toc1_1_)    
  - [Linear Regression](#toc1_2_)    
  - [K-Nearest Neighbors](#toc1_3_)    
  - [Decision Trees](#toc1_4_)    
  - [Support Vector Machines (SVMs)](#toc1_5_)    
    - [Supplemental Stuff](#toc1_5_1_)    
  - [Naive Bayes](#toc1_6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

#Transform data
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## <a id='toc1_1_'></a>[Logistic Regression](#toc0_)

- **Intuition behind test**: Logistic regression is a statistical model that uses a logistic function to model a binary dependent variable. It's a way to predict the probability of a certain event happening, which makes it a very good fit for binary classification problems.

- **Use case for test**: Logistic regression is used when the dependent variable is binary. It's widely used for binary classification problems like spam detection, churn prediction, or health diagnosis.

- **Classification Intuition**:: Logistic regression outputs probabilities. If the probability is greater than 0.5, it assigns the instance to the positive class, otherwise it assigns it to the negative class. Logistic regression uses the logistic sigmoid function to return a probability value.

- **Probability Formula**: The logistic regression function can be written as: 
    $$ P(Y=1|X) = \frac{1}{1+e^{-(\beta_0 + \beta_1X)}} $$
- **Cost Function**: The cost function in logistic regression is the log loss, which can be written as:
    $$ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)}\log(h_\theta(x^{(i)})) + (1 - y^{(i)})\log(1 - h_\theta(x^{(i)}))] $$

- **How to code it**:



In [2]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)


0.9777777777777777


- **The most important hyperparameters to iterate through**:
    -   `C`: Inverse of regularization strength; must be a positive float. Smaller values specify stronger regularization.
    - `penalty`: Used to specify the norm used in the penalization ('l1', 'l2', 'elasticnet', 'none')
    -  `solver` (Algorithm to use in the optimization problem)

- **Code for iterating through one example of a hyperparameter**:


In [3]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['l1', 'l2']}
grid = GridSearchCV(LogisticRegression(), param_grid)
grid.fit(X_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
35 fits failed out of a total of 70.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
35 fits failed with the following error:
Traceback (most recent call last):
  File "c:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\ProgramData\Anaconda3\lib\site


- **Assumptions**:
    - Binary logistic regression requires the dependent variable to be binary.
    - Requires the observations to be independent of each other.
    - Requires little or no multicollinearity among the independent variables.
    - Requires the independent variables to be linearly related to the log odds.

- **Interpretation of Coefficients**: The coefficients of the logistic regression algorithm can be interpreted as the change in the log odds of the output variable for a one unit change in the input variable. For example, a coefficient of 0.5 would mean that a one unit change in the input variable would result in a 0.5 unit change in the log odds of the output variable.


## <a id='toc1_2_'></a>[Linear Regression](#toc0_)

- **Intuition**: Linear regression is a statistical model that examines the linear relationship between two (Simple Linear Regression ) or more (Multiple Linear Regression) variables — a dependent variable and independent variable(s). Linear relationship basically means that when one (or more) independent variables increases (or decreases), the dependent variable increases (or decreases) too.
- **Use case**: Linear regression is used when we want to predict the value of a variable based on the value of another variable. For example, you could use linear regression to predict sales based on advertising spend.
- **Regression Intuition**: Linear regression algorithm finds the best fit line through the data by finding the line that minimizes the sum of squares of residuals.
- **Formula**: 
- **Cost Function**: The cost function in linear regression is the Residual Sum of Squares (RSS) which can be written as:
    $$ J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 $$

- **How to code it**:


In [4]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)


0.9474498984024573

- **Important Hyperparameters**: 
    - `fit_intercept`: Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations.
    - `normalize`: This parameter is ignored when `fit_intercept` is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.
- **Code for Hyperparameter Tuning**: Linear Regression does not have hyperparameters that we can tune to improve the performance of the model. The model learns the parameters from the data.
- **Assumptions**:
    - Linearity: The relationship between X and the mean of Y is linear.
    - Homoscedasticity: The variance of residual is the same for any value of X.
    - Independence: Observations are independent of each other.
    - Normality: For any fixed value of X, Y is normally distributed.
- **Interpretation of Coefficients**: The sign of a regression coefficient tells you whether there is a positive or negative correlation between each independent variable the dependent variable. A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease. The coefficient value signifies how much the mean of the dependent variable changes given a one-unit shift in the independent variable while holding other variables in the model constant.


## <a id='toc1_3_'></a>[K-Nearest Neighbors](#toc0_)

- **Intuition**: KNN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until function evaluation. It works by finding a predetermined number of training samples closest in distance to the new point, and predict the label from these.
- **Use case**: KNN can be used in both classification and regression predictive problems. However, it is more widely used in classification problems in the industry. K-nearest neighbors algorithm is used for simple classification tasks, where the dataset is small and well-labeled.
- **Classification Intuition**: In KNN classification, an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors.
- **Regression Intuition**: In KNN regression, the output is the property value for the object. This value is the average (or median) of the values of k nearest neighbors.
- **Probability Formula**: KNN does not provide a formula for probability as it does not have a mathematical model underlying it. It simply calculates the distances to all points and takes the majority vote (for classification) or average/median (for regression) of the k closest points.
- **Cost Function**: KNN does not have a cost function as it does not learn a function from the training data. The 'cost' is essentially the computation of distances to all points in the training set.
- **How to code it**:


In [5]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
model.score(X_test, y_test)


0.9777777777777777


- **Important Hyperparameters**: 
    - `n_neighbors`: Number of neighbors to use by default for kneighbors queries.
    - `weights`: weight function used in prediction. Possible values: 'uniform', 'distance'
    - `p` (Power parameter for the Minkowski metric)

- **Code for Hyperparameter Tuning**:



In [6]:
from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors': [3, 5, 7, 9], 'weights': ['uniform', 'distance']}
grid = GridSearchCV(KNeighborsClassifier(), param_grid)
grid.fit(X_train, y_train)


- **Assumptions**:
    - KNN assumes that similar things exist in close proximity. In other words, similar things are near to each other.
- **Interpretation of Coefficients**: KNN does not provide coefficients as it does not learn a mathematical function from the data.


## <a id='toc1_4_'></a>[Decision Trees](#toc0_)
- **Intuition**: Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes. And the decision nodes are where the data is split.
- **Use case**: Decision Trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal. They are a powerful prediction method and extremely popular.
- **Classification Intuition**: In the context of classification, the decision tree algorithm is used to identify the class or category of an observation based on the values of its features. The decision tree makes decisions by splitting data based on the values of the input features, and each split is made in a way that maximizes the separation of the classes in the output variable.
- **Regression Intuition**: In the context of regression, the decision tree algorithm is used to predict a continuous value based on the values of its features. The decision tree makes decisions by splitting data based on the values of the input features, and each split is made in a way that minimizes the variance or other measure of dispersion in the output variable.
- **Probability Formula**: Decision Trees do not directly provide a formula for probability. However, the proportion of training instances of a class in each node can give a probability for the Decision Tree's prediction.
- **Cost Function**: The cost function that is minimized to choose split points is the Gini cost function in the case of classification trees, and it is the sum of squared residuals in the case of regression trees.
- **Code Example**:



In [7]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
model.score(X_test, y_test)


0.9777777777777777

- **Important Hyperparameters**: 
    - `max_depth`: The maximum depth of the tree. This parameter is a stopping condition. The deeper the tree, the more complex the model will be.
    - `min_samples_split`: The minimum number of samples required to split an internal node.
    - `min_samples_leaf`: The minimum number of samples required to be at a leaf node.
- **Code for Hyperparameter Tuning**:

In [8]:
from sklearn.model_selection import GridSearchCV
param_grid = {'max_depth': [1, 2, 3, 4, 5], 'min_samples_split': [2, 3, 4], 'min_samples_leaf': [1, 2, 3]}
grid = GridSearchCV(DecisionTreeClassifier(), param_grid)
grid.fit(X_train, y_train)
# grid.score(X_test, y_test)


- **Assumptions**:
    - Decision tree does not assume any distribution of the data. It can handle any distribution of data.
    - It does not require any assumptions of linearity in the data.
- **Interpretation of Coefficients**: Decision Trees do not have coefficients like linear or logistic regression. Instead, the interpretation of a Decision Tree model is the structure of the tree itself, which can be visualized and used to make decisions based on the input features. Each path in the tree from the root to a leaf represents a decision path that ends in a predicted outcome.

## <a id='toc1_5_'></a>[Support Vector Machines (SVMs)](#toc0_)

- **Intuition behind test**: SVM is a supervised machine learning algorithm which can be used for classification, regression and outliers detection, but is mostly used in classification problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate. The idea of SVM is simple: The algorithm creates a line or a hyperplane which separates the data into classes.

- **Use case for test**: SVMs are helpful in text and hypertext categorization, classification of images, and in the biological and other sciences.

- **Classification Intuition**: SVMs are based on the idea of finding a hyperplane that best separates the features into different classes.

- **Regression Intuition**: In the case of regression, SVMs find the hyperplane that deviates from the most of the data points by no more than a certain amount, and for the rest of the data points, tries to minimize the deviation.

- **Non-linear Intuition**: SVM can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. When data are not linearly separable, SVM uses a higher dimension to separate the data, which is not possible using simple logistic regression.

- **Cost Function**: SVM aims to minimize the structural risk, which is defined as the sum of the training error and a term that penalizes model complexity (the number of support vectors). The cost function can be written as:
    $$ \min_{w,b,\xi} \frac{1}{2}w^Tw + C\sum_{i=1}^{n}\xi_i $$
    subject to the constraints:
    $$ y_i(w^Tx_i - b) \geq 1 - \xi_i, \xi_i \geq 0 $$

- **How to code it**:


In [9]:

from sklearn import svm

model = svm.SVC()
model.fit(X_train, y_train)
model.score(X_test, y_test)


0.9777777777777777


- **Important Hyperparameters**: 
    - `C`: Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive.
    - `kernel`: Specifies the kernel type to be used in the algorithm. It could be 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or a callable.
    - `degree`: Degree of the polynomial kernel function ('poly'). Ignored by all other kernels.
    - `gamma`: Kernel coefficient for 'rbf', 'poly' and 'sigmoid'.
- **Code for Hyperparameter Tuning**:


In [10]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf', 'poly', 'sigmoid']}
grid = GridSearchCV(svm.SVC(), param_grid, refit=True)
grid.fit(X_train, y_train)
grid.score(X_test, y_test)


0.9777777777777777


- **Assumptions**:
    - SVM assumes that the data it works with is in a specific format. Namely, a matrix where rows represent the samples and columns represent the attributes of the samples.
- **Interpretation of Coefficients**: The coefficients in SVM are not easily interpretable. This is because they do not represent the change in output variable for a unit change in an input variable, as in linear regression. Instead, they represent the weights assigned to the features.

### <a id='toc1_5_1_'></a>[Supplemental Stuff](#toc0_)

- **Computing a Probability of Class Membership**
  - Support Vector Machines (SVMs) are primarily designed for binary classification and do not directly provide probability estimates for the classes. They output a decision function that is used to separate the classes. However, it is possible to obtain class probabilities using an additional method known as Platt Scaling.
  - Platt Scaling is a method that fits a logistic regression model to the SVM's scores, which are then transformed into probabilities. This is done by applying the logistic function to the output of the SVM's decision function.
  - In scikit-learn, you can get probability estimates for SVM by setting the `probability` parameter to `True` when creating the SVM object. After fitting the SVM, you can call the `predict_proba` method to get the class probabilities.

Here is an example:



In [11]:
from sklearn import svm

# Create a SVM classifier with probability estimates
clf = svm.SVC(probability=True)

# Fit the SVM model
clf.fit(X_train, y_train)

# Get class probabilities
probabilities = clf.predict_proba(X_test)




In this example, `probabilities` is a 2D array where the first column is the probability of the negative class (class 0) and the second column is the probability of the positive class (class 1).

Please note that enabling probability estimates in SVMs involves an internal cross-validation and can be computationally expensive.

# Naive Bayes

- **Intuition**: Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e., every pair of features being classified is independent of each other.

- **Use case**: Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.

- **Classification Intuition**: Naive Bayes is a probabilistic classifier, meaning it predicts based on the probability of an object. In terms of classification, it uses Bayes' Theorem to predict the classification or label. It assumes all the features are independent of each other, which is why it is called 'naive'.

- **Regression Intuition**: Naive Bayes is not typically used for regression tasks as it is a probabilistic classifier and is mostly used for classification tasks.

- **Probability Formula**: The Naive Bayes formula is based on Bayes' Theorem, which is:
    $$ P(A|B) = \frac{P(B|A) * P(A)}{P(B)} $$
  In the context of Naive Bayes, it can be written as:
    $$ P(y|X) = \frac{P(X|y) * P(y)}{P(X)} $$
  where:
    - \( P(y|X) \) is the posterior probability of class (target) given predictor (attribute).
    - \( P(y) \) is the prior probability of class.
    - \( P(X|y) \) is the likelihood which is the probability of predictor given class.
    - \( P(X) \) is the prior probability of predictor.

- **Cost Function**: Naive Bayes doesn't have a traditional cost function like other algorithms. Instead, it uses the probabilities of each attribute belonging to each class to make a prediction.

- **Code Example**:


In [12]:

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
model.score(X_test, y_test)


0.9777777777777777


- **Important Hyperparameters**: 
    - `priors`: Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
    - `var_smoothing`: Portion of the largest variance of all features that is added to variances for calculation stability.

- **Code for Hyperparameter Tuning**:


In [13]:

from sklearn.model_selection import GridSearchCV
param_grid = {'var_smoothing': np.logspace(0,-9, num=100)}
grid = GridSearchCV(GaussianNB(), param_grid, cv=5)
grid.fit(X_train, y_train)
grid.score(X_test, y_test)

0.9777777777777777


- **Assumptions**:
    - Independence between every pair of features: Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
    - Features are equally important: Another assumption made by the Naive Bayes Classifier is all features
