# ML Regression and Classification Review

In [2]:
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

iris = datasets.load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)

#Transform data
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

**Table of contents**<a id='toc0_'></a>    
- [Logistic Regression](#toc1_1_)    
  - [K-Nearest Neighbors](#toc1_2_)    
  - [Decision Trees](#toc1_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Logistic Regression](#toc0_)

- **Intuition behind test**: Logistic regression is a statistical model that uses a logistic function to model a binary dependent variable. It's a way to predict the probability of a certain event happening, which makes it a very good fit for binary classification problems.

- **Use case for test**: Logistic regression is used when the dependent variable is binary. It's widely used for binary classification problems like spam detection, churn prediction, or health diagnosis.

- **Classification Intuition**:: Logistic regression outputs probabilities. If the probability is greater than 0.5, it assigns the instance to the positive class, otherwise it assigns it to the negative class. Logistic regression uses the logistic sigmoid function to return a probability value.

- **Probability Formula**: The logistic regression function can be written as: 
    $$ P(Y=1|X) = \frac{1}{1+e^{-(\beta_0 + \beta_1X)}} $$
- **Cost Function**: The cost function in logistic regression is the log loss, which can be written as:
    $$ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)}\log(h_\theta(x^{(i)})) + (1 - y^{(i)})\log(1 - h_\theta(x^{(i)}))] $$

- **How to code it**:



In [5]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)


0.9333333333333333


- **The most important hyperparameters to iterate through**:
    -   `C`: Inverse of regularization strength; must be a positive float. Smaller values specify stronger regularization.
    - `penalty`: Used to specify the norm used in the penalization ('l1', 'l2', 'elasticnet', 'none')
    -  `solver` (Algorithm to use in the optimization problem)

- **Code for iterating through one example of a hyperparameter**:


In [6]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['l1', 'l2']}
grid = GridSearchCV(LogisticRegression(), param_grid)
grid.fit(X_train, y_train)



- **Assumptions**:
    - Binary logistic regression requires the dependent variable to be binary.
    - Requires the observations to be independent of each other.
    - Requires little or no multicollinearity among the independent variables.
    - Requires the independent variables to be linearly related to the log odds.

- **Interpretation of Coefficients**: The coefficients of the logistic regression algorithm can be interpreted as the change in the log odds of the output variable for a one unit change in the input variable. For example, a coefficient of 0.5 would mean that a one unit change in the input variable would result in a 0.5 unit change in the log odds of the output variable.



## <a id='toc1_2_'></a>[K-Nearest Neighbors](#toc0_)

- **Intuition**: KNN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until function evaluation. It works by finding a predetermined number of training samples closest in distance to the new point, and predict the label from these.
- **Use case**: KNN can be used in both classification and regression predictive problems. However, it is more widely used in classification problems in the industry. K-nearest neighbors algorithm is used for simple classification tasks, where the dataset is small and well-labeled.
- **Classification Intuition**: In KNN classification, an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors.
- **Regression Intuition**: In KNN regression, the output is the property value for the object. This value is the average (or median) of the values of k nearest neighbors.
- **Probability Formula**: KNN does not provide a formula for probability as it does not have a mathematical model underlying it. It simply calculates the distances to all points and takes the majority vote (for classification) or average/median (for regression) of the k closest points.
- **Cost Function**: KNN does not have a cost function as it does not learn a function from the training data. The 'cost' is essentially the computation of distances to all points in the training set.
- **How to code it**:


In [7]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
model.score(X_test, y_test)


0.9333333333333333


- **Important Hyperparameters**: 
    - `n_neighbors`: Number of neighbors to use by default for kneighbors queries.
    - `weights`: weight function used in prediction. Possible values: 'uniform', 'distance'
    - `p` (Power parameter for the Minkowski metric)

- **Code for Hyperparameter Tuning**:



In [8]:
from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors': [3, 5, 7, 9], 'weights': ['uniform', 'distance']}
grid = GridSearchCV(KNeighborsClassifier(), param_grid)
grid.fit(X_train, y_train)


- **Assumptions**:
    - KNN assumes that similar things exist in close proximity. In other words, similar things are near to each other.
- **Interpretation of Coefficients**: KNN does not provide coefficients as it does not learn a mathematical function from the data.



## <a id='toc1_3_'></a>[Decision Trees](#toc0_)

- **Intuition behind test**: Decision Trees is a type of algorithm that makes decisions based on conditions. It's like playing a game of 20 questions to predict the class or value of the target variable.

- **Use case for test**: Decision Trees are used for both classification and regression tasks. They are widely used in customer segmentation, detection of fraudulent transactions, or prediction of diseases.

- **Intuition for using it for classification**: The tree is constructed in a way that the most important features appear at the top of the tree. It splits the data into subsets based on the feature that provides the most information gain. This process is repeated recursively until it makes a prediction for every subset.

- **Intuition for using it for regression**: Decision tree regression observes features of an object and trains a model in the structure of a tree to predict data in the future to produce meaningful continuous output.

- **How to code it**:



In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)



- **The most important hyperparameters to iterate through**: `max_depth` (The maximum depth of the tree), `min_samples_split` (The minimum number of samples required to split an internal node), `min_samples_leaf` (The minimum number of samples required to be at a leaf node)

- **Code for iterating through one example of a hyperparameter**:

```python


In [None]:

from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth': [3, 5, 7, 9, 11]}
clf = GridSearchCV(DecisionTreeClassifier(), param_grid)
clf.fit(X_train, y_train)



- **Assumptions of the algorithm**: Decision tree algorithm assumes that the training data is noise-free, it assumes that missing values are at random, and the most crucial assumption is that the training set is a sample from the actual population.

## Linear Regression

- **Intuition**: Linear regression is a statistical model that examines the linear relationship between two (Simple Linear Regression ) or more (Multiple Linear Regression) variables — a dependent variable and independent variable(s). Linear relationship basically means that when one (or more) independent variables increases (or decreases), the dependent variable increases (or decreases) too.
- **Use case**: Linear regression is used when we want to predict the value of a variable based on the value of another variable. For example, you could use linear regression to predict sales based on advertising spend.
- **Regression Intuition**: Linear regression algorithm finds the best fit line through the data by finding the line that minimizes the sum of squares of residuals.
- **Formula**: 
- **Cost Function**: The cost function in linear regression is the Residual Sum of Squares (RSS) which can be written as:
    $$ J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 $$

- **How to code it**:


In [None]:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)



- **The most important hyperparameters to iterate through**: `fit_intercept` (Whether to calculate the intercept for this model), `normalize` (This parameter is ignored when fit_intercept is set to False)

- **Code for iterating through one example of a hyperparameter**: Linear regression does not typically require hyperparameter tuning.

- **Assumptions of the algorithm**: Linear regression assumes that there is a linear relationship between the dependent and independent variables, the residuals are normally distributed and have constant variance, and there is no multicollinearity among independent variables.


## Support Vector Machines (SVMs)

- **Intuition behind test**: SVM is a supervised machine learning algorithm which can be used for both classification or regression challenges. However, it is mostly used in classification problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate.

- **Use case for test**: SVMs are helpful in text and hypertext categorization, classification of images, and in the biological and other sciences.

- **Intuition for using it for classification**: SVMs are based on the idea of finding a hyperplane that best separates the features into different classes.

- **Intuition for using it for regression**: In the case of regression, SVMs find the hyperplane that deviates from the most of the data points by no more than a certain amount, and for the rest of the data points, tries to minimize the deviation.

- **How to code it**:


In [None]:

from sklearn import svm

model = svm.SVC()
model.fit(X_train, y_train)
predictions = model.predict(X_test)



- **The most important hyperparameters to iterate through**: `C` (Penalty parameter C of the error term), `kernel` (Specifies the kernel type to be used in the algorithm), `gamma` (Kernel coefficient for 'rbf', 'poly' and 'sigmoid')

- **Code for iterating through one example of a hyperparameter**:


In [None]:

from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf']} 
clf = GridSearchCV(svm.SVC(), param_grid)
clf.fit(X_train, y_train)



- **Assumptions of the algorithm**: SVMs assume that the data it works with is in a specific format. Namely, that all of the input features



## Naive Bayes

- **Intuition behind test**: Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

- **Use case for test**: Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. They are often used for text classification, spam filtering, and recommendation systems.

- **Intuition for using it for classification**: Naive Bayes is a probabilistic classifier, meaning it predicts on the basis of the probability of an object. It uses Bayes' Theorem, which is based on the concept of conditional probability.

- **Intuition for using it for regression**: Naive Bayes is not typically used for regression tasks as it's a probabilistic classifier and works based on the assumption of independence among predictors.

- **How to code it**:


In [None]:

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)



- **The most important hyperparameters to iterate through**: Naive Bayes typically doesn't have hyperparameters that need tuning, but some implementations like `BernoulliNB` and `MultinomialNB` have a `alpha` parameter which is a smoothing parameter.

- **Code for iterating through one example of a hyperparameter**: Naive Bayes typically doesn't require hyperparameter tuning.

- **Assumptions of the algorithm**: Naive Bayes assumes that all features are independent from each other and each one contributes independently to the probability of the outcome. This is a 'naive' assumption because it's rarely true in real-world scenarios.


# Linear Regression
- **Intuition**: Linear regression is a statistical model that examines the linear relationship between two (Simple Linear Regression ) or more (Multiple Linear Regression) variables — a dependent variable and independent variable(s). Linear relationship basically means that when one (or more) independent variables increases (or decreases), the dependent variable increases (or decreases) too.
- **Use case**: Linear regression is used when we want to predict the value of a variable based on the value of another variable. For example, you could use linear regression to predict sales based on advertising spend.
- **Regression Intuition**: Linear regression algorithm finds the best fit line through the data by finding the line that minimizes the sum of squares of residuals.
- **Probability Formula**: Not applicable for Linear Regression.
- **Cost Function**: The cost function in linear regression is the Residual Sum of Squares (RSS) which can be written as:
    $$ J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 $$
- **Code Example**:
    ```python
    from sklearn.linear_model import LinearRegression
    model = LinearRegression()
    model.fit(X_train, y_train)
    ```
- **Important Hyperparameters**: 
    - `fit_intercept`: Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations.
    - `normalize`: This parameter is ignored when `fit_intercept` is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.
- **Code for Hyperparameter Tuning**: Linear Regression does not have hyperparameters that we can tune to improve the performance of the model. The model learns the parameters from the data.
- **Assumptions**:
    - Linearity: The relationship between X and the mean of Y is linear.
    - Homoscedasticity: The variance of residual is the same for any value of X.
    - Independence: Observations are independent of each other.
    - Normality: For any fixed value of X, Y is normally distributed.
- **Interpretation of Coefficients**: The sign of a regression coefficient tells you whether there is a positive or negative correlation between each independent variable the dependent variable. A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease. The coefficient value signifies how much the mean of the dependent variable changes given a one-unit shift in the independent variable while holding other variables in the model constant.

# Support Vector Machines (SVM)
- **Intuition**: Support Vector Machines are a set of supervised learning methods used for classification, regression and outliers detection. The idea of SVM is simple: The algorithm creates a line or a hyperplane which separates the data into classes.
- **Use case**: SVM is used in a variety of applications such as face detection, intrusion detection, classification of emails, and handwriting recognition.
- **Classification Intuition**: SVM can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. When data are not linearly separable, SVM uses a higher dimension to separate the data, which is not possible using simple logistic regression.
- **Regression Intuition**: In addition to performing linear classification, SVMs can efficiently perform a non-linear regression using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.
- **Probability Formula**: SVM doesn't directly provide probability estimates. These are calculated using an expensive five-fold cross-validation.
- **Cost Function**: SVM aims to minimize the structural risk, which is defined as the sum of the training error and a term that penalizes model complexity (the number of support vectors). The cost function can be written as:
    $$ \min_{w,b,\xi} \frac{1}{2}w^Tw + C\sum_{i=1}^{n}\xi_i $$
    subject to the constraints:
    $$ y_i(w^Tx_i - b) \geq 1 - \xi_i, \xi_i \geq 0 $$
- **Code Example**:
  


In [None]:
from sklearn import svm
model = svm.SVC()
model.fit(X_train, y_train)



- **Important Hyperparameters**: 
    - `C`: Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive.
    - `kernel`: Specifies the kernel type to be used in the algorithm. It could be 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or a callable.
    - `degree`: Degree of the polynomial kernel function ('poly'). Ignored by all other kernels.
    - `gamma`: Kernel coefficient for 'rbf', 'poly' and 'sigmoid'.
- **Code for Hyperparameter Tuning**:


In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf', 'poly', 'sigmoid']}
grid = GridSearchCV(svm.SVC(), param_grid, refit=True, verbose=2)
grid.fit(X_train, y_train)



- **Assumptions**:
    - SVM assumes that the data it works with is in a specific format. Namely, a matrix where rows represent the samples and columns represent the attributes of the samples.
    - SVM assumes that the data it works with is in a specific format. Namely, a matrix where rows represent the samples and columns represent the attributes of the samples.
- **Interpretation of Coefficients**: The coefficients in SVM are not easily interpretable. This is because they do not represent the change in output variable for a unit change in an input variable, as in linear regression. Instead, they represent the weights assigned to the features.