## Machine Learning Detailed analysis of Concepts and Algorithms

In [99]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

In [None]:
boston = load_boston()
features = boston.data[:, 0:2]
target = boston.target

In [None]:
features.shape, target.shape

In [None]:
lr = LinearRegression()
model = lr.fit(features, target)
model

### Reducing Variance and Regularization

Problem
`You want to reduce the variance of your linear regression model`

Solution
`Use a learning algorithm that includes a shrinkage penalty (also called regularization) like ridge regression and lasso regression:`

In [14]:
from sklearn.linear_model import Ridge
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler

In [15]:
boston = load_boston()
features = boston.data[:, 0:2]
target = boston.target


In [16]:
features[:5]

array([[6.320e-03, 1.800e+01],
       [2.731e-02, 0.000e+00],
       [2.729e-02, 0.000e+00],
       [3.237e-02, 0.000e+00],
       [6.905e-02, 0.000e+00]])

In [17]:
target[:5]

array([24. , 21.6, 34.7, 33.4, 36.2])

In [18]:
scaler = StandardScaler()
scaled = scaler.fit_transform(features)
scaled

array([[-0.41978194,  0.28482986],
       [-0.41733926, -0.48772236],
       [-0.41734159, -0.48772236],
       ...,
       [-0.41344658, -0.48772236],
       [-0.40776407, -0.48772236],
       [-0.41500016, -0.48772236]])

In [19]:
reg_ridge = Ridge(alpha = 0.5)
model_ridge = reg_ridge.fit(scaled, target)
model_ridge

Ridge(alpha=0.5)

- In standard linear regression the model trains to minimize the sum of squared error between the true( 𝑦𝑖 ) and prediction ( 𝑦̂ 𝑖 ) target values, or residual sum of squares (RSS):

- Regularized regression learners are similar, except they attempt to minimize RSS and some penalty for the total size of the coefficient values, called a shrinkage penalty because it attempts to "shrink" the model. 

- There are two common types of regularized learners for linear regression: ridge regression and the lasso. 

### So which one should we use?

A a very general rule of thumb, `ridge regression often produces slightly better predictions than lasso, but lasso  produces more interpretable models.` If we want a `balance between, ridge and lasso's penalty functions we can use elastic net, which is simply a regression model with both penalties included.` Regardless of which one we use, both ridge and lasso regresions can penalize large or complex models by including coefficient values in the loss funciton we are trying to minimize

`The hyper parameter  𝛼  lets us control how much we penalize the coefficients, with higher values of  𝛼  creating simpler models. The ideal value of  𝛼  should be tuned like any other hyperparameter. In scikit-learn,  𝛼  is set using the alpha parameter.`

### Selecting best Alpha parameter

#### `scikit-learn includes a RidgeCV method that allows us to select the ideal value for 𝛼:`

In [20]:
from sklearn.linear_model import RidgeCV
reg_cv = RidgeCV([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
model_cv = reg_cv.fit(scaled, target)
model_cv

RidgeCV(alphas=array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]))

In [21]:
model_cv.alpha_

1.0

`Because in linear regression the value of the coefficients is partially determined by the scale of the feature, and in regularized models all coefficients are summed together, we must make sure to standardize the feature prior to training`

### Reducing Features with Lasso Regression¶

Problem
`You want to simplify your linear regression model by reducing the number of features.`

Solution
`Use a lasso regression`

In [30]:
from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler

In [31]:
boston = load_boston()
features = boston.data
target = boston.target

In [32]:
scale = StandardScaler()
scaled = scale.fit_transform(features)
scaled

array([[-0.41978194,  0.28482986, -1.2879095 , ..., -1.45900038,
         0.44105193, -1.0755623 ],
       [-0.41733926, -0.48772236, -0.59338101, ..., -0.30309415,
         0.44105193, -0.49243937],
       [-0.41734159, -0.48772236, -0.59338101, ..., -0.30309415,
         0.39642699, -1.2087274 ],
       ...,
       [-0.41344658, -0.48772236,  0.11573841, ...,  1.17646583,
         0.44105193, -0.98304761],
       [-0.40776407, -0.48772236,  0.11573841, ...,  1.17646583,
         0.4032249 , -0.86530163],
       [-0.41500016, -0.48772236,  0.11573841, ...,  1.17646583,
         0.44105193, -0.66905833]])

In [33]:
reg_lasso = Lasso(alpha = 0.5)
model_lasso = reg_lasso.fit(scaled, target)
model_lasso

Lasso(alpha=0.5)

`One interesting characteristic of lasso regression's penalty is that it can shrink the coefficients of a model to zero, effectively reducing the number of features in the model. For example, in our solution we set alpha to 0.5 and we can see that many of the coefficients are 0, meaning their corresponding features are not used in the model:`

In [34]:
model_lasso.coef_

array([-0.11526463,  0.        , -0.        ,  0.39707879, -0.        ,
        2.97425861, -0.        , -0.17056942, -0.        , -0.        ,
       -1.59844856,  0.54313871, -3.66614361])

`However if we increase  𝛼  to a much higher value, we see that lierally none of the features are being used:`

In [35]:
reg_lasso_10 = Lasso(alpha = 10)
model_lasso_10 = reg_lasso_10.fit(scaled, target)
model_lasso_10.coef_

array([-0.,  0., -0.,  0., -0.,  0., -0.,  0., -0., -0., -0.,  0., -0.])

`The practical benefit of this effect is that it means that we could include 100 features in our feature matrix and then, through adjusting lasso's  𝛼  hyperparameter, produce a model that uses only 10 (for instance) of the most important features. This lets us reduce variance while improving interpretability of our model (since fewer features is easier to explain)`

## Logistic Regression

`Despite being called a regression, logistic regression is actually a widely used supervised classification technique. Allows us to predict the probability that an observation is of a certain class`

### Binary Classifier

In [48]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

In [50]:
iris = load_iris()
features = iris.data[:100, :]
target = iris.target[:100]

scale = StandardScaler()
scaled = scale.fit_transform(features)
scaled

array([[-5.81065904e-01,  8.41837140e-01, -1.01297765e+00,
        -1.04211089e+00],
       [-8.94308978e-01, -2.07835104e-01, -1.01297765e+00,
        -1.04211089e+00],
       [-1.20755205e+00,  2.12033793e-01, -1.08231219e+00,
        -1.04211089e+00],
       [-1.36417359e+00,  2.09934449e-03, -9.43643106e-01,
        -1.04211089e+00],
       [-7.37687441e-01,  1.05177159e+00, -1.01297765e+00,
        -1.04211089e+00],
       [-1.11201292e-01,  1.68157493e+00, -8.04974023e-01,
        -6.86441647e-01],
       [-1.36417359e+00,  6.31902691e-01, -1.01297765e+00,
        -8.64276271e-01],
       [-7.37687441e-01,  6.31902691e-01, -9.43643106e-01,
        -1.04211089e+00],
       [-1.67741667e+00, -4.17769553e-01, -1.01297765e+00,
        -1.04211089e+00],
       [-8.94308978e-01,  2.09934449e-03, -9.43643106e-01,
        -1.21994552e+00],
       [-1.11201292e-01,  1.26170604e+00, -9.43643106e-01,
        -1.04211089e+00],
       [-1.05093052e+00,  6.31902691e-01, -8.74308565e-01,
      

In [55]:
target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [51]:
lr = LogisticRegression()
model = lr.fit(scaled, target)
model

LogisticRegression()

In [52]:
new_observation = [[0.5, 0.4, 0.2, 0.7]]


In [53]:
pred = model.predict(new_observation)
pred

array([1])

In [54]:
prob = model.predict_proba(new_observation)
prob

array([[0.18259902, 0.81740098]])

Dispite having "regression" in its name, a logistic regression is actually a widely used binary lassifier (i.e. the target vector can only take two values). In a logistic regression, a linear model (e.g. $\beta_0 + \beta_i x$) is included in a logistic (also called sigmoid) function, $\frac{1}{1+e^{-z }}$, such that:
$$
P(y_i = 1 | X) = \frac{1}{1+e^{-(\beta_0 + \beta_1x)}}
$$
where $P(y_i = 1 | X)$ is the probability of the ith obsevation's target, $y_i$ being class 1, X is the training data, $\beta_0$ and $\beta_1$ are the parameters to be learned, and e is Euler's number. The effect of the logistic function is to constrain the value of the function's output to between 0 and 1 so that i can be interpreted as a probability. If $P(y_i = 1 | X)$ is greater than 0.5, class 1 is predicted; otherwise class 0 is predicted

### Training a Multiclass Classifier

In [56]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

In [64]:
iris = load_iris()
features = iris.data
target = iris.target


In [65]:
scale = StandardScaler()
scaled = scale.fit_transform(features)

In [66]:
# OVR one-vs-rest Logistic Regression
ovr = LogisticRegression(random_state = 42, multi_class = 'ovr')
model = ovr.fit(scaled, target)
model


LogisticRegression(multi_class='ovr', random_state=42)

On their own, logistic regressions are only binary classifiers, meaning they cannot handle target vectors with more than two classes. However, two clever extensions to logistic regression do just that. 

`First, in one-vs-rest logistic regression (OVR)` a separate model is trained for each class predicted whether an observation is that class or not (thus making it a binary classification problem). It assumes that each observation problem (e.g. class 0 or not) is independent

In [67]:
new_observation = [[0.7, 0.5, 0.2, 0.7]]


In [68]:
pred = model.predict(new_observation)
pred

array([2])

In [69]:
prob = model.predict_proba(new_observation)
prob

array([[0.04538717, 0.34272702, 0.61188581]])

In [70]:
#Multinomial Logistic regression
#logistic_regression_MNL = LogisticRegression(random_state=0, multi_class="multinomial")

`Alternatively in multinomial logistic regression (MLR)` the logistic function  is replaced with a softmax function:
$$
P(y_I = k | X) = \frac{e^{\beta_k x_i}}{\sum_{j=1}^{K}{e^{\beta_j x_i}}}
$$
where $P(y_i = k | X)$ is the probability of the ith observation's target value, $y_i$, is class k, and K is the total number of classes. One practical advantage of the MLR is that its predicted probabilities using `predict_proba` method are more reliable

We can switch to an MNL by setting `multi_class='multinomial'`

In [71]:
mlr = LogisticRegression(random_state = 42, multi_class = 'multinomial')
model = mlr.fit(scaled, target)
model

LogisticRegression(multi_class='multinomial', random_state=42)

In [72]:
new_observation = [[0.7, 0.5, 0.2, 0.7]]

In [73]:
pred = model.predict(new_observation)
pred

array([1])

In [74]:
prob = model.predict_proba(new_observation)
prob

array([[0.01923294, 0.76695828, 0.21380879]])

### Reducing Variance Through Regularization

In [76]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

In [77]:
iris = load_iris()
features = iris.data
target = iris.target


In [78]:
scale = StandardScaler()
scaled = scale.fit_transform(features)


In [79]:
lr_cv = LogisticRegressionCV(penalty = 'l2',
                            Cs = 10,
                            random_state = 42,
                            n_jobs = -1)
model =lr_cv.fit(scaled, target)
model

LogisticRegressionCV(n_jobs=-1, random_state=42)

In [87]:
new_observation = [[0.7, 0.5, 0.2, 0.7]]

In [88]:
model.predict(new_observation)

array([1])

In [90]:
model.predict_proba(new_observation)

array([[3.86627251e-04, 9.91050788e-01, 8.56258465e-03]])

In [91]:
# It can be seen that the probability has increased considerably after applying l2 regularization

Regularization is a method of penalizing complex models to reduce their variance. Specifically, a penalty term is added to the loss function we are trying to minimize typically the L1 and L2 penalties

In the L1 penalty:
$$
\alpha \sum_{j=1}^{p}{|\hat\beta_j|}
$$
where $\hat\beta_j$ is the parameters of the jth of p features being learned and $\alpha$ is a hyperparameter denoting the regularization strength.

With the L2 penalty:
$$
\alpha \sum_{j=1}^{p}{\hat\beta_j^2}
$$
higher values of $\alpha$ increase the penalty for larger parameter values(i.e. more complex models). scikit-learn follows the common method of using C instead of $\alpha$ where C is the inverse of the regularization strength: $C = \frac{1}{\alpha}$. To reduce variance while using logistic regression, we can treat C as a hyperparameter to be tuned to find the value of C that creates the best model. In scikit-learn we can use the `LogisticRegressionCV` class to efficiently tune C.

### Training a Classifier on Very Large Data

In [93]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

iris = load_iris()
features = iris.data
target = iris.target

scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)

logistic_regression = LogisticRegression(random_state = 42, solver="sag") # stochastic average gradient (SAG) solver
model = logistic_regression.fit(features_standardized, target)
model

LogisticRegression(random_state=42, solver='sag')

scikit-learn's `LogisticRegression` offers a number of techniques for training a logistic regression, called solvers. Most of the time scikit-learn will select the best solver automatically for us or warn us we cannot do something with that solver.

`Stochastic averge gradient descent` allows us to train a model much faster than other solvers when our data is very large. However, it is also very sensitive to feature scaling, so standardizing our features is particularly important

### Handling Imbalanced Classes

In [94]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

iris = load_iris()
features = iris.data[40:, :]
target = iris.target[40:]

In [96]:
target = np.where((target == 0), 0 ,1)


In [97]:
scale = StandardScaler()
scaled = scale.fit_transform(features)


In [98]:
logistic_regression = LogisticRegression(random_state = 42, class_weight = "balanced")
model = logistic_regression.fit(scaled, target)
model

LogisticRegression(class_weight='balanced', random_state=42)

`LogisticRegression` comes with a built in method of handling imbalanced classes.
`class_weight="balanced"` will automatically weigh classes inversely proportional to their frequency:
$$
w_j = \frac{n}{kn_j}
$$
where $w_j$ is the weight to class j, n is the number of observations, $n_j$ is the number of observations in class j, and k is the total number of classes

## K-Nearest Neighbors

`An observation is predicted to be the class of that of the largest proportion of the k-nearest observations.`

### Finding an Observation's Nearest Neighbors

In [100]:
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

In [101]:
iris = load_iris()
features = iris.data

In [102]:
scale = StandardScaler()
scaled = scale.fit_transform(features)

In [103]:
nearest_neighbors = NearestNeighbors(n_neighbors = 2).fit(scaled)
nearest_neighbors

NearestNeighbors(n_neighbors=2)

In [105]:
new_observation = [[1, 1, 1, 1]]

In [106]:
distances, indices = nearest_neighbors.kneighbors(new_observation)
(distances, indices)

(array([[0.49140089, 0.74294782]]), array([[124, 110]], dtype=int64))

In [107]:
distances

array([[0.49140089, 0.74294782]])

In [108]:
indices

array([[124, 110]], dtype=int64)

In [109]:
scaled[indices]

array([[[1.03800476, 0.55861082, 1.10378283, 1.18556721],
        [0.79566902, 0.32841405, 0.76275827, 1.05393502]]])

How do we measure distance?

* Euclidian
$$
d_{euclidean} = \sqrt{\sum_{i=1}^{n}{(x_i - y_i)^2}}
$$

* Manhattan
$$
d_{manhattan} = \sum_{i=1}^{n}{|x_i - y_i|}
$$

* Minkowski (default)
$$
d_{minkowski} = (\sum_{i=1}^{n}{|x_i - y_i|^p})^{\frac{1}{p}}
$$

### Creating a K-Nearest Neighbor Classifier

In [111]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

iris = load_iris()
features = iris.data
target = iris.target