# Classification Homework

In this notebook are some exercises to gain more experience with the classification techniques we've introduced as well as present material we didn't have time to cover in class.

In [None]:
## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
import seaborn as sns

## This sets the plot style
## to have a grid on a white background
sns.set_style("whitegrid")

## Additional Techniques in Train Test Splits

Work through the Oversampling and Undersampling notebook in the Classification Folder from Lectures.

## Other KNN Measures

KNN can use other distance measures. Read through the `sklearn` documentation to see what the `p` and `metric` arguments are, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html">https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html</a>. Then look at the wikipedia entry for Minkowski distance <a href="https://en.wikipedia.org/wiki/Minkowski_distance">https://en.wikipedia.org/wiki/Minkowski_distance</a>.

In "The distance function effect on k-nearest neighbor classification for medical datasets" Hu et. al. demonstrate that the Minkowski distance can lead to horrible performance using various medical data sets, <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978658/">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978658/</a>.

See how performance is affected on the iris data set when you use different distance metrics.

In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







## Precision-Recall Trade-Off

Return to the logistic regression example from notebook 2.

Plot the precision and the recall as a function of the probability cutoff. Notice that as one increases the other tends to decrease.

In [None]:
data = np.loadtxt("random_binary.csv",delimiter = ",")
X = data[:,0]
y = data[:,1]

In [None]:
# Perform a stratified test train split
# Practice, write the code to do that in these two blocks
# First import the package
from sklearn.model_selection import train_test_split

In [None]:
# Now split the data
# Have 20% for testing
# Set 614 as the random state
# and stratify the split
X_train,X_test,y_train,y_test = train_test_split(X,y,
                                                 test_size = .2,
                                                 random_state = 614,
                                                 shuffle = True,
                                                 stratify = y)

In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







## Using Classification Algorithms for Regression

Virtually all of our classification techniques can also be used for regression purposes as well. These are nice because they allow you to have a diverse array of regression models. Each model makes different types of errors on the data. For instance while linear regression models may have issues with multicolinearity, random forest regression models are typically fine with multicolinearity because the decision tree structure doesn't rely on the existence of a linear relationship between the target and feature.

We briefly introduce regression versions of the classification algorithms and then refer you to their documentation. Work through these and answer the questions as you go along.

In all of the following suppose $y$ is the target with feature matrix, $X$.

### KNN Regression

knn regression works as follows. For an observation you'd like to predict on take in the target values for its $k$ closest training neighbors in the feature space. The prediction for the observation is then calculated by taking the average of the $k$ closest training neighbors. This average can be the arithmetic mean, a mean weighted by the inverse of the distance to the observation, or any custom weight function you'd like.

Go to section 1.6.3 in the sklearn docs, <a href="https://scikit-learn.org/stable/modules/neighbors.html">https://scikit-learn.org/stable/modules/neighbors.html</a>, to read what `sklearn` has to say about knn regression and see how to implement it in python.

Testing your understanding from the documentation to predict on the following data with a knn model. Use `np.linspace` to plot the model fit along the training data. Compare the fit to a standard linear regression.

In [None]:
## Read in the data
x_train = np.linspace(-5,5,100)
y_train = x*(x-1)*(x+2) + .5*np.random.randn(100)

x_test = np.linspace(-5,5,100)
y_test = x*(x-1)*(x+2) + .5*np.random.randn(100)

In [None]:
## Make the model and fit it




In [None]:
## Additional Code here




In [None]:
## Additional Code here

Improving Regressors using Boosting Techniques


In [None]:
## Additional Code here




### Tree Regression Methods

Both decision trees and random forests can be used in regression problems. One nice feature here is that you don't need to go through the process of exploring important feature by hand as you did with linear regression.

The process is the same as with tree based classification the algorithm goes through all the possible features and makes a cut based on improving the training error from the split. Once a split is made the regression is done by looking at the average target value of the instances in each node of the split. So for example if the test observation you want to predict falls into node $A$ of the tree, then the prediction would be the average target value of the $A$ node.

You can implement a single tree using `DecisionTreeRegressor`, or as is likely preferred `RandomForestRegressor`.

You can read the documentation here, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html">https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html</a>, and here <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html">https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html</a>. 

As discussed in Regression Notebook 5 these regression methods are able to avoid some of the issues of multicolinearity.

Read in the following hitters data.

Then make a train test split.

Then using tree regression methods build a model to predict `Salary`.

In [None]:
hitters = pd.read_csv("Hitters.csv", index_col = 0)

In [None]:
hitters.sample(5)

In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







### SVM Regression

Regression can also be done using support vector machines.

It is implemented using `LinearSVR`, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html">https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html</a>, and `SVR`, <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html">https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html</a>.

### Ensemble Regression

Once you've fit a number of good regression models you can create an ensemble regressor. All of the ensemble methods we've discussed have regression versions, `VotingRegressor`, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingRegressor.html">https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingRegressor.html</a>, `BaggingRegressor` <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html">https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html</a>, and `AdaBoostRegressor` <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html">https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html</a>.

Return to any of the regression problems from the regression portion of the course. Build an ensemble of regression models.

In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







### How to Fit a Logistic Regression Model

We return to Maximum Likelihood Estimation from the Regression Notebook.

Recall that in logistic regression we are interested in $P(y=1|X)$ let's call this $p(X;\beta)$. In logistic regression we're modeling this as:
$$
p(X;\beta) = \frac{1}{1 + e^{-X\beta}}.
$$

Now because our training data exists in a binary state we can't rely on the same procedure we did for linear regression. We instead use maximum likelihood estimation. We first must write out the likelihood function.

First attempt to set up the $\log$-likelihood for the logistic regression model, hint: we can think of $y_i$ as a bernouli random variable with probability parameter $p_i=p(X_i;\beta)$.


After you've accomplished that read through this reference starting at page 5 to see the derivation of the maximum likelihood estimate for logistic regression, <a href="https://cseweb.ucsd.edu/~elkan/250B/logreg.pdf">https://cseweb.ucsd.edu/~elkan/250B/logreg.pdf</a>.

### Multiple Classes

All of the algorithms we've discussed in the classification (with the exception of knn) have been limited to two output classes. We now briefly demonstrate how to extend each model to have multiple classes with `sklearn` and leave it up to you to pursue the theory involved on your own time.

#### Logistic Regression -- Linear Discriminant Analysis

Learn about the theory behind linear discriminant analysis here, <a href="https://web.stanford.edu/class/stats202/content/lec9.pdf">https://web.stanford.edu/class/stats202/content/lec9.pdf</a>.

In `sklearn` it can be implemented with `LinearDiscriminantAnalysis`, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html">https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html</a>.

Use it to classify the toy data set below. Plot the decision boundary.

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [None]:
X_train = np.concatenate([2*np.random.randn(50,2) + [2,2],
                    2*np.random.randn(50,2) + [-1,-1],
                    1*np.random.randn(50,2) + [-3,4]])

y_train = np.concatenate([np.zeros(50),np.ones(50),2*np.ones(50)])


X_test = np.concatenate([2*np.random.randn(50,2) + [2,2],
                    2*np.random.randn(50,2) + [-1,-1],
                    1*np.random.randn(50,2) + [-3,4]])

y_test = np.concatenate([np.zeros(50),np.ones(50),2*np.ones(50)])

In [None]:
plt.figure(figsize=(6,6))

plt.scatter(X[:50,0],X[:50,1],c='r',label="0")
plt.scatter(X[50:100,0],X[50:100,1],c='b',label="1")
plt.scatter(X[100:,0],X[100:,1],c='g',label="2")

plt.xlabel("$x_1$",fontsize=14)
plt.ylabel("$x_2$",fontsize=14)

plt.legend()
plt.show()

In [None]:
## Make and fit the model






In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







### Decision Trees and Random Forests

These methods naturally accept multiple classes with no additional work.

See the example below.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

In [None]:
tree_clf = DecisionTreeClassifier(max_depth=2)

In [None]:
tree.plot_tree(test.fit(X_train,y_train),filled=True)

plt.show()

The random forest classifier can be imported like so.

In [None]:
from sklearn.ensemble import RandomForestClassifier

### SVMs

SVMs also naturally allow for multiple output classes.

In [None]:
from sklearn.svm import SVC

In [None]:
svc = SVC(kernel="linear")

In [None]:
svc.fit(X_train,y_train)

In [None]:
svc.predict(X_test)

In [None]:
# Your Code here
x1 = np.linspace(-10,10,100)
x2 = np.linspace(-10,10,100)

x1v, x2v = np.meshgrid(x1,x2)

X_grid = np.concatenate([x1v.reshape(-1,1),x2v.reshape(-1,1)],axis=1)

pred_grid = pd.DataFrame(np.concatenate([X_grid,svc.predict(X_grid).reshape(-1,1)],axis=1),
                         columns = ['x1','x2','y'])

plt.figure(figsize=(10,10))

plt.scatter(pred_grid.loc[pred_grid.y == 1,'x1'],pred_grid.loc[pred_grid.y == 1,'x2'],
           c='antiquewhite', label="Predicted 1")
plt.scatter(pred_grid.loc[pred_grid.y == 0,'x1'],pred_grid.loc[pred_grid.y == 0,'x2'],
           c='black', label="Predicted 0")
plt.scatter(pred_grid.loc[pred_grid.y == 2,'x1'],pred_grid.loc[pred_grid.y == 2,'x2'],
           c='grey', label="Predicted 2")

plt.scatter(X_train[y_train == 1,0],X_train[y_train == 1,1],c = "cyan",label="Training 1")
plt.scatter(X_train[y_train == 0,0],X_train[y_train == 0,1],c = "orange",label="Training 0")
plt.scatter(X_train[y_train == 2,0],X_train[y_train == 2,1],c = "red",label="Training 2")

plt.legend(fontsize=14)

plt.xlabel("$x_1$", fontsize=16)
plt.ylabel("$x_2$", fontsize=16)


plt.show()

### Ensemble Methods

Ensemble Methods work in exactly the same way.


Practice by building an ensemble model to predict on the iris data.

In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







## Na&iuml;ve Bayes Classifier

Na&iuml;ve Bayes is a popular classifyication technique, that is often used to in email <a href="https://www.spam.com/">spam</a> filter examples, <a href="https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering">https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering</a>.

Read up on the algorithm in the sklearn docs <a href="https://scikit-learn.org/stable/modules/naive_bayes.html">https://scikit-learn.org/stable/modules/naive_bayes.html</a>, and this pdf presentation from UPenn Engineering, <a href="https://www.seas.upenn.edu/~cis391/Lectures/naive-bayes-spam-2015.pdf">https://www.seas.upenn.edu/~cis391/Lectures/naive-bayes-spam-2015.pdf</a>.



## Gradient Boosting

In addition to the popular AdaBoost there is the Gradient Boosting Algorithm. Just like AdaBoost Gradient Boosting sequentially adds a new weak learner to the model in order to improve the predictions at each step. 

While AdaBoost learns on the difficult to predict instance by weighting them differently, Gradient Boosting works by having the next weak learner predicting the residual errors from the previous weak learner.

Let's show how we can implement this by hand using a Gradient Boosted Regression Tree.

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
# make some toy data
x = np.linspace(-.5,.5,50)
y = x**2 + .025*np.random.randn(50)

In [None]:
# We'll fit a series of regression trees
tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg4 = DecisionTreeRegressor(max_depth=2)
tree_reg5 = DecisionTreeRegressor(max_depth=2)

In [None]:
tree_reg1.fit(x.reshape(-1,1),y)

# get the residuals
y2 = y - tree_reg1.predict(x.reshape(-1,1))

tree_reg2.fit(x.reshape(-1,1),y2)

# get the next round of residuals
y3 = y2 - tree_reg2.predict(x.reshape(-1,1))

tree_reg3.fit(x.reshape(-1,1),y3)

# get the next round
y4 = y3 - tree_reg3.predict(x.reshape(-1,1))

tree_reg4.fit(x.reshape(-1,1),y4)

# get next round of residuals
y5 = y4 - tree_reg4.predict(x.reshape(-1,1))

tree_reg5.fit(x.reshape(-1,1),y5)

# get last round of residuals.
y6 = tree_reg5.predict(x.reshape(-1,1))

In [None]:
# make a plot to demonstrate
fig, ax = plt.subplots(5,2,figsize=(14,20), sharex=True, sharey=True)

# The first row has two of the same
# plot since we have no residuals yet
ax[0,0].scatter(x,y,c='b')
ax[0,1].scatter(x,y,c='b')

# now add in the fitted regression
ax[0,0].plot(x,tree_reg1.predict(x.reshape(-1,1)),'-r')
ax[0,1].plot(x,tree_reg1.predict(x.reshape(-1,1)),'-r')

## second row
ax[1,0].scatter(x,y2,c='b')
ax[1,1].scatter(x,y,c='b')

# now add in the fitted regression
ax[1,0].plot(x,tree_reg2.predict(x.reshape(-1,1)),'-r')
ax[1,1].plot(x,tree_reg1.predict(x.reshape(-1,1)) + 
                 tree_reg2.predict(x.reshape(-1,1)),'-r')

## third row
ax[2,0].scatter(x,y3,c='b')
ax[2,1].scatter(x,y,c='b')

# now add in the fitted regression
ax[2,0].plot(x,tree_reg3.predict(x.reshape(-1,1)),'-r')
ax[2,1].plot(x,tree_reg1.predict(x.reshape(-1,1)) + 
                 tree_reg2.predict(x.reshape(-1,1)) +
                 tree_reg3.predict(x.reshape(-1,1)),'-r')

## fourth row
ax[3,0].scatter(x,y4,c='b')
ax[3,1].scatter(x,y,c='b')

# now add in the fitted regression
ax[3,0].plot(x,tree_reg4.predict(x.reshape(-1,1)),'-r')
ax[3,1].plot(x,tree_reg1.predict(x.reshape(-1,1)) + 
                 tree_reg2.predict(x.reshape(-1,1)) +
                 tree_reg3.predict(x.reshape(-1,1)) +
                 tree_reg4.predict(x.reshape(-1,1)),'-r')


## fifth row
ax[4,0].scatter(x,y5,c='b')
ax[4,1].scatter(x,y,c='b')

# now add in the fitted regression
ax[4,0].plot(x,tree_reg5.predict(x.reshape(-1,1)),'-r')
ax[4,1].plot(x,tree_reg1.predict(x.reshape(-1,1)) + 
                 tree_reg2.predict(x.reshape(-1,1)) +
                 tree_reg3.predict(x.reshape(-1,1)) +
                 tree_reg4.predict(x.reshape(-1,1)) +
                 tree_reg5.predict(x.reshape(-1,1)),'-r')

ax[0,0].set_title("Residual Plots", fontsize=16)
ax[0,1].set_title("Original Data Plot", fontsize=16)

ax[0,0].set_ylabel("Model 1", fontsize=14)
ax[1,0].set_ylabel("Model 2", fontsize=14)
ax[2,0].set_ylabel("Model 3", fontsize=14)
ax[3,0].set_ylabel("Model 4", fontsize=14)
ax[4,0].set_ylabel("Model 5", fontsize=14)

plt.show()

We can do all of this at the same time with the `GradientBoostingRegressor` Class in `sklearn`. <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html">https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html</a>.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
# The larger the learning rate the faster the algorithm
# learns
# but be careful you can easily overshoot the optimal solution this way
# see what happens if learning_rate is 10
gbr = GradientBoostingRegressor(max_depth = 2,
                                n_estimators = 5,
                                learning_rate=.5)

gbr.fit(x.reshape(-1,1),y)

In [None]:
# We can plot the gradient boosted regressor prediction.
plt.figure()

plt.scatter(x,y,c='b')

plt.plot(x,gbr.predict(x.reshape(-1,1)),'-r')

plt.show()

Now go through and make a test set for this synthetic data. Find the `max_depth` that produces the lowest test error. Plot the test error as a function of the `max_depth`.

In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here







In [None]:
## Code here





