### Thursday April 3 Lecture 

**Sklearn Essentials**

In [2]:
# dataset loader
from sklearn import datasets

# model training and evalutation utilities 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold # this is one way to generate folds
from sklearn.model_selection import KFold

# models
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn import linear_model

# toy data
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape


((150, 4), (150,))

### What you should learn/be aware of based on this lecture

Key sklearn functions:
- train_test_split
- cross_validate
- Fold generators: KFold and StratifiedKFold
- Scoring functions per last lecture and how to pass to cross_validate
- How to compare different models by looping over them with cross_validate, GridSearchCV, or RandomizedSearchCV

Not covered today but you should check out:

- confusion_matrix and classification_report (helpful to evaluate models)

## Split, train, evaluate example

In [15]:
# split the data with 50% in each set
X1, X2, y1, y2 = train_test_split(X, y, random_state=0,
                                  train_size=0.5)

# fit the model on one set of data
# ignore the model I choose here, its not important what
model = KNeighborsClassifier(n_neighbors=1)
model.fit(X1, y1) # fit on the "training data" X1 and  y1

# evaluate the model on the second set of data
y2_model = model.predict(X2) # using X2 (out-of-sample data), predict y2
accuracy_score(y2, y2_model)# see how close y2 is to prediction (fraction of all pred that are exactly right)


0.9066666666666666

## Cross_validate Example 

In [24]:
from sklearn import svm
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=5)
print(scores)
# array([0.96..., 1.  ..., 0.96..., 0.96..., 1.        ])

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
#Accuracy: 0.98 (+/- 0.03)

[0.96666667 1.         0.96666667 0.96666667 1.        ]
Accuracy: 0.98 (+/- 0.03)


## K-fold example

RepeatedKFold & StratifiedKFold are two variations of k-fold.

- **RepeatedKFold** repeats K-Fold n times. It can be used when one requires to run KFold n times, producing different splits in each repetition.

- **StratifiedKFold** is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

![image](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_0041.png)

In [26]:
import numpy as np
from sklearn.model_selection import KFold

X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train, test in kf.split(X):
    print("%s %s" % (train, test))
# [2 3] [0 1]
# [0 1] [2 3]
#[2 3] train refers to "c" and b"d" being the training set


[2 3] [0 1]
[0 1] [2 3]


In [38]:
from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np
X, y = np.ones((50, 1)), np.hstack(([0] * 45, [1] * 5))
skf = StratifiedKFold(n_splits=3)
for train, test in skf.split(X, y):
    print('train -  {}   |   test -  {}'.format(np.bincount(y[train]), np.bincount(y[test])))

print("\n")
kf = KFold(n_splits=3)
for train, test in kf.split(X, y):
    print('train -  {}   |   test -  {}'.format(np.bincount(y[train]), np.bincount(y[test])))

# We can see that StratifiedKFold preserves the class ratios (approximately 1 / 10) in both train and test dataset.

train -  [30  3]   |   test -  [15  2]
train -  [30  3]   |   test -  [15  2]
train -  [30  4]   |   test -  [15  1]


train -  [28  5]   |   test -  [17]
train -  [28  5]   |   test -  [17]
train -  [34]   |   test -  [11  5]


 ![image](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_0071.png)

### Want to do k-fold? It's like repeating the above. In pseudo code, it looks like:

1. Break the X and y data into $k$ subsamples
2. For each subsample, fit the model, predict OOS, score predictions, and save those


### K-Fold in Python: The explicit way, and the wrapped way


In [None]:
# you can take quick notes here, but I'm not going to write this code slow enough to copy
# the point here is to illustrate

Now try the wrapper below! We are going to see how to use that function to:
- try multiple models
- try different sets of X variables
- try different ways to specific folds

In [None]:
# try the function here

In [None]:
# try here with diff scores

All the metrics it can compute out of the box are here: https://scikit-learn.org/stable/modules/model_evaluation.html

Notice that many of these were discussed in our last lecture!

**Warning/Note:** the metric names on that link and what you put in the scoring dictionary don't seem to match up.

## question:


In [1]:
# answer here

## Exploring the cross_validate parameters

### The Model Parameter 

In [None]:
# change the model

## question:


In [1]:
# answer here

linear_model submodule contains lots of useful alternate options

In [None]:
# for example:
linear_model.Lasso
linear_model.Ridge
linear_model.LogisticRegression

linear_model.LassoCV() # Returns a Lasso (L1 Regularization) linear model with picking the best model by cross validation
linear_model.RidgeCV() # Returns a Ridge (L2 Regularization) linear model with picking the best model by cross validation
linear_model.LogisticRegressionCV() # return best logit model by CV

Looping over models

In [None]:
# set up models to try
models = []
models.append(('svc_1', SVC(gamma='auto') ))
models.append(('svc_2', SVC(C=5) ))
models.append(('neighbor',  KNeighborsClassifier(n_neighbors=1)))

# loop and print
for name, model in models:
    scores = cross_validate(model, X, y, scoring='accuracy')
    print('%s: %.3f (%.3f)' % (name.ljust(10), 
                                   scores['test_score'].mean(), 
                                   scores['test_score'].std()
                                   )
         )


### The X parameter

You can loop over Xs

In [None]:
# define a smaller X and a bigger X
X_small = X[:,:2] # just first two columns

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3, include_bias=False)
X3 = poly.fit_transform(X)

# set up Xs to try
right here!

# loop and print
right here!

### Xs and Models

### CV parameter and folds
Just watch.

# Links, resoruces, and next week
Only two resources needed

- Sklearn **Documentation** [here](https://scikit-learn.org/stable/user_guide.html)
- Python Data Science Handbook (note some module calls are obsolete, so you might need to update code) https://jakevdp.github.io/PythonDataScienceHandbook/index.html

Next week:
preprocessing
data transformations
feasture selection