In [None]:
# dataset loader
from sklearn import datasets

# model training and evalutation utilities 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold # this is one way to generate folds
from sklearn.model_selection import KFold

# models
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn import linear_model

# toy data
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape

## What you should learn/be aware of based on this lecture 

Key `sklearn` functions:
- `train_test_split`
- `cross_validate`
- Fold generators: `KFold` and `StratifiedKFold`
- Scoring functions per last lecture and how to pass to `cross_validate`
- How to compare different models by looping over them with `cross_validate`, `GridSearchCV`, or `RandomizedSearchCV` 

Not covered today but you should check out:
- `confusion_matrix` and `classification_report` (helpful to evaluate models)


## A simple "split, train, evaluate" example

In [None]:
# split the data with 50% in each set
X1, X2, y1, y2 = train_test_split(X, y, random_state=0,
                                  train_size=0.5)

# fit the model on one set of data
# ignore the model I choose here, its not important what
model = KNeighborsClassifier(n_neighbors=1)
model.fit(X1, y1) # fit on the "training data" X1 and  y1

# evaluate the model on the second set of data
y2_model = model.predict(X2) # using X2 (out-of-sample data), predict y2
accuracy_score(y2, y2_model) # see how close y2 is to prediction (fraction of all pred that are exactly right)

## Want to do k-fold? It's like repeating the above. In pseudo code, it looks like:
1. Break the X and y data into $k$ subsamples
2. For each subsample, fit the model, predict OOS, score predictions, and save those

Ok?


## K-Fold in Python: The explicit way, and the wrapped way

Watch me do the explicit way

In [None]:
# you can take quick notes here, but I'm not going to write this code slow enough to copy
# the point here is to illustrate

Now try the wrapper below! We are going to see how to use that function to:
- try multiple models
- try different sets of X variables
- try different ways to specific folds

In [None]:
# try the function here

In [None]:
# try here with diff scores

All the metrics it can compute out of the box are here: https://scikit-learn.org/stable/modules/model_evaluation.html

Notice that many of these were discussed in our last lecture!

_**Warning/Note:**_ the metric names on that link and what you put in the `scoring` dictionary don't seem to match up.  

## question:

In [None]:
# answer here

## Exploring the `cross_validate` parameters

### The model parameter 

In [None]:
# change the model

### question:


In [None]:
# answer here

`linear_model` submodule contains lots of useful alternate options

In [None]:
# for example:
linear_model.Lasso
linear_model.Ridge
linear_model.LogisticRegression

linear_model.LassoCV() # Returns a Lasso (L1 Regularization) linear model with picking the best model by cross validation
linear_model.RidgeCV() # Returns a Ridge (L2 Regularization) linear model with picking the best model by cross validation
linear_model.LogisticRegressionCV() # return best logit model by CV


Looping over models

In [None]:
# set up models to try
models = []
models.append(('svc_1', SVC(gamma='auto') ))
models.append(('svc_2', SVC(C=5) ))
models.append(('neighbor',  KNeighborsClassifier(n_neighbors=1)))

# loop and print
for name, model in models:
    scores = cross_validate(model, X, y, scoring='accuracy')
    print('%s: %.3f (%.3f)' % (name.ljust(10), 
                                   scores['test_score'].mean(), 
                                   scores['test_score'].std()
                                   )
         )


### The X parameter

You can loop over Xs

In [None]:
# define a smaller X and a bigger X
X_small = X[:,:2] # just first two columns

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3, include_bias=False)
X3 = poly.fit_transform(X)

# set up Xs to try
right here!

# loop and print
right here!

### Xs and Models

### CV parameter and folds

Just  watch.

# Links, resoruces, and next week

Only two resources needed
- sklearn docs are GREAT https://scikit-learn.org/stable/user_guide.html 
- Python Data Science Handbook (note some module calls are obsolete, so you might need to update code) https://jakevdp.github.io/PythonDataScienceHandbook/index.html

Next week:
- preprocessing
- data transformations
- feasture selection
