In [None]:
#!pip install -U scikit-learn

In [None]:
import pandas as pd
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.linear_model import LinearRegression

# Train Test Splits

### The Problem

When building a predictive model we can run into a problem of overfitting. 

**OVERFITTING WILL BE A MAJOR TOPIC WE WILL DEAL WITH FOR AS LONG AS WE MAKE PREDICTIVE MODELS**

Overfitting is a model that does *too well* on the data we've given in, so much so that it makes the model *worse* at looking at new data.

Take this data for example:

![Clearly linear scatterplot](Images/justdot.png)

It's our data is pretty clearly linear, a linear model wouldn't prefectly predict every point, but it would be pretty good.

![Simple linear model](Images/propermodel.png)

Buuuuuuuut I could totally write an algorithm or funtion that perfectly hits every point here.

!['Better' model](Images/overfitmodel.png)


This second model will perform better on any metrics we feed into it... FOR THESE SPECIFIC DATA POINTS. 

But, we want models that make predictions. That's 99% of the point of data science. 

If we tried to feed our models new information (e.g. use the model in the real world) the 'worse' model will perform much better.

![New data comparison](Images/withnewdata.png)



So how do we know if our model is overfit? We can't magically create new data, but we can HIDE data from the model, only to reveal it later. This is the train-test split. We train the models on one set of data, we verify the model's performance on the test data. 



#### Getting an example dataset

Let's grab a dataset and see how this works. I'm just grabbing the diabetes dataset from sklearn.

In [None]:
#first we load the dataset. load_diabetes has an option to load directly as a pandas df

diabetes_df = datasets.load_diabetes(as_frame=True, scaled = False).frame
# choosing 'as_frame = True' makes the dataset a pandas DF, setting scaled = False
# means that we get the raw data, and I'll talk about why I did that later.
# NOTE! That's not an option in the version of sklearn y'all have.

diabetes_df

Now the FIRST thing we want to do is the train-test-split. 

In [None]:
# This is the part of the lecture where I verbally explain all the options for train_test_split
train_test_split

### The options to input into train_test_split:

##### arrays
First, we MUST pass in arrays. They must be "sequence of indexables with same length / shape" That is, the same number of rows, and those rows must match up. So row 1 in one array needs to map to row 1 in the second array.  In this case, our X variables or predictors, and our Y variables or target. 

##### test_size
Next, we can choose the test_size. We can set this manually, or use the default. The default is .25, or 25% of the data. If we pass an int it will set the size of the test size to that int, if we pass a value between 1 and 0 it makes the test dataset a corresponding % of the overall dataset. The default here is fine for most use cases.

##### train_size
Or we can set the training size instead of the test size. But usually we want to use all the data we don't put into the test set, so by default we leave this untouched.

##### random_state
So, we want our tts to be randomized, otherwise we'll bias the samples and blahblahblah. Buuuuut we want to be able to open and close the notebook and have our results be repeatable. So here we usually want to pass in some arbitary random seed, so that each time the TTS is run *on the same dataset* you end up with the same random result. best practice is to pick a number and stick with it. Because programmers are nerds, 42 is a fairly widely used arbitrary seed. I do 14 because that's my lucky number. You can do whatever, it just forces the split to be the same each time you run it.

##### shuffle 
Shuffle is a boolean option that defaults to true. Basically it makes sure that you're actually grabbing data points randomly, which we almost always want to do. 

##### stratify
You can pass in an array to have the split try to ensure that it balances the split based on the contents of that array. Usually we don't do this, and if we do, usually we stratify by the target. 

For an example of when we might use this, if we have a classification dataset that has some very rare values in the target, we might want to force the t-t-s to try to make sure there's an equal proportion of each possible target in the train and the test. This is a problem that goes away with bigger datasets thanks to the Law of Large Numbers.

### Our train_test_split

##### arrays

Our arrays will be the predictors on one hand and the target on the other. In this case, we'll use diabetes_df.drop('target', axis = 0) and diabetes_df.target

##### test_size

We can leave this as default.

##### train_size

We can leave this as default.

##### random_state

I'm going to set this at 14 just because I can.

##### shuffle

We can leave this as default.

##### stratify

We can leave this as default.


In [None]:
train_test_split(diabetes_df.drop('target', axis = 1), diabetes_df.target, random_state = 42)

# This output is strange. train_test_split returns four arrays, so we need to set it to four different variables.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(diabetes_df.drop('target', axis = 1), diabetes_df.target, random_state = 42)



In [None]:
type(X_train.index[0])

In [None]:
X_test.head()

In [None]:
y_train.head()

In [None]:
y_test.head()

Note the way the indices line up. This is why indices are important!

Note also how the indices have been randomized. 

Since our target was just a single column, y_train and y_test came out as a series while X_train and X_test came out as full dataframes.

#### Data leaking

This leads to problems of data leakage, or, the possibility of, unless we're careful, information from the test set 'leaking' its way into our training data. This can happen a lot of ways, and we need to keep an eye out for it. 

For instance, check out [this paper](https://reproducible.cs.princeton.edu/), which is just some data scientists calling out other, seasoned and experienced academic data scientists for messing up data leakage. Note, that, kinda like the example we gave above, fixing the data leaks often led to the fancy shmancy algos performing only about as well as the simpler models.

**We now need to take X_test and Y_test and hide them away. Lock the door, hide the key**

We now want to do our pre-processing steps on X_train and y_train, and then be ready to, *once we've already built the model*, be able to reapply those steps to the test data.

So what's some preprocessing we need to do?
.

.

.

.

.

.

.

.

.Insert image of Dora the Explorer

.

.

.

.

.

.

.

.

.

.

That's right! We might want to scale this data, and we for sure need to do some one hot encoding for our categorical data.

In [None]:
# Here we are, as Jelly would say it 'in-STAUNCH-ee-a-ting' our ohe and scaler
ohe1 = OneHotEncoder(sparse = False, drop = 'first', handle_unknown = 'ignore')
scaler1 = StandardScaler()

In [None]:
# Since for the life of me I can't track down what sex is supposed to map to, 
# I'm going with an arbitrary replacement.
X_train.sex.replace({1: 'Jock', 2:'Nerd'}, inplace = True)

X_train

In [None]:
#this fits the encoder on the column we need
sex_ohe = ohe1.fit(X_train[['sex']])

# this creates a new object by putting the encoder to the column,
# then creates a DataFrame from that object, the label for the undropped column, 
# and the index from the training dataset

sex_encoded = pd.DataFrame(sex_ohe.transform(X_train[['sex']]), columns=['Nerd'], index = X_train.index)

In [None]:
sex_encoded.head()

In [None]:
# Now we need to add a column to our X_train dataset. 

X_train_encoded = pd.concat([X_train, sex_encoded['Nerd']], axis = 1)

In [None]:
# I'm calling this just to spot check that the 'Nerd' column = 1 when sex = 'Nerd' and
# Nerd = 0 when sex = 'Jock'
# I could do this more rigourously but this works for now.
X_train_encoded

In [None]:
X_train_encoded.drop(['sex'], axis = 1, inplace= True)

In [None]:
# since I know I'll need to repeat this later, I'll make a function for everything EXCEPT the fit step.

def ohe_for_diabetes(data_df, fit_encoder):
    data_df.sex.replace({1: 'Jock', 2:'Nerd'}, inplace = True)
    data_sex_encoded = pd.DataFrame(fit_encoder.transform(data_df[['sex']]), columns=['Nerd'], index = data_df.index)
    data_df_encoded = pd.concat([data_df, data_sex_encoded['Nerd']], axis = 1)
    display(data_df_encoded)
    ### I am including this so we can sanity check that 
    data_df_encoded.drop(['sex'], axis = 1, inplace= True)
    return data_df_encoded



In [None]:
# Don't do this but I want to prove a point:

X_test_encoded = ohe_for_diabetes(data_df = X_test, fit_encoder = sex_ohe)

Whew, that was kind of a pain, but we're set up for categorical columns now. On to scaling!

### 

In [None]:
X_train_encoded

In [None]:
# Back to some preprocessing
# I got lazy here, can you spot it? It won't effect the model but it's still bad practice

scaler1.fit(X_train_encoded)
X_train_scaled = pd.DataFrame(scaler1.transform(X_train_encoded), 
                              columns=X_train_encoded.columns,
                              index = X_train_encoded.index)

In [None]:
X_train_scaled

In [None]:
model = LinearRegression()

model.fit(X_train_scaled, y_train)

In [None]:
X_test_scaled = pd.DataFrame(scaler1.transform(X_test_encoded), 
                              columns=X_test_encoded.columns,
                              index = X_test_encoded.index)


In [None]:
display(model.score(X_test_scaled, y_test))

In [None]:
y_pred = model.predict(X_test_scaled)

sns.scatterplot(x = y_pred, y = y_test)

In [None]:
display(model.score(X_train_scaled, y_train))

In [None]:
cross_val_score(model, X_train_scaled, y_train)