## Test Train Split

### Why split the data?
#### Estimate the performance of ML algorithms when they are used to make predictions on 'out of sample' data
#### Avoid overfitting and underfitting


![alt text](overfit.png "Title")

------

![alt text](Train-Test-Data-Split.png "Title")

### The training dataset :  Use for training the algorithm
### The test dataset:  Use to test the algorithm on 'out of sample' dataset



In [1]:
import pandas as pd
import numpy as np
from pydataset import data
from sklearn.model_selection import train_test_split



In [2]:
#import mpg dataset
mpg = data('mpg') 

In [3]:
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


In [4]:
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [5]:
# Define the independent('X') and dependent ('y') variables
X = mpg[['displ', 'cyl', 'trans', 'drv']]
y = mpg[['cty']]

In [6]:
mpg_shuffled = mpg.sample(frac = 1)

In [7]:
mpg_shuffled

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
93,ford,mustang,4.0,2008,6,manual(m5),r,17,26,r,subcompact
183,toyota,camry,2.4,2008,4,auto(l5),f,21,31,r,midsize
66,dodge,ram 1500 pickup 4wd,4.7,2008,8,auto(l5),4,9,12,e,pickup
180,toyota,camry,2.2,1999,4,manual(m5),f,21,29,r,midsize
146,nissan,altima,3.5,2008,6,manual(m6),f,19,27,p,midsize
...,...,...,...,...,...,...,...,...,...,...,...
21,chevrolet,c1500 suburban 2wd,5.3,2008,8,auto(l4),r,14,20,r,suv
44,dodge,caravan 2wd,3.3,2008,6,auto(l4),f,11,17,e,minivan
77,ford,expedition 2wd,5.4,2008,8,auto(l6),r,12,18,r,suv
70,dodge,ram 1500 pickup 4wd,4.7,2008,8,manual(m6),4,9,12,e,pickup


In [8]:
train_data = mpg_shuffled[:int(len(mpg_shuffled)*0.8)]
test_data = mpg_shuffled[int(len(mpg)*0.8):]

In [9]:
# use sklearn train test split function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [10]:
X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train, test_size = 0.2, random_state = 123)

In [11]:
# check shape of train dataset
X_train.shape, y_train.shape

((149, 4), (149, 1))

In [12]:
# check shape of train dataset
X_test.shape, y_test.shape

((47, 4), (47, 1))

In [13]:
X_validate.shape, y_validate.shape

((38, 4), (38, 1))

When not to use above train/test split?

## TL;DR – The train_test_split function is for splitting a single dataset for two different purposes: training and testing. 
### The training subset is for building your model.
### The testing subset is to evaluate the performance of the model.

In [14]:
mpg = data('mpg')

In [15]:
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


In [16]:
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [17]:
mpg_shuffled = mpg.sample(frac = 1, random_state = 123)

In [18]:
mpg_shuffled

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
92,ford,mustang,3.8,1999,6,auto(l4),r,18,25,r,subcompact
152,nissan,pathfinder 4wd,3.3,1999,6,manual(m5),4,15,17,r,suv
166,subaru,impreza awd,2.2,1999,4,auto(l4),4,21,26,r,subcompact
157,pontiac,grand prix,3.8,1999,6,auto(l4),f,17,27,r,midsize
134,land rover,range rover,4.6,1999,8,auto(l4),4,11,15,p,suv
...,...,...,...,...,...,...,...,...,...,...,...
99,ford,mustang,5.4,2008,8,manual(m6),r,14,20,p,subcompact
221,volkswagen,jetta,2.8,1999,6,manual(m5),f,17,24,r,compact
67,dodge,ram 1500 pickup 4wd,4.7,2008,8,auto(l5),4,13,17,r,pickup
127,jeep,grand cherokee 4wd,4.7,2008,8,auto(l5),4,9,12,e,suv


In [19]:
X = mpg_shuffled[:200]
y = mpg_shuffled[200:]

In [20]:
# Use sklearn library to split the data:

X = mpg.drop(columns = 'cty')
y = mpg[['cty']]

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 123)

In [22]:
X_train.shape, y_train.shape

((187, 10), (187, 1))

In [23]:
X_test.shape, y_test.shape

((47, 10), (47, 1))

In [24]:
# split the train dataset again into train and validate dataset

X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train, train_size = 0.8, random_state = 123)

In [25]:
X_train.shape, y_train.shape

((149, 10), (149, 1))

In [26]:
X_validate.shape, y_validate.shape

((38, 10), (38, 1))