## Test Train Split

### Why split the data?
#### Estimate the performance of ML algorithms when they are used to make predictions on 'out of sample' data
#### Avoid overfitting and underfitting


![alt text](overfit.png "Title")

------

![alt text](Train-Test-Data-Split.png "Title")

### The training dataset :  Use for training the algorithm
### The test dataset:  Use to test the algorithm on 'out of sample' dataset



In [25]:
import pandas as pd
import numpy as np
from pydataset import data
from sklearn.model_selection import train_test_split



In [26]:
#import mpg dataset
mpg = data('mpg') 

In [27]:
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


In [28]:
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [29]:
# Define the independent('X') and dependent ('y') variables
X = mpg[['displ', 'cyl', 'trans', 'drv']]
y = mpg[['cty']]

In [69]:
mpg_shuffled = mpg.sample(frac = 1)

In [70]:
mpg_shuffled

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
29,chevrolet,k1500 tahoe 4wd,5.3,2008,8,auto(l4),4,14,19,r,suv
210,volkswagen,gti,2.0,2008,4,manual(m6),f,21,29,p,compact
65,dodge,ram 1500 pickup 4wd,4.7,2008,8,manual(m6),4,12,16,r,pickup
127,jeep,grand cherokee 4wd,4.7,2008,8,auto(l5),4,9,12,e,suv
166,subaru,impreza awd,2.2,1999,4,auto(l4),4,21,26,r,subcompact
...,...,...,...,...,...,...,...,...,...,...,...
51,dodge,dakota pickup 4wd,3.9,1999,6,auto(l4),4,13,17,r,pickup
136,lincoln,navigator 2wd,5.4,1999,8,auto(l4),r,11,16,p,suv
158,pontiac,grand prix,3.8,2008,6,auto(l4),f,18,28,r,midsize
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize


In [71]:
train_data = mpg_shuffled[:int(len(mpg_shuffled)*0.8)]
test_data = mpg_shuffled[int(len(mpg)*0.8):]

In [72]:
train_data

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
29,chevrolet,k1500 tahoe 4wd,5.3,2008,8,auto(l4),4,14,19,r,suv
210,volkswagen,gti,2.0,2008,4,manual(m6),f,21,29,p,compact
65,dodge,ram 1500 pickup 4wd,4.7,2008,8,manual(m6),4,12,16,r,pickup
127,jeep,grand cherokee 4wd,4.7,2008,8,auto(l5),4,9,12,e,suv
166,subaru,impreza awd,2.2,1999,4,auto(l4),4,21,26,r,subcompact
...,...,...,...,...,...,...,...,...,...,...,...
67,dodge,ram 1500 pickup 4wd,4.7,2008,8,auto(l5),4,13,17,r,pickup
9,audi,a4 quattro,1.8,1999,4,auto(l5),4,16,25,p,compact
165,subaru,forester awd,2.5,2008,4,auto(l4),4,18,23,p,suv
140,mercury,mountaineer 4wd,4.6,2008,8,auto(l6),4,13,19,r,suv


In [73]:
test_data

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
8,audi,a4 quattro,1.8,1999,4,manual(m5),4,18,26,p,compact
20,chevrolet,c1500 suburban 2wd,5.3,2008,8,auto(l4),r,11,15,e,suv
167,subaru,impreza awd,2.2,1999,4,manual(m5),4,19,26,r,subcompact
120,hyundai,tiburon,2.7,2008,6,auto(l4),f,17,24,r,subcompact
139,mercury,mountaineer 4wd,4.0,2008,6,auto(l5),4,13,19,r,suv
45,dodge,caravan 2wd,3.8,1999,6,auto(l4),f,15,22,r,minivan
90,ford,f150 pickup 4wd,5.4,2008,8,auto(l4),4,13,17,r,pickup
37,chevrolet,malibu,3.6,2008,6,auto(s6),f,17,26,r,midsize
74,dodge,ram 1500 pickup 4wd,5.9,1999,8,auto(l4),4,11,15,r,pickup
96,ford,mustang,4.6,1999,8,manual(m5),r,15,22,r,subcompact


In [77]:
# use sklearn train test split function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [75]:
# check shape of train dataset
X_train.shape, y_train.shape

((187, 4), (187, 1))

In [66]:
# check shape of train dataset
X_test.shape, y_test.shape

((47, 4), (47, 1))

When not to use above train/test split?

## TL;DR – The train_test_split function is for splitting a single dataset for two different purposes: training and testing. 
### The training subset is for building your model.
### The testing subset is to evaluate the performance of the model.