## Test Train Split

### Why split the data?
#### Estimate the performance of ML algorithms when they are used to make predictions on 'out of sample' data
#### Avoid overfitting and underfitting


![alt text](overfit.png "Title")

------

![alt text](Train-Test-Data-Split.png "Title")

In [27]:
import pandas as pd
import numpy as np
from pydataset import data
from sklearn.model_selection import train_test_split

In [2]:
#import mpg dataset
mpg = data('mpg') 

In [3]:
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


In [4]:
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [5]:
# Define the independent('X') and dependent ('y') variables
X = mpg[['displ', 'cyl', 'trans', 'drv']]
y = mpg[['cty']]

In [6]:
mpg_shuffled = mpg.sample(frac = 1)

In [7]:
mpg_shuffled

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
8,audi,a4 quattro,1.8,1999,4,manual(m5),4,18,26,p,compact
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
81,ford,explorer 4wd,4.0,2008,6,auto(l5),4,13,19,r,suv
70,dodge,ram 1500 pickup 4wd,4.7,2008,8,manual(m6),4,9,12,e,pickup
164,subaru,forester awd,2.5,2008,4,auto(l4),4,20,26,r,suv
...,...,...,...,...,...,...,...,...,...,...,...
47,dodge,caravan 2wd,3.8,2008,6,auto(l6),f,16,23,r,minivan
223,volkswagen,new beetle,1.9,1999,4,auto(l4),f,29,41,d,subcompact
143,nissan,altima,2.4,1999,4,auto(l4),f,19,27,r,compact
111,hyundai,sonata,2.4,2008,4,auto(l4),f,21,30,r,midsize


In [8]:
train_data = mpg_shuffled[:int(len(mpg_shuffled)*0.8)]
test_data = mpg_shuffled[int(len(mpg)*0.8):]

In [9]:
# use sklearn train test split function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [10]:
X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train, test_size = 0.2, random_state = 123)

In [11]:
# check shape of train dataset
X_train.shape, y_train.shape

((149, 4), (149, 1))

In [12]:
# check shape of train dataset
X_test.shape, y_test.shape

((47, 4), (47, 1))

In [13]:
X_validate.shape, y_validate.shape

((38, 4), (38, 1))

When not to use above train/test split?

## TL;DR – The train_test_split function is for splitting a single dataset for two different purposes: training and testing. 
### The training subset is for building your model.
### The testing subset is to evaluate the performance of the model.

In [14]:
mpg = data('mpg')

In [15]:
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


In [16]:
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [17]:
mpg_shuffled = mpg.sample(frac = 1, random_state = 123)

In [18]:
mpg_shuffled

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
92,ford,mustang,3.8,1999,6,auto(l4),r,18,25,r,subcompact
152,nissan,pathfinder 4wd,3.3,1999,6,manual(m5),4,15,17,r,suv
166,subaru,impreza awd,2.2,1999,4,auto(l4),4,21,26,r,subcompact
157,pontiac,grand prix,3.8,1999,6,auto(l4),f,17,27,r,midsize
134,land rover,range rover,4.6,1999,8,auto(l4),4,11,15,p,suv
...,...,...,...,...,...,...,...,...,...,...,...
99,ford,mustang,5.4,2008,8,manual(m6),r,14,20,p,subcompact
221,volkswagen,jetta,2.8,1999,6,manual(m5),f,17,24,r,compact
67,dodge,ram 1500 pickup 4wd,4.7,2008,8,auto(l5),4,13,17,r,pickup
127,jeep,grand cherokee 4wd,4.7,2008,8,auto(l5),4,9,12,e,suv


In [19]:
X = mpg_shuffled[:200]
y = mpg_shuffled[200:]

In [20]:
# Use sklearn library to split the data:

X = mpg.drop(columns = 'cty')
y = mpg[['cty']]

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 123)

In [22]:
X_train.shape, y_train.shape

((187, 10), (187, 1))

In [23]:
X_test.shape, y_test.shape

((47, 10), (47, 1))

In [24]:
# split the train dataset again into train and validate dataset

X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train, train_size = 0.8, random_state = 123)

In [25]:
X_train.shape, y_train.shape

((149, 10), (149, 1))

In [26]:
X_validate.shape, y_validate.shape

((38, 10), (38, 1))

In [32]:
tips = data('tips')

In [33]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


In [34]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 1 to 244
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 15.2+ KB


In [36]:
tips_shuffled = tips.sample(frac = 1)

In [37]:
tips_shuffled

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
146,8.35,1.50,Female,No,Thur,Lunch,2
216,12.90,1.10,Female,Yes,Sat,Dinner,2
131,19.08,1.50,Male,No,Thur,Lunch,2
193,28.44,2.56,Male,Yes,Thur,Lunch,2
144,27.05,5.00,Female,No,Thur,Lunch,6
...,...,...,...,...,...,...,...
11,10.27,1.71,Male,No,Sun,Dinner,2
239,35.83,4.67,Female,No,Sat,Dinner,3
128,14.52,2.00,Female,No,Thur,Lunch,2
226,16.27,2.50,Female,Yes,Fri,Lunch,2


Why and when to shuffle/randomize observations?
    - To avoid introducting bias due order of measurement
    - By default, always good to randomize except in special cases?
When not to shuffle/randomize oberservations?
    - Time series - More in time series lessons

In [38]:
X = tips_shuffled[['total_bill', 'size']]
y = tips_shuffled[['tip']]

In [39]:
X_train = X[:200]
y_train = y[:200]

X_test = X[200:]
y_test = y[200:]

In [41]:
X_train.shape, y_train.shape

((200, 2), (200, 1))

In [42]:
X_test.shape, y_test.shape

((44, 2), (44, 1))

In [43]:
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.50,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
240,29.03,5.92,Male,No,Sat,Dinner,3
241,27.18,2.00,Female,Yes,Sat,Dinner,2
242,22.67,2.00,Male,Yes,Sat,Dinner,2
243,17.82,1.75,Male,No,Sat,Dinner,2


In [44]:
X = tips[['total_bill', 'size']]
y = tips[['tip']]

In [45]:
X_train, X_test, y_train,y_test = train_test_split(X,y, test_size = 0.2, random_state = 123)

In [46]:
X_train.shape, y_train.shape

((195, 2), (195, 1))

In [47]:
X_test.shape, y_test.shape

((49, 2), (49, 1))

In [48]:
train, test = train_test_split(tips, test_size = 0.2, random_state = 123)

Caution: Use test data only one for check your best model.  
To find best model or hyperparameter tuning, use validation/cross-validation

In [53]:
X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train, random_state = 123, test_size = 0.8)

In [54]:
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_blobs(n_samples=100)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize first 5 rows
print(X_train[:5, :])
# split again, and we should see the same split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize first 5 rows
print(X_train[:5, :])

[[ 8.27071406  2.76801556]
 [ 7.08789543  0.82942164]
 [ 7.42136722  1.5508787 ]
 [ 2.35602981 -4.10446947]
 [ 0.21283006 -4.40158273]]
[[ 8.27071406  2.76801556]
 [ 7.08789543  0.82942164]
 [ 7.42136722  1.5508787 ]
 [ 2.35602981 -4.10446947]
 [ 0.21283006 -4.40158273]]
