# Train Test Split

## Import the relevant libraries

#In this lesson we will explore the train_test_split module
#Therefore we need no more than the module itself and NumPy

In [193]:
import numpy as np
from sklearn.model_selection import train_test_split

#Underfitting means the model has not captured the underlying logic of the data, it doesn't have strong predictive power.Underfitted models are clumsy and have a low accuracy. Either there are no relationships to be found, or we need a different model.It has low train accuracy and low test accuracy. 

Underfitting is easy to spot. You have almost no accuracy whatsoever. Overfitting is much harder, though, as the accuracy of the model seems outstanding.
Overfitting means our regression has focused on the particular data set so much it has missed the point. Overfitting refers to models that are so super good at modeling the data that they fit, or at least come very near each observation. The problem is that the random noise is captured inside an overfitted model.It has high train accuracy and low test accuracy. 

One solution to overfitting is to split our initial data set into two, training and test. Splits like 90% training and 10% test, or 80 20 are common. We create the regression on the training data. After we have the coefficients, we test the model on the test data by assessing the accuracy. The whole point is that the model has never seen the test data set. Therefore, it cannot overfit on it. We are trying to avoid the scenario where the model learns to predict the training data very well but fails miserably when given new samples.



## Generate some data we are going to split

#Let's generate a new data frame 'a' which will contain all integers from 1 to 100
#The method np.arange works like the built-in method 'range' with the difference that it creates an array instead of a list. The array contains values from 1 to 100 included. 


In [194]:
a = np.arange(1,101)

#Let's check it out. It is very important that the values inside are arranged in order to follow
what the splitting process has accomplished.

In [195]:
a

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100])

#Similarly, let's create another ndarray 'b', which will contain integers from 501 to 600
#We have intentionally picked these numbers so we can easily compare the two.
The difference between the elements of the two arrays is 500 for any two corresponding elements.

In [196]:
b = np.arange(501,601)
b

array([501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513,
       514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526,
       527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539,
       540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552,
       553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565,
       566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578,
       579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591,
       592, 593, 594, 595, 596, 597, 598, 599, 600])

## Split the data

#Train_test_split works by taking an array and spliting it into two arrays. Let's check out how this works.

In [197]:
train_test_split(a)

[array([ 33,   8,   5,  92,  31,  85,  29,  60,  12,  80,  99,   4,  17,
         81,  25,  63,   7,  11,  18,   3,  32,  16,  21,  55,  67,  40,
        100,  34,  14,  78,  50,  89,  57,  42,  28,  58,  51,  53,  37,
         86,  54,  98,  22,  19,  74,  20,  82,   2,  24,  44,  87,  56,
          9,  93,  70,  10,  68,  65,  23,  30,  13,  48,  47,  27,  91,
         36,  88,  43,  83,  59,  75,  94,  66,  73,  49]),
 array([45, 71, 96, 84,  1, 90, 77, 46, 38,  6, 26, 39, 72, 69, 79, 35, 76,
        97, 95, 61, 64, 41, 62, 52, 15])]

#We would prefer storing the result in dedicated variables so we can write a_train, a_test equals train_test_split of a. The first array is always considered to be the training array while the second, the testing one.

In [198]:
a_train, a_test = train_test_split(a) 

#First, let's check the shapes of the two variables. a_train has a length of 75 while a_test, a length of 25. This means that default split is 75,25. Both arrays are also shuffled after train test split. Earlier, they were numbers ordered from 1 to 100 but now they are completely randomized. 

In [199]:
a_train.shape, a_test.shape

((75,), (25,))

In [200]:
a_train

array([ 63,  21,  46,  12,  68,  13,  71,  26,  96,  75,  99,  18,  72,
        91,  38,  36,   1,  77,  57,  55,  30,   3,   7,  88,  39,  60,
        74,  95,  27,  97,  28,  11,  19,  79,  25,   6,  33,  93,  78,
        14,  29,  50,   4, 100,  66,  58,  48,  15,  84,  41,  87,  65,
        34,  73,  37,  85,  17,  45,  52,  81,  42,  62,  35,  69,  82,
        70,  31,  54,  98,  92,   9,  51,  94,  23,  44])

In [201]:
a_test

array([24, 86, 76,  8, 47, 16, 83, 53, 64, 22, 90, 56, 32, 43, 49, 40, 59,
       61, 20, 89,  2, 10,  5, 67, 80])

##75-25 split are the default settings of train_test_split. 75, 25 might dedicate too much data to testing. To change this, we can use an argument called test size and set it to a float between zero and one. Therefore, to achieve an 80/20 split, we need to simply include test_size equals 0.2.

In [202]:
a_train, a_test = train_test_split(a, test_size = 0.2) 

In [203]:
a_train.shape, a_test.shape

((80,), (20,))

In [204]:
a_train

array([52, 10, 44, 19, 75,  1, 69, 78, 50, 22, 84, 95, 62, 73, 63, 39,  8,
       24, 80, 97,  3, 83, 26, 54, 87, 68, 23, 20, 47,  6, 51, 18, 71, 72,
       42, 56, 12, 92, 74, 40, 17, 96, 28, 13,  4, 61, 48, 31, 91, 16, 99,
       94, 82, 30, 86, 34, 76, 29, 32, 43, 98, 53, 41, 37, 35, 14, 65, 25,
       79, 46, 55, 88,  5, 89, 77, 11, 58,  7, 59, 67])

In [205]:
a_test

array([100,  27,  36,  85,  90,  33,  60,  21,  93,  38,  45,  70,  64,
        49,  66,  81,   9,  57,  15,   2])

#There's an argument called shuffle which is set to true by default. Changing that to false would place the first 80 observations of a in a_train while the last 20 in a_test.

In [206]:
a_train, a_test = train_test_split(a, test_size = 0.2, shuffle = False) 

In [207]:
a_train

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
       69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80])

In [208]:
a_test

array([ 81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,
        94,  95,  96,  97,  98,  99, 100])

#Most of the time, we prefer to shuffle the data. This removes time dependencies, day of the week effects. Let's remove the shuffle argument to get randomized values in a_train and a_test. 

In [209]:
a_train, a_test = train_test_split(a, test_size = 0.2) 

In [210]:
a_train

array([ 6, 17, 73, 93, 39, 77, 19, 68, 91, 12, 78, 49, 13, 25, 54, 96, 22,
       94, 63, 33,  8, 16, 34, 31, 60, 84,  1, 65, 23, 52, 89, 15, 10, 24,
       56,  3, 11, 29, 32, 44,  4, 43, 21, 14, 30, 99, 88, 62, 51, 72, 86,
       58, 42, 70, 45, 67, 66, 85,  2, 59, 81, 38, 98,  7, 46, 95, 18, 53,
       80, 57, 48, 64,  9, 74, 47, 92, 28, 97, 35, 36])

In [211]:
a_test

array([ 40,  75,  41,  82, 100,  37,  71,  27,  55,  69,  90,  26,  20,
        50,  87,  79,  61,   5,  83,  76])

#Each time we run the code, we get a different shuffle. So the data is rearranged in a random manner. This could be an issue for modeling. Each time we split the data, we will get different training and testing datasets and we'd be creating a different regression on different data each time we run the cells. Training a model on different training data would generally have little impact,but it will have some impact. The R squared is likely to change with one or two percentage points just because of the split. If we are trying to improve the model with many tiny tweaks, each of which are bringing 1 or 2% of additional explanatory power, a different shuffle every time would prevent an objective assessment of the changes. In the best case scenario, we would like to have shuffle data but shuffled in the same way every time. For that, Sklearn has a random state argument. We'll add the random state argument and set it to 42.

In [212]:
a_train, a_test = train_test_split(a, test_size = 0.2, random_state = 42) 

In [213]:
a_train

array([ 56,  89,  27,  43,  70,  16,  41,  97,  10,  73,  12,  48,  86,
        29,  94,   6,  67,  66,  36,  17,  50,  35,   8,  96,  28,  20,
        82,  26,  63,  14,  25,   4,  18,  39,   9,  79,   7,  65,  37,
        90,  57, 100,  55,  44,  51,  68,  47,  69,  62,  98,  80,  42,
        59,  49,  99,  58,  76,  33,  95,  60,  64,  85,  38,  30,   2,
        53,  22,   3,  24,  88,  92,  75,  87,  83,  21,  61,  72,  15,
        93,  52])

In [214]:
a_test

array([84, 54, 71, 46, 45, 40, 23, 81, 11,  1, 19, 31, 74, 34, 91,  5, 77,
       78, 13, 32])

#If we try rerunning the code again and again, we would always get the exact same shuffled split. If we want to get a different shuffled split, we would simply change the random state to a different number like 365.

In [235]:
a_train, a_test = train_test_split(a, test_size = 0.2, random_state = 365) 

In [236]:
a_train

array([ 25,  32,  99,  73,  91,  66,   3,  59,  94,   1,   8,  15,  90,
        54,  31,  20,  77,  82,  30,  35,  95,  42,  38,   7,  11,  50,
        21,  48,   2,  17,  10,  58,  68,  43,  41,  16,  88,  72,  79,
       100,  80,  39,  24,  86,  22,  23,  62,  76,  18,  47,  55,  26,
        60,  19,  71,  64,  51,  63,  65,  28,  12,  78,  13,  44,  75,
        87,  40,   4,  29,  49,  37,  57,  27,  74,   6,  45,  92,  34,
        53,  83])

In [217]:
a_test

array([ 9, 69, 81, 56, 33, 93, 84, 61, 46, 89, 85, 67, 97,  5, 70, 36, 98,
       96, 14, 52])

#We got a different shuffled split and no matter how many times we rerun the cells, the split doesn't change while the numbers are still randomized. This also allows us to split more than one array at the same time. We can simply add b as an argument and include two new variables to store the returned arrays. We'll call them b_train and b_test.

#There are several different arguments we can set when we employ this method
#Most often, we have inputs and targets, so we have to split 2 different arrays
#we are simulating this situation by splitting 'a' and 'b'


#We can specify the 'test_size'. Common splits are 75-25, 80-20, 85-15, 90-10. Finally, we should always employ a 'random_state'. In this way we ensure that when we are splitting the data we will always get the SAME random shuffle

#Note that 2 arrays will be split into 4. The order is train1, test1, train2, test2. It is very useful to store them in 4 variables, so we can later use them. 

In [249]:
a_train, a_test, b_train, b_test = train_test_split(a, b, test_size=0.2, random_state=365)

## Explore the result

#Let's check the shapes
#Basically, we are checking how does the 'test_size' work

In [250]:
a_train.shape, a_test.shape

((80,), (20,))

#Explore manually

In [220]:
a_train

array([ 25,  32,  99,  73,  91,  66,   3,  59,  94,   1,   8,  15,  90,
        54,  31,  20,  77,  82,  30,  35,  95,  42,  38,   7,  11,  50,
        21,  48,   2,  17,  10,  58,  68,  43,  41,  16,  88,  72,  79,
       100,  80,  39,  24,  86,  22,  23,  62,  76,  18,  47,  55,  26,
        60,  19,  71,  64,  51,  63,  65,  28,  12,  78,  13,  44,  75,
        87,  40,   4,  29,  49,  37,  57,  27,  74,   6,  45,  92,  34,
        53,  83])

#Explore manually

In [251]:
a_test

array([ 9, 69, 81, 56, 33, 93, 84, 61, 46, 89, 85, 67, 97,  5, 70, 36, 98,
       96, 14, 52])

In [252]:
b_train.shape, b_test.shape

((80,), (20,))

In [254]:
b_train

array([525, 532, 599, 573, 591, 566, 503, 559, 594, 501, 508, 515, 590,
       554, 531, 520, 577, 582, 530, 535, 595, 542, 538, 507, 511, 550,
       521, 548, 502, 517, 510, 558, 568, 543, 541, 516, 588, 572, 579,
       600, 580, 539, 524, 586, 522, 523, 562, 576, 518, 547, 555, 526,
       560, 519, 571, 564, 551, 563, 565, 528, 512, 578, 513, 544, 575,
       587, 540, 504, 529, 549, 537, 557, 527, 574, 506, 545, 592, 534,
       553, 583])

In [258]:
b_test

array([509, 569, 581, 556, 533, 593, 584, 561, 546, 589, 585, 567, 597,
       505, 570, 536, 598, 596, 514, 552])

#a consists of the ordered sequence of the numbers from 1 to a 100 while b from 501 to 600. We can say that the number 1 from a matches with 501 from B, 25 from A matches with 525 from B and so on. This is extremely important for regressions because we want a certain observation's inputs to match with its target even after shuffling.