# Lesson 10: Train and test split


A remark on underfitting and overfitting: (this is where the train and test method is nedded)

Underfitted model: low train accuracy and low test accuracy (data fitted so that the fit does not 
have good predictive power, does not capture any logic: i.e. a line fitted to data in the shape of polynomial).
    
Overfitted model: high train accuarcy and low test accuracy (data fitted so good that the model captures all 
noise and missed the point: i.e. the polynomial goes through all data).
    
Good model: high train accuracy and high test accuracy, the model captures the underlying logic and the fit 
goes close to all points.

## Train - Test split

## Import libraries 

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split

## Generate some data that will be split 

In [8]:
a = np.arange(1,101)      # By using "arange" I generate an array with elements from the range 1 to 100. 
                          # Note that for lists I was using "range" instead of "arange".
a

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100])

In [9]:
b = np.arange(501,601)
b

array([501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513,
       514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526,
       527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539,
       540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552,
       553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565,
       566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578,
       579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591,
       592, 593, 594, 595, 596, 597, 598, 599, 600])

## Split the data

In [27]:
train_test_split(a, test_size=0.2) # This function divide the original matrix a into two: one for training 
                                   # and one for testing. By default, it chooses 75% elements for 
                                   # training and 25% for testing. By choosing the size we change this
                                   # proportion. First 80 elements are taken for training and they are 
                                   # shuffled.

[array([ 69,  48,  70,  11,  95,  20,  56,  50,  82,  89,   5,  57,  39,
         83,  85,  80,  41,  30,  98,  94,  38,  15,  81,  78,  10,  16,
          1,  42,  49,  75,  58,  87,   4,  66,  74,  32,  45,   9,  35,
         44,  61,  63,  92,  51,  67,  28,   6,  24,  99,  90,  55,  62,
         84,  23,  60,  31,  33,   7,  26,  12,  91,  25,  18,  54, 100,
         68,  86,   8,  71,  29,  64,  77,  47,  37,  53,  40,  52,  88,
         22,   3]),
 array([36, 79, 17, 46, 43, 65, 27, 34, 13, 76, 14, 72, 21, 93, 97, 59, 96,
         2, 19, 73])]

In [31]:
a_train, a_test = train_test_split(a, test_size=0.2, random_state=42)     

# By using "shuffle=False" we can choose first 80 elements and then put them in ascending order. 

# If we want to have the data shuffled, we leave them as they are, this is by default.

# For shuffled data we can use "random_state=42", which means that each time we execute the code, the data 
# are shuffled in the same way. If this option is not used, each click generates different shuffling.

In [38]:
a_train, a_test, b_train, b_test = train_test_split(a, b, test_size=0.2, random_state=42)  

# The function train_test_split works for many matrices

## Explore the result

In [39]:
a_train.shape, a_test.shape

((80,), (20,))

In [40]:
a_train

array([ 56,  89,  27,  43,  70,  16,  41,  97,  10,  73,  12,  48,  86,
        29,  94,   6,  67,  66,  36,  17,  50,  35,   8,  96,  28,  20,
        82,  26,  63,  14,  25,   4,  18,  39,   9,  79,   7,  65,  37,
        90,  57, 100,  55,  44,  51,  68,  47,  69,  62,  98,  80,  42,
        59,  49,  99,  58,  76,  33,  95,  60,  64,  85,  38,  30,   2,
        53,  22,   3,  24,  88,  92,  75,  87,  83,  21,  61,  72,  15,
        93,  52])

In [41]:
b_train.shape, b_test.shape

((80,), (20,))

In [37]:
b_train

array([556, 589, 527, 543, 570, 516, 541, 597, 510, 573, 512, 548, 586,
       529, 594, 506, 567, 566, 536, 517, 550, 535, 508, 596, 528, 520,
       582, 526, 563, 514, 525, 504, 518, 539, 509, 579, 507, 565, 537,
       590, 557, 600, 555, 544, 551, 568, 547, 569, 562, 598, 580, 542,
       559, 549, 599, 558, 576, 533, 595, 560, 564, 585, 538, 530, 502,
       553, 522, 503, 524, 588, 592, 575, 587, 583, 521, 561, 572, 515,
       593, 552])

Note that both a and b matrices are shuffled in the same way if "random_state=42" is used. 