# Train-test split
[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

Partitions data into two subsets, one for training the model and the other for testing. It is worth noting that it keeps the dataset balanced.

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets

iris = datasets.load_iris(as_frame=True)
X, y = iris['data'], iris['target']

In [3]:
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [4]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

`test_size`: float or int, default=None

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. 
If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

`train_size`: float or int, default=None

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. 
If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

`random_state`: int, RandomState instance or None, default=None

Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.



In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [7]:
X_train.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
96,5.7,2.9,4.2,1.3
105,7.6,3.0,6.6,2.1
66,5.6,3.0,4.5,1.5
0,5.1,3.5,1.4,0.2
122,7.7,2.8,6.7,2.0


In [8]:
X_test.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
73,6.1,2.8,4.7,1.2
18,5.7,3.8,1.7,0.3
118,7.7,2.6,6.9,2.3
78,6.0,2.9,4.5,1.5
76,6.8,2.8,4.8,1.4


In [9]:
y_train.head()

96     1
105    2
66     1
0      0
122    2
Name: target, dtype: int64

In [10]:
y_test.head()

73     1
18     0
118    2
78     1
76     1
Name: target, dtype: int64

### Dataset subsampling

If we want to subsample a dataset while keeping it balanced, we can do so. Let's do, for instance, the iris dataset. We would first analyze the dataset and its balance.

In [3]:
import pandas as pd
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [5]:
df_iris.info

<bound method DataFrame.info of      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                  5.1               3.5                1.4               0.2
1                  4.9               3.0                1.4               0.2
2                  4.7               3.2                1.3               0.2
3                  4.6               3.1                1.5               0.2
4                  5.0               3.6                1.4               0.2
..                 ...               ...                ...               ...
145                6.7               3.0                5.2               2.3
146                6.3               2.5                5.0               1.9
147                6.5               3.0                5.2               2.0
148                6.2               3.4                5.4               2.3
149                5.9               3.0                5.1               1.8

[150 rows x 4 columns]>

In [6]:
print(df_iris['petal length (cm)'].value_counts(normalize=True)*100, "\n")
print(df_iris['petal width (cm)'].value_counts(normalize=True)*100)

petal length (cm)
1.4    8.666667
1.5    8.666667
5.1    5.333333
4.5    5.333333
1.6    4.666667
1.3    4.666667
5.6    4.000000
4.7    3.333333
4.9    3.333333
4.0    3.333333
4.2    2.666667
5.0    2.666667
4.4    2.666667
4.8    2.666667
1.7    2.666667
3.9    2.000000
4.6    2.000000
5.7    2.000000
4.1    2.000000
5.5    2.000000
6.1    2.000000
5.8    2.000000
3.3    1.333333
5.4    1.333333
6.7    1.333333
5.3    1.333333
5.9    1.333333
6.0    1.333333
1.2    1.333333
4.3    1.333333
1.9    1.333333
3.5    1.333333
5.2    1.333333
3.0    0.666667
1.1    0.666667
3.7    0.666667
3.8    0.666667
6.6    0.666667
6.3    0.666667
1.0    0.666667
6.9    0.666667
3.6    0.666667
6.4    0.666667
Name: proportion, dtype: float64 

petal width (cm)
0.2    19.333333
1.3     8.666667
1.8     8.000000
1.5     8.000000
1.4     5.333333
2.3     5.333333
1.0     4.666667
0.4     4.666667
0.3     4.666667
2.1     4.000000
2.0     4.000000
0.1     3.333333
1.2     3.333333
1.9     3.333333
1.6 

In [None]:
df_iris['strata'] = df_iris['petal length (cm)'].astype(str) + "_" + df_iris['petal width (cm)'].astype(str)

subsample, _ = train_test_split(df_iris, stratify=df_iris['strata'], train_size=0.9, random_state=42) #does not work because the dataset is too small, but there's that