## Train and test sets

- The aim of machine learning is to predict data based on other data. When building a model we want to be sure that the final solution will work fine on new data samples. It means that we can trust its predictions and make decisions based on them. Train-test splitting is a vital step to satisfy this requirement.


### `sklearn tool`

In [5]:
from sklearn.datasets import load_wine

data = load_wine(as_frame=True)["frame"]
X, y = data.iloc[:, :-1], data["target"]

Now X is a 2D array of features and y is a target variable. Secondly, we make a split using train_test_split() imported from sklearn.model_selection module:

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

The method divides given arrays/DataFrames/lists into two parts. Therefore it returns twice as many arrays as were passed in the arguments.

- Let's discuss available parameters:

- - *arrays are arrays to be split. One can pass multiple array-like structures, the accepted types are lists, numpy arrays, scipy.sparse matrices, and pandas dataframes. In our case, we pass only X and y. The code below will also work:

- - `train_size` is a proportion of an array to mark as a train set. test_size is the other way around: it sets a proportion of an array to mark as a test set. You choose one of these parameters. Their sum always equals 1 

- - `random_state` controls random shuffling of the rows before the split. Pass any integer, if you need a reproducible output.

- - `shuffle` is True by default and it controls whether or not to shuffle the data before splitting.

- - Lastly, there is another quite important parameter, `stratify`, which ensures that the train and test sets will be representative of the class distribution in the original dataset, which becomes crucial when dealing with an unequal class distribution. Without setting the stratify (None by default) to an array of labels, you might end up in a scenario where certain classes are only present in either the train or the test set. The following line ensures that both the train and the test will have the same ratio of classes as the original set:

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)


### A note on fit_transform()

`.fit()`, `.fit_transform()`, and `.transform()` are the methods of the estimator API in sklearn. 
- The estimator is `.fit()` only on the train set, while `.transform()` could be applied to both the training and the test sets. 
- `.fit_transform()` is the optimized combination of the two methods, equivalent to `fit(X_train).transform(X_train).`