## Dataset splitting

Supervised machine learning is about creating models that precisely map the given inputs (independent variables, or **predictors**) to the given outputs (dependent variables, or **responses**).

It's important to understand that you can't use the same data for training and testing the model, because that evaluation would be biased (meaning that you have to test with fresh data to properly assess if the model performs similarly with both datasets).

On the road of creating your model, you'll have to fit it (aka train it); another advantage of splitting data in two datatsets is that it allows to detect if your model is **underfitting** (the model is unable to create correlations between your data points, i.e. is non-linear) or **overfitting** (the model has such liberty that it creates corelations that should not exist).

Preferably, there should be three datasets: **training** (used to train or fit your model), **validation** (used for hyperparameter tuning of your model, i.e. experimenting with different options to find the best outcome possible) and **testing** (as the name implies, to finally test your model with fresh data).

One of the ways to do this is by splitting the original dataset, and one particular package can help you with that task: `sklearn`.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

The `train_test_split` function has the following options:

* `train_size` is the number that defines the size of the training set. If you provide a float, then it must be between 0.0 and 1.0 and will define the percentage of the dataset used for testing. If you provide an int, then it will represent the total number of the training samples. The default value is None.

* `test_size` is the number that defines the size of the test set. It’s very similar to train_size. You should provide either train_size or test_size. If neither is given, then the default share of the dataset that will be used for testing is 0.25, or 25 percent.

* `random_state` is the object that controls randomization during splitting. It's an int. The default value is None.

* `shuffle` is the Boolean object (True by default) that determines whether to shuffle the dataset before applying the split.

### Simple example with a linear regression

We'll create two datasets, x and y, create train and test sets for both, and then apply a linear regression between them.

In [None]:
from sklearn.linear_model import LinearRegression

# creating the datasets
x = np.arange(20).reshape(-1, 1)
y = np.array([5, 12, 11, 19, 30, 29, 23, 40, 51, 54, 74, 62, 68, 73, 89, 84, 89, 101, 99, 106])

In [None]:
x

In [None]:
y

In [None]:
# creating the train and test sets, with a test size of 8 observations (so the train will have a size of 12 observations) and without randomness.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=8, random_state=0)

print(x_train)
print()
print(x_test)

In [None]:
# creating and fitting the model using the train datasets 
model = LinearRegression().fit(x_train, y_train)

# looking at the best intercept and slope of the regression line
print(model.intercept_)
print(model.coef_)

#### Using the test set to assess the performance of the model

Let's checkout the model scores when using both sets:

In [None]:
print(model.score(x_train, y_train))
print(model.score(x_test, y_test))

`.score()` returns the coefficient of determination, or R², for the data passed. It's maximum is 1. The higher the R² value, the better the fit. In this case, the training data yields a slightly higher coefficient.

Let's build a visualization of this difference to better understand it:

In [None]:
# the predict() method gives us the values to build the regression line
y_pred = model.predict(x_test)

# building the visualization with matplotlib
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plt.scatter(x_train, y_train, c='coral', label='Train')
plt.scatter(x_test, y_test, c='lightblue', label='Test')
plt.plot(x_test, y_pred, color='grey')

plt.legend()
plt.title('Train/Test sets comparison')
plt.xlabel('x')
plt.ylabel('y')

plt.show()