# Exercise: Dataset Principles

Since models are nothing without data, it's important to make sure the fundamentals are strong when creating and shaping your datasets. Here we'll create a regression dataset and split it into the three core dataset types: train, validation, and test.

Your tasks for this exercise are:
1. Create a dataframe with your features and target arrays from `make_regression`.
2. Create a 60% Train / 20% Validation / 20% Test dataset group using the `train_test_split` method.
3. Confirm the datasets are the correct size by outputing their shape.
4. Save the three datasets to CSV

In [1]:
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

In [3]:
# Creating a regression dataset with 1000 samples, 5 feature columns, 2 which are actually useful, and 1 target column
regression_dataset = make_regression(
    n_samples=1000, n_features=5, n_informative=2, n_targets=1, random_state=0
)
regression_dataset

(array([[ 0.23622549, -0.32328864, -0.01842905, -1.54847105,  1.31142713],
        [-0.80149689,  0.27117018, -0.52564059, -0.88778014,  0.93639854],
        [ 0.68788139,  0.4170435 , -1.20373519,  0.49872696, -0.73793178],
        ...,
        [-0.02852887, -0.30937759,  0.2847906 ,  0.88649154, -0.27213176],
        [-0.9379476 ,  0.71178527, -0.28813145, -1.41082331,  0.16491273],
        [-0.65646367,  0.86740741,  0.7553957 , -0.59140267,  1.12441918]]),
 array([ 7.06180828e+01,  5.27578700e+01, -4.37284556e+01,  1.56835125e+02,
         1.02748706e+02, -7.75631136e+01,  6.08954729e+01, -4.11511041e+01,
        -1.42269605e+02, -1.30975306e+01,  2.72960140e+01, -9.92255546e+01,
         6.25446072e+01,  5.14841840e+01,  1.63891608e+01, -7.11464079e+01,
        -3.31515101e+01, -8.33005251e+00, -1.40969095e+01, -4.99334125e+01,
         6.63694045e+01, -2.72640723e+01, -2.47879318e+01,  2.70317047e+00,
         3.67718023e+01, -1.93936019e+01,  8.36402580e+00,  7.44409550e+01,
   

In [4]:
df = pd.DataFrame(data=regression_dataset[0], columns=[f"feature_{i}" for i in range(1, 6)])
df["target"] = regression_dataset[1]

In [5]:
df.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,target
0,0.236225,-0.323289,-0.018429,-1.548471,1.311427,70.618083
1,-0.801497,0.27117,-0.525641,-0.88778,0.936399,52.75787
2,0.687881,0.417044,-1.203735,0.498727,-0.737932,-43.728456
3,-0.679593,-1.063433,-1.797456,0.913202,2.211304,156.835125
4,0.096479,-0.50706,0.522083,0.155794,1.520004,102.748706


In [6]:
# Create a train: 0.8 | test: 0.2 ratio dataset
df_train, df_test = train_test_split(df, test_size=0.2, random_state=0)

# Create a train: 0.6 | validation: 0.2 ratio dataset
df_train, df_val = train_test_split(df_train, test_size=0.25, random_state=0)

# Final dataset sizes: train: 0.6, validation: 0.2, test: 0.2,

In [7]:
# Output each shape to confirm the size of train/validation/test
print(f"Train: {df_train.shape}")
print(f"Validation: {df_val.shape}")
print(f"Test: {df_test.shape}")

Train: (600, 6)
Validation: (200, 6)
Test: (200, 6)


In [8]:
# Output all datasets to csv
df_train.to_csv('train.csv', index=False)
df_val.to_csv('validation.csv', index=False)
df_test.to_csv('test.csv', index=False)