<font color=red>This is a draft version and the notebook is due to be changed and finalized soon.</font>

# Splitting the Data

This is a mini Jupyter notebook in which you will load your data, examine it, convert it to NumPy arrays, generate numerical labels from string labels and call the data splitting function (from scikit-learn) twice to split your data into training, validation and test datasets.

## Before you start

- In order for the notebooks to function as intended, modify only between lines marked "### begin your code here (__ lines)." and "### end your code here.". 

- The line count is a suggestion of how many lines of code you need to accomplish what is asked.

- You should execute the cells (the boxes that a notebook is composed of) in order.

- You can execute a cell by pressing Shift and Enter (or Return) simultaneously.

- You should have completed the previous Jupyter notebooks before attempting this one as the concepts covered there are not repeated, for the sake of brevity.

## Loading the appropriate packages

We will import the required libraries.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import plotly.express as px

Let's turn off scientific notation.

In [None]:
np.set_printoptions(suppress=True)

## Loading and examining the data

We will load our data from a CSV file and put it in a pandas an object of the `DataFrame` class.

This is the Iris flower dataset, a very famous dataset which was published in 1936 by studying three different species of the flower Iris: _Iris setosa_, _Iris versicolor_ and _Iris virginica_. Originally, the dataset has 150 examples which corresponds to 150 different Iris flowers measured (50 flowers from each species). The dataset has 4 features, the sepal length, sepal width, petal length and petal width of each flower in centimeters.

* This was taken and modified from the Machine Learning dataset repository of School of Information and Computer Science of University of California Irvine (UCI):
 
> _Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science._

In [None]:
df = pd.read_csv('data_splitting_the_data.csv')

Let's take a look at the data.

In [None]:
df

We have more than 3 features, so let's use a scatter matrix to visualize our data:

In [None]:
data_dimensions = df.columns[:-1].to_list()

fig = px.scatter_matrix(df, dimensions=data_dimensions, color='Species')
fig.show()

Let's convert our data to NumPy arrays. We use LabelEncoder from scikit-learn to convert our string labels into numbers:

In [None]:
X = df.drop('Species', axis=1).to_numpy()
y_text = df['Species'].to_numpy()
y = LabelEncoder().fit_transform(y_text)

Now we can check the resulting NumPy arrays and their shapes:

In [None]:
X

In [None]:
X.shape

In [None]:
y_text

In [None]:
y_text.shape

In [None]:
y

In [None]:
y.shape

## Splitting data

Now use `train_test_split` twice: Once to create `X_train` and `y_train`, and `X_vt` and `y_vt` to create two sets, one for training and the other for validation and test. We want 70% of our data to be training data. No other grgumants need to be sepcified. You can find the documentation for `train_test_split` here:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Go ahead and do that function call here:

In [None]:
### begin your code here (1 line).

### end your code here.

Now, split your combined validation and test data (`X_vt` and `y_vt`) into two separate validation (`X_validation` and `y_validation`) and test (`X_test` and `y_test`) datasets. We want 50% of this data to be for validation and another 50% for test (which reserves 15% of total datapoints for each):

In [None]:
### begin your code here (1 line).

### end your code here.

If everything went alright, we are done! You can use these datasets in a supervised learning algorithm now. But let's check the resulting data now:

In [None]:
X_train

In [None]:
X_train.shape

In [None]:
y_train

In [None]:
y_train.shape

In [None]:
X_validation

In [None]:
X_validation.shape

In [None]:
y_validation

In [None]:
y_validation.shape

In [None]:
X_test

In [None]:
X_test.shape

In [None]:
y_test

In [None]:
y_test.shape

Let's plot the three datasets as well:

In [None]:
df_train = pd.DataFrame(np.c_[X_train, y_train], columns=df.columns)
fig2 = px.scatter_matrix(df_train, dimensions=data_dimensions, color='Species')
fig2.show()

In [None]:
df_validation = pd.DataFrame(np.c_[X_validation, y_validation], columns=df.columns)
fig3 = px.scatter_matrix(df_validation, dimensions=data_dimensions, color='Species')
fig3.show()

In [None]:
df_test = pd.DataFrame(np.c_[X_test, y_test], columns=df.columns)
fig4 = px.scatter_matrix(df_test, dimensions=data_dimensions, color='Species')
fig4.show()

We are done!