<a href="https://colab.research.google.com/github/shunte88/MiT_DSD/blob/main/training_testing_solution_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Define training and testing sets in practice


## Activity Overview

This activity is designed to consolidate your knowledge about the differences in the training and testing sets and to teach you  how to define those in `Python` using the `sklearn` library.

In this activity, we'll use one of the toy datasets made available on `sklearn`. We choose to use the `wine` dataset.


This assignment is designed to help you apply the machine learning algorithms you've learned using the packages in `Python`. `Python` concepts, instructions, and starter code are embedded within this Jupyter Notebook to help guide you as you progress through the activity. Remember to run the code of each code cell prior to submitting the assignment. Upon completing the activity, we encourage you to compare your work against the solution file to perform a self-assessment.

## Define Training and Testing Sets in Practice

We have seen that it is standard practice to split the data available into a training and testing sets to avoid overfitting, choose the best model to use, and to minimized errors.

### Importing the Dataset and Exploratory Data Analysis (EDA)

We begin by using the libraries `sklearn`, and `pandas` to import and read the datasets.  Let's have a closer look at each of them:

- `pandas` is a  software library written for the `Python` programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating dataframes and numerical tables series.

 `Scikit-learn` (also known as `sklearn`) is a free software machine learning library for the `Python` programming language. It features various classification, regression, and clustering algorithms.

    In the code cell below, we start by importing the necessary libraries and modules. .


In [1]:
#import the libraries
import pandas as pd
df= pd.read_csv("wine.csv")

In [2]:
df.head()

Unnamed: 0,Wine,Alcohol,Malic.acid,Ash,Acl,Mg,Phenols,Flavanoids,Nonflavanoid.phenols,Proanth,Color.int,Hue,OD,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


Before performing any algorithm on the dataframe, it's always good practice to perform exploratory data analysis.

We begin by visualizing the first several rows of the DataFrame `df` using the function `.head()`. By default, `.head()` displays the first five rows of a DataFrame; this can be changed by passing the desired number of rows to the function `.head()` as an integer.

Complete the code below to visualize the first ten rows of `df`.

In [3]:
df.head( )

Unnamed: 0,Wine,Alcohol,Malic.acid,Ash,Acl,Mg,Phenols,Flavanoids,Nonflavanoid.phenols,Proanth,Color.int,Hue,OD,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


Next, we retrieve some more information about our DataFrame by using the properties `.shape` and `columns`.

Here's a brief description of what each of the above functions does:

- `.shape`: Returns a tuple representing the dimensionality of the DataFrame.
- `.columns`: Returns the column labels of the DataFrame.
- `.describe`(): Computes and shows summary statistics related to the DataFrame.


Run the cells below to get information aboout the DataFrame.

In [4]:
df.shape

(178, 14)

In [5]:
df.columns

Index(['Wine', 'Alcohol', 'Malic.acid', 'Ash', 'Acl', 'Mg', 'Phenols',
       'Flavanoids', 'Nonflavanoid.phenols', 'Proanth', 'Color.int', 'Hue',
       'OD', 'Proline'],
      dtype='object')

### Separating the Dataset Into a Training and Testing Dataset


Before you separate our original data into training and testing sets in `Python` using `sklearn`, in the text cell below provide a short description about the differences between the two sets.

**DOUBLE CLICK ON THIS CELL TO TYPE YOUR ANSWER**

**YOUR ANSWER HERE:** Training dataset is used to train the model and test data set is used to test the model.

As you have seen in Video 3 for this week, it is important to split the data into  *training* and *testing* sets.

To split the data into  training and testing datasets, we can use the function [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from `sklearn`. This function does a random split of data arrays or matrices into train and test subsets and returns a list containing a train-test split of inputs.

As we observe, in our case, the function `train_test_split` takes four arguments:

- `X`: Input dataframe
- `y`: Output dataframe
- `test_size`: Should be between 0.0 and 1.0 and should represent the proportion of the dataset to include in the test split
- `random_state`: Controls the shuffling applied to the data before applying the split. Ensures the reproducibility of the results across multiple function calls

In the code cell below, fill-in the ellipsis to set the argument `test_size` equal to `0.3` and `random_state` equal to `123`.

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df, df,test_size=0.3, random_state=123)


You can see the size of the resulting train and test subsets using `.shape`:

In [7]:
X_train.shape

(124, 14)

In [8]:
X_test.shape

(54, 14)

We will learn how to separate the data into inputs and outputs  and how to implement algorithms in the next segments of this week of the course.

Stay tuned!