### Learning Objectives

At the end of the experiment, you will be able to :

*  Apply train test split


### Dataset

## Train-Test Split Evaluation

The train-test split is a technique for assessing a machine learning algorithm's performance.


The procedure involves dividing a dataset into two subsets. The first subset, known as the training dataset, is used to fit the model. The second subset is not used to train the model; instead, the model is fed the dataset's input element, and predictions are made and compared to expected values. The second dataset is known as the test dataset.

*   Train Dataset: Used to fit the machine learning model.
*   Test Dataset: Used to evaluate the fit machine learning model.

The goal is to estimate the machine learning model's performance on new data that was not used to train the model.



### How to set-up :

The size of the train and test sets is the procedure's main configuration parameter. For either the train or test datasets, this is most commonly expressed as a percentage between 0 and 1. A training set with a size of 0.67 (67%), for example, means that the remainder percentage of 0.33 (33%) is assigned to the test set.

There is no such thing as an optimal split percentage.

You must select a split percentage that meets the goals of your project:


*   Train: 80%, Test: 20%
*   Train: 67%, Test: 33%
*   Train: 50%, Test: 50%


Now that we are familiar with the train-test split model evaluation procedure, let’s look at how we can use this procedure in Python.







In [1]:
import pandas as pd

In [2]:
dataframe  = pd.read_csv('diabetes.csv')
dataframe.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [7]:
from sklearn.model_selection import train_test_split
x = dataframe[['Pregnancies', 'Glucose', 'BloodPressure','SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']]
y = dataframe['Outcome']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
x_train.shape, x_test.shape, y_train.shape, y_test.shape



((614, 8), (154, 8), (614,), (154,))

In [8]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.67)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((514, 8), (254, 8), (514,), (254,))

In [9]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.5, test_size=0.5)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((384, 8), (384, 8), (384,), (384,))

##### Train Test split without sklearn library

In [10]:
# Shuffle the dataset
shuffle_df = dataframe.sample(frac=1)

# Define a size for your train set
train_size = int(0.7 * len(dataframe))

# Split your dataset
train_set = shuffle_df[:train_size]

test_set = shuffle_df[train_size:]

In [11]:
print("train data :",len(train_set), ", test data :",len(test_set))

train data : 537 , test data : 231


In [13]:
train_set.groupby(['Outcome']).count()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,353,353,353,353,353,353,353,353
1,184,184,184,184,184,184,184,184


In [14]:
test_set.groupby(['Outcome']).count()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,147,147,147,147,147,147,147,147
1,84,84,84,84,84,84,84,84


##### Train Test split with sklearn library

In [15]:
features = dataframe.iloc[:,:8]
target = dataframe['Outcome']

In [17]:
from sklearn.model_selection import train_test_split

In [21]:
# We are splitting the data into train and test sets in the ratio of 70:30
# i.e 70 % of data is train set and 30 % of the data is test set
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size = 0.3)

In [20]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((537, 8), (231, 8), (537,), (231,))

We can divide the dataset so that 70% is used to train the model and 30% is used to evaluate it.