# Data Pre-processing | Dataset Splitting
🦊 `Notebook by` [Md.Samiul Alim](https://github.com/sami0055)

😋  `Machine Learning Source Codes` [GitHub](https://github.com/sami0055/Machine-Learning)

# **Dataset Splitting**


The appropriate ratio in terms of dataset size for splitting into train, validation, and test sets depends on various factors, including the total size of your dataset, the complexity of your model, and the nature of your problem. There is no fixed ratio, but here are some general guidelines:


**Small Datasets (Less than 1,000 samples):**
* Training Set: 60-70%
* Validation Set: 15-20%
* Test Set: 15-20%

**Medium Datasets (1,000 to 10,000 samples):**
* Training Set: 60-80%
* Validation Set: 10-20%
* Test Set: 10-20%

**Large Datasets (More than 10,000 samples):**
* Training Set: 70-90%
* Validation Set: 5-15%
* Test Set: 5-15%

These are rough guidelines, and the actual ratios can vary depending on the specific characteristics of your data and problem.

## Import Libraries 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df=pd.read_csv('diabetes_prediction_dataset.csv')

## EDA

In [3]:
df.shape

(100000, 9)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB


In [5]:
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [6]:
df['diabetes'].value_counts()
# THis is class imblance dataset which is not a good dataset

0    91500
1     8500
Name: diabetes, dtype: int64

## Label Encoding

In [7]:
from sklearn.preprocessing import OrdinalEncoder
encoder=OrdinalEncoder(categories=[list(df['gender'].unique()),list(df['smoking_history'].unique())])

In [8]:
cat_col=['gender','smoking_history']
df[cat_col]=encoder.fit_transform(df[cat_col])

In [9]:
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,0.0,80.0,0,1,0.0,25.19,6.6,140,0
1,0.0,54.0,0,0,1.0,27.32,6.6,80,0
2,1.0,28.0,0,0,0.0,27.32,5.7,158,0
3,0.0,36.0,0,0,2.0,23.45,5.0,155,0
4,1.0,76.0,1,1,2.0,20.14,4.8,155,0


## Dataset Split

In [10]:
from sklearn.model_selection import train_test_split
y=df['diabetes']
x=df.drop(columns='diabetes')

In [12]:
# Splitting the dataset into a train (80%) and temp (20%)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42,stratify=y)


In [13]:
print(x_train.shape)
print(y_train.shape)

(80000, 8)
(80000,)


In [16]:
# Splitting the dataset into a train (80%) and temp (20%)
x_train, x_temp, y_train, y_temp = train_test_split(x, y, test_size=0.3, random_state=42,stratify=y)


In [17]:
print(x_train.shape)
print(y_train.shape)

(70000, 8)
(70000,)


In [18]:
# Train 70% val(15%) Test(15%)
x_val,x_test,y_val,y_test=train_test_split(x_temp,y_temp,test_size=0.5,stratify=y_temp)

In [19]:
print(x_val.shape)
print(y_val.shape)

(15000, 8)
(15000,)


In [20]:
print(x_test.shape)
print(y_test.shape)

(15000, 8)
(15000,)


In [21]:
y_train.value_counts()

0    64050
1     5950
Name: diabetes, dtype: int64

In [22]:
y_test.value_counts()

0    13725
1     1275
Name: diabetes, dtype: int64

In [23]:
y_val.value_counts()

0    13725
1     1275
Name: diabetes, dtype: int64

In [24]:
x_train.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level
54534,0.0,56.0,0,1,0.0,39.51,4.0,100
7286,1.0,64.0,0,0,3.0,27.32,6.6,90
44496,1.0,49.0,0,0,0.0,42.26,6.0,240
86010,1.0,68.0,0,0,1.0,27.32,4.8,85
44954,1.0,62.0,0,0,5.0,30.64,5.7,145


In [25]:
x_val.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level
98412,0.0,47.0,0,0,4.0,25.92,6.1,155
31148,1.0,8.0,0,0,1.0,27.32,6.6,160
93556,1.0,75.0,1,0,0.0,36.51,9.0,159
86450,1.0,80.0,0,0,1.0,27.32,5.8,145
19655,1.0,15.0,0,0,0.0,29.75,6.2,200


In [26]:
x_test.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level
99390,1.0,41.0,0,0,0.0,29.57,6.5,90
43657,0.0,68.0,0,0,1.0,27.32,6.1,158
20867,0.0,53.0,0,0,0.0,32.89,6.2,159
79228,0.0,80.0,0,1,0.0,34.7,3.5,155
65979,0.0,15.0,0,0,1.0,21.18,4.8,200


In [27]:
y_train.head()

54534    0
7286     0
44496    1
86010    0
44954    0
Name: diabetes, dtype: int64

In [28]:
y_val.head()

98412    0
31148    0
93556    1
86450    0
19655    0
Name: diabetes, dtype: int64

In [29]:
y_test.head()

99390    0
43657    0
20867    0
79228    0
65979    0
Name: diabetes, dtype: int64

## Thank you