## Train-Test Split:

- Train-test split is a data preparation technique where we divide the dataset into:

Training data: used to train the ML model 

Testing data: used to evaluate how well the model performs on unseen data

## Why do we need Train-Test Split:

If we train and test on the same data, the model: 
- Looks smart
- Fails in real-world data

Train-test split helps to:
- Detect overfitting
- Measure real performance
- Simulate future/unseen data.

### Real-life analogy:

Studying from a book (train) != writing exam questions you've already seen (test)

## Example
if N =100 and split = 80:20

- train=80 samples

- test =20 samples

#### ![WhatsApp Image 2025-12-20 at 11.10.33 AM.jpeg](attachment:18d7de35-a041-436c-ac86-eeef20cfeebb.jpeg)

In [8]:
from sklearn.model_selection import train_test_split
import pandas as pd

In [2]:
data = pd.DataFrame({
    "study_hours":[1,2,3,4,5,6,7,8],
    "marks":[35,40,50,55,65,70,80,85]})

In [3]:
data

Unnamed: 0,study_hours,marks
0,1,35
1,2,40
2,3,50
3,4,55
4,5,65
5,6,70
6,7,80
7,8,85


In [5]:
X=data[["study_hours"]]
Y=data[['marks']]

In [10]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.20,random_state=40)

In [11]:
X_train

Unnamed: 0,study_hours
2,3
4,5
0,1
5,6
3,4
6,7


In [12]:
X_test

Unnamed: 0,study_hours
7,8
1,2


In [13]:
Y_train

Unnamed: 0,marks
2,50
4,65
0,35
5,70
3,55
6,80


In [16]:
Y_test

Unnamed: 0,marks
7,85
1,40


# Real time dataset

In [17]:
diabetes = pd.read_csv("https://github.com/YBIFoundation/Dataset/raw/main/Diabetes.csv")

In [18]:
diabetes

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [19]:
y = diabetes['diabetes']

In [24]:
X = diabetes[['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age']]

In [25]:
X

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


In [26]:
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: diabetes, Length: 768, dtype: int64

In [29]:
X_train,X_test,y_train,y_test = train_test_split(X, y,test_size = 0.2, random_state = 2529)

In [28]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((614, 8), (154, 8), (614,), (154,))