# Model Validation

 Testing your model to check how well it performs on new/unseen data.

Goal:

‚úÖ Avoid overfitting
‚úÖ Ensure model generalizes well
‚úÖ Estimate real-world performance.

‚úÖ 1Ô∏è‚É£ Train-Test Split (Simple)

Split data into:
	‚Ä¢	Train (80%)
	‚Ä¢	Test (20%)
              
‚úÖ 2Ô∏è‚É£ Cross Validation (Best practice)

Data is split into multiple folds.

‚úÖ 3Ô∏è‚É£ K-Fold Cross Validation

Data divided into K parts. Each part becomes test set once.


‚úÖ 4Ô∏è‚É£ Stratified K-Fold

Used when: üëâ Classification with imbalanced classes.


‚≠ê Model validation metrics

Depends on problem type.

Regression
	‚Ä¢	R2 score
	‚Ä¢	MAE
	‚Ä¢	RMSE.


Classification
	‚Ä¢	Accuracy
	‚Ä¢	Precision
	‚Ä¢	Recall
	‚Ä¢	F1 Score.


‚≠ê Why validation is critical

Without validation: ‚ùå Model may not work in production.

With validation: ‚úÖ Reliable performance estimate.

In [1]:
import pandas as pd
my_df = pd.read_csv("feature_selection_sample_data.csv")

# Test/Train Split

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

X = my_df.drop(["output"], axis = 1)
y = my_df["output"]

# Regression Model

üîπ 1Ô∏è‚É£ Split data into training and testing
Purpose:

üëâ Divide dataset into two parts: train_test_split(X, y, test_size=0.2, random_state=42)


Used to:  X_train, y_train  üëâ Train the model.  

Used to:X_test, y_test üëâ Evaluate model performance.

test_size = 0.2
80% ‚Üí training
20% ‚Üí testing

random_state = 42
Ensures: üëâ Same split every time. / Same shuffle every run.



 What is R¬≤ score?

Measures:

üëâ How well predictions match actual values.

Range:1.0 ‚Üí Perfect model
0.0 ‚Üí Random guessing
Negative ‚Üí Worse than baseline


EX- R2 = 0.85 -Model explains 85% of variance.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

regressor = LinearRegression()
#Train the model
regressor.fit(X_train, y_train)
#Make predictions -  Predict values for unseen test data.
y_pred = regressor.predict(X_test)
#Evaluate model 
r2_score(y_test,y_pred)



0.8305710774942844

In [4]:
# Classification Model

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)


# Cross Validation
 Kfold for regression

 Stratifiedkfold  - for Classification
 

cv = 4
Means: üëâ 4-fold cross validation.


‚≠ê WHY we use cross validation


 1. More reliable evaluation

Single split: may give biased result.

Cross-validation: üëâ Tests model multiple times.

2. Reduce overfitting risk

Model tested on different subsets.

3. Better generalization estimate

Shows how model performs on unseen data.



In [5]:
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold

cv_scores = cross_val_score(regressor, X, y, cv = 4, scoring = "r2")

cv_scores.mean()

np.float64(0.6379038172153191)

# Regression


n_splits = 4

Means:üëâ Data divided into 4 parts.
So: üëâ Model trained and tested 4 times.

shuffle=True
Important: üëâ Randomly shuffles data before splitting.



In [6]:
cv = KFold(n_splits = 4, shuffle = True, random_state = 42)
cv_scores = cross_val_score(regressor, X, y, cv = cv, scoring = "r2")
cv_scores.mean()

np.float64(0.7078051873514348)

In [None]:
# Classification
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
cv = StratifiedKFold(n_splits = 4, shuffle = True, random_state = 42)
cv_scores = cross_val_score(clf, X, y, cv = cv, scoring = "accuracy")
cv_scores.mean()