<a href="https://colab.research.google.com/github/tanishavaishya18/python-basics/blob/main/Day12_CV_Leakage_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
## CROSS VALIDATION, DATA LEAKAGE AND HYPERPARAMETER TUNING

Train test split is often not fully accurate.
The accuracy depends heavily on the test data.


*   One lucky split - overly optimistic accuracy
*   One unlucky split - overly pessimistic accuracy

**Cross Validation**

The data is split into k folds. Thus, teh model is trained k times and each fold acts as a test once. The accuracy is the average of all the folds.

**Data Leakage**

Data leakage happens when test data leaks into training data.
for example) 	•	Scaling before splitting
	            •	Feature engineering using full dataset
	            •	Using future data to predict past
Leakage gives fake high accuracy.


In [5]:
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split


data = load_wine()
x= data.data
y=data.target

In [6]:
model= LogisticRegression(max_iter = 10000)
scores = cross_val_score(model, x, y , cv=5) #does a 5 fold cross validation
scores  #scores is a NumPy array of accurcay score of each fold

array([0.97222222, 0.91666667, 0.91666667, 1.        , 1.        ])

In [7]:
print("CV scores:", scores)
print("mean CV score:", scores.mean())
print("Std Dev:", scores.std())   #mean+-std dev is how papers report performance

CV scores: [0.97222222 0.91666667 0.91666667 1.         1.        ]
mean CV score: 0.961111111111111
Std Dev: 0.03767961101736262


In [8]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
x_train, x_test, y_train, y_test = train_test_split(
    x, y,
    test_size=0.2,
    random_state=42
)

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

**Wrong code that creates leakage(the above one is correct)**

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # WRONG — uses test data info

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)


---



**golden rule**

[Fit on train, transform on test.

why it matters?



*   fit = learn from data, test data must not teach teh model anything
*   prevents data leakage, thus giving honest performance]


scaler.fit(x_train) -  model learns the mean and standard deviation of tarining data

scaler.transform(x_test) - does not learn anything new, uses same mean and standard deviation


transform standardises the data using parameters that were already learned during fit.


In [10]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipe = Pipeline(
    [("Scaler", StandardScaler()),
     ("model", LogisticRegression(max_iter=500))]
)
scores = cross_val_score(pipe,x,y,cv=5)
scores.mean()

np.float64(0.9831746031746033)

**PIPELINES**

Pipelines are a structutres way to connect multiple steps of a Ml workflow, such as preprocessing and modelling so that they can act as a single unit.


Why pipelines are important?



*   prevent data leakage(fit on trainig data,transform on test data)
*   work correctly with cross validation( for each fold, fits preprocessing only on the training fold, applies it on validation fold)



*   cleaner and safer code






In [12]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'model__C':[0.01, 0.1, 1, 10, 100]
} #tells the model which hyperparameters to try
## model__C - double underscore means 'inside this step'

grid = GridSearchCV(
    pipe,
    param_grid,  #The dictionary of hyperparameters to search
    cv=5,
    scoring='accuracy'
) #scoring='accuracy' - Metric used to evaluate performance


grid.fit(x,y)

In [13]:
print('Best parameters:', grid.best_params_)
print("Best CV score:", grid.best_score_)

Best parameters: {'model__C': 0.1}
Best CV score: 0.9833333333333332


**What are hyperparameters?**

Hyperparameters are settings you choose before taining a model. The control how the model learns, but are not learned from data itself.

for example,
LogisticRegression(max_iter=500), here max_iter is a hyperparameter.


**Grid Search**

Grid search helps to find the best hyperparametrs for a ML model by



*  trying all possible  combinations of hyperparameters
*  evaluating each combination by cross validation
*  Selecting the combination that performs best on average



In [None]:
#Reflection

* CV gives reliable performance estimates
* Leakage causes fake accuracy
* Pipelines are safer
* Hyperparameters matter
* ML is about process, not just algos