# Data Splitting and Hyperparameter Tuning in Machine Learning

Proper data splitting is essential before performing any model evaluation or hyperparameter tuning. This notebook demonstrates how to split data, tune hyperparameters using `GridSearchCV`, and evaluate model performance correctly.

> For information on how to handle data imbalance, refer to:  
> [data_balancing.ipynb](./data_balancing.ipynb)


### Order:

1. Split full dataset into training and test sets
2. Use only the training set for cross-validation & tuning
3. Use the test set strictly for final performance evaluation

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

- **train_test_split**: A function from `sklearn.model_selection` used to split the dataset into training and test subsets.
- **X**: The input features (independent variables).

- **y**: The target labels (dependent variable).

- **test_size=0.2**: Specifies that 20% of the data will be used as the test set, and 80% as the training set.

- **stratify=y**: Ensures that the class distribution of the target variable `y` is maintained proportionally in both training and test sets (important for classification).

- **random_state=42**: Sets the random seed for reproducibility, so the split is the same every time this code is run.

This split prepares the data for training models on the training set and evaluating performance on the unseen test set.

## Hyperparameter Tuning Using GridSearchCV

Once the data is split, we apply `GridSearchCV` on the **training set only**. Internally, it uses cross-validation to find the best hyperparameters.

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold

# Define parameter grid
param_grid = {'C': [0.1, 1, 10]}

# Set up GridSearchCV on training set
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(
    LogisticRegression(max_iter=1000),
    param_grid=param_grid,
    cv=cv,
    scoring='accuracy'
)

# Fit only on training data
grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)

Best parameters: {'C': 10}


**Regularization** is a technique used to prevent a model from overfitting the training data.
The parameter **C** is the inverse of the regularization strength.

- A smaller **C** means stronger regularization, which reduces model complexity.  
- A larger **C** means weaker regularization, allowing the model to fit the data more closely.

`GridSearchCV` tries different values of **C** to find the best balance.

## Final Evaluation on Test Set

Evaluate the best model using the **test set**, which has not been seen during training or tuning.


In [3]:
from sklearn.metrics import accuracy_score

# Predict on the test set
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

# Accuracy score
print("Test accuracy:", accuracy_score(y_test, y_pred))

Test accuracy: 1.0


## Summary

| Step             | Purpose                                    |
|------------------|--------------------------------------------|
| **Train/Test Split** | Isolate unseen data for final evaluation   |
| **GridSearchCV**     | Tune hyperparameters using cross-validation |
| **Test Evaluation**  | Measure generalisation performance (no tuning) |

**Never use the test set during model tuning. Doing so results in data leakage and unreliable performance estimates.**