# 🔷 Part A: Train and Test ML Models

We’ll learn:

- Why we split data into train and test sets

- Use train_test_split() from sklearn.model_selection

- Understand random_state and why it matters

- Train a model on training data

- Test the model on test data

- Calculate accuracy (and later precision/recall/F1 if needed)

- Understand what data leakage is and how to avoid it

# 🔷 Part B: Model Validation Techniques

We’ll cover slightly more advanced validation tools:

- k-Fold Cross-Validation

Use cross_val_score() to train/test model multiple times

- GridSearchCV (Intro)

Automatically try different model parameters and find the best

- classification_report

See precision, recall, F1-score all at once, class-wise

## 🔷 Part A: Train and Test ML Models

### 🧠 Why Split the Data?

When training an ML model, we want to test if it can predict on unseen data.

So we divide our data into:

- Training Set (e.g., 70%): The model learns from this.

- Test Set (e.g., 30%): Used only to evaluate how well the model performs on new data.

This avoids overfitting, which happens when the model memorizes the training data instead of learning patterns.

### 🧠 Understand random_state and why it matters

When we split data using train_test_split, it selects random samples for training and testing.

- If you don’t use random_state, the split will be different each time you run the code.

- If you set random_state to a number (like 42), you make the randomness consistent and reproducible.

📌 This is important for reproducibility. For example, during debugging, testing, or sharing notebooks with others, we want the results to be exactly the same every time.

### 🔍 Breast Cancer Dataset (Overview)

- 🎯 **Goal**: Predict if a tumor is malignant (cancerous) or benign (non-cancerous)

- 🔢 **Target**: Binary (0 = malignant, 1 = benign)

- 📊 **Features**: 30 numerical features (e.g., mean radius, texture, perimeter)

**🔷 Step 1: Import Libraries and Load the Dataset**

In [1]:
# 1. Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# load_breast_cancer – loads the dataset
# train_test_split – used to split the data into training and testing sets

**🔷 Step 2: Load the Dataset**

In [2]:
# 2. Load the dataset
data = load_breast_cancer()
x = data.data # 30 features (inputs)
y = data.target # Target labels: 0 (malignant), 1 (benign)

# X: feature matrix of shape (569, 30)
# y: label vector of shape (569,)

**🔷 Step 3: Split into Train/Test Sets**

In [3]:
# 3. Split data (70% train, 30% test)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

# test_size=0.3: 30% for testing, 70% for training
# random_state=42: keeps results consistent every time

**🔷 Step 4: Confirm the Split**

In [4]:
# 4. Check the shape of each split
print("x_train Shape:", x_train.shape)
print("x_test Shape:", x_test.shape)
print("y_train Shape:", y_train.shape)
print("y_test Shape:", y_test.shape)

# This confirms that data was split correctly.

x_train Shape: (398, 30)
x_test Shape: (171, 30)
y_train Shape: (398,)
y_test Shape: (171,)


**✅ Step 5: Train a Model on Training Data**

We’ll use the RandomForestClassifier here — a very good and reliable model for classification.

In [5]:
# 5. Train a Model
from sklearn.ensemble import RandomForestClassifier #RandomForestClassifier: A tree-based ensemble model that combines many decision trees to make a stronger model.

# Create the model
rf_model = RandomForestClassifier(random_state=42) #random_state=42: Ensures results are reproducible — you'll get the same model every time.

# Train (fit) the model on training data
rf_model.fit(x_train, y_train) #.fit(x_train, y_train): Trains the model using the training data.

**✅ Step 6: Test the Model on Test Data and Calculate Accuracy**

In [6]:
# Step 6: Make Predictions and Evaluate Accuracy
from sklearn.metrics import accuracy_score

# Predict on test set
y_pred = rf_model.predict(x_test) #y_pred = rf_model.predict(x_test): Makes predictions for the unseen test data.

# Compare predicted vs actual
accuracy = accuracy_score(y_test, y_pred) #accuracy_score(y_test, y_pred): Compares predicted labels to actual labels and returns the percentage of correct predictions.

print(f"Accuracy on Test Set: {accuracy:.2f}")

Accuracy on Test Set: 0.97


**Test Accuracy: 0.97**

📌 This means 97% of the predictions on the test set were correct.
Very high accuracy — shows the model is performing very well overall.

### ✅ When do we need more than just accuracy?

**Accuracy is enough:**

- When the dataset is balanced (roughly equal number of each class)

- When all errors matter equally

✅ For example, Breast Cancer dataset is fairly balanced, so accuracy alone gives a decent idea of performance.

**But... Accuracy is not enough:**
    
We also calculate Precision, Recall, and F1-score when:

- The dataset is imbalanced (e.g., 90% Class A, 10% Class B)

- We care about certain types of errors more than others
e.g., Missing a cancer case (false negative) is worse than a false alarm (false positive)

That’s why we include them for better model insight even on balanced data — it's best practice. ✅

### ✅ Let's add Precision, Recall, and F1-Score

In [7]:
from sklearn.metrics import classification_report # classification_report: Combines precision, recall, F1-score, and support in one clean summary.

# Generate precision, recall, f1-score, and support
report = classification_report(y_test, y_pred, target_names=["Malignant", "Benign"]) # target_names: Names for class 0 and 1 — in breast cancer data: 0 = Malignant (dangerous), 1 = Benign (not dangerous)

print("Classification Report:")
print(report)

Classification Report:
              precision    recall  f1-score   support

   Malignant       0.98      0.94      0.96        63
      Benign       0.96      0.99      0.98       108

    accuracy                           0.97       171
   macro avg       0.97      0.96      0.97       171
weighted avg       0.97      0.97      0.97       171



 
### 🧾 Let’s decode your **Classification Report:**

| Class             | Precision | Recall | F1-score | Support |
| ----------------- | --------- | ------ | -------- | ------- |
| **Malignant (0)** | 0.98      | 0.94   | 0.96     | 63      |
| **Benign (1)**    | 0.96      | 0.99   | 0.98     | 108     |

### What each part means:

**🔹 Precision**

***- Malignant: 0.98***

Out of all the times model predicted "malignant", 98% were actually malignant.
Great for reducing false positives (labeling someone as sick when they’re not).

***- Benign: 0.96 — Also great!***

**🔹 Recall**

***- Malignant: 0.94***

Out of all the actual malignant cases, it caught 94%.
This is very important in medical problems — you don’t want to miss a real cancer case.

***- Benign: 0.99 — Excellent!***

**🔹 F1-score**

This balances precision and recall.

***- Malignant: 0.96***

***- Benign: 0.98***

Both are high — meaning your model is very balanced and strong.

**🔍 What are:**

***📊 Macro avg***

Just the average of all classes (treats all classes equally).

***📊 Weighted avg***

Average, but takes class frequency into account (Benign has more examples than Malignant).

### ✅ Summary (in simple terms):

- Model is doing a great job at both detecting cancer and avoiding false alarms.

- Recall for malignant is 94%, which is strong — but in critical fields like cancer detection, we often want even higher recall (maybe tune it later using GridSearchCV or better preprocessing).

- You're now using complete evaluation, not just accuracy — this makes your analysis professional and trustworthy 💼✔️

### 🛑 Part A (Final Step): Understanding and Preventing Data Leakage

**🔍 What is Data Leakage?**

Data Leakage happens when information from outside the training dataset (usually the test set or future data) is used to train the model.

This gives your model an unfair advantage — like cheating in an exam!

**🚨 Why is it bad?**

- The model looks like it performs extremely well during training.

- But when it's deployed on real unseen data, it fails badly.

- It gives you a false sense of accuracy.

**💡 Real-Life Example of Data Leakage:**

Imagine you’re predicting if a patient has cancer.

- You accidentally include a column like "biopsy_result" in the training features — which already reveals if the patient has cancer.

- Model learns this shortcut and gets 100% accuracy.

- But in real hospital use, you won’t have that result before prediction, so your model fails.

**✅ Common Causes of Data Leakage:**

| Mistake                                                        | Why it leaks                    |
| -------------------------------------------------------------- | ------------------------------- |
| Using **test data** during training                            | Model "sees" answers early      |
| Applying **scaling or encoding** to full data before splitting | Info from test leaks into train |
| Including **target-related features**                          | Model learns from future        |

**✅ How to Avoid Data Leakage:**

- Always split your data first (did this ✅).

- Only fit scalers/encoders on training data, then apply to test data.

- Be careful with feature selection — don’t include future or label-based columns.

- In pipelines (learn later), scikit-learn helps prevent leakage automatically.

**✅ So did we prevent leakage today?**

Yes! ✅

Let’s double-check what we did:

| Step                                           | Safe? | Why?                       |
| ---------------------------------------------- | ----- | -------------------------- |
| Used `train_test_split()` **before training**  | ✅     | Prevented future info leak |
| Didn’t apply scaling or encoding yet           | ✅     | No leakage risk            |
| Trained model only on `x_train`, not full data | ✅     | Proper practice            |
| Evaluated on untouched `x_test`                | ✅     | Realistic test             |

Perfect!

**🎯 Summary:**

- Data leakage ruins your model’s real-world usefulness.

- But by following good practices like splitting early, training only on training data, and avoiding future info, you keep your models honest and reliable.

## 🔷 Part B: Model Validation Techniques

**✅ Step 1: k-Fold Cross-Validation using cross_val_score()**

- What it is: Instead of training on 1 fixed train-test split, the model is trained/tested on k different folds.

- Why: It gives you a more reliable estimate of model performance.

**✅ Step 2: GridSearchCV (Intro)**

- What it does: Tries many combinations of model parameters automatically.

- Why: Helps you tune hyperparameters and find the best version of your model.

**✅ Step 3: classification_report**

- We already used this once 👍

- We'll review how it shows precision, recall, F1-score, and support class-wise.

### ✅ Step 1: k-Fold Cross-Validation using cross_val_score()

**🔍 What is k-Fold Cross-Validation?**

Normally, we split data once into training and test sets. But that gives performance on just one split.

In k-Fold Cross-Validation, the data is split into k equal parts (“folds”):

- Train the model on k-1 folds

- Test on the remaining 1 fold

- Repeat the process k times, each time using a different fold as test set

- Take the average score

✅ This gives a better estimate of how the model will perform on unseen data.

In [8]:
# 1. Import cross_val_score
from sklearn.model_selection import cross_val_score #from sklearn.model_selection import cross_val_score → Imports the function that does k-fold cross-validation for you.
from sklearn.ensemble import RandomForestClassifier # already did this before

# 2. Create the model again
model = RandomForestClassifier(random_state=42) # already did this before
# model = RandomForestClassifier(random_state=42) → You define the model you want to validate.

# 3. Apply k-fold cross-validation (default k=5)
# cross_val_score does 5-fold cross-validation by default
cv_scores = cross_val_score(model, x, y, cv=5) # full data (x, y), not just train or test
# cross_val_score(model, x, y, cv=5)
# → Tells scikit-learn to:
# - Use the model stored in model veriable i.e. RandomForestClassifier
# - Use the whole dataset x and y
# - Split it into 5 parts (folds)
# - Train/test 5 times
# - Return the score (accuracy) for each fold

# 4. Print individual scores and average score
print("Cross-Validation Scores (Each Fold):", cv_scores)
print("Average Cross-Validation Score:", cv_scores.mean())
# cv_scores.mean()
# → Calculates the average accuracy across all 5 folds.

Cross-Validation Scores (Each Fold): [0.92105263 0.93859649 0.98245614 0.96491228 0.97345133]
Average Cross-Validation Score: 0.9560937742586555


### ✅ What We’ve Achieved Just Now:

**🔁 k-Fold Cross-Validation Recap:**

- It splits the entire dataset into k parts (you chose cv=5 → 5 parts).

- The model is trained and tested 5 times, each time using a different fold for testing.

- This gives a better, more reliable performance estimate than just one train/test split.

### ✅ Step 3: GridSearchCV 

**🔷 What is GridSearchCV?**

GridSearchCV helps us:

- Try out different values for hyperparameters (like number of trees, max depth, etc.)

- Evaluate each combination using cross-validation

- Pick the best model based on performance (like highest accuracy)

In [9]:
#Step 1: Import necessary tools
from sklearn.model_selection import GridSearchCV # GridSearchCV: This is the main tool we’ll use to automatically try different parameter values and find the best model
from sklearn.ensemble import RandomForestClassifier # The model we want to tune (a powerful ensemble of decision trees).

#step 2: Create the model
model = RandomForestClassifier(random_state=42) #random_state=42 ensures we get the same result every time we run it (for reproducibility).

#Step 3: Define the grid of parameters
# We are giving multiple values to test for each hyperparameter:
param_grid = {
    "n_estimators" : [10,50,100], #'n_estimators': Try models with 10, 50, and 100 trees.
    "max_depth" : [None,5,10], # Try no limit, depth 5, and depth 10 
    "min_samples_split" : [2,5], # Try splitting a node at 2 or 5 samples
} # We’re giving GridSearchCV 3 options for n_estimators, 3 for max_depth, and 2 for min_samples_split — so it will try 3 × 3 × 2 = 18 combinations of these settings.

#Step 4: Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=model, #estimator=model: The model to tune. (model=RandomForestClassifier(random_state=42)).
    param_grid=param_grid, #param_grid=...: The dictionary of parameters to try that we made earlier.
    cv=5, #cv=5: Perform 5-fold cross-validation.
    scoring="accuracy", #scoring='accuracy': Evaluate how accurate the model is for each combination.
)
# What this tells the computer:
# Use the RandomForestClassifier as our model.
# Try every combination from param_grid.
# For each combination, perform 5-fold cross-validation.
# Use accuracy to evaluate how good the model is.

#Step 5: Fit the model
grid_search.fit(x,y) 
# It trains and evaluates the model 18 times (once for each parameter combination), using 5-fold cross-validation for each — so 90 model fits in total.
# It remembers which combination gave the best accuracy.

#Step 6: Print the results
print("Best Parameters:", grid_search.best_params_) # .best_params_: Show which combination gave the best results.
print("Best Score:", grid_search.best_score_) # .best_score_: Show the average accuracy from cross-validation for the best combination.
# Tells us which combination of parameters gave the best result.
# Shows the best accuracy score achieved during the search.

Best Parameters: {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 50}
Best Score: 0.9631268436578171


**✅ What This Means:**

1)n_estimators = 50

- Your model performs best when it builds a forest of 50 trees.

2)max_depth = None

- This allows each tree to grow fully (no depth limit). This likely helps capture more patterns in the data.

3)min_samples_split = 5

- A node must have at least 5 samples to be split further. This helps prevent overfitting.

4)Best Score ≈ 96.3%

- This is the average accuracy across all 5 folds, which means your model generalizes well.

**🧠 Why GridSearchCV Matters:**

- You tried 18 different combinations (3×3×2) and picked the one with the best balance of performance.

- This removes the guesswork and helps build the most reliable and optimized model.

### ✅ Step 4: classification_report

**🧠 Why are we doing this?**

***✅ Goal:***

Now that we’ve fine-tuned our model using GridSearchCV, we want to evaluate the final best-performing model using detailed classification metrics:

- Precision – Out of all the positive predictions, how many were correct?

- Recall – Out of all actual positives, how many were correctly predicted?

- F1-score – Balance between precision and recall.

- Support – Number of true instances for each class.

This gives us a complete picture of how well our optimized model performs.

In [10]:
from sklearn.metrics import classification_report

# Step 1: Use the best model selected by GridSearchCV
best_model = grid_search.best_estimator_  #grid_search.best_estimator_	Retrieves the best model found during GridSearchCV tuning

# Step 2: Predict on the test set
y_pred_best = best_model.predict(x_test) #.predict(x_test)	Uses the best model to make predictions on the test set

# Step 3: Generate a detailed classification report
report = classification_report(
    y_pred,
    y_pred_best,
    target_names=["Malignant", "Benign"]
) #classification_report()	Calculates precision, recall, F1-score, and support for each class

print("Classification Report for Final Tuned Model:\n")
print(report)

Classification Report for Final Tuned Model:

              precision    recall  f1-score   support

   Malignant       0.94      0.98      0.96        60
      Benign       0.99      0.96      0.98       111

    accuracy                           0.97       171
   macro avg       0.96      0.97      0.97       171
weighted avg       0.97      0.97      0.97       171



**💡 Key Takeaways:**
    
- Malignant (Cancerous):

  Precision: 94% – Out of all predicted "Malignant", 94% were correct.

  Recall: 98% – Out of all actual "Malignant", 98% were caught. 🔥

  F1-score: Balanced and high at 96%.

- Benign (Non-Cancerous):

  Precision: 99% – Very few false positives.

  Recall: 96% – Almost all actual benign cases were correctly detected.

  F1-score: 98% – Very strong.

- Macro avg treats both classes equally.

- Weighted avg accounts for class imbalance (more benign cases).