
# Hands-On Machine Learning: Core Topics

This notebook mirrors the style of the attached hands-on template and walks you through **six core topics** in Machine Learning:

1. Supervised Learning – Regression  
2. Supervised Learning – Classification  
3. Unsupervised Learning – Clustering  
4. Dimensionality Reduction  
5. Cross-Validation  
6. Bias–Variance and Overfitting

> **How to use this notebook**
> - Run each cell in order (Shift+Enter).
> - Read the short theory blocks, then experiment by changing parameters.
> - Complete the **Try it / Exercises** prompts sprinkled throughout.



## Learning Objectives

By the end of this notebook, you will be able to:

- Train and evaluate simple **regression** and **classification** models.
- Apply **K-Means** and **DBSCAN** for clustering and reason about their differences.
- Use **PCA** to reduce dimensionality and visualize high-dimensional datasets.
- Perform **K-fold cross-validation** and **grid search** for model selection.
- Interpret **learning curves** and **validation curves** to diagnose **under/overfitting**.



## Setup

We use common scientific Python libraries. If something is missing in your environment, install it via pip (uncomment the line below).



## 1. Supervised Learning – Regression


### Goal:
Predict a continuous target (e.g., house price).  
We start with a synthetic dataset and compare **Linear Regression** and **Ridge Regression**.

**Why Ridge?** Adds L2 penalty to shrink coefficients → often better test performance with noisy/correlated features.

**Metrics:** **MSE** (↓ better), **R²** (↑ better; can be < 0 on test).

### Steps
- Generate data with `make_regression`.
- Split with `train_test_split`.
- Fit `LinearRegression()` and `Ridge(alpha=1.0)`.
- Compute test **MSE** and **R²**.
- Plot test scatter + both prediction lines.

### Interpret
- Ridge better test MSE/R² ⇒ less overfitting.
- If Ridge underperforms, reduce `alpha`; too large `alpha` ⇒ underfit.

### Try
- Change `noise` in `make_regression`. How do MSE and R² respond?
- Vary `alpha` in `Ridge(alpha=...)`. When does Ridge help vs hurt?




## 2. Supervised Learning – Classification

### Goal:
Predict a discrete class (e.g., species).  
We use a 2-class synthetic dataset to compare **Logistic Regression** and **SVM**.

### Steps
1. Generate a binary dataset with `make_classification` (tune `class_sep`).
2. Do a `train_test_split`.
3. Build two pipelines: `StandardScaler()` → `LogisticRegression()` and `SVC(kernel='rbf')`.
4. Evaluate **accuracy**; plot **confusion matrices**; print `classification_report`.

### Interpretation
- **SVM (RBF)** often wins when the decision boundary is non-linear.  
- **LogReg** is fast and interpretable for near-linear problems.  
- Use confusion matrices to see **which** classes are confused.

### Try / Experiments
- Vary `class_sep` and `n_informative`.  
- Change SVM kernel (`'linear'`, `'poly'`) and tune `C`, `gamma`.  
- Use `load_iris()` (3 classes) nd adapt the code to multiclass classification.



## 3. Unsupervised Learning – Clustering

### Goal:
Group similar points without labels.  


###  Algorithms

- K-Means: centroid-based, needs k, struggles with non-spherical shapes/outliers.
- DBSCAN: density-based, finds arbitrary shapes, flags noise (label = −1), no k.

### Steps

1. Generate blobs + add uniform noise.
2. Fit KMeans(n_clusters=4) → labels_km.
3. Fit DBSCAN(eps=0.9, min_samples=10) → labels_db.
4. Scatter-plot points colored by labels for both methods.

### Interpret

- K-Means: clean spherical clusters; noisy points get forced into nearest cluster.
- DBSCAN: can separate irregular clusters; noisy points → −1. Too small eps → over-fragmentation; too large → merges clusters.

### Try

- Vary cluster_std (data overlap).
- Sweep n_clusters for K-Means; plot inertia (elbow).
- Grid eps × min_samples for DBSCAN; count noise and clusters.
- Standardize features if scales differ.




## 4. Dimensionality Reduction

### Goal:
Compress features while preserving structure.  
We use **PCA** to project to 2D for visualization.


### Steps

1. Load data (iris), set X, y.
2. Build Pipeline(StandardScaler() → PCA(n_components=2)).
3. fit_transform(X) → X_pca.
4. Scatter plot by class (PC1 vs PC2).
5. Print explained_variance_ratio_ and its sum.

### Interpret

- Classes separating in PC space ⇒ PCA captures meaningful variance.
- Explained variance (EVR): higher cumulative EVR ⇒ better 2D summary.
- If clusters overlap, 2 PCs may be insufficient or classes aren’t linearly separable.

### Try

- Change n_components (e.g., 3) and compare cumulative EVR.
- Remove scaling and observe impact.
- Use whiten=True and compare plots.



## 5. Cross-Validation

We estimate generalization performance by splitting the data into multiple folds.

### Goal:
Use **K-Fold** and **`cross_val_score`** on classification with Logistic Regression.


### Method:
 Use K-Fold CV with a scaled Logistic Regression pipeline. Report fold scores and the mean. Then run GridSearchCV to pick the best C.

### Steps

1. Create data with make_classification.
2. Build Pipeline(StandardScaler(), LogisticRegression(...)).
3. Define KFold(n_splits=5, shuffle=True, random_state=...).
4. cross_val_score(..., scoring="accuracy") → fold scores + mean.
5. GridSearchCV over clf__C (and solver/penalty), fit, read best_params_, best_score_.

### Interpret

- Stable, high mean + low std ⇒ robust model.
- GridSearchCV selects C balancing bias/variance.

### Try

- Try `SVC` instead of `LogisticRegression` and grid-search `C` and `gamma`.
- Change `n_splits` to 3 or 10. How does variance of scores change?
- Use `scoring='roc_auc'` on a binary dataset and compare with accuracy.


## 6. Bias–Variance and Overfitting

### Goal:
Understand model capacity vs data size/complexity.

We will:
- Plot a **learning curve** (train size vs. score) to see if more data would help.
- Plot a **validation curve** (hyperparameter vs. score) to see under/overfitting.



### Steps

1. Build a DecisionTreeClassifier pipeline.
2. Use learning_curve with train_sizes=np.linspace(0.1,1.0,6).
3. Plot training vs CV accuracy.
4. Use validation_curve over max_depth=1..20.
5. Plot train vs CV accuracy vs max_depth.

### Interpret

- High train, low CV ⇒ overfitting (reduce complexity / more data).
- Low train and CV ⇒ underfitting (increase complexity / features).
- Learning curve flat & gap large ⇒ more data may help; if both low, increase model capacity.

### Try

- Repeat with `DecisionTreeClassifier(min_samples_leaf=...)`. How does it affect overfitting?
- Swap in `Ridge` (regression) and use `validation_curve` over `alpha`.
- Change dataset size and noise; re-plot learning curves to see effects.


## Wrap-up

In this hands-on you practiced:
- Building and evaluating **regression** and **classification** models
- Applying **K-Means** and **DBSCAN** for clustering
- Using **PCA** to reduce dimensions and visualize data
- Doing **K-fold cross-validation** and **grid search**
- Reading **learning** and **validation curves** to reason about **bias–variance**

