# Module 1 — Supervised Learning

**Created:** 2025-12-04 14:06:54 UTC

## Overview

Supervised learning is a machine learning paradigm where we train a model using labeled data—data that includes both input features and known output labels (targets). The goal is to learn a mapping from inputs to outputs so that the model can predict outcomes for new, unseen data. This is analogous to learning with a teacher who provides correct answers.

### Why Supervised Learning?
Supervised learning is used when we have access to labeled data and want to predict specific outcomes. Unlike unsupervised learning, where patterns are discovered from unlabeled data, supervised learning focuses on prediction tasks where we know what the "correct" answers are (e.g., email is spam or not, patient's diagnosis).

### Where It's Applied
- **Classification:** Predicting categorical labels.
  - Spam detection (classify emails as spam or not).
  - Medical diagnosis (classify tumors as benign or malignant).
  - Sentiment analysis (classify text as positive or negative).
- **Regression:** Predicting continuous numerical values.
  - House price prediction (predict price based on features like size, location).
  - Sales forecasting (predict sales volume based on historical data).
  - Stock price prediction.

### When to Choose Supervised Learning
Choose supervised learning when:
- Labeled data is available (or can be obtained).
- The task is to map inputs to known outputs (prediction-oriented).
- You need interpretable and accurate predictions for decision-making.
- Examples: When building recommendation systems, fraud detection, or automated grading, where historical labeled examples exist.

## Learning objectives
- Understand the difference between classification and regression.
- Learn a simple beginner example using scikit-learn.
- Step up to intermediate model selection and evaluation.
- See advanced notes on bias-variance, regularization, and example with pipelines.


## Beginner — Concept + Simple Example

**Concept (2 sentences):** In supervised learning you provide examples (features) and labels (targets). The model learns a mapping so you can predict labels for new data.

**Simple code:** We'll load the Iris dataset and train a small decision tree classifier.


In [None]:
# Beginner example (do not run here — included for learners)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
print('Accuracy:', accuracy_score(y_test, preds))

# Explanation:
# - load_iris(): small labeled dataset for classification
# - DecisionTreeClassifier: easy-to-visualize model
# - We split into train/test and compute accuracy


## Intermediate — Model selection & evaluation

**What to learn:** Cross-validation, hyperparameter tuning, precision/recall, confusion matrix.

**Code idea:** Use `GridSearchCV` to tune an SVM and evaluate with cross-validation.


In [None]:
# Intermediate example (runnable on your machine)
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
param_grid = {'svc__C':[0.1,1,10], 'svc__kernel':['rbf','linear']}
search = GridSearchCV(pipeline, param_grid, cv=5)
search.fit(X, y)
print('Best params:', search.best_params_)
print('Best CV score:', search.best_score_)


## Advanced — Theory and pipeline example

**Topics:** Bias vs variance, regularization, learning curves, feature engineering, pipelines for reproducible workflows.

**Advanced code sketch:** Example showing a pipeline with feature selection and regularized model.


In [None]:
# Advanced example (sketch)
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('select', SelectKBest(f_classif, k=2)),
    ('clf', LogisticRegression(penalty='l2', C=1.0, max_iter=1000))
])
pipeline.fit(X_train, y_train)
print('Train score:', pipeline.score(X_train, y_train))
print('Test score:', pipeline.score(X_test, y_test))

# Final notes:
# - Regularization (penalty, C) controls overfitting
# - Use learning curves (sklearn.model_selection.learning_curve) to diagnose
