# Day 6: Introduction to Machine Learning

**Duration:** 90 minutes  
**Dataset:** Titanic Passenger Data

## Learning Objectives
- Define machine learning and understand it as an optimization task
- Differentiate supervised, unsupervised, and reinforcement learning
- Apply train-test split and cross-validation
- Understand classification tasks (K-NN, Decision Trees, Logistic Regression)
- Understand regression for continuous predictions
- Apply K-means clustering
- Understand neural network basics (layers, activation functions, hyperparameters)
- Understand loss functions

---

## Part 1: Setup and Data Loading (5 mins)

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt

# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, 
    mean_squared_error, r2_score, silhouette_score
)

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")

In [None]:
# Load Titanic dataset
df = sns.load_dataset('titanic')
print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
df.head()

---
## Part 2: What is Machine Learning? (10 mins)

### Definition

**Machine Learning** is a subset of artificial intelligence that enables computers to learn patterns from data without being explicitly programmed.

### Machine Learning as an Optimization Task

At its core, machine learning is about **optimization**:
1. Define a **model** (function) with parameters
2. Define a **loss function** (measures how wrong predictions are)
3. Find parameters that **minimize the loss** (optimization)

```
minimize: Loss(predictions, actual_values)
by adjusting: model parameters
```

### Three Types of Machine Learning

#### 1. Supervised Learning
- **Learn from labeled data** (we have both input features and correct answers)
- **Goal:** Predict output for new, unseen inputs
- **Examples:**
  - Classification: Predict categories (survived/died, spam/not spam)
  - Regression: Predict continuous values (house price, temperature)

#### 2. Unsupervised Learning
- **Learn from unlabeled data** (we only have input features, no correct answers)
- **Goal:** Discover hidden patterns or structure in data
- **Examples:**
  - Clustering: Group similar items together
  - Dimensionality reduction: Compress data while preserving important information

#### 3. Reinforcement Learning
- **Learn by trial and error** through interaction with an environment
- **Goal:** Maximize cumulative reward
- **Examples:** Game playing (AlphaGo), robotics, self-driving cars
- *Note: We won't cover this today*

### Exercise 2.1: Identify the Learning Type

For each scenario, identify whether it's **supervised** or **unsupervised** learning:

1. Predicting whether a Titanic passenger survived (we have survival labels): _______________
2. Grouping passengers into clusters based on age and fare (no labels): _______________
3. Predicting the fare a passenger paid based on their class and age: _______________
4. Finding natural groups in customer purchase behavior: _______________

---
## Part 3: Data Preparation for Machine Learning (10 mins)

Before we can train models, we need to prepare our data!

In [None]:
# Create a clean copy for machine learning
df_ml = df.copy()

# Check for missing values
print("Missing values:")
print(df_ml.isnull().sum())
print("\nTotal missing:", df_ml.isnull().sum().sum())

In [None]:
# TODO: Handle missing values
# Fill missing 'age' with median age
df_ml['age'] = df_ml['age'].fillna(df_ml['age'].median())

# TODO: Fill missing 'embarked' with most common value (mode)
df_ml['embarked'] = # YOUR CODE HERE (use fillna with mode)

# TODO: Fill missing 'fare' with median fare
df_ml['fare'] = # YOUR CODE HERE

print("Missing values after imputation:")
print(df_ml[['age', 'embarked', 'fare']].isnull().sum())

In [None]:
# Encode categorical variables
# TODO: Convert 'sex' to binary: 1 for male, 0 for female
df_ml['sex_encoded'] = # YOUR CODE HERE (use map: {'male': 1, 'female': 0})

# One-hot encode 'embarked'
embarked_dummies = pd.get_dummies(df_ml['embarked'], prefix='embarked', drop_first=True)
df_ml = pd.concat([df_ml, embarked_dummies], axis=1)

print("\nEncoded features:")
print(df_ml[['sex', 'sex_encoded', 'embarked', 'embarked_Q', 'embarked_S']].head())

In [None]:
# Feature engineering: create family_size
# TODO: Create family_size = siblings/spouses + parents/children + 1 (the passenger)
df_ml['family_size'] = # YOUR CODE HERE (sibsp + parch + 1)

print(f"Family size range: {df_ml['family_size'].min()} to {df_ml['family_size'].max()}")
print(f"Average family size: {df_ml['family_size'].mean():.2f}")

---
## Part 4: Train-Test Split (10 mins)

### Why Split Data?

We split data into **training** and **testing** sets to:
1. **Train** the model on one portion of data
2. **Evaluate** the model on unseen data (testing set)
3. **Detect overfitting**: When a model memorizes training data but fails on new data

```
Original Data (100%)
    |
    ├── Training Set (70-80%): Used to train the model
    └── Test Set (20-30%): Used to evaluate the model
```

### Important Concepts

- **Training Set**: Data used to train (fit) the model
- **Test Set**: Data used to evaluate model performance (simulates real-world predictions)
- **Never** use test data for training!
- Common split ratios: 80/20, 70/30, 75/25

In [None]:
# Select features for our first model
feature_columns = ['pclass', 'sex_encoded', 'age', 'fare', 'family_size', 
                   'embarked_Q', 'embarked_S']

# TODO: Create X (features) and y (target)
X = # YOUR CODE HERE (select feature_columns from df_ml)
y = # YOUR CODE HERE (select 'survived' from df_ml)

print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"\nFeatures: {feature_columns}")

In [None]:
# TODO: Split data into training and testing sets (80/20 split)
# Use random_state=42 for reproducibility
X_train, X_test, y_train, y_test = # YOUR CODE HERE (use train_test_split)

print(f"Training set size: {len(X_train)} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"Test set size: {len(X_test)} samples ({len(X_test)/len(X)*100:.1f}%)")
print(f"\nTraining set survival rate: {y_train.mean()*100:.1f}%")
print(f"Test set survival rate: {y_test.mean()*100:.1f}%")

### Feature Scaling

Many ML algorithms perform better when features are on the same scale.

**Important**: Fit scaler on training data only, then transform both train and test!

In [None]:
# TODO: Scale features using StandardScaler
scaler = StandardScaler()

# Fit on training data and transform both train and test
X_train_scaled = # YOUR CODE HERE (fit_transform on X_train)
X_test_scaled = # YOUR CODE HERE (transform on X_test - don't fit!)

print("Original training data (first 3 rows):")
print(X_train.head(3))
print("\nScaled training data (first 3 rows):")
print(X_train_scaled[:3])

---
## Part 5: Classification - Predicting Survival (25 mins)

### What is Classification?

**Classification** is a supervised learning task where we predict **discrete categories** (classes).

- **Binary Classification**: 2 classes (e.g., survived/died, spam/not spam)
- **Multi-class Classification**: 3+ classes (e.g., passenger class 1/2/3)

Today we'll explore three popular classification algorithms!

### 5.1: K-Nearest Neighbors (K-NN)

**How it works:**
1. Find the K nearest data points to the new observation
2. Take a "vote" among those K neighbors
3. Assign the most common class

**Analogy:** "You are the average of your 5 closest friends"

**Key hyperparameter:** K (number of neighbors)
- Small K: More sensitive to noise (overfitting)
- Large K: Smoother decision boundary (underfitting)

In [None]:
# TODO: Train a K-NN classifier with K=5
knn = # YOUR CODE HERE (create KNeighborsClassifier with n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Make predictions
y_pred_knn = knn.predict(X_test_scaled)

# Evaluate
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f"K-NN Accuracy: {accuracy_knn*100:.2f}%")

In [None]:
# TODO: Experiment with different K values
k_values = [1, 3, 5, 7, 9, 15, 25, 50]
accuracies = []

for k in k_values:
    # YOUR CODE HERE
    # Train KNN with n_neighbors=k
    # Predict on test set
    # Calculate accuracy and append to accuracies list
    pass

# Plot results
fig = px.line(x=k_values, y=accuracies, markers=True,
              title='K-NN Performance vs K Value',
              labels={'x': 'K (Number of Neighbors)', 'y': 'Accuracy'})
fig.update_layout(yaxis_tickformat='.0%')
fig.show()

print(f"Best K value: {k_values[np.argmax(accuracies)]} with accuracy: {max(accuracies)*100:.2f}%")

### 5.2: Decision Trees

**How it works:**
1. Split data based on features to create "decision rules"
2. Keep splitting until reaching pure leaf nodes or stopping criteria
3. Make predictions by following the decision path

**Analogy:** A flowchart of yes/no questions

```
Is sex == female?
  ├─ Yes → Is pclass <= 2?
  │         ├─ Yes → Survived
  │         └─ No → Check age...
  └─ No → Is age < 16?
            ├─ Yes → Survived
            └─ No → Died
```

**Pros:** Easy to interpret, handles non-linear relationships  
**Cons:** Can overfit easily

In [None]:
# TODO: Train a Decision Tree classifier
dt = DecisionTreeClassifier(max_depth=4, random_state=42)
dt.fit(X_train_scaled, y_train)

# Make predictions
y_pred_dt = # YOUR CODE HERE

# Evaluate
accuracy_dt = # YOUR CODE HERE (calculate accuracy)
print(f"Decision Tree Accuracy: {accuracy_dt*100:.2f}%")

In [None]:
# Visualize the decision tree
plt.figure(figsize=(20, 10))
plot_tree(dt, 
          feature_names=feature_columns,
          class_names=['Died', 'Survived'],
          filled=True,
          rounded=True,
          fontsize=10)
plt.title('Decision Tree for Titanic Survival Prediction', fontsize=16)
plt.show()

In [None]:
# Feature importance
feature_importance = pd.DataFrame({
    'feature': feature_columns,
    'importance': dt.feature_importances_
}).sort_values('importance', ascending=False)

fig = px.bar(feature_importance, x='importance', y='feature', orientation='h',
             title='Feature Importance in Decision Tree')
fig.show()

print("\nMost important features:")
print(feature_importance)

### 5.3: Logistic Regression

**How it works:**
1. Creates a linear combination of features
2. Applies sigmoid function to get probability (0 to 1)
3. Classifies based on threshold (typically 0.5)

**Formula:** `P(survived=1) = 1 / (1 + e^(-z))` where `z = w₁x₁ + w₂x₂ + ... + b`

**Note:** Despite the name, it's a **classification** algorithm, not regression!

**Pros:** Fast, interpretable, provides probabilities  
**Cons:** Assumes linear decision boundary

In [None]:
# TODO: Train a Logistic Regression classifier
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_scaled, y_train)

# Make predictions
y_pred_lr = # YOUR CODE HERE

# Get probability predictions
y_pred_proba_lr = lr.predict_proba(X_test_scaled)[:, 1]  # Probability of survival

# Evaluate
accuracy_lr = # YOUR CODE HERE
print(f"Logistic Regression Accuracy: {accuracy_lr*100:.2f}%")
print(f"\nFirst 5 probability predictions: {y_pred_proba_lr[:5]}")

In [None]:
# Model coefficients (weights)
coefficients = pd.DataFrame({
    'feature': feature_columns,
    'coefficient': lr.coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)

fig = px.bar(coefficients, x='coefficient', y='feature', orientation='h',
             title='Logistic Regression Coefficients',
             color='coefficient',
             color_continuous_scale='RdBu_r')
fig.show()

print("\nFeature coefficients:")
print(coefficients)
print("\nPositive coefficient = increases survival probability")
print("Negative coefficient = decreases survival probability")

### 5.4: Comparing Classification Models

Let's compare all three models using multiple metrics!

In [None]:
# TODO: Calculate metrics for all models
models = {
    'K-NN': y_pred_knn,
    'Decision Tree': y_pred_dt,
    'Logistic Regression': y_pred_lr
}

results = []
for model_name, y_pred in models.items():
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    results.append({
        'Model': model_name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1
    })

results_df = pd.DataFrame(results)
print("Model Comparison:")
print(results_df.to_string(index=False))

# Visualize comparison
results_melted = results_df.melt(id_vars='Model', var_name='Metric', value_name='Score')
fig = px.bar(results_melted, x='Model', y='Score', color='Metric', barmode='group',
             title='Classification Model Comparison')
fig.update_layout(yaxis_tickformat='.0%')
fig.show()

### Understanding Classification Metrics

- **Accuracy**: % of correct predictions (both survived and died)
- **Precision**: Of all predicted survivors, what % actually survived? (avoid false alarms)
- **Recall**: Of all actual survivors, what % did we predict correctly? (avoid missing survivors)
- **F1-Score**: Harmonic mean of precision and recall (balanced metric)

In [None]:
# Confusion Matrix for best model (Logistic Regression)
cm = confusion_matrix(y_test, y_pred_lr)

fig = px.imshow(cm, 
                labels=dict(x="Predicted", y="Actual", color="Count"),
                x=['Died', 'Survived'],
                y=['Died', 'Survived'],
                text_auto=True,
                title='Confusion Matrix - Logistic Regression',
                color_continuous_scale='Blues')
fig.show()

print("Confusion Matrix:")
print(f"True Negatives (correctly predicted died): {cm[0, 0]}")
print(f"False Positives (incorrectly predicted survived): {cm[0, 1]}")
print(f"False Negatives (incorrectly predicted died): {cm[1, 0]}")
print(f"True Positives (correctly predicted survived): {cm[1, 1]}")

### 5.5: Support Vector Machines (SVM) - Conceptual Overview

**How it works:**
1. Find the hyperplane (decision boundary) that best separates classes
2. Maximize the margin (distance) between the boundary and nearest points
3. Support vectors are the critical points closest to the boundary

**Key concept:** The "kernel trick" allows SVM to handle non-linear boundaries

**Pros:** Effective in high dimensions, memory efficient  
**Cons:** Slow for large datasets, requires feature scaling

In [None]:
# TODO: Train an SVM classifier (optional - can be slow)
# Use a smaller dataset for demonstration
from sklearn.svm import SVC

svm = SVC(kernel='rbf', random_state=42)
svm.fit(X_train_scaled, y_train)

y_pred_svm = svm.predict(X_test_scaled)
accuracy_svm = accuracy_score(y_test, y_pred_svm)

print(f"SVM Accuracy: {accuracy_svm*100:.2f}%")
print(f"Number of support vectors: {len(svm.support_)}")

---
## Part 6: Cross-Validation (8 mins)

### The Problem with Single Train-Test Split

A single split might not be representative. What if test set is too easy or too hard?

### Solution: K-Fold Cross-Validation

1. Split data into K folds (e.g., 5 folds)
2. Train K times, each time using a different fold as test set
3. Average the results to get more reliable estimate

```
Fold 1: [Test][Train][Train][Train][Train]
Fold 2: [Train][Test][Train][Train][Train]
Fold 3: [Train][Train][Test][Train][Train]
Fold 4: [Train][Train][Train][Test][Train]
Fold 5: [Train][Train][Train][Train][Test]
```

In [None]:
# TODO: Perform 5-fold cross-validation on Logistic Regression
lr_cv = LogisticRegression(max_iter=1000, random_state=42)

# Use cross_val_score with cv=5
cv_scores = cross_val_score(lr_cv, X_train_scaled, y_train, cv=5, scoring='accuracy')

print("Cross-Validation Scores:")
for i, score in enumerate(cv_scores, 1):
    print(f"Fold {i}: {score*100:.2f}%")

print(f"\nMean CV Score: {cv_scores.mean()*100:.2f}%")
print(f"Standard Deviation: {cv_scores.std()*100:.2f}%")
print(f"95% Confidence Interval: [{(cv_scores.mean() - 2*cv_scores.std())*100:.2f}%, {(cv_scores.mean() + 2*cv_scores.std())*100:.2f}%]")

In [None]:
# TODO: Compare CV scores for all models
models_cv = {
    'K-NN': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree': DecisionTreeClassifier(max_depth=4, random_state=42),
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42)
}

cv_results = []
for model_name, model in models_cv.items():
    # YOUR CODE HERE: perform cross-validation
    scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')
    cv_results.append({
        'Model': model_name,
        'Mean CV Score': scores.mean(),
        'Std Dev': scores.std()
    })

cv_results_df = pd.DataFrame(cv_results)
print("\nCross-Validation Comparison:")
print(cv_results_df)

---
## Part 7: Regression - Predicting Continuous Values (12 mins)

### What is Regression?

**Regression** is a supervised learning task where we predict **continuous numerical values**.

Examples:
- Predicting house prices
- Forecasting temperature
- Estimating ticket fare

Today we'll predict the **fare** a passenger paid based on their characteristics!

In [None]:
# Prepare data for regression
# TODO: Create features (X) and target (y) for fare prediction
regression_features = ['pclass', 'sex_encoded', 'age', 'family_size', 
                       'embarked_Q', 'embarked_S']

X_reg = df_ml[regression_features]
y_reg = df_ml['fare']  # Target is now 'fare' (continuous)

print(f"Target (fare) statistics:")
print(y_reg.describe())

# Split data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Scale features
scaler_reg = StandardScaler()
X_train_reg_scaled = scaler_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_reg.transform(X_test_reg)

### Linear Regression

**How it works:**
- Finds the best-fit line (or hyperplane) through the data
- Minimizes the sum of squared errors (distance from points to line)

**Formula:** `y = w₁x₁ + w₂x₂ + ... + b`

Where:
- `y` = predicted fare
- `x₁, x₂, ...` = features (pclass, age, etc.)
- `w₁, w₂, ...` = weights (coefficients)
- `b` = bias (intercept)

In [None]:
# TODO: Train a Linear Regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train_reg_scaled, y_train_reg)

# Make predictions
y_pred_reg = lin_reg.predict(X_test_reg_scaled)

# Evaluate using regression metrics
mse = mean_squared_error(y_test_reg, y_pred_reg)
rmse = np.sqrt(mse)
r2 = r2_score(y_test_reg, y_pred_reg)

print("Linear Regression Results:")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R² Score: {r2:.4f}")
print(f"\nInterpretation: On average, predictions are off by £{rmse:.2f}")

### Understanding Regression Metrics

- **MSE (Mean Squared Error)**: Average squared difference between predictions and actual values
  - Lower is better
  - Units are squared (hard to interpret)

- **RMSE (Root Mean Squared Error)**: Square root of MSE
  - Lower is better
  - Same units as target variable (easier to interpret)

- **R² (R-squared)**: Proportion of variance explained by the model
  - Range: 0 to 1 (sometimes negative for very bad models)
  - 1.0 = perfect predictions
  - 0.0 = model is no better than predicting the mean

In [None]:
# Visualize predictions vs actual values
fig = go.Figure()

# Perfect predictions line
max_fare = max(y_test_reg.max(), y_pred_reg.max())
fig.add_trace(go.Scatter(x=[0, max_fare], y=[0, max_fare], 
                         mode='lines', name='Perfect Prediction',
                         line=dict(color='red', dash='dash')))

# Actual predictions
fig.add_trace(go.Scatter(x=y_test_reg, y=y_pred_reg, 
                         mode='markers', name='Predictions',
                         marker=dict(size=8, opacity=0.6)))

fig.update_layout(
    title=f'Predicted vs Actual Fare (R² = {r2:.3f})',
    xaxis_title='Actual Fare (£)',
    yaxis_title='Predicted Fare (£)',
    showlegend=True
)
fig.show()

In [None]:
# Feature coefficients for regression
reg_coefficients = pd.DataFrame({
    'feature': regression_features,
    'coefficient': lin_reg.coef_
}).sort_values('coefficient', key=abs, ascending=False)

fig = px.bar(reg_coefficients, x='coefficient', y='feature', orientation='h',
             title='Linear Regression Coefficients for Fare Prediction',
             color='coefficient',
             color_continuous_scale='RdBu_r')
fig.show()

print("\nFeature coefficients:")
print(reg_coefficients)

---
## Part 8: Unsupervised Learning - K-Means Clustering (10 mins)

### What is Clustering?

**Clustering** is an unsupervised learning task where we group similar data points together **without labels**.

Use cases:
- Customer segmentation
- Anomaly detection
- Data exploration

### K-Means Algorithm

**How it works:**
1. Choose K (number of clusters)
2. Randomly initialize K cluster centers
3. Assign each point to nearest center
4. Update centers to mean of assigned points
5. Repeat steps 3-4 until convergence

**Key hyperparameter:** K (number of clusters)

In [None]:
# Prepare data for clustering
# TODO: Select features for clustering (age and fare)
cluster_features = ['age', 'fare', 'pclass']
X_cluster = df_ml[cluster_features]

# Scale features (important for K-Means!)
scaler_cluster = StandardScaler()
X_cluster_scaled = scaler_cluster.fit_transform(X_cluster)

print(f"Clustering data shape: {X_cluster_scaled.shape}")

In [None]:
# TODO: Apply K-Means with K=3
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_cluster_scaled)

# Add cluster labels to dataframe
df_ml['cluster'] = clusters

print("Cluster distribution:")
print(df_ml['cluster'].value_counts().sort_index())

# Cluster characteristics
print("\nCluster characteristics:")
print(df_ml.groupby('cluster')[cluster_features].mean())

In [None]:
# Visualize clusters in 2D (Age vs Fare)
fig = px.scatter(df_ml, x='age', y='fare', color='cluster',
                 title='K-Means Clustering of Titanic Passengers',
                 labels={'cluster': 'Cluster'},
                 hover_data=['pclass', 'sex'])

# Add cluster centers
centers = scaler_cluster.inverse_transform(kmeans.cluster_centers_)
fig.add_trace(go.Scatter(x=centers[:, 0], y=centers[:, 1],
                         mode='markers',
                         marker=dict(symbol='x', size=15, color='black', line=dict(width=2)),
                         name='Cluster Centers'))

fig.show()

In [None]:
# Find optimal K using the Elbow Method
inertias = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    # YOUR CODE HERE
    # Fit K-Means with k clusters
    # Store inertia_ and silhouette_score
    kmeans_temp = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans_temp.fit(X_cluster_scaled)
    
    inertias.append(kmeans_temp.inertia_)
    silhouette_scores.append(silhouette_score(X_cluster_scaled, kmeans_temp.labels_))

# Plot elbow curve
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(K_range), y=inertias, mode='lines+markers',
                         name='Inertia'))
fig.update_layout(title='Elbow Method for Optimal K',
                  xaxis_title='Number of Clusters (K)',
                  yaxis_title='Inertia (Within-Cluster Sum of Squares)')
fig.show()

# Plot silhouette scores
fig2 = px.line(x=list(K_range), y=silhouette_scores, markers=True,
               title='Silhouette Score vs K',
               labels={'x': 'Number of Clusters (K)', 'y': 'Silhouette Score'})
fig2.show()

print(f"Best K based on silhouette score: {K_range[np.argmax(silhouette_scores)]}")

### Clustering Evaluation Metrics

- **Inertia**: Sum of squared distances to nearest cluster center
  - Lower is better
  - Always decreases as K increases
  - Look for "elbow" in the curve

- **Silhouette Score**: Measures how similar points are to their own cluster vs other clusters
  - Range: -1 to 1
  - Higher is better
  - >0.5 = good clustering

---
## Part 9: Introduction to Neural Networks (10 mins)

### What are Neural Networks?

**Neural networks** are computing systems inspired by biological neural networks in animal brains.

### Basic Structure

```
Input Layer → Hidden Layer(s) → Output Layer
```

**Components:**
1. **Neurons (Nodes)**: Processing units that receive inputs, apply weights, and produce outputs
2. **Layers**: Groups of neurons
   - Input Layer: Receives raw data
   - Hidden Layer(s): Performs computations
   - Output Layer: Produces final prediction
3. **Weights**: Learned parameters that connect neurons
4. **Biases**: Offset values for each neuron

### Activation Functions

**Activation functions** introduce non-linearity, allowing networks to learn complex patterns.

**Common activation functions:**

1. **ReLU (Rectified Linear Unit)**
   - Formula: `f(x) = max(0, x)`
   - Most popular for hidden layers
   - Fast and effective

2. **Sigmoid**
   - Formula: `f(x) = 1 / (1 + e^(-x))`
   - Output: 0 to 1
   - Used for binary classification output

3. **Tanh (Hyperbolic Tangent)**
   - Formula: `f(x) = (e^x - e^(-x)) / (e^x + e^(-x))`
   - Output: -1 to 1
   - Similar to sigmoid but centered at 0

4. **Softmax**
   - Converts outputs to probabilities (sum to 1)
   - Used for multi-class classification output

In [None]:
# Visualize activation functions
x = np.linspace(-5, 5, 100)

# ReLU
relu = np.maximum(0, x)

# Sigmoid
sigmoid = 1 / (1 + np.exp(-x))

# Tanh
tanh = np.tanh(x)

# Plot
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=relu, mode='lines', name='ReLU'))
fig.add_trace(go.Scatter(x=x, y=sigmoid, mode='lines', name='Sigmoid'))
fig.add_trace(go.Scatter(x=x, y=tanh, mode='lines', name='Tanh'))

fig.update_layout(title='Common Activation Functions',
                  xaxis_title='Input',
                  yaxis_title='Output',
                  showlegend=True)
fig.show()

### Neural Network Hyperparameters

**Hyperparameters** are settings we choose before training (not learned from data).

**Key hyperparameters:**

1. **Number of layers**: How deep is the network?
   - More layers = can learn more complex patterns
   - But: harder to train, risk of overfitting

2. **Number of neurons per layer**: How wide is each layer?
   - More neurons = more capacity to learn
   - But: more computation, risk of overfitting

3. **Learning rate**: How big are the update steps during training?
   - Too high: training unstable, might not converge
   - Too low: training very slow
   - Typical values: 0.001 to 0.1

4. **Batch size**: How many examples to process before updating weights?
   - Small batch: noisy updates, but more frequent
   - Large batch: stable updates, but less frequent
   - Common values: 16, 32, 64, 128

5. **Epochs**: How many times to go through entire dataset?
   - More epochs = more training
   - But: risk of overfitting if too many

### Example Neural Network Architecture

For Titanic survival prediction:

```
Input Layer (7 neurons)
    ↓
Hidden Layer 1 (16 neurons, ReLU activation)
    ↓
Hidden Layer 2 (8 neurons, ReLU activation)
    ↓
Output Layer (1 neuron, Sigmoid activation)
```

**Total parameters to learn:**
- Layer 1: 7 × 16 + 16 = 128
- Layer 2: 16 × 8 + 8 = 136
- Output: 8 × 1 + 1 = 9
- **Total: 273 parameters**

### Forward Propagation (Conceptual)

How a neural network makes a prediction:

1. **Input**: Start with features (age, sex, pclass, etc.)
2. **Layer 1**: 
   - Multiply inputs by weights
   - Add biases
   - Apply activation function (ReLU)
3. **Layer 2**: 
   - Multiply layer 1 outputs by weights
   - Add biases
   - Apply activation function (ReLU)
4. **Output Layer**: 
   - Multiply layer 2 outputs by weights
   - Add bias
   - Apply activation function (Sigmoid)
   - Get probability of survival

### Loss Functions

**Loss function** measures how wrong our predictions are. Goal: minimize loss!

#### For Classification:

**1. Binary Cross-Entropy Loss** (for binary classification)
- Formula: `-[y log(ŷ) + (1-y) log(1-ŷ)]`
- Where y = actual (0 or 1), ŷ = predicted probability
- Penalizes confident wrong predictions heavily

**2. Categorical Cross-Entropy Loss** (for multi-class)
- Extension of binary cross-entropy for multiple classes

#### For Regression:

**1. Mean Squared Error (MSE)**
- Formula: `average of (actual - predicted)²`
- Penalizes large errors more heavily

**2. Mean Absolute Error (MAE)**
- Formula: `average of |actual - predicted|`
- Less sensitive to outliers than MSE

In [None]:
# Visualize Binary Cross-Entropy Loss
# When actual value is 1 (survived)
y_pred_prob = np.linspace(0.01, 0.99, 100)

# Loss when actual = 1
loss_when_1 = -np.log(y_pred_prob)

# Loss when actual = 0
loss_when_0 = -np.log(1 - y_pred_prob)

fig = go.Figure()
fig.add_trace(go.Scatter(x=y_pred_prob, y=loss_when_1, 
                         mode='lines', name='Actual = 1 (Survived)'))
fig.add_trace(go.Scatter(x=y_pred_prob, y=loss_when_0, 
                         mode='lines', name='Actual = 0 (Died)'))

fig.update_layout(title='Binary Cross-Entropy Loss',
                  xaxis_title='Predicted Probability',
                  yaxis_title='Loss',
                  showlegend=True)
fig.show()

print("Interpretation:")
print("- If actual = 1, we want predicted probability close to 1 (low loss)")
print("- If actual = 0, we want predicted probability close to 0 (low loss)")
print("- Being confident and wrong results in very high loss!")

### Training Neural Networks (Conceptual)

**Backpropagation** is the algorithm for training neural networks:

1. **Forward pass**: Make predictions
2. **Calculate loss**: How wrong are we?
3. **Backward pass**: Calculate gradient of loss with respect to each weight
4. **Update weights**: Adjust weights in direction that reduces loss
5. **Repeat**: Go through dataset multiple times (epochs)

**Gradient Descent** optimization:
```
new_weight = old_weight - learning_rate × gradient
```

**Note**: We won't implement neural networks from scratch today, but libraries like TensorFlow and PyTorch handle this for us!

---
## Part 10: Exercises and Practice (5 mins)

### Exercise 10.1: Improve Classification Performance

Try to improve model accuracy by:
1. Adding more features
2. Engineering new features
3. Trying different hyperparameters

In [None]:
# TODO: Your turn! Try to improve the model
# Ideas:
# - Add 'cabin_known' feature (binary: 1 if cabin is not null)
# - Add 'is_alone' feature (1 if family_size == 1)
# - Extract title from name (Mr, Mrs, Miss, Master)
# - Try different train-test split ratios
# - Experiment with different K values for K-NN

# YOUR CODE HERE


### Exercise 10.2: Predict Passenger Class

Change the problem: Predict `pclass` (1, 2, or 3) instead of survival.

In [None]:
# TODO: Build a multi-class classifier
# 1. Select appropriate features (don't include pclass!)
# 2. Split data
# 3. Train a model (Logistic Regression works for multi-class)
# 4. Evaluate using accuracy

# YOUR CODE HERE


### Exercise 10.3: Clustering Analysis

Analyze the clusters you created:
1. What are the characteristics of each cluster?
2. How do survival rates differ across clusters?
3. Can you give meaningful names to each cluster?

In [None]:
# TODO: Analyze clusters
# Calculate survival rate for each cluster
# Look at other characteristics (sex, embarked, etc.)

# YOUR CODE HERE


---
## Summary & Reflection

### Key Takeaways

Today we learned:

#### 1. Machine Learning Fundamentals
- Machine learning is an optimization task: minimize loss by adjusting parameters
- Three types: supervised (labeled data), unsupervised (unlabeled), reinforcement (rewards)

#### 2. Supervised Learning: Classification
- **K-NN**: Classify based on nearest neighbors (simple, intuitive)
- **Decision Trees**: Create decision rules (interpretable)
- **Logistic Regression**: Linear model with sigmoid activation (fast, probabilistic)
- **SVM**: Find optimal decision boundary (effective in high dimensions)

#### 3. Supervised Learning: Regression
- **Linear Regression**: Predict continuous values with linear relationship
- Metrics: MSE, RMSE (lower is better), R² (higher is better)

#### 4. Model Evaluation
- **Train-test split**: Essential for detecting overfitting
- **Cross-validation**: More robust evaluation using multiple splits
- **Metrics**: Accuracy, precision, recall, F1-score (classification); MSE, RMSE, R² (regression)

#### 5. Unsupervised Learning: Clustering
- **K-Means**: Group similar data points without labels
- **Elbow method**: Find optimal number of clusters
- **Silhouette score**: Measure clustering quality

#### 6. Neural Networks (Conceptual)
- **Architecture**: Input → Hidden Layer(s) → Output
- **Activation functions**: ReLU, sigmoid, tanh, softmax
- **Hyperparameters**: layers, neurons, learning rate, batch size, epochs
- **Loss functions**: Cross-entropy (classification), MSE/MAE (regression)
- **Training**: Forward pass + backpropagation + gradient descent

### Machine Learning Workflow

```
1. Define Problem (classification, regression, clustering)
   ↓
2. Prepare Data (cleaning, encoding, feature engineering)
   ↓
3. Split Data (train/test or cross-validation)
   ↓
4. Choose Algorithm (based on problem type)
   ↓
5. Train Model (fit on training data)
   ↓
6. Evaluate Model (test on unseen data)
   ↓
7. Tune Hyperparameters (improve performance)
   ↓
8. Deploy Model (use for predictions)
```

### Reflection Questions

1. **What's the difference between classification and regression?**

   Your answer: ___________________________________

2. **Why do we need to split data into training and testing sets?**

   Your answer: ___________________________________

3. **Which classification model performed best on the Titanic dataset? Why do you think that is?**

   Your answer: ___________________________________

4. **What are the three most important features for predicting survival?**

   Your answer: ___________________________________

5. **When would you use unsupervised learning instead of supervised learning?**

   Your answer: ___________________________________

6. **What is the purpose of activation functions in neural networks?**

   Your answer: ___________________________________

7. **What's the relationship between loss functions and optimization?**

   Your answer: ___________________________________

---
## Bonus Challenges (Optional)

### Challenge 1: Feature Engineering
Create a new feature called `fare_category` that bins fare into 4 categories (cheap, medium, expensive, luxury). Does this improve model performance?

In [None]:
# YOUR CODE HERE


### Challenge 2: Ensemble Methods
Research and try a Random Forest classifier (ensemble of decision trees). How does it compare to individual models?

In [None]:
# Hint: from sklearn.ensemble import RandomForestClassifier
# YOUR CODE HERE


### Challenge 3: Hyperparameter Tuning
Use GridSearchCV to find the best hyperparameters for your favorite model.

In [None]:
# Hint: from sklearn.model_selection import GridSearchCV
# YOUR CODE HERE


---
## Resources for Further Learning

### Documentation
- **Scikit-learn User Guide**: https://scikit-learn.org/stable/user_guide.html
- **Scikit-learn Cheat Sheet**: https://scikit-learn.org/stable/tutorial/machine_learning_map/

### Tutorials
- **Machine Learning Crash Course (Google)**: https://developers.google.com/machine-learning/crash-course
- **Kaggle Learn**: https://www.kaggle.com/learn
- **Neural Networks Explained**: https://www.youtube.com/watch?v=aircAruvnKk

### Books
- *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow* by Aurélien Géron
- *Introduction to Statistical Learning* (free PDF): https://www.statlearning.com/

### Practice
- **Kaggle Competitions**: https://www.kaggle.com/competitions
- **UCI ML Repository**: https://archive.ics.uci.edu/ml/

---

**Great job today! You've taken your first steps into machine learning!**

Next session: Advanced machine learning techniques and deep learning with neural networks!