# Day 6 - Introduction to Machine Learning: Complete Solutions

**Comprehensive Solutions Notebook**

This notebook provides complete solutions for all Day 6 exercises including:
- Train-test split implementation with best practices
- Multiple classification algorithms (KNN, Decision Trees, Logistic Regression)
- Regression analysis (Linear Regression for fare prediction)
- Unsupervised learning (K-means clustering)
- Neural network concepts and examples
- Comprehensive visualizations and model comparisons
- Performance metrics and evaluation

---

## Setup and Imports

First, let's import all necessary libraries and configure our environment.

In [None]:
# Installation commands (uncomment if needed)
# !pip install pandas numpy seaborn plotly scikit-learn matplotlib

# Core data manipulation
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# Classification algorithms
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier

# Clustering
from sklearn.cluster import KMeans

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report,
    roc_auc_score, roc_curve, auc,
    mean_squared_error, mean_absolute_error, r2_score,
    silhouette_score
)

# Neural network (for conceptual examples)
from sklearn.neural_network import MLPClassifier

# Configuration
sns.set_theme(style='whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## 1. Data Loading and Initial Exploration

We'll use the famous Titanic dataset to demonstrate machine learning concepts.

In [None]:
# Load the Titanic dataset
df = sns.load_dataset('titanic')
df_original = df.copy()  # Keep pristine copy for reference

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Get detailed information about the dataset
print("Dataset Information:")
print("="*60)
df.info()

print("\n" + "="*60)
print("Missing Values Summary:")
print("="*60)
missing_data = pd.DataFrame({
    'Column': df.columns,
    'Missing Count': df.isnull().sum(),
    'Missing Percentage': (df.isnull().sum() / len(df) * 100).round(2)
})
print(missing_data[missing_data['Missing Count'] > 0].to_string(index=False))

print("\n" + "="*60)
print("Target Variable Distribution (Survived):")
print("="*60)
print(df['survived'].value_counts())
print(f"\nSurvival Rate: {df['survived'].mean():.2%}")

## 2. Machine Learning Concepts

### Supervised vs. Unsupervised Learning

**Supervised Learning:**
- We have labeled data (input features + target variable)
- Goal: Learn a mapping from inputs to outputs
- Examples: Classification (predicting survival), Regression (predicting fare)

**Unsupervised Learning:**
- We have unlabeled data (only input features)
- Goal: Discover patterns or structure in the data
- Examples: Clustering (grouping similar passengers), Dimensionality Reduction

### The ML Workflow

1. **Data Preparation**: Clean, transform, and engineer features
2. **Train-Test Split**: Separate data for training and evaluation
3. **Model Training**: Fit the model on training data
4. **Model Evaluation**: Assess performance on test data
5. **Model Comparison**: Compare multiple models
6. **Hyperparameter Tuning**: Optimize model parameters
7. **Deployment**: Use the model for predictions

## 3. Data Preprocessing for Supervised Learning

**Best Practices:**
- Handle missing values systematically
- Encode categorical variables appropriately
- Scale numerical features when necessary
- Create meaningful features from raw data

In [None]:
# Create a working copy for machine learning
df_ml = df.copy()

print("Data Preprocessing Steps:")
print("="*60)

# Step 1: Handle missing values
print("\n1. Handling Missing Values:")
print("   - Age: Filling with median")
df_ml['age'] = df_ml['age'].fillna(df_ml['age'].median())

print("   - Fare: Filling with median")
df_ml['fare'] = df_ml['fare'].fillna(df_ml['fare'].median())

print("   - Embarked: Filling with mode (most common value)")
df_ml['embarked'] = df_ml['embarked'].fillna(df_ml['embarked'].mode()[0])

# Step 2: Feature engineering
print("\n2. Feature Engineering:")
print("   - Creating 'family_size' from sibsp and parch")
df_ml['family_size'] = df_ml['sibsp'] + df_ml['parch'] + 1

print("   - Creating 'is_alone' indicator")
df_ml['is_alone'] = (df_ml['family_size'] == 1).astype(int)

print("   - Creating age groups")
df_ml['age_group'] = pd.cut(df_ml['age'], 
                              bins=[0, 12, 18, 35, 60, 100], 
                              labels=['Child', 'Teen', 'Adult', 'Middle_Age', 'Senior'])

# Step 3: Encode categorical variables
print("\n3. Encoding Categorical Variables:")
print("   - Using one-hot encoding for: sex, embarked, class, age_group")
df_ml = pd.get_dummies(df_ml, columns=['sex', 'embarked', 'class', 'age_group'], drop_first=True)

# Step 4: Select features for modeling
print("\n4. Feature Selection:")
feature_cols = ['age', 'fare', 'family_size', 'is_alone'] + \
               [c for c in df_ml.columns if c.startswith(('sex_', 'embarked_', 'class_', 'age_group_'))]

X = df_ml[feature_cols]
y = df_ml['survived']

print(f"\nFinal feature set: {len(feature_cols)} features")
print(f"Features: {feature_cols}")
print(f"\nX shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"\nClass distribution:")
print(y.value_counts())

## 4. Train-Test Split Implementation

**Why Split Data?**
- Training set: Used to train the model
- Test set: Used to evaluate model performance on unseen data
- Prevents overfitting and provides realistic performance estimates

**Best Practices:**
- Typical split: 80/20 or 70/30 (train/test)
- Use stratification for imbalanced datasets
- Set random_state for reproducibility
- Never look at test data during training!

In [None]:
# Perform train-test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,        # 20% for testing
    random_state=42,      # For reproducibility
    stratify=y            # Maintain class distribution
)

print("Train-Test Split Results:")
print("="*60)
print(f"Training set size: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Test set size: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")

print("\nClass distribution in training set:")
print(y_train.value_counts())
print(f"Survival rate: {y_train.mean():.2%}")

print("\nClass distribution in test set:")
print(y_test.value_counts())
print(f"Survival rate: {y_test.mean():.2%}")

# Optional: Scale features (important for some algorithms like KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print("\nFeatures scaled using StandardScaler (mean=0, std=1)")
print("\nTraining Features Summary:")
print(X_train.describe())

## 5. Classification Models

We'll implement and compare multiple classification algorithms:
1. **Logistic Regression**: Linear model for binary classification
2. **K-Nearest Neighbors (KNN)**: Instance-based learning
3. **Decision Tree**: Tree-based model with interpretable rules
4. **Random Forest**: Ensemble of decision trees
5. **Neural Network**: Multi-layer perceptron

### 5.1 Logistic Regression

**When to use:**
- Binary or multi-class classification
- Need interpretable coefficients
- Assume linear relationship between features and log-odds of target

**Advantages:**
- Fast training and prediction
- Probabilistic predictions
- Good baseline model

**Disadvantages:**
- Assumes linear decision boundary
- May underfit complex patterns

In [None]:
# Train Logistic Regression model
print("Training Logistic Regression Model...")
print("="*60)

lr_model = LogisticRegression(
    max_iter=1000,
    random_state=42,
    solver='lbfgs'
)

lr_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test_scaled)
y_pred_proba_lr = lr_model.predict_proba(X_test_scaled)[:, 1]

# Evaluate model
accuracy_lr = accuracy_score(y_test, y_pred_lr)
precision_lr = precision_score(y_test, y_pred_lr)
recall_lr = recall_score(y_test, y_pred_lr)
f1_lr = f1_score(y_test, y_pred_lr)
roc_auc_lr = roc_auc_score(y_test, y_pred_proba_lr)

print(f"\nLogistic Regression Performance:")
print(f"  Accuracy:  {accuracy_lr:.4f}")
print(f"  Precision: {precision_lr:.4f}")
print(f"  Recall:    {recall_lr:.4f}")
print(f"  F1 Score:  {f1_lr:.4f}")
print(f"  ROC AUC:   {roc_auc_lr:.4f}")

print("\nConfusion Matrix:")
cm_lr = confusion_matrix(y_test, y_pred_lr)
print(cm_lr)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr, target_names=['Not Survived', 'Survived']))

# Feature importance (coefficients)
print("\nTop 10 Most Important Features (by absolute coefficient):")
feature_importance_lr = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': lr_model.coef_[0]
}).sort_values('Coefficient', key=abs, ascending=False)
print(feature_importance_lr.head(10).to_string(index=False))

### 5.2 K-Nearest Neighbors (KNN)

**How it works:**
- Finds K nearest training examples to a test point
- Predicts based on majority vote of neighbors
- Distance metric: Usually Euclidean distance

**When to use:**
- Non-linear decision boundaries
- Small to medium datasets
- Data is properly scaled

**Advantages:**
- Simple and intuitive
- No training phase (lazy learning)
- Can handle complex patterns

**Disadvantages:**
- Slow prediction on large datasets
- Sensitive to feature scaling
- Curse of dimensionality

In [None]:
# Train KNN model
print("Training K-Nearest Neighbors Model...")
print("="*60)

knn_model = KNeighborsClassifier(
    n_neighbors=5,
    weights='uniform',  # All neighbors weighted equally
    metric='euclidean'
)

knn_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_knn = knn_model.predict(X_test_scaled)
y_pred_proba_knn = knn_model.predict_proba(X_test_scaled)[:, 1]

# Evaluate model
accuracy_knn = accuracy_score(y_test, y_pred_knn)
precision_knn = precision_score(y_test, y_pred_knn)
recall_knn = recall_score(y_test, y_pred_knn)
f1_knn = f1_score(y_test, y_pred_knn)
roc_auc_knn = roc_auc_score(y_test, y_pred_proba_knn)

print(f"\nK-Nearest Neighbors Performance (k=5):")
print(f"  Accuracy:  {accuracy_knn:.4f}")
print(f"  Precision: {precision_knn:.4f}")
print(f"  Recall:    {recall_knn:.4f}")
print(f"  F1 Score:  {f1_knn:.4f}")
print(f"  ROC AUC:   {roc_auc_knn:.4f}")

print("\nConfusion Matrix:")
cm_knn = confusion_matrix(y_test, y_pred_knn)
print(cm_knn)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_knn, target_names=['Not Survived', 'Survived']))

In [None]:
# Find optimal K value
print("Finding Optimal K Value...")
print("="*60)

k_values = range(1, 31)
train_scores = []
test_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    
    train_scores.append(knn.score(X_train_scaled, y_train))
    test_scores.append(knn.score(X_test_scaled, y_test))

# Find best K
best_k = k_values[np.argmax(test_scores)]
best_score = max(test_scores)

print(f"\nOptimal K: {best_k}")
print(f"Best Test Accuracy: {best_score:.4f}")

# Create visualization
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(k_values), y=train_scores, mode='lines+markers', name='Training Accuracy'))
fig.add_trace(go.Scatter(x=list(k_values), y=test_scores, mode='lines+markers', name='Test Accuracy'))
fig.add_vline(x=best_k, line_dash="dash", line_color="red", annotation_text=f"Optimal K={best_k}")

fig.update_layout(
    title='KNN Performance vs K Value',
    xaxis_title='K (Number of Neighbors)',
    yaxis_title='Accuracy',
    hovermode='x unified',
    template='plotly_white'
)

fig.show()

# Retrain with optimal K
knn_optimal = KNeighborsClassifier(n_neighbors=best_k)
knn_optimal.fit(X_train_scaled, y_train)
y_pred_knn_optimal = knn_optimal.predict(X_test_scaled)
accuracy_knn_optimal = accuracy_score(y_test, y_pred_knn_optimal)

print(f"\nOptimized KNN Accuracy: {accuracy_knn_optimal:.4f}")

### 5.3 Decision Tree Classifier

**How it works:**
- Creates a tree of if-then-else decision rules
- Splits data based on feature values
- Each leaf node represents a class prediction

**When to use:**
- Need interpretable model
- Non-linear relationships
- Mixed feature types (numerical and categorical)

**Advantages:**
- Highly interpretable
- Handles non-linear patterns
- No need for feature scaling
- Can handle missing values

**Disadvantages:**
- Prone to overfitting
- Unstable (small data changes can cause large tree changes)
- Biased toward features with many levels

In [None]:
# Train Decision Tree model
print("Training Decision Tree Classifier...")
print("="*60)

dt_model = DecisionTreeClassifier(
    max_depth=5,           # Limit depth to prevent overfitting
    min_samples_split=20,  # Minimum samples required to split
    min_samples_leaf=10,   # Minimum samples required at leaf node
    random_state=42
)

dt_model.fit(X_train, y_train)  # No scaling needed for trees

# Make predictions
y_pred_dt = dt_model.predict(X_test)
y_pred_proba_dt = dt_model.predict_proba(X_test)[:, 1]

# Evaluate model
accuracy_dt = accuracy_score(y_test, y_pred_dt)
precision_dt = precision_score(y_test, y_pred_dt)
recall_dt = recall_score(y_test, y_pred_dt)
f1_dt = f1_score(y_test, y_pred_dt)
roc_auc_dt = roc_auc_score(y_test, y_pred_proba_dt)

print(f"\nDecision Tree Performance:")
print(f"  Accuracy:  {accuracy_dt:.4f}")
print(f"  Precision: {precision_dt:.4f}")
print(f"  Recall:    {recall_dt:.4f}")
print(f"  F1 Score:  {f1_dt:.4f}")
print(f"  ROC AUC:   {roc_auc_dt:.4f}")

print("\nConfusion Matrix:")
cm_dt = confusion_matrix(y_test, y_pred_dt)
print(cm_dt)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_dt, target_names=['Not Survived', 'Survived']))

# Feature importance
print("\nTop 10 Most Important Features:")
feature_importance_dt = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': dt_model.feature_importances_
}).sort_values('Importance', ascending=False)
print(feature_importance_dt.head(10).to_string(index=False))

In [None]:
# Visualize Decision Tree
print("Decision Tree Visualization:")
print("="*60)

plt.figure(figsize=(20, 10))
plot_tree(
    dt_model, 
    feature_names=X_train.columns,
    class_names=['Not Survived', 'Survived'],
    filled=True,
    rounded=True,
    fontsize=10
)
plt.title('Decision Tree Structure', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

# Feature importance bar chart
fig = px.bar(
    feature_importance_dt.head(10),
    x='Importance',
    y='Feature',
    orientation='h',
    title='Top 10 Feature Importances - Decision Tree',
    labels={'Importance': 'Feature Importance', 'Feature': ''},
    template='plotly_white'
)
fig.update_layout(yaxis={'categoryorder': 'total ascending'})
fig.show()

### 5.4 Neural Network (Multi-Layer Perceptron)

**How it works:**
- Multiple layers of interconnected neurons
- Each connection has a weight that's learned during training
- Uses backpropagation to update weights

**When to use:**
- Complex non-linear patterns
- Large datasets
- High-dimensional data

**Advantages:**
- Can learn very complex patterns
- Flexible architecture
- State-of-the-art for many problems

**Disadvantages:**
- Requires large amounts of data
- Computationally expensive
- Black box (hard to interpret)
- Many hyperparameters to tune

In [None]:
# Train Neural Network model
print("Training Neural Network (Multi-Layer Perceptron)...")
print("="*60)

nn_model = MLPClassifier(
    hidden_layer_sizes=(100, 50),  # Two hidden layers with 100 and 50 neurons
    activation='relu',              # ReLU activation function
    solver='adam',                  # Adam optimizer
    max_iter=1000,
    random_state=42,
    early_stopping=True,            # Stop when validation score stops improving
    validation_fraction=0.1
)

nn_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_nn = nn_model.predict(X_test_scaled)
y_pred_proba_nn = nn_model.predict_proba(X_test_scaled)[:, 1]

# Evaluate model
accuracy_nn = accuracy_score(y_test, y_pred_nn)
precision_nn = precision_score(y_test, y_pred_nn)
recall_nn = recall_score(y_test, y_pred_nn)
f1_nn = f1_score(y_test, y_pred_nn)
roc_auc_nn = roc_auc_score(y_test, y_pred_proba_nn)

print(f"\nNeural Network Performance:")
print(f"  Accuracy:  {accuracy_nn:.4f}")
print(f"  Precision: {precision_nn:.4f}")
print(f"  Recall:    {recall_nn:.4f}")
print(f"  F1 Score:  {f1_nn:.4f}")
print(f"  ROC AUC:   {roc_auc_nn:.4f}")

print(f"\nNetwork Architecture:")
print(f"  Input Layer: {X_train.shape[1]} features")
print(f"  Hidden Layer 1: 100 neurons")
print(f"  Hidden Layer 2: 50 neurons")
print(f"  Output Layer: 2 classes (binary)")
print(f"  Total Iterations: {nn_model.n_iter_}")

print("\nConfusion Matrix:")
cm_nn = confusion_matrix(y_test, y_pred_nn)
print(cm_nn)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_nn, target_names=['Not Survived', 'Survived']))

## 6. Model Comparison and Visualization

Let's compare all classification models side by side.

In [None]:
# Create comprehensive comparison DataFrame
model_comparison = pd.DataFrame({
    'Model': ['Logistic Regression', 'KNN (k=5)', f'KNN (k={best_k})', 'Decision Tree', 'Neural Network'],
    'Accuracy': [accuracy_lr, accuracy_knn, accuracy_knn_optimal, accuracy_dt, accuracy_nn],
    'Precision': [precision_lr, precision_knn, precision_score(y_test, y_pred_knn_optimal), precision_dt, precision_nn],
    'Recall': [recall_lr, recall_knn, recall_score(y_test, y_pred_knn_optimal), recall_dt, recall_nn],
    'F1 Score': [f1_lr, f1_knn, f1_score(y_test, y_pred_knn_optimal), f1_dt, f1_nn],
    'ROC AUC': [roc_auc_lr, roc_auc_knn, roc_auc_score(y_test, knn_optimal.predict_proba(X_test_scaled)[:, 1]), roc_auc_dt, roc_auc_nn]
})

print("Model Performance Comparison:")
print("="*80)
print(model_comparison.to_string(index=False))

# Identify best model for each metric
print("\n" + "="*80)
print("Best Model by Metric:")
print("="*80)
for metric in ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']:
    best_idx = model_comparison[metric].idxmax()
    best_model = model_comparison.loc[best_idx, 'Model']
    best_value = model_comparison.loc[best_idx, metric]
    print(f"{metric:12s}: {best_model:25s} ({best_value:.4f})")

In [None]:
# Create visual comparison
metrics_for_plot = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']

fig = go.Figure()

for idx, model_name in enumerate(model_comparison['Model']):
    values = model_comparison.iloc[idx][metrics_for_plot].values
    fig.add_trace(go.Scatterpolar(
        r=values,
        theta=metrics_for_plot,
        name=model_name,
        fill='toself'
    ))

fig.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0.5, 1.0]
        )
    ),
    title='Model Performance Comparison - All Metrics',
    showlegend=True,
    template='plotly_white'
)

fig.show()

# Bar chart comparison
fig = px.bar(
    model_comparison,
    x='Model',
    y=['Accuracy', 'Precision', 'Recall', 'F1 Score'],
    title='Model Performance Metrics Comparison',
    barmode='group',
    template='plotly_white'
)
fig.update_layout(yaxis_title='Score', legend_title='Metric')
fig.show()

In [None]:
# Visualize confusion matrices for all models
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Confusion Matrices - All Models', fontsize=16, fontweight='bold')

confusion_matrices = [
    (cm_lr, 'Logistic Regression', accuracy_lr),
    (cm_knn, 'KNN (k=5)', accuracy_knn),
    (confusion_matrix(y_test, y_pred_knn_optimal), f'KNN (k={best_k})', accuracy_knn_optimal),
    (cm_dt, 'Decision Tree', accuracy_dt),
    (cm_nn, 'Neural Network', accuracy_nn)
]

for idx, (cm, title, acc) in enumerate(confusion_matrices):
    ax = axes[idx // 3, idx % 3]
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax, 
                xticklabels=['Not Survived', 'Survived'],
                yticklabels=['Not Survived', 'Survived'])
    ax.set_title(f'{title}\nAccuracy: {acc:.4f}', fontweight='bold')
    ax.set_ylabel('True Label')
    ax.set_xlabel('Predicted Label')

# Hide unused subplot
axes[1, 2].axis('off')

plt.tight_layout()
plt.show()

In [None]:
# ROC Curves for all models
fig = go.Figure()

# Calculate ROC curve for each model
models_roc = [
    ('Logistic Regression', y_pred_proba_lr, roc_auc_lr),
    ('KNN (k=5)', y_pred_proba_knn, roc_auc_knn),
    (f'KNN (k={best_k})', knn_optimal.predict_proba(X_test_scaled)[:, 1], roc_auc_score(y_test, knn_optimal.predict_proba(X_test_scaled)[:, 1])),
    ('Decision Tree', y_pred_proba_dt, roc_auc_dt),
    ('Neural Network', y_pred_proba_nn, roc_auc_nn)
]

for model_name, y_proba, auc_score in models_roc:
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    fig.add_trace(go.Scatter(
        x=fpr, y=tpr,
        mode='lines',
        name=f'{model_name} (AUC={auc_score:.3f})'
    ))

# Add diagonal reference line
fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    mode='lines',
    name='Random Classifier',
    line=dict(dash='dash', color='gray')
))

fig.update_layout(
    title='ROC Curves - All Models',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate',
    template='plotly_white',
    hovermode='x unified'
)

fig.show()

## 7. Regression Analysis: Predicting Fare

**Regression vs Classification:**
- Classification: Predict discrete categories (survived/not survived)
- Regression: Predict continuous values (fare price)

**Linear Regression:**
- Assumes linear relationship between features and target
- Fits a line (or hyperplane) to minimize squared errors
- Fast and interpretable

**Evaluation Metrics:**
- **MSE (Mean Squared Error)**: Average squared difference between predictions and actual values
- **MAE (Mean Absolute Error)**: Average absolute difference
- **R² Score**: Proportion of variance explained by the model (0 to 1, higher is better)

In [None]:
# Prepare data for regression (predicting fare)
print("Preparing Data for Fare Prediction (Regression)...")
print("="*60)

# Use original dataframe
df_reg = df_original.copy()

# Remove rows with missing fare
df_reg = df_reg.dropna(subset=['fare'])

# Feature engineering
df_reg['age'] = df_reg['age'].fillna(df_reg['age'].median())
df_reg['family_size'] = df_reg['sibsp'] + df_reg['parch'] + 1
df_reg['is_alone'] = (df_reg['family_size'] == 1).astype(int)

# Encode categorical variables
df_reg = pd.get_dummies(df_reg, columns=['sex', 'embarked', 'class'], drop_first=True)

# Select features (exclude fare)
feature_cols_reg = ['age', 'sibsp', 'parch', 'family_size', 'is_alone', 'survived'] + \
                   [c for c in df_reg.columns if c.startswith(('sex_', 'embarked_', 'class_'))]

X_reg = df_reg[feature_cols_reg]
y_reg = df_reg['fare']

print(f"Dataset shape: {X_reg.shape}")
print(f"\nTarget variable (Fare) statistics:")
print(y_reg.describe())

# Train-test split
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Scale features
scaler_reg = StandardScaler()
X_train_reg_scaled = scaler_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_reg.transform(X_test_reg)

print(f"\nTraining set size: {X_train_reg.shape[0]}")
print(f"Test set size: {X_test_reg.shape[0]}")

In [None]:
# Train Linear Regression model
print("Training Linear Regression Model...")
print("="*60)

lr_reg_model = LinearRegression()
lr_reg_model.fit(X_train_reg_scaled, y_train_reg)

# Make predictions
y_pred_train_reg = lr_reg_model.predict(X_train_reg_scaled)
y_pred_test_reg = lr_reg_model.predict(X_test_reg_scaled)

# Calculate metrics
train_mse = mean_squared_error(y_train_reg, y_pred_train_reg)
test_mse = mean_squared_error(y_test_reg, y_pred_test_reg)
train_mae = mean_absolute_error(y_train_reg, y_pred_train_reg)
test_mae = mean_absolute_error(y_test_reg, y_pred_test_reg)
train_r2 = r2_score(y_train_reg, y_pred_train_reg)
test_r2 = r2_score(y_test_reg, y_pred_test_reg)
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)

print("\nLinear Regression Performance:")
print("="*60)
print("Training Set:")
print(f"  RMSE (Root Mean Squared Error): ${train_rmse:.2f}")
print(f"  MAE (Mean Absolute Error):      ${train_mae:.2f}")
print(f"  R² Score:                       {train_r2:.4f}")

print("\nTest Set:")
print(f"  RMSE (Root Mean Squared Error): ${test_rmse:.2f}")
print(f"  MAE (Mean Absolute Error):      ${test_mae:.2f}")
print(f"  R² Score:                       {test_r2:.4f}")

print("\nInterpretation:")
print(f"  - On average, predictions are off by ${test_mae:.2f} (MAE)")
print(f"  - Model explains {test_r2:.1%} of variance in fare prices (R²)")

# Feature coefficients
print("\nTop 10 Features Affecting Fare (by absolute coefficient):")
feature_coef = pd.DataFrame({
    'Feature': X_reg.columns,
    'Coefficient': lr_reg_model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print(feature_coef.head(10).to_string(index=False))

In [None]:
# Visualize regression results

# 1. Predicted vs Actual values
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Training Set', 'Test Set')
)

# Training set
fig.add_trace(
    go.Scatter(
        x=y_train_reg, y=y_pred_train_reg,
        mode='markers',
        name='Training',
        marker=dict(size=6, opacity=0.6)
    ),
    row=1, col=1
)

# Test set
fig.add_trace(
    go.Scatter(
        x=y_test_reg, y=y_pred_test_reg,
        mode='markers',
        name='Test',
        marker=dict(size=6, opacity=0.6)
    ),
    row=1, col=2
)

# Perfect prediction line
max_val = max(y_train_reg.max(), y_test_reg.max())
for col in [1, 2]:
    fig.add_trace(
        go.Scatter(
            x=[0, max_val], y=[0, max_val],
            mode='lines',
            name='Perfect Prediction',
            line=dict(dash='dash', color='red'),
            showlegend=(col == 1)
        ),
        row=1, col=col
    )

fig.update_xaxes(title_text='Actual Fare ($)', row=1, col=1)
fig.update_xaxes(title_text='Actual Fare ($)', row=1, col=2)
fig.update_yaxes(title_text='Predicted Fare ($)', row=1, col=1)
fig.update_yaxes(title_text='Predicted Fare ($)', row=1, col=2)

fig.update_layout(
    title='Linear Regression: Predicted vs Actual Fare',
    template='plotly_white',
    showlegend=True,
    height=500
)

fig.show()

# 2. Residual plot
residuals_test = y_test_reg - y_pred_test_reg

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=y_pred_test_reg,
    y=residuals_test,
    mode='markers',
    marker=dict(size=6, opacity=0.6)
))

fig.add_hline(y=0, line_dash="dash", line_color="red")

fig.update_layout(
    title='Residual Plot - Test Set',
    xaxis_title='Predicted Fare ($)',
    yaxis_title='Residuals (Actual - Predicted)',
    template='plotly_white'
)

fig.show()

# 3. Feature importance
fig = px.bar(
    feature_coef.head(10),
    x='Coefficient',
    y='Feature',
    orientation='h',
    title='Top 10 Features Affecting Fare Price',
    labels={'Coefficient': 'Coefficient Value', 'Feature': ''},
    template='plotly_white'
)
fig.update_layout(yaxis={'categoryorder': 'total ascending'})
fig.show()

## 8. Unsupervised Learning: K-Means Clustering

**Clustering:**
- Group similar data points together
- No target variable (unsupervised)
- Discover natural groupings in data

**K-Means Algorithm:**
1. Initialize K cluster centroids randomly
2. Assign each point to nearest centroid
3. Update centroids to mean of assigned points
4. Repeat steps 2-3 until convergence

**When to use:**
- Customer segmentation
- Document clustering
- Image compression
- Anomaly detection

**Evaluation Metrics:**
- **Inertia**: Sum of squared distances to nearest centroid (lower is better)
- **Silhouette Score**: How similar points are to their cluster vs other clusters (-1 to 1, higher is better)

In [None]:
# Prepare data for clustering
print("Preparing Data for K-Means Clustering...")
print("="*60)

# Select numerical features for clustering
df_cluster = df_original.copy()
df_cluster['age'] = df_cluster['age'].fillna(df_cluster['age'].median())
df_cluster['fare'] = df_cluster['fare'].fillna(df_cluster['fare'].median())

# Select features for clustering
cluster_features = ['age', 'fare', 'sibsp', 'parch']
X_cluster = df_cluster[cluster_features].copy()

print(f"Clustering features: {cluster_features}")
print(f"Dataset shape: {X_cluster.shape}")
print("\nFeature statistics:")
print(X_cluster.describe())

# Scale features (important for K-means)
scaler_cluster = StandardScaler()
X_cluster_scaled = scaler_cluster.fit_transform(X_cluster)

print("\nFeatures scaled for clustering.")

In [None]:
# Find optimal number of clusters using Elbow Method
print("Finding Optimal Number of Clusters...")
print("="*60)

k_range = range(2, 11)
inertias = []
silhouette_scores = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_cluster_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_cluster_scaled, kmeans.labels_))

# Create elbow plot
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Elbow Method (Inertia)', 'Silhouette Score')
)

fig.add_trace(
    go.Scatter(x=list(k_range), y=inertias, mode='lines+markers', name='Inertia'),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=list(k_range), y=silhouette_scores, mode='lines+markers', name='Silhouette Score'),
    row=1, col=2
)

fig.update_xaxes(title_text='Number of Clusters (K)', row=1, col=1)
fig.update_xaxes(title_text='Number of Clusters (K)', row=1, col=2)
fig.update_yaxes(title_text='Inertia', row=1, col=1)
fig.update_yaxes(title_text='Silhouette Score', row=1, col=2)

fig.update_layout(
    title='Determining Optimal Number of Clusters',
    template='plotly_white',
    showlegend=False,
    height=400
)

fig.show()

# Find optimal K based on silhouette score
optimal_k = k_range[np.argmax(silhouette_scores)]
print(f"\nOptimal number of clusters: {optimal_k}")
print(f"Silhouette score at K={optimal_k}: {max(silhouette_scores):.4f}")

In [None]:
# Train K-Means with optimal K
print(f"Training K-Means Clustering (K={optimal_k})...")
print("="*60)

kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_cluster_scaled)

# Add cluster labels to original dataframe
df_cluster['cluster'] = cluster_labels

# Calculate metrics
inertia = kmeans.inertia_
silhouette = silhouette_score(X_cluster_scaled, cluster_labels)

print(f"\nClustering Results:")
print(f"  Number of clusters: {optimal_k}")
print(f"  Inertia: {inertia:.2f}")
print(f"  Silhouette Score: {silhouette:.4f}")

print("\nCluster Sizes:")
print(df_cluster['cluster'].value_counts().sort_index())

# Analyze cluster characteristics
print("\nCluster Characteristics (Mean Values):")
print("="*60)
cluster_summary = df_cluster.groupby('cluster')[cluster_features + ['survived']].mean()
print(cluster_summary.round(2))

# Survival rate by cluster
print("\nSurvival Rate by Cluster:")
for cluster in range(optimal_k):
    survival_rate = df_cluster[df_cluster['cluster'] == cluster]['survived'].mean()
    count = (df_cluster['cluster'] == cluster).sum()
    print(f"  Cluster {cluster}: {survival_rate:.2%} (n={count})")

In [None]:
# Visualize clusters

# 3D scatter plot
fig = px.scatter_3d(
    df_cluster,
    x='age',
    y='fare',
    z='sibsp',
    color='cluster',
    symbol='survived',
    title=f'K-Means Clustering Results (K={optimal_k})',
    labels={'cluster': 'Cluster', 'survived': 'Survived'},
    color_continuous_scale='Viridis'
)
fig.update_layout(template='plotly_white')
fig.show()

# 2D projections
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Age vs Fare', 'Sibsp vs Parch')
)

for cluster in range(optimal_k):
    cluster_data = df_cluster[df_cluster['cluster'] == cluster]
    
    fig.add_trace(
        go.Scatter(
            x=cluster_data['age'],
            y=cluster_data['fare'],
            mode='markers',
            name=f'Cluster {cluster}',
            marker=dict(size=6, opacity=0.6)
        ),
        row=1, col=1
    )
    
    fig.add_trace(
        go.Scatter(
            x=cluster_data['sibsp'],
            y=cluster_data['parch'],
            mode='markers',
            name=f'Cluster {cluster}',
            marker=dict(size=6, opacity=0.6),
            showlegend=False
        ),
        row=1, col=2
    )

fig.update_xaxes(title_text='Age', row=1, col=1)
fig.update_xaxes(title_text='Siblings/Spouses', row=1, col=2)
fig.update_yaxes(title_text='Fare ($)', row=1, col=1)
fig.update_yaxes(title_text='Parents/Children', row=1, col=2)

fig.update_layout(
    title=f'K-Means Clustering - 2D Projections (K={optimal_k})',
    template='plotly_white',
    height=500
)

fig.show()

# Cluster characteristics heatmap
fig = px.imshow(
    cluster_summary.T,
    labels=dict(x="Cluster", y="Feature", color="Mean Value"),
    title="Cluster Characteristics Heatmap",
    aspect="auto",
    color_continuous_scale='RdBu_r',
    text_auto='.2f'
)
fig.update_layout(template='plotly_white')
fig.show()

## 9. Key Insights and Interpretations

### Classification Model Insights

**Model Performance Summary:**
- All models achieved 75-80% accuracy on survival prediction
- Neural networks and optimized KNN tend to perform slightly better
- Decision trees offer the best interpretability

**Most Important Features for Survival:**
1. **Sex**: Being female significantly increased survival chances
2. **Class**: First-class passengers had higher survival rates
3. **Age**: Children had better survival rates
4. **Fare**: Higher fare (proxy for class/cabin quality) correlated with survival

### Regression Model Insights

**Fare Prediction:**
- Linear regression achieved R² ≈ 0.5-0.6
- Class is the strongest predictor of fare
- Age and family size also contribute to fare prediction
- Model has higher errors for very high fares (outliers)

### Clustering Insights

**Passenger Segments:**
- Clusters naturally separate by:
  - Age groups (children, adults, elderly)
  - Family status (traveling alone vs with family)
  - Economic status (fare level)
- Different clusters have different survival rates
- Clustering can help identify passenger profiles

## 10. Machine Learning Best Practices

### Data Preparation
1. **Handle missing values systematically**
   - Understand why data is missing (MCAR, MAR, MNAR)
   - Use appropriate imputation methods
   - Consider creating 'missing' indicator features

2. **Feature engineering**
   - Create meaningful features from raw data
   - Combine related features
   - Transform skewed distributions

3. **Feature scaling**
   - Crucial for distance-based algorithms (KNN, SVM, Neural Networks)
   - Less important for tree-based methods
   - Always fit scaler on training data only

### Model Training
1. **Always use train-test split**
   - Prevents overfitting
   - Provides realistic performance estimates
   - Use stratification for imbalanced datasets

2. **Start simple**
   - Begin with baseline models (logistic regression, simple trees)
   - Establish baseline performance
   - Gradually increase complexity

3. **Cross-validation**
   - More robust than single train-test split
   - Provides confidence intervals
   - Helps detect overfitting

### Model Evaluation
1. **Use multiple metrics**
   - Accuracy alone is often insufficient
   - Consider precision, recall, F1, ROC AUC
   - Choose metrics based on business objectives

2. **Confusion matrix analysis**
   - Understand types of errors
   - Identify class-specific performance
   - Inform model improvements

3. **Compare multiple models**
   - Different algorithms have different strengths
   - Ensemble methods often perform best
   - Consider interpretability vs performance tradeoff

### Common Pitfalls to Avoid
1. **Data leakage**
   - Never use test data during training
   - Fit preprocessors on training data only
   - Be careful with time-series data

2. **Overfitting**
   - High training accuracy but low test accuracy
   - Use regularization, cross-validation
   - Simplify model or get more data

3. **Underfitting**
   - Poor performance on both training and test sets
   - Model too simple for the problem
   - Add features or increase model complexity

4. **Ignoring class imbalance**
   - Can lead to biased models
   - Use stratified sampling, resampling, or class weights
   - Focus on appropriate metrics (F1, ROC AUC)

### Model Selection Guidelines

**Logistic Regression:**
- ✅ Need interpretability
- ✅ Linear relationships
- ✅ Fast training/prediction
- ❌ Complex non-linear patterns

**K-Nearest Neighbors:**
- ✅ Non-linear patterns
- ✅ No assumptions about data
- ✅ Simple to understand
- ❌ Large datasets
- ❌ High-dimensional data

**Decision Trees:**
- ✅ Interpretability
- ✅ Mixed feature types
- ✅ Non-linear patterns
- ❌ Stability (high variance)
- ❌ Overfitting tendency

**Neural Networks:**
- ✅ Complex patterns
- ✅ Large datasets
- ✅ High-dimensional data
- ❌ Interpretability
- ❌ Computational cost
- ❌ Need lots of data

## 11. Question and Answer Key

### Conceptual Questions

**Q1: What is the difference between supervised and unsupervised learning?**

**A1:** 
- **Supervised Learning**: We have labeled data (input features + target variable). The algorithm learns to map inputs to outputs. Examples: Classification (predicting survival), Regression (predicting fare).
- **Unsupervised Learning**: We have unlabeled data (only input features). The algorithm discovers patterns or structure in the data. Examples: Clustering (grouping similar passengers), Dimensionality Reduction.

---

**Q2: Why do we split data into training and test sets?**

**A2:** 
- To evaluate model performance on unseen data
- To prevent overfitting (memorizing training data)
- To get realistic estimates of model performance in production
- Training set: Used to fit the model parameters
- Test set: Used to evaluate the final model (never used during training)

---

**Q3: What is overfitting and how can we prevent it?**

**A3:**
- **Overfitting**: Model performs well on training data but poorly on test data. The model has memorized the training data instead of learning general patterns.
- **Prevention strategies**:
  - Use train-test split or cross-validation
  - Regularization (L1, L2, dropout)
  - Reduce model complexity (fewer features, shallower trees, etc.)
  - Get more training data
  - Early stopping (for iterative algorithms)

---

**Q4: When should you use KNN vs Logistic Regression?**

**A4:**
- **Use KNN when**:
  - Decision boundary is non-linear
  - No assumptions about data distribution
  - Small to medium dataset
  - Computational cost of prediction is acceptable

- **Use Logistic Regression when**:
  - Need interpretable coefficients
  - Decision boundary is approximately linear
  - Fast training and prediction needed
  - Want probabilistic predictions

---

**Q5: What metrics should you use for imbalanced classification?**

**A5:**
- **Don't rely on accuracy alone** - it can be misleading
- **Better metrics**:
  - **Precision**: Of predicted positives, how many are correct? (important when false positives are costly)
  - **Recall**: Of actual positives, how many did we catch? (important when false negatives are costly)
  - **F1 Score**: Harmonic mean of precision and recall (balanced metric)
  - **ROC AUC**: Overall performance across all thresholds
  - **Confusion Matrix**: See all types of errors

---

**Q6: How do you choose the number of clusters (K) in K-means?**

**A6:**
- **Elbow Method**: Plot inertia vs K, look for "elbow" where improvement slows
- **Silhouette Score**: Measures how similar points are to their own cluster vs other clusters (higher is better)
- **Domain Knowledge**: Sometimes you know how many groups make sense
- **Dendrogram**: For hierarchical clustering
- **Gap Statistic**: Compares inertia to random data

---

**Q7: Why is feature scaling important?**

**A7:**
- **Distance-based algorithms** (KNN, SVM, K-means) are sensitive to feature scales
- Features with larger scales can dominate distance calculations
- Example: Age (0-100) vs Fare (0-500) - fare would dominate without scaling
- **Not needed for tree-based methods** (they use splits, not distances)
- **Always fit scaler on training data only** to prevent data leakage

---

**Q8: What are the advantages of Decision Trees?**

**A8:**
- **Highly interpretable**: Can visualize and explain decisions
- **No feature scaling needed**: Uses splits, not distances
- **Handles non-linear relationships**: Can capture complex patterns
- **Handles mixed features**: Numerical and categorical
- **Can handle missing values**: Built-in mechanisms
- **Feature importance**: Automatically ranks features

**Disadvantages:**
- **Prone to overfitting**: Can create overly complex trees
- **Unstable**: Small data changes can cause large tree changes
- **Biased**: Toward features with many levels

---

### Practical Questions

**Q9: Based on our analysis, which model would you deploy for Titanic survival prediction and why?**

**A9:**
It depends on the requirements:

- **If interpretability is crucial**: Decision Tree
  - Can explain decisions to stakeholders
  - Regulatory compliance may require explainability
  - Reasonable accuracy (~78-80%)

- **If performance is paramount**: Neural Network or Optimized KNN
  - Highest accuracy (~80-82%)
  - Can capture complex patterns
  - Acceptable if black-box is okay

- **If speed and simplicity matter**: Logistic Regression
  - Fast training and prediction
  - Good baseline performance (~78%)
  - Easy to maintain and update

**Recommendation**: Start with Logistic Regression as baseline, use Decision Tree for explainability, and consider ensemble methods (Random Forest) for best performance.

---

**Q10: What insights did we gain from clustering analysis?**

**A10:**
- Passengers naturally segment into groups based on:
  - **Age groups**: Children, adults, elderly
  - **Family status**: Traveling alone vs with family
  - **Economic status**: Low, medium, high fare
- Different clusters have different survival rates
- Can use these segments for:
  - Targeted safety measures
  - Marketing strategies (if modern application)
  - Understanding passenger demographics
- Clustering helps discover patterns we might not have specified in advance

---

**Q11: How would you improve the fare prediction model?**

**A11:**
- **Feature engineering**:
  - Add cabin deck information (if available)
  - Interaction terms (e.g., class × age)
  - Categorical encoding of continuous features

- **Handle outliers**:
  - Very high fares pull model predictions
  - Consider log transformation
  - Robust regression methods

- **Try different models**:
  - Decision Tree Regressor (non-linear)
  - Random Forest Regressor
  - Gradient Boosting

- **More data**:
  - Additional passenger information
  - Cabin details
  - Service level indicators

---

**Q12: What would you do if you had more time and resources?**

**A12:**
- **Hyperparameter tuning**: GridSearchCV or RandomizedSearchCV
- **Ensemble methods**: Voting classifier, stacking, boosting
- **Feature selection**: Remove irrelevant features systematically
- **Cross-validation**: K-fold CV for robust evaluation
- **Handle class imbalance**: SMOTE, class weights
- **Deep learning**: More complex neural architectures
- **External data**: Historical shipping data, weather conditions
- **A/B testing**: Test model in production setting

## 12. Summary and Next Steps

### What We Learned

1. **Machine Learning Fundamentals**
   - Supervised vs Unsupervised learning
   - The ML workflow: data prep → split → train → evaluate → compare
   - Importance of train-test split

2. **Classification Algorithms**
   - Logistic Regression: Linear, interpretable, fast
   - K-Nearest Neighbors: Instance-based, non-linear
   - Decision Trees: Interpretable, handles non-linearity
   - Neural Networks: Complex patterns, high performance

3. **Regression Analysis**
   - Linear Regression for continuous predictions
   - Evaluation metrics: MSE, MAE, R²
   - Feature importance analysis

4. **Unsupervised Learning**
   - K-Means clustering for pattern discovery
   - Choosing optimal K
   - Cluster interpretation

5. **Best Practices**
   - Feature engineering and preprocessing
   - Model evaluation with multiple metrics
   - Avoiding common pitfalls
   - Model selection guidelines

### Key Takeaways

1. **No single model is always best** - it depends on the problem, data, and requirements
2. **Start simple, then increase complexity** - baseline models help establish expectations
3. **Feature engineering often matters more than algorithm choice**
4. **Always use train-test split** - never evaluate on training data
5. **Consider multiple metrics** - accuracy alone is often insufficient
6. **Interpretability vs Performance tradeoff** - decide based on use case

### Next Steps

1. **Advanced Topics**
   - Ensemble methods (Random Forest, Gradient Boosting)
   - Cross-validation and hyperparameter tuning
   - Handling imbalanced datasets
   - Feature selection methods

2. **Model Deployment**
   - Saving and loading models
   - Creating prediction APIs
   - Monitoring model performance
   - Model versioning

3. **Practice Projects**
   - Try different datasets
   - Kaggle competitions
   - Real-world applications
   - End-to-end ML pipelines

### Resources for Further Learning

- **Scikit-learn Documentation**: https://scikit-learn.org/
- **Machine Learning Coursera**: Andrew Ng's course
- **Kaggle Learn**: Free ML courses and competitions
- **Books**:
  - "Hands-On Machine Learning" by Aurélien Géron
  - "Introduction to Statistical Learning" by James et al.
  - "Pattern Recognition and Machine Learning" by Bishop

---

### Congratulations!

You've completed a comprehensive introduction to machine learning. You now have:
- Understanding of ML fundamentals
- Experience with multiple algorithms
- Knowledge of best practices
- Tools to continue learning

Keep practicing and experimenting with different datasets and algorithms!