# Comprehensive Supervised Learning on Titanic (2 Hours)

Welcome to this 2-hour workshop that integrates our previous notebooks into one cohesive flow, using the Titanic dataset as a real-world example. We'll cover:

1. **Preprocessing**:
   - **Scaling**: StandardScaler, MinMaxScaler, RobustScaler, and a mention of normalization
   - **Encoding**: Ordinal, Target, OneHot, LabelEncoder
   - Summaries & comparison tables for each
2. **Embeddings**:
   - Text features with Bag-of-Words/TF-IDF
   - Dimensionality reduction (PCA vs. t-SNE) for visualization
   - Mention of advanced word embeddings (Word2Vec)
3. **Optimization Techniques**:
   - Compare logistic regression solvers (lbfgs, sag) & SGDClassifier
   - Hyperparameter tuning: GridSearchCV vs. RandomizedSearchCV
4. **Evaluation Metrics**:
   - Accuracy, Precision, Recall, confusion matrix
5. **Plotting & Comparison**:
   - Confusion matrix heatmap
   - Bar chart of model performance

We'll highlight why we might choose each scaler or encoder, incorporate text columns, demonstrate embedding visuals, and compare optimization strategies. Let's begin!

## 1. Loading and Cleaning the Titanic Dataset

We load Titanic from seaborn. This dataset includes numeric, categorical, and a textual column (embark_town).

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np

titanic = sns.load_dataset('titanic').copy()
print('Initial shape:', titanic.shape)
titanic.head()

We want to predict 'survived' (0 or 1). We'll remove columns with many NaNs (deck) or partial redundancy (alive, alone, who, adult_male). We'll also drop rows missing age or embarked for simplicity.

In [None]:
titanic.drop(columns=['deck','alive','alone','who','adult_male'], inplace=True)
titanic.dropna(subset=['age','embarked'], inplace=True)
print('After cleanup, shape:', titanic.shape)
titanic.head()

## 2. Preprocessing

### 2.1 Scaling: StandardScaler, MinMaxScaler, RobustScaler, mention Normalization

1. **StandardScaler**: transforms each feature to have mean=0, std=1.
2. **MinMaxScaler**: rescales features to [0,1].
3. **RobustScaler**: uses median and IQR, more robust to outliers.
4. **Normalization** (L1 or L2 norm) often used for text or similar.

| Scaler         | Pros                                 | Cons                             |
|----------------|--------------------------------------|----------------------------------|
| StandardScaler | Mean=0, std=1, widely used           | Outliers shift mean/std          |
| MinMaxScaler   | Maps to [0,1], intuitive range        | Outliers skew min, max           |
| RobustScaler   | Less outlier impact (median, IQR)     | Not in [0,1], no strict bound     |
| Normalization  | Good for text features (unit norm)    | Less common for numeric features |

We'll pick numeric columns age, sibsp, parch, fare and compare each.

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

numeric_cols = ['age','sibsp','parch','fare']
titanic_num = titanic[numeric_cols].copy()

std_scaler = StandardScaler()
mm_scaler = MinMaxScaler()
rb_scaler = RobustScaler()

X_std = std_scaler.fit_transform(titanic_num)
X_mm  = mm_scaler.fit_transform(titanic_num)
X_rb  = rb_scaler.fit_transform(titanic_num)

df_std = pd.DataFrame(X_std, columns=numeric_cols)
df_mm  = pd.DataFrame(X_mm,  columns=numeric_cols)
df_rb  = pd.DataFrame(X_rb,  columns=numeric_cols)

print('StandardScaler stats:\n', df_std.describe().loc[['mean','std','min','max']])
print('\nMinMaxScaler stats:\n', df_mm.describe().loc[['min','max']])
print('\nRobustScaler stats:\n', df_rb.describe().loc[['mean','std','min','max']])

### 2.2 Encoders: Ordinal, Target, OneHot, Label

We'll demonstrate them on sex, embarked, class.

| Encoder     | Pros                                         | Cons                                      |
|-------------|----------------------------------------------|-------------------------------------------|
| Ordinal     | Minimal dimension, single integer per cat    | Implies an order that might not exist     |
| Target      | Uses mean of target in each category         | Risk of leakage if not careful            |
| OneHot      | Standard, no ordinal assumption              | Dimensions can blow up for large cat sets |
| Label       | Good for y or truly ordinal cat              | If used for input, can impose false order |


In [None]:
!pip install category_encoders --quiet
from category_encoders import OrdinalEncoder, TargetEncoder
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

cat_cols = ['sex','embarked','class']
df_cat = titanic[cat_cols].copy()
y_surv = titanic['survived'].astype(int).values

ord_enc = OrdinalEncoder(cols=cat_cols)
df_ord = ord_enc.fit_transform(df_cat, y_surv)

tgt_enc = TargetEncoder(cols=cat_cols)
df_tgt = tgt_enc.fit_transform(df_cat, y_surv)

ohe = OneHotEncoder(sparse=False, drop=None)
df_ohe_arr = ohe.fit_transform(df_cat)
df_ohe_cols = ohe.get_feature_names_out(cat_cols)
df_ohe = pd.DataFrame(df_ohe_arr, columns=df_ohe_cols)

lbl_enc = LabelEncoder()
sex_le = lbl_enc.fit_transform(df_cat['sex'])

print('OrdinalEncoder head:\n', df_ord.head())
print('\nTargetEncoder head:\n', df_tgt.head())
print('\nOneHotEncoder head:\n', df_ohe.head())
print('\nLabelEncoder for sex:', sex_le[:10], '... classes:', lbl_enc.classes_)

## 3. Embeddings

We'll treat embark_town as text, demonstrating a simple Bag-of-Words or TF-IDF approach. Then we'll do PCA vs. t-SNE on numeric data.

### 3.1 Bag-of-Words / TF-IDF


In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

town_data = titanic['embark_town'].fillna('')
cv = CountVectorizer()
bow_mat = cv.fit_transform(town_data)
print('CountVectorizer shape:', bow_mat.shape)
print('Vocabulary:', cv.get_feature_names_out())

tfidf = TfidfVectorizer()
tfidf_mat = tfidf.fit_transform(town_data)
print('TF-IDF shape:', tfidf_mat.shape)

### 3.2 PCA vs. t-SNE for numeric columns


In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
%matplotlib inline

# We'll do PCA and t-SNE on standard-scaled numeric data
X_numeric_std = X_std  # from earlier (StandardScaler on age,sibsp,parch,fare)
y_survived = titanic['survived'].astype(int).values

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_numeric_std)

tsne = TSNE(n_components=2, perplexity=30, learning_rate='auto', init='pca', random_state=42)
X_tsne = tsne.fit_transform(X_numeric_std)

def plot_embed(X_emb, y, title):
    plt.figure(figsize=(6,5))
    plt.scatter(X_emb[:,0], X_emb[:,1], c=y, cmap='viridis', alpha=0.7)
    plt.title(title)
    plt.colorbar(label='survived')
    plt.show()

plot_embed(X_pca, y_survived, 'PCA on Titanic numeric')
plot_embed(X_tsne, y_survived, 't-SNE on Titanic numeric')

### 3.3 Advanced Word Embeddings
For large textual data, Word2Vec or GloVe can produce dense vectors capturing semantic meaning. Here, embark_town is too trivial. In practice, you'd do something like:
```python
# from gensim.models import Word2Vec
# w2v_model = Word2Vec(list_of_tokenized_sentences, vector_size=100, ...)
```
Then incorporate word vectors into your ML pipeline. We'll skip the full demonstration here.

## 4. Optimization Techniques

We'll build a pipeline for classification. We'll do a train/test split on the entire dataset, then compare:
- LogisticRegression with lbfgs vs. sag
- SGDClassifier
- RandomForest with GridSearch & RandomizedSearch


In [None]:
from sklearn.model_selection import train_test_split

train_df = titanic.copy()
y = train_df['survived'].astype(int)
train_df.drop(columns=['survived'], inplace=True)

X_train_df, X_test_df, y_train, y_test = train_test_split(train_df, y, test_size=0.3, random_state=42, stratify=y)
print('Train shape:', X_train_df.shape, 'Test shape:', X_test_df.shape)

numeric_cols = ['age','sibsp','parch','fare']
cat_cols     = ['sex','embarked','class']
text_col     = 'embark_town'

### 4.1 ColumnTransformer Pipeline


In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

class TextSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.key].fillna('').values

numeric_transformer = Pipeline([
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('ohe', OneHotEncoder(drop=None, sparse=False, handle_unknown='ignore'))
])

text_transformer = Pipeline([
    ('selector', TextSelector(text_col)),
    ('bow', CountVectorizer())
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_cols),
    ('cat', categorical_transformer, cat_cols),
    ('text', text_transformer, [text_col])
])
preprocessor

### 4.2 Logistic Regression solvers vs. SGD


In [None]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import accuracy_score

lr_lbfgs = Pipeline([
    ('preprocess', preprocessor),
    ('clf', LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42))
])

lr_sag = Pipeline([
    ('preprocess', preprocessor),
    ('clf', LogisticRegression(solver='sag', max_iter=1000, random_state=42))
])

sgd_clf = Pipeline([
    ('preprocess', preprocessor),
    ('clf', SGDClassifier(loss='log', random_state=42, max_iter=1000, tol=1e-3))
])

pipelines = {
    'LogReg_lbfgs': lr_lbfgs,
    'LogReg_sag': lr_sag,
    'SGDClassifier': sgd_clf
}

results = {}
for name, pipe in pipelines.items():
    pipe.fit(X_train_df, y_train)
    acc = pipe.score(X_test_df, y_test)
    results[name] = acc

results

### 4.3 RandomForest: GridSearchCV & RandomizedSearchCV

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import randint

rf_pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('clf', RandomForestClassifier(random_state=42))
])

param_grid = {
    'clf__n_estimators': [50, 100],
    'clf__max_depth': [None, 3, 5]
}

grid_search = GridSearchCV(rf_pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_df, y_train)
print('GridSearch best params:', grid_search.best_params_)
print('GridSearch best CV score:', grid_search.best_score_)
rf_grid_best = grid_search.best_estimator_
acc_rf_grid = rf_grid_best.score(X_test_df, y_test)
print('Test Accuracy (RF GridSearch):', acc_rf_grid)

param_dist = {
    'clf__n_estimators': randint(10,200),
    'clf__max_depth': [None,3,5,7]
}

rand_search = RandomizedSearchCV(rf_pipeline, param_dist, n_iter=5, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)
rand_search.fit(X_train_df, y_train)
print('\nRandomSearch best params:', rand_search.best_params_)
print('RandomSearch best CV score:', rand_search.best_score_)
rf_rand_best = rand_search.best_estimator_
acc_rf_rand = rf_rand_best.score(X_test_df, y_test)
print('Test Accuracy (RF RandomSearch):', acc_rf_rand)

## 5. Evaluation & Plotting
We'll compare these final models in terms of Accuracy, Precision, and Recall, plus a confusion matrix.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

def evaluate_model(name, model, X, y):
    y_pred = model.predict(X)
    acc = accuracy_score(y, y_pred)
    prec = precision_score(y, y_pred)
    rec = recall_score(y, y_pred)
    cm = confusion_matrix(y, y_pred)
    print(f'---- {name} ----')
    print(f'Accuracy: {acc:.3f}')
    print(f'Precision: {prec:.3f}')
    print(f'Recall: {rec:.3f}')
    print('Confusion Matrix:')
    print(cm)
    return (acc, prec, rec, cm)

model_scores = {}

for name, pipe in pipelines.items():
    acc, prec, rec, cm = evaluate_model(name, pipe, X_test_df, y_test)
    model_scores[name] = acc

acc_rf_g, prec_rf_g, rec_rf_g, cm_rf_g = evaluate_model('RF_Grid', rf_grid_best, X_test_df, y_test)
model_scores['RF_Grid'] = acc_rf_g

acc_rf_r, prec_rf_r, rec_rf_r, cm_rf_r = evaluate_model('RF_Rand', rf_rand_best, X_test_df, y_test)
model_scores['RF_Rand'] = acc_rf_r

### 5.1 Confusion Matrix Heatmap

In [None]:
best_model_name = max(model_scores, key=model_scores.get)
print('Best model by accuracy:', best_model_name, '->', model_scores[best_model_name])

all_cm = {
    'LogReg_lbfgs': evaluate_model('LogReg_lbfgs', pipelines['LogReg_lbfgs'], X_test_df, y_test)[3],
    'LogReg_sag': evaluate_model('LogReg_sag', pipelines['LogReg_sag'], X_test_df, y_test)[3],
    'SGDClassifier': evaluate_model('SGDClassifier', pipelines['SGDClassifier'], X_test_df, y_test)[3],
    'RF_Grid': cm_rf_g,
    'RF_Rand': cm_rf_r
}

best_cm = all_cm[best_model_name]

plt.figure(figsize=(5,4))
sns.heatmap(best_cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'{best_model_name} - Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

### 5.2 Bar Chart of Accuracies

In [None]:
names = list(model_scores.keys())
accs = [model_scores[n] for n in names]

plt.figure(figsize=(6,4))
bars = plt.bar(names, accs, color=['orange','green','purple','blue','red'])
plt.ylim([0,1])
plt.title('Model Accuracy Comparison')
for i, bar in enumerate(bars):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height()+0.01,
             f'{bar.get_height():.3f}', ha='center', va='bottom')
plt.show()

## 6. Comparison Tables & Final Takeaways

### 6.1 Scaling Methods

| Scaler         | Pros                                  | Cons                             |
|----------------|---------------------------------------|----------------------------------|
| StandardScaler | Mean=0, std=1, widely used            | Outliers shift mean/std          |
| MinMaxScaler   | Maps to [0,1], intuitive range         | Outliers skew min, max           |
| RobustScaler   | Less outlier impact (median, IQR)      | Not in [0,1], no strict bound     |
| Normalization  | Good for text features (unit norm)     | Less common for numeric features |

### 6.2 Encoding Methods

| Encoder     | Pros                                   | Cons                                                          |
|-------------|----------------------------------------|---------------------------------------------------------------|
| Ordinal     | Minimal dimension, single integer cat   | Implies ordering that might not exist                         |
| Target      | Uses mean of target in each category    | Potential leakage if not cross-validation aware               |
| OneHot      | Standard, no ordinal assumption         | Dimension blow-up for high-cardinality features              |
| Label       | Good for labeling y or truly ordinal    | If used for input, might incorrectly impose numeric ordering |

### 6.3 Embeddings

| Method       | Description                                   | Use Case                                                 |
|--------------|-----------------------------------------------|----------------------------------------------------------|
| Bag-of-Words / TF-IDF | Basic textual representation, token counts/frequencies  | Smaller text fields, easy to interpret                   |
| Word2Vec / GloVe  | Dense vectors capturing semantic meaning  | Larger text corpora, advanced NLP tasks                 |
| PCA         | Linear dimension reduction for numeric data     | Quick embedding, captures variance linearly             |
| t-SNE       | Nonlinear, preserves local distances            | Clearer cluster visuals, but can be slow or tricky to tune |

### 6.4 Optimization / Tuning

| Approach             | Pros                                    | Cons                                        |
|----------------------|-----------------------------------------|---------------------------------------------|
| SGDClassifier        | Can handle large data in streaming mode | Needs careful LR scheduling, sensitive      |
| lbfgs / sag (LogReg) | Often stable for moderate data sizes     | Memory-heavy if data is huge               |
| GridSearchCV         | Exhaustive param search                 | Slow if param space is large               |
| RandomizedSearchCV   | Samples param space for speed           | Might skip best combos                     |

## Final Observations
- The best model often depends on the dataset's nature. RandomForest might do well if tuned.
- For text, advanced embeddings (Word2Vec) can help. For small text fields, one-hot or bag-of-words might suffice.
- Always consider outliers when choosing a scaler.
- Check cardinality when choosing an encoder.
- Evaluate with cross-validation, especially if using target encoding.

### End of 2-Hour Integrated Notebook
We have combined the basics from previous notebooks with the Titanic dataset, demonstrating scaling (Standard, MinMax, Robust), encoding (Ordinal, Target, OneHot, mention Label), text embeddings (Bag-of-Words, mention Word2Vec), numeric embeddings (PCA vs. t-SNE), optimization (SGD, logistic solvers, random forest tuning), and evaluation (accuracy, precision, recall) with plots.

Thank you for following this integrated session!

In [None]:
print('End of the integrated 2-hour Titanic supervised learning notebook. Thank you!')