# Titanic Survival Prediction: Beginner Template

This notebook is a beginner-friendly template covering EDA, feature engineering, model building, evaluation, and conclusions. Follow the comments and TODOs to learn and adapt.

## Objectives
- Explore the Titanic dataset
- Engineer useful features
- Train baseline and improved ML models
- Evaluate with clear metrics and plots
- Provide a structure you can fork and extend

In [None]:
# Setup: imports and settings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, RocCurveDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
sns.set(style="whitegrid", context="notebook")
pd.set_option('display.max_columns', None)

## Data Loading
You can either: 
1) Download Kaggle Titanic train.csv/test.csv and place locally, or
2) Load via seaborn's example (not identical), or
3) Mount Google Drive in Colab.

Below we include a simple Kaggle-style loader (expects train.csv in the working directory).

In [None]:
# TODO: Provide your path to Kaggle Titanic 'train.csv'
# e.g., path = '/content/train.csv' in Colab if uploaded
path = 'train.csv'  # change if needed
try:
    df = pd.read_csv(path)
    print(df.shape)
    display(df.head())
except FileNotFoundError:
    print("train.csv not found. Upload it or set the correct path.")
    df = pd.DataFrame()

## Exploratory Data Analysis (EDA)
Understand basic structure, missingness, and relationships with Survived.

In [None]:
if not df.empty:
    display(df.describe(include='all'))
    print('Missing values per column:')
    print(df.isna().sum().sort_values(ascending=False))
    # Target distribution
    if 'Survived' in df:
        sns.countplot(x='Survived', data=df)
        plt.title('Target Distribution: Survived')
        plt.show()
    # Sex vs Survived
    if set(['Sex','Survived']).issubset(df.columns):
        sns.barplot(x='Sex', y='Survived', data=df, estimator=np.mean)
        plt.title('Average Survival Rate by Sex')
        plt.show()
    # Pclass vs Survived
    if set(['Pclass','Survived']).issubset(df.columns):
        sns.barplot(x='Pclass', y='Survived', data=df, estimator=np.mean)
        plt.title('Average Survival Rate by Passenger Class')
        plt.show()

## Feature Engineering
Create helpful features like Title and FamilySize; impute missing values.

In [None]:
def add_engineered_features(df):
    df = df.copy()
    # Title extraction from Name
    if 'Name' in df.columns:
        df['Title'] = df['Name'].str.extract(r',\s*([^.]*)\.')
    # Family size
    for col in ['SibSp', 'Parch']:
        if col not in df.columns:
            df[col] = 0
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    # Cabin present indicator
    df['HasCabin'] = (~df.get('Cabin', pd.Series([np.nan]*len(df))).isna()).astype(int)
    return df

if not df.empty:
    df = add_engineered_features(df)
    display(df.head())

## Train/Test Split and Preprocessing Pipelines
We build a preprocessing pipeline for numeric and categorical columns.

In [None]:
target = 'Survived'
feature_cols = [c for c in df.columns if c != target] if not df.empty else []
if not df.empty and target in df:
    X = df[feature_cols]
    y = df[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    # Identify numeric & categorical columns
    numeric_features = X.select_dtypes(include=['int64','float64']).columns.tolist()
    categorical_features = X.select_dtypes(include=['object','category','bool']).columns.tolist()
    # Robust defaults
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)
        ]
    )
else:
    X_train = X_test = y_train = y_test = None
    preprocessor = None

## Baseline Model: Logistic Regression
Start simple, then iterate.

In [None]:
def evaluate(model, X_test, y_test, name="Model"):
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print(f'{name} -> Acc: {acc:.3f} | Prec: {prec:.3f} | Rec: {rec:.3f} | F1: {f1:.3f}')
    cm = confusion_matrix(y_test, y_pred)
    ConfusionMatrixDisplay(cm).plot(values_format='d')
    plt.title(f'{name} Confusion Matrix')
    plt.show()
    # AUC for probabilistic models
    if hasattr(model, 'predict_proba'):
        y_prob = model.predict_proba(X_test)[:,1]
        try:
            auc = roc_auc_score(y_test, y_prob)
            RocCurveDisplay.from_predictions(y_test, y_prob)
            plt.title(f'{name} ROC Curve (AUC={auc:.3f})')
            plt.show()
        except Exception as e:
            print('AUC not available:', e)

if preprocessor is not None:
    log_reg = Pipeline(steps=[('prep', preprocessor), ('clf', LogisticRegression(max_iter=1000))])
    log_reg.fit(X_train, y_train)
    evaluate(log_reg, X_test, y_test, name='Logistic Regression')
else:
    print('Data not loaded; skipping model training.')

## Model Improvements: Random Forest and SVM
Try more powerful models and compare performance.

In [None]:
if preprocessor is not None:
    rf = Pipeline(steps=[('prep', preprocessor), ('clf', RandomForestClassifier(n_estimators=300, random_state=42))])
    rf.fit(X_train, y_train)
    evaluate(rf, X_test, y_test, name='Random Forest')

    svm_clf = Pipeline(steps=[('prep', preprocessor), ('clf', SVC(kernel='rbf', probability=True, C=2.0, gamma='scale', random_state=42))])
    svm_clf.fit(X_train, y_train)
    evaluate(svm_clf, X_test, y_test, name='SVM (RBF)')
else:
    print('Data not loaded; skipping model training.')

## Hyperparameter Tuning (Optional)
Use GridSearchCV to search for better parameters.

In [None]:
# Example grid for RandomForest
if preprocessor is not None:
    param_grid = {
        'clf__n_estimators': [200, 400],
        'clf__max_depth': [None, 5, 10],
        'clf__min_samples_split': [2, 5]
    }
    grid = GridSearchCV(Pipeline([('prep', preprocessor), ('clf', RandomForestClassifier(random_state=42))]), param_grid, cv=3, n_jobs=-1)
    grid.fit(X_train, y_train)
    print('Best params:', grid.best_params_)
    evaluate(grid.best_estimator_, X_test, y_test, name='Random Forest (Tuned)')
else:
    print('Data not loaded; skipping tuning.')

## Conclusions
- Logistic Regression provides a solid baseline.
- Tree-based models like Random Forest often perform better on tabular data.
- Feature engineering (Title, FamilySize, HasCabin) helps.
- Try ensembling and additional tuning for further gains.

Next steps: Add more features (e.g., ticket groupings), try Gradient Boosting (XGBoost/LightGBM), and perform rigorous cross-validation.