<a href="https://colab.research.google.com/github/samipn/crisp-dm_semma_and_kdd/blob/main/CRISP_DM_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CRISP-DM: Titanic — End-to-End

Check off each phase as you complete it. Run all cells top-to-bottom in Colab.

In [36]:
#@title Setup
!pip -q install imbalanced-learn fastapi uvicorn joblib plotly
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns, joblib, os, json, plotly.express as px
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, classification_report, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
RANDOM_STATE = 42
os.makedirs('data', exist_ok=True)


## Business Understanding
- **Goal**: Predict survival to support prioritization of rescue resources (hypothetical) or demonstrate ML pipeline.
- **Primary KPI**: ROC AUC ≥ 0.86 on validation. Secondary: F1.
- **Constraints**: Model interpretability and reproducibility.

In [37]:
# Acceptance tests (to be validated at the end)
TARGET_KPI = {"auc": 0.86, "f1": 0.78}


### Upload kaggle.json using file browser

Run the following cell to upload your `kaggle.json` file.

In [38]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 68 bytes


After uploading, proceed to the next steps to configure the Kaggle API and download the dataset.

## Data Understanding
Load data, inspect schema, missingness, target balance.

In [39]:
import os
import shutil

# Get the user's home directory
home_dir = os.path.expanduser("~")

# Create the .kaggle directory if it doesn't exist
kaggle_dir = os.path.join(home_dir, ".kaggle")
os.makedirs(kaggle_dir, exist_ok=True)

# Define the source and destination paths
source_path = "/content/kaggle.json" # Corrected source path
destination_path = os.path.join(kaggle_dir, "kaggle.json")

# Move the kaggle.json file to the .kaggle directory
# Use shutil.move to handle potential overwriting if the file already exists
if os.path.exists(source_path):
    shutil.move(source_path, destination_path)
    print(f"Moved {source_path} to {destination_path}")
else:
    print(f"{source_path} not found. Please ensure it's uploaded to the Colab environment.")

# Set the appropriate permissions for the kaggle.json file
if os.path.exists(destination_path):
    os.chmod(destination_path, 0o600)
    print(f"Set permissions for {destination_path} to 0o600")
else:
    print(f"{destination_path} not found, cannot set permissions.")

Moved /content/kaggle.json to /root/.kaggle/kaggle.json
Set permissions for /root/.kaggle/kaggle.json to 0o600


In [40]:
!kaggle competitions download -c titanic -f train.csv -p data


train.csv: Skipping, found more recently modified local copy (use --force to force download)


In [41]:
train_path = '/content/data/train.csv' # Path to the train.csv file within the zip
if not os.path.exists(train_path):
    print(f"Error: File not found at {train_path}")
    print(f"Current working directory: {os.getcwd()}")
else:
    print(f"File found at {train_path}. Proceeding to load.")
    df = pd.read_csv(train_path)
    display(df.head())

File found at /content/data/train.csv. Proceeding to load.


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [42]:
# Quick profile
display(df.describe(include='all').T)
df.isna().mean().sort_values(ascending=False).head(10)

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
PassengerId,891.0,,,,446.0,257.353842,1.0,223.5,446.0,668.5,891.0
Survived,891.0,,,,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
Pclass,891.0,,,,2.308642,0.836071,1.0,2.0,3.0,3.0,3.0
Name,891.0,891.0,"Dooley, Mr. Patrick",1.0,,,,,,,
Sex,891.0,2.0,male,577.0,,,,,,,
Age,714.0,,,,29.699118,14.526497,0.42,20.125,28.0,38.0,80.0
SibSp,891.0,,,,0.523008,1.102743,0.0,0.0,0.0,1.0,8.0
Parch,891.0,,,,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
Ticket,891.0,681.0,347082,7.0,,,,,,,
Fare,891.0,,,,32.204208,49.693429,0.0,7.9104,14.4542,31.0,512.3292


Unnamed: 0,0
Cabin,0.771044
Age,0.198653
Embarked,0.002245
PassengerId,0.0
Name,0.0
Pclass,0.0
Survived,0.0
Sex,0.0
Parch,0.0
SibSp,0.0


## Data Preparation
Feature engineering: Title extraction, simple imputations, encoding.

In [43]:
feat_df = df.copy() # Use the original dataframe

y = feat_df['Survived'] # Assuming 'target' is the survival column for this dataset
X = feat_df.drop(columns=['Survived']) # Drop only the target column

numeric = X.select_dtypes(include=['int64','float64']).columns.tolist()
categorical = X.select_dtypes(include=['object','category','bool']).columns.tolist()

num_pipe = Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
cat_pipe = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])
pre = ColumnTransformer([('num', num_pipe, numeric), ('cat', cat_pipe, categorical)])

## Modeling
We compare Logistic Regression and Random Forest via stratified CV.

In [44]:
models = {
    "log_reg": LogisticRegression(max_iter=1000, n_jobs=None, random_state=RANDOM_STATE),
    "rf": RandomForestClassifier(n_estimators=400, random_state=RANDOM_STATE)
}

from sklearn.model_selection import cross_val_score, StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

for name, clf in models.items():
    pipe = Pipeline([('pre', pre), ('clf', clf)])
    auc = cross_val_score(pipe, X, y, scoring='roc_auc', cv=skf).mean()
    print(name, "cv auc:", round(auc, 4))


log_reg cv auc: 0.8725
rf cv auc: 0.8795


In [45]:
# Fit best model on full train and save
best = Pipeline([('pre', pre), ('clf', RandomForestClassifier(n_estimators=500, random_state=RANDOM_STATE))])
best.fit(X, y)
os.makedirs('deployment', exist_ok=True)
joblib.dump(best, 'deployment/model.joblib'); print("Saved to deployment/model.joblib")

Saved to deployment/model.joblib


## Evaluation
Holdout evaluation if you create a split.

In [46]:
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE)
best.fit(X_tr, y_tr)
probs = best.predict_proba(X_te)[:,1]
from sklearn.metrics import f1_score
auc = roc_auc_score(y_te, probs)
pred = (probs >= 0.5).astype(int)
f1 = f1_score(y_te, pred)
print("AUC:", auc, "F1:", f1)
print("Meets KPI?", (auc>=TARGET_KPI['auc']) and (f1>=TARGET_KPI['f1']))

AUC: 0.8442028985507245 F1: 0.7213114754098361
Meets KPI? False


## Deployment
Exported model is loaded by FastAPI app under `deployment/api/app.py`. See `deployment/README.md`.