# Task 2 - Predictive Analysis using Machine Learning

Deliverable: A notebook demonstrating **feature selection, model training, and evaluation**.

Created: 2025-08-29 04:59

---
We will build a classification model using the Titanic dataset (predict survival). This dataset is well-known, small enough to run locally, and demonstrates end-to-end ML workflow.

Steps:
1. Load dataset
2. Explore and preprocess features
3. Feature engineering & selection
4. Train/test split
5. Model training (Logistic Regression, Random Forest)
6. Evaluation (accuracy, confusion matrix, classification report)
7. Insights

## 1) Load dataset

In [None]:
import seaborn as sns
import pandas as pd
titanic = sns.load_dataset('titanic')
print(titanic.shape)
titanic.head()

## 2) Preprocess features

In [None]:
df = titanic.copy()
# Drop columns with too many missing or not useful for prediction
df = df.drop(columns=['deck','embark_town','alive','class','who','adult_male'])

# Fill missing values
df['age'] = df['age'].fillna(df['age'].median())
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

# Encode categorical variables
df = pd.get_dummies(df, drop_first=True)

print(df.shape)
df.head()

## 3) Train/Test split

In [None]:
from sklearn.model_selection import train_test_split
X = df.drop('survived', axis=1)
y = df['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train.shape, X_test.shape

## 4) Train Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression(max_iter=1000)
model_lr.fit(X_train, y_train)
y_pred_lr = model_lr.predict(X_test)
print('Train acc:', model_lr.score(X_train, y_train))
print('Test acc:', model_lr.score(X_test, y_test))

## 5) Train Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier(n_estimators=200, random_state=42)
model_rf.fit(X_train, y_train)
y_pred_rf = model_rf.predict(X_test)
print('Train acc:', model_rf.score(X_train, y_train))
print('Test acc:', model_rf.score(X_test, y_test))

## 6) Evaluation

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print('Logistic Regression Report')
print(classification_report(y_test, y_pred_lr))
print('Random Forest Report')
print(classification_report(y_test, y_pred_rf))

print('Confusion Matrix (Random Forest)')
print(confusion_matrix(y_test, y_pred_rf))

## 7) Feature Importance (Random Forest)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
feat_imp = pd.Series(model_rf.feature_importances_, index=X.columns)
feat_imp.nlargest(10).plot(kind='barh')
plt.title('Top 10 Important Features')
plt.show()

## 8) Insights

After running the models, summarize:
- Accuracy of Logistic Regression vs Random Forest
- Key features influencing survival (e.g., sex, age, class)
- Whether model generalizes well (train vs test accuracy)
- Potential improvements (hyperparameter tuning, cross-validation, etc.)