# 🧪 Assignment: EDA Impact on Model Accuracy – Titanic Dataset

## 🎯 Objective
In this assignment, you will:
- Understand how Exploratory Data Analysis (EDA) can improve machine learning models.
- Train a model with minimal preprocessing.
- Perform detailed EDA and feature engineering.
- Retrain the model and compare performance.


In [None]:
# Step 1: Load Libraries and Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset (Upload 'train.csv' from Titanic Kaggle dataset)
# Download the dataset and upload it in colab
# https://www.kaggle.com/c/titanic
#

# Use the right path
df = pd.read_csv('train.csv')
df.head()

In [None]:
# Step 2: Baseline Model (Minimal preprocessing)
df_baseline = df[['Pclass', 'Sex', 'Age', 'Survived']].dropna()
df_baseline['Sex'] = LabelEncoder().fit_transform(df_baseline['Sex'])

X = df_baseline[['Pclass', 'Sex', 'Age']]
y = df_baseline['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy (Baseline):", accuracy_score(y_test, y_pred))

## 🔍 Step 3: EDA
Explore the dataset and look for:
- Missing values (`df.isnull().sum()`)
- Correlations (`sns.heatmap(df.corr(), annot=True)`)
- Survival rate across features:
  - `sns.barplot(x='Sex', y='Survived', data=df)`
  - `sns.histplot(data=df, x='Age', hue='Survived', bins=20)`
- Distribution of Pclass, Fare, Embarked, etc.
- Feature interactions (e.g., `FamilySize = SibSp + Parch`)


In [None]:
# Step 4: Feature Engineering based on EDA
# Impute some features

# Encode categorical variables

# Select relevant features


# Feature Scaling


In [None]:
# Step 5: Train Model After EDA
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy (After EDA):", accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

#Step 6: Try other models such as randomforest, xgboost


## 🧠 Final Reflection
1. Compare the accuracy before and after EDA.
2. What features or cleaning steps made the biggest difference?
3. Why is skipping EDA risky in real-world machine learning projects?
