
# Titanic Dataset — Data Cleaning & Exploratory Data Analysis (EDA)

This notebook performs **data cleaning** and **EDA** on the Titanic dataset (Kaggle format).  
You can run it in **Google Colab** or **VS Code/Jupyter**.

> If you're using Colab, run the next cell to upload `titanic.csv`.


In [None]:
# If using Google Colab, uncomment the next 4 lines to upload titanic.csv
# from google.colab import files
# uploaded = files.upload()
# import io
# csv_name = list(uploaded.keys())[0]  # use uploaded filename
csv_name = "titanic.csv"  # <- change this if your CSV has a different name/path


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Visualization settings
plt.rcParams['figure.figsize'] = (8, 5)
pd.set_option('display.max_columns', None)

In [None]:
# Load Titanic dataset
df = pd.read_csv(csv_name)
print("Shape:", df.shape)
df.head()

In [None]:
print("Data Types / Info:")
print(df.dtypes)
print("\nMissing Values per Column:")
print(df.isna().sum())

## 🧹 Data Cleaning

In [None]:
# 1) Fill missing 'Age' with median (robust to outliers)
if 'Age' in df.columns:
    df['Age'] = df['Age'].fillna(df['Age'].median())

# 2) Fill missing 'Embarked' with mode (most frequent)
if 'Embarked' in df.columns:
    df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

# 3) Drop 'Cabin' due to high missingness (optional)
if 'Cabin' in df.columns and df['Cabin'].isna().mean() > 0.6:
    df = df.drop(columns=['Cabin'])

# 4) Drop any remaining fully-empty columns
df = df.dropna(axis=1, how='all')

print("After cleaning — shape:", df.shape)
print("Remaining missing values:")
print(df.isna().sum())

## 🔎 Exploratory Data Analysis (EDA)

In [None]:
# Helper to add value labels on bars
def add_labels(ax):
    for p in ax.patches:
        height = p.get_height()
        ax.annotate(f'{int(height)}', (p.get_x()+p.get_width()/2., height),
                    ha='center', va='bottom', xytext=(0,3), textcoords='offset points')

# 1) Survival distribution
if 'Survived' in df.columns:
    counts = df['Survived'].value_counts().sort_index()
    ax = counts.plot(kind='bar')
    ax.set_title('Survival Distribution (0 = Not Survived, 1 = Survived)')
    ax.set_xlabel('Survived')
    ax.set_ylabel('Count')
    add_labels(ax)
    plt.tight_layout()
    plt.show()

In [None]:
# 2) Survival by Gender
if set(['Sex','Survived']).issubset(df.columns):
    ctab = pd.crosstab(df['Sex'], df['Survived'])
    ax = ctab.plot(kind='bar', stacked=False)
    ax.set_title('Survival by Gender')
    ax.set_xlabel('Sex')
    ax.set_ylabel('Count')
    plt.xticks(rotation=0)
    plt.tight_layout()
    plt.show()

In [None]:
# 3) Survival by Passenger Class
if set(['Pclass','Survived']).issubset(df.columns):
    ctab = pd.crosstab(df['Pclass'], df['Survived'])
    ax = ctab.plot(kind='bar')
    ax.set_title('Survival by Passenger Class')
    ax.set_xlabel('Pclass')
    ax.set_ylabel('Count')
    plt.tight_layout()
    plt.show()

In [None]:
# 4) Age distribution
if 'Age' in df.columns:
    plt.hist(df['Age'].dropna(), bins=30, edgecolor='black')
    plt.title('Age Distribution of Passengers')
    plt.xlabel('Age')
    plt.ylabel('Frequency')
    plt.tight_layout()
    plt.show()

In [None]:
# 5) Age vs Survival (box-style using matplotlib)
if set(['Age','Survived']).issubset(df.columns):
    data0 = df[df['Survived']==0]['Age'].dropna()
    data1 = df[df['Survived']==1]['Age'].dropna()
    plt.boxplot([data0, data1], labels=['Not Survived', 'Survived'], showmeans=True)
    plt.title('Age vs Survival')
    plt.ylabel('Age')
    plt.tight_layout()
    plt.show()

In [None]:
# 6) Survival by Embarked
if set(['Embarked','Survived']).issubset(df.columns):
    ctab = pd.crosstab(df['Embarked'], df['Survived'])
    ax = ctab.plot(kind='bar')
    ax.set_title('Survival by Embarkation Port')
    ax.set_xlabel('Embarked')
    ax.set_ylabel('Count')
    plt.tight_layout()
    plt.show()

In [None]:
# 7) Correlation (numeric columns)
numeric_df = df.select_dtypes(include='number')
corr = numeric_df.corr(numeric_only=True)

# Simple heatmap using matplotlib
fig, ax = plt.subplots()
cax = ax.imshow(corr, interpolation='nearest')
ax.set_title('Correlation Heatmap (Numeric Columns)')
fig.colorbar(cax)
ax.set_xticks(range(len(corr.columns)))
ax.set_yticks(range(len(corr.columns)))
ax.set_xticklabels(corr.columns, rotation=90)
ax.set_yticklabels(corr.columns)
plt.tight_layout()
plt.show()


## 📌 Quick Insights (Typical Patterns)
- **Females** generally show **higher survival rates** than males.
- **1st class** passengers survived more often than 2nd/3rd.
- **Younger passengers** tend to have better odds of survival than older passengers.
- `Fare` and `Pclass` often correlate with `Survived`.

> Your exact results may vary depending on the Titanic CSV version you use.
