# 01 - Exploratory Data Analysis (EDA)

Welcome to the first notebook of the Titanic Survival Prediction project! In this notebook, we will perform Exploratory Data Analysis (EDA) on the `train.csv` dataset. The goal of EDA is to understand the dataset's structure, identify patterns, detect anomalies, and gain insights that will guide our feature engineering and model selection processes.

## 1. Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options for better viewing
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# Load the training dataset
try:
    df = pd.read_csv('../data/train.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: train.csv not found. Please make sure it's in the 'data/' directory.")
    df = pd.DataFrame() # Create an empty DataFrame to avoid errors later


## 2. Basic Information

Let's start by looking at the first few rows of the dataset, its general information, and descriptive statistics.

### 2.1. Display First 5 Rows (`.head()`)

In [None]:
if not df.empty:
    print("\nFirst 5 rows of the dataset:")
    display(df.head())

### 2.2. Dataset Information (`.info()`)

In [None]:
if not df.empty:
    print("\nDataset Information:")
    df.info()

### 2.3. Descriptive Statistics (`.describe()`)

In [None]:
if not df.empty:
    print("\nDescriptive Statistics for Numerical Features:")
    display(df.describe())

## 3. Identify Missing Values

Missing values can impact model performance. Let's identify which columns have missing data and the extent of those missing values.

In [None]:
if not df.empty:
    print("\nMissing Values:")
    missing_values = df.isnull().sum()
    missing_values = missing_values[missing_values > 0].sort_values(ascending=False)
    missing_percent = (df.isnull().sum() / len(df) * 100)[(df.isnull().sum() / len(df) * 100) > 0].sort_values(ascending=False)

    missing_df = pd.DataFrame({'Missing Count': missing_values, 'Missing Percent': missing_percent})
    display(missing_df)

    if missing_df.empty:
        print("No missing values found in the dataset.")


**Observations on Missing Values:**
- **`Cabin`**: Has a very high percentage of missing values. This column might need to be dropped or handled carefully (e.g., creating a binary feature indicating if Cabin info is available).
- **`Age`**: Has a significant number of missing values. Imputation (e.g., with median or mean) will be necessary.
- **`Embarked`**: Has very few missing values. These can likely be imputed with the mode.

## 4. Analyze Data Distribution

Understanding the distribution of individual features is crucial. We'll visualize this using histograms for numerical data and count plots for categorical data.

### 4.1. Numerical Features: `Age`, `Fare`, `SibSp`, `Parch`

In [None]:
if not df.empty:
    numerical_features = ['Age', 'Fare', 'SibSp', 'Parch']
    plt.figure(figsize=(15, 10))

    for i, feature in enumerate(numerical_features):
        plt.subplot(2, 2, i + 1)
        sns.histplot(df[feature].dropna(), kde=True)
        plt.title(f'Distribution of {feature}')
        plt.xlabel(feature)
        plt.ylabel('Frequency')
    plt.tight_layout()
    plt.show()

**Observations on Numerical Features:**
- **`Age`**: Appears somewhat normally distributed, but with a tail towards older ages. There are peaks at younger ages.
- **`Fare`**: Highly skewed to the right, indicating most fares are low, with a few very high fares.
- **`SibSp`** and **`Parch`**: Most passengers traveled alone or with very few siblings/spouses/parents/children. These distributions are heavily skewed towards 0.

### 4.2. Categorical Features: `Sex`, `Pclass`, `Embarked`, `Survived`

In [None]:
if not df.empty:
    categorical_features = ['Sex', 'Pclass', 'Embarked', 'Survived']
    plt.figure(figsize=(15, 5))

    for i, feature in enumerate(categorical_features):
        plt.subplot(1, 4, i + 1)
        sns.countplot(x=feature, data=df, palette='viridis')
        plt.title(f'Distribution of {feature}')
        plt.xlabel(feature)
        plt.ylabel('Count')
    plt.tight_layout()
    plt.show()

**Observations on Categorical Features:**
- **`Sex`**: More male passengers than female passengers.
- **`Pclass`**: The majority of passengers were in 3rd class, followed by 1st and then 2nd class.
- **`Embarked`**: Most passengers embarked from 'S' (Southampton), followed by 'C' (Cherbourg) and 'Q' (Queenstown).
- **`Survived`**: More passengers did not survive (0) than survived (1), indicating an imbalanced target variable.

## 5. Explore Relationships with `Survived` (Target Variable)

Now, let's investigate how different features correlate with our target variable, `Survived`.

### 5.1. Survival Rate by `Sex`

In [None]:
if not df.empty:
    plt.figure(figsize=(6, 5))
    sns.barplot(x='Sex', y='Survived', data=df, palette='pastel')
    plt.title('Survival Rate by Sex')
    plt.ylabel('Survival Rate')
    plt.show()

**Observation:** Females had a significantly higher survival rate than males. This is a very strong indicator.

### 5.2. Survival Rate by `Pclass`

In [None]:
if not df.empty:
    plt.figure(figsize=(7, 5))
    sns.barplot(x='Pclass', y='Survived', data=df, palette='coolwarm')
    plt.title('Survival Rate by Passenger Class')
    plt.ylabel('Survival Rate')
    plt.show()

**Observation:** Passengers in 1st class had a much higher survival rate compared to 2nd and especially 3rd class. This suggests `Pclass` is an important feature.

### 5.3. Survival Rate by `Embarked`

In [None]:
if not df.empty:
    plt.figure(figsize=(7, 5))
    sns.barplot(x='Embarked', y='Survived', data=df, palette='rocket')
    plt.title('Survival Rate by Embarkation Point')
    plt.ylabel('Survival Rate')
    plt.show()

**Observation:** Passengers who embarked from Cherbourg ('C') had a slightly higher survival rate than those from Queenstown ('Q') and Southampton ('S').

### 5.4. Survival Rate by `Age` Bins (Visualizing Continuous Features)

In [None]:
if not df.empty:
    # Create age bins for better visualization
    df['Age_Bin'] = pd.cut(df['Age'], bins=[0, 12, 18, 60, 100], labels=['Child', 'Teenager', 'Adult', 'Elderly'])
    plt.figure(figsize=(8, 5))
    sns.barplot(x='Age_Bin', y='Survived', data=df, palette='flare', order=['Child', 'Teenager', 'Adult', 'Elderly'])
    plt.title('Survival Rate by Age Group')
    plt.ylabel('Survival Rate')
    plt.xlabel('Age Group')
    plt.show()
    df.drop('Age_Bin', axis=1, inplace=True) # Drop the temporary bin column


**Observation:** Children (Age < 12) generally had a higher survival rate compared to other age groups.

### 5.5. Survival Rate by `Fare` Bins

In [None]:
if not df.empty:
    # Create fare bins for better visualization
    df['Fare_Bin'] = pd.qcut(df['Fare'], q=4, labels=['Very Low', 'Low', 'Medium', 'High'])
    plt.figure(figsize=(8, 5))
    sns.barplot(x='Fare_Bin', y='Survived', data=df, palette='crest', order=['Very Low', 'Low', 'Medium', 'High'])
    plt.title('Survival Rate by Fare Group')
    plt.ylabel('Survival Rate')
    plt.xlabel('Fare Group')
    plt.show()
    df.drop('Fare_Bin', axis=1, inplace=True) # Drop the temporary bin column


**Observation:** Passengers who paid higher fares tended to have a higher survival rate, which aligns with `Pclass` observations.

### 5.6. Correlation Matrix (for numerical features)

In [None]:
if not df.empty:
    plt.figure(figsize=(10, 8))
    # Select only numerical columns for correlation matrix
    numerical_cols = df.select_dtypes(include=np.number).columns.tolist()
    # Drop 'PassengerId' as it's just an identifier and 'Age_Bin', 'Fare_Bin' if they temporarily existed
    if 'PassengerId' in numerical_cols:
        numerical_cols.remove('PassengerId')
    if 'Age_Bin' in numerical_cols:
        numerical_cols.remove('Age_Bin')
    if 'Fare_Bin' in numerical_cols:
        numerical_cols.remove('Fare_Bin')

    correlation_matrix = df[numerical_cols].corr()
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
    plt.title('Correlation Matrix of Numerical Features')
    plt.show()

**Observations on Correlation:**
- `Survived` has a negative correlation with `Pclass` (higher class = better survival) and `Fare` has a positive correlation (higher fare = better survival), aligning with previous observations.
- `SibSp` and `Parch` are positively correlated with each other, which is expected as they both relate to family members.

## 6. Insights and Observations from EDA

Based on our exploratory data analysis, we've gathered several key insights that will inform our next steps in feature engineering and model building:

1.  **Missing Values:**
    -   `Cabin` has too many missing values and might be best dropped or converted into a simplified binary feature (e.g., 'Has_Cabin').
    -   `Age` has a significant number of missing values, which will require imputation (e.g., median or mean).
    -   `Embarked` has very few missing values, easily imputable with the mode.

2.  **Feature Importance (Preliminary):**
    -   `Sex` is a strong predictor of survival (females survived more).
    -   `Pclass` is a strong predictor (higher class survived more).
    -   `Age` and `Fare` also show relationships with survival, particularly younger ages and higher fares correlating with better survival.

3.  **Feature Engineering Opportunities:**
    -   Combining `SibSp` and `Parch` into a `FamilySize` feature, and then `IsAlone` from `FamilySize`, could be beneficial.
    -   Extracting `Title` from `Name` might reveal interesting patterns related to social status or respect, which could influence survival.
    -   Binning `Age` and `Fare` might help capture non-linear relationships and handle outliers, as seen in our visualizations.

4.  **Categorical Feature Handling:**
    -   `Sex`, `Embarked`, `Pclass` (which can be treated as categorical), and any newly engineered categorical features (like `Title`, `Age_Bin`, `Fare_Bin`) will need to be encoded (e.g., One-Hot Encoding) before feeding them into machine learning models.

5.  **Numerical Feature Scaling:**
    -   Numerical features like `Age` and `Fare` have different scales and distributions, suggesting the need for scaling (e.g., Standard Scaling) before model training, especially for models sensitive to feature scales (like Logistic Regression or SVMs).

These insights will directly guide the next steps in our machine learning pipeline, focusing on data cleaning, feature engineering, and preparing the data for model training.