# Titanic Dataset - Exploratory Data Analysis (EDA)
#### Author: Wojciech Domino
#### Date: 2025
---

This notebook performs an Exploratory Data Analysis (EDA) on the Titanic dataset.  
We will investigate the key factors that influenced the survival of passengers, perform data cleaning, and create visualizations to better understand the dataset.

**Objectives:**
- Understand the structure of the dataset
- Handle missing values
- Explore relationships between features
- Generate meaningful insights

---

Step 1: Importing Libraries and Loading the Dataset

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Set plot style
sns.set_style('whitegrid')

# Load the dataset
df = pd.read_csv('../data/train.csv')

# Basic Information
print("Dataset Info:")
print(df.info())

# Dataset Description
print("\nDataset Description:")
print(df.describe())

# Checking Missing Values
print("\nMissing Values:")
print(df.isnull().sum())

FileNotFoundError: [Errno 2] No such file or directory: '../data/train.csv'

Step 2: Data Cleaning

In [2]:
# Fill missing Age values with median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Drop Cabin column due to excessive missing values
df.drop('Cabin', axis=1, inplace=True)

# Fill missing Embarked values with the most frequent one
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Verify no missing values left
print("\nMissing Values After Cleaning:")
print(df.isnull().sum())

NameError: name 'df' is not defined

Step 3: Data Visualization

In [3]:
# Survival Count
sns.countplot(data=df, x='Survived')
plt.title('Survival Count')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

# Survival Rate by Gender
sns.barplot(data=df, x='Sex', y='Survived')
plt.title('Survival Rate by Gender')
plt.ylabel('Survival Rate')
plt.show()

# Survival Rate by Passenger Class
sns.barplot(data=df, x='Pclass', y='Survived')
plt.title('Survival Rate by Passenger Class')
plt.ylabel('Survival Rate')
plt.show()

# Age Distribution
sns.histplot(data=df, x='Age', bins=30, kde=True)
plt.title('Age Distribution of Passengers')
plt.xlabel('Age')
plt.show()

# Fare Distribution
sns.histplot(data=df, x='Fare', bins=30, kde=True)
plt.title('Fare Distribution')
plt.xlabel('Fare')
plt.show()

# Correlation Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

NameError: name 'df' is not defined

In [4]:
# --- Step 4: Key Insights ---
print("""
Key Insights:
- Female passengers had significantly higher survival rates than male passengers.
- Passengers from higher classes (Pclass 1) were more likely to survive.
- Younger passengers had slightly better chances of survival.
- Higher ticket fares correlated positively with survival probability.
""")


Key Insights:
- Female passengers had significantly higher survival rates than male passengers.
- Passengers from higher classes (Pclass 1) were more likely to survive.
- Younger passengers had slightly better chances of survival.
- Higher ticket fares correlated positively with survival probability.

