**Exploratory Data Analysis (EDA) on the Titanic dataset.**

### üéØ Problem Statement

The objective of this analysis is to understand the factors that influenced the survival of passengers aboard the Titanic.
Using the Titanic dataset, we explore demographic and travel-related variables such as age, gender, passenger class, fare, and family size to identify patterns in survival outcomes.

The goal is to:
* perform exploratory data analysis  
* visualize key survival patterns  
* build a simple prediction model to classify passengers as survived or not  

This notebook focuses on **EDA first and prediction second**.


In this notebook, I:

* performed data cleaning
* handled missing values
* created age groups
* visualized feature distributions
* analyzed survival by gender, class, age group, and embarkation port
* generated key insights from the data

# **Step 1: Import libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("default")

*  It tells Matplotlib to use the default plot style.
*  Matplotlib has many built-in visual styles (like themes), for example:

1. "seaborn"
2. "ggplot"
3. "dark_background"
4. "bmh"
5. "default" ‚Üê the normal Matplotlib look

**You can try other themes**
* plt.style.use("seaborn-v0_8")
* plt.style.use("ggplot")
* plt.style.use("dark_background")


# **Step 2: Load dataset**

**Titanic - machine Learning from Disaster**

In [None]:
df = pd.read_csv("/kaggle/input/titanic/train.csv")
df.head()

NameError: name 'pd' is not defined

# **Step 3: Understanding the dataset**

**Shape (rows & Columns)**

Understanding the number of rows and columns helps estimate data volume and plan EDA accordingly.

In [None]:
df.shape

**Column names**

Listing column names helps us quickly understand the structure of the dataset and identify important variables.

In [None]:
df.columns

**Datatypes - Missing Values - Non null counts**

df.info() is used to check data types and missing values. It helps decide data cleaning techniques such as type conversion and imputation.

In [None]:
df.info()

**Quick statistics for numeric columns**

Descriptive statistics give an overall summary of numerical features and help us identify skewness and outliers.

In [None]:
df.describe()

# **Step 4: Missing values & Duplicates**

**Check missing values in each column**

Checking missing values is necessary to avoid errors in analysis and modeling. Columns with high missing values may need imputation or removal.

In [None]:
df.isnull().sum()

**Check duplicates**

Duplicate rows may distort insights, so we check and handle duplicates to maintain data quality.

In [None]:
df.duplicated().sum()

# **Step 5: Understand the target variable**

**Our prediction column is Survived**

*   0 = did not survive
*   1 = survived



**Check value counts**

We analyze the target variable distribution to understand class imbalance issues and overall survival rate.

In [None]:
df['Survived'].value_counts()

**Percentage form**

Converting target counts into percentages makes the survival distribution easier to interpret and communicate.

In [None]:
df['Survived'].value_counts(normalize=True) * 100

# **Summary of above steps**

*   Understand dataset size and structure
*   Identify datatypes
*   Detect missing values
*   Detect duplicates
*   Study distribution of target variable
*   Plan cleaning & visualization steps
*   All these steps together ensure:

    ‚ÄúData is clean, understood, and ready before deeper analysis.‚Äù

# **Step 6: Handle missing values**

**From above steps get to know that :**

*   Age ‚Üí missing
*   Cabin ‚Üí many missing
*   Embarked ‚Üí few missing

**Drop Cabin column**

In [None]:
df = df.drop(columns=['Cabin'])

**Why we dropped Cabin?**

* more than 75% values are missing

* filling so many values will add noise

* Cabin is not essential for basic EDA

**What function does:**

* drop() removes selected rows/columns

* columns=['Cabin'] tells it to drop a column

**Handle missing Age values**

**We will fill Age using median, not mean.**

**Why median?**

* Age has outliers

* median is robust to extreme values

* mean would shift due to old/very young passengers

**What function does:**

* fillna() fills missing values

* df['Age'].median() calculates median age

In [None]:
df['Age'] = df['Age'].fillna(df['Age'].median())

**Handle missing Embarked values**

**We fill with most frequent category (mode)**

**Why mode?**

* Embarked is categorical

* mean/median don‚Äôt make sense

* most common port is reasonable assumption

**What function does:**

* mode() returns most frequent value

* [0] selects first mode value

In [None]:
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

**Check again if any missing values left**

In [None]:
df.isnull().sum()

# **Step 7: Univariate Analysis**

**Distribution of Age**

In [None]:
plt.figure(figsize=(6,4))
sns.histplot(df['Age'], kde=True)
plt.title("Age Distribution of Passengers")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

**What this does**

* histplot ‚Üí histogram of Age

* kde=True ‚Üí smooth curve showing distribution

* figure() ‚Üí controls graph size

* title/xlabel/ylabel ‚Üí add labels

**Why it is important**

* shows age spread

* detects skewness

* helps decide age groups later

**Distribution of Fare**

In [None]:
plt.figure(figsize=(6,4))
sns.histplot(df['Fare'], kde=True)
plt.title("Fare Distribution")
plt.xlabel("Fare")
plt.ylabel("Count")
plt.show()

**What above plot reflects**
* fare has extreme values

* tells whether fares are skewed

* hints at rich vs poor passengers

**Count of Survival (0 vs 1)**

In [None]:
plt.figure(figsize=(5,4))
sns.countplot(x='Survived', data=df)
plt.title("Count of Survival")
plt.xlabel("Survived (0 = No, 1 = Yes)")
plt.ylabel("Number of Passengers")
plt.show()

**What above plot reflects**

* shows class imbalance

* tells how many survived vs died

# **Step 8: Survival Analysis**

**relationship with other features**

**Survival by Gender**

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x='Sex', hue='Survived', data=df)
plt.title("Survival Count by Gender")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.show()

**What this plot shows**

* gender on x-axis

* different colors = survived vs not

* height of bars = number of passengers

**Why it is important**

* Titanic survival is strongly related to gender

* helps see ‚Äúwomen and children first‚Äù effect

**Survival by Passenger Class (Pclass)**

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x='Pclass', hue='Survived', data=df)
plt.title("Survival Count by Passenger Class")
plt.xlabel("Passenger Class")
plt.ylabel("Count")
plt.show()

**What it shows**

* 1st / 2nd / 3rd class vs survival

**Why important**

* higher class had better cabins and lifeboat access

* strong economic inequality insight

**Survival by Age Group**

**Step i ‚Äî create age groups**

In [None]:
df['Age_group'] = pd.cut(df['Age'],
                         bins=[0, 12, 18, 30, 50, 80],
                         labels=['Child', 'Teen', 'Young Adult', 'Adult', 'Senior'])

**Step ii ‚Äî plot survival by age group**

In [None]:
plt.figure(figsize=(7,4))
sns.countplot(x='Age_group', hue='Survived', data=df)
plt.title("Survival Count by Age Group")
plt.xlabel("Age Group")
plt.ylabel("Count")
plt.show()

**What pd.cut() does**

* divides continuous Age column into ranges

* assigns labels (Child, Teen, etc.)

**Why this step is important**

* raw age is hard to interpret

* age groups show survival pattern clearly

* supports the rule ‚Äúwomen and children first‚Äù

**Survival by Embarkation Port**

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x='Embarked', hue='Survived', data=df)
plt.title("Survival Count by Embarked Port")
plt.xlabel("Port (C = Cherbourg, Q = Queenstown, S = Southampton)")
plt.ylabel("Count")
plt.show()

**Why important**

* passengers from different ports had different:

* socio-economic background

* travel purpose

* affects survival indirectly

# **Step 9: Correlation Heatmap (numeric relationships)**

In [None]:
numeric_df = df.select_dtypes(include='number')
plt.figure(figsize=(8,5))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

* Heatmap shows correlation between numerical features

* Values range from ‚àí1 to +1

* +1 = strong positive relationship

* ‚àí1 = strong negative relationship

* 0 = no relationship

**Useful to see:**

* Fare vs Pclass

* Survived vs other features

# **Step 10: Final insights**

**‚úî Key Insights from Titanic EDA**

* Majority of passengers did not survive

* Females had a significantly higher survival rate

* 1st class passengers survived more than 2nd & 3rd class

* Children had better survival chances than adults

* Higher fare is associated with higher survival (wealth effect)

* Passengers embarking from Cherbourg had better survival chances