# Capstone Project 1: Titanic Dataset - Complete EDA

This project guides you through a complete exploratory data analysis of the Titanic dataset.

**Objectives:**
- Load and understand the data
- Handle missing values
- Explore relationships between features
- Create visualizations to tell a story
- Draw insights about survival factors

**Problems:** 15 (Progressive difficulty)

In [None]:
# ============================================
# SETUP - Run this cell first!
# ============================================
import sys
sys.path.insert(0, '..')
from utils.checks import capstone_titanic_eda as verify

# Dataset path (provided for convenience)
TITANIC_PATH = '../datasets/titanic/train.csv'

print("Checker loaded!")
print(f"Dataset path: {TITANIC_PATH}")
print("\nNow import the libraries you need and load the dataset.")

---
## Problem 0: Import Libraries and Setup
**Difficulty:** Easy

### Concept
Before starting EDA, import all necessary libraries for data manipulation and visualization.

### Syntax
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
np.random.seed(42)
sns.set_style('whitegrid')
```

### Task
Import the required libraries and set up the plotting environment.

### Expected Properties
- All libraries should be importable
- Plotting should work inline

In [None]:
# Your solution:


In [None]:
# Verification
verify.p0(globals())

---
## Problem 1: Load the Titanic Dataset
**Difficulty:** Easy

### Concept
Loading data is the first step in any analysis. Pandas provides `read_csv()` to load CSV files into DataFrames. The Titanic dataset contains passenger information including demographics and survival status.

### Syntax
```python
df = pd.read_csv('filepath.csv')  # Load CSV into DataFrame
df.head()                          # Display first 5 rows
df.info()                          # Display data types and non-null counts
df.shape                           # Returns (rows, columns)
```

### Example
```python
>>> data = pd.read_csv('sales.csv')
>>> data.head()
   id  price  quantity
0   1   9.99         3
1   2  14.99         1
>>> data.shape
(100, 3)
```

### Task
Load the Titanic dataset from `../datasets/titanic/train.csv` into a DataFrame called `df`.

### Expected Properties
- `df` should be a pandas DataFrame
- DataFrame should have more than 800 rows
- Should contain columns like 'Survived', 'Pclass', 'Sex', 'Age'

In [None]:
# Your solution:
df = None

In [None]:
# Verification
verify.p1(df)

---
## Problem 2: Analyze Missing Data
**Difficulty:** Easy

### Concept
Missing data is common in real-world datasets. Before analysis, you need to identify which columns have missing values and how many. This helps decide on an appropriate strategy for handling them.

### Syntax
```python
df.isnull().sum()                    # Count nulls per column
df.isnull().sum() / len(df) * 100    # Percentage of nulls

# Create summary DataFrame
missing_df = pd.DataFrame({
    'missing_count': df.isnull().sum(),
    'missing_pct': df.isnull().sum() / len(df) * 100
})
```

### Example
```python
>>> data = pd.DataFrame({'A': [1, 2, None], 'B': [4, None, 6]})
>>> data.isnull().sum()
A    1
B    1
```

### Task
Create a DataFrame called `missing_df` with two columns:
- `'missing_count'`: number of missing values per column
- `'missing_pct'`: percentage of missing values per column

### Expected Properties
- `missing_df` should be a DataFrame
- Should have columns 'missing_count' and 'missing_pct'
- Index should match df columns
- 'Age' column should have missing values

In [None]:
# Your solution:
missing_df = None

In [None]:
# Verification
verify.p2(missing_df, df)

---
## Problem 3: Calculate Overall Survival Rate
**Difficulty:** Easy

### Concept
The overall survival rate tells us what percentage of passengers survived the disaster. This is calculated by taking the mean of the binary 'Survived' column (where 1=survived, 0=died).

### Syntax
```python
df['binary_column'].mean()       # Mean of 0/1 values = proportion
df['binary_column'].mean() * 100 # Convert to percentage
```

### Example
```python
>>> outcomes = pd.Series([1, 0, 1, 1, 0])  # 3 successes out of 5
>>> success_rate = outcomes.mean() * 100
>>> success_rate
60.0
```

### Task
Calculate the overall survival rate as a percentage and store it in `survival_rate`.

### Expected Properties
- `survival_rate` should be a number (int or float)
- Should be between 0 and 100
- Should be approximately between 35% and 42%

In [None]:
# Your solution:
survival_rate = None

In [None]:
# Verification
verify.p3(survival_rate)

---
## Problem 4: Survival Rate by Gender
**Difficulty:** Easy

### Concept
GroupBy operations let you calculate statistics for different groups in your data. Here, we want to know if gender affected survival chances, following the "women and children first" evacuation protocol.

### Syntax
```python
df.groupby('column')['target'].mean()  # Mean of target for each group
df.groupby('column')['target'].mean() * 100  # As percentage
```

### Example
```python
>>> data = pd.DataFrame({'Gender': ['M', 'F', 'M', 'F'], 
...                      'Passed': [1, 1, 0, 1]})
>>> data.groupby('Gender')['Passed'].mean() * 100
Gender
F    100.0
M     50.0
```

### Task
Calculate survival rates (as percentages) by gender. Store the result in `survival_by_gender`.

### Expected Properties
- Should be a pandas Series
- Index should contain 'male' and 'female'
- Female survival rate should be higher than male
- Female survival rate should be above 70%

In [None]:
# Your solution:
survival_by_gender = None

In [None]:
# Verification
verify.p4(survival_by_gender)

---
## Problem 5: Survival Rate by Passenger Class
**Difficulty:** Easy

### Concept
Passenger class (1st, 2nd, 3rd) likely affected survival chances, as higher-class passengers had better cabin locations and priority access to lifeboats.

### Syntax
```python
df.groupby('Pclass')['Survived'].mean() * 100
series.idxmax()  # Returns index of maximum value
```

### Example
```python
>>> scores = pd.Series([85, 92, 78], index=['A', 'B', 'C'])
>>> scores.idxmax()
'B'
```

### Task
Calculate survival rates (as percentages) by passenger class. Store in `survival_by_class`.

### Expected Properties
- Should be a pandas Series
- Should have 3 elements (for classes 1, 2, 3)
- First class should have the highest survival rate
- First class survival should be above 60%

In [None]:
# Your solution:
survival_by_class = None

In [None]:
# Verification
verify.p5(survival_by_class)

---
## Problem 6: Visualize Survival by Gender
**Difficulty:** Medium

### Concept
Visualizations make patterns more obvious. A grouped bar chart can show both the count of survivors and non-survivors for each gender, making the "women and children first" protocol visible.

### Syntax
```python
# Create crosstab for grouped data
ct = pd.crosstab(df['category'], df['outcome'])
ct.plot(kind='bar')

# Or manually with matplotlib
fig, ax = plt.subplots()
groups.plot(kind='bar', ax=ax)
```

### Example
```python
>>> ct = pd.crosstab(df['Gender'], df['Passed'])
>>> ct.plot(kind='bar', title='Pass Rate by Gender')
```

### Task
Create a grouped bar chart showing survival counts by gender. Use `pd.crosstab()` and create the plot. Store the figure in `fig` and axes in `ax`.

### Expected Properties
- `fig` should not be None
- Should create a matplotlib figure
- Chart should show both survived and not survived counts

In [None]:
# Your solution:
fig = None
ax = None

In [None]:
# Verification
verify.p6(fig)

---
## Problem 7: Age Distribution Analysis
**Difficulty:** Medium

### Concept
Comparing distributions helps identify patterns. By creating separate histograms for survivors and non-survivors, we can see if age was a factor in survival.

### Syntax
```python
fig, axes = plt.subplots(1, 2, figsize=(12, 5))  # 1 row, 2 columns
axes[0].hist(data1, bins=20)
axes[1].hist(data2, bins=20)
```

### Example
```python
>>> survived_ages = df[df['Survived']==1]['Age'].dropna()
>>> not_survived_ages = df[df['Survived']==0]['Age'].dropna()
>>> fig, ax = plt.subplots()
>>> ax.hist(survived_ages, bins=15, alpha=0.5, label='Survived')
```

### Task
Create two side-by-side histograms comparing age distribution of survivors vs non-survivors. Remove NaN values before plotting. Store figure in `fig` and axes array in `axes`.

### Expected Properties
- `axes` should be an array/list with 2 elements
- Each subplot should show age distributions
- Should handle missing age values

In [None]:
# Your solution:
fig = None
axes = None

In [None]:
# Verification
verify.p7(axes)

---
## Problem 8: Handle Missing Age Values
**Difficulty:** Medium

### Concept
Missing age values can be filled strategically. Using the median age by passenger class and gender is better than a global median, as these groups likely have different age distributions.

### Syntax
```python
# Fill with group-specific values
df['column'] = df.groupby(['group1', 'group2'])['column'].transform(
    lambda x: x.fillna(x.median())
)
```

### Example
```python
>>> df['Salary'] = df.groupby('Department')['Salary'].transform(
...     lambda x: x.fillna(x.median())
... )
```

### Task
Create a copy of df called `df_clean`. Fill missing Age values with the median age by Pclass and Sex. Store the count of remaining missing ages in `missing_ages_after`.

### Expected Properties
- `df_clean` should be a DataFrame
- `missing_ages_after` should be 0
- All ages should be positive numbers

In [None]:
# Your solution:
df_clean = None
missing_ages_after = None

In [None]:
# Verification
verify.p8(df_clean, missing_ages_after)

---
## Problem 9: Create Age Groups
**Difficulty:** Medium

### Concept
Binning continuous variables into categories makes analysis easier. Age groups help us see patterns like "children had better survival rates" more clearly than raw age values.

### Syntax
```python
pd.cut(series, bins=[0, 18, 65, 100], labels=['Young', 'Adult', 'Senior'])
```

### Example
```python
>>> ages = pd.Series([5, 25, 45, 70])
>>> pd.cut(ages, bins=[0, 18, 65, 100], labels=['Child', 'Adult', 'Senior'])
0     Child
1     Adult
2     Adult
3    Senior
```

### Task
Create a new column `'age_group'` in df_clean with categories:
- 'Child' (0-12)
- 'Teen' (13-19)
- 'Adult' (20-60)
- 'Senior' (60+)

### Expected Properties
- df_clean should have an 'age_group' column
- Column should have exactly 4 unique categories
- Should be of categorical dtype

In [None]:
# Your solution:
bins = [0, 12, 19, 60, 100]
labels = ['Child', 'Teen', 'Adult', 'Senior']

In [None]:
# Verification
verify.p9(df_clean)

---
## Problem 10: Survival by Age Group
**Difficulty:** Medium

### Concept
After creating age groups, we can analyze survival rates for each group and visualize the results. This reveals whether the "children first" evacuation protocol was followed.

### Syntax
```python
rates = df.groupby('category')['outcome'].mean() * 100
rates.plot(kind='bar')
```

### Example
```python
>>> survival = df.groupby('age_group')['Survived'].mean() * 100
>>> survival.plot(kind='bar', ylabel='Survival Rate (%)')
```

### Task
1. Calculate survival rates by age group (as percentages), store in `survival_by_age`
2. Create a bar chart visualization

### Expected Properties
- `survival_by_age` should be a Series
- Should have 4 elements (one per age group)
- Children should have survival rate above 50%

In [None]:
# Your solution:
survival_by_age = None
fig = None

In [None]:
# Verification
verify.p10(survival_by_age)

---
## Problem 11: Fare Analysis
**Difficulty:** Medium

### Concept
Fare paid is a proxy for socioeconomic status. Analyzing fare by survival status can reveal whether wealthier passengers had better chances.

### Syntax
```python
df.groupby('binary_column')['numeric_column'].mean()
```

### Example
```python
>>> df.groupby('Passed')['StudyHours'].mean()
Passed
0    3.2
1    5.8
```

### Task
Calculate mean fare for survivors (1) vs non-survivors (0). Store in `fare_by_survival`.

### Expected Properties
- Should be a pandas Series with 2 elements
- Survivors should have higher average fare
- Both values should be positive

In [None]:
# Your solution:
fare_by_survival = None

In [None]:
# Verification
verify.p11(fare_by_survival)

---
## Problem 12: Create Family Size Feature
**Difficulty:** Medium

### Concept
Feature engineering creates new variables from existing ones. Family size (combining siblings/spouses and parents/children) might affect survival - both traveling alone and in very large families could be disadvantageous.

### Syntax
```python
df['new_column'] = df['col1'] + df['col2'] + 1  # +1 for the person themselves
```

### Example
```python
>>> df['total_score'] = df['quiz1'] + df['quiz2'] + df['exam']
```

### Task
Create a 'family_size' column by adding SibSp (siblings/spouses) + Parch (parents/children) + 1 (the passenger themselves).

### Expected Properties
- df_clean should have 'family_size' column
- Minimum value should be 1 (person traveling alone)
- Maximum value should be greater than 5

In [None]:
# Your solution:


In [None]:
# Verification
verify.p12(df_clean)

---
## Problem 13: Survival by Family Size
**Difficulty:** Medium

### Concept
Analyzing survival by family size can reveal interesting patterns. Small families might have stuck together, while very large families might have had trouble coordinating evacuation.

### Syntax
```python
df.groupby('family_size')['Survived'].mean() * 100
```

### Example
```python
>>> survival = df.groupby('group_size')['success'].mean() * 100
>>> survival
group_size
1    25.0
2    50.0
3    45.0
```

### Task
Calculate survival rates by family size. Store in `survival_by_family`.

### Expected Properties
- Should be a pandas Series
- Should have multiple elements (different family sizes)
- All values should be between 0 and 100

In [None]:
# Your solution:
survival_by_family = None

In [None]:
# Verification
verify.p13(survival_by_family)

---
## Problem 14: Correlation Analysis
**Difficulty:** Medium

### Concept
Correlation measures linear relationships between variables. For survival prediction, we want to know which numeric features correlate most strongly with survival.

### Syntax
```python
# Correlation matrix
df.corr()

# Correlations with specific column
df.corr()['target_column']
```

### Example
```python
>>> df[['feature1', 'feature2', 'target']].corr()['target']
feature1    0.45
feature2   -0.32
target      1.00
```

### Task
Calculate correlations between all numeric features and 'Survived'. Store correlations with 'Survived' in `survival_corr` (excluding the 1.0 correlation with itself).

### Expected Properties
- Should be a pandas Series
- Should contain correlations for numeric columns
- All values should be between -1 and 1

In [None]:
# Your solution:
survival_corr = None

In [None]:
# Verification
verify.p14(survival_corr)

---
## Problem 15: Create Summary Dashboard
**Difficulty:** Hard

### Concept
A dashboard presents multiple related visualizations together, telling a complete story about the data. For Titanic, key factors include class, gender, age, and fare.

### Syntax
```python
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
# Access subplots: axes[0,0], axes[0,1], axes[1,0], axes[1,1]
axes[0,0].plot(...)  # Top-left
axes[0,1].bar(...)   # Top-right
```

### Example
```python
>>> fig, axes = plt.subplots(2, 2)
>>> axes[0,0].hist(data1)
>>> axes[0,1].scatter(x, y)
>>> axes[1,0].bar(categories, values)
>>> axes[1,1].boxplot(groups)
```

### Task
Create a 2x2 dashboard with:
- Top-left: Survival rate by class (bar chart)
- Top-right: Survival rate by gender (bar chart)
- Bottom-left: Age distribution (histogram)
- Bottom-right: Fare by survival status (box plot)

### Expected Properties
- `axes` should be a 2x2 array
- Figure should have a title
- All 4 subplots should be populated

In [None]:
# Your solution:
fig = None
axes = None

In [None]:
# Verification
verify.p15(axes)

---
## Summary and Key Insights

Based on your EDA, document the key findings:

1. **Gender Impact**: ...
2. **Class Differences**: ...
3. **Age Patterns**: ...
4. **Family Size**: ...
5. **Economic Factors**: ...

In [None]:
from utils.checker import check
check.summary()