# Day 2: Introduction to Data Science - SOLUTIONS

**Duration:** 90 minutes  
**Dataset:** Titanic Passenger Data

---

## Part 1: Setting Up Our Environment

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import seaborn as sns

print("✓ Libraries imported successfully!")

In [None]:
# Load the Titanic dataset
df = sns.load_dataset('titanic')

print("First 5 rows of the Titanic dataset:")
df.head()

In [None]:
# Get basic information
print("Dataset shape (rows, columns):", df.shape)
print("\nColumn data types:")
df.info()

---
## Part 2: Understanding Data Science

### Exercise 2.1: Identifying the Three Pillars - SOLUTION

**Examples for Titanic dataset:**
- **Domain Expertise:** Understanding maritime history, social class structure in 1912, lifeboat protocols
- **Statistics/Mathematics:** Calculating survival rates, correlations between variables, probability distributions
- **Computer Science:** Data loading, manipulation with pandas, creating visualizations, writing efficient code

### Exercise 2.2: Big Data Characteristics - SOLUTION

In [None]:
# SOLUTION: Calculate the volume of our dataset
num_rows = df.shape[0]
num_cols = df.shape[1]

print(f"Volume: {num_rows} rows × {num_cols} columns = {num_rows * num_cols} data points")
print(f"\nInterpretation: This is a small dataset ({num_rows} passengers), not 'Big Data' by modern standards.")
print("But it's perfect for learning!")

In [None]:
# SOLUTION: Identify the variety in our dataset
print("Data types in our dataset:")
print(df.dtypes)
print("\n" + "="*50)
print("Variety Analysis:")
print(f"- Numerical: {df.select_dtypes(include=[np.number]).columns.tolist()}")
print(f"- Categorical: {df.select_dtypes(include=['object', 'category']).columns.tolist()}")
print("\nWe have STRUCTURED data with both numerical and categorical variables.")

**The 5 V's for Titanic Dataset:**
- **Volume:** ~891 passengers (small dataset)
- **Velocity:** Static/historical data (collected once, over 100 years ago)
- **Variety:** Structured data with numerical + categorical features
- **Veracity:** High trustworthiness (historical records), but some missing values
- **Value:** High value for understanding survival patterns and social dynamics

---
## Part 3: Data Visualization & Storytelling

In [None]:
# SOLUTION: Bar chart showing survival counts
fig = px.histogram(df, x='survived', 
                   color='survived',
                   labels={'survived': 'Survived (0=No, 1=Yes)'},
                   category_orders={'survived': [0, 1]})
fig.update_layout(title='Survival Distribution on the Titanic',
                  xaxis_title='Survived (0=No, 1=Yes)',
                  yaxis_title='Number of Passengers',
                  showlegend=False)
fig.show()

# Calculate exact numbers
survival_counts = df['survived'].value_counts()
print(f"\nDied: {survival_counts[0]} passengers")
print(f"Survived: {survival_counts[1]} passengers")
print(f"Survival Rate: {survival_counts[1]/len(df)*100:.1f}%")

**Story:** Tragically, more passengers died than survived. Only about 38% survived the disaster.

In [None]:
# SOLUTION: Survival rates by passenger class
fig = px.histogram(df, x='pclass', color='survived',
                   barmode='group',
                   labels={'pclass': 'Passenger Class', 'survived': 'Survived'},
                   category_orders={'survived': [0, 1]})
fig.update_layout(title='Survival by Passenger Class',
                  xaxis_title='Passenger Class (1=1st, 2=2nd, 3=3rd)',
                  yaxis_title='Count')
fig.show()

# Calculate survival rates by class
print("\nSurvival rates by class:")
for pclass in sorted(df['pclass'].unique()):
    rate = df[df['pclass']==pclass]['survived'].mean()
    print(f"Class {pclass}: {rate*100:.1f}%")

**Answer:** 1st class had the highest survival rate (~63%). This reflects the "women and children first" protocol being applied more effectively in upper-class areas, and better access to lifeboats.

### Exercise 3.2: Age Distribution - SOLUTION

In [None]:
# SOLUTION: Histogram of passenger ages
fig = px.histogram(df, x='age', nbins=30, 
                   marginal='box',  # Add a box plot on top
                   color_discrete_sequence=['steelblue'])
fig.update_layout(title='Age Distribution of Titanic Passengers',
                  xaxis_title='Age (years)',
                  yaxis_title='Number of Passengers')
fig.show()

print(f"\nAge Statistics:")
print(f"Mean age: {df['age'].mean():.1f} years")
print(f"Median age: {df['age'].median():.1f} years")
print(f"Youngest: {df['age'].min():.1f} years")
print(f"Oldest: {df['age'].max():.1f} years")

### Exercise 3.3: Multiple Variables - SOLUTION

In [None]:
# SOLUTION: Survival by sex and class
survival_by_sex_class = df.groupby(['sex', 'pclass'])['survived'].mean().reset_index()

fig = px.bar(survival_by_sex_class, 
             x='pclass', 
             y='survived', 
             color='sex',
             barmode='group',
             title='Survival Rate by Gender and Passenger Class',
             labels={'survived': 'Survival Rate', 'pclass': 'Passenger Class'})
fig.update_layout(yaxis_tickformat='.0%')
fig.show()

print("\nSurvival rates by gender and class:")
print(survival_by_sex_class.pivot(index='pclass', columns='sex', values='survived'))

**Patterns observed:**
- Women had much higher survival rates than men across all classes
- 1st class women had the highest survival rate (~96%)
- 3rd class men had the lowest survival rate (~14%)
- The "women and children first" protocol is clearly visible
- Social class AND gender both mattered for survival

### Exercise 3.4: Scatter Plot - SOLUTION

In [None]:
# SOLUTION: Scatter plot of Age vs Fare
fig = px.scatter(df, x='age', y='fare', color='survived',
                 hover_data=['pclass', 'sex'],
                 labels={'survived': 'Survived'},
                 opacity=0.6)
fig.update_layout(title='Relationship between Age and Fare',
                  xaxis_title='Age (years)',
                  yaxis_title='Fare (£)')
fig.show()

print("\nInsights:")
print("- Higher fares generally correlate with better survival")
print("- Some outliers with very high fares")
print("- Age alone doesn't show a strong pattern")

---
## Part 4: Correlation vs Causation

In [None]:
# SOLUTION: Calculate correlation matrix
correlation_matrix = df.select_dtypes(include=[np.number]).corr()
print("Correlation Matrix:")
print(correlation_matrix.round(3))

In [None]:
# SOLUTION: Create heatmap
fig = px.imshow(correlation_matrix,
                text_auto='.2f',
                aspect='auto',
                title='Correlation Heatmap',
                color_continuous_scale='RdBu_r',
                zmin=-1, zmax=1)
fig.update_layout(width=800, height=700)
fig.show()

### Exercise 4.2: Correlation vs Causation - SOLUTION

In [None]:
# SOLUTION: Correlation between fare and survival
fare_survival_corr = df['fare'].corr(df['survived'])
print(f"Correlation between Fare and Survival: {fare_survival_corr:.3f}")

# Additional analysis
pclass_survival_corr = df['pclass'].corr(df['survived'])
print(f"Correlation between Passenger Class and Survival: {pclass_survival_corr:.3f}")
print(f"\nNote: Negative correlation for pclass means lower class numbers (1st class) → higher survival")

**Discussion Answers:**

1. **Does higher fare CAUSE better survival?** 
   No! Paying more money doesn't physically improve your chances of survival.

2. **What might be the real reason?** 
   Higher fares indicate higher passenger class (1st class). First class passengers had:
   - Cabins closer to the deck
   - Better access to lifeboats
   - More crew assistance
   - Priority in evacuation

3. **Confounding variable:** 
   **Passenger Class (pclass)** is the confounding variable. It influences both fare (1st class = expensive) and survival (1st class = better access to lifeboats).

In [None]:
# Visualizing the confounding relationship
fig = px.scatter(df, x='fare', y='survived', color='pclass',
                 trendline='ols',
                 title='Fare vs Survival: Passenger Class as Confounding Variable',
                 labels={'pclass': 'Passenger Class'})
fig.update_layout(xaxis_title='Fare (£)', yaxis_title='Survived (0=No, 1=Yes)')
fig.show()

---
## Part 5: Recognizing Poor Visualizations

In [None]:
# Poor visualization example
age_groups = pd.cut(df['age'].dropna(), bins=20)
fig = px.pie(values=age_groups.value_counts().values, 
             names=age_groups.value_counts().index.astype(str),
             title='Age Distribution (Poor Visualization - Too Many Slices!)')
fig.show()

**What makes this poor:**
1. Too many slices (20) - impossible to distinguish colors
2. Can't easily compare sizes
3. Labels overlap and are unreadable
4. Pie charts are poor for showing distributions
5. Hard to see patterns or trends

In [None]:
# SOLUTION: Better visualization
fig = px.histogram(df, x='age', nbins=20,
                   title='Age Distribution (Better Visualization)',
                   labels={'age': 'Age (years)'},
                   color_discrete_sequence=['steelblue'])
fig.update_layout(yaxis_title='Number of Passengers',
                  bargap=0.1)
fig.show()

print("\nWhy this is better:")
print("✓ Clear x-axis shows age ranges")
print("✓ Easy to compare bar heights")
print("✓ Pattern is immediately visible (peak in 20-30 age range)")
print("✓ Professional and clean appearance")

---
## Part 6: Summary & Reflection

### Sample Reflection Answers

1. **Most interesting insight:**
   The stark difference between men and women's survival rates, and how this interacted with social class. The "women and children first" protocol was clearly followed, but your class still mattered.

2. **Most useful visualization:**
   The grouped bar chart showing survival by gender and class, because it revealed multiple relationships simultaneously and told a clear story about the disaster.

3. **Real-world application:**
   Healthcare: Analyzing patient outcomes based on treatments, demographics, and other factors while being careful about correlation vs causation (e.g., does a treatment work, or are healthier patients simply more likely to receive it?).

---
## Bonus Challenges - SOLUTIONS

In [None]:
# Bonus 1 SOLUTION: Box plot for fare by class
fig = px.box(df, x='pclass', y='fare', color='pclass',
             title='Fare Distribution by Passenger Class',
             labels={'pclass': 'Passenger Class', 'fare': 'Fare (£)'})
fig.update_layout(showlegend=False)
fig.show()

print("\nInsights:")
print("- 1st class has highest median fare and most outliers")
print("- 3rd class has lowest and most consistent fares")
print("- Clear separation between classes")

In [None]:
# Bonus 2 SOLUTION: Survival by embarkation port
survival_by_port = df.groupby('embarked')['survived'].agg(['mean', 'count']).reset_index()
survival_by_port.columns = ['embarked', 'survival_rate', 'count']

fig = px.bar(survival_by_port, x='embarked', y='survival_rate',
             text='count',
             title='Survival Rate by Embarkation Port',
             labels={'embarked': 'Port (C=Cherbourg, Q=Queenstown, S=Southampton)',
                    'survival_rate': 'Survival Rate'})
fig.update_traces(texttemplate='n=%{text}', textposition='outside')
fig.update_layout(yaxis_tickformat='.0%')
fig.show()

print("\nInterpretation:")
print("Cherbourg (C) passengers had highest survival rate.")
print("This correlates with more 1st class passengers boarding there.")

In [None]:
# Bonus 3 SOLUTION: Family size analysis
df['family_size'] = df['sibsp'] + df['parch'] + 1

family_survival = df.groupby('family_size').agg({
    'survived': 'mean',
    'passenger_id': 'count'
}).reset_index()
family_survival.columns = ['family_size', 'survival_rate', 'count']

fig = px.line(family_survival, x='family_size', y='survival_rate',
              markers=True,
              title='Survival Rate by Family Size',
              labels={'family_size': 'Family Size', 'survival_rate': 'Survival Rate'})
fig.update_layout(yaxis_tickformat='.0%')
fig.show()

print("\nInsights:")
print("- Small families (2-4 people) had better survival rates")
print("- Solo travelers had moderate survival")
print("- Large families (7+) had very low survival rates")
print("- Hypothesis: Small families could coordinate evacuation, large families struggled")

---
## Additional Insights

### Key Statistics Summary

In [None]:
print("="*60)
print("TITANIC DATASET: KEY STATISTICS")
print("="*60)
print(f"\nTotal Passengers: {len(df)}")
print(f"Survived: {df['survived'].sum()} ({df['survived'].mean()*100:.1f}%)")
print(f"Died: {len(df) - df['survived'].sum()} ({(1-df['survived'].mean())*100:.1f}%)")
print("\n" + "-"*60)
print("By Gender:")
print(df.groupby('sex')['survived'].agg(['count', 'sum', 'mean']))
print("\n" + "-"*60)
print("By Class:")
print(df.groupby('pclass')['survived'].agg(['count', 'sum', 'mean']))
print("\n" + "="*60)

---
**End of Day 2 Solutions**