# Day 2: Introduction to Data Science

**Duration:** 90 minutes  
**Dataset:** Titanic Passenger Data

## Learning Objectives
- Understand what data science is and its three pillars
- Explore real-world applications of data science
- Learn about Big Data characteristics
- Practice effective data visualization
- Distinguish correlation from causation

---

## Part 1: Setting Up Our Environment (10 mins)

### Introduction to Python Libraries for Data Science

Before we start, let's install and import the key libraries we'll use:

#### **Pandas** (https://pandas.pydata.org/)
- Used for data manipulation and analysis
- Think of it as Excel on steroids!
- Alternatives: Polars, Dask

#### **NumPy** (https://numpy.org/)
- Fundamental package for numerical computing
- Provides powerful array operations
- Alternatives: JAX (for advanced users)

#### **Plotly** (https://plotly.com/python/)
- Interactive visualization library
- Creates beautiful, interactive charts
- Alternatives: Matplotlib, Seaborn, Altair

In [None]:
# Install required packages (run only once)
# Uncomment the line below if you need to install packages
# !pip install pandas numpy plotly seaborn

# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

print("✓ Libraries imported successfully!")

### Load the Titanic Dataset

The Titanic dataset contains information about passengers aboard the RMS Titanic. We'll use this throughout the course to learn data science concepts.

**Dataset Features:**
- PassengerId: Unique identifier
- Survived: Survival (0 = No, 1 = Yes)
- Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- Name: Passenger name
- Sex: Gender
- Age: Age in years
- SibSp: Number of siblings/spouses aboard
- Parch: Number of parents/children aboard
- Ticket: Ticket number
- Fare: Passenger fare
- Cabin: Cabin number
- Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

In [None]:
# Load the Titanic dataset from seaborn's built-in datasets
import seaborn as sns
df = sns.load_dataset('titanic')

# Display first few rows
print("First 5 rows of the Titanic dataset:")
df.head()

In [None]:
# Get basic information about the dataset
print("Dataset shape (rows, columns):", df.shape)
print("\nColumn data types:")
df.info()

---
## Part 2: Understanding Data Science (15 mins)

### Exercise 2.1: Identifying the Three Pillars

Data Science sits at the intersection of three domains:
1. **Domain Expertise** - Understanding the problem context
2. **Statistics/Mathematics** - Analyzing and interpreting data
3. **Computer Science** - Programming and computational thinking

**Task:** For the Titanic dataset, write down one example for each pillar:
- Domain Expertise: ___________________________________
- Statistics: ___________________________________
- Computer Science: ___________________________________

### Exercise 2.2: Big Data Characteristics (The 5 V's)

Let's analyze our dataset using the Big Data framework:

**The 5 V's:**
- **Volume:** How much data?
- **Velocity:** How fast is data generated?
- **Variety:** What types of data?
- **Veracity:** How trustworthy is the data?
- **Value:** What insights can we extract?

In [None]:
# TODO: Calculate the volume of our dataset
# Hint: Use df.shape to get number of rows and columns
num_rows = # YOUR CODE HERE
num_cols = # YOUR CODE HERE

print(f"Volume: {num_rows} rows × {num_cols} columns = {num_rows * num_cols} data points")

In [None]:
# TODO: Identify the variety in our dataset
# Hint: Use df.dtypes to see different data types
print("Data types in our dataset:")
# YOUR CODE HERE

---
## Part 3: Data Visualization & Storytelling (35 mins)

### Exercise 3.1: Creating Your First Visualizations

Good visualizations tell a story. Let's explore the Titanic data visually!

In [None]:
# TODO: Create a bar chart showing survival counts
# Hint: Use px.histogram() with x='survived'

fig = # YOUR CODE HERE
fig.update_layout(title='Survival Distribution on the Titanic',
                  xaxis_title='Survived (0=No, 1=Yes)',
                  yaxis_title='Number of Passengers')
fig.show()

**Question:** What story does this chart tell? ___________________________________

In [None]:
# TODO: Create a visualization comparing survival rates by passenger class
# Hint: Use px.histogram() with x='pclass' and color='survived'

fig = # YOUR CODE HERE
fig.update_layout(title='Survival by Passenger Class',
                  xaxis_title='Passenger Class',
                  yaxis_title='Count',
                  barmode='group')
fig.show()

**Question:** Which passenger class had the highest survival rate? Why might this be? 

Your answer: ___________________________________

### Exercise 3.2: Age Distribution

Let's explore the age distribution of passengers using a histogram.

In [None]:
# TODO: Create a histogram of passenger ages
# Hint: Use px.histogram() with x='age' and nbins=30

fig = # YOUR CODE HERE
fig.update_layout(title='Age Distribution of Titanic Passengers',
                  xaxis_title='Age (years)',
                  yaxis_title='Number of Passengers')
fig.show()

### Exercise 3.3: Multiple Variables - Gender, Class, and Survival

In [None]:
# TODO: Create a grouped bar chart showing survival by sex and class
# Hint: Group the data first, then create a bar chart

survival_by_sex_class = df.groupby(['sex', 'pclass'])['survived'].mean().reset_index()

fig = px.bar(survival_by_sex_class, 
             x='pclass', 
             y='survived', 
             color='sex',
             barmode='group',
             title='Survival Rate by Gender and Passenger Class',
             labels={'survived': 'Survival Rate', 'pclass': 'Passenger Class'})
fig.show()

**Question:** What patterns do you observe? ___________________________________

### Exercise 3.4: Scatter Plot - Fare vs Age

In [None]:
# TODO: Create a scatter plot of Age vs Fare, colored by survival
# Hint: Use px.scatter() with x='age', y='fare', color='survived'

fig = # YOUR CODE HERE
fig.update_layout(title='Relationship between Age and Fare',
                  xaxis_title='Age (years)',
                  yaxis_title='Fare (£)')
fig.show()

---
## Part 4: Correlation vs Causation (20 mins)

### Exercise 4.1: Computing Correlations

**Important:** Correlation measures how two variables move together, but it does NOT mean one causes the other!

In [None]:
# TODO: Calculate correlation between numerical variables
# Hint: Use df.select_dtypes(include=[np.number]).corr()

correlation_matrix = # YOUR CODE HERE
print("Correlation Matrix:")
print(correlation_matrix)

In [None]:
# TODO: Create a heatmap of correlations
# Hint: Use px.imshow()

fig = px.imshow(correlation_matrix,
                text_auto='.2f',
                aspect='auto',
                title='Correlation Heatmap',
                color_continuous_scale='RdBu_r')
fig.show()

### Exercise 4.2: Understanding Correlation vs Causation

Look at the correlation between `fare` and `survived`.

In [None]:
# TODO: Calculate the correlation between fare and survival
fare_survival_corr = # YOUR CODE HERE
print(f"Correlation between Fare and Survival: {fare_survival_corr:.3f}")

**Discussion Questions:**

1. Does higher fare CAUSE better survival? ___________________________________
2. What might be the real reason for this relationship? (Hint: Think about passenger class) ___________________________________
3. Can you think of a confounding variable? ___________________________________

---
## Part 5: Recognizing Poor Visualizations (10 mins)

### Exercise 5.1: Creating a Misleading Visualization

Let's intentionally create a poor visualization to understand what NOT to do!

In [None]:
# Poor visualization example: Pie chart with too many categories
age_groups = pd.cut(df['age'].dropna(), bins=20)
fig = px.pie(values=age_groups.value_counts().values, 
             names=age_groups.value_counts().index.astype(str),
             title='Age Distribution (Poor Visualization - Too Many Slices!)')
fig.show()

**Question:** What makes this visualization poor? List at least 2 reasons:

1. ___________________________________
2. ___________________________________

In [None]:
# TODO: Create a BETTER version of the above visualization
# Hint: Use a histogram instead of a pie chart

# YOUR CODE HERE

---
## Part 6: Summary & Reflection (10 mins)

### Key Takeaways

Today we learned:
- ✓ Data science combines domain expertise, statistics, and computer science
- ✓ The 5 V's of Big Data: Volume, Velocity, Variety, Veracity, Value
- ✓ Different visualization types serve different purposes
- ✓ Correlation ≠ Causation!
- ✓ Good visualizations tell clear stories without misleading

### Reflection Questions

1. What was the most interesting insight you discovered about the Titanic data?

   Your answer: ___________________________________

2. Which type of visualization did you find most useful and why?

   Your answer: ___________________________________

3. Can you think of a real-world application where these data science skills would be valuable?

   Your answer: ___________________________________

---
## Bonus Challenges (Optional)

If you finish early, try these additional exercises:

### Bonus 1: Box Plot for Fare by Class

In [None]:
# TODO: Create a box plot showing fare distribution by passenger class
# Hint: Use px.box()

# YOUR CODE HERE

### Bonus 2: Survival Rate by Embarkation Port

In [None]:
# TODO: Calculate and visualize survival rates by embarkation port
# Hint: Group by 'embarked' and calculate mean of 'survived'

# YOUR CODE HERE

### Bonus 3: Family Size Analysis

In [None]:
# TODO: Create a new feature 'family_size' = SibSp + Parch + 1
# Then visualize survival rate by family size

# YOUR CODE HERE

---
## Resources for Further Learning

- **Pandas Documentation:** https://pandas.pydata.org/docs/
- **Plotly Gallery:** https://plotly.com/python/
- **Data Visualization Best Practices:** https://www.storytellingwithdata.com/
- **Correlation vs Causation:** https://www.tylervigen.com/spurious-correlations

**See you on Day 4!** 🚀