# Day 2: Introduction to Data Science

**Duration:** 90 minutes  
**Dataset:** Titanic Passenger Data

## Learning Objectives
- Understand what data science is and its three pillars (Domänenwissen, Statistik/Mathematik, Informatik)
- Explore real-world applications of data science (AI → Machine Learning → Deep Learning → GenAI)
- Learn about Big Data characteristics (Volume, Velocity, Variety, Veracity, Value)
- Identify and analyze different data sources (Open-Source, privat, kommerziell)
- Practice effective data visualization and storytelling
- Distinguish correlation from causation and identify confounding variables

---

## Part 1: Setting Up Our Environment (10 mins)

### Introduction to Python Libraries for Data Science

Before we start, let's install and import the key libraries we'll use:

#### **Pandas** (https://pandas.pydata.org/)
- Used for data manipulation and analysis
- Think of it as Excel on steroids!
- Alternatives: Polars, Dask

#### **NumPy** (https://numpy.org/)
- Fundamental package for numerical computing
- Provides powerful array operations
- Alternatives: JAX (for advanced users)

#### **Plotly** (https://plotly.com/python/)
- Interactive visualization library
- Creates beautiful, interactive charts
- Alternatives: Matplotlib, Seaborn, Altair

In [None]:
# Install required packages (run only once)
# Uncomment the line below if you need to install packages
# !pip install pandas numpy plotly seaborn

# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

print("Libraries imported successfully!")

### Load the Titanic Dataset

The Titanic dataset contains information about passengers aboard the RMS Titanic. We'll use this throughout the course to learn data science concepts.

**Dataset Features:**
- PassengerId: Unique identifier
- Survived: Survival (0 = No, 1 = Yes)
- Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- Name: Passenger name
- Sex: Gender
- Age: Age in years
- SibSp: Number of siblings/spouses aboard
- Parch: Number of parents/children aboard
- Ticket: Ticket number
- Fare: Passenger fare
- Cabin: Cabin number
- Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

In [None]:
# Load the Titanic dataset from seaborn's built-in datasets
import seaborn as sns
df = sns.load_dataset('titanic')

# Display first few rows
print("First 5 rows of the Titanic dataset:")
df.head()

In [None]:
# Get basic information about the dataset
print("Dataset shape (rows, columns):", df.shape)
print("\nColumn data types:")
df.info()

---
## Part 2: Understanding Data Science - The Three Pillars (15 mins)

### What is Data Science?

Data Science is an **interdisciplinary field** that sits at the intersection of three domains:

### Exercise 2.1: Identifying the Three Pillars

Data Science combines three essential areas:

1. **Domänenwissen (Domain Expertise)** - Understanding the problem context
   - Knowledge about the specific field or industry
   - Business understanding
   - Subject matter expertise

2. **Statistik/Mathematik (Statistics & Mathematics)** - Analyzing and interpreting data
   - Statistical inference
   - Probability theory
   - Linear algebra and calculus

3. **Informatik (Computer Science)** - Programming and computational thinking
   - Programming skills
   - Algorithms and data structures
   - Database management

**Task:** For the Titanic dataset, write down one example for each pillar:

**Your Answers:**

1. **Domänenwissen:** ___________________________________

2. **Statistik/Mathematik:** ___________________________________

3. **Informatik:** ___________________________________

### Exercise 2.2: AI Hierarchy - From AI to Generative AI

Understanding the relationship between different AI concepts:

```
Artificial Intelligence (AI)
    ↓
Machine Learning (ML)
    ↓
Deep Learning (DL)
    ↓
Generative AI (GenAI) / LLMs
```

- **AI:** Broad field of making machines intelligent
- **Machine Learning:** AI systems that learn from data
- **Deep Learning:** ML using neural networks with many layers
- **Generative AI/LLMs:** DL models that can generate new content

**Real-World Applications:**
- Qualitätskontrolle (Quality Control)
- Predictive Maintenance
- Betrugserkennung (Fraud Detection)
- Autonomes Fahren (Autonomous Driving)
- Empfehlungssysteme (Recommendation Systems)

**Discussion Question:** Can you think of an example for each level of the AI hierarchy?

Your answers: ___________________________________

### Exercise 2.3: Big Data Characteristics - The 5 V's

Big Data is characterized by five key properties:

**The 5 V's:**
- **Volume:** How much data? (Scale)
- **Velocity:** How fast is data generated? (Speed)
- **Variety:** What types of data? (Diversity)
- **Veracity:** How trustworthy is the data? (Quality)
- **Value:** What insights can we extract? (Utility)

Let's analyze our Titanic dataset using this framework:

In [None]:
# TODO: Calculate the volume of our dataset
# Hint: Use df.shape to get number of rows and columns
num_rows = # YOUR CODE HERE
num_cols = # YOUR CODE HERE

print(f"Volume: {num_rows} rows × {num_cols} columns = {num_rows * num_cols} data points")
print(f"\nIs this 'Big Data'? What do you think?")

In [None]:
# TODO: Identify the variety in our dataset
# Hint: Use df.dtypes to see different data types
print("Data types in our dataset:")
# YOUR CODE HERE

# Count different types
print("\nVariety Analysis:")
print(f"Numerical columns: {len(df.select_dtypes(include=[np.number]).columns)}")
print(f"Categorical columns: {len(df.select_dtypes(include=['object', 'category']).columns)}")

In [None]:
# TODO: Check veracity (data quality) by identifying missing values
# Hint: Use df.isnull().sum()
print("Veracity - Missing Data Analysis:")
missing_values = # YOUR CODE HERE
print(missing_values[missing_values > 0])

**Reflection:** Analyze the Titanic dataset using all 5 V's:

1. **Volume:** ___________________________________
2. **Velocity:** ___________________________________
3. **Variety:** ___________________________________
4. **Veracity:** ___________________________________
5. **Value:** ___________________________________

---
## Part 3: Data Visualization & Storytelling (35 mins)

### Principles of Effective Visualization

Good visualizations tell a story! We can represent data using:
- **Länge (Length)** - Bar charts
- **Position** - Scatter plots
- **Farbe (Color)** - Heatmaps, color-coded categories
- **Größe (Size)** - Bubble charts
- **Form (Shape)** - Different markers

### Exercise 3.1: Creating Your First Visualizations

In [None]:
# TODO: Create a bar chart showing survival counts
# Hint: Use px.histogram() with x='survived'

fig = # YOUR CODE HERE
fig.update_layout(title='Survival Distribution on the Titanic',
                  xaxis_title='Survived (0=No, 1=Yes)',
                  yaxis_title='Number of Passengers')
fig.show()

**Question:** What story does this chart tell? 

Your answer: ___________________________________

In [None]:
# TODO: Create a visualization comparing survival rates by passenger class
# Hint: Use px.histogram() with x='pclass' and color='survived'

fig = # YOUR CODE HERE
fig.update_layout(title='Survival by Passenger Class',
                  xaxis_title='Passenger Class',
                  yaxis_title='Count',
                  barmode='group')
fig.show()

**Question:** Which passenger class had the highest survival rate? Why might this be? 

Your answer: ___________________________________

### Exercise 3.2: Age Distribution

In [None]:
# TODO: Create a histogram of passenger ages
# Hint: Use px.histogram() with x='age' and nbins=30

fig = # YOUR CODE HERE
fig.update_layout(title='Age Distribution of Titanic Passengers',
                  xaxis_title='Age (years)',
                  yaxis_title='Number of Passengers')
fig.show()

### Exercise 3.3: Multiple Variables - Gender, Class, and Survival

In [None]:
# TODO: Create a grouped bar chart showing survival by sex and class
# Hint: First group the data, then create a bar chart

survival_by_sex_class = df.groupby(['sex', 'pclass'])['survived'].mean().reset_index()

fig = px.bar(survival_by_sex_class, 
             x='pclass', 
             y='survived', 
             color='sex',
             barmode='group',
             title='Survival Rate by Gender and Passenger Class',
             labels={'survived': 'Survival Rate', 'pclass': 'Passenger Class'})
fig.update_layout(yaxis_tickformat='.0%')
fig.show()

**Question:** What patterns do you observe? 

Your answer: ___________________________________

### Exercise 3.4: Scatter Plot - Fare vs Age

In [None]:
# TODO: Create a scatter plot of Age vs Fare, colored by survival
# Hint: Use px.scatter() with x='age', y='fare', color='survived'

fig = # YOUR CODE HERE
fig.update_layout(title='Relationship between Age and Fare',
                  xaxis_title='Age (years)',
                  yaxis_title='Fare (£)')
fig.show()

---
## Part 4: Correlation vs Causation (20 mins)

### Understanding the Difference

**IMPORTANT:** Correlation measures how two variables move together, but it does NOT mean one causes the other!

**Correlation:** A statistical relationship between two variables
- Example: Ice cream sales and drowning deaths are correlated

**Causation:** One thing directly causes another
- Example: Smoking causes lung cancer

**Confounding Variable:** A third variable that influences both, creating a spurious correlation
- Example: Hot weather causes both more ice cream sales AND more swimming (drownings)

### Exercise 4.1: Computing Correlations

In [None]:
# TODO: Calculate correlation between numerical variables
# Hint: Use df.select_dtypes(include=[np.number]).corr()

correlation_matrix = # YOUR CODE HERE
print("Correlation Matrix:")
print(correlation_matrix)

In [None]:
# TODO: Create a heatmap of correlations
# Hint: Use px.imshow()

fig = px.imshow(correlation_matrix,
                text_auto='.2f',
                aspect='auto',
                title='Correlation Heatmap',
                color_continuous_scale='RdBu_r',
                zmin=-1, zmax=1)
fig.show()

### Exercise 4.2: Correlation vs Causation - The Fare and Survival Relationship

In [None]:
# TODO: Calculate the correlation between fare and survival
# Hint: Use df['fare'].corr(df['survived'])

fare_survival_corr = # YOUR CODE HERE
print(f"Correlation between Fare and Survival: {fare_survival_corr:.3f}")

**Critical Thinking Questions:**

1. Does paying a higher fare CAUSE better survival? 
   
   Your answer: ___________________________________

2. What might be the real reason for this relationship? (Think about what fare represents)
   
   Your answer: ___________________________________

3. Can you identify a **confounding variable**?
   
   Your answer: ___________________________________

In [None]:
# TODO: Investigate the confounding variable by examining passenger class
# Calculate correlation between pclass and survived, and between pclass and fare

pclass_survival_corr = # YOUR CODE HERE
pclass_fare_corr = # YOUR CODE HERE

print(f"Correlation between Passenger Class and Survival: {pclass_survival_corr:.3f}")
print(f"Correlation between Passenger Class and Fare: {pclass_fare_corr:.3f}")
print("\nNote: Negative correlation for pclass means lower class numbers (1st class) correlate with higher survival")

### Exercise 4.3: Visualizing Confounding Variables

In [None]:
# Create a visualization showing how passenger class confounds the fare-survival relationship
fig = px.scatter(df, x='fare', y='survived', color='pclass',
                 title='Fare vs Survival: Passenger Class as Confounding Variable',
                 labels={'pclass': 'Passenger Class'},
                 hover_data=['sex', 'age'])
fig.update_layout(xaxis_title='Fare (£)', yaxis_title='Survived (0=No, 1=Yes)')
fig.show()

**Discussion:** How does this visualization help us understand that passenger class is the real factor, not fare itself?

Your answer: ___________________________________

---
## Part 5: Recognizing Poor Visualizations (10 mins)

### What Makes a Bad Visualization?

Common problems:
- Too many categories (unreadable)
- Wrong chart type for the data
- Misleading scales or axes
- Poor color choices
- Missing labels or context

### Exercise 5.1: Creating a Misleading Visualization

Let's intentionally create a poor visualization to understand what NOT to do!

In [None]:
# Poor visualization example: Pie chart with too many categories
age_groups = pd.cut(df['age'].dropna(), bins=20)
fig = px.pie(values=age_groups.value_counts().values, 
             names=age_groups.value_counts().index.astype(str),
             title='Age Distribution (Poor Visualization - Too Many Slices!)')
fig.show()

**Question:** What makes this visualization poor? List at least 3 reasons:

1. ___________________________________
2. ___________________________________
3. ___________________________________

In [None]:
# TODO: Create a BETTER version of the above visualization
# Hint: Use a histogram instead of a pie chart

# YOUR CODE HERE

---
## Part 6: Summary & Reflection (10 mins)

### Key Takeaways

Today we learned:
- Data science combines three pillars: Domänenwissen, Statistik/Mathematik, and Informatik
- The AI hierarchy: AI → Machine Learning → Deep Learning → Generative AI/LLMs
- The 5 V's of Big Data: Volume, Velocity, Variety, Veracity, Value
- Data sources: Open-Source (Kaggle, WHO), privat, kommerziell
- Different visualization types serve different purposes (bar charts, scatter plots, heatmaps)
- **Correlation ≠ Causation!** Watch out for confounding variables
- Good visualizations tell clear stories without misleading

### Reflection Questions

1. What was the most interesting insight you discovered about the Titanic data?

   Your answer: ___________________________________

2. Which type of visualization did you find most useful and why?

   Your answer: ___________________________________

3. Can you think of a real-world application where understanding correlation vs causation is critical?

   Your answer: ___________________________________

4. Which of the three pillars (Domain, Statistics, Computer Science) do you feel most comfortable with? Which needs more work?

   Your answer: ___________________________________

---
## Bonus Challenges (Optional)

If you finish early, try these additional exercises:

### Bonus 1: Box Plot for Fare by Class

In [None]:
# TODO: Create a box plot showing fare distribution by passenger class
# Hint: Use px.box()

# YOUR CODE HERE

### Bonus 2: Survival Rate by Embarkation Port

In [None]:
# TODO: Calculate and visualize survival rates by embarkation port
# Hint: Group by 'embarked' and calculate mean of 'survived'

# YOUR CODE HERE

### Bonus 3: Family Size Analysis

This is an important feature we'll use later in the course!

In [None]:
# TODO: Create a new feature 'family_size' = SibSp + Parch + 1
# Then visualize survival rate by family size
# Hint: Use groupby and mean, then create a line or bar chart

# YOUR CODE HERE

# What patterns do you notice? Do small or large families have better survival rates?

---
## Resources for Further Learning

- **Pandas Documentation:** https://pandas.pydata.org/docs/
- **Plotly Gallery:** https://plotly.com/python/
- **Data Visualization Best Practices:** https://www.storytellingwithdata.com/
- **Correlation vs Causation:** https://www.tylervigen.com/spurious-correlations
- **Kaggle Datasets:** https://www.kaggle.com/datasets
- **WHO Open Data:** https://www.who.int/data

**See you on Day 4 for Data Preparation & Feature Engineering!**