# Titanic Dataset - Comprehensive Practice Exercises

## 📊 Data Source & Quick Setup

### Download the Dataset
```python
# Method 1: Direct download from URL
import pandas as pd
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic_df = pd.read_csv(url)

# Method 2: From seaborn (if installed)
import seaborn as sns
titanic_df = sns.load_dataset('titanic')

# Method 3: Manual download
# Download from: https://www.kaggle.com/c/titanic/data
# titanic_df = pd.read_csv('titanic.csv')
```

### Dataset Overview
The Titanic dataset contains information about 891 passengers on the Titanic with the following columns:
- **Survived**: Survival (0 = No, 1 = Yes)
- **Pclass**: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- **Sex**: Gender
- **Age**: Age in years
- **SibSp**: # of siblings/spouses aboard
- **Parch**: # of parents/children aboard
- **Ticket**: Ticket number
- **Fare**: Passenger fare
- **Cabin**: Cabin number
- **Embarked**: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

---

## 🎯 Problem Statements - Organized by Difficulty

## 🟢 Level 1: Basic Data Exploration & Cleaning

### Exercise 1.1: Initial Data Inspection
```python
# Problem Statement:
"""
1. Load the Titanic dataset and display the first 10 rows
2. Get basic information about the dataset (shape, columns, data types)
3. Display summary statistics for numerical columns
4. Check for missing values in each column
"""
```

### Exercise 1.2: Data Cleaning Fundamentals
```python
"""
1. Handle missing values:
   - Fill missing 'Age' values with the median age
   - Fill missing 'Embarked' with the most frequent value
   - Drop the 'Cabin' column (too many missing values)
2. Create a new column 'FamilySize' = SibSp + Parch + 1
3. Convert 'Sex' to numerical values (male: 0, female: 1)
4. Create age groups: Child (0-12), Teen (13-19), Adult (20-59), Senior (60+)
"""
```

### Exercise 1.3: Basic Analysis
```python
"""
1. What is the overall survival rate?
2. How many passengers were in each class?
3. What is the average age of passengers?
4. Which embarkation port had the most passengers?
"""
```

---

## 🟡 Level 2: Intermediate Analysis & Visualization

### Exercise 2.1: Demographic Analysis
```python
"""
1. Analyze survival by gender:
   - What percentage of women survived?
   - What percentage of men survived?
   - Create a bar chart comparing survival rates by gender

2. Analyze survival by passenger class:
   - Calculate survival rates for each class
   - Visualize with a stacked bar chart
   - Which class had the highest survival rate?

3. Age analysis:
   - What is the average age of survivors vs non-survivors?
   - Create a histogram of ages for survivors and non-survivors
"""
```

### Exercise 2.2: Family & Fare Analysis
```python
"""
1. Family size impact:
   - Create family size categories: Alone, Small (2-4), Large (5+)
   - Calculate survival rate for each family size category
   - Does traveling with family increase survival chances?

2. Fare analysis:
   - What is the average fare for each passenger class?
   - Create fare categories: Low (<10), Medium (10-50), High (>50)
   - Analyze survival rates by fare categories

3. Combined factors:
   - Find survival rate for women in 1st class vs men in 3rd class
   - What was the survival rate for children in each class?
"""
```

### Exercise 2.3: Data Transformation
```python
"""
1. Create a new DataFrame that includes:
   - Passenger name (from Name column - extract titles: Mr, Mrs, Miss, etc.)
   - Age group
   - Family size category
   - Fare category
   - Survival status

2. Calculate survival rates by title (Mr, Mrs, Miss, Master, etc.)
3. Find the most common ticket for each passenger class
"""
```

---

## 🔴 Level 3: Advanced Analysis & Insights

### Exercise 3.1: Complex Grouping & Aggregation
```python
"""
1. Create a comprehensive summary table showing:
   - Count of passengers
   - Survival rate
   - Average age
   - Average fare
   For each combination of: Pclass × Sex × Embarked

2. Calculate the survival probability for:
   - A 30-year-old female in 1st class
   - A 45-year-old male in 3rd class
   - A child (under 12) in 2nd class

3. Find the 5 most expensive tickets and analyze their survival rate
"""
```

### Exercise 3.2: Correlation & Pattern Finding
```python
"""
1. Calculate correlation matrix for numerical features
2. Identify which factors are most strongly correlated with survival
3. Create a pivot table showing survival rates by:
   - Age groups vs Passenger class
   - Family size vs Gender
   - Embarkation port vs Passenger class

4. Find interesting patterns:
   - Were there any families where some members survived and others didn't?
   - What was the survival rate of passengers traveling alone?
   - Did having a higher fare within the same class increase survival chances?
"""
```

### Exercise 3.3: Feature Engineering
```python
"""
1. Extract titles from names (Mr, Mrs, Miss, Master, etc.)
2. Create a 'IsChild' column (Age < 12)
3. Create a 'IsAlone' column (FamilySize == 1)
4. Create fare per person (Fare / FamilySize)
5. Create age × class interaction term
6. Bin ages into quantiles (equal number of passengers in each age group)
"""
```

---

## 🏆 Level 4: Real-World Scenarios

### Exercise 4.1: Complete Data Analysis Report
```python
"""
Create a comprehensive analysis that answers these business questions:

1. **Passenger Profile Analysis:**
   - What was the typical passenger profile?
   - How did passenger demographics vary by class?

2. **Survival Factors Investigation:**
   - What were the key factors that influenced survival?
   - Was the "women and children first" policy followed?
   - How did wealth (as indicated by class and fare) affect survival?

3. **Recommendations:**
   - If you were designing safety protocols based on this data, what would you recommend?
   - What passenger characteristics would you prioritize for lifeboat access?
"""
```

### Exercise 4.2: Predictive Features Preparation
```python
"""
Prepare the dataset for machine learning by:

1. Handling all missing values appropriately
2. Encoding categorical variables
3. Creating new meaningful features
4. Removing irrelevant columns
5. Normalizing numerical features
6. Creating a clean, analysis-ready dataset

Final dataset should have these features:
- Pclass, Sex, Age, SibSp, Parch, Fare, Embarked
- FamilySize, IsAlone, Title, AgeGroup, FarePerPerson
- (All properly encoded and scaled)
"""
```

### Exercise 4.3: Data Storytelling
```python
"""
Create a compelling data story with visualizations:

1. **The Tragedy in Numbers:** Overall statistics and key figures
2. **Survival Inequalities:** How different factors affected survival
3. **Passenger Stories:** Find and highlight interesting individual cases
4. **Lessons Learned:** Key takeaways from the data

Required visualizations:
- Survival rate by different demographics
- Distribution of passengers across different categories
- Correlation heatmap
- Age distribution of survivors vs non-survivors
"""
```

---
## 🎯 Expected Learning Outcomes

After completing these exercises, students should be able to:

### Technical Skills
- Handle real-world messy data with missing values
- Perform advanced data filtering and grouping
- Create meaningful data visualizations
- Extract insights from complex datasets
- Prepare data for machine learning

### Analytical Skills
- Formulate and test hypotheses
- Identify patterns and correlations
- Draw meaningful conclusions from data
- Communicate findings effectively
- Think critically about data quality and biases

### Business Understanding
- Translate business questions into data analysis
- Provide actionable recommendations
- Understand the ethical implications of data analysis
- Create compelling data stories

---

## 💡 Pro Tips for Students

1. **Always start with data understanding** - explore before analyzing
2. **Document your assumptions** - especially for handling missing data
3. **Validate your findings** - cross-check with multiple approaches
4. **Consider the context** - historical and social factors matter
5. **Visualize effectively** - choose the right chart for your message

In [1]:
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Generate sample dataset
num_rows = 50
order_ids = np.arange(1001, 1001 + num_rows)
products = np.random.choice(
    ["Wireless Mouse", "Keyboard", "Laptop Stand", "USB Hub", "Headphones", 
     "Webcam", "Monitor", "Smartphone", "Tablet", "Charger"], 
    size=num_rows
)
quantities = np.random.randint(1, 10, size=num_rows)
unit_prices = np.random.choice([15, 25, 50, 75, 100, 150, 200, 250], size=num_rows)
total_revenue = quantities * unit_prices
locations = np.random.choice(
    ["Lagos", "Abuja", "Kano", "Port Harcourt", "Ibadan", "Enugu"], size=num_rows
)
categories = np.random.choice(
    ["Electronics", "Accessories", "Computers", "Mobile Devices"], size=num_rows
)

# Create DataFrame
df = pd.DataFrame({
    "Order ID": order_ids,
    "Product Name": products,
    "Quantity": quantities,
    "Unit Price": unit_prices,
    "Total Revenue": total_revenue,
    "Customer Location": locations,
    "Product Category": categories
})

# Display first few rows
print(df.head())

# Save as CSV
df.to_csv("sample_sales_dataset.csv", index=False)
print("✅ Dataset saved as 'sample_sales_dataset.csv'")


   Order ID Product Name  Quantity  Unit Price  Total Revenue  \
0      1001      Monitor         3          50            150   
1      1002      USB Hub         1          75             75   
2      1003   Smartphone         4         250           1000   
3      1004   Headphones         2         150            300   
4      1005      Monitor         8          50            400   

  Customer Location Product Category  
0             Enugu      Accessories  
1     Port Harcourt        Computers  
2             Abuja      Accessories  
3             Abuja        Computers  
4            Ibadan      Electronics  
✅ Dataset saved as 'sample_sales_dataset.csv'


In [2]:
import pandas as pd
import numpy as np
import random

# Sample data
products = ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Charger', 
            'Monitor'"Wireless Mouse", "Keyboard", "Laptop Stand", "Webcam"]
categories = ['Electronics', 'Accessories', 'Gadgets', "Computers", "Mobile Devices"]
locations = ['Lagos', 'Abuja', 'Kano', 'Port Harcourt', 'Ibadan', "Enugu"]

data = {
    'Order_ID': [f"ORD{1000+i}" for i in range(50)],
    'Product_Name': np.random.choice(products, 50),
    'Quantity': np.random.randint(1, 10, 50),
    'Unit_Price': np.random.randint(5000, 200000, 50),
    'Customer_Location': np.random.choice(locations, 50),
    'Product_Category': np.random.choice(categories, 50)
}

df = pd.DataFrame(data)
df['Total_Revenue'] = df['Quantity'] * df['Unit_Price']

# Save to CSV for import into MySQL
df.to_csv("sample_sales_data.csv", index=False)


In [1]:
import mysql.connector

conn = mysql.connector.connect(
    host="localhost",
    user="root",
    password="08023256300",
    database="sales_db",
    allow_local_infile=True  # 👈 THIS IS IMPORTANT
)

cursor = conn.cursor()

cursor.execute("""
LOAD DATA LOCAL INFILE 'C:/Users/USER/Desktop/python/DATA/sample_sales_data.csv'
INTO TABLE sales
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(Order_ID, Product_Name, Quantity, Unit_Price, Customer_Location, Product_Category, Total_Revenue)
""")

conn.commit()
conn.close()
