# Day 1: Data Loading & Exploratory Data Analysis (EDA)

## 🎯 Welcome to Your 7-Day ML Journey!

**Today's Focus**: Understanding our data before we do anything else

This is Day 1 of your structured ML learning path. Today we're laying the foundation by thoroughly exploring the Titanic dataset. This might seem "basic," but remember: **good data scientists spend 80% of their time understanding and preparing data, and only 20% building models.**

### What We'll Accomplish Today:
1. 📊 Load and understand the Titanic dataset structure
2. 🔍 Identify missing values and data quality issues  
3. 📈 Explore distributions of all features
4. 🔗 Analyze relationships between variables
5. 🎯 Set up the foundation for tomorrow's data cleaning

### Why Start Here?
Every successful ML project starts with understanding the data. Today's exploration will directly inform:
- **Day 2**: What data cleaning strategies to use
- **Day 3**: Which features to include in our model
- **Day 4**: How to interpret our model's performance

Let's dive in! 🚀

## 1. Setting Up Our Environment

First, let's import all the libraries we'll need and understand what each one does:

In [1]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For displaying all columns in pandas
pd.set_option('display.max_columns', None)

# Set up matplotlib for better plots
plt.style.use('default')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

# Set seaborn style for prettier plots
sns.set_palette("husl")

# Ignore warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("✅ All libraries imported successfully!")
print(f"📊 Pandas version: {pd.__version__}")
print(f"📈 Matplotlib version: {plt.matplotlib.__version__}")
print(f"🎨 Seaborn version: {sns.__version__}")

✅ All libraries imported successfully!
📊 Pandas version: 2.3.1
📈 Matplotlib version: 3.10.5
🎨 Seaborn version: 0.13.2


### 🧠 Remember This?
- **Pandas**: Your main tool for data manipulation (think Excel, but much more powerful)
- **NumPy**: Handles mathematical operations and arrays efficiently
- **Matplotlib**: The foundation plotting library (like the engine of a car)
- **Seaborn**: Makes Matplotlib prettier and easier to use (like the car's interior)

The configuration above ensures our plots will be readable and consistent throughout our analysis.

## 2. Loading the Titanic Dataset

We'll use the Titanic dataset that comes built-in with Seaborn. This is the same dataset used on Kaggle and in countless ML tutorials.

In [None]:
# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

print("📋 Dataset loaded successfully!")
print(f"📏 Dataset shape: {titanic.shape}")
print(f"👥 Number of passengers: {titanic.shape[0]}")
print(f"📊 Number of features: {titanic.shape[1]}")

### 🎯 The ML Problem We're Solving

**Problem Type**: Binary Classification  
**Goal**: Predict whether a passenger survived the Titanic disaster  
**Target Variable**: `survived` (0 = died, 1 = survived)  
**Features**: Everything else (passenger class, age, gender, etc.)

This is a **supervised learning** problem because we have historical data with known outcomes (who survived and who didn't).

## 3. First Look at Our Data

Let's examine the structure and get our first glimpse of the data:

In [None]:
# Display the first few rows
print("🔍 First 5 rows of our dataset:")
titanic.head()

In [None]:
# Get detailed information about the dataset
print("📋 Dataset Information:")
titanic.info()

In [None]:
# Get basic statistics for numerical columns
print("📊 Basic Statistics for Numerical Features:")
titanic.describe()

### 🧠 Understanding Each Feature

Let's break down what each column means:

| Feature | Description | Type | ML Relevance |
|---------|-------------|------|-------------|
| **survived** | 0 = No, 1 = Yes | Target Variable | This is what we want to predict |
| **pclass** | Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd) | Categorical | Socioeconomic status indicator |
| **sex** | Gender | Categorical | Historically important for survival |
| **age** | Age in years | Numerical | May affect survival chances |
| **sibsp** | # of siblings/spouses aboard | Numerical | Family size indicator |
| **parch** | # of parents/children aboard | Numerical | Family size indicator |
| **fare** | Passenger fare | Numerical | Economic status indicator |
| **embarked** | Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) | Categorical | Geographic/economic indicator |
| **class** | Same as pclass but as text | Categorical | Duplicate of pclass |
| **who** | man, woman, or child | Categorical | Derived from sex and age |
| **adult_male** | True/False | Categorical | Derived feature |
| **deck** | Deck level | Categorical | Ship location (lots of missing data) |
| **embark_town** | Full name of embarkation port | Categorical | Same as embarked but spelled out |
| **alive** | yes/no version of survived | Categorical | Duplicate of target variable |
| **alone** | True if travelling alone | Categorical | Derived from sibsp and parch |

## 4. Missing Data Analysis

Missing data is one of the biggest challenges in real-world ML. Let's see what we're dealing with:

In [None]:
# Calculate missing data statistics
missing_data = titanic.isnull().sum()
missing_percent = (missing_data / len(titanic)) * 100

# Create a summary dataframe
missing_summary = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percent
})

# Sort by missing percentage
missing_summary = missing_summary.sort_values('Missing Percentage', ascending=False)

# Only show columns with missing data
missing_summary = missing_summary[missing_summary['Missing Count'] > 0]

print("🚨 Missing Data Summary:")
print(missing_summary)

In [None]:
# Visualize missing data patterns
plt.figure(figsize=(12, 8))

# Create a heatmap of missing values
sns.heatmap(titanic.isnull(), 
            yticklabels=False, 
            cbar=True, 
            cmap='viridis',
            xticklabels=True)
plt.title('Missing Data Pattern\n(Yellow = Missing, Dark = Present)', fontsize=14, fontweight='bold')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Missing data bar chart
plt.figure(figsize=(10, 6))
missing_counts = titanic.isnull().sum().sort_values(ascending=True)
missing_counts = missing_counts[missing_counts > 0]  # Only show columns with missing data

missing_counts.plot(kind='barh', color='coral')
plt.title('Missing Values by Feature', fontsize=14, fontweight='bold')
plt.xlabel('Number of Missing Values')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

### 🧠 Understanding Missing Data Patterns

**Key Observations:**
1. **deck**: 77% missing - This might be because deck information wasn't systematically recorded
2. **age**: 20% missing - Common in historical records
3. **embarked**: Only 2 missing values - Easy to handle

**Why This Matters for Tomorrow:**
- **deck**: Might need to drop this feature or create a "Unknown" category
- **age**: Will need imputation (filling with mean, median, or predicted values)
- **embarked**: Can fill with the most common value

### 🚨 Common Mistake Alert!
Never just delete rows with missing data without understanding WHY the data is missing. Sometimes the missingness itself is informative!

## 5. Target Variable Analysis

Let's start by understanding our target variable - what we're trying to predict:

In [None]:
# Survival statistics
survival_counts = titanic['survived'].value_counts()
survival_rate = titanic['survived'].mean()

print(f"📊 Survival Statistics:")
print(f"✅ Survived: {survival_counts[1]} passengers ({survival_counts[1]/len(titanic)*100:.1f}%)")
print(f"❌ Did not survive: {survival_counts[0]} passengers ({survival_counts[0]/len(titanic)*100:.1f}%)")
print(f"📈 Overall survival rate: {survival_rate:.3f} ({survival_rate*100:.1f}%)")

In [None]:
# Visualize survival distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Count plot
sns.countplot(data=titanic, x='survived', ax=ax1, palette='Set2')
ax1.set_title('Survival Count', fontsize=14, fontweight='bold')
ax1.set_xlabel('Survived (0=No, 1=Yes)')
ax1.set_ylabel('Number of Passengers')

# Add count labels on bars
for i, v in enumerate(survival_counts):
    ax1.text(i, v + 10, str(v), ha='center', fontweight='bold')

# Pie chart
labels = ['Did not survive', 'Survived']
colors = ['lightcoral', 'lightgreen']
ax2.pie(survival_counts, labels=labels, autopct='%1.1f%%', colors=colors, startangle=90)
ax2.set_title('Survival Rate Distribution', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

### 🎯 Why This Matters

**Class Imbalance**: We have more deaths (61.6%) than survivals (38.4%). This is important because:
1. **Baseline Accuracy**: A "dumb" model that always predicts "death" would be 61.6% accurate
2. **Evaluation Strategy**: We'll need to look beyond just accuracy when evaluating our models
3. **Sampling Considerations**: We might need to balance our training data

This will become crucial on **Day 4** when we evaluate our model's performance!

## 6. Categorical Features Exploration

Let's examine each categorical feature and see how it relates to survival:

In [None]:
# Define the main categorical features to analyze
categorical_features = ['sex', 'pclass', 'embarked', 'who', 'alone']

# Create subplots for each categorical feature
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()  # Flatten the array for easier indexing

for idx, feature in enumerate(categorical_features):
    # Count plot with survival hue
    sns.countplot(data=titanic, x=feature, hue='survived', ax=axes[idx], palette='Set1')
    axes[idx].set_title(f'Survival by {feature.title()}', fontsize=12, fontweight='bold')
    axes[idx].legend(['Did not survive', 'Survived'])
    
    # Rotate x-axis labels if needed
    if feature in ['embarked']:
        axes[idx].tick_params(axis='x', rotation=45)

# Remove the empty subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.show()

In [None]:
# Calculate survival rates by categorical features
print("📊 Survival Rates by Categorical Features:\n")

for feature in categorical_features:
    print(f"🔍 {feature.upper()}:")
    survival_by_feature = titanic.groupby(feature)['survived'].agg(['count', 'sum', 'mean'])
    survival_by_feature.columns = ['Total', 'Survived', 'Survival_Rate']
    survival_by_feature['Survival_Rate'] = survival_by_feature['Survival_Rate'].round(3)
    print(survival_by_feature)
    print("\n" + "-"*50 + "\n")

### 🧠 Key Insights from Categorical Features

**Gender (sex)**:
- Women had a **74.2%** survival rate vs men at **18.9%**
- Clear evidence of "women and children first" policy

**Passenger Class (pclass)**:
- First class: **63.0%** survival rate
- Second class: **47.3%** survival rate  
- Third class: **24.2%** survival rate
- Strong correlation between wealth/status and survival

**Port of Embarkation (embarked)**:
- Cherbourg (C): **55.4%** survival rate
- Queenstown (Q): **39.0%** survival rate
- Southampton (S): **33.7%** survival rate
- May be correlated with passenger class/wealth

These patterns will be crucial for our ML model!

## 7. Numerical Features Exploration

Now let's examine the numerical features and their distributions:

In [None]:
# Define numerical features
numerical_features = ['age', 'fare', 'sibsp', 'parch']

# Create distribution plots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for idx, feature in enumerate(numerical_features):
    # Remove missing values for plotting
    data = titanic[feature].dropna()
    
    # Histogram with KDE
    sns.histplot(data=data, kde=True, ax=axes[idx], color='skyblue', alpha=0.7)
    axes[idx].set_title(f'Distribution of {feature.title()}', fontsize=12, fontweight='bold')
    axes[idx].grid(alpha=0.3)
    
    # Add statistics text
    mean_val = data.mean()
    median_val = data.median()
    axes[idx].axvline(mean_val, color='red', linestyle='--', label=f'Mean: {mean_val:.1f}')
    axes[idx].axvline(median_val, color='green', linestyle='--', label=f'Median: {median_val:.1f}')
    axes[idx].legend()

plt.tight_layout()
plt.show()

In [None]:
# Box plots to show survival patterns
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for idx, feature in enumerate(numerical_features):
    sns.boxplot(data=titanic, x='survived', y=feature, ax=axes[idx], palette='Set2')
    axes[idx].set_title(f'{feature.title()} by Survival Status', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Survived (0=No, 1=Yes)')
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Statistical summary by survival status
print("📊 Numerical Features by Survival Status:\n")

for feature in numerical_features:
    print(f"🔍 {feature.upper()}:")
    stats = titanic.groupby('survived')[feature].describe().round(2)
    print(stats)
    print("\n" + "-"*70 + "\n")

### 🧠 Key Insights from Numerical Features

**Age**:
- Survivors were slightly younger on average (28.3 vs 30.6 years)
- Children (under 16) had better survival chances
- Right-skewed distribution with most passengers between 20-40

**Fare**:
- Survivors paid significantly higher fares on average (48.4 vs 22.1)
- Highly right-skewed distribution (few very expensive tickets)
- Strong correlation with passenger class

**Family Size (SibSp + Parch)**:
- Most passengers traveled alone or with small families
- Medium family sizes (2-4 people) had better survival rates
- Very large families had poor survival rates

### 🚨 Data Quality Notes:
- **Fare** has extreme outliers (some tickets cost 10x the median)
- **Age** has a lot of missing values we'll need to handle
- **SibSp/Parch** are counts, so they're naturally discrete

## 8. Correlation Analysis

Let's examine how features relate to each other and to our target variable:

In [None]:
# Select numerical columns for correlation analysis
numerical_cols = ['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare']
correlation_data = titanic[numerical_cols].corr()

# Create correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_data, 
            annot=True, 
            cmap='RdBu_r', 
            center=0,
            square=True, 
            fmt='.3f',
            cbar_kws={'label': 'Correlation Coefficient'})
plt.title('Feature Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Show correlations with survival specifically
survival_correlations = correlation_data['survived'].drop('survived').sort_values(key=abs, ascending=False)
print("🎯 Correlations with Survival (strongest to weakest):\n")
for feature, corr in survival_correlations.items():
    direction = "📈 Positive" if corr > 0 else "📉 Negative"
    strength = "Strong" if abs(corr) > 0.3 else "Moderate" if abs(corr) > 0.1 else "Weak"
    print(f"{feature}: {corr:.3f} ({direction}, {strength})")

### 🧠 Understanding Correlations

**Strongest Predictors of Survival:**
1. **Fare (+0.257)**: Higher fare → Higher survival chance
2. **Pclass (-0.338)**: Lower class number (higher class) → Higher survival chance

**Interesting Relationships:**
- **Pclass & Fare (-0.549)**: Higher class passengers paid more (expected)
- **Age & Pclass (+0.308)**: Older passengers tended to be in higher classes
- **SibSp & Parch (+0.372)**: People with siblings often had parents/children too

**Why This Matters:**
- These correlations guide feature selection for our model
- High correlations between features might cause multicollinearity
- The moderate correlations suggest we'll need multiple features to predict survival well

## 9. Advanced Relationship Analysis

Let's dig deeper into some interesting relationships:

In [None]:
# Create family size feature for analysis
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1  # +1 for the passenger themselves

# Survival by family size
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
family_survival = titanic.groupby('family_size')['survived'].agg(['count', 'mean'])
sns.barplot(x=family_survival.index, y=family_survival['mean'], palette='viridis')
plt.title('Survival Rate by Family Size', fontsize=12, fontweight='bold')
plt.xlabel('Family Size')
plt.ylabel('Survival Rate')
plt.grid(axis='y', alpha=0.3)

# Add count labels
for i, (size, count) in enumerate(family_survival['count'].items()):
    plt.text(i, family_survival['mean'].iloc[i] + 0.02, f'n={count}', ha='center', fontsize=9)

plt.subplot(1, 2, 2)
# Age groups survival
titanic['age_group'] = pd.cut(titanic['age'], bins=[0, 12, 18, 35, 60, 100], 
                             labels=['Child', 'Teen', 'Young Adult', 'Adult', 'Senior'])
age_survival = titanic.groupby('age_group', observed=False)['survived'].mean()
sns.barplot(x=age_survival.index, y=age_survival.values, palette='plasma')
plt.title('Survival Rate by Age Group', fontsize=12, fontweight='bold')
plt.xlabel('Age Group')
plt.ylabel('Survival Rate')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Cross-tabulation: Gender vs Class vs Survival
print("🔍 Survival Rates by Gender and Class:\n")

# Create a pivot table
survival_by_gender_class = pd.crosstab([titanic['sex'], titanic['pclass']], 
                                      titanic['survived'], normalize='index')
survival_by_gender_class = survival_by_gender_class.round(3)
print(survival_by_gender_class)

# Visualize this relationship
plt.figure(figsize=(10, 6))
sns.heatmap(survival_by_gender_class, annot=True, fmt='.3f', cmap='RdYlGn', 
            cbar_kws={'label': 'Survival Rate'})
plt.title('Survival Rate by Gender and Passenger Class', fontsize=14, fontweight='bold')
plt.xlabel('Survived (0=No, 1=Yes)')
plt.ylabel('Gender and Class')
plt.tight_layout()
plt.show()

### 🧠 Advanced Insights

**Family Size Pattern:**
- Solo travelers: ~30% survival rate
- Small families (2-4): ~55-70% survival rate  
- Large families (5+): ~20% survival rate
- **Insight**: Medium-sized families had advantages, possibly helping each other while being small enough to move quickly

**Age Group Pattern:**
- Children: ~58% survival rate ("children first" policy)
- Adults: ~35-40% survival rate
- **Insight**: Clear evidence of age-based evacuation priorities

**Gender + Class Interaction:**
- First-class women: **96.8%** survival rate
- Third-class men: **13.5%** survival rate
- **Insight**: The combination of gender and class was crucial - privilege compounded advantages

## 10. Feature Engineering Preview

Let's identify some potential new features we could create (we'll do this properly on Day 2):

In [None]:
# Preview of feature engineering opportunities
print("🔧 Feature Engineering Opportunities Identified:\n")

# 1. Title extraction from names
print("1. TITLE EXTRACTION:")
print("Sample names and potential titles:")
sample_names = titanic['name'].head(10)
for name in sample_names:
    # Simple title extraction (we'll improve this tomorrow)
    if 'Mr.' in name:
        title = 'Mr'
    elif 'Mrs.' in name:
        title = 'Mrs'
    elif 'Miss.' in name:
        title = 'Miss'
    elif 'Master.' in name:
        title = 'Master'
    else:
        title = 'Other'
    print(f"   {name[:30]}... → {title}")

print("\n" + "-"*60)

# 2. Family size categories
print("\n2. FAMILY SIZE CATEGORIES:")
titanic['family_category'] = titanic['family_size'].apply(
    lambda x: 'Solo' if x == 1 else 'Small' if x <= 4 else 'Large'
)
print(titanic['family_category'].value_counts())

print("\n" + "-"*60)

# 3. Fare per person
print("\n3. FARE PER PERSON:")
titanic['fare_per_person'] = titanic['fare'] / titanic['family_size']
print(f"Original fare range: ${titanic['fare'].min():.2f} - ${titanic['fare'].max():.2f}")
print(f"Fare per person range: ${titanic['fare_per_person'].min():.2f} - ${titanic['fare_per_person'].max():.2f}")

print("\n" + "-"*60)

# 4. Age groups
print("\n4. AGE GROUPS:")
print(titanic['age_group'].value_counts().dropna())

## 11. Data Quality Assessment

Let's summarize the data quality issues we need to address tomorrow:

In [None]:
print("🔍 DATA QUALITY ASSESSMENT SUMMARY\n")
print("=" * 50)

# 1. Missing data summary
print("\n1. MISSING DATA ISSUES:")
missing_summary_clean = titanic.isnull().sum().sort_values(ascending=False)
missing_summary_clean = missing_summary_clean[missing_summary_clean > 0]
for col, count in missing_summary_clean.items():
    percentage = (count / len(titanic)) * 100
    severity = "🚨 Critical" if percentage > 50 else "⚠️ Moderate" if percentage > 10 else "✅ Minor"
    print(f"   {col}: {count} missing ({percentage:.1f}%) - {severity}")

# 2. Duplicate features
print("\n2. DUPLICATE/REDUNDANT FEATURES:")
print("   📋 pclass ≈ class (same information, different format)")
print("   📋 survived ≈ alive (same information, different format)")
print("   📋 embarked ≈ embark_town (same information, different format)")
print("   📋 who ≈ derived from sex + age")
print("   📋 alone ≈ derived from sibsp + parch")

# 3. Data type issues
print("\n3. DATA TYPE CONSIDERATIONS:")
print("   🔢 pclass: Currently numeric, but should be categorical")
print("   🔤 survived: Currently numeric, but represents categories")
print("   💰 fare: Has extreme outliers that might need handling")

# 4. Outliers
print("\n4. OUTLIER ANALYSIS:")
fare_q99 = titanic['fare'].quantile(0.99)
age_q99 = titanic['age'].quantile(0.99)
outlier_fares = (titanic['fare'] > fare_q99).sum()
print(f"   💰 Fare outliers (>99th percentile): {outlier_fares} passengers")
print(f"   👴 Age outliers (>99th percentile): {(titanic['age'] > age_q99).sum()} passengers")

print("\n" + "=" * 50)
print("📋 TOMORROW'S DATA CLEANING PRIORITIES:")
print("   1. Handle missing age values (imputation)")
print("   2. Handle missing embarked values (mode imputation)")
print("   3. Decide on deck feature (likely drop due to 77% missing)")
print("   4. Remove duplicate features")
print("   5. Convert data types appropriately")
print("   6. Handle fare outliers if necessary")
print("   7. Create new features (title, family_size, etc.)")

## 12. Key Insights Summary

Let's summarize the most important findings from today's analysis:

In [None]:
print("🎯 KEY INSIGHTS FROM TODAY'S ANALYSIS\n")
print("=" * 60)

print("\n📊 DATASET OVERVIEW:")
print(f"   • {len(titanic)} passengers, {titanic.shape[1]} features")
print(f"   • {titanic['survived'].mean():.1%} overall survival rate")
print(f"   • Binary classification problem (survived: yes/no)")

print("\n🔍 STRONGEST SURVIVAL PREDICTORS:")
print("   1. 👩 Gender: Women 74.2% vs Men 18.9% survival rate")
print("   2. 🎫 Passenger Class: 1st class 63.0% vs 3rd class 24.2%")
print("   3. 👨‍👩‍👧‍👦 Family Size: Medium families (2-4) had highest survival")
print("   4. 💰 Fare: Higher paying passengers survived more often")
print("   5. 👶 Age: Children had better survival chances")

print("\n🔗 IMPORTANT RELATIONSHIPS:")
print("   • Gender + Class interaction is crucial (1st class women: 96.8% survival)")
print("   • Fare correlates with passenger class (wealth indicator)")
print("   • Port of embarkation may proxy for passenger class")
print("   • Family size has optimal range (not solo, not too large)")

print("\n⚠️ DATA CHALLENGES:")
print("   • Age: 20% missing values")
print("   • Deck: 77% missing (likely unusable)")
print("   • Class imbalance: 61.6% deaths vs 38.4% survivors")
print("   • Fare has extreme outliers")
print("   • Several redundant features to clean up")

print("\n🎯 ML MODEL IMPLICATIONS:")
print("   • Multiple features needed for good prediction")
print("   • Feature interactions likely important")
print("   • Need to handle class imbalance in evaluation")
print("   • Feature engineering opportunities identified")

print("\n" + "=" * 60)
print("✅ DAY 1 COMPLETE! Ready for Day 2: Data Cleaning & Feature Engineering")

## 13. Tomorrow's Roadmap: Day 2 Preview

Based on today's analysis, here's what we'll tackle tomorrow:

### 🗓️ Day 2: Data Cleaning & Feature Engineering

**Morning Session (Data Cleaning):**
1. **Missing Value Strategy**
   - Age: Impute using median by passenger class and gender
   - Embarked: Fill with most common value ('S')
   - Deck: Drop due to 77% missing

2. **Feature Cleanup**
   - Remove duplicate features (class, alive, embark_town, etc.)
   - Convert data types (pclass to categorical)
   - Handle outliers in fare if needed

**Afternoon Session (Feature Engineering):**
3. **New Feature Creation**
   - Extract titles from names (Mr, Mrs, Miss, Master, etc.)
   - Create family_size and family_category features
   - Calculate fare_per_person
   - Create age_group categories
   - Create is_child feature

4. **Feature Encoding**
   - One-hot encode categorical variables
   - Scale numerical features
   - Prepare final feature matrix

**What You'll Learn:**
- Different imputation strategies and when to use them
- Feature engineering techniques that improve model performance
- How to encode categorical variables for ML algorithms
- Why feature scaling matters for some algorithms

### 💡 Questions to Think About:
1. What other titles might be hidden in the name field?
2. Should we treat fare outliers, or do they contain important information?
3. How might we combine sibsp and parch more intelligently?
4. What age ranges make the most sense for grouping?

**See you tomorrow for Day 2! 🚀**

---

## 🎉 Congratulations on Completing Day 1!

You've successfully completed a thorough exploratory data analysis of the Titanic dataset. Today's work forms the foundation for everything we'll build this week.

### What You Accomplished:
✅ Loaded and understood the dataset structure  
✅ Identified missing data patterns and quality issues  
✅ Explored all features and their relationships with survival  
✅ Discovered key insights about survival patterns  
✅ Identified data cleaning priorities for tomorrow  
✅ Planned feature engineering opportunities  

### Key Takeaways:
- **EDA is crucial**: Understanding your data deeply informs every subsequent decision
- **Multiple factors matter**: No single feature perfectly predicts survival
- **Interactions are important**: Gender + class combinations tell a richer story
- **Data quality varies**: Real-world data always needs cleaning

**Ready for Day 2? Let's clean this data and engineer some powerful features! 💪**