# Lab 1: Data Exploration & Visualization

**Introduction to Data Science & Engineering - Day 1**

| Duration | Difficulty | Framework | Exercises |
|---|---|---|---|
| 90 min | Beginner | pandas, matplotlib, seaborn | 5 |

In this lab, you'll practice:
- Loading and exploring datasets
- Identifying data quality issues
- Computing descriptive statistics
- Creating visualizations with matplotlib and seaborn
- Analyzing correlations
- Customer segmentation
- Time series patterns

---

## Setup

First, let's import the necessary libraries.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

# Settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print("Libraries loaded successfully!")

## Part 1: Create and Load the Dataset

We'll work with a synthetic e-commerce dataset with deliberate quality issues.

In [None]:
np.random.seed(42)
n_samples = 2000

# Generate dates over 2 years
start_date = datetime(2023, 1, 1)
dates = [start_date + timedelta(days=np.random.randint(0, 730)) for _ in range(n_samples)]

# Customer segments
segments = np.random.choice(['Premium', 'Standard', 'Basic', None], n_samples, p=[0.2, 0.4, 0.3, 0.1])

# Product categories
categories = np.random.choice(['Electronics', 'Clothing', 'Home & Garden', 'Books', 'Sports'], n_samples)

data = {
    'order_id': range(1, n_samples + 1),
    'customer_id': np.random.randint(100, 600, n_samples),
    'order_date': dates,
    'product_category': categories,
    'quantity': np.random.randint(1, 10, n_samples),
    'unit_price': np.round(np.random.uniform(5, 500, n_samples), 2),
    'customer_segment': segments,
    'customer_age': np.random.randint(18, 75, n_samples).astype(float),
    'satisfaction_score': np.random.choice([1, 2, 3, 4, 5, np.nan], n_samples, p=[0.05, 0.1, 0.2, 0.35, 0.2, 0.1]),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_samples),
    'is_returned': np.random.choice([0, 1], n_samples, p=[0.85, 0.15])
}

df = pd.DataFrame(data)

# Inject quality issues
# Missing values
df.loc[np.random.choice(df.index, 80, replace=False), 'customer_age'] = np.nan
df.loc[np.random.choice(df.index, 40, replace=False), 'unit_price'] = np.nan

# Outliers
df.loc[np.random.choice(df.index, 10, replace=False), 'unit_price'] = np.random.uniform(2000, 5000, 10)
df.loc[np.random.choice(df.index, 5, replace=False), 'quantity'] = np.random.randint(50, 100, 5)

# Duplicates
dup_indices = np.random.choice(df.index, 15, replace=False)
duplicates = df.loc[dup_indices].copy()
df = pd.concat([df, duplicates], ignore_index=True)

# Calculate total_amount
df['total_amount'] = df['quantity'] * df['unit_price']

print(f"Dataset shape: {df.shape}")
df.head(10)

### Exercise 1.1: Basic Exploration

Explore the dataset structure using pandas methods.

**Your Task:** Use pandas methods to examine the shape, data types, descriptive statistics, missing values, and duplicates in the dataset.

In [None]:
# TODO: Print the shape and data types of the dataset
# Hint: Use df.shape and df.dtypes
pass

In [None]:
# TODO: Get descriptive statistics for all numeric columns
# Hint: Use df.describe()
pass

In [None]:
# TODO: Check for missing values and compute the percentage missing per column
# Hint: Use df.isnull().sum() and divide by len(df)
pass

In [None]:
# TODO: Check for duplicate rows and duplicate order_ids
# Hint: Use df.duplicated().sum() and df['order_id'].duplicated().sum()
pass

## Part 2: Data Quality Assessment

### Exercise 2.1: Handle Missing Values

**Your Task:** Implement a function that fills missing values using median imputation for numeric columns and 'Unknown' for categorical columns. After imputation, recalculate the `total_amount` column.

In [None]:
def handle_missing_values(df):
    """Handle missing values in the dataset.
    
    Strategy: median for numeric columns, 'Unknown' for categorical.
    After imputation, recalculate total_amount.
    
    Returns: cleaned DataFrame
    """
    # TODO: Fill customer_age with median
    # TODO: Fill unit_price with median
    # TODO: Fill satisfaction_score with median
    # TODO: Fill customer_segment with 'Unknown'
    # TODO: Recalculate total_amount = quantity * unit_price
    pass

df = handle_missing_values(df)

### Exercise 2.2: Remove Duplicates

**Your Task:** Implement a function that removes duplicate orders based on `order_id`, keeping the first occurrence.

In [None]:
def remove_duplicates(df):
    """Remove duplicate orders, keeping the first occurrence.
    
    Returns: deduplicated DataFrame and count of removed rows
    """
    # TODO: Remove duplicates based on order_id, keep first
    # TODO: Return cleaned df and count of removed rows
    pass

df, removed = remove_duplicates(df)

### Exercise 2.3: Detect Outliers

**Your Task:** Implement the IQR method for outlier detection. For each numeric column, calculate Q1, Q3, and IQR, then identify values outside the bounds.

In [None]:
def detect_outliers_iqr(series, factor=1.5):
    """Detect outliers using the IQR method.
    
    Args:
        series: pandas Series to check
        factor: IQR multiplier (default 1.5)
    
    Returns: tuple of (outlier_series, lower_bound, upper_bound)
    """
    # TODO: Calculate Q1, Q3, and IQR
    # TODO: Compute lower and upper bounds
    # TODO: Filter series for values outside bounds
    pass

# Test on unit_price, quantity, total_amount
for col in ['unit_price', 'quantity', 'total_amount']:
    result = detect_outliers_iqr(df[col])
    if result:
        outliers, lower, upper = result
        print(f"{col}: {len(outliers)} outliers")

In [None]:
# Visualize outliers with box plots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, col in zip(axes, ['unit_price', 'quantity', 'total_amount']):
    sns.boxplot(data=df, y=col, ax=ax, color='#3b82f6')
    ax.set_title(f'{col} Distribution')
plt.tight_layout()
plt.show()

## Part 3: Statistical Analysis

### Exercise 3.1: Distribution Analysis

**Your Task:** Create a 2x2 subplot grid showing the distributions of `unit_price`, `customer_age`, `quantity`, and `satisfaction_score`. Use `sns.histplot` with `kde=True` for continuous variables.

In [None]:
def plot_distributions(df):
    """Plot distributions of key numeric features.
    
    Create a 2x2 subplot with histograms for:
    - unit_price, customer_age, quantity, satisfaction_score
    Use sns.histplot with kde=True for continuous variables.
    """
    # TODO: Create 2x2 subplot figure (14, 10)
    # TODO: Plot histograms for each feature
    # TODO: Set titles and call plt.tight_layout()
    pass

plot_distributions(df)

### Exercise 3.2: Correlation Analysis

**Your Task:** Compute the correlation matrix for all numeric columns and visualize it as a heatmap using `sns.heatmap` with annotations.

In [None]:
def plot_correlation_matrix(df):
    """Compute and visualize the correlation matrix for numeric columns.
    
    Use sns.heatmap with annot=True, cmap='coolwarm'.
    """
    # TODO: Select numeric columns
    # TODO: Compute correlation matrix
    # TODO: Create heatmap visualization
    pass

plot_correlation_matrix(df)

## Part 4: Data Visualization

### Exercise 4.1: Category Analysis

**Your Task:** Analyze sales by product category. Create two side-by-side plots: a horizontal bar chart of total revenue by category, and a pie chart of order distribution.

In [None]:
def analyze_categories(df):
    """Analyze sales by product category.
    
    Create two plots side by side:
    1. Horizontal bar chart of total revenue by category
    2. Pie chart of order distribution by category
    """
    # TODO: Group by product_category and sum total_amount
    # TODO: Create 1x2 subplot with barh and pie chart
    pass

analyze_categories(df)

### Exercise 4.2: Customer Segmentation Analysis

**Your Task:** Analyze customer segments by creating two plots: average order value by segment (horizontal bar), and satisfaction score distribution by segment (box plot).

In [None]:
def analyze_segments(df):
    """Analyze customer segments.
    
    Create two plots:
    1. Average order value by segment (barh)
    2. Satisfaction score distribution by segment (boxplot)
    """
    # TODO: Compute segment statistics
    # TODO: Create 1x2 subplot with barh and boxplot
    pass

analyze_segments(df)

### Exercise 4.3: Time Series Analysis

**Your Task:** Analyze trends over time. Create two stacked plots: monthly revenue trend (line) and monthly order count (bar).

In [None]:
def analyze_time_series(df):
    """Analyze trends over time.
    
    Create two plots stacked vertically:
    1. Monthly revenue trend (line plot)
    2. Monthly order count (bar chart)
    
    Hint: Use df['order_date'].dt.to_period('M') for monthly grouping.
    """
    # TODO: Create order_month from order_date
    # TODO: Aggregate revenue and order count by month
    # TODO: Create 2x1 subplot with line and bar charts
    pass

analyze_time_series(df)

### Exercise 4.4: Regional Analysis

**Your Task:** Analyze performance by region. Create two horizontal bar charts: total revenue by region and return rate by region.

In [None]:
def analyze_regions(df):
    """Analyze performance by region.
    
    Create two plots:
    1. Total revenue by region (barh)
    2. Return rate by region (barh)
    """
    # TODO: Group by region for revenue and return rate
    # TODO: Create 1x2 subplot
    pass

analyze_regions(df)

## Part 5: Advanced Exploration

### Exercise 5.1: Multi-dimensional Analysis

**Your Task:** Create a comprehensive 2x2 visualization combining scatter plots, heatmaps, bar charts, and violin plots to explore relationships across multiple dimensions.

In [None]:
def multi_dimensional_analysis(df):
    """Create a comprehensive 2x2 multi-dimensional visualization.
    
    Plots:
    1. Scatter: Age vs Total Amount, colored by customer_segment
    2. Heatmap: Average order value by Category x Region
    3. Bar: Return rate by product category
    4. Violin: Satisfaction score by return status
    """
    # TODO: Create 2x2 subplot figure (16, 12)
    # TODO: Implement each of the 4 visualizations
    pass

multi_dimensional_analysis(df)

## Summary

In this lab, you learned how to:

1. **Generate and load** synthetic datasets with realistic quality issues
2. **Assess data quality** -- missing values, duplicates, outliers
3. **Clean data** using imputation, deduplication, and outlier detection
4. **Compute statistics** and analyze distributions
5. **Visualize data** with bar charts, histograms, scatter plots, heatmaps
6. **Analyze trends** over time using time series grouping
7. **Segment customers** and compare across dimensions

---

*Introduction to Data Science & Engineering | AI Elevate*