# EDA & Visualization Notebook

This notebook demonstrates richer visualizations using matplotlib and seaborn, saves SVG figures to `docs/figures/`, and includes examples of how to write pytest tests for plotting functions.

Run the cells sequentially. The notebook also includes helper plotting functions (reusable) which will be tested by pytest.

In [None]:
# Environment & Imports
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set visual style
sns.set_theme(style="whitegrid")
plt.rcParams["figure.dpi"] = 100

# Helper: ensure output dir exists
OUT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "docs", "figures"))
os.makedirs(OUT_DIR, exist_ok=True)

RNG = np.random.default_rng(42)
random.seed(42)
np.random.seed(42)

print("Environment ready. Figures will be saved to:", OUT_DIR)

## Load and Inspect Data
# Load the cleaned dataset produced by the pipeline and take a quick look.

df = pd.read_csv(os.path.join(os.path.dirname(__file__), '..', 'data', 'cleaned_data.csv'))
print('Loaded', df.shape, 'rows and columns')
df.head()

### Quick summaries: dtypes and missing values

df.dtypes

df.isna().sum()

df.describe(include='all').T

In [None]:
## Matplotlib: Basic Plots
Create a basic histogram and save it as SVG.

def plot_age_histogram(df, save_path=None):
    fig, ax = plt.subplots(figsize=(6,3))
    ax.hist(df['Age'].dropna(), bins=20, color='#2b7cff', alpha=0.8)
    ax.set_title('Age distribution')
    ax.set_xlabel('Age')
    ax.set_ylabel('Count')
    plt.tight_layout()
    if save_path:
        fig.savefig(save_path, format='svg')
    return fig, ax

fig, ax = plot_age_histogram(df, save_path=os.path.join(OUT_DIR, 'age_hist.svg'))
plt.show()

In [None]:
## Seaborn: Advanced Visualizations
Violin plot of Age by Survived and a countplot for Survived by Pclass & Sex.

def plot_age_violin(df, save_path=None):
    fig, ax = plt.subplots(figsize=(6,3))
    sns.violinplot(x='Survived', y='Age', data=df, inner='quartile', palette='muted', ax=ax)
    ax.set_title('Age by Survival')
    ax.set_xlabel('Survived')
    if save_path:
        fig.savefig(save_path, format='svg')
    return fig, ax

fig, ax = plot_age_violin(df, save_path=os.path.join(OUT_DIR, 'age_violin.svg'))
plt.show()

def plot_survival_by_pclass_sex(df, save_path=None):
    fig = sns.catplot(x='Pclass', hue='Sex', col='Survived', data=df, kind='count', height=3, aspect=1)
    fig.fig.suptitle('Survival counts by Pclass and Sex', y=1.03)
    if save_path:
        fig.savefig(save_path, format='svg')
    return fig

fig = plot_survival_by_pclass_sex(df, save_path=os.path.join(OUT_DIR, 'survival_pclass_sex.svg'))
plt.show()

In [None]:
## Correlation Heatmap
Show correlations between numeric variables.

def plot_corr_heatmap(df, save_path=None):
    num = df.select_dtypes(include=["number"]) 
    corr = num.corr()
    fig, ax = plt.subplots(figsize=(6,4))
    sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', ax=ax)
    ax.set_title('Correlation matrix')
    if save_path:
        fig.savefig(save_path, format='svg')
    return fig, ax

fig, ax = plot_corr_heatmap(df, save_path=os.path.join(OUT_DIR, 'corr_heatmap.svg'))
plt.show()

In [None]:
## Plot Composition & Subplots
Example of creating a figure with subplots combining histogram and boxplot.

def combined_fare_plots(df, save_path=None):
    fig, axes = plt.subplots(1,2, figsize=(8,3), gridspec_kw={'width_ratios':[3,1]})
    sns.histplot(df['Fare'], bins=30, ax=axes[0], color='#2b7cff')
    axes[0].set_title('Fare distribution')
    sns.boxplot(x=df['Fare'], ax=axes[1], color='#ff6b6b')
    axes[1].set_title('Fare boxplot')
    plt.tight_layout()
    if save_path:
        fig.savefig(save_path, format='svg')
    return fig, axes

fig, axes = combined_fare_plots(df, save_path=os.path.join(OUT_DIR, 'fare_combined.svg'))
plt.show()

In [None]:
## Styling & Themes
Show how to switch themes and color palettes.

sns.set_style('darkgrid')
sns.set_context('talk')
plt.rcParams['figure.figsize'] = (6,3)
fig, ax = plot_age_histogram(df)
plt.show()
# reset
sns.set_theme(style='whitegrid')
plt.rcParams['figure.figsize'] = (6,4)

print('Styling demo complete')

In [None]:
## Reusable Plotting Functions & Unit Testing
We will place reusable plotting functions in `src/plot_utils.py` and write pytest tests that import them and assert they return (fig, ax) and that axes labels/titles are as expected.

# Create module file to be used by tests
module_path = os.path.join(os.path.dirname(__file__), '..', 'src', 'plot_utils.py')
module_code = '''import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='whitegrid')

def plot_age_histogram(df, save_path=None):
    fig, ax = plt.subplots(figsize=(6,3))
    ax.hist(df['Age'].dropna(), bins=20, color='#2b7cff', alpha=0.8)
    ax.set_title('Age distribution')
    ax.set_xlabel('Age')
    if save_path:
        fig.savefig(save_path, format='svg')
    return fig, ax
'''
with open(module_path, 'w', encoding='utf-8') as f:
    f.write(module_code)

print('Wrote', module_path)

## Unit Tests for Plots
Create pytest tests under `tests/` that import plotting functions and assert they return a figure and axes, and check for expected labels/titles. We will demonstrate an example test here.

In [None]:
# Example of running a test within the notebook (optional):
# This cell will run pytest programmatically if pytest is installed.

try:
    import pytest
    print('pytest is available; running tests in tests/ (this may create output)')
    import subprocess
    subprocess.run(['pytest', '-q', 'tests', '-k', 'plot'], check=False)
except Exception as e:
    print('pytest not available in this environment:', e)

## Saving & Reproducibility
All figures were saved as SVGs to `docs/figures/`. To ensure reproducible figures, set RNG seeds and avoid random shuffling without fixed seeds.

## Conclusion
- Observations about distributions, skewness, and potential transforms are written here.
- Use log-transforms for skewed features like `Fare` if needed, and verify after transform.

This notebook provides reusable plotting functions and demonstrates how to test them using pytest.