In [None]:
# 01 – Load, Clean & Exploratory Data Analysis (Telco Churn)

In this notebook we:

1. Load the original *Telco Customer Churn* CSV from Kaggle.
2. Clean it (type conversions, handling missing values, binary mapping …).
3. Perform a quick but thorough EDA – distributions, class imbalance, correlations,
   and visual insights that will guide feature engineering.
# -------------------------------------------------------------------------
# Imports & global settings
# -------------------------------------------------------------------------
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# For reproducibility
%config InlineBackend.figure_format = 'svg'
sns.set(style="whitegrid", rc={'figure.figsize':(10,6)})

# -------------------------------------------------------------------------
# Paths
# -------------------------------------------------------------------------
PROJECT_ROOT = Path("..")  # notebook lives under notebooks/
RAW_PATH = PROJECT_ROOT / "data" / "raw" / "Telco-Customer-Churn.csv"
PROCESSED_PATH = PROJECT_ROOT / "data" / "processed" / "telco_processed.csv"
# -------------------------------------------------------------------------
# Load raw data
# -------------------------------------------------------------------------
from src.data import load_raw, clean, save_processed

if not RAW_PATH.exists():
    raise FileNotFoundError(
        f"Raw CSV not found at {RAW_PATH}. "
        "Download the dataset from https://www.kaggle.com/datasets/blastchar/telco-customer-churn "
        "and place it under `data/raw/`."
    )

raw_df = load_raw(RAW_PATH)
raw_df.head()
# -------------------------------------------------------------------------
# Quick sanity check – shape & missing values
# -------------------------------------------------------------------------
print(f"Rows: {raw_df.shape[0]}, Columns: {raw_df.shape[1]}")
print("Missing values per column:")
print(raw_df.isna().sum())
# -------------------------------------------------------------------------
# Clean data
# -------------------------------------------------------------------------
df = clean(raw_df)

# Save the processed version for later notebooks (so we don't re‑run clean every time)
save_processed(df, PROCESSED_PATH)

df.head()
# -------------------------------------------------------------------------
# Distribution of the target (Churn)
# -------------------------------------------------------------------------
sns.countplot(x="Churn", data=df)
plt.title("Class distribution – Churn")
plt.show()

print("Churn proportion: {:.2%}".format(df["Churn"].mean()))
# -------------------------------------------------------------------------
# Numerical features – histograms
# -------------------------------------------------------------------------
num_cols = df.select_dtypes(include=["int64","float64"]).columns.tolist()
num_cols.remove("Churn")  # exclude target

df[num_cols].hist(bins=30, layout=(2,3), figsize=(12,8), edgecolor='k')
plt.suptitle("Histograms of numeric variables")
plt.show()
# -------------------------------------------------------------------------
# Categorical features – bar plots (frequency)
# -------------------------------------------------------------------------
cat_cols = df.select_dtypes(include="category").columns.tolist()
for col in cat_cols:
    plt.figure(figsize=(8,4))
    sns.countplot(y=col, hue="Churn", data=df, order=df[col].value_counts().index)
    plt.title(f"{col} (stacked by Churn)")
    plt.tight_layout()
    plt.show()
# -------------------------------------------------------------------------
# Correlation heatmap (numeric only) – includes target
# -------------------------------------------------------------------------
corr = df.select_dtypes(include=["int64","float64"]).corr()
plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Pearson correlation (numeric features)")
plt.show()
# -------------------------------------------------------------------------
# Pair‑wise relationship – MonthlyCharges vs. TotalCharges colored by churn
# -------------------------------------------------------------------------
plt.figure(figsize=(8,6))
sns.scatterplot(
    x="MonthlyCharges", y="TotalCharges", hue="Churn", data=df, alpha=0.7
)
plt.title("Monthly vs Total Charges")
plt.show()
## Immediate observations

* **Class imbalance** – only ~26% churn.
* **Tenure** has a strong negative correlation with churn (`-0.35`), i.e., newer customers churn more.
* **MonthlyCharges** is higher on average for churners.
* **Contract type** (`Month-to-month` vs. longer contracts) appears to be the most predictive categorical variable.
* Some categories (`StreamingTV`, `StreamingMovies`, etc.) have many “No internet service” entries – we’ll later collapse these into a single “No service” flag.

These insights will shape our feature engineering (e.g. tenure bins, contract duration in months, count of active services) and modelling decisions (balancing the classes, choosing tree based models, etc.).