# Data Analysis — Plan:
- Step 1 — Basic Churn Summary
- Step 2 — Churn by Demographics (gender, senior citizen, dependents)
- Step 3 — Churn by Subscription & Services
- Step 4 — pivot_table for Contract, PaymentMethod, InternetService
- Step 5 — Tenure Segmentation (cohort-style buckets)
- Step 6 — Revenue Metrics (ARPU, LTV proxy)
- Step 7 — Collect all tables → Save Day 3 outputs

## Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [None]:
pd.set_option('display.max_columns', None)

## Data Load & Scan

- Basic strucutre checks
- Summary stats
- Missing values
- Duplicate checks
- Unique values per column => find columns that need to be converted to 'category'
- Quick numerical distribution (sanity check)

In [None]:
import kagglehub
path = kagglehub.dataset_download('blastchar/telco-customer-churn')

In [None]:
filename = os.listdir(path)[0]
fp = os.path.join(path, filename)

In [None]:
df = pd.read_csv(fp)

In [None]:
df.head()

In [None]:
# df.shape
# df.dtypes
# df.columns.tolist()
df.info()

In [None]:
df.describe(include='all').T

## Data Clean
- Missing values
- Duplicate values
- Fix TotalCharges column
  - object -> numeric
  - remove NaN values/rows
- Cardianility checks:
  - Convert columns with low cardinality to category (object -> category); define category threshold
- Standardize Categorical Values:
  - Replace “No internet service” & “No phone service” with simple “No”
- Final Validation Checks
- Export cleaned dataset

Missing or Duplicate values

In [None]:
missing = df.isna().sum().to_frame('missing_count')

In [None]:
missing['missing_pct'] = (missing['missing_count'] / len(df)) * 100

In [None]:
print("Missing rows:\n", missing)

In [None]:
duplicates = df.duplicated().sum()

In [None]:
print("Duplicate rows:", duplicates)

Fix TotalCharges column
- convert object -> numeric/float
- remove NaN rows

In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

In [None]:
df['TotalCharges'].isna().sum() # 11

In [None]:
df = df[df['TotalCharges'].notna()].copy()

In [None]:
df['TotalCharges'].isna().sum() # 0

Cardinality/Unique check & Convert Categorical Columns

In [None]:
# unique_vals = df.nunique().to_frame("unique_count")
unique_vals = df.nunique()

In [None]:
print("Unique values:\n", unique_vals)

In [None]:
# convert column to category where unique values <= 4
cat_threshold = 4
low_cardinality_cols = unique_vals[unique_vals <= cat_threshold].index

In [None]:
len(low_cardinality_cols)

In [None]:
df[low_cardinality_cols] = df[low_cardinality_cols].astype('category')

In [None]:

df.dtypes

Standardize Categorical Values:
- Replace “No internet service” & “No phone service” with simple “No”

In [None]:
internet_related_cols = [
    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies'
]

In [None]:
for col in internet_related_cols:
  df[col] = df[col].replace({'No internet service': 'No'})

In [None]:
df['MultipleLines'] = df['MultipleLines'].replace({'No phone service': 'No'})

In [None]:
# validation
for col in internet_related_cols + ['MultipleLines']:
  print(col, df[col].nunique())

Final Validation Checks

In [None]:
df.isna().sum()

In [None]:
df.dtypes

In [None]:
df.describe(include='all').T

Export Cleaned Dataset

In [None]:
os.makedirs("data", exist_ok=True)

df.to_csv("data/cleaned_dataset_v1.csv", index=False)

## Data Analysis

### Step 1 — Basic Churn Summary

In [None]:
churn_summary = df['Churn'].value_counts().to_frame('count')
# churn_summary

In [None]:
churn_summary['percent'] = round((churn_summary['count'] / len(df)) * 100, 2)

In [None]:
churn_summary

Observation:
- Customers churned: 26.58%
- More than a quarter (>25%) => alarming

### Step 2 — Churn by Demographics (gender, senior citizen, dependents)


In [None]:
gender_churn = pd.crosstab(df['gender'], df['Churn'], normalize='index') * 100
gender_churn

In [None]:
senior_churn = pd.crosstab(df['SeniorCitizen'], df['Churn'], normalize='index') * 100
senior_churn

In [None]:
dependents_churn = pd.crosstab(df['Dependents'], df['Churn'], normalize='index') * 100
dependents_churn

Observations:
- Gender → no effect
- Senior citizens → very high churn
- Customers with dependents → very low churn

### Step 3 — Churn by Subscription & Services

In [None]:
service_cols = [
    'PhoneService', 'MultipleLines', 'InternetService',
    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
    'TechSupport', 'StreamingTV', 'StreamingMovies'
]

service_churn_tables = {}

for col in service_cols:
  table = pd.crosstab(df[col], df['Churn'], normalize='index') * 100
  service_churn_tables[col] = table
  print(f"--- {col} vs Churn (%) ---")
  print(table, "\n")

Observations:
- Fiber-optic customers → very high churn / highest-risk segment
- lack of TechSupport → high churn
- lack of OnlineSecurity → high churn
- customers without DeviceProtection → slightly higher churn
- Streaming Services → slightly higher churn

### Step 4 — pivot_table for Contract, PaymentMethod, InternetService

In [None]:
contract_pivot = pd.crosstab(df['Contract'], df['Churn'], normalize='index') * 100
contract_pivot

In [None]:
payment_pivot = pd.crosstab(df['PaymentMethod'], df['Churn'], normalize='index') * 100
payment_pivot

In [None]:
internet_pivot = pd.crosstab(df['InternetService'], df['Churn'], normalize='index') * 100
internet_pivot

Observations:
- Month-to-month customers → very high churn / highest-risk segment
- Electronic check → high churn
- Fiber optic customers → high churn

### Step 5 — Tenure Segmentation (cohort-style buckets)

In [None]:
bins = [0, 12, 24, 36, 48, 60, 72]

labels = ['0-12', '13-24', '25-36', '37-48', '49-60', '61-72']

df['tenure_group'] = pd.cut(df['tenure'], bins=bins, labels=labels, include_lowest=True)


In [None]:
tenure_churn = pd.crosstab(df['tenure_group'], df['Churn'], normalize='index') * 100
tenure_churn

Observations:
- new customers (0-12 months) → very high risk / highest risk segment
- churn drops after 24 months
- long-term customers (61-72 months) → lowest churn / most loyal cohort

### Step 6 — Revenue Metrics (ARPU, LTV proxy)
- ARPU by churn
- LTV by churn
- ARPU by contract type


In [None]:
# ARPU (average monthly revenue)
arpu_summary = df.groupby('Churn')['MonthlyCharges'].mean()
arpu_summary

In [None]:
df['LTV'] = df['MonthlyCharges'] * df['tenure']

In [None]:
lft_summary = df.groupby('Churn')['LTV'].mean()
lft_summary

In [None]:
arpu_by_contract = df.groupby('Contract')['MonthlyCharges'].mean()
arpu_by_contract

Observations:
- Customers who churn pay significantly more each month (+21% higher ARPU) → not loyal customers
- Customers who stay longer generate 67% higher lifetime value → loyalty=longevity=revenue
- Month-to-month customers pay the highest ARPU (+5.5 per user per month) → but also churn most (42%)
  - Retaining month-to-month customers is financially high-impact

### Step 7 — Collect all tables → Save Day 3 outputs / todo

In [None]:
# os.makedirs("data/processed", exist_ok=True)
# export_tables = {'filename': df...}
# df.to_csv()