<h1 style="color:#1f77b4; text-align:left; font-size:40px;">
    Data Cleaning
</h1>

<h3 style="color:#555; text-align:left;">
    <strong>Purpose:</strong><br><br>
    - Load the unified master dataset *telco_master* created in notebook 002.<br><br>
    - Perform <strong>light, transparent cleaning</strong> <em>before</em> EDA:
    <ul>
        <li>Remove clearly redundant or uninformative columns</li>
        <li>Standardize key types and strip whitespace</li>
        <li>Run basic structure and missing-value checks</li>
    </ul>
    - Use <strong>simple visualizations and summaries</strong> to justify why some columns are dropped.<br><br>
    - Save a ‚Äúbasic cleaned‚Äù dataset for EDA: *telco_master_basic_clean*
</h3>

<h2 style="color:#1f77b4; border-bottom: 3px solid #1f77b4; padding-bottom:4px;">
</h2>


# Imports

In [2]:
import sys
from pathlib import Path

project_root = Path().resolve().parent
src_path = project_root / "src"
sys.path.append(str(src_path))

from utils_data import load_df, quick_overview

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Load Interim Data

In [3]:
MASTER_NAME = "telco_master"

df = load_df(MASTER_NAME, folder="interim")
print(f"\nLoaded '{MASTER_NAME}' from interim with shape: {df.shape}")


üìÇ Loaded: /Users/dianagomes/Desktop/work/s2/EnterpriseDataScienceBootcamp_workgroup/data/interim/telco_master.csv

Loaded 'telco_master' from interim with shape: (7043, 56)


# Quick Overview

In [4]:
quick_overview(df, name="telco_master", show_head=True, n_head=5)


===== telco_master =====
Shape: 7043 rows √ó 56 columns

Data types:
demo_count                                 int64
demo_gender                               object
demo_age                                   int64
demo_under_30                             object
demo_senior_citizen                       object
demo_married                              object
demo_dependents                           object
demo_number_of_dependents                  int64
loc_count                                  int64
loc_country                               object
loc_state                                 object
loc_city                                  object
loc_zip_code                               int64
loc_lat_long                              object
loc_latitude                             float64
loc_longitude                            float64
svc_count                                  int64
svc_quarter                               object
svc_referred_a_friend                     object

Unnamed: 0,demo_count,demo_gender,demo_age,demo_under_30,demo_senior_citizen,demo_married,demo_dependents,demo_number_of_dependents,loc_count,loc_country,...,st_quarter,st_satisfaction_score,st_customer_status,st_churn_label,st_churn_value,st_churn_score,st_cltv,st_churn_category,st_churn_reason,zip_population
0,1,Male,78,No,Yes,No,No,0,1,United States,...,Q3,3,Churned,Yes,1,91,5433,Competitor,Competitor offered more data,68701
1,1,Female,74,No,Yes,Yes,Yes,1,1,United States,...,Q3,3,Churned,Yes,1,69,5302,Competitor,Competitor made better offer,55668
2,1,Male,71,No,Yes,No,Yes,3,1,United States,...,Q3,2,Churned,Yes,1,81,3179,Competitor,Competitor made better offer,47534
3,1,Female,78,No,Yes,Yes,Yes,1,1,United States,...,Q3,2,Churned,Yes,1,88,5337,Dissatisfaction,Limited range of services,27778
4,1,Female,80,No,Yes,Yes,Yes,1,1,United States,...,Q3,2,Churned,Yes,1,67,2793,Price,Extra data charges,26265


# Phase 1 ‚Äì Structural & Integrity Checks

 Verify the integrity of the merged dataset before any transformations:

## Inspect variables

In [5]:
dtype_summary = (
    df.dtypes
    .reset_index()
    .rename(columns={'index': 'column', 0: 'dtype'})
    .sort_values('dtype')
)
display(dtype_summary)

Unnamed: 0,column,dtype
0,demo_count,int64
52,st_cltv,int64
51,st_churn_score,int64
50,st_churn_value,int64
47,st_satisfaction_score,int64
45,st_count,int64
42,svc_total_extra_data_charges,int64
20,svc_tenure_in_months,int64
19,svc_number_of_referrals,int64
16,svc_count,int64


### Grouping the variables for analysis

In [6]:
demographic_vars = [c for c in df.columns if c.startswith("demo_")]
location_vars    = [c for c in df.columns if c.startswith("loc_")]
service_vars     = [c for c in df.columns if c.startswith("svc_")]
status_vars      = [c for c in df.columns if c.startswith("st_")]

In [7]:
for group, cols in {
    "Demographics": demographic_vars,
    "Location": location_vars,
    "Services": service_vars,
    "Status": status_vars
}.items():
    print(f"\n{group} ({len(cols)} vars)")
    print(cols)


Demographics (8 vars)
['demo_count', 'demo_gender', 'demo_age', 'demo_under_30', 'demo_senior_citizen', 'demo_married', 'demo_dependents', 'demo_number_of_dependents']

Location (8 vars)
['loc_count', 'loc_country', 'loc_state', 'loc_city', 'loc_zip_code', 'loc_lat_long', 'loc_latitude', 'loc_longitude']

Services (29 vars)
['svc_count', 'svc_quarter', 'svc_referred_a_friend', 'svc_number_of_referrals', 'svc_tenure_in_months', 'svc_offer', 'svc_phone_service', 'svc_avg_monthly_long_distance_charges', 'svc_multiple_lines', 'svc_internet_service', 'svc_internet_type', 'svc_avg_monthly_gb_download', 'svc_online_security', 'svc_online_backup', 'svc_device_protection_plan', 'svc_premium_tech_support', 'svc_streaming_tv', 'svc_streaming_movies', 'svc_streaming_music', 'svc_unlimited_data', 'svc_contract', 'svc_paperless_billing', 'svc_payment_method', 'svc_monthly_charge', 'svc_total_charges', 'svc_total_refunds', 'svc_total_extra_data_charges', 'svc_total_long_distance_charges', 'svc_total

---
**Services variables**

Each service variable was reviewed for business interpretability, temporal validity, and expected influence on churn. The variables capture 5 behavioral dimensions:
- **Customer relationship & engagement**
  - svc_tenure_in_months, svc_offer, 
  - svc_referred_a_friend, svc_number_of_referrals, svc_contract

- **Service subscriptions**
  - svc_phone_service, svc_multiple_lines
  - svc_internet_service, svc_internet_type
  - svc_streaming_tv, svc_streaming_movies, svc_streaming_music
  - svc_unlimited_data
  - svc_online_security, svc_online_backup
  - svc_device_protection_plan, svc_premium_tech_support

- **Usage intensity**
  - svc_avg_monthly_long_distance_charges
  - svc_avg_monthly_gb_download

- **Billing and payment behavior**
  - svc_paperless_billing, svc_payment_method, svc_monthly_charge

- ‚ö†Ô∏è **Revenue and financial exposure**
  - High leakage risk: likely to encode information **after** the churn event.
       
    Example:

    (Low total revenue ‚Üí customer must have churned early)
       
    (High total revenue ‚Üí customer stayed longer)

  - svc_total_charges, svc_total_refunds,
    svc_total_extra_data_charges, svc_total_long_distance_charges,
    svc_total_revenue

- **Time / metadata**
  - svc_quarter, svc_count

---
**Target & Churn Outcome Variables**

Using these variables in training will cause data leakage and unrealistically high model performance.
- ‚úÖ **st_churn_label** *(object)*  **will be used for validation**
  ‚Üí Target (Yes/No churn) 

- ‚ùå **st_count** *(int64)*  
  ‚Üí No variation (1)

- ‚ùå **st_quarter** *(object)*  
  ‚Üí Time reference tied to churn outcome (Q3)

- ‚ö†Ô∏è **st_satisfaction_score** *(object)*  
  ‚Üí Customer satisfaction rating (1‚Äì5)  
  Potential risk: proceed with caution

- ‚ùå **st_customer_status** *(object)*  
  ‚Üí Current state (Active/Churned)

- ‚ùå **st_churn_value** *(int64)*  
  ‚Üí Encoded churn status (binary)

- ‚ùå **st_churn_score** *(int64)*  
  ‚Üí Likely a precomputed churn risk score

- ‚ùå **st_cltv** *(object)*  
  ‚Üí Custumer lifetime value

- ‚ùå **st_churn_category** *(object)*  
  ‚Üí Why they churned (Competitor, Price, Service, etc.)

- ‚ùå **st_churn_reason** *(object)*  
  ‚Üí Detailed textual reason


## Missing values snapshot

In [13]:
missing_summary = (
    df.isna()
    .sum()
    .reset_index()
    .rename(columns={"index": "column", 0: "n_missing"})
    .query("n_missing > 0")
    .sort_values("n_missing", ascending=False)
)

print(f"\nColumns with missing values: {missing_summary.shape[0]}")
missing_summary


Columns with missing values: 4


Unnamed: 0,column,n_missing
53,st_churn_category,5174
54,st_churn_reason,5174
21,svc_offer,3877
26,svc_internet_type,1526


### Missing Values ‚Äì Interpretation & Handling

Only four variables exhibit missing values, and none of them are random.  
All missing values are **structurally induced** by business logic or outcome conditions.

In particular, the churn-related variables (st_churn_category and st_churn_reason) are only populated for customers who have churned. As a result, these variables must **not** be imputed and should be excluded from predictive modeling to avoid **target leakage**.

For the service-related variables, missing values reflect the **absence of the corresponding service**, rather than data collection errors. Therefore, missing entries will be replaced with explicit category labels rather than statistical estimates:

- svc_internet_type ‚Üí filled with "No Internet Service"
- svc_offer ‚Üí filled with "No Offer" or "Unknown" depending on business interpretation

# Phase 2 - Redundancy Analysis 
Here we **justify** dropping some columns by showing:

- They are constant (or almost constant)  
- Or they are directly derivable from other fields (redundant information)

<!-- MODELO: Sec√ß√£o principal numerada -->
<!-- 
<h2 style="background-color:#1f77b4; color:white; padding:10px; border-radius:6px;">
    X. Nome da Sec√ß√£o
</h2>
-->

<!-- MODELO: Sec√ß√£o com linha colorida -->
<!-- 
<h2 style="color:#ff7f0e; border-bottom: 3px solid #ff7f0e; padding-bottom:4px;">
    X. Nome da Sec√ß√£o
</h2>
-->

<!-- MODELO: Subsec√ß√£o -->
<!-- 
<h3 style="color:#2ca02c; margin-top:10px;">
    X.Y Nome da Subsec√ß√£o
</h3>
-->

<!-- MODELO: Caixa de Nota -->
<!-- 
<div style="border-left: 5px solid #1f77b4; padding:10px; background-color:#f5f9ff; margin:15px 0;">
    <b>Nota:</b> Texto da nota.
</div>
-->

<!-- MODELO: Sec√ß√£o principal numerada -->
<!-- 
<h2 style="background-color:#1f77b4; color:white; padding:10px; border-radius:6px;">
    X. Nome da Sec√ß√£o
</h2>
-->

<!-- MODELO: Sec√ß√£o com linha colorida -->
<!-- 
<h2 style="color:#ff7f0e; border-bottom: 3px solid #ff7f0e; padding-bottom:4px;">
    X. Nome da Sec√ß√£o
</h2>
-->

<!-- MODELO: Subsec√ß√£o -->
<!-- 
<h3 style="color:#2ca02c; margin-top:10px;">
    X.Y Nome da Subsec√ß√£o
</h3>
-->

<!-- MODELO: Caixa de Nota -->
<!-- 
<div style="border-left: 5px solid #1f77b4; padding:10px; background-color:#f5f9ff; margin:15px 0;">
    <b>Nota:</b> Texto da nota.
</div>
-->