# =====================================
# 01. Data Cleaning & Merging
# =====================================

Notebook 01 Goals:
Clean and merge the 5 raw Telco datasets into a single customer-level table (df_master) with consistent column names and a standard key (customer_id), ready for EDA and modeling.

This includes: (delete once items are complete)
    - Loading all raw datasets
    - Normalizing identifiers (like CustomerID)
    - Merging the tables correctly
    - Handling missing values and data types
    - Saving the cleaned dataset to data/interim/

### Dataset Summary
| 
| **Demographics**  | 7,043 × 6 | Key: Customer ID  | Demographic attributes (gender, age, marital status). 
| **Location**      | 7,043 × 8 | Key: Customer ID  | Geographic features including country, city, and ZIP code. 
| **Services**      | 7,043 × 11| Key: Customer ID  | Subscribed telecom and streaming services. 
| **Status**        | 7,043 × 9 | Key: Customer ID  | Account details, tenure, churn label, and churn reason. 
| **Population**    | 1,671 × 3 | Key: Zip Code     | ZIP-level population counts; joined later via `Zip Code` from the location table. 

**Merge logic:**
Customer-level tables (`demographics`, `location`, `services`, `status`) are merged one-to-one on `Customer ID`.  
`Population` is an auxiliary dataset used for enrichment through `Zip Code`.

# --- 1. Imports ---

In [1]:
import sys
from pathlib import Path
import pandas as pd

# --- 2. Path Configuration ---

In [2]:
ROOT = Path.cwd().resolve().parent
sys.path.append(str(ROOT / "src"))

from config import RAW_DIR
from utils_data import save_df, quick_overview

# --- 3. Load raw tables ---

In [None]:
FILES = {
    "demographics": "Telco_customer_churn_demographics.xlsx",
    "location":     "Telco_customer_churn_location.xlsx",
    "population":   "Telco_customer_churn_population.xlsx",
    "services":     "Telco_customer_churn_services.xlsx",
    "status":       "Telco_customer_churn_status.xlsx",
}

dfs = {name: pd.read_excel(RAW_DIR / file) for name, file in FILES.items()}

for name, df in dfs.items():
    print(f"{name:12s}: {df.shape[0]} rows × {df.shape[1]} columns")

demographics = dfs["demographics"]
location     = dfs["location"]
population   = dfs["population"]
services     = dfs["services"]
status       = dfs["status"]


demographics: 7043 rows × 9 columns
location    : 7043 rows × 9 columns
population  : 1671 rows × 3 columns
services    : 7043 rows × 30 columns
status      : 7043 rows × 11 columns


# --- 4. Normalise column names and key ---

In [None]:
KEY = "customer_id"

def normalize(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()
    out.columns = (
        out.columns
        .str.strip()
        .str.replace(r"\s+", "_", regex=True)
        .str.lower()
    )
    # Map variations of the customer key
    for cand in ("customerid", "customer_id", "customer id"):
        if cand in out.columns and cand != KEY:
            out = out.rename(columns={cand: KEY})
    return out

demographics = normalize(demographics)
location     = normalize(location)
population   = normalize(population)
services     = normalize(services)
status       = normalize(status)

# Assert key presence only for customer-level tables
for name, df in {
    "demographics": demographics,
    "location":     location,
    "services":     services,
    "status":       status,
}.items():
    assert KEY in df.columns, f"{name} does not contain '{KEY}' after normalization."


# --- 5. Basic Cleaning and Consistency Checks ---

In [9]:
# Ensure unique customer IDs
for name, df in {
    "demographics": demographics,
    "location":     location,
    "services":     services,
    "status":       status,
}.items():
    assert df[KEY].is_unique, f"{name}: duplicate {KEY}s found."

# Handle whitespace or formatting issues if any
for name, df in dfs.items():
    if KEY in df.columns:
        df[KEY] = df[KEY].astype(str).str.strip()


# --- 6. Prefix Columns and Merge Customer-Level Tables ---

In [10]:
def add_prefix_except(df: pd.DataFrame, prefix: str, keep=(KEY,)) -> pd.DataFrame:
    return df.rename(columns={c: (prefix + c) if c not in keep else c for c in df.columns})

demo_ = add_prefix_except(demographics, "demo_")
loc_  = add_prefix_except(location,     "loc_")
svc_  = add_prefix_except(services,     "svc_")
st_   = add_prefix_except(status,       "st_")

df_master = (
    demo_
    .merge(loc_, on=KEY, how="inner", validate="one_to_one")
    .merge(svc_, on=KEY, how="inner", validate="one_to_one")
    .merge(st_,  on=KEY, how="inner", validate="one_to_one")
).set_index(KEY)

print(df_master.shape)
df_master.head()


(7043, 55)


Unnamed: 0_level_0,demo_count,demo_gender,demo_age,demo_under_30,demo_senior_citizen,demo_married,demo_dependents,demo_number_of_dependents,loc_count,loc_country,...,st_count,st_quarter,st_satisfaction_score,st_customer_status,st_churn_label,st_churn_value,st_churn_score,st_cltv,st_churn_category,st_churn_reason
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8779-QRDMV,1,Male,78,No,Yes,No,No,0,1,United States,...,1,Q3,3,Churned,Yes,1,91,5433,Competitor,Competitor offered more data
7495-OOKFY,1,Female,74,No,Yes,Yes,Yes,1,1,United States,...,1,Q3,3,Churned,Yes,1,69,5302,Competitor,Competitor made better offer
1658-BYGOY,1,Male,71,No,Yes,No,Yes,3,1,United States,...,1,Q3,2,Churned,Yes,1,81,3179,Competitor,Competitor made better offer
4598-XLKNJ,1,Female,78,No,Yes,Yes,Yes,1,1,United States,...,1,Q3,2,Churned,Yes,1,88,5337,Dissatisfaction,Limited range of services
4846-WHAFZ,1,Female,80,No,Yes,Yes,Yes,1,1,United States,...,1,Q3,2,Churned,Yes,1,67,2793,Price,Extra data charges


# --- Enrich with Zip Code ---

In [11]:
# If loc_zip_code exists, map ZIP population
if "loc_zip_code" in df_master.columns:
    df_master = df_master.merge(
        population.rename(columns={
            "zip_code": "loc_zip_code",
            "population": "zip_population"
        }),
        on="loc_zip_code",
        how="left"
    )

# --- Post-Merge Validation and Save ---

In [12]:
assert df_master.index.is_unique, "customer_id duplicated after merge."
assert df_master.index.notna().all(), "customer_id contains missing values."

quick_overview(df_master, "Merged Customer Master")
save_df(df_master, "telco_master_clean", folder="interim")

print("✅ Cleaned master dataset saved to data/interim/")



===== Merged Customer Master =====
Shape: 7043 rows × 57 columns

Data types:
demo_count                                 int64
demo_gender                               object
demo_age                                   int64
demo_under_30                             object
demo_senior_citizen                       object
demo_married                              object
demo_dependents                           object
demo_number_of_dependents                  int64
loc_count                                  int64
loc_country                               object
loc_state                                 object
loc_city                                  object
loc_zip_code                               int64
loc_lat_long                              object
loc_latitude                             float64
loc_longitude                            float64
svc_count                                  int64
svc_quarter                               object
svc_referred_a_friend                  

Unnamed: 0,demo_count,demo_gender,demo_age,demo_under_30,demo_senior_citizen,demo_married,demo_dependents,demo_number_of_dependents,loc_count,loc_country,...,st_satisfaction_score,st_customer_status,st_churn_label,st_churn_value,st_churn_score,st_cltv,st_churn_category,st_churn_reason,id,zip_population
0,1,Male,78,No,Yes,No,No,0,1,United States,...,3,Churned,Yes,1,91,5433,Competitor,Competitor offered more data,21,68701
1,1,Female,74,No,Yes,Yes,Yes,1,1,United States,...,3,Churned,Yes,1,69,5302,Competitor,Competitor made better offer,54,55668
2,1,Male,71,No,Yes,No,Yes,3,1,United States,...,2,Churned,Yes,1,81,3179,Competitor,Competitor made better offer,56,47534
3,1,Female,78,No,Yes,Yes,Yes,1,1,United States,...,2,Churned,Yes,1,88,5337,Dissatisfaction,Limited range of services,100,27778
4,1,Female,80,No,Yes,Yes,Yes,1,1,United States,...,2,Churned,Yes,1,67,2793,Price,Extra data charges,114,26265


✅ Guardado: /Users/dianagomes/Desktop/work/s2/EnterpriseDataScienceBootcamp_workgroup/EnterpriseDataScienceBootcamp_workgroup/data/interim/telco_master_clean.csv
✅ Cleaned master dataset saved to data/interim/
