<h1 style="color:#1f77b4; text-align:left; font-size:40px; margin-bottom:0;">
    00 – Data Loading & Master Table Creation
</h1>

<h3 style="color:#555; text-align:left; margin-top:5px;">
    Load raw Telco datasets, run basic sanity checks, and build a unified master dataset.
</h3>

</p>

---

# Imports

In [3]:
import sys
from pathlib import Path

import numpy as np
import pandas as pd

project_root = Path().resolve().parent
src_path = project_root / "src"
sys.path.append(str(src_path))
from utils_data import save_df
from config import RAW_DIR, INTERIM_DIR
from utils_data import quick_overview 


# Path Configuration

In [16]:
ROOT = Path.cwd().resolve().parent
sys.path.append(str(ROOT / "src"))


# Load Data

In [17]:
!pip install openpyxl

files = {
    "demographics": RAW_DIR / "Telco_customer_churn_demographics.xlsx",
    "location": RAW_DIR / "Telco_customer_churn_location.xlsx",
    "population": RAW_DIR / "Telco_customer_churn_population.xlsx",
    "services": RAW_DIR / "Telco_customer_churn_services.xlsx",
    "status": RAW_DIR / "Telco_customer_churn_status.xlsx",
}

dfs = {}
for name, path in files.items():
    dfs[name] = pd.read_excel(path)
    print(f"Loaded {name:12s}: {dfs[name].shape[0]} rows × {dfs[name].shape[1]} cols")


demographics = dfs["demographics"]
location     = dfs["location"]
population   = dfs["population"]
services     = dfs["services"]
status       = dfs["status"]



270.12s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Loaded demographics: 7043 rows × 9 cols
Loaded location    : 7043 rows × 9 cols
Loaded population  : 1671 rows × 3 cols
Loaded services    : 7043 rows × 30 cols
Loaded status      : 7043 rows × 11 cols


# Quick Overview

In [18]:
for name, df in dfs.items():
    print("=" * 80)
    print(f"QUICK OVERVIEW – {name.upper()}")
    print("=" * 80)
    quick_overview(df, name=name)


QUICK OVERVIEW – DEMOGRAPHICS

===== demographics =====
Shape: 7043 rows × 9 columns

Data types:
Customer ID             object
Count                    int64
Gender                  object
Age                      int64
Under 30                object
Senior Citizen          object
Married                 object
Dependents              object
Number of Dependents     int64
dtype: object

Missing values per column:
Customer ID             0
Count                   0
Gender                  0
Age                     0
Under 30                0
Senior Citizen          0
Married                 0
Dependents              0
Number of Dependents    0
dtype: int64

First 5 rows:


Unnamed: 0,Customer ID,Count,Gender,Age,Under 30,Senior Citizen,Married,Dependents,Number of Dependents
0,8779-QRDMV,1,Male,78,No,Yes,No,No,0
1,7495-OOKFY,1,Female,74,No,Yes,Yes,Yes,1
2,1658-BYGOY,1,Male,71,No,Yes,No,Yes,3
3,4598-XLKNJ,1,Female,78,No,Yes,Yes,Yes,1
4,4846-WHAFZ,1,Female,80,No,Yes,Yes,Yes,1


QUICK OVERVIEW – LOCATION

===== location =====
Shape: 7043 rows × 9 columns

Data types:
Customer ID     object
Count            int64
Country         object
State           object
City            object
Zip Code         int64
Lat Long        object
Latitude       float64
Longitude      float64
dtype: object

Missing values per column:
Customer ID    0
Count          0
Country        0
State          0
City           0
Zip Code       0
Lat Long       0
Latitude       0
Longitude      0
dtype: int64

First 5 rows:


Unnamed: 0,Customer ID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude
0,8779-QRDMV,1,United States,California,Los Angeles,90022,"34.02381, -118.156582",34.02381,-118.156582
1,7495-OOKFY,1,United States,California,Los Angeles,90063,"34.044271, -118.185237",34.044271,-118.185237
2,1658-BYGOY,1,United States,California,Los Angeles,90065,"34.108833, -118.229715",34.108833,-118.229715
3,4598-XLKNJ,1,United States,California,Inglewood,90303,"33.936291, -118.332639",33.936291,-118.332639
4,4846-WHAFZ,1,United States,California,Whittier,90602,"33.972119, -118.020188",33.972119,-118.020188


QUICK OVERVIEW – POPULATION

===== population =====
Shape: 1671 rows × 3 columns

Data types:
ID            int64
Zip Code      int64
Population    int64
dtype: object

Missing values per column:
ID            0
Zip Code      0
Population    0
dtype: int64

First 5 rows:


Unnamed: 0,ID,Zip Code,Population
0,1,90001,54492
1,2,90002,44586
2,3,90003,58198
3,4,90004,67852
4,5,90005,43019


QUICK OVERVIEW – SERVICES

===== services =====
Shape: 7043 rows × 30 columns

Data types:
Customer ID                           object
Count                                  int64
Quarter                               object
Referred a Friend                     object
Number of Referrals                    int64
Tenure in Months                       int64
Offer                                 object
Phone Service                         object
Avg Monthly Long Distance Charges    float64
Multiple Lines                        object
Internet Service                      object
Internet Type                         object
Avg Monthly GB Download                int64
Online Security                       object
Online Backup                         object
Device Protection Plan                object
Premium Tech Support                  object
Streaming TV                          object
Streaming Movies                      object
Streaming Music                       object
Unlimited

Unnamed: 0,Customer ID,Count,Quarter,Referred a Friend,Number of Referrals,Tenure in Months,Offer,Phone Service,Avg Monthly Long Distance Charges,Multiple Lines,...,Unlimited Data,Contract,Paperless Billing,Payment Method,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue
0,8779-QRDMV,1,Q3,No,0,1,,No,0.0,No,...,No,Month-to-Month,Yes,Bank Withdrawal,39.65,39.65,0.0,20,0.0,59.65
1,7495-OOKFY,1,Q3,Yes,1,8,Offer E,Yes,48.85,Yes,...,Yes,Month-to-Month,Yes,Credit Card,80.65,633.3,0.0,0,390.8,1024.1
2,1658-BYGOY,1,Q3,No,0,18,Offer D,Yes,11.33,Yes,...,Yes,Month-to-Month,Yes,Bank Withdrawal,95.45,1752.55,45.61,0,203.94,1910.88
3,4598-XLKNJ,1,Q3,Yes,1,25,Offer C,Yes,19.76,No,...,Yes,Month-to-Month,Yes,Bank Withdrawal,98.5,2514.5,13.43,0,494.0,2995.07
4,4846-WHAFZ,1,Q3,Yes,1,37,Offer C,Yes,6.33,Yes,...,Yes,Month-to-Month,Yes,Bank Withdrawal,76.5,2868.15,0.0,0,234.21,3102.36


QUICK OVERVIEW – STATUS

===== status =====
Shape: 7043 rows × 11 columns

Data types:
Customer ID           object
Count                  int64
Quarter               object
Satisfaction Score     int64
Customer Status       object
Churn Label           object
Churn Value            int64
Churn Score            int64
CLTV                   int64
Churn Category        object
Churn Reason          object
dtype: object

Missing values per column:
Churn Category        5174
Churn Reason          5174
Customer ID              0
Count                    0
Quarter                  0
Satisfaction Score       0
Customer Status          0
Churn Label              0
Churn Value              0
Churn Score              0
CLTV                     0
dtype: int64

First 5 rows:


Unnamed: 0,Customer ID,Count,Quarter,Satisfaction Score,Customer Status,Churn Label,Churn Value,Churn Score,CLTV,Churn Category,Churn Reason
0,8779-QRDMV,1,Q3,3,Churned,Yes,1,91,5433,Competitor,Competitor offered more data
1,7495-OOKFY,1,Q3,3,Churned,Yes,1,69,5302,Competitor,Competitor made better offer
2,1658-BYGOY,1,Q3,2,Churned,Yes,1,81,3179,Competitor,Competitor made better offer
3,4598-XLKNJ,1,Q3,2,Churned,Yes,1,88,5337,Dissatisfaction,Limited range of services
4,4846-WHAFZ,1,Q3,2,Churned,Yes,1,67,2793,Price,Extra data charges


# Global Summary by Table

In [19]:
global_summary = []

for name, df in dfs.items():
    n_rows, n_cols = df.shape
    total_missing = df.isna().sum().sum()
    pct_missing = (
        total_missing / (n_rows * n_cols) * 100 if n_rows * n_cols > 0 else 0
    )

    num_numeric = df.select_dtypes(include=[np.number]).shape[1]
    num_categorical = df.select_dtypes(exclude=[np.number]).shape[1]

    global_summary.append(
        {
            "table": name,
            "n_rows": n_rows,
            "n_cols": n_cols,
            "total_missing": total_missing,
            "pct_missing": round(pct_missing, 2),
            "num_numeric_cols": num_numeric,
            "num_categorical_cols": num_categorical,
        }
    )

global_summary_df = pd.DataFrame(global_summary)
global_summary_df.sort_values("table")


Unnamed: 0,table,n_rows,n_cols,total_missing,pct_missing,num_numeric_cols,num_categorical_cols
0,demographics,7043,9,0,0.0,3,6
1,location,7043,9,0,0.0,4,5
2,population,1671,3,0,0.0,3,0
3,services,7043,30,5403,2.56,11,19
4,status,7043,11,10348,13.36,5,6


# Normalise key and consisntency checks

In [20]:
KEY = "customer_id"

def normalize(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()
    out.columns = (
        out.columns
        .str.strip()
        .str.replace(r"\s+", "_", regex=True)
        .str.lower()
    )
    # Map variations of the customer key
    for cand in ("customerid", "customer_id", "customer id"):
        if cand in out.columns and cand != KEY:
            out = out.rename(columns={cand: KEY})
    return out

demographics = normalize(demographics)
location     = normalize(location)
population   = normalize(population)
services     = normalize(services)
status       = normalize(status)

# Assert key presence only for customer-level tables
for name, df in {
    "demographics": demographics,
    "location":     location,
    "services":     services,
    "status":       status,
}.items():
    assert KEY in df.columns, f"{name} does not contain '{KEY}' after normalization."


# Merge tables

Four customer-level datasets (demographics, location, services, and status) can be merged one-to-one using Customer ID. The population dataset is auxiliary and used for enrichment via Zip Code.

## Prefix Columns and Merge Customer-Level Tables 

In [21]:
def add_prefix_except(df: pd.DataFrame, prefix: str, keep=(KEY,)) -> pd.DataFrame:
    return df.rename(columns={c: (prefix + c) if c not in keep else c for c in df.columns})

demo_ = add_prefix_except(demographics, "demo_")
loc_  = add_prefix_except(location,     "loc_")
svc_  = add_prefix_except(services,     "svc_")
st_   = add_prefix_except(status,       "st_")

df_master = (
    demo_
    .merge(loc_, on=KEY, how="inner", validate="one_to_one")
    .merge(svc_, on=KEY, how="inner", validate="one_to_one")
    .merge(st_,  on=KEY, how="inner", validate="one_to_one")
).set_index(KEY)

print(df_master.shape)
df_master.head()


(7043, 55)


Unnamed: 0_level_0,demo_count,demo_gender,demo_age,demo_under_30,demo_senior_citizen,demo_married,demo_dependents,demo_number_of_dependents,loc_count,loc_country,...,st_count,st_quarter,st_satisfaction_score,st_customer_status,st_churn_label,st_churn_value,st_churn_score,st_cltv,st_churn_category,st_churn_reason
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8779-QRDMV,1,Male,78,No,Yes,No,No,0,1,United States,...,1,Q3,3,Churned,Yes,1,91,5433,Competitor,Competitor offered more data
7495-OOKFY,1,Female,74,No,Yes,Yes,Yes,1,1,United States,...,1,Q3,3,Churned,Yes,1,69,5302,Competitor,Competitor made better offer
1658-BYGOY,1,Male,71,No,Yes,No,Yes,3,1,United States,...,1,Q3,2,Churned,Yes,1,81,3179,Competitor,Competitor made better offer
4598-XLKNJ,1,Female,78,No,Yes,Yes,Yes,1,1,United States,...,1,Q3,2,Churned,Yes,1,88,5337,Dissatisfaction,Limited range of services
4846-WHAFZ,1,Female,80,No,Yes,Yes,Yes,1,1,United States,...,1,Q3,2,Churned,Yes,1,67,2793,Price,Extra data charges


## Enrich with ZIP Code

In [22]:
if "loc_zip_code" in df_master.columns:
    # Reset index temporarily for merge
    df_master = (
        df_master.reset_index()
        .merge(
            population.rename(columns={
                "zip_code": "loc_zip_code",
                "population": "zip_population"
            })[["loc_zip_code", "zip_population"]],
            on="loc_zip_code",
            how="left"
        )
        .set_index("customer_id")  # restore index
    )


print(df_master.shape)
df_master.head()

(7043, 56)


Unnamed: 0_level_0,demo_count,demo_gender,demo_age,demo_under_30,demo_senior_citizen,demo_married,demo_dependents,demo_number_of_dependents,loc_count,loc_country,...,st_quarter,st_satisfaction_score,st_customer_status,st_churn_label,st_churn_value,st_churn_score,st_cltv,st_churn_category,st_churn_reason,zip_population
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8779-QRDMV,1,Male,78,No,Yes,No,No,0,1,United States,...,Q3,3,Churned,Yes,1,91,5433,Competitor,Competitor offered more data,68701
7495-OOKFY,1,Female,74,No,Yes,Yes,Yes,1,1,United States,...,Q3,3,Churned,Yes,1,69,5302,Competitor,Competitor made better offer,55668
1658-BYGOY,1,Male,71,No,Yes,No,Yes,3,1,United States,...,Q3,2,Churned,Yes,1,81,3179,Competitor,Competitor made better offer,47534
4598-XLKNJ,1,Female,78,No,Yes,Yes,Yes,1,1,United States,...,Q3,2,Churned,Yes,1,88,5337,Dissatisfaction,Limited range of services,27778
4846-WHAFZ,1,Female,80,No,Yes,Yes,Yes,1,1,United States,...,Q3,2,Churned,Yes,1,67,2793,Price,Extra data charges,26265


# Final Master Table Check

In [23]:
print("Final master dataset shape:", df_master.shape)
df_master.info()

Final master dataset shape: (7043, 56)
<class 'pandas.core.frame.DataFrame'>
Index: 7043 entries, 8779-QRDMV to 3186-AJIEK
Data columns (total 56 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   demo_count                             7043 non-null   int64  
 1   demo_gender                            7043 non-null   object 
 2   demo_age                               7043 non-null   int64  
 3   demo_under_30                          7043 non-null   object 
 4   demo_senior_citizen                    7043 non-null   object 
 5   demo_married                           7043 non-null   object 
 6   demo_dependents                        7043 non-null   object 
 7   demo_number_of_dependents              7043 non-null   int64  
 8   loc_count                              7043 non-null   int64  
 9   loc_country                            7043 non-null   object 
 10  loc_state              

# Save Master Dataset to data/interim

In [29]:
MASTER_NAME = "telco_master"

save_df(df_master, MASTER_NAME, folder="interim")


✅ DataFrame saved to: /Users/dianagomes/Desktop/work/s2/EnterpriseDataScienceBootcamp_workgroup/EnterpriseDataScienceBootcamp_workgroup-4/data/interim/telco_master.csv


PosixPath('/Users/dianagomes/Desktop/work/s2/EnterpriseDataScienceBootcamp_workgroup/EnterpriseDataScienceBootcamp_workgroup-4/data/interim/telco_master.csv')

# Summary

- All 5 raw Telco datasets were successfully loaded from `data/raw`.  
- Basic sanity checks and a global summary by table were performed.  
- Customer-level tables (demographics, location, services, status) were merged on `CustomerID`.  
- Population data was merged at ZIP-code level where possible.  
- The final master dataset **`telco_master`** was created and saved to the **interim** data folder.