<h1 style="color:#1f77b4; text-align:left; font-size:40px;">
    Data Understanding
</h1>

<h3 style="color:#555; text-align:left;">
    <strong>Purpose:</strong><br><br>
    - Validate the integrity of the unified customer dataset, remove unusable or high-leakage variables, and prepare a clean analytical base for exploratory analysis.

</h3>

<h2 style="color:#1f77b4; border-bottom: 3px solid #1f77b4; padding-bottom:4px;">
</h2>


# Imports

In [1]:
import sys
from pathlib import Path

project_root = Path().resolve().parent
src_path = project_root / "src"
sys.path.append(str(src_path))

from utils_data import load_df, quick_overview

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Load Interim Data

In [2]:
MASTER_NAME = "telco_master"

df = load_df(MASTER_NAME, folder="interim")
print(f"\nLoaded '{MASTER_NAME}' from interim with shape: {df.shape}")


üìÇ Loaded: C:\Users\sarac\rep_EDSB\EnterpriseDataScienceBootcamp_workgroup\data\interim\telco_master.csv

Loaded 'telco_master' from interim with shape: (7043, 56)


# Quick Overview

In [3]:
quick_overview(df, name="telco_master", show_head=True, n_head=5)


===== telco_master =====
Shape: 7043 rows √ó 56 columns

Data types:
demo_count                                 int64
demo_gender                               object
demo_age                                   int64
demo_under_30                             object
demo_senior_citizen                       object
demo_married                              object
demo_dependents                           object
demo_number_of_dependents                  int64
loc_count                                  int64
loc_country                               object
loc_state                                 object
loc_city                                  object
loc_zip_code                               int64
loc_lat_long                              object
loc_latitude                             float64
loc_longitude                            float64
svc_count                                  int64
svc_quarter                               object
svc_referred_a_friend                     object

Unnamed: 0,demo_count,demo_gender,demo_age,demo_under_30,demo_senior_citizen,demo_married,demo_dependents,demo_number_of_dependents,loc_count,loc_country,...,st_quarter,st_satisfaction_score,st_customer_status,st_churn_label,st_churn_value,st_churn_score,st_cltv,st_churn_category,st_churn_reason,zip_population
0,1,Male,78,No,Yes,No,No,0,1,United States,...,Q3,3,Churned,Yes,1,91,5433,Competitor,Competitor offered more data,68701
1,1,Female,74,No,Yes,Yes,Yes,1,1,United States,...,Q3,3,Churned,Yes,1,69,5302,Competitor,Competitor made better offer,55668
2,1,Male,71,No,Yes,No,Yes,3,1,United States,...,Q3,2,Churned,Yes,1,81,3179,Competitor,Competitor made better offer,47534
3,1,Female,78,No,Yes,Yes,Yes,1,1,United States,...,Q3,2,Churned,Yes,1,88,5337,Dissatisfaction,Limited range of services,27778
4,1,Female,80,No,Yes,Yes,Yes,1,1,United States,...,Q3,2,Churned,Yes,1,67,2793,Price,Extra data charges,26265


# Phase 1 ‚Äì Structural & Integrity Checks

 Verify the integrity of the merged dataset before any transformations:

## Inspect variables

In [4]:
dtype_summary = (
    df.dtypes
    .reset_index()
    .rename(columns={'index': 'column', 0: 'dtype'})
    .sort_values('dtype')
)
display(dtype_summary)

Unnamed: 0,column,dtype
0,demo_count,int64
52,st_cltv,int64
51,st_churn_score,int64
50,st_churn_value,int64
47,st_satisfaction_score,int64
45,st_count,int64
42,svc_total_extra_data_charges,int64
20,svc_tenure_in_months,int64
19,svc_number_of_referrals,int64
16,svc_count,int64


### Grouping the variables for analysis

In [5]:
demographic_vars = [c for c in df.columns if c.startswith("demo_")]
location_vars    = [c for c in df.columns if c.startswith("loc_")]
service_vars     = [c for c in df.columns if c.startswith("svc_")]
status_vars      = [c for c in df.columns if c.startswith("st_")]

In [6]:
for group, cols in {
    "Demographics": demographic_vars,
    "Location": location_vars,
    "Services": service_vars,
    "Status": status_vars
}.items():
    print(f"\n{group} ({len(cols)} vars)")
    print(cols)


Demographics (8 vars)
['demo_count', 'demo_gender', 'demo_age', 'demo_under_30', 'demo_senior_citizen', 'demo_married', 'demo_dependents', 'demo_number_of_dependents']

Location (8 vars)
['loc_count', 'loc_country', 'loc_state', 'loc_city', 'loc_zip_code', 'loc_lat_long', 'loc_latitude', 'loc_longitude']

Services (29 vars)
['svc_count', 'svc_quarter', 'svc_referred_a_friend', 'svc_number_of_referrals', 'svc_tenure_in_months', 'svc_offer', 'svc_phone_service', 'svc_avg_monthly_long_distance_charges', 'svc_multiple_lines', 'svc_internet_service', 'svc_internet_type', 'svc_avg_monthly_gb_download', 'svc_online_security', 'svc_online_backup', 'svc_device_protection_plan', 'svc_premium_tech_support', 'svc_streaming_tv', 'svc_streaming_movies', 'svc_streaming_music', 'svc_unlimited_data', 'svc_contract', 'svc_paperless_billing', 'svc_payment_method', 'svc_monthly_charge', 'svc_total_charges', 'svc_total_refunds', 'svc_total_extra_data_charges', 'svc_total_long_distance_charges', 'svc_total

---
**Demographics variables**

These variables describe life stage, family responsibilities, and customer profile, which may influence service needs, price sensitivity, and contract stability.

**Individual & life-stage characteristics**
  - ‚úÖ demo_gender, 
  - ‚úÖ demo_age
  - ‚ùå demo_under_30, demo_senior_citizen

    These variables  are direct transformations of demo_age. To preserve the most informative representation of life stage, only demo_age was retained.


**Household structure**
  - ‚úÖ demo_married, demo_number_of_dependents
  - ‚ùå demo_dependents

    This variable describes the same underlying concept as demo_number_of_dependents. 
    Since the numeric feature provides richer information, it's the only one we decide to keep.

**Metadata**
  - ‚ùå demo_count ‚Äî Record/count indicator (non-behavioral)



In [7]:
# Checking for variability
df[[
    "demo_gender",
    "demo_age",
    "demo_married",
    "demo_number_of_dependents",
    "demo_count"
]].nunique()


demo_gender                   2
demo_age                     62
demo_married                  2
demo_number_of_dependents    10
demo_count                    1
dtype: int64

In [8]:
# Inspecting gender
df["demo_gender"].value_counts()


demo_gender
Male      3555
Female    3488
Name: count, dtype: int64

In [9]:
# Inspecting age
df["demo_age"].describe()

count    7043.000000
mean       46.509726
std        16.750352
min        19.000000
25%        32.000000
50%        46.000000
75%        60.000000
max        80.000000
Name: demo_age, dtype: float64

In [10]:
# Inspecting married
df["demo_married"].value_counts()

demo_married
No     3641
Yes    3402
Name: count, dtype: int64

In [11]:
# Inspecting number_of_dependents
df["demo_number_of_dependents"].describe()

count    7043.000000
mean        0.468692
std         0.962802
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         9.000000
Name: demo_number_of_dependents, dtype: float64

In [12]:
df_clean = df

df_clean = df_clean.drop(
    columns=['demo_under_30', 'demo_senior_citizen', 'demo_dependents']
)

print("‚úÖ Dropped redundant demographic features.")
print("‚úÖ New shape:", df_clean.shape)

‚úÖ Dropped redundant demographic features.
‚úÖ New shape: (7043, 53)


---
**Location variables**

These variables capture the regional context of each customer, enabling geographic segmentation.

**Geographic identifiers**
‚ö†Ô∏è Hierarchically redundant
- ‚ùå loc_country, loc_state, 
- ‚úÖ loc_city, loc_zip_code


**Geographic coordinates**
- ‚úÖ loc_latitude, loc_longitude
- ‚ùå loc_lat_long
    This variable is a concatenation of latitude & longitude - created multicollinearity.

**Technical metadata**
- ‚ùå loc_count ‚Äî Location-related count indicator

In [13]:
# Global check
df[[
    "loc_city",
    "loc_zip_code",
    "loc_latitude",
    "zip_population",
    "loc_longitude"
]].nunique()

loc_city          1106
loc_zip_code      1626
loc_latitude      1626
zip_population    1569
loc_longitude     1625
dtype: int64

In [14]:
df_clean = df_clean.drop(columns=['loc_country', 'loc_state', 'loc_lat_long','zip_population', 'loc_count'])

print("‚úÖ Dropped columns: ['loc_country', 'loc_state', 'loc_lat_long','zip_population', 'loc_count']")
print("‚úÖ New shape:", df_clean.shape)

‚úÖ Dropped columns: ['loc_country', 'loc_state', 'loc_lat_long','zip_population', 'loc_count']
‚úÖ New shape: (7043, 48)


---
**Services variables**

Each service variable was reviewed for business interpretability, temporal validity, and expected influence on churn. The variables capture 5 behavioral dimensions:
**Customer relationship & engagement**
  - ‚úÖ svc_tenure_in_months, svc_offer, 
  - ‚úÖ svc_referred_a_friend, svc_number_of_referrals, svc_contract

**Service subscriptions**
  - ‚úÖ svc_phone_service, svc_multiple_lines
  - ‚úÖ svc_internet_service, svc_internet_type
  - ‚úÖ svc_streaming_tv, svc_streaming_movies, svc_streaming_music
  - ‚úÖ svc_unlimited_data
  - ‚úÖ svc_online_security, svc_online_backup
  - ‚úÖ svc_device_protection_plan, svc_premium_tech_support

**Usage intensity**
  - ‚úÖ svc_avg_monthly_long_distance_charges
  - ‚úÖ svc_avg_monthly_gb_download

**Billing and payment behavior**
  - ‚úÖ svc_paperless_billing, svc_payment_method, svc_monthly_charge

‚ö†Ô∏è **Revenue and financial exposure**
  - High leakage risk: likely to encode information **after** the churn event.
       
    Example:

    (Low total revenue ‚Üí customer must have churned early)
       
    (High total revenue ‚Üí customer stayed longer)

  - svc_total_charges, svc_total_refunds,
    svc_total_extra_data_charges, svc_total_long_distance_charges,
    svc_total_revenue

**Time / metadata**
  - ‚ùå svc_quarter, svc_count

In [15]:
df_clean[service_vars].nunique()


svc_count                                   1
svc_quarter                                 1
svc_referred_a_friend                       2
svc_number_of_referrals                    12
svc_tenure_in_months                       72
svc_offer                                   5
svc_phone_service                           2
svc_avg_monthly_long_distance_charges    3584
svc_multiple_lines                          2
svc_internet_service                        2
svc_internet_type                           3
svc_avg_monthly_gb_download                50
svc_online_security                         2
svc_online_backup                           2
svc_device_protection_plan                  2
svc_premium_tech_support                    2
svc_streaming_tv                            2
svc_streaming_movies                        2
svc_streaming_music                         2
svc_unlimited_data                          2
svc_contract                                3
svc_paperless_billing             

In [16]:
# Inspecting binary vars
binary_vars = [
    "svc_referred_a_friend", "svc_phone_service", "svc_multiple_lines",
    "svc_internet_service", "svc_streaming_tv", "svc_streaming_movies",
    "svc_streaming_music", "svc_unlimited_data", "svc_online_security",
    "svc_online_backup", "svc_device_protection_plan",
    "svc_premium_tech_support", "svc_paperless_billing"
]

for col in binary_vars:
    print(df_clean[col].value_counts())
    print("-----")


svc_referred_a_friend
No     3821
Yes    3222
Name: count, dtype: int64
-----
svc_phone_service
Yes    6361
No      682
Name: count, dtype: int64
-----
svc_multiple_lines
No     4072
Yes    2971
Name: count, dtype: int64
-----
svc_internet_service
Yes    5517
No     1526
Name: count, dtype: int64
-----
svc_streaming_tv
No     4336
Yes    2707
Name: count, dtype: int64
-----
svc_streaming_movies
No     4311
Yes    2732
Name: count, dtype: int64
-----
svc_streaming_music
No     4555
Yes    2488
Name: count, dtype: int64
-----
svc_unlimited_data
Yes    4745
No     2298
Name: count, dtype: int64
-----
svc_online_security
No     5024
Yes    2019
Name: count, dtype: int64
-----
svc_online_backup
No     4614
Yes    2429
Name: count, dtype: int64
-----
svc_device_protection_plan
No     4621
Yes    2422
Name: count, dtype: int64
-----
svc_premium_tech_support
No     4999
Yes    2044
Name: count, dtype: int64
-----
svc_paperless_billing
Yes    4171
No     2872
Name: count, dtype: int64
-----


In [17]:
# Inspecting categorical vars

for col in ["svc_offer", "svc_internet_type", "svc_contract", "svc_payment_method"]:
    print(df_clean[col].value_counts())
    print("-----")


svc_offer
Offer B    824
Offer E    805
Offer D    602
Offer A    520
Offer C    415
Name: count, dtype: int64
-----
svc_internet_type
Fiber Optic    3035
DSL            1652
Cable           830
Name: count, dtype: int64
-----
svc_contract
Month-to-Month    3610
Two Year          1883
One Year          1550
Name: count, dtype: int64
-----
svc_payment_method
Bank Withdrawal    3909
Credit Card        2749
Mailed Check        385
Name: count, dtype: int64
-----


In [18]:
# Inspecting svc_number_of_referrals

df_clean["svc_number_of_referrals"].describe()


count    7043.000000
mean        1.951867
std         3.001199
min         0.000000
25%         0.000000
50%         0.000000
75%         3.000000
max        11.000000
Name: svc_number_of_referrals, dtype: float64

In [19]:
# Inspecting svc_tenure_in_months
df_clean["svc_tenure_in_months"].describe()


count    7043.000000
mean       32.386767
std        24.542061
min         1.000000
25%         9.000000
50%        29.000000
75%        55.000000
max        72.000000
Name: svc_tenure_in_months, dtype: float64

In [20]:
# Inspecting usage intensity
df_clean[[
    "svc_avg_monthly_long_distance_charges",
    "svc_avg_monthly_gb_download"
]].describe()

Unnamed: 0,svc_avg_monthly_long_distance_charges,svc_avg_monthly_gb_download
count,7043.0,7043.0
mean,22.958954,20.515405
std,15.448113,20.41894
min,0.0,0.0
25%,9.21,3.0
50%,22.89,17.0
75%,36.395,27.0
max,49.99,85.0


In [21]:
# Inspecting svc_monthly_charge
df_clean["svc_monthly_charge"].describe()


count    7043.000000
mean       64.761692
std        30.090047
min        18.250000
25%        35.500000
50%        70.350000
75%        89.850000
max       118.750000
Name: svc_monthly_charge, dtype: float64

In [22]:
leakage_candidates = [
    'svc_total_charges',
    'svc_total_revenue',
    'svc_total_extra_data_charges',
    'svc_total_long_distance_charges'
]

df[leakage_candidates + ['svc_tenure_in_months']].corr()

# Cumulative financial variables show strong correlation with customer tenure, confirming target leakage.
# All cumulative financial features will be excluded to prevent leakage and preserve temporal causality.

Unnamed: 0,svc_total_charges,svc_total_revenue,svc_total_extra_data_charges,svc_total_long_distance_charges,svc_tenure_in_months
svc_total_charges,1.0,0.972212,0.121859,0.610185,0.826074
svc_total_revenue,0.972212,1.0,0.122496,0.778559,0.853146
svc_total_extra_data_charges,0.121859,0.122496,1.0,0.058871,0.082266
svc_total_long_distance_charges,0.610185,0.778559,0.058871,1.0,0.674149
svc_tenure_in_months,0.826074,0.853146,0.082266,0.674149,1.0


In [23]:
df.groupby('st_churn_value')[
    ['svc_total_charges', 'svc_total_revenue']
].mean()

# Group-wise averages further confirm leakage: non-churned customers show:
#   ‚âà66% higher svc_total_charges
#   ‚âà73% higher `svc_total_revenue` than churned customers, 
#   
#   Indicating that these cumulative features directly encode customer lifetime rather than predictive risk.


Unnamed: 0_level_0,svc_total_charges,svc_total_revenue
st_churn_value,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2550.792103,3418.374927
1,1531.796094,1971.353569


In [None]:
df_clean = df_clean.drop(columns=[
    'svc_count',
    'svc_quarter',
    'svc_total_charges',
    'svc_total_refunds',
    'svc_total_revenue',
    'svc_total_extra_data_charges',
    'svc_total_long_distance_charges'
])

print("‚úÖ Dropped columns: ['svc_count', 'svc_quarter', 'svc_total_charges', 'svc_total_refunds', 'svc_total_revenue', 'svc_total_extra_data_charges', 'svc_total_long_distance_charges']")
print("‚úÖ New shape:", df_clean.shape)

‚úÖ Dropped columns: ['svc_count', 'svc_quarter', 'svc_total_charges', 'svc_total_refunds', 'svc_total_revenue', 'svc_total_extra_data_charges', 'svc_total_long_distance_charges']
‚úÖ New shape: (7043, 41)


---
**Target & Churn Outcome Variables**

Using these variables in training will cause data leakage and unrealistically high model performance.
- ‚úÖ **st_churn_label** *(object)*  **will be used for validation**
  ‚Üí Target (Yes/No churn) 

- ‚ùå **st_count** *(int64)*  
  ‚Üí No variation (1)

- ‚ùå **st_quarter** *(object)*  
  ‚Üí Time reference tied to churn outcome (Q3)

- ‚ö†Ô∏è **st_satisfaction_score** *(object)*  
  ‚Üí Customer satisfaction rating (1‚Äì5)  
  Potential risk: proceed with caution

- ‚ùå **st_customer_status** *(object)*  
  ‚Üí Current state (Active/Churned)

- ‚ùå **st_churn_value** *(int64)*  
  ‚Üí Encoded churn status (binary)

- ‚ùå **st_churn_score** *(int64)*  
  ‚Üí Likely a precomputed churn risk score

- ‚ùå **st_cltv** *(object)*  
  ‚Üí Custumer lifetime value

- ‚ùå **st_churn_category** *(object)*  
  ‚Üí Why they churned (Competitor, Price, Service, etc.)

- ‚ùå **st_churn_reason** *(object)*  
  ‚Üí Detailed textual reason


In [25]:
target = "st_churn_label"

leakage_target_cols = [
    "st_count",
    "st_quarter",
    "st_satisfaction_score",
    "st_customer_status",
    "st_churn_value",
    "st_churn_score",
    "st_cltv",
    "st_churn_category",
    "st_churn_reason",
]

df_clean = df_clean.drop(columns=leakage_target_cols)

print("‚úÖ Dropped target-related leakage columns:", leakage_target_cols)
print("‚úÖ Remaining shape:", df_clean.shape)
print("‚úÖ Target distribution:")
print(df[target].value_counts())


‚úÖ Dropped target-related leakage columns: ['st_count', 'st_quarter', 'st_satisfaction_score', 'st_customer_status', 'st_churn_value', 'st_churn_score', 'st_cltv', 'st_churn_category', 'st_churn_reason']
‚úÖ Remaining shape: (7043, 32)
‚úÖ Target distribution:
st_churn_label
No     5174
Yes    1869
Name: count, dtype: int64


# Phase 2 ‚Äì Missing Values


In [26]:
missing_summary_before = (
    df_clean.isna()
    .sum()
    .reset_index()
    .rename(columns={"index": "column", 0: "n_missing"})
    .query("n_missing > 0")
    .sort_values("n_missing", ascending=False)
)

missing_summary_before

Unnamed: 0,column,n_missing
12,svc_offer,3877
17,svc_internet_type,1526


### Missing Values ‚Äì Interpretation & Handling

Missing values reflect the **absence of the corresponding service**, rather than data collection errors. Therefore, missing entries will be replaced with explicit category labels:

- svc_internet_type ‚Üí filled with "No Internet Service"
- svc_offer ‚Üí filled with "No Offer" or "Unknown" depending on business interpretation

In [27]:
fill_map = {}

if "svc_internet_type" in df_clean.columns:
    fill_map["svc_internet_type"] = "No Internet"

if "svc_offer" in df_clean.columns:
    fill_map["svc_offer"] = "No Offer"

df_clean = df_clean.fillna(value=fill_map)

# Phase 2 - Redundancy Analysis 
Objective: Identify columns that are constant or near-constant.

In [28]:
n_unique = df_clean.nunique().sort_values()

constant_cols = n_unique[n_unique == 1].index.tolist()

print("‚úÖ Constant columns detected:", constant_cols)

‚úÖ Constant columns detected: ['demo_count']


In [29]:
df_clean = df_clean.drop(columns=constant_cols)

print("‚úÖ Dropped constant columns:")
print(constant_cols)

print("\n‚úÖ New DataFrame shape:")
print(df_clean.shape)

‚úÖ Dropped constant columns:
['demo_count']

‚úÖ New DataFrame shape:
(7043, 31)


# Phase 3 ‚Äì Saving Clean Dataset


In [30]:
df_clean.to_csv(project_root / "data" / "interim" / "telco_master_basic_clean.csv", index=False)
print("‚úÖ Saved: telco_master_basic_clean.csv")


‚úÖ Saved: telco_master_basic_clean.csv
