# RetainX – Customer Revenue & Subscription Retention Intelligence System  
### Data Understanding & Raw Data Profiling  

**Client:** AirWave Communications  
**Domain:** Telecom / Subscription Analytics  
**Author:** Ujjwal Verma  

---

## Objective
This notebook performs **initial data understanding and profiling** on the **raw telecom customer dataset** (`telecom_churn.csv`).

The purpose of this notebook is strictly to:
- Understand dataset structure and schema
- Identify potential data quality issues
- Analyze churn distribution at a high level
- Inform cleaning, preprocessing, and modeling decisions

> **Important:**  
> This notebook does **not** perform any cleaning or transformation.  
> All data quality issues identified here are addressed in the subsequent **data_cleaning.ipynb** notebook before SQL ingestion.


## 1. Library Imports & Environment Setup

In this section, we import the core Python libraries required for data analysis and numerical operations.

- **Pandas** is used for data loading, manipulation, and tabular analysis  
- **NumPy** is used for numerical computations and array-based operations  

These libraries form the foundation of most analytics and data science workflows.


In [None]:
import pandas as pd
import numpy as np

## 2. Load Raw Telecom Customer Dataset

The raw telecom customer dataset is loaded from the project’s data directory.  
This dataset represents the **initial source of truth** before any cleaning, transformation, or feature engineering is applied.

At this stage:
- No modifications are made to the data
- The purpose is strictly inspection and understanding

A preview of the first few rows is displayed to confirm successful loading.


In [2]:
df = pd.read_csv("../02_Data/raw/telecom_churn.csv")
df.head()

Unnamed: 0,customer_id,telecom_partner,gender,age,state,city,pincode,date_of_registration,num_dependents,estimated_salary,calls_made,sms_sent,data_used,churn
0,1,Reliance Jio,F,25,Karnataka,Kolkata,755597,2020-01-01,4,124962,44,45,-361,0
1,2,Reliance Jio,F,55,Mizoram,Mumbai,125926,2020-01-01,2,130556,62,39,5973,0
2,3,Vodafone,F,57,Arunachal Pradesh,Delhi,423976,2020-01-01,0,148828,49,24,193,1
3,4,BSNL,M,46,Tamil Nadu,Kolkata,522841,2020-01-01,1,38722,80,25,9377,1
4,5,BSNL,F,26,Tripura,Delhi,740247,2020-01-01,2,55098,78,15,1393,0


## 3. Dataset Structure & Schema Overview

This section examines the **overall structure of the dataset**, including:

- Total number of rows and columns
- Data types of each column
- Presence of missing or null values

Understanding schema and data types early helps identify:
- Columns requiring type correction
- Fields that may need cleaning or transformation
- Potential data inconsistencies


In [3]:
print("Rows, Columns:", df.shape)
df.info()

Rows, Columns: (243553, 14)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243553 entries, 0 to 243552
Data columns (total 14 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   customer_id           243553 non-null  int64 
 1   telecom_partner       243553 non-null  object
 2   gender                243553 non-null  object
 3   age                   243553 non-null  int64 
 4   state                 243553 non-null  object
 5   city                  243553 non-null  object
 6   pincode               243553 non-null  int64 
 7   date_of_registration  243553 non-null  object
 8   num_dependents        243553 non-null  int64 
 9   estimated_salary      243553 non-null  int64 
 10  calls_made            243553 non-null  int64 
 11  sms_sent              243553 non-null  int64 
 12  data_used             243553 non-null  int64 
 13  churn                 243553 non-null  int64 
dtypes: int64(9), object(5)
memory usage: 26.

## 4. Descriptive Statistics & Value Distribution

Here, we generate descriptive statistics for both **numerical and categorical variables**.

This step helps:
- Understand value ranges and central tendencies
- Identify potential outliers
- Review the diversity of categorical fields
- Validate whether values fall within expected business ranges

This analysis provides early signals for data cleaning and feature engineering decisions.


In [None]:
df.describe(include="all").transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
customer_id,243553.0,,,,121777.0,70307.839393,1.0,60889.0,121777.0,182665.0,243553.0
telecom_partner,243553.0,4.0,Reliance Jio,61123.0,,,,,,,
gender,243553.0,2.0,M,145977.0,,,,,,,
age,243553.0,,,,46.077609,16.444029,18.0,32.0,46.0,60.0,74.0
state,243553.0,28.0,Uttarakhand,8856.0,,,,,,,
city,243553.0,6.0,Chennai,40749.0,,,,,,,
pincode,243553.0,,,,549501.270541,259808.860574,100006.0,324586.0,548112.0,774994.0,999987.0
date_of_registration,243553.0,1220.0,2020-01-01,200.0,,,,,,,
num_dependents,243553.0,,,,1.9975,1.414941,0.0,1.0,2.0,3.0,4.0
estimated_salary,243553.0,,,,85021.137839,37508.963233,20000.0,52585.0,84990.0,117488.0,149999.0


## 5. Missing Value Analysis

In this section, we analyze missing values across all columns to assess overall data quality.

For each column, we calculate:
- Absolute count of missing values
- Percentage of missing values relative to the dataset size

This helps determine:
- Which columns require imputation or cleaning
- Whether missing data is negligible or significant
- The potential impact of missing values on downstream analysis


In [4]:
missing = df.isnull().sum().reset_index()
missing.columns = ["Column", "Missing_Count"]
missing["Missing_Percentage"] = (missing["Missing_Count"] / len(df)) * 100
missing.sort_values(by="Missing_Percentage", ascending=False)

Unnamed: 0,Column,Missing_Count,Missing_Percentage
0,customer_id,0,0.0
1,telecom_partner,0,0.0
2,gender,0,0.0
3,age,0,0.0
4,state,0,0.0
5,city,0,0.0
6,pincode,0,0.0
7,date_of_registration,0,0.0
8,num_dependents,0,0.0
9,estimated_salary,0,0.0


## 6. Duplicate Record Identification

This step checks for **duplicate customer records** based on the unique customer identifier.

In real-world analytics systems:
- Each customer should be uniquely represented
- Duplicate records can distort churn metrics, revenue calculations, and segmentation

Identifying duplicates early ensures data integrity before further processing.


In [None]:
duplicate_count = df.duplicated(subset=["customer_id"]).sum()
duplicate_count

np.int64(0)

## 7. Unique Value Distribution by Column

Here, we calculate the number of unique values for each column.

This analysis helps:
- Distinguish categorical vs continuous variables
- Identify columns with unexpectedly low or high cardinality
- Detect potential data quality issues (e.g., columns that should be unique but are not)

Understanding cardinality is also important for feature engineering and visualization planning.


In [None]:
unique_counts = df.nunique().reset_index()
unique_counts.columns = ["Column", "Unique_Values"]
unique_counts

Unnamed: 0,Column,Unique_Values
0,customer_id,243553
1,telecom_partner,4
2,gender,2
3,age,57
4,state,28
5,city,6
6,pincode,213442
7,date_of_registration,1220
8,num_dependents,5
9,estimated_salary,110032


## 8. Churn Distribution & Class Balance

This section analyzes the distribution of the churn variable as a percentage.

Understanding churn distribution is critical because:
- Telecom churn datasets are often imbalanced
- Churn imbalance impacts analytical interpretation and modeling strategies
- It establishes a baseline for retention analysis

This insight helps guide segmentation, KPI design, and dashboard storytelling.


In [None]:
df["churn"].value_counts(normalize=True) * 100

churn
0    79.952208
1    20.047792
Name: proportion, dtype: float64

## 9. Data Profiling Summary

In this final step, we consolidate key findings from the data understanding phase into a structured summary.

The summary includes:
- Dataset size
- Number of columns
- Duplicate record count
- Number of columns with missing values
- Churn distribution overview

This summary provides a **quick, executive-level snapshot** of dataset readiness before moving to the data cleaning phase.


In [None]:
analysis_summary = {
    "Total Rows": df.shape[0],
    "Total Columns": df.shape[1],
    "Duplicate Records": duplicate_count,
    "Columns with Missing Data": missing[missing["Missing_Count"] > 0].shape[0],
    "Churn Imbalance": df["churn"].value_counts(normalize=True).to_dict()
}

analysis_summary

{'Total Rows': 243553,
 'Total Columns': 14,
 'Duplicate Records': np.int64(0),
 'Columns with Missing Data': 0,
 'Churn Imbalance': {0: 0.7995220752772497, 1: 0.20047792472275028}}

## Key Observations & Data Readiness Assessment

- The raw telecom customer dataset is complete and well-structured, with no missing values across any columns.
- No duplicate customer records were observed, confirming the integrity of `customer_id` as a unique identifier.
- Data types are consistent and suitable for analytical processing, with minor standardization required for date and categorical fields.
- The churn variable is clearly defined and balanced enough for retention analysis.
- Overall dataset quality is high and does not require extensive remediation before preprocessing and analytics.

---

## Conclusion

The dataset is **clean and reliable at source**, which is ideal in real-world analytics projects.

Preprocessing steps in the subsequent **data_cleaning.ipynb** notebook focus on:
- Data type standardization
- Outlier handling
- Feature preparation for analytics

The cleaned dataset is then ingested into PostgreSQL for SQL-based feature engineering and analytical modeling.

➡ Proceed to **`02_data_cleaning.ipynb`**
