# Telco Customer Churn EDA

## 1. Problem Statement

### What is Customer Churn?
Customer churn refers to when a customer stops doing business with a company. In the telecom industry, this means a user cancels their subscription.

### Why is it important?
Acquiring new customers is often more expensive than retaining existing ones. By understanding why customers leave, companies can improve their services and offer targeted retention plans.

### Our Goal
We use historical customer data to find patterns and drivers of churn. This analysis prepares us to build a prediction model later.


## 2. Basic Data Checks

In [2]:
# Basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Set background graph style
plt.style.use('dark_background')

# Load Dataset
data = pd.read_csv(r'D:\Data Science Projects\Customer Churn Risk Scoring System\NoteBook\Data\Telco-Customer-Churn.csv')
# 1. Check Shape
print('Data Shape:', data.shape)

# 2. View first few rows
display(data.head())

# 3. Info to see column types and non-null counts
print('\nData Info:')
data.info()

# 4. Summary statistics for numeric columns
print('\nNumeric Summary:')
display(data.describe())

# 5. Unique values per column
print('\nUnique Values Count:')
print(data.nunique())

Data Shape: (7043, 21)


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes



Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-nul

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75



Unique Values Count:
customerID          7043
gender                 2
SeniorCitizen          2
Partner                2
Dependents             2
tenure                73
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
MonthlyCharges      1585
TotalCharges        6531
Churn                  2
dtype: int64


### Insight
- The dataset contains 7043 rows and 21 columns.
- Most columns are categorical (object). `SeniorCitizen`, `tenure`, `MonthlyCharges` are numeric.
- **Issue**: `TotalCharges` is currently an object (string) type, but it should be numeric. This need to be fixed.
- `customerID` has 7043 unique values, confirming it acts as a primary key.


## 3. Data Cleaning & Quality

In [4]:
# a) Convert TotalCharges to numeric
# 'coerce' turns non-numeric values (like empty strings) into NaN
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

# Check how many missing values were created
print('Missing TotalCharges:', data['TotalCharges'].isnull().sum())

# b) Duplicate Check
duplicates = data.duplicated().sum()
print('Duplicate Rows:', duplicates)

# Check for duplicate CustomerIDs
id_duplicates = data['customerID'].duplicated().sum()
print('Duplicate CustomerIDs:', id_duplicates)

# Drop duplicates if any
if duplicates > 0:
    data.drop_duplicates(inplace=True)
    print('Duplicates dropped.')


Missing TotalCharges: 11
Duplicate Rows: 0
Duplicate CustomerIDs: 0


### Insight
- We successfully converted `TotalCharges` to numeric. Some values became NaN.
- There are no duplicate rows in the dataset.
- `customerID` is unique for every row, so the data is at the customer level.
