# Chustomer Churn Prediction
## Project Overview
A Data from the Telco DomainDue to tough competition the customers tend to swtich between the telecommunication service providersE.gan Airtel customer might transition to Jio services and vice versaThis behaviour from the customers is known as churn.

## Objective
To be able to predict if a customer would churn or notTake the Next Best Action to prevent churn.

## Stages to be convered during the solution
- `Data Merging and Wrangling:` Combining multiple data sources and cleaning the data
- `Exploratory Data Analysis:` Understanding the relationship between features and with target
- `Data Preprocessing:` Data Encoding, Missing Value Treatment, Outlier Treatment, Feature Scaling
- `Model Building:` Train ML Model using the pre-processed data
- `Evaluation:` Assess the Model's performace

By the end of this project, you will have a complete workflow for predicting churn and/or creating classification models.


## Domain Backgroud (Telecom Churn Stroy)

I’m working as a data analyst at a telecommunications company, which I refer to as TeleComCoTeleComCo provides phone and internet services to a wide range of customers, and, as with most telecoms, customer churn—when customers stop using our service—is a key concernHigh churn rates lead to lost revenue and may signal customer dissatisfaction.

Recently, in the last quarter, we noticed a rise in customers leaving for competitorsIn response, our management tasked my team with investigating the reasons behind this churn and developing a model to predict which customers are most at risk of leavingThe purpose is to identify these customers in advance and proactively offer them incentives to stay.

For this project, I’ve been given a dataset detailing account information for both past and current customers, along with data indicating whether or not each customer eventually churnedMy primary responsibilities include:
- Analyzing this dataset to detect patterns and factors associated with customer churn.
- Building a predictive model (specifically using logistic regression) to estimate churn risk for individual customers.
- Through this data exploration, I expect to identify patterns such as:
- Customers with longer tenures are generally less likely to churn, while newer customers may be at greater risk.
- Those with certain types of plans or higher monthly charges might be more inclined to leave, possibly due to the cost factor.
- Demographic details could influence churn—for instance, senior citizens may use our services differently or have specific needs.

Customer preferences, like opting for paperless billing or bundling phone and internet, might also relate to their likelihood of churning.
By thoroughly investigating these factors and building the predictive model, we aim to help TeleComCo understand why customers leave and reduce future churn through timely interventions.

## Dataset Description

The dataset consists of customer records, each with a variety of features describing the customer and their service usage. Below is an overview of each column in the data:
- `customer_id:` A unique identifier for each customer (e.g., a UUID). This is just an ID and not useful for prediction.
- `customer_email:` The email address-of the customer. This is an identifier as well and not directly useful for the model.
- `age:` The age of the customer (in years). This could be related to churn if different age groups have different service preferences.
- `senior_citizen:` Whether the customer is a senior citizen or not (boolean: true/false). Typically, this might be derived from age (e.g., age > 65).
- `partner:` Whether the customer has a partner or not (boolean). This indicates if the customer is married or in a long-term partnershipIn telecom, having a partner
might mean family plans or shared services.
- `dependents:` Whether the customer has dependents (children or other dependents) or not (boolean) Customers with dependents might have different usage (e.g.,family plans).
- `tenure_months:` The number of months the customer has been with the company. Higher tenure might indicate loyalty; low tenure customers are newer and might be more likely to churn if they haven't established loyalty.
- `phone_service:` Whether the customer has phone service with the company (boolean). Some customers might only have internet service; this feature tells if they also subscribed to phone.
- `paperless_billing:` Whether the customer has opted for paperless billing (boolean). This could be a proxy for tech-savvy behavior or convenience preference.
- `monthly_charges:` The amount `$` charged to the customer every month. This is like their monthly bill. Customers with higher bills might churn due to cost, or those with very low bills might churn if they are not using many services.
- `total_charges:` The total amount `$` the customer has been charged since joining (this is roughly monthly_charges * tenure, plus any extras). This can indicate the overall value of the customer; low total charges might mean the customer is relatively new or has a low-cost plan.
- `churn:` The target variable - whether the customer has churned (true = yes, the customer left; false = no, the customer is still with the company). This is what we want to
predict.
- `last_interaction_date:` The date of the last interaction with the customer (could be the last service use or last customer support call, etc.). This might give insight into how recently the customer was active. Customers with very old last interactions might have silently churned.
- `region:` The geographic region or state where the customer resides (e.g., Ohio, California, etc.). Different regions might have different market conditions or competitor
presence, possibly affecting churn.
- `signup_date:` The date when the customer originally signed up for service. (Note: This column is present in one of the source files. When we merge data, some records might not have a signup_date if it wasn't recorded for them.)



## Potential Questions and Considerations:

Based on the above features, here are some questions that might arise and that we will explore in this project:
- Do older customers or senior citizens tend to churn more or less than younger customers?
- Does having a partner or dependents influence churn? (For example, do single customers churn more often than those with family plans?)
- How does tenure relate to churn? Are newer customers more likely to leave compared to long-term customers?
- What about monthly charges? Are customers with high monthly charges more likely to churn (perhaps due to higher cost), or could it be that those with low charges churn because they might not be fully utilizing the service?
- Are there any regional trends in churn? (We might check if certain regions have higher churn rates.)
- How do features like phone service or paperless billing correlate with churn? (e.g., maybe paperless billing users are more engaged or maybe less personal interaction leads to higher churn?)
- Are there outliers or unusual values in charges or tenure that need special attention?
  
`Try to answer these questions step-by-step in the analysis below.`

### Task 0: Import required libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# to tell python to show the pyplot in the outplut seciton of the cell
%matplotlib inline 
import warnings
warnings.filterwarnings("ignore")

### Task 1: Combine the two datasets
The customer data is provided in two CSV files (say, Customer Churn_data.cv and Customer Churn data_2.cv). Load both files and combine them into a single pandas
DataFrame for analysis. The two files have the same columns (one file may have an extra column signup_date). Ensure that after merging, all columns are aligned correctly.

In [2]:
file1 = pd.read_csv("Customer_Churn_data.csv")
print(file1.shape)
file1.head(2)

(100000, 14)


Unnamed: 0,customer_id,customer_email,age,senior_citizen,partner,dependents,tenure_months,phone_service,paperless_billing,monthly_charges,total_charges,churn,last_interaction_date,region
0,0f1eb305-e440-4576-9ab0-f8bdbf0bd17b,Preston.Cartwright54@hotmail.com,74,True,False,True,50,True,True,104.265791,1647.518754,False,2024-08-16T21:32:39.602Z,Ohio
1,0e0237a0-dc14-4610-9c74-5f50d72dd00a,Agustin_Treutel@yahoo.com,43,False,False,True,28,True,True,116.143274,4882.935552,False,2025-01-11T09:46:16.708Z,Oklahoma


In [3]:
file2 = pd.read_csv("Customer_Churn_data_2.csv")
print(file2.shape)
file2.head(2)

(100000, 15)


Unnamed: 0,customer_id,customer_email,age,senior_citizen,partner,dependents,tenure_months,phone_service,paperless_billing,monthly_charges,total_charges,churn,last_interaction_date,region,signup_date
0,132dcfb5-759a-4640-9156-79319d71b7f1,Jarod_Heidenreich@yahoo.com,62,False,True,True,28,True,False,129.60312,2285.999558,False,2024-09-19T04:01:17.383Z,Virginia,2023-04-23T21:52:19.052Z
1,c560d179-94a5-487a-9fd6-b1174fea339f,Julianne67@gmail.com,64,False,True,True,6,False,False,27.488638,5045.149417,False,2024-12-08T06:33:58.864Z,Pennsylvania,2022-01-03T16:36:13.193Z


### Task 2: View the first few rows of the combined data

After merging, use the DataFrame's head© method to display the first 5 rows of the combined dataset. This will help verify that the data from both files has been concatenated
correctly and that columns are as expected.

In [9]:
data = pd.concat([file1, file2], ignore_index=True)
data.sample(5)

Unnamed: 0,customer_id,customer_email,age,senior_citizen,partner,dependents,tenure_months,phone_service,paperless_billing,monthly_charges,total_charges,churn,last_interaction_date,region,signup_date
119305,f767f472-de9e-4214-9cd0-f3b9bb65341c,Berniece_Schneider44@hotmail.com,24,True,True,True,53,False,False,45.476548,5078.339264,False,2025-02-23T17:38:55.227Z,Maryland,2024-07-19T04:22:43.517Z
164665,f10ba2ab-ed94-4228-a862-1adbc76db828,Marley.Torp@yahoo.com,72,True,False,True,36,True,False,132.869715,4515.274639,False,2024-11-02T20:01:05.272Z,New Mexico,2024-04-16T00:00:00.524Z
142383,c2809e13-038d-4d2a-babc-57ae852a046b,George90@yahoo.com,48,False,True,True,31,False,True,143.309371,5642.954727,True,2025-03-24T13:35:14.599Z,California,2023-10-18T07:40:07.897Z
172212,21aa4fa8-2a2c-4eaf-845d-1a71eb524b19,Citlalli36@yahoo.com,46,True,True,False,19,False,False,73.912599,3333.799924,True,2025-06-25T14:28:22.638Z,Minnesota,2022-07-29T20:17:29.103Z
39144,65c0a6fd-3dc5-4d3b-a09f-4768efdac0a5,Selmer67@hotmail.com,27,True,False,False,31,False,False,61.685801,1388.460767,False,2025-07-21T05:19:49.176Z,Ohio,


### Task 3: Understand the dataset dimensions and dtypes
Determine the size of the combined dataset. Find out how many rows and columns are present. This can be done using the DataFrame's .info0 method. This will show the data type of each column and whether there are any missing values (non-null counts) in each column. Verfy that numeric columns are correctly recognized (e.g. age.
tenure months should be int or float, charges should be float, churn and otner booleans might appear as bool

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 15 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   customer_id            200000 non-null  object 
 1   customer_email         200000 non-null  object 
 2   age                    200000 non-null  int64  
 3   senior_citizen         200000 non-null  bool   
 4   partner                200000 non-null  bool   
 5   dependents             200000 non-null  bool   
 6   tenure_months          200000 non-null  int64  
 7   phone_service          200000 non-null  bool   
 8   paperless_billing      200000 non-null  bool   
 9   monthly_charges        200000 non-null  float64
 10  total_charges          200000 non-null  float64
 11  churn                  200000 non-null  bool   
 12  last_interaction_date  200000 non-null  object 
 13  region                 200000 non-null  object 
 14  signup_date            100000 non-nu

#### Verify if the data has been loaded perfectly

In [16]:
# last_interaction_date --> convert to datetime
data['last_interaction_date'] = pd.to_datetime(data['last_interaction_date'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 15 columns):
 #   Column                 Non-Null Count   Dtype              
---  ------                 --------------   -----              
 0   customer_id            200000 non-null  object             
 1   customer_email         200000 non-null  object             
 2   age                    200000 non-null  int64              
 3   senior_citizen         200000 non-null  bool               
 4   partner                200000 non-null  bool               
 5   dependents             200000 non-null  bool               
 6   tenure_months          200000 non-null  int64              
 7   phone_service          200000 non-null  bool               
 8   paperless_billing      200000 non-null  bool               
 9   monthly_charges        200000 non-null  float64            
 10  total_charges          200000 non-null  float64            
 11  churn                  200000 non-null 

In [21]:
# signup_date --> convert into correct datatype (datetime)
data['signup_date'] = pd.to_datetime(data['signup_date']) #infer_datetime_formate=True --> to handle mixed formats in a single col
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 15 columns):
 #   Column                 Non-Null Count   Dtype              
---  ------                 --------------   -----              
 0   customer_id            200000 non-null  object             
 1   customer_email         200000 non-null  object             
 2   age                    200000 non-null  int64              
 3   senior_citizen         200000 non-null  bool               
 4   partner                200000 non-null  bool               
 5   dependents             200000 non-null  bool               
 6   tenure_months          200000 non-null  int64              
 7   phone_service          200000 non-null  bool               
 8   paperless_billing      200000 non-null  bool               
 9   monthly_charges        200000 non-null  float64            
 10  total_charges          200000 non-null  float64            
 11  churn                  200000 non-null 

### Task 4: Generate summary statistics
Use the .describe() method on the DataFrame to get summary statistics for the numeric columns (count, mean, std, min, quartiles, max). This will give an overview of the
distributions (e.g., average age, average tenure, min/max charges, etc.).

In [22]:
data.describe(include=object)

Unnamed: 0,customer_id,customer_email,region
count,200000,200000,200000
unique,200000,197332,50
top,0f1eb305-e440-4576-9ab0-f8bdbf0bd17b,Emerson28@gmail.com,North Dakota
freq,1,4,4180


In [24]:
data.describe(include=int)

Unnamed: 0,age,tenure_months
count,200000.0,200000.0
mean,54.03167,35.97615
std,21.048897,21.053301
min,18.0,0.0
25%,36.0,18.0
50%,54.0,36.0
75%,72.0,54.0
max,90.0,72.0


In [26]:
data.describe(include=float)
# tenure*monthly charges = total_charges

Unnamed: 0,monthly_charges,total_charges
count,200000.0,200000.0
mean,76.618568,4006.213781
std,34.89114,2303.960157
min,18.000333,18.027625
25%,46.902328,2015.261182
50%,75.754153,3999.740393
75%,104.441355,6001.846181
max,149.998856,7999.994083


In [27]:
data.describe(include=bool)

Unnamed: 0,senior_citizen,partner,dependents,phone_service,paperless_billing,churn
count,200000,200000,200000,200000,200000,200000
unique,2,2,2,2,2,2
top,True,False,False,False,True,True
freq,100054,100001,100202,100043,100117,100013


In [33]:
data.describe(include=['datetimetz'])

Unnamed: 0,last_interaction_date,signup_date
count,200000,100000
mean,2025-02-02 21:13:13.894031872+00:00,2023-02-03 04:16:50.602242048+00:00
min,2024-08-04 12:54:19.943000+00:00,2020-08-05 13:04:51.741000+00:00
25%,2024-11-03 17:43:17.303000064+00:00,2021-11-04 20:06:11.245750016+00:00
50%,2025-02-02 22:44:00.503000064+00:00,2023-02-01 14:36:04.458499840+00:00
75%,2025-05-05 08:50:22.581000192+00:00,2024-05-02 17:46:17.339500032+00:00
max,2025-08-04 12:49:38.839000+00:00,2025-08-04 11:36:35.778000+00:00


### Task 5: Check for duplicate entries
Ensure there are no duplicate customer records in the data. For instance, verify if customer_id is unique across the combined dataset. You can use pandas functions like
.duplicated() on the customer_id column to check for any duplicates.

In [36]:
data['customer_id'].duplicated().sum() #number of duplicated rows in the given col/df

0

In [38]:
data.duplicated().sum() # number of duplicated rows in the given df

0

In [39]:
26:00

SyntaxError: illegal target for annotation (1713285522.py, line 1)