# 1.0 Project Overview

### 1.1 Real Life Problem
Customer churn is a major concern for telecommunication companies, as acquiring new customers is significantly more expensive than retaining existing ones. Many telecommunications providers aim to identify customers who are likely to stop using their mobile network so that proactive retention strategies can be applied.


### 1.2 Invested stakeholders
The primary stakeholders of this project include SyriaTel’s business management, marketing and retention teams, customer service departments, and strategic decision-makers who are interested in reducing customer churn and improving long-term customer loyalty.

# 2.0 Business Understanding


### 2.1 Business Problem

SyriaTel is experiencing customer attrition and wants to reduce churn by identifying at-risk customers early.

Currently, SyriaTel lacks a data-driven method to:

1. Predict which customers are likely to churn.

2. Determine the estimated revenue at risk from customers who churn.

3. Understand the key factors driving customer dissatisfaction.



### 2.2 Business Questions


1. Can we accurately predict whether a customer will churn? 

2. What is the estimated revenue at risk from customers who churn?

3. Which customer behaviors and service features are most strongly associated with churn?



### 2.3 Business Objectives

1. Build a classification model to predict customer churn.

2. Estimate the potential revenue at risk when customers churn.

3. Identify key factors associated with customer churn and retention to support data-driven decision-making.


### 2.4 Implications of the project findings to the real-world problem and stakeholders 

The findings of this project will help SyriaTel to: 

1. Identify at-risk customers early, enabling proactive retention strategies.

2. Understand the potential revenue impact of customer churn and prioritize high-risk, high-value customers.

3. Gain insights into customer behaviors and service factors that influence churn, supporting targeted improvements in service quality and customer satisfaction.

# 3.0 Data Understanding

### 3.1 Data Description and Sources

+ The dataset contains customer-level data from SyriaTel. Each row represents a single customer. This dataset is suitable for the project because it includes usage metrics, service plans, and customer service interactions which are key indicators of customer behavior and satisfaction. These variables provide meaningful insights into how customers use the mobile network, the types of services they subscribe to, and the level of support they require—all of which are closely linked to churn behavior. 

+ This dataset was sourced from Kaggle (https://www.kaggle.com/datasets/becksddf/churn-in-telecoms-dataset?resource=download). Kaggle is an online platform for data science and machine learning that allows users to learn, practice, and collaborate using real-world data and problems.

+ This dataset contains 21 feature columns and 3333 customer records.

### 3.2 Target Variable

 **- Churn (Boolean) -** Indicates whether a customer has discontinued service with the company.

+ True → Customer is likely to leave the service

+ False → Customer is likely to remain

This makes the task a binary classification problem.

### 3.3 Feature Categories

#### A. Customer Usage - These features capture how customers use the service.

+ total day minutes – total daytime call duration

+ total day calls – total number of calls made during the day

+ total day charge – total charges for daytime calls

+ total eve minutes – total evening call duration

+ total eve calls – total number of calls made in the evening

+ total eve charge – total charges for evening calls

+ total night minutes – total night call duration

+ total night calls – total number of calls made at night

+ total night charge – total charges for night calls

+ total intl minutes – total international call duration

+ total intl calls – total number of international calls

+ total intl charge – total charges for international calls

#### B. Service Plans - These features describe the type of services customers subscribe to.

+ international plan – whether the customer has an international calling plan

+ voice mail plan – whether voicemail plan is subscribed

+ number of voice mail messages


#### C. Customer Support - This is often a strong indicator of dissatisfaction, as customers typically contact customer service when they experience problems with the service.

+ number of customer service calls


#### D. Administrative Features

+ state – the state/region where the customer lives

+ account length – time (in days) the customer has had the account

+ area code – geographic telephone area code

#### E. Identifiers - unique identifier (to be excluded from modeling)

+ customer phone number

# 4.0 Data Preparation and Exploratory Analysis

In [1]:
# We will start by importing all necessary libraries

# For data loading, cleaning and exploratory data analysis
import pandas as pd
import numpy as np

# For visualization
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

# For data pre-processing (Encoding and Scaling)
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler

# For data modelling 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LassoCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree

# For feature selection 
from sklearn.feature_selection import RFE, RFECV

# For model evaluation
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, auc

#### 4.1 Descriptive Statistics of all Features in the Dataset

In [None]:
#Loading the dataset and saving it as customer_df

customer_df = pd.read_csv('customer.csv')

# confirm that the dataset has been loaded by calling the first 5 rows
customer_df.head()


Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


#### 4.2 Selected features based on relevance

#### 4.3 Checking for missing data

#### 4.4 Checking for duplicated data

#### 4.5 Data Analysis and Visualization

In [3]:
# Summary findings

# 5.0 Data Modelling

#### 5.1 Prepare the data for modelling

In [4]:
# Split the data into training and testing datasets




# Divide the categorical from numerical datasets




# Convert the categorical data using LabelEncoder
 



# Scale the numerical data




# Merge the datasets back together to form 1 full dataset



#### 5.2 Create a Baseline Model Using Logistic Regression 

#### 5.3 Create a Model Using Decision Tree Classifier

#### 5.4 Create a Model Using Random Forest Classifier

# 6.0 Evaluation and Predictive Findings

In [5]:
#Function to run evaluations



#For Logistic Regression



#For Decision Tree Classifier



#For Random Forest Classifier

#### 6.1 Selecting the best features from the best model

# 7.0 Predictive Recommendation

# 8.0 Limitations of the Data Set

# 9.0 Suggestions for further analysis