# **SYRIAN TELECOM CUSTOMER CHURN ANALYTICS**

## **BUSINESS UNDERSTANDING** 



### **OVERVIEW AND BUSINESS UNDERSTANDING**
  In a world where communication means are evolving rapidly it is important for a company such as ours to keep upto date with our customers as they are the true heroes in our journey.That is why when customers stop using our products in a phenomenon known as churn we get concerned as it impacts our market shares,revenue and even affect our brand name.

  To solve this problem our company decided to undertake a pro active approach via  a data driven method to solve this problem and find out why our customers have suddenly decided to stop using our products.We feel it is the right step to undertake so that we are able to solve the issues affecting our customers and make a difference

  Key Question for this project: Are there predictable patterns in customer behavior and account details that signal an impending decision to churn?

### **CHALLENGES**

***The challenges in predicting churn are numerous which may include:***

**Volatile Market:** The telecom market is saturated, with competitors constantly offering aggressive promotions, making customer loyalty fragile.

**Multi-Dimensionality:** Churn is often influenced by multiple interacting factors, such as service quality, pricing, contract terms, technical support experiences, and specific calling/data habits, making it difficult to isolate the key drivers.

**Class Imbalance:** In a stable business, the number of customers who churn is typically small compared to those who remain loyal. This class imbalance can make training an accurate predictive model difficult.

### **PROPOSED SOLUTION**

*Our approach will involve predictive modeling  on a dataset of SyriaTel dataset:*

**Performing Exploratory Data Analysis (EDA):** This helps to understand the distribution of customer features and identify initial correlations with churn.

**Applying Machine Learning techniques:**  this  build's a classifier that predicts the probability of an individual customer churning.

### **PROBLEM STATEMENT**
The company needs a reliable, data-driven system to predict which current customers are most likely to discontinue their service with us SyriaTel in the near future. Specifically, what specific customer characteristics, usage patterns, or account details are the strongest predictors of customer churn, and how can a classification model be built to identify these "at-risk" customers?

### **OBJECTIVES**
To explore and summarize customer data to identify differences between churning and non-churning groups.

To train a classification model to accurately predict customer churn.

To provide actionable insights and a prioritized list of at-risk customers for the marketing and retention teams.

**BRIEF CONCLUSION**
By successfully building and deploying this predictive churn model, we can shift from a reactive to a proactive retention strategy

## **DATA UNDERSTANDING**
The goal in this step is to understand our dataset , see it's structure and content. We will also select the columns that will effectively help us build models that will help us as a company address this issue effectively.

In [31]:
# we begin by loading the dataset and necessary libraries more libraries will be loaded as we proceed.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report






In [32]:
df=pd.read_csv('bigml_a.csv')
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

In [34]:
df.duplicated().sum()

np.int64(0)

### **Based of the review of our data we understood the following:**

***we have a record of 3333 customers and 21 columns***

***we have no missing values in our data***

***We have a mixture of datatypes we have floats,interger and objects***

***Our dataset also does not have duplicate values***

## **DATA PREPARATION AND ANALYSIS**

***For this project we have to define the data we are going to need based on the columns and the one's we will drop as written below:***

we will drop the following columns **state,area code,phone number and number of voice messages**.this is because they are unique identifiers and redundant.
 
 ***The remaining columns except churn  which will be our y or target variable will be our X independent variables***

In [35]:
# to drop those columns we will use the drop method  
# columns include state,area code,phone number,number vmail messages
df_new=df.drop(['state','area code','phone number','number vmail messages'],axis=1)
df_new.head()

Unnamed: 0,account length,international plan,voice mail plan,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,128,no,yes,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,107,no,yes,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,137,no,no,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,84,yes,no,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,75,yes,no,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [36]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   account length          3333 non-null   int64  
 1   international plan      3333 non-null   object 
 2   voice mail plan         3333 non-null   object 
 3   total day minutes       3333 non-null   float64
 4   total day calls         3333 non-null   int64  
 5   total day charge        3333 non-null   float64
 6   total eve minutes       3333 non-null   float64
 7   total eve calls         3333 non-null   int64  
 8   total eve charge        3333 non-null   float64
 9   total night minutes     3333 non-null   float64
 10  total night calls       3333 non-null   int64  
 11  total night charge      3333 non-null   float64
 12  total intl minutes      3333 non-null   float64
 13  total intl calls        3333 non-null   int64  
 14  total intl charge       3333 non-null   

we have a few columns which are categorical in nature and thus we have to convert them into interger so that our machine can understand them properly.This will be done via a method known as *one hot encoding* as shown below

for our churn column we will convert it into  the datatype interger as sklearn library can handle the target variable directly as booleans or intergers

In [37]:
 # we need to import the library for one hot encoding then convert categorical columns

ohe=OneHotEncoder()
categorical_cols=['international plan','voice mail plan',]
encoded_data=ohe.fit_transform(df_new[categorical_cols]).toarray()
encoded_df=pd.DataFrame(encoded_data,columns=ohe.get_feature_names_out(categorical_cols))
encoded_df.head()


Unnamed: 0,international plan_no,international plan_yes,voice mail plan_no,voice mail plan_yes
0,1.0,0.0,0.0,1.0
1,1.0,0.0,0.0,1.0
2,1.0,0.0,1.0,0.0
3,0.0,1.0,1.0,0.0
4,0.0,1.0,1.0,0.0


In [38]:
# we need to drop the original columns as we have created new ones
df_new = df_new.drop(['international plan', 'voice mail plan'], axis=1)
df_new = pd.concat([df_new, encoded_df], axis=1)



In [39]:
df_new['churn']=df_new['churn'].astype(int)
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   account length          3333 non-null   int64  
 1   total day minutes       3333 non-null   float64
 2   total day calls         3333 non-null   int64  
 3   total day charge        3333 non-null   float64
 4   total eve minutes       3333 non-null   float64
 5   total eve calls         3333 non-null   int64  
 6   total eve charge        3333 non-null   float64
 7   total night minutes     3333 non-null   float64
 8   total night calls       3333 non-null   int64  
 9   total night charge      3333 non-null   float64
 10  total intl minutes      3333 non-null   float64
 11  total intl calls        3333 non-null   int64  
 12  total intl charge       3333 non-null   float64
 13  customer service calls  3333 non-null   int64  
 14  churn                   3333 non-null   

In [40]:
# next we separate X and y
X = df_new.drop('churn', axis=1)
y = df_new['churn']

# **MODELING**

**Now  we can begin modelling our baseline model**

In [41]:
# our columns are already defined and libraries already imported
# we first split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [42]:
# now we can model our baseline model using desicion tree we will use a cart algorithm
baseline_tree=DecisionTreeClassifier(random_state=42)
baseline_tree.fit(X_train,y_train)
y_pred=baseline_tree.predict(X_test)
print(classification_report(y_test,y_pred))


              precision    recall  f1-score   support

           0       0.96      0.95      0.95       566
           1       0.73      0.75      0.74       101

    accuracy                           0.92       667
   macro avg       0.84      0.85      0.85       667
weighted avg       0.92      0.92      0.92       667

