# Predictive Churn Analysis for Telecom Services Using Machine Learning

## Project Description

The project focuses on creating a predictive model to tell which customers are likely to keep of discontinue their telecom operator's services. This entails initial analysis of datasets covering contracts, personal, and internet/phone service usage details. A thorough exploratory data analysis will uncover trends and guide feature engineering. Emphasis will be on one-hot encoding for categorical variables and devising new features that reflect customer behavior.

Using boosting algorithms, the project aims to leverage capacity for binary classification and to fine-tune the predictive model through hyperparameter optimization. The primary performance metric, the AUC-ROC score, will gauge the model's success in finding potential churn from loyal customers, and should result in 0.88 or above.

The project aims to equip the telecom operatoring company with the ability to find those likely to chrun and proactively send targeted promotions and customer retention plan options.

## Interconnect's services
Interconnect mainly provides two types of services:
1. Landline communication. The telephone can be connected to several lines simultaneously.
2. Internet. The network can be set up via a telephone line (DSL, digital subscriber line) or through a fiber optic cable.

Some other services the company provides include:
- Internet security: antivirus software (DeviceProtection) and a malicious website blocker (OnlineSecurity)
- A dedicated technical support line (TechSupport)
- Cloud file storage and data backup (OnlineBackup)
- TV streaming (StreamingTV) and a movie directory (StreamingMovies)

The clients can choose either a monthly payment or sign a 1- or 2-year contract. They can use various payment methods and receive an electronic invoice after a transaction.

## Data Description

The data consists of files obtained from different sources:
- contract.csv — contract information
- personal.csv — the client's personal data
- internet.csv — information about Internet services
- phone.csv — information about telephone services
In each file, the column customerID contains a unique code assigned to each client.

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


pd.options.display.max_columns = None
pd.options.display.max_rows = 100
%matplotlib inline

Dataset placeholders below, as I did not see the datasets, only in the project outline video.

In [3]:
# Load the datasets
contract_df = pd.read_csv('contract.csv')
personal_df = pd.read_csv('personal.csv')
internet_df = pd.read_csv('internet.csv')
phone_df = pd.read_csv('phone.csv')

contract_df.head()

Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65


## Data Merging

In [5]:
# Merging the datasets on 'customerID'
df_merged = contract_df.merge(personal_df, on='customerID', how='left')
df_merged = df_merged.merge(internet_df, on='customerID', how='left')
df_merged = df_merged.merge(phone_df, on='customerID', how='left')

df_merged.head()

Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,gender,SeniorCitizen,Partner,Dependents,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,MultipleLines
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85,Female,0,Yes,No,DSL,No,Yes,No,No,No,No,
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5,Male,0,No,No,DSL,Yes,No,Yes,No,No,No,No
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15,Male,0,No,No,DSL,Yes,Yes,No,No,No,No,No
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75,Male,0,No,No,DSL,Yes,No,Yes,Yes,No,No,
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65,Female,0,No,No,Fiber optic,No,No,No,No,No,No,No


In [None]:
df_merged.info()
df_merged.isnull().mean() * 100

1. Initial Data Exploration
- Import the datasets and combine them using a common column (customerID).
- Examine the features, data types, and identify null or missing values, to understand data.

## Data Preprocessing

2. Data Cleaning and Preprocessing
- Handle missing values
- Convert data types as needed (changing dates to a datetime format).

## Exploratory Data Analysis

3. Exploratory Data Analysis (EDA)
- Visualize data to uncover patterns, trends, and relationships.
- Investigate the target variable of churn rate with other features.

In [6]:
# Data Visualizations
plt.figure(figsize=(10, 6))
sns.countplot(x='Churn', data=df_merged)
plt.title('Distribution of Churn')

ValueError: Could not interpret input 'Churn'

<Figure size 1000x600 with 0 Axes>

## Feature Engineering

4. Feature Engineering
- Engineer new features that could be relevant for predicting customer churn, such as aggregating multiple variables into a single metric (e.g., total customer value).
- Prepare categorical variables for modeling by applying one-hot encoding to transform them into a binary matrix to ensure the target variable is encoded as 0 and 1.

## Model Training with Boosting Algorithms

5. Model Training with Boosting Algorithms
- Split the data into training and testing sets to prepare for the model training phase.
- Use a boosting algorithm, like XGBoost, capable of handling imbalanced datasets and various feature types.
- Train the boosting model to establish a baseline.

## Model Tuning and Validation

6. Model Tuning and Validation
- Use cross-validation to assess the model’s performance.
- Optimize the model by tuning hyperparameters to improve prediction accuracy, using like GridSearch or Random Search.
- Evaluate the model’s performance using appropriate metrics, such as AUC-ROC and Accuracy.

## Final Model Training and Evaluation

7. Final Model Training and Prediction
- Train the final model on the full training dataset using the best hyperparameters found during tuning.
- Make predictions on the testing set and evaluate the model's performance using the AUC-ROC, and accuracy metrics.
- Interpret results for churn prediction

## Project Conclusion

Summarize the project workflow, highlighting key findings and suggestions for customer retention, including promotional offers, and customer retention programs.