✔️ Data Science Challenge

This is the 1st Alura Data Science Challenge about Telco Churn classification. The goal is predict customer Churn in a classification model.

Dataset

The dataset is in json format, and it is available in the following link:

https://raw.githubusercontent.com/stephaniaslis/DS_Challenge_Telco/main/Telco-Customer-Churn.json

Data Cleaning

Data cleaning was made analysing the following steps:

Useless columns
Data kind definition
- Null values
- Missing Analysis (%)
  - Fill missings or not
Duplicated column
Duplicated rows
Constant columns
Feature Analysis
- Outlier analysis
Data export

The notebook is available here:

https://github.com/stephaniaslis/DS_Challenge_Telco/blob/3d93e2667009c74f946479027f15b232f9f06c18/Telco_Data_cleaning.ipynb

EDA

The EDA was made using pandas profiling that generates an interactive report in html format:

https://github.com/stephaniaslis/DS_Challenge_Telco/blob/main/report_telco.html

PS: The html is in localhost.

There was created reports to Churn 0 (negative Churn) and Churn 1 (positive Churn):

https://github.com/stephaniaslis/DS_Challenge_Telco/blob/main/report_telco_churn_0.html

https://github.com/stephaniaslis/DS_Challenge_Telco/blob/main/report_telco_churn_1.html

Report conclusions in general

The target (Churn) is imbalanced, the churn rate is 27%
The gender distribution is almost 50% for male and female
There are more customers:
- under 65 years old
- with phone service
- without dependents
- with montly contract
Customer tenure is highly correlated with Account charges total

Conclusions Churn = 0

Mostly :
- are under 65 years old
- don't have dependents
- have phone service
Top 1 contract is month-to-month
Customer tenure (average) : 37 months
Account charges monthly (average): 61.30 US$
Customer tenure is highly correlated with Account charges total

Conclusions Churn = 1

Mostly :
- are under 65 years old
- don´t have customer partner
- don't have dependents
- have phone service
- have fiber optic internet
- have internet Online Security
- have internet OnlineBackup
- have internet Device Protection
- have internet TechSupport
- have monthly contract
- have Paperless Billing
Customer tenure (average) : 17 months
Account charges monthly (average): 74.44 US$
Customer tenure is highly correlated with Account charges total

The notebook is available here:

https://github.com/stephaniaslis/DS_Challenge_Telco/blob/3d93e2667009c74f946479027f15b232f9f06c18/Telco_EDA.ipynb

Feature Selection

For feature selection, is used a majority voting method applying 3 selection proposals:

Statistical test:
- Anova for numeric features
- Chi2 for categorical features
RFECV
Boruta

Features maintained by at least 2 selection proposals will be used in the modeling process.

The notebook is available here:

https://github.com/stephaniaslis/DS_Challenge_Telco/blob/3d93e2667009c74f946479027f15b232f9f06c18/Telco_feature_selection_.ipynb

Model

This is a Churn prediction project, so the metric choosen to evaluate the model is Recall, because the risk of misidentifying false negatives is more serious than predicting false positives.

Firstofall, there was a train test split.

Smote were the balance method used for this dataset.

The models were ranked using pycaret:

After that, the three best models were fitted and predictions were made with test dataset:

Model	Accuracy	Recall	Precision
xgboost	0.7597	0.6346	0.541
ada	0.7564	0.7273	0.5306
lightgbm	0.7725	0.6613	0.5613
Baseline	0.73

The choosen model is AdaBoost because it is above the baseline and it has the best recall.

After that, the model was tuned using random grid search, tunnig the hyperparams:

n_estimators
learning_rate

Model metrics:

	Accuracy	Recall	Precision
Train	0.79	0.85	0.77
Baseline	0.73

Model confusion matrix:

The most important feature in this model is customer tenure:

Test metrics

	Accuracy	Recall	Precision
Test	0.75	0.77	0.52
Baseline	0.73

Test confusion matrix

Classification report by class:

Train and test comparison:

	Accuracy	Recall	Precision
Train	0.79	0.85	0.77
Test	0.75	0.77	0.52
Baseline	0.73

The notebook is available here:

https://github.com/stephaniaslis/DS_Challenge_Telco/blob/3d93e2667009c74f946479027f15b232f9f06c18/Telco_model.ipynb

Conclusion

The accuracy is 0.75, considering that the test dataset is imbalanced (0.73 for class 0 and 0.27 for class 1) the model predicts a little bit better than the baseline.

As it shows the recall (sensitivity) in the test dataset is 0.77, wich means that in 100 predictions using this model 77 of positive class are correctly predicted and 23 are incorrectly.

Next steps:

colect more data to build a robuster model
deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✔️ Data Science Challenge

Dataset

Data Cleaning