This is the 1st Alura Data Science Challenge about Telco Churn classification. The goal is predict customer Churn in a classification model.
The dataset is in json format, and it is available in the following link:
https://raw.githubusercontent.com/stephaniaslis/DS_Challenge_Telco/main/Telco-Customer-Churn.json
Data cleaning was made analysing the following steps:
- Useless columns
- Data kind definition
- Null values
- Missing Analysis (%)
- Fill missings or not
- Duplicated column
- Duplicated rows
- Constant columns
- Feature Analysis
- Outlier analysis
- Data export
The notebook is available here:
The EDA was made using pandas profiling that generates an interactive report in html format:
https://github.com/stephaniaslis/DS_Challenge_Telco/blob/main/report_telco.html
PS: The html is in localhost.
There was created reports to Churn 0 (negative Churn) and Churn 1 (positive Churn):
https://github.com/stephaniaslis/DS_Challenge_Telco/blob/main/report_telco_churn_0.html
https://github.com/stephaniaslis/DS_Challenge_Telco/blob/main/report_telco_churn_1.html
- The target (Churn) is imbalanced, the churn rate is 27%
- The gender distribution is almost 50% for male and female
- There are more customers:
- under 65 years old
- with phone service
- without dependents
- with montly contract
- Customer tenure is highly correlated with Account charges total
- Mostly :
- are under 65 years old
- don't have dependents
- have phone service
- Top 1 contract is month-to-month
- Customer tenure (average) : 37 months
- Account charges monthly (average): 61.30 US$
- Customer tenure is highly correlated with Account charges total
- Mostly :
- are under 65 years old
- don´t have customer partner
- don't have dependents
- have phone service
- have fiber optic internet
- have internet Online Security
- have internet OnlineBackup
- have internet Device Protection
- have internet TechSupport
- have monthly contract
- have Paperless Billing
- Customer tenure (average) : 17 months
- Account charges monthly (average): 74.44 US$
- Customer tenure is highly correlated with Account charges total
The notebook is available here:
For feature selection, is used a majority voting method applying 3 selection proposals:
- Statistical test:
- Anova for numeric features
- Chi2 for categorical features
- RFECV
- Boruta
Features maintained by at least 2 selection proposals will be used in the modeling process.
The notebook is available here:
This is a Churn prediction project, so the metric choosen to evaluate the model is Recall, because the risk of misidentifying false negatives is more serious than predicting false positives.
Firstofall, there was a train test split.
Smote were the balance method used for this dataset.
The models were ranked using pycaret:
After that, the three best models were fitted and predictions were made with test dataset:
Model | Accuracy | Recall | Precision |
---|---|---|---|
xgboost | 0.7597 | 0.6346 | 0.541 |
ada | 0.7564 | 0.7273 | 0.5306 |
lightgbm | 0.7725 | 0.6613 | 0.5613 |
Baseline | 0.73 |
The choosen model is AdaBoost because it is above the baseline and it has the best recall.
After that, the model was tuned using random grid search, tunnig the hyperparams:
- n_estimators
- learning_rate
Accuracy | Recall | Precision | |
---|---|---|---|
Train | 0.79 | 0.85 | 0.77 |
Baseline | 0.73 |
Model confusion matrix:
The most important feature in this model is customer tenure:
Accuracy | Recall | Precision | |
---|---|---|---|
Test | 0.75 | 0.77 | 0.52 |
Baseline | 0.73 |
Test confusion matrix
Classification report by class:
Train and test comparison:
Accuracy | Recall | Precision | |
---|---|---|---|
Train | 0.79 | 0.85 | 0.77 |
Test | 0.75 | 0.77 | 0.52 |
Baseline | 0.73 |
The notebook is available here:
The accuracy is 0.75, considering that the test dataset is imbalanced (0.73 for class 0 and 0.27 for class 1) the model predicts a little bit better than the baseline.
As it shows the recall (sensitivity) in the test dataset is 0.77, wich means that in 100 predictions using this model 77 of positive class are correctly predicted and 23 are incorrectly.
Next steps:
- colect more data to build a robuster model
- deployment