End-to-end prediction model heavily using statistical data science covering the full machine learning pipleline. The project uses a lot of statistical data science techniques. The pipleline is procedurally documented with explanation step-by-step.
Hyperparameter-tuned Optimally engineered (95% score) machine learning model that predicts customer loyalty; based on
- Customer Account Information & Demographic
- Services Information
- Geographic Information
To summarize,
- Developed an end-to-end machine learning pipeline for predicting customer retention/churn.
- Heavily utilized statistical data science techniques to analyze and interpret complex datasets.
- Built a robust Model that effectively identifies customers likely to leave, Data Analysis enabling proactive customer retention strategies.
Method used:
EDA
(exploratory data analysis)- Statistical
Median
Key Results & derived Insights: - Fixed Total Charges column feature’s datatype to Numeric.
- Replaced missing values under Total Charges column, by finding data statistical pattern of contract type feature. Replaced with Median on condition of contract type pattern
- Redundant repeated data are removed to reduce dimension complexity, such as 'Lat Long’, 'Country’, 'State’.
Created and transformed categorical and numerical features that could help in predicting customer churn.
Method used:
Correlation Map Matrix(Pearson)
High Correlation Threshold
Key Results & derived Insights:- To reduce dimensions complexity by Pearson's Correlation Map Matrix with threshold value.
- No Need to Drop Any Features for threshold value of 0.9.
- As the highest correlation coefficient of 0.9 is not matched, No Need to drop those numerical feature.
Method used:
Histogram
. For distribution analysisstandard deviation
outlier handling
Key Results & derived Insights:- The features mostly have normal distribution bell curve.
- So, we can apply standard deviation to clean outliers. (2 x standard deviation)
- Then replace outliers with median.
Method used:
Min-Max Scaling
Data Normalization
Key Results & derived Insights:- This muti-dimension Dataset will require Scale Normalization. Applied MinMaxScaler() on Numerical feature.
In machine learning, some feature’s min-max range differs from other features multiple times. Data Normalization is a common practice which consists of transforming numeric columns to a common scale.
Method used:
Bar Chart
with category vs Churn RelationVisualization analysis for feature selection
Key Results & derived Insights:- From the Chart visualization, The prominent features that correlate to churn are: ‘Partner’, ‘Dependents’, ‘Internet Service’, ‘Online Security’, ‘Online Backup’, ‘Device Protection’, ‘Tech Support’, ‘Streaming TV’, ‘Streaming Movies’, ‘Contract’, ‘Paperless Billing’, ‘Payment Method’. Elaborately, We will see more of this Analysis in Bonus section.
- The relatively less impacting features are: ‘Gender’, ‘Senior Citizen’, ‘Phone Service’.
Method used:
One Hot Encoding
Key Results & derived Insights:- Dataset has 2-3 labels for each categorical columns. One Hot Encoding is perfect to equalize this duality relations.
- Dataset is Encoded into numeric data for machine learning.
Labels are encoded in machine learning because most machine learning models can only operate on numerical data. Label encoding is a technique used to convert categorical variables into numerical format.
Method used:
Oversampling
. For class imbalanceSMOTE
(Synthetic Minority Oversampling Technique)
Key Results & derived Insights:- The churn yes data sample is very low which is crucial. This is not good for this binary classification.
- SMOTE is applied for oversampling the minority class.
Method used:
stratify
(for balanced distribution in each dataset spilt)
Key Results & derived Insights:
-
applied stratify for balanced distribution in train and test data. For correct evaluation.
-
Dataset split is made to be 75% train and 25% test data given 10,000 samples.
Engineered experiments to optimize machine learning model selection.
Method used:
- Benchmark of Models
random forests
,logistic regression
,SVM
,KNN
,gradient boosting
classification algorithms- Monitor with baseline complexity of dataset
Key Results & derived Insights: - Experimentation for model selection
- Selected algorithm: Random Forest (with 95% accuracy). dataset baseline complexity was Accuracy: 50%
Engineered experiments to optimize machine learning model selection.
Method used:
Randomized Search
withparameter grid
5-fold cross-validation
*random forest parameters:tree_depth
,max_features selection
,min_samples_split
Key Results & derived Insights: Best parameters combination with100 iterations
:
{
'n_estimators': 50,
'min_samples_split': 2,
'max_features': 'log2’,
'max_depth': 35
}
- We can see false negatives are much less. Only 81 compared to 1248.
- Minimal false negatives (i.e., correctly identifying all customers who are likely to churn) is very important for churn prediction.
- This is optimal performance.
-
Prediction task is focused on if user will churn/leave. so, minimizing false negatives(correctly identifying who will churn) is important. F1-score is a good indicat
-
F1-score is 95% which is good.
-
AUC score is 95% which is also good as it translates well to ROC curve. ROC diagram available in notebook
-
Minimizing false negatives (i.e., correctly identifying all customers who are likely to churn) may be more important than minimizing false positives. So, it could be more useful to look at F1 score than accuracy, for example.
-
An F1 score of 95% is a good score for the binary classification model.
Details provided in notebook. Outlined steps for deploying:
- Model Serialization
- Deployment Approach
- Hosting Options
Monitoring and maintainance:
- Performance Monitoring
- Data Drift
- Retraining
- Versioning
- We have an almost equal number of
men
andwomen
customers in the data set. - Both
genders
have similar churn rates. We see there is no difference in the churn rate. - There are much fewer
senior
citizens customers. Non-senior
citizens are the majority of customers which makes sense.- Customers with
no partner
are a bit more likely to churn. 65% churn rate but major 45% don’t churn. - Customers with
dependents
are substantially less prone to churn. 30% don’t churn. Only around 5% churn. - The overall number of customers with
dependents
is lower than customers withno dependents
. - Customers with
no dependents
are a bit more likely to churn. 95% churn rate but major 70% don’t churn. Most customers havedependents
which makes sense. - Observing the churn rate, having
phone service
doesn’t contribute to churn. - Having
multiple lines
doesn’t contribute to churn. - The combination of
fiber optics
internet service is highly prone to churn. Improvements should be made or alternatives likeDSL
should be provided. - The
DSL
internet service combination has a much lower rate of churn. The company can focus onDSL
internet service. - Customers who have
no internet service
when it comes tostreaming TV
orstreaming movies
are less likely to churn.