Credit Card Churn

About

This is a Mini-Project for SC1015 (Introduction to Data Science and Artificial Intelligence which focuses on credit card churn from this Kaggle data set. For detailed walkthrough, refer to the source code in order:

Contributors

Phoebe C - Machine Learning: Logistic Regression, Decision Tree, Naive Bayes, Random Forest
Nagammai S - Exploratory Data Analysis, Decision Tree, Logistic Regression
Melissa S - Data Preparation, Logistic Regression

Problem Definition

Are we able to predict which customers are more likely to carry out credit card churning?

Models Used

Regression (Logistic, Decision Tree, Random Forest)

We used multiple machine learning models to find a solution for our problem:

Logistic regression is a supervised learning technique. Attrition_Flag is defined as the customer activity (ie. attrited customers vs existing customers) which is a categorical variable. Since this categorical variable is our dependent variable, we chose to use logistic regression instead of simple regression. However, it must be noted that as logistic regression uses all the variables in the regression equation, it may result in lower accuracy.

Decision tree also uses the supervised learning technique. The tree-like model allows the decisions to be represented in a clear and well-defined manner. Moreover, decision tree has a higher accuracy compared to logistic regression. Since the decision tree has high variance and requires complex calculations if there are many class variables, we used the random forest, which is made up of multiple decision trees. A random forest is more efficient with large datasets, and is also more accurate compared to decision tree.

Learning Points

Naive Bayes

Assuming that the predictors are independent and with lesser training data, Naive Bayes was used to predict the class of the data. Naive Bayes finds the probabilities of each variable and predicts the variable with the highest probability. The accuracy and speed of Naive Bayes classifier is high even for large datasets.

Resampling the data

To improve the accuracy and to balance out the classes, the data was resampled. Some of the variables from the minority classes were duplicated in order to balance the classes.

One-Hot Encoding

Prediction and accuracy are improved as one-hot encoding converts the categorical variables so that they can be used in the algorithms. Categorical variables are represented as binary vectors by mapping the categorical variables to integers. Except for the index, the other integer values are all zero. Moreover, one-hot encoding ensures that machine learning does not value higher numbers.

Calculating permutation importance

The change in the prediction error of the model can be measured by permutation importance. When each of the predictor variables are shuffled in a random manner, the accuracy of the model will be monitored. Permutation importance observed the changes in the accuracy of the model and would measure the importance of the variable directly.

Conclusion

Total Transaction Count, Total revolving balance, total change in transaction count from Q4 to Q1 are the variables that are closely correlated to churn rate. These variables can be used to predict which customers are more likely to potentially carry out credit card churning.
Decision Forest has the highest accuracy and therefore is the most optimal model for predicting churn rate. Hence, we are able to predict which customers are more likely to carry out credit card churning

References

Feature importance in naive Bayes classifiers. InBlog. (n.d.). Retrieved April 24, 2022, from https://blog.ineuron.ai/Feature-Importance-in-Naive-Bayes-Classifiers-5qob5d5sFW#:~:text=Feature%20importance%20is%20the%20methodology,are%20in%20predicting%20target%20variable
Goyal, S. (2020, November 19). Credit Card customers. Kaggle. Retrieved April 24, 2022, from https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers?select=BankChurners.csv
Supervised learning. scikit. (n.d.). Retrieved April 24, 2022, from https://scikit-learn.org/stable/supervised_learning.html

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Data Visualisation with EDA.ipynb		Data Visualisation with EDA.ipynb
Final Decision Tree (one-hot encoding).ipynb		Final Decision Tree (one-hot encoding).ipynb
Logistic Regression (one-hot encoding).ipynb		Logistic Regression (one-hot encoding).ipynb
Naive Bayes (one-hot encoding).ipynb		Naive Bayes (one-hot encoding).ipynb
README.md		README.md
Random Forest (one-hot encoding).ipynb		Random Forest (one-hot encoding).ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Card Churn

About

Contributors

Problem Definition

Models Used

Regression (Logistic, Decision Tree, Random Forest)

Learning Points

Conclusion

References

About

Releases

Packages

Languages

seow2002/SC1015-mini-project

Folders and files

Latest commit

History

Repository files navigation

Credit Card Churn

About

Contributors

Problem Definition

Models Used

Regression (Logistic, Decision Tree, Random Forest)

Learning Points

Conclusion

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages