Skip to content

seow2002/SC1015-mini-project

Repository files navigation

Credit Card Churn

About

This is a Mini-Project for SC1015 (Introduction to Data Science and Artificial Intelligence which focuses on credit card churn from this Kaggle data set. For detailed walkthrough, refer to the source code in order:
  1. Data Visualisation with EDA
  2. Machine Learning: Logistic Regression
  3. Machine Learning: Naive Bayes
  4. Machine Learning: Decision Tree
  5. Machine Learning: Random Forest

Contributors

Phoebe C - Machine Learning: Logistic Regression, Decision Tree, Naive Bayes, Random Forest
Nagammai S - Exploratory Data Analysis, Decision Tree, Logistic Regression
Melissa S - Data Preparation, Logistic Regression

Problem Definition

Are we able to predict which customers are more likely to carry out credit card churning?

Models Used

Regression (Logistic, Decision Tree, Random Forest)

We used multiple machine learning models to find a solution for our problem:

Logistic regression is a supervised learning technique. Attrition_Flag is defined as the customer activity (ie. attrited customers vs existing customers) which is a categorical variable. Since this categorical variable is our dependent variable, we chose to use logistic regression instead of simple regression. However, it must be noted that as logistic regression uses all the variables in the regression equation, it may result in lower accuracy.

Decision tree also uses the supervised learning technique. The tree-like model allows the decisions to be represented in a clear and well-defined manner. Moreover, decision tree has a higher accuracy compared to logistic regression. Since the decision tree has high variance and requires complex calculations if there are many class variables, we used the random forest, which is made up of multiple decision trees. A random forest is more efficient with large datasets, and is also more accurate compared to decision tree.

Learning Points

  • Naive Bayes
  • Assuming that the predictors are independent and with lesser training data, Naive Bayes was used to predict the class of the data. Naive Bayes finds the probabilities of each variable and predicts the variable with the highest probability. The accuracy and speed of Naive Bayes classifier is high even for large datasets.
  • Resampling the data
  • To improve the accuracy and to balance out the classes, the data was resampled. Some of the variables from the minority classes were duplicated in order to balance the classes.
  • One-Hot Encoding
  • Prediction and accuracy are improved as one-hot encoding converts the categorical variables so that they can be used in the algorithms. Categorical variables are represented as binary vectors by mapping the categorical variables to integers. Except for the index, the other integer values are all zero. Moreover, one-hot encoding ensures that machine learning does not value higher numbers.
  • Calculating permutation importance
  • The change in the prediction error of the model can be measured by permutation importance. When each of the predictor variables are shuffled in a random manner, the accuracy of the model will be monitored. Permutation importance observed the changes in the accuracy of the model and would measure the importance of the variable directly.

Conclusion

Total Transaction Count, Total revolving balance, total change in transaction count from Q4 to Q1 are the variables that are closely correlated to churn rate. These variables can be used to predict which customers are more likely to potentially carry out credit card churning.
Decision Forest has the highest accuracy and therefore is the most optimal model for predicting churn rate. Hence, we are able to predict which customers are more likely to carry out credit card churning

References

Feature importance in naive Bayes classifiers. InBlog. (n.d.). Retrieved April 24, 2022, from https://blog.ineuron.ai/Feature-Importance-in-Naive-Bayes-Classifiers-5qob5d5sFW#:~:text=Feature%20importance%20is%20the%20methodology,are%20in%20predicting%20target%20variable
Goyal, S. (2020, November 19). Credit Card customers. Kaggle. Retrieved April 24, 2022, from https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers?select=BankChurners.csv
Supervised learning. scikit. (n.d.). Retrieved April 24, 2022, from https://scikit-learn.org/stable/supervised_learning.html

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published