- Data Visualisation with EDA
- Machine Learning: Logistic Regression
- Machine Learning: Naive Bayes
- Machine Learning: Decision Tree
- Machine Learning: Random Forest
Nagammai S - Exploratory Data Analysis, Decision Tree, Logistic Regression
Melissa S - Data Preparation, Logistic Regression
Are we able to predict which customers are more likely to carry out credit card churning? We used multiple machine learning models to find a solution for our problem:
Logistic regression is a supervised learning technique. Attrition_Flag is defined as the customer activity (ie. attrited customers vs existing customers) which is a categorical variable. Since this categorical variable is our dependent variable, we chose to use logistic regression instead of simple regression. However, it must be noted that as logistic regression uses all the variables in the regression equation, it may result in lower accuracy.
Decision tree also uses the supervised learning technique. The tree-like model allows the decisions to be represented in a clear and well-defined manner. Moreover, decision tree has a higher accuracy compared to logistic regression. Since the decision tree has high variance and requires complex calculations if there are many class variables, we used the random forest, which is made up of multiple decision trees. A random forest is more efficient with large datasets, and is also more accurate compared to decision tree.
- Naive Bayes
- Assuming that the predictors are independent and with lesser training data, Naive Bayes was used to predict the class of the data. Naive Bayes finds the probabilities of each variable and predicts the variable with the highest probability. The accuracy and speed of Naive Bayes classifier is high even for large datasets.
- Resampling the data
- To improve the accuracy and to balance out the classes, the data was resampled. Some of the variables from the minority classes were duplicated in order to balance the classes.
- One-Hot Encoding
- Prediction and accuracy are improved as one-hot encoding converts the categorical variables so that they can be used in the algorithms. Categorical variables are represented as binary vectors by mapping the categorical variables to integers. Except for the index, the other integer values are all zero. Moreover, one-hot encoding ensures that machine learning does not value higher numbers.
- Calculating permutation importance
- The change in the prediction error of the model can be measured by permutation importance. When each of the predictor variables are shuffled in a random manner, the accuracy of the model will be monitored. Permutation importance observed the changes in the accuracy of the model and would measure the importance of the variable directly.
Decision Forest has the highest accuracy and therefore is the most optimal model for predicting churn rate. Hence, we are able to predict which customers are more likely to carry out credit card churning Feature importance in naive Bayes classifiers. InBlog. (n.d.). Retrieved April 24, 2022, from https://blog.ineuron.ai/Feature-Importance-in-Naive-Bayes-Classifiers-5qob5d5sFW#:~:text=Feature%20importance%20is%20the%20methodology,are%20in%20predicting%20target%20variable
Goyal, S. (2020, November 19). Credit Card customers. Kaggle. Retrieved April 24, 2022, from https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers?select=BankChurners.csv
Supervised learning. scikit. (n.d.). Retrieved April 24, 2022, from https://scikit-learn.org/stable/supervised_learning.html