Achievement π :
2nd Place (out of over 900 participants or over 322 teams)
This competition is the National University of Singapore's biggest annual Data Science hackathon where participants use machine learning to tackle real-life business cases of corporate partners. Our team was judged by both NUS Statistics and Data Science Society, a data scientist, and a senior machine learning engineer from Singlife.
Team Number: 219
Team Name: Team Zero
Team Members / Contributors:
- Reina Peh
- Ryan Tan
- Zhang Bowen
- Claudia Lai
Our goal is to predict the binary outcomes of the target f_purchase_lh
using Singlife's highly imbalanced dataset (minority class: only 3.95% of the target column) with 304 columns and 17,992 rows.
Test Metrics:
- Precision
- Recall
- F1-Score (our priority)
- EDA
- Data Cleaning
- Feature Engineering
- Imputation Techniques
- RandomUnderSampler
- SMOTENC
- XGClassifier Model
- SelectFromModel Feature Selection Method
- Optuna
- LIME (Explainable AI)
Refer to Exploratory Data Analysis
notebook
Function 1: clean_data(data, target)
- Null Value Analysis: It calculates and displays the count and percentage of null values per column
- Column Removal: Columns with 100% null values are removed
hh_20
,pop_20
,hh_size_est
are also removed because we observed thathh_size
=pop_20
/hh_20
, andhh_size
is more meaningful thanpop_20
andhh_20
, and is more granular thanhh_size_est
Function 2: clean_data_v2(data)
None
Entries Handling: Counts and percentages ofNone
entries per column are calculated and sorted. Rows wheremin_occ_date
orcltdob_fix
areNone
are removed- Data Type Optimization: Converts all float64 columns to float32 for efficiency
- Column Dropping: The
clntnum
column, a unique identifier, is dropped as it does not contribute to the analysis
Function 3: clean_target_column(data, target_column_name)
This function is dedicated to preprocessing the target column of the dataset.
We believe that the age of clients influences their purchasing decisions, hence we added a new column to contain values calculated by subtracting min_occ_date
by cltdob_fix
Total percentage of null values in the DataFrame: 22.6%
32 columns with > 90% null values
83 columns with > 50% null values
Skewed Distributions
Since our data contained some features with right-skewed distributions, and many features with binary values (0s and 1s), we adopted median data imputation to fill the null values. This is because median imputation provides more representative values for features with only 0 and 1 values, and is also robust in the presence of skewed data distributions and outliers.
We implemented a combined under-over sampling strategy to create a more balanced dataset to improve our model's ability to predict the minority class instances without losing significant information.
Under-Sampling
We first applied Random Under-Sampling to reduce the size of the overrepresented class. This approach helps in balancing the class distribution and reducing the training dataset size, which can be beneficial for computational efficiency.
Over-Sampling with SMOTENC
After under-sampling, we used SMOTENC (Synthetic Minority Over-sampling Technique for Nominal and Continuous data) for over-sampling the minority class. Unlike basic over-sampling techniques, SMOTENC generates synthetic samples for the minority class in a more sophisticated manner, considering both nominal and continuous features.
One of our primary challenges was to decipher the most influential factors from a high-dimensional dataset that originally contained over 300 columns (200+ after data cleaning).
Utilizing a Strong Classifier:
We employed the XGBClassifier, a gradient boosting framework renowned for its effectiveness in classification tasks and its capability to rank feature importance.
SelectFromModel Methodology:
The SelectFromModel method was applied in tandem with the XGBClassifier. This method analyzes the feature importance scores generated by the classifier and retains only the most significant features. We chose to keep the top 40 features, so that we retain enough features to capture the diverse aspects of customer behavior while avoiding the pitfalls of model over-complexity and potential overfitting.
Computational Efficiency:
Recursive Feature Elimination (RFE) is inherently iterative and computationally demanding, especially with a large number of features. In contrast, SelectFromModel offers a more computationally efficient alternative.
Preserving Interpretability:
While PCA is effective for reducing dimensionality, it transforms the original features into principal components, which can be challenging to interpret, especially in a business context where understanding specific feature influences is crucial. SelectFromModel maintains the original features, making the results more interpretable and actionable.
Optuna is a hyperparameter optimization framework that employs a Bayesian optimization technique to search the hyperparameter space. Unlike traditional grid search, which exhaustively tries all possible combinations, or random search, which randomly selects combinations within the search space, Optuna intelligently navigates the search space by considering the results of past trials.
For our XGBoost classifier, the following key hyperparameters were considered:
n_estimators
: The number of trees in the ensemble.max_depth
: The maximum depth of the trees.learning_rate
: The step size shrinkage used to prevent overfitting.subsample
: The fraction of samples used to fit each tree.colsample_bytree
: The fraction of features used when constructing each tree.
The optimization process resulted in a set of hyperparameters that achieved an approximately 10% improvement in the F1-score from the baseline model, indicating a more harmonic balance of precision and recall for the model.
Local Interpretable Model-Agnostic Explanations (LIME) elucidates the decision-making of complex machine learning models by approximating them with simpler, interpretable models around specific instances. This technique demystifies "black box" models, making their predictions transparent and aiding in better decision-making.
Reference:
Papers with Code - LIME Explained. (2016). https://paperswithcode.com/method/lime
In our study, we applied LIME 100 times across different instances, averaging the explanation weights to counteract the randomness in data perturbation and ensure more consistent insights. This method helps pinpoint the features most impactful to the model's predictions, providing a stable and clear understanding of its behavior. The averaging process across multiple LIME iterations reduces variability, allowing us to identify and rely on the consistent influence of specific features on the model's decisions.
Since we only had 2-3 days to work on this datathon, here are some approaches we would like to take if given more time:
- Use more advanced Optuna features like pruners and samplers for further refinement
- Use other oversampling techniques like ADASYN, which adaptively generates minority samples according to their distributions