Skip to content

suarez96/Heart_Statlog_Classifier

Repository files navigation

SVM vs Random Forest?

C. Augusto Suarez: University of New Brunswick, Dec 2019

This project is a Comparison of a support vector machine and a random forest classifier on the same dataset: the BNG_Heart_Statlog data from OpenML. This dataset contains one million entries of 13 features and a 14th "class" column: the Absence/0 or Presence/1 of heart disease in the subject. It is a python built project using jupyter notebooks and the necessary enviornment can be built from the heart_statlog.yml file included. I also included the dataset for 2 reasons:

  • The dataset is small enough to upload to github
  • The dataset is already openly available on OpenML; For private datasets, I strongly believe that the enclosed data should be protected and kept private under all circumstances.

The Process

Initial Visualizations

We take an initial gander at the dataset and observe 13 features as input, with a binary output: the presence or absence of heart disease. We can see that all of our input data is numerical but our output data is categorical of type string.

Starting2

We can easily fix this problem by using a nominal converter to get numerical output instead. Seen below:

Starting2

After we've done our initial exploration, we continue by looking at visualizations of what I think are intuitive relationships. In this case, I looked at the distribution of ages among both classes. Here we can see that, although not a deciding factor, the age distributions of both classes are, in fact, slightly different. We can also see that the spreads of resting blood pressure are similar among both classes but not identical.

Starting Point

Model Selection

After this, I decided on the models that I wanted to compare for this particular dataset. Even though neural networks have had remarkable success classifying similar data, I wanted to try machine learning models that I had not implemented before so I chose to look at SVM's and Random Forests.

  • SVM's because they have the ability to learn nonlinear relationships between features using kernel methods. We can also, for the same reason, test it earlier without having to preprocess our data excessively.
  • Random Forests because if the relationships are not of high order the model will perform better, train faster and scale to a larger subset of the data with more accuracy.

Model Training

As mentioned previously, we could use the SVM 'kernel trick' to predict with almost no data preparation. And we can see below that after only a only a few tuned hyperparameters. The SVM already classifies the small subset with high accuracy. Below is the confusion matrix with 20000 samples and a radial basis function kernel.

Starting Point

SVM Tuning

In this section, we will describe a series of concise steps taken in an attempt to increase the model accuracy and scale it to the desired size of one million entries (the entire test set). I used the parameters in the best trainings to fine tune my svm for the following group of trainings. For this, I made use of the GridSearchCV library which greatly streamlined my workflow because of its capacity to automate several trainings at once; For simple machine learning tasks, I highly recommend it. In this case: the steps (almost all of them, at least) we followed for tuning were as follows:

  • We increased the sample size from 20000 to 40000 and added the default 3rd degree polynomial kernel to our kernel searchspace parameters. We used a slack variable selection of 10, 50 and 100 as well as a selection of 1k, 5k and 10k for maximum number of iterations. We started the tuning by using a 2 fold cross validation in our grid search. Training time: 5 min, 46 s.

Initial Tuning

  • From these results we increased the resolution of the slack variable to go from 10-100 in increments of 10, and changed the max iterations to 5k, 7.5k, and 10k. Training time: 23 min, 18 s.

Increased Slack Resolution

  • Then we increased 2 fold validation to 3 fold validation in an attempt to check for overfitting, slack of only 90, iterations of 7.5k. Training time: 2 min, 10 s.

3 fold

  • Added 'linear' kernel to search space parameters and went up to 4 fold validation. Training time: 2 min, 46 s.

4 fold linear

  • Removed 'linear' and 'poly' kernels from search space parameters and added sigmoid also to check. Confusion matrix and classification report to "see if model performs as advertised". Training time: 4 min, 12 s.

rbf sigmoid rbf sigmoid cm and cr

  • Here is where I wanted to see if the model would scale up. As such, I decided to increase to sample size from 40000 samples to 100,000 samples. and let it run over night. I was also using only 'rbf' for the kernel parameter and a slack variable selction of 80-100 in increments of 5. Training time: 33 min, 32 s.

svm 100000 svm 100000 cm and cr

Here we reach a point where we need to do some more work before we can continue scaling

We can see that the increase in samples fed into the SVM training drastically took a toll on our accuracy metrics. This happens because as our training data increases, so do our number of support vectors, meaning that our classification boundaries have a harder time generalizing to unseen data. To attempt to fix this, I tried several approaches separately and in a pipeline.

  • The first thing I tried was a simple normalization: Every data point was normalized to mean 0 and standard deviation 1.

  • When this failed to make a noticeable impact, I looked at principal component analysis and attempted to train on the new transformed, pca-fit data. I also tried tuning the number of iterations that the PCA did on the data.

  • When PCA failed, I tried undersampling AND oversampling techniques to see if it was an class imbalance problem. For undersampling I used random undersampling, which takes a random subset of the majority class of equal size to the under-represented class to account for the imbalance. For oversampling, I used scikit-learn's implementation of Synthetic Minority Oversampling TEchiniques (SMOTE) and applied it to the minority class. This technique uses points in the minority class, calculates the difference vector to each of its K-nearest neighbors, then creates a new data point somewhere along that vector by using its product with a random number between 0 and 1.

No combination of these techniques produced more adequate results, so I turned my attention to my second machine learning algorithm: the Random Forest Classifier.

Random Forest Tuning

  • After the initial tuning stages of the random forest, I arrived suprisingly quickly at an acceptable classifier for a sample size of 10000 with smote upsampling and a 4-fold cross validation.

rf 10000 init upsample cm

  • Maintaining the upsampling technique, I scaled the sample size from 10k to 100k samples and wanted to see how our new classifier would perform at the same size that our SVM went awry. In our search space, we used 100, 500, and 1000 as our number of estimators, we used None (Default), 16, and 64 as our max leaf nodes and None (Default, 8 and 16) as our max depth parameter. Essentially I wanted to see if there was a way that we could maximize our accuracy metrics while using effective parameters for a shorter training time.

rf 100000 upsample rf 100000 upsample cm

  • From the results above, I noticed that increasing the estimators from 500 and 1000 had no real impact on accuracy but doubled our training time. As such, I decided to see if I could improve the estimator on streamlined parameters by running the same training on the PCA-fit-transformed data. My streamlined parameters were n_estimators: 500, 1000 (honestly probably just forgot to remove the 1000) and max depth 16 and None. Training time: 3 min, 28 s

rf streamlined pca rf streamlined pca cm

  • I also in parallel ran a training with the same search space but without PCA and with downsampled data. The results were inline with the non-pca results with upsampling from the step above. Training time: 32 min, 32 s

rf streamlined no pca rf streamlined no pca cm

Note: with the same parameters, PCA (as expected) ran in a MUCH faster training time, but had a drastic loss in accuracy, so the time gain is less enticing.

  • Now the big one: I let the model sit overnight with our streamlined parameters {'n_estimators' : [500], 'max_leaf_nodes' : [None], 'max_depth' : [16] } and on the entire dataset (minus 10, actually, for a reason that will be explained later). Training time: 2 hours, 13 min, 38 s.

final forest final forest cm

The Results

The final random forest model performs well over my desired target (80%) and was able to scale to the entire 1 million data points! It trained in a very reasonable time and generalized well to our hold out sets over 4 Fold Cross Validation. With more data preparation and tuning I am sure that I could achieve an even better result. But this is satisfactory for now.

One More Thing

Remember those last 10 data points that I held out from our dataset? I wanted to to see, if I took the last 10 points and laid out their ground truths, what the model would predict about them. This was more of a whimsical excercise than an actual motivated research technique, but it was still an interesting thought. The results are below:

final test

As was expected with a 90% model, the random forest classifier correctly categorized 9 out the 10 last data entries. And with that, I conclude my first investigation on the heart_statlog dataset. But now it's obvious that I should ask: what else could I do? Well, with the work that's already been done, I can think of a few things that might improve our random forest classifier:

  • Cleaning our data with other techniques. Maybe a different feature extraction method using composite or filtered attributes.
  • Seeing what techniques like varimax rotation could do to our PCA's. For that matter, seeing how we could take advantage of the PCA's in any way! Seeing as the dimensionality reduction drastically decreased our training time
  • Trying other sampling techniques such as ADASYN or maybe a non-random undersampling
  • Running a slightly modified random forest algorithm, such as a boosted tree

Maybe I'll come back to this at a later stage but for now, these results suffice.

About

A Comparison of a support vector machine and a random forest classifier on the same dataset: the BNG_Heart_Statlog data from OpenML

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors