# Blood Donation Predictors Report

## Introduction

Blood donations are a vital component of saving lives, and there is an ever-growing need for healthy and clean volunteer donors (Gillespie & Hillyer, 2002). Finding donors that will be a repeated donor is a hard task for many transfusion centers (Armitage & Conner, 2001). This leads to a need for understanding what motivates individuals to donate blood, and whether there are certain factors that effect if someone chooses to donate blood or not, specifically if they have previously been a donor. To try and address this question we explore a dataset provided by I-Cheng et al. (2008) which indicates whether an individual donated blood or not. Each donor has four characteristics associated with them, 1) the time in months since their last donation (Recency), 2) the total number of times they have donated (Frequency), 3) the total amount of blood they have donated in centilitres (Monetary), and 4) the time since their first donation in months (Time). We are using this dataset to observe whether these features influence whether an individual donated blood. 

## Preliminary EDA 

Before creating any models or statistical tests we conducted a preliminary exploratory data analysis to provide insights into how or model would perform. Prior to the EDA we split our data into train and test sets, and the used the train set to derive information. As shown in Table 1, we determined that there were 598 observation in our train dataset. We separated our data into two additional tables based on the target class, Table 2 for only those that did not donate, and Table 3 which was only cases where there was a donation. Based on this separating we saw our data was imbalanced. Class 1 representing those candidates who did not donate, had 460 observations, versus class 2 with 138 individuals who did donate. We also noted from Table 1 that almost all features had a high variance, which indicated to us that these may not exceptionally predictive. 

In [19]:
import pandas as pd
blood_df_train = pd.read_csv('../data/processed/train_data.csv').drop('Unnamed: 0', axis = 1)
print('Table 1:')
blood_df_train.describe()

Table 1:


Unnamed: 0,since_last_don,total_dons,total_blood,since_first_don,Class
count,598.0,598.0,598.0,598.0,598.0
mean,9.951505,5.653846,1413.461538,35.0301,1.230769
std,8.39913,5.939018,1484.754538,24.345691,0.421678
min,0.0,1.0,250.0,2.0,1.0
25%,4.0,2.0,500.0,16.0,1.0
50%,9.0,4.0,1000.0,28.0,1.0
75%,14.75,7.0,1750.0,50.75,1.0
max,74.0,50.0,12500.0,98.0,2.0


In [20]:
print('Table 2:')
blood_df_train[blood_df_train['Class']==1].describe()

Table 2:


Unnamed: 0,since_last_don,total_dons,total_blood,since_first_don,Class
count,460.0,460.0,460.0,460.0,460.0
mean,11.315217,4.969565,1242.391304,36.121739,1.0
std,8.699697,4.908215,1227.053762,24.566267,0.0
min,0.0,1.0,250.0,2.0,1.0
25%,4.0,2.0,500.0,16.0,1.0
50%,11.0,3.0,750.0,28.0,1.0
75%,16.0,7.0,1750.0,52.0,1.0
max,74.0,44.0,11000.0,98.0,1.0


In [18]:
print('Table 3:')
blood_df_train[blood_df_train['Class']==2].describe()

Table 3:


Unnamed: 0,since_last_don,total_dons,total_blood,since_first_don,Class
count,138.0,138.0,138.0,138.0,138.0
mean,5.405797,7.934783,1983.695652,31.391304,2.0
std,5.175233,8.134998,2033.749576,23.31424,0.0
min,0.0,1.0,250.0,2.0,2.0
25%,2.0,3.0,750.0,15.0,2.0
50%,4.0,6.0,1500.0,28.0,2.0
75%,4.0,10.0,2500.0,41.75,2.0
max,26.0,50.0,12500.0,98.0,2.0


In addition to the tables we also created visualization to help us understand the distribution of the data. Though not included in this report, in the EDA.ipynb file we looked at all observations together, regardless of whether the observation indicated donated or not donated, through our plots observed that almost all features had an exponential distribution. This carried through to figures 1, 2, 3 and 4 seen below where we chose to separate the features based on class. Both classes had an exponential distribution and followed the same trend. This indicated that these features may not be particularity strong in binary classification, even though the data represented a binary classification problem. 

### Figure 1.

![Figure 1](../results/since_first_don.png)

### Figure 2.

![Figure 2](../results/since_last_don.png)

### Figure 3.

![Figure 3](../results/total_blood.png)

### Figure 4

![Figure 4](../results/total_dons.png)

## Methodology

We implemented a decision tree model from Scikit Learn to follow through on our observations from the EDA and address our research question. We chose a decision tree since it is suited to binary classification and is easily interpretable. Prior to implementing the model, we cleaned and processed the data to ensure there were no missing or erroneous values. We then selected a random subset of the class 1 portion of the training data to address the training imbalance, so our model was trained on a dataset that had a 50% split of classes. Once this was completed, we created a decision tree model and performed a GridCV search to tune for the best maximum depth hyperparameter with 10-fold cross validation to fit and train the model. 

## Results 

As shown in Table 4., with our GridCV search we were able to determine that the best `max_depth` setting was 7. With that value we got a CV accuracy score of 0.65, with a training error of 0.14 and a validation error of 0.35. 


In [24]:
print('Table 4:')
pd.read_csv('../results/analysis_result.csv').drop('Unnamed: 0', axis = 1)

Table 5:


Unnamed: 0,Best_max_depth,Best_CV_Score,Training_Error,Validation_Error
0,7,0.65,0.140909,0.357143


## Discussion 

Based on our results we infer that the features of 1) time since last donation, 2) total number of donations, 3) total blood donated, and 4) the time since the first donation, all combined have some predictive power for whether a patient will donate blood. However, since our accuracy and cross validation scores were low, the combined predictive power of these features is quite low. Since the predictive power is so low, we wouldn’t recommend this model as a predictive tool for predicting blood donation. We would suggest that other factors may provide better predictions as to whether blood is donated by a past donor. 

## Conclusion

Acquiring blood donations from volunteers is a crucial but difficult task. Therefore, it is important to understand what motivates individuals to donate. We assed four predictors for donation, and though they proved to be better than random, our results don’t indicate them to be robust predictors and we therefore suggest additional information needs to be acquired to better predict blood donations. 


## References:

Armitage, C. J., & Conner, M. (2001). Social cognitive determinants of blood donation. Journal of applied social psychology, 31(7), 1431-1457.
</br></br>
Gillespie, T. W., & Hillyer, C. D. (2002). Blood donors and factors impacting the blood donation decision. Transfusion Medicine Reviews, 16(2), 115-130.
</br></br>
Yeh, I. C., Yang, K. J., & Ting, T. M. (2009). Knowledge discovery on RFM model using Bernoulli sequence. Expert Systems with Applications, 36(3), 5866-5871
