## Introduction
<p><img src="https://assets.datacamp.com/production/project_646/img/blood_donation.png" style="float: right;" alt="A pictogram of a blood bag with blood donation written in it" width="200"></p>
<p>Blood transfusion saves lives - from replacing lost blood during major surgery or a serious injury to treating various illnesses and blood disorders. Ensuring that there's enough blood in supply whenever needed is a serious challenge for the health professionals. According to <a href="https://www.webmd.com/a-to-z-guides/blood-transfusion-what-to-know#1">WebMD</a>, "about 5 million Americans need a blood transfusion every year".</p>
<p>Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive. We want to predict whether or not a donor will give blood the next time the vehicle comes to campus.</p>
<p>The data is stored in <code>datasets/transfusion.data</code> and it is structured according to RFMTC marketing model (a variation of RFM). We'll explore what that means later in this notebook. First, let's inspect the data.</p>

#### Loading the blood donations data
<p>We now know that we are working with a typical CSV file (i.e., the delimiter is <code>,</code>, etc.). We proceed to loading the data into memory.</p>

In [2]:
# Importing pandas
import pandas as pd

# Reading in dataset
transfusion = pd.read_csv("C:\\Users\\talfi\\python\\ML\\medtoureasy\\Give_Life_Predict_Blood_Donations\\datasets\\transfusion.data")

# Printing out the first rows of our dataset
transfusion.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


#### Inspecting transfusion DataFrame
<p>Let's briefly return to our discussion of RFM model. RFM stands for Recency, Frequency and Monetary Value and it is commonly used in marketing for identifying your best customers. In our case, our customers are blood donors.</p>
<p>RFMTC is a variation of the RFM model. Below is a description of what each column means in our dataset:</p>
<ul>
<li>R (Recency - months since the last donation)</li>
<li>F (Frequency - total number of donation)</li>
<li>M (Monetary - total blood donated in c.c.)</li>
<li>T (Time - months since the first donation)</li>
<li>a binary variable representing whether he/she donated blood in March 2007 (1 stands for donating blood; 0 stands for not donating blood)</li>
</ul>
<p>It looks like every column in our DataFrame has the numeric type, which is exactly what we want when building a machine learning model. Let's verify our hypothesis.</p>

In [2]:
# Print a concise summary of transfusion DataFrame
transfusion.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   Recency (months)                            748 non-null    int64
 1   Frequency (times)                           748 non-null    int64
 2   Monetary (c.c. blood)                       748 non-null    int64
 3   Time (months)                               748 non-null    int64
 4   whether he/she donated blood in March 2007  748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB


In [22]:
transfusion.shape

(748, 5)

#### Creating blood_donation column
<p>We are aiming to predict the value in <code>whether he/she donated blood in March 2007</code> column. Let's rename this it to <code>blood_donation</code> so that it's more convenient to work with.</p>

In [3]:
# Rename whether he/she donated blood in March 2007 column as 'blood_donation' for brevity 
transfusion.rename(
    columns={'whether he/she donated blood in March 2007': 'blood_donation'},
    inplace=True
)

# Print out the first 2 rows
transfusion.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),blood_donation
0,2,50,12500,98,1
1,0,13,3250,28,1


#### Checking blood_donation incidence
<p>We want to predict whether or not the same donor will give blood the next time the vehicle comes to campus. The model for this is a binary classifier, meaning that there are only 2 possible outcomes:</p>
<ul>
<li><code>0</code> - the donor will not give blood</li>
<li><code>1</code> - the donor will give blood</li>
</ul>
<p>blood_donation incidence is defined as the number of cases of each individual value in a dataset. That is, how many 0s in the blood_donation column compared to how many 1s? blood_donation incidence gives us an idea of how balanced (or imbalanced) is our dataset.</p>

In [4]:
# Print target incidence proportions, rounding output to 3 decimal places
transfusion.blood_donation.value_counts(normalize = True).round(3)

0    0.762
1    0.238
Name: blood_donation, dtype: float64

#### Splitting transfusion into train and test datasets
<p>We'll now use <code>train_test_split()</code> method to split <code>transfusion</code> DataFrame.</p>
<p>blood_donation incidence informed us that in our dataset <code>0</code>s appear 76% of the time. We want to keep the same structure in train and test datasets, i.e., both datasets must have 0 target incidence of 76%. This is very easy to do using the <code>train_test_split()</code> method from the <code>scikit learn</code> library - all we need to do is specify the <code>stratify</code> parameter. In our case, we'll stratify on the <code>blood_donation</code> column.</p>

In [5]:
# Import train_test_split method
from sklearn.model_selection import train_test_split

# Split transfusion DataFrame into
# X_train, X_test, y_train and y_test datasets,
# stratifying on the `blood_donation` column
X_train, X_test, y_train ,y_test = train_test_split(
    transfusion.drop(columns='blood_donation'),
    transfusion.blood_donation,
    test_size=0.25,
    random_state=42,
    stratify=transfusion.blood_donation
)

# Print out the first 2 rows of X_train
X_train.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months)
334,16,2,500,16
99,5,7,1750,26


#### Untuned Transfusion Accuracy with XGBoostClassifier  

Now, it is time to train and test our M.L. model's accuracy. <br>
I'll first fit our model to XGBoostClassifier by  `xgboost` 's `XGBClassifier` <br>
Then, I'll display accuracy score by `sklearn.metrics`'s`accuracy_score`.


In [11]:
import xgboost as xgb
from sklearn.metrics import accuracy_score
xgbc = xgb.XGBClassifier(objective = "binary:logistic",
                            n_estimators = 10,
                            seed = 123)
# Fitting our model to XGBClassifier
xgbc.fit(X_train, y_train)
# Creating y_pred
y_pred = xgbc.predict(X_test)
# accuracy
print("Untuned Model's Accuracy is: ",accuracy_score(y_pred, y_test))

Untuned Model's Accuracy is:  0.7754010695187166


Our model's accuracy is **0.7754010695187166** Now, let's make it better by tuning hyperparameters.

#### Hyperparameter Tuning by RandomSearchCV
**Random search CV** is a hyperparameter tuning technique where random combinations of the hyperparameters are used to find the best solution for the built model. It tries random combinations of a range of values.[*](https://medium.com/@senapati.dipak97/grid-search-vs-random-search-d34c92946318#:~:text=Random%20search%20is%20a%20technique,configurations%20in%20the%20parameter%20space.)

In [19]:
from sklearn.model_selection import RandomizedSearchCV
import numpy as np
xgbc = xgb.XGBClassifier()
trans_param_grid = {"n_estimators":np.arange(1,100,1),
                   "seed":np.arange(1,200,1)}
randomized_accuracy = RandomizedSearchCV(estimator=xgbc, param_distributions = trans_param_grid,
                                        n_iter = 30, scoring = "accuracy", cv = 4,
                                        verbose = 1)
randomized_accuracy.fit(X_train, y_train)
print("Best Params are: ", randomized_accuracy.best_params_)

Fitting 4 folds for each of 30 candidates, totalling 120 fits




































































































































Best Params are:  {'seed': 149, 'n_estimators': 12}




`Best Params are:  {'seed': 149, 'n_estimators': 12}` <br>
Let's tune these parameters and see if any change in accuracy_score

#### Tuned Transfusion Accuracy with XGBoostClassifier  

In [21]:
xgbc = xgb.XGBClassifier(objective = "binary:logistic",
                            n_estimators = 12,
                            seed = 149)
# Fitting our model to XGBClassifier
xgbc.fit(X_train, y_train)
# Creating y_pred
y_pred = xgbc.predict(X_test)
# accuracy
print("Tuned Model's Accuracy is: ",accuracy_score(y_pred, y_test))

Tuned Model's Accuracy is:  0.7807486631016043




0.775 to 0.78, it may seem a tiny difference, but in M.L. , even tiny differences can have massive impacts