# BATCH- JULY 2022

# TECHNOCOLABS- MINI PROJECT BY VINAY KUMAR KUSHWAHA

# Predict Blood Donation for Future Expectancy


<img src = "blood1.png" style="width:800px;height:400px">

# Introduction

Forecasting blood supply is a serious and recurrent problem for blood collection managers: in January 2019, "Nationwide, the Red Cross saw 27,000 fewer blood donations over the holidays than they see at other times of the year." Machine learning can be used to learn the patterns in the data to help to predict future blood donations and therefore save more lives.

# Problem Statement

In this Project, We will work with data collected from the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes its blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. The dataset consists of a random sample of 748 donors.  Task will be to predict if a blood donor will donate within a given time window. We will look at the full model-building process: from inspecting the dataset to using the tpot library to automate your Machine Learning pipeline.


# Project Tasks:

* Inspecting transfusion.data file  
* Loading the blood donations data  
* Inspecting transfusion DataFrame  
* Creating target column  
* Checking target incidence  
* Splitting transfusion into train and test datasets  
* Selecting model using TPOT  
* Checking the variance  
* Log normalization  
* Training the model  
* Conclusion  

## Importing the libraries..

In [15]:
#Importing the required libraries

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")


In [16]:
#loading the dataset
data = pd.read_csv(r"C:\Users\HP\Downloads\transfusion\transfusion.data", delimiter="," )
data

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
743,23,2,500,38,0
744,21,2,500,52,0
745,23,3,750,62,0
746,39,1,250,39,0


# Inspecting transfusion DataFrame

Let's briefly return to our discussion of RFM model. RFM stands for Recency, Frequency and Monetary Value and it is commonly used in marketing for identifying your best customers. In our case, our customers are blood donors.
RFMTC is a variation of the RFM model. Below is a description of what each column means in our dataset:
R (Recency - months since the last donation)
F (Frequency - total number of donation)
M (Monetary - total blood donated in c.c.)
T (Time - months since the first donation)
a binary variable representing whether he/she donated blood in March 2007 (1 stands for donating blood; 0 stands for not donating blood)
It looks like every column in our DataFrame has the numeric type, which is exactly what we want when building a machine learning model. Let's verify our hypothesis.

In [17]:
#top 5 rows
data.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [18]:
#Summary of the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   Recency (months)                            748 non-null    int64
 1   Frequency (times)                           748 non-null    int64
 2   Monetary (c.c. blood)                       748 non-null    int64
 3   Time (months)                               748 non-null    int64
 4   whether he/she donated blood in March 2007  748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB


In [19]:
#description of the dataset
data.describe()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
count,748.0,748.0,748.0,748.0,748.0
mean,9.506684,5.514706,1378.676471,34.282086,0.237968
std,8.095396,5.839307,1459.826781,24.376714,0.426124
min,0.0,1.0,250.0,2.0,0.0
25%,2.75,2.0,500.0,16.0,0.0
50%,7.0,4.0,1000.0,28.0,0.0
75%,14.0,7.0,1750.0,50.0,0.0
max,74.0,50.0,12500.0,98.0,1.0


In [20]:
#checking any missing value
data.isnull().sum()

Recency (months)                              0
Frequency (times)                             0
Monetary (c.c. blood)                         0
Time (months)                                 0
whether he/she donated blood in March 2007    0
dtype: int64

In [21]:
#finding the rows and columns of the dataset
data.shape

(748, 5)

In [22]:
data.corr()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
Recency (months),1.0,-0.182745,-0.182745,0.160618,-0.279869
Frequency (times),-0.182745,1.0,1.0,0.63494,0.218633
Monetary (c.c. blood),-0.182745,1.0,1.0,0.63494,0.218633
Time (months),0.160618,0.63494,0.63494,1.0,-0.035854
whether he/she donated blood in March 2007,-0.279869,0.218633,0.218633,-0.035854,1.0


# Creating target column

We are aiming to predict the value in whether he/she donated blood in March 2007 column.    Let's rename this it to target so that it's more convenient to work with.

In [23]:
#Renaming the dataset
data.rename(columns={'whether he/she donated blood in March 2007':'target'}, inplace=True)
data.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),target
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


# Checking target incidence

We want to predict whether or not the same donor will give blood the next time the vehicle comes to campus. The model for this is a binary classifier, meaning that there are only 2 possible outcomes:  
0 - the donor will not give blood  
1 - the donor will give blood  
Target incidence is defined as the number of cases of each individual target value in a dataset. That is, how many 0s in the target column compared to how many 1s? Target incidence gives us an idea of how balanced (or imbalanced) is our dataset.

In [93]:
#percentage of donors and non-donors 
data.target.value_counts(normalize=True).mul(100).round(3).astype(str) + "%"

0    76.203%
1    23.797%
Name: target, dtype: object

In [25]:
#dropping the target columns
X_data = data.drop(columns='target')
X_data.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months)
0,2,50,12500,98
1,0,13,3250,28
2,1,16,4000,35
3,2,20,5000,45
4,1,24,6000,77


In [26]:
#checking target columns
Y_data = data.target
Y_data.head()

0    1
1    1
2    1
3    1
4    0
Name: target, dtype: int64

# Splitting transfusion into train and test datasets

We'll now use train_test_split() method to split transfusion DataFrame.  
Target incidence informed us that in our dataset 0s appear 76% of the time. We want to keep the same structure in train and test datasets, i.e., both datasets must have 0 target incidence of 76%. This is very easy to do using the train_test_split() method from the scikit learn library - all we need to do is specify the stratify parameter. In our case, we'll stratify on the target column.

In [27]:
# Import train_test_split method
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_data,Y_data,test_size=0.25,random_state=23,stratify=data.target)

In [28]:
#dataset of X
X_train,X_test

(     Recency (months)  Frequency (times)  Monetary (c.c. blood)  Time (months)
 658                11                  3                    750             26
 70                  2                  4                   1000             16
 294                11                  5                   1250             35
 590                 2                  1                    250              2
 125                 2                  5                   1250             34
 ..                ...                ...                    ...            ...
 124                 2                  4                   1000             26
 480                23                  1                    250             23
 434                14                  2                    500             35
 619                 4                  1                    250              4
 466                23                  4                   1000             45
 
 [561 rows x 4 columns],
      Recency

In [37]:
#dataset of Y
y_train, y_test

(658    0
 70     0
 294    0
 590    0
 125    0
       ..
 124    0
 480    0
 434    0
 619    0
 466    0
 Name: target, Length: 561, dtype: int64,
 552    1
 718    0
 109    0
 545    1
 251    0
       ..
 726    0
 577    0
 702    0
 249    0
 24     0
 Name: target, Length: 187, dtype: int64)

# Selecting model using TPOT

TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

<img src = "wpt2.png" style="width:800px;height:400px">

TPOT will automatically explore hundreds of possible pipelines to find the best one for our dataset. Note, the outcome of this search will be a scikit-learn pipeline, meaning it will include any pre-processing steps as well as the model.
We are using TPOT to help us zero in on one model that we can then explore and optimize further.

In [45]:
#Importing the tpot..

from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score
tpot = TPOTClassifier(generations=7,
                        population_size=20,
                        verbosity=2,
                        scoring='roc_auc',
                        random_state=23,
                        disable_update_check=True,
                        config_dict='TPOT light')
                    
tpot.fit(X_train, y_train)


Optimization Progress:   0%|          | 0/160 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7447138135715426

Generation 2 - Current best internal CV score: 0.7447204391595637

Generation 3 - Current best internal CV score: 0.7447204391595637

Generation 4 - Current best internal CV score: 0.7447204391595637

Generation 5 - Current best internal CV score: 0.7522352200669574

Generation 6 - Current best internal CV score: 0.7612278968434919

Generation 7 - Current best internal CV score: 0.7612278968434919

Best pipeline: LogisticRegression(RobustScaler(ZeroCount(MinMaxScaler(input_matrix))), C=0.5, dual=False, penalty=l2)


TPOTClassifier(config_dict='TPOT light', disable_update_check=True,
               generations=7, population_size=20, random_state=23,
               scoring='roc_auc', verbosity=2)

In [100]:
#selecting best pipleline..
print('\nBest pipeline steps:', end='\n')
for pl, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    print(f'{pl}. {transform}')


Best pipeline steps:
1. MinMaxScaler()
2. ZeroCount()
3. RobustScaler()
4. LogisticRegression(C=0.5, random_state=23)


In [110]:
#AUC score of the new normalized model..

tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score : {tpot_auc_score:.5f}')


AUC score : 0.75814


# Checking the variance

TPOT picked LogisticRegression as the best model for our dataset with no pre-processing steps, giving us the AUC score of 0.7850. This is a great starting point. Let's see if we can make it better.
One of the assumptions for linear regression models is that the data and the features we are giving it are related in a linear fashion, or can be measured with a linear distance metric. If a feature in our dataset has a high variance that's an order of magnitude or more greater than the other features, this could impact the model's ability to learn from other features in the dataset.
Correcting for high variance is called normalization. It is one of the possible transformations you do before training a model. Let's check the variance to see if such transformation is needed.

In [47]:
X_train.var().round(2)

Recency (months)              70.93
Frequency (times)             34.51
Monetary (c.c. blood)    2156724.28
Time (months)                586.29
dtype: float64

# Log normalization

Monetary (c.c. blood)'s variance is very high in comparison to any other column in the dataset. This means that, unless accounted for, this feature may get more weight by the model (i.e., be seen as more important) than any other feature.
One way to correct for high variance is to use log normalization.

In [61]:
# normalize the column
col_norm = ["Monetary (c.c. blood)"]

# Copy X_train and X_test into X_train_norm and X_test_norm
X_train_norm, X_test_norm = X_train.copy(), X_test.copy()

# Log normalization
for data_norm in [X_train_norm, X_test_norm]:
    # Add log normalized column
    data_norm['monetary_in_log'] = np.log(data_norm[col_norm])
    # Drop the original column
    data_norm.drop(columns=col_norm, inplace=True)

The variance looks much better now. Notice that now Time (months) has the largest variance, but it's not the orders of magnitude higher than the rest of the variables, so we'll leave it as is.

In [111]:
X_train_norm.var().round(2)

Recency (months)      70.93
Frequency (times)     34.51
Time (months)        586.29
monetary_in_log        0.85
dtype: float64

# Training the model

In [63]:
#importing logistic regression..

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=0.5, random_state=23)

# Train the model
logreg.fit(X_train_norm, y_train)

LogisticRegression(C=0.5, random_state=23)

In [109]:
# AUC score for normalized logistic model
logreg_auc_score = roc_auc_score(y_test, logreg.predict_proba(X_test_norm)[:, 1])
print(f'\nAUC score :{logreg_auc_score:.5f}' )


AUC score :0.78693


# Conclusion

The demand for blood fluctuates throughout the year. As one prominent example, blood donations slow down during busy holiday seasons. An accurate forecast for the future supply of blood allows for an appropriate action to be taken ahead of time and therefore saving more lives.  

We explored automatic model selection using TPOT and AUC score we got was "0.75814". This is better than simply choosing 0 all the time (the target incidence suggests that such a model would have 76% success rate). We then log normalized our training data and improved the AUC score to "0.78693".   
In the field of machine learning, even small improvements in accuracy can be important.  
So, our normalized new model is better than previous model.