# Predict Blood Donation for Future Expectancy

Forecasting blood supply is a serious and recurrent problem for blood collection managers: in January 2019, "Nationwide, the Red Cross saw 27,000 fewer blood donations over the holidays than they see at other times of the year." Machine learning can be used to learn the patterns in the data to help to predict future blood donations and therefore save more lives.

In this Project, we will work with data collected from the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes its blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. The dataset, obtained from the UCI Machine Learning Repository, consists of a random sample of 748 donors. 

Our task will be to predict if a blood donor will donate within a given time window. WE will look at the full model-building process: from inspecting the dataset to using the tpot library to automate your Machine Learning pipeline.

#### Importing necessray libraries

In [57]:
import pandas as pd
import numpy as np
import os

#### Loading the blood donations data

In [58]:
inp_file="transfusion.data"
data=pd.read_csv(inp_file)

#### Inspecting transfusion DataFrame

In [59]:
data.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [60]:
data.shape

(748, 5)

In [61]:
data.describe()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
count,748.0,748.0,748.0,748.0,748.0
mean,9.506684,5.514706,1378.676471,34.282086,0.237968
std,8.095396,5.839307,1459.826781,24.376714,0.426124
min,0.0,1.0,250.0,2.0,0.0
25%,2.75,2.0,500.0,16.0,0.0
50%,7.0,4.0,1000.0,28.0,0.0
75%,14.0,7.0,1750.0,50.0,0.0
max,74.0,50.0,12500.0,98.0,1.0


In [62]:
data.isnull().sum()

Recency (months)                              0
Frequency (times)                             0
Monetary (c.c. blood)                         0
Time (months)                                 0
whether he/she donated blood in March 2007    0
dtype: int64

#### Checking the variance and Log normalization¶

In [63]:
data.var()

Recency (months)                              6.553543e+01
Frequency (times)                             3.409751e+01
Monetary (c.c. blood)                         2.131094e+06
Time (months)                                 5.942242e+02
whether he/she donated blood in March 2007    1.815819e-01
dtype: float64

#### Variance is very high for Monetary. Hence doing log normalization

In [64]:
data['log_Monetary']=np.log(data['Monetary (c.c. blood)'])

In [65]:
data.var()

Recency (months)                              6.553543e+01
Frequency (times)                             3.409751e+01
Monetary (c.c. blood)                         2.131094e+06
Time (months)                                 5.942242e+02
whether he/she donated blood in March 2007    1.815819e-01
log_Monetary                                  8.363484e-01
dtype: float64

In [66]:
Monetary=data.pop('Monetary (c.c. blood)')

#### Creating target column

In [67]:
data.rename(columns={"whether he/she donated blood in March 2007":"target"},inplace=True)

In [68]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
Recency (months)     748 non-null int64
Frequency (times)    748 non-null int64
Time (months)        748 non-null int64
target               748 non-null int64
log_Monetary         748 non-null float64
dtypes: float64(1), int64(4)
memory usage: 29.3 KB


In [69]:
features=['Recency (months)','Frequency (times)','Time (months)','log_Monetary']

#### Checking target incidence

In [70]:
y=data['target']
X=data[features]

In [71]:
y.head()

0    1
1    1
2    1
3    1
4    0
Name: target, dtype: int64

In [72]:
X.head()

Unnamed: 0,Recency (months),Frequency (times),Time (months),log_Monetary
0,2,50,98,9.433484
1,0,13,28,8.08641
2,1,16,35,8.29405
3,2,20,45,8.517193
4,1,24,77,8.699515


#### Splitting transfusion into train and test datasets

In [73]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42,stratify=y)

In [74]:
X_train.shape

(501, 4)

In [75]:
X_test.shape

(247, 4)

In [76]:
y_train.shape

(501,)

In [77]:
y_test.shape

(247,)

#### Splitting transfusion into train and test datasets

In [78]:
from tpot import TPOTClassifier

In [81]:
pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
                                    random_state=42, verbosity=2)
pipeline_optimizer.fit(X_train, y_train)

Version 0.10.2 of tpot is outdated. Version 0.11.7 was released Wednesday January 06, 2021.


HBox(children=(IntProgress(value=0, description='Optimization Progress', max=120, style=ProgressStyle(descript…

Generation 1 - Current best internal CV score: 0.7964882488248824
Generation 2 - Current best internal CV score: 0.7964882488248824
Generation 3 - Current best internal CV score: 0.7964882488248824
Generation 4 - Current best internal CV score: 0.7965080508050806
Generation 5 - Current best internal CV score: 0.7965080508050806

Best pipeline: LogisticRegression(PCA(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False), iterated_power=9, svd_solver=randomized), C=10.0, dual=False, penalty=l1)


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
        disable_update_check=False, early_stop=None, generations=5,
        max_eval_time_mins=5, max_time_mins=None, memory=None,
        mutation_rate=0.9, n_jobs=1, offspring_size=None,
        periodic_checkpoint_folder=None, population_size=20,
        random_state=42, scoring=None, subsample=1.0, template=None,
        use_dask=False, verbosity=2, warm_start=False)

In [82]:
print(pipeline_optimizer.score(X_test, y_test))
pipeline_optimizer.export('tpot_exported_pipeline.py')

0.7854251012145749


### Training the model

In [83]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

In [107]:
exported_pipeline = make_pipeline(
   LogisticRegression(solver='lbfgs',C=15.0, dual=False, penalty="l2")
)

In [108]:
exported_pipeline.fit(X_train, y_train)
results = exported_pipeline.predict(X_test)
print(exported_pipeline.score(X_test,y_test))

0.7935222672064778


In [109]:
from sklearn.metrics import classification_report
print(classification_report(y_test, results))

              precision    recall  f1-score   support

           0       0.82      0.94      0.87       188
           1       0.63      0.32      0.43        59

   micro avg       0.79      0.79      0.79       247
   macro avg       0.72      0.63      0.65       247
weighted avg       0.77      0.79      0.77       247

