# **Predict Blood Donation for Future Expectancy**

# **Project Tasks**

* Inspecting transfusion.data file
* Loading the blood donations data
* Inspecting transfusion DataFrame
* Creating target column
* Checking target incidence
* Splitting transfusion into train and test datasets
* Selecting model using TPOT
* Checking the variance
* Log normalization
* Training the linear regression model
* Conclusion

## 1. Inspecting transfusion.data file

Blood transfusion saves lives - from replacing lost blood during major surgery or a serious injury to treating various illnesses and blood disorders. Ensuring that there's enough blood in supply whenever needed is a serious challenge for the health professionals. According to WebMD "about 5 million Americans need a blood transfusion every year"

## 2. Loading the blood donations data

In [87]:
import pandas as pd
import numpy as np
df= pd.read_csv("transfusion.csv")
df

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
743,23,2,500,38,0
744,21,2,500,52,0
745,23,3,750,62,0
746,39,1,250,39,0


## 3. Inspecting transfusion DataFrame

In [82]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   Recency (months)                            748 non-null    int64
 1   Frequency (times)                           748 non-null    int64
 2   Monetary (c.c. blood)                       748 non-null    int64
 3   Time (months)                               748 non-null    int64
 4   whether he/she donated blood in March 2007  748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB


In [83]:
df.shape

(748, 5)

In [84]:
transfusion.isnull().sum()

Recency (months)                              0
Frequency (times)                             0
Monetary (c.c. blood)                         0
Time (months)                                 0
whether he/she donated blood in March 2007    0
dtype: int64

## 4. Creating target column

In [2]:
df1= df.rename(columns = {'whether he/she donated blood in March 2007': 'Target'}, inplace = False)
df1

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),Target
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
743,23,2,500,38,0
744,21,2,500,52,0
745,23,3,750,62,0
746,39,1,250,39,0


## 5. Checking target incidence

In [3]:
print("-------------Total percentage of 0 and 1")
df1["Target"].value_counts(normalize= True)*100


-------------Total percentage of 0 and 1


0    76.203209
1    23.796791
Name: Target, dtype: float64

In [4]:
X = df1.iloc[:,:-1]
Y= df1.iloc[:,-1]


## 6. Splitting transfusion into train and test datasets


In [80]:
from sklearn.model_selection import train_test_split

In [64]:
X_train, X_test, Y_train, Y_test = train_test_split(
    X,
    Y,
    test_size=0.25,
    random_state=42,
    stratify=Y
)


In [65]:
X_test

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months)
41,2,5,1250,16
682,11,2,500,25
532,4,8,2000,28
538,2,8,2000,38
153,2,1,250,2
...,...,...,...,...
621,4,1,250,4
371,14,3,750,31
484,23,1,250,23
655,16,16,4000,77


## 7. Selecting model using TPOT

In [66]:
from tpot import TPOTClassifier

In [67]:

from sklearn.metrics import roc_auc_score

tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    verbosity=2,
    scoring='roc_auc',
    random_state=42,
    disable_update_check=True,
    config_dict='TPOT light'
)


In [72]:
tpot.fit(X_train, Y_train)
tpot_auc_score = roc_auc_score(Y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')
print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    
    print(f'{idx}. {transform}')

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=120.0, style=ProgressStyle(de…


Generation 1 - Current best internal CV score: 0.7422459184429089
Generation 2 - Current best internal CV score: 0.7422459184429089
Generation 3 - Current best internal CV score: 0.7422459184429089
Generation 4 - Current best internal CV score: 0.7422459184429089
Generation 5 - Current best internal CV score: 0.7423330644124079
Best pipeline: LogisticRegression(input_matrix, C=0.1, dual=False, penalty=l2)

AUC score: 0.7853

Best pipeline steps:
1. LogisticRegression(C=0.1, random_state=42)


## 8. Checking the variance

In [76]:
X_train.var().round(3)






Recency (months)              66.929
Frequency (times)             33.830
Monetary (c.c. blood)    2114363.700
Time (months)                611.147
dtype: float64

## 9. Log normalization

In [77]:
X_train_normed, X_test_normed = X_train.copy(), X_test.copy()
col_to_normalize = 'Monetary (c.c. blood)'


for df_ in [X_train_normed, X_test_normed]:
    df_['Monetary_log'] = np.log(df_[col_to_normalize])
    df_.drop(columns=col_to_normalize, inplace=True)

X_train_normed.var().round(3)

Recency (months)      66.929
Frequency (times)     33.830
Time (months)        611.147
Monetary_log           0.837
dtype: float64

## 10. Training the linear regression model

In [78]:
from sklearn import linear_model

logreg = linear_model.LogisticRegression(C=0.1, dual=False, penalty='l2',solver='liblinear',random_state=42)

# Train the model
logreg.fit(X_train_normed, y_train)


logreg_auc_score = roc_auc_score(y_test, logreg.predict_proba(X_test_normed)[:, 1])
print(f'\nAUC score: {logreg_auc_score:.4f}')


AUC score: 0.7867


In [88]:
import pickle

In [90]:
pickle.dump(logreg, open("blood_donation.pkl", "wb"))