<a href="https://colab.research.google.com/github/shanekim/Portfolio/blob/master/Predicting_Blood_Donation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Blood Donation

In [0]:
from google.colab import files
uploaded = files.upload()

Saving transfusion.data to transfusion.data


In [0]:
import io
import pandas as pd

df = pd.read_csv(io.BytesIO(uploaded['transfusion.data']))

df.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [0]:
!head -n 5 transfusion.data

Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),"whether he/she donated blood in March 2007"
2 ,50,12500,98 ,1
0 ,13,3250,28 ,1
1 ,16,4000,35 ,1
2 ,20,5000,45 ,1


RFM stands for Recency, Frequency and Monetary Value. It can be used for identifying the best customers. In this project, customers are blood donors. 

Data Dictionary:
* R (Recency - months since the last donation)
* F (Frequency - total number of donation)
* M (Monetary - total blood donated in cc)
* T (Time - months since the first donation)
* Binary variable representing whether he/she donated blood in March 2007 (1 stands for donating blood; 0 stands for not donating blood)

In [0]:
df.rename(
    columns={'whether he/she donated blood in March 2007': 'target'},
    inplace=True
)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
Recency (months)         748 non-null int64
Frequency (times)        748 non-null int64
Monetary (c.c. blood)    748 non-null int64
Time (months)            748 non-null int64
target                   748 non-null int64
dtypes: int64(5)
memory usage: 29.3 KB


In [0]:
# Target incidence

df.target.value_counts(normalize=True).round(3)

0    0.762
1    0.238
Name: target, dtype: float64

In [0]:
# Preserving the target incidence rate in both train/test sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop(columns='target'),
    df['target'],
    test_size=0.25,
    random_state=42,
    stratify=df['target']
)

In [0]:
print(y_train.value_counts(normalize=True).round(3))
print(y_test.value_counts(normalize=True).round(3))

0    0.761
1    0.239
Name: target, dtype: float64
0    0.765
1    0.235
Name: target, dtype: float64


In [26]:
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score

# Instantiate TPOTClassifier
tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    verbosity=2,
    scoring='roc_auc',
    random_state=42,
    disable_update_check=True,
    config_dict='TPOT light'
)
tpot.fit(X_train, y_train)

# AUC score 
tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')

# Print best pipeline steps
print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    # Print idx and transform
    print(f'{idx}. {transform}')

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=120.0, style=ProgressStyle(de…

Generation 1 - Current best internal CV score: 0.7424354492343548
Generation 2 - Current best internal CV score: 0.7425225952038538
Generation 3 - Current best internal CV score: 0.7463665665033654
Generation 4 - Current best internal CV score: 0.7463665665033654
Generation 5 - Current best internal CV score: 0.7474026915476985

Best pipeline: LogisticRegression(MultinomialNB(input_matrix, alpha=0.1, fit_prior=True), C=20.0, dual=False, penalty=l1)

AUC score: 0.7721

Best pipeline steps:
1. StackingEstimator(estimator=MultinomialNB(alpha=0.1, class_prior=None,
                                          fit_prior=True))
2. LogisticRegression(C=20.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=42, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)


TPOT picked LogisticRegression as the best model with AUC score of 0.785

# Linear Regression

### Correcting high variance column - Log Normalization

In [27]:
# X_train's variance, rounding the output to 3 decimal places
X_train.var().round(3)

Recency (months)              66.929
Frequency (times)             33.830
Monetary (c.c. blood)    2114363.700
Time (months)                611.147
dtype: float64

Since Monetary has high variance, compared to the other features
=> It get more weight (seen as more important) than any other features

In [28]:
import numpy as np

X_train_normed, X_test_normed = X_train.copy(), X_test.copy()
col_to_normalize = 'Monetary (c.c. blood)'

# Log normalization
for df_ in [X_train_normed, X_test_normed]:
    # Add log normalized column
    df_['monetary_log'] = np.log(df_[col_to_normalize])
    # Drop the original column
    df_.drop(columns=col_to_normalize, inplace=True)

# Check the variance for X_train_normed
X_train_normed.var().round(3)

Recency (months)      66.929
Frequency (times)     33.830
Time (months)        611.147
monetary_log           0.837
dtype: float64

### Model Fitting

In [39]:
from sklearn import linear_model

# Instantiate LogisticRegression
linreg = linear_model.LogisticRegression(
    solver='liblinear',
    random_state=42
)

# Train the model
linreg.fit(X_train_normed, y_train)

# AUC score for tpot model
linreg_auc_score = roc_auc_score(y_test, linreg.predict_proba(X_test_normed)[:, 1])
print('AUC score: {}'.format(linreg_auc_score.round(3)))


AUC score: 0.789


Linear Regression model improved the AUC score by 0.5%.