## Blood Transfusion Service Center Data Set.¶
### Data Set Information:

To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).

Source: UCI Machine Learning Repository

In [31]:
## split the dataset into train and test samples.

## separating the independents and dependents 

X = df.drop(columns='made_donation_in_march_2007')
y = df.made_donation_in_march_2007

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.25,shuffle=True,random_state=42)

In [24]:
import numpy as np
import pandas as pd

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data')

df = df.rename(columns={'Recency (months)' : 'Recency - months since last donation',
                    'Frequency (times)' : 'Frequency - total number of donation',
                    'Monetary (c.c. blood)' : 'Monetary - total blood donated in c.c.',
                    'Time (months)' : 'Time - months since first donation' ,
                    'whether he/she donated blood in March 2007' : 'made_donation_in_march_2007'})

In [52]:
df.isnull().sum()

Recency - months since last donation      0
Frequency - total number of donation      0
Monetary - total blood donated in c.c.    0
Time - months since first donation        0
made_donation_in_march_2007               0
dtype: int64

In [32]:
# Get quick initial metrics estimate.

# Using simple pandas value counts method
print(y_train.value_counts(normalize=True))

# Using sklearn accuracy_score
import numpy as np
from sklearn.metrics import accuracy_score

majority_class = y_train.mode()[0]
prediction = np.full(shape=y_train.shape, 
                     fill_value=majority_class)

accuracy_score(y_train, prediction)

0    0.768271
1    0.231729
Name: made_donation_in_march_2007, dtype: float64


0.768270944741533

In [36]:
# Data pre-processing, Feature selection and Model selection.

# Imports for pipeline

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import SelectKBest,f_classif
from sklearn.linear_model import LogisticRegression

## create a pipeline

pipeline = make_pipeline(\
                         RobustScaler(),
                         SelectKBest(f_classif),
                         LogisticRegression(solver='lbfgs'))


In [41]:
#Model Validation

from sklearn.model_selection import GridSearchCV

param_grid = {
    'selectkbest__k': [1, 2, 3, 4],
    'logisticregression__class_weight': [None,'balanced'],
    'logisticregression__C': [.0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0]
}

gridsearch = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', verbose=1)
gridsearch.fit(X_train, y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 360 out of 360 | elapsed:    5.2s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('selectkbest', SelectKBest(k=10, score_func=<function f_classif at 0x122dfa158>)), ('logisticregression', LogisticRegression(C=1.0, clas...nalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'selectkbest__k': [1, 2, 3, 4], 'logisticregression__class_weight': [None, 'balanced'], 'logisticregression__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [44]:
# interpret the results.

# best cross validation score

print('Cross validation score:', gridsearch.best_score_)

# Best parameters which resulted in the best score

print('Best Parameters',gridsearch.best_params_)

# Which features were selected?
selector = gridsearch.best_estimator_.named_steps['selectkbest']
all_names = X_train.columns
selected_mask = selector.get_support()
selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]

print('Features selected:')
for name in selected_names:
    print(name)

print()
print('Features not selected:')
for name in unselected_names:
    print(name)

Cross validation score: 0.7807486631016043
Best Parameters {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'selectkbest__k': 4}
Features selected:
Recency - months since last donation
Frequency - total number of donation
Monetary - total blood donated in c.c.
Time - months since first donation

Features not selected:


In [50]:
#Get the best model and check it against test data set.

# Predict with X_test features

y_pred = gridsearch.predict(X_test)

#compare predictions to y_test labels.

test_score = accuracy_score(y_test,y_pred)
print('Accuracy score on test data set', test_score)

Accuracy score on test data set 0.7540106951871658


In [46]:
selector = gridsearch.best_estimator_.named_steps

SelectKBest(k=4, score_func=<function f_classif at 0x122dfa158>)