# Predict whether the person donated blood or Not 
**This notebook gives simple steps to use TPOTClassifier**


In [1]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/blood-transfusion/transfusion.data


In [2]:

transfusion = pd.read_csv('../input/blood-transfusion/transfusion.data')

transfusion.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


**According to data description**
- R (Recency - months since the last donation)
- F (Frequency - total number of donation)
- M (Monetary - total blood donated in c.c.)
- T (Time - months since the first donation)
- a binary variable representing whether he/she donated blood in March 2007 (1 stands for donating blood; 0 stands for not donating blood) basically the target column

In [3]:
transfusion = transfusion.rename(columns={'whether he/she donated blood in March 2007': 'target'})

So `target` says
- 0 - the donor will not give blood
- 1 - the donor will give blood


In [4]:
# checking if the data is balanced by target incidence method
transfusion.target.value_counts(normalize=True).round(3)

0    0.762
1    0.238
Name: target, dtype: float64

Here 0 have target incidence of .76. Data is imbalanced so we would use stratify on target column while splitting 

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transfusion.drop(columns='target'), transfusion.target, test_size=0.25,
    random_state=7,
    stratify=transfusion.target)

In [6]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(561, 4) (187, 4) (561,) (187,)


TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. TPOT will automatically explore hundreds of possible pipelines to find the best one for our dataset. Note, the outcome of this search will be a scikit-learn pipeline, meaning it will include any pre-processing steps as well as the model.

reference : http://epistasislab.github.io/tpot/

 https://machinelearningmastery.com/tpot-for-automated-machine-learning-in-python/

In [7]:
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score

tpot = TPOTClassifier(
    generations=10,
    population_size=20,
    verbosity=2,
    scoring='roc_auc',
    random_state=42,
    disable_update_check=True,
    config_dict='TPOT light'
)
tpot.fit(X_train, y_train)

# AUC score for tpot model
tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')

# Print best pipeline steps
print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    # Print idx and transform
    print(f'{idx}. {transform}')

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7428330234896582

Generation 2 - Current best internal CV score: 0.7428330234896582

Generation 3 - Current best internal CV score: 0.7428330234896582

Generation 4 - Current best internal CV score: 0.7428330234896582

Generation 5 - Current best internal CV score: 0.7451120698726719

Generation 6 - Current best internal CV score: 0.7451120698726719

Generation 7 - Current best internal CV score: 0.7451120698726719

Generation 8 - Current best internal CV score: 0.7451120698726719

Generation 9 - Current best internal CV score: 0.7451120698726719

Generation 10 - Current best internal CV score: 0.7471714195517205

Best pipeline: LogisticRegression(MultinomialNB(input_matrix, alpha=0.01, fit_prior=True), C=0.5, dual=False, penalty=l2)

AUC score: 0.7768

Best pipeline steps:
1. StackingEstimator(estimator=MultinomialNB(alpha=0.01))
2. LogisticRegression(C=0.5, random_state=42)
