# Project Description
Forecasting blood supply is a serious and recurrent problem for blood collection managers.In this Project, you will work with data collected from the donor database of Blood TransfusionService Center.<p>The dataset, obtained from the Machine Learning Repository, consists of arandom sample of 748 donors. Your task will be to predict if a blood donor will donate within a giventime window.<p>You will look at the full model-building process: from inspecting the dataset to usingthe tpot library to automate your Machine Learning pipeline.To complete this Project, you need to know some Python, pandas, and logistic regression.

## Task 1: 
Inspect the file that contains the dataset.<p>•Print out the first 5 lines from datasets/transfusion.data using the head shellcommand.<br>Make sure to first read the narrative for each task in the notebook on the right before reading themore detailed instructions here. To complete this Project, you need to know some Python, pandas,and logistic regression. We recommend one is familiar with the content.To run a shell command in a notebook, you prefix it with !, e.g. !ls will list directory contents.


In [1]:
import pandas as pd
transfusion=pd.read_csv('transfusion.data')
transfusion.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


## Task 2:
InstructionsLoad the dataset.<br>•Import the pandas library.<br>•Load the transfusion.data file from datasets/transfusion.data and assign it tothe transfusion variable.<br>•Display the first rows of the DataFrame with the head() method to verify the file was loadedcorrectly.<br>If you print the first few rows of data, you should see a table with only 5 columns

In [2]:
import pandas as pd
transfusion=pd.read_csv('transfusion.data')
transfusion.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1


## Task 3:
InstructionsInspect the DataFrame's structure.<br>•Print a concise summary of the transfusion DataFrame with the info() method.DataFrame's info() method prints some useful information about a DataFrame:•index type•column types•non-null values•memory usageincluding the index dtype and column dtypes, non-null values and memoryusage.


In [3]:
print('Shape:',transfusion.shape)
print('Info:',transfusion.info())
print('Description:',transfusion.describe())

Shape: (748, 5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   Recency (months)                            748 non-null    int64
 1   Frequency (times)                           748 non-null    int64
 2   Monetary (c.c. blood)                       748 non-null    int64
 3   Time (months)                               748 non-null    int64
 4   whether he/she donated blood in March 2007  748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB
Info: None
Description:        Recency (months)  Frequency (times)  Monetary (c.c. blood)  \
count        748.000000         748.000000             748.000000   
mean           9.506684           5.514706            1378.676471   
std            8.095396           5.839307            1459.826781   
min            0.000000           1.000000  

## Task 4:  
Rename a column. <br>Rename whether he/she donated blood in March 2007 to target for brevity.<br>•Print the first 2 rows of the DataFrame with the head() method to verify the change was donecorrectly.<p>By setting the inplace parameter of the rename() method to True, the transfusion DataFrameis changed in-place, i.e., the transfusion variable will now point to the updated DataFrame asyou'll verify by printing the first 2 rows

In [4]:
transfusion.rename(columns={'whether he/she donated blood in March 2007':'target'},inplace=True)
transfusion.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),target
0,2,50,12500,98,1
1,0,13,3250,28,1


## Task 5:
Print target incidence.<p>•Use value_counts() method on transfusion.target column to print target incidenceproportions, setting normalize=True and rounding the output to 3 decimal places.By default, value_counts() method returns counts of unique values. Bysetting normalize=True, the value_counts() will return the relative frequencies of the uniquevalues instead

In [5]:
round(transfusion['target'].value_counts(normalize=True),3)

0    0.762
1    0.238
Name: target, dtype: float64

## Task 6:
Split the transfusion DataFrame into train and test datasets.<br>•Import train_test_split from sklearn.model_selection module.<br>•Split transfusion into X_train, X_test, y_train and y_test datasets, stratifying onthe target column.<br>•Print the first 2 rows of the X_train DataFrame with the head() method.Writing the code to split the data into the 4 datasets needed would require a lot of work. Instead, youwill use the train_test_split() method in the scikit-learn library

In [6]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(transfusion.drop(columns='target'),
                                              transfusion.target,
                                              test_size=0.25,
                                              random_state=42,
                                              stratify= transfusion.target
                                              )
x_train.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months)
334,16,2,500,16
99,5,7,1750,26


## Task 7:
Use the TPOT library to find the best machine learning pipeline.<br>•Import TPOTClassifier from tpot and roc_auc_score from sklearn.metrics.<br>•Create an instance of TPOTClassifier and assign it to tpot variable.<br>•Print tpot_auc_score, rounding it to 4 decimal places.<br>•Print idx and transform in the for-loop to display the pipeline steps.<p>You will adapt the classification example from the TPOT's documentation. In particular, you willspecify scoring='roc_auc' because this is the metric that you want to optimize for andadd random_state=42 for reproducibility. You'll also use TPOT lightconfiguration with only fastmodels and preprocessors.The nice thing about TPOT is that it has the same API as scikit-learn, i.e., you first instantiate amodel and then you train it, using the fit method.Data pre-processing affects the model's performance, and tpot's fitted_pipeline_ attribute willallow you to see what pre-processing (if any) was done in the best pipeline

In [7]:
!pip install TPOT



In [10]:

from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score


tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    verbosity=2,
    scoring='roc_auc',
    random_state=42,
    disable_update_check=True,
    config_dict='TPOT light'
)
tpot.fit(x_train, y_train)


tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(x_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')


print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):

    print(f'{idx}. {transform}')

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=120, style=ProgressStyle(descript…


Generation 1 - Current best internal CV score: 0.7422459184429089
Generation 2 - Current best internal CV score: 0.7422459184429089
Generation 3 - Current best internal CV score: 0.7422459184429089
Generation 4 - Current best internal CV score: 0.7422459184429089
Generation 5 - Current best internal CV score: 0.7423330644124079
Best pipeline: LogisticRegression(input_matrix, C=0.1, dual=False, penalty=l2)

AUC score: 0.7853

Best pipeline steps:
1. LogisticRegression(C=0.1, random_state=42)


## Task 8:
Check the variance.<br>•Print X_train's variance using var() method and round it to 3 decimal places.pandas.DataFrame.var() method returns column-wise variance of a DataFrame, which makescomparing the variance across the features in X_train simple and straightforward 

In [12]:

variance = round(x_train.var(),3)
print(f'\nX_train Variance :\n {variance}')


X_train Variance :
 Recency (months)              66.929
Frequency (times)             33.830
Monetary (c.c. blood)    2114363.700
Time (months)                611.147
dtype: float64


## Task9:
InstructionsCorrect for high variance.<p>•Copy X_train and X_test into X_train_normed and X_test_normed respectively.<br>•Assign the column name (a string) that has the highest varianceto col_to_normalize variable.<br>•For X_train and X_test DataFrames:
•Log normalize col_to_normalize to add it to the DataFrame.•Drop col_to_normalize.•Print X_train_normed variance using var() method and round it to 3 decimal places.<p>X_train and X_test must have the same structure. To keep your code "DRY" (Don't RepeatYourself), you are using a for-loop to apply the same set of transformations to each of theDataFrames.Normally, you'll do pre-processing before you split the data (it could be one of the steps in machinelearning pipeline). Here, you are testing various ideas with the goal to improve model performance,and therefore this approach is fine.

In [13]:
variance.max()

2114363.7

In [15]:
import numpy as np


x_train_normed, x_test_normed = x_train.copy(), x_test.copy()


col_to_normalize = 'Monetary (c.c. blood)'


for df_ in [x_train_normed, x_test_normed]:
   
    df_['monetary_log'] = np.log(df_[col_to_normalize])
  
    df_.drop(columns=col_to_normalize, inplace=True)



round(x_train_normed.var(),3)

Recency (months)      66.929
Frequency (times)     33.830
Time (months)        611.147
monetary_log           0.837
dtype: float64

## Task 10:
Train the logistic regression model.•Import linear_model from sklearn.•Create an instance of linear_model.LogisticRegression and assign itto logreg variable.•Train logreg model using the fit() method.•Print logreg_auc_score.<p>The scikit-learn library has a consistent API when it comes to fitting a model:1.Create an instance of a model you want to train.2.Train it on your train datasets using the fit method.You may recognise this pattern from when you trained TPOT model. This is the beauty ofthe scikit-learn library: you can quickly try out different models with only a few code changes

In [16]:

from sklearn import linear_model


logreg = linear_model.LogisticRegression(
    solver='liblinear',
    random_state=42
)


logreg.fit(x_train_normed, y_train)


logreg_auc_score = roc_auc_score(y_test, logreg.predict_proba(x_test_normed)[:, 1])
print(f'\nAUC score: {logreg_auc_score:.4f}')


AUC score: 0.7891


## Task 11:
Sort your models based on their AUC score from highest to lowest.<br>•Import itemgetter from operator module.<br>•Sort the list of (model_name, model_score) pairs from highest to lowestusing reverse=True parameter.

In [17]:

from operator import itemgetter


sorted(
    [('tpot', tpot_auc_score), ('logreg', logreg_auc_score)],
    key=itemgetter(1),
    reverse=True
)

[('logreg', 0.7890972663699937), ('tpot', 0.7852828989192625)]

### Congratulations, you've made it to the end! Good luck and keep on learning!


<b>Shubham R Khule 
<br>M.Sc(Data Science & Big Data Analytics)
<br>Pune, India</b>