## 1. Inspecting transfusion.data file
<p><img src="https://assets.datacamp.com/production/project_646/img/blood_donation.png" style="float: right;" alt="A pictogram of a blood bag with blood donation written in it" width="200"></p>
<p>Blood transfusion saves lives - from replacing lost blood during major surgery or a serious injury to treating various illnesses and blood disorders. Ensuring that there's enough blood in supply whenever needed is a serious challenge for the health professionals. According to <a href="https://www.webmd.com/a-to-z-guides/blood-transfusion-what-to-know#1">WebMD</a>, "about 5 million Americans need a blood transfusion every year".</p>
<p>Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive. We want to predict whether or not a donor will give blood the next time the vehicle comes to campus.</p>
<p>The data is stored in <code>datasets/transfusion.data</code> and it is structured according to RFMTC marketing model (a variation of RFM). We'll explore what that means later in this notebook. First, let's inspect the data.</p>

A little <a href="https://www.optimove.com/resources/learning-center/rfm-segmentation">explanation</a> about the RFM model =) 

In [26]:
with open('datasets/transfusion.data', 'r') as f:
    for i in range(5):
        line = f.readline()
        print(line)

Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),"whether he/she donated blood in March 2007"

2 ,50,12500,98 ,1

0 ,13,3250,28 ,1

1 ,16,4000,35 ,1

2 ,20,5000,45 ,1



## 2. Loading the blood donations data
<p>We now know that we are working with a typical CSV file (i.e., the delimiter is <code>,</code>, etc.). We proceed to loading the data into memory.</p>

In [11]:
import pandas as pd

transfusion = pd.read_csv("datasets/transfusion.data")

transfusion.head(5)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


## 3. Inspecting transfusion DataFrame
<p>Let's briefly return to our discussion of RFM model. RFM stands for Recency, Frequency and Monetary Value and it is commonly used in marketing for identifying your best customers. In our case, our customers are blood donors.</p>
<p>RFMTC is a variation of the RFM model. Below is a description of what each column means in our dataset:</p>
<ul>
<li>R (Recency - months since the last donation)</li>
<li>F (Frequency - total number of donation)</li>
<li>M (Monetary - total blood donated in c.c.)</li>
<li>T (Time - months since the first donation)</li>
<li>a binary variable representing whether he/she donated blood in March 2007 (1 stands for donating blood; 0 stands for not donating blood)</li>
</ul>
<p>It looks like every column in our DataFrame has the numeric type, which is exactly what we want when building a machine learning model. Let's verify our hypothesis.</p>

In [13]:
transfusion.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   Recency (months)                            748 non-null    int64
 1   Frequency (times)                           748 non-null    int64
 2   Monetary (c.c. blood)                       748 non-null    int64
 3   Time (months)                               748 non-null    int64
 4   whether he/she donated blood in March 2007  748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB


## 4. Creating target column
<p>We are aiming to predict the value in <code>whether he/she donated blood in March 2007</code> column. Let's rename this it to <code>target</code> so that it's more convenient to work with.</p>

In [14]:
transfusion.rename(
    columns={'whether he/she donated blood in March 2007': 'target'},
    inplace=True
)

transfusion.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),target
0,2,50,12500,98,1
1,0,13,3250,28,1


## 5. Checking target incidence
<p>We want to predict whether or not the same donor will give blood the next time the vehicle comes to campus. The model for this is a binary classifier, meaning that there are only 2 possible outcomes:</p>
<ul>
<li><code>0</code> - the donor will not give blood</li>
<li><code>1</code> - the donor will give blood</li>
</ul>
<p>Target incidence is defined as the number of cases of each individual target value in a dataset. That is, how many 0s in the target column compared to how many 1s? Target incidence gives us an idea of how balanced (or imbalanced) is our dataset.</p>

In [15]:
transfusion.target.value_counts(normalize=True).round(3)

0    0.762
1    0.238
Name: target, dtype: float64

Although I understand that a more balanced dataset would typically provide better training for my model, I've decided to use this particular dataset for a few reasons. Firstly, this dataset is already publicly available and has been used in previous studies, which will allow me to compare my results with others and potentially build on previous research. Additionally, this dataset is specifically focused on blood donation, which is the area I'm interested in exploring for my project. Finally, I believe that I can still make use of techniques such as evaluation metrics that take into account the class imbalance, as well as adjusting the class weights during model training, to mitigate the potential bias towards the majority class.

Using stratification would be a good way to address the class imbalance in your dataset. Stratification involves dividing the dataset into several subsets based on the values of a certain feature (in this case, the target variable indicating whether a donor will donate blood again). By stratifying the dataset, you ensure that each subset contains a proportional number of instances from each class. In your case, you can use a stratification ratio of 0.76:0.24 to ensure that each subset contains approximately 76% instances of the negative class and 24% instances of the positive class.

## 6. Splitting transfusion into train and test datasets
<p>We'll now use <code>train_test_split()</code> method to split <code>transfusion</code> DataFrame.</p>
<p>Target incidence informed us that in our dataset <code>0</code>s appear 76% of the time. We want to keep the same structure in train and test datasets, i.e., both datasets must have 0 target incidence of 76%. This is very easy to do using the <code>train_test_split()</code> method from the <code>scikit learn</code> library - all we need to do is specify the <code>stratify</code> parameter. In our case, we'll stratify on the <code>target</code> column.</p>

In [16]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(
    transfusion.drop(columns='target'),
    transfusion.target,
    test_size=0.25,
    random_state=42,
    stratify=transfusion.target
)

X_train.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months)
334,16,2,500,16
99,5,7,1750,26


## 7. Selecting model using TPOT
<p><a href="https://github.com/EpistasisLab/tpot">TPOT</a> is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.</p>
<p><img src="https://assets.datacamp.com/production/project_646/img/tpot-ml-pipeline.png" alt="TPOT Machine Learning Pipeline"></p>
<p>TPOT will automatically explore hundreds of possible pipelines to find the best one for our dataset. Note, the outcome of this search will be a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">scikit-learn pipeline</a>, meaning it will include any pre-processing steps as well as the model.</p>
<p>We are using TPOT to help us zero in on one model that we can then explore and optimize further.</p>

In [17]:
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score

tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    verbosity=2,
    scoring='roc_auc',
    random_state=42,
    disable_update_check=True,
    config_dict='TPOT light'
)
tpot.fit(X_train, y_train)

tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')

print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    # Print idx and transform
    print(f'{idx}. {transform}')

Optimization Progress:   0%|          | 0/120 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7422459184429089

Generation 2 - Current best internal CV score: 0.7422459184429089

Generation 3 - Current best internal CV score: 0.7422459184429089

Generation 4 - Current best internal CV score: 0.7422459184429089

Generation 5 - Current best internal CV score: 0.7456308339276876

Best pipeline: MultinomialNB(Normalizer(input_matrix, norm=l2), alpha=0.001, fit_prior=True)

AUC score: 0.7637

Best pipeline steps:
1. Normalizer()
2. MultinomialNB(alpha=0.001)


### TPOT Analysis
        
TPOT picked LogisticRegression as the best model for our dataset with no pre-processing steps, giving us the AUC score of 0.7850. This is a great starting point. Let's see if we can make it better. To make it easier, I will explain a little bit about the AUC and Logistic Regression

The AUC (Area Under the Curve) is a performance metric commonly used in binary classification problems to evaluate the performance of a machine learning model. It measures the ability of the model to distinguish between the positive and negative classes (i.e., the donors who will donate blood again and those who will not). The AUC score ranges from 0 to 1, with higher values indicating better performance. A score of 0.5 indicates that the model is no better than random guessing, while a score of 1 indicates perfect classification. For more information about AUC and it's metrics <a href="https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc?hl">Click here</a>.


In our case, TPOT has selected Logistic Regression as the best model for our dataset, and has given you an AUC score of 0.7850. This means that the model is able to distinguish between donors who will donate blood again and those who will not with a high degree of accuracy. However, there is always room for improvement, and you can try various techniques such as feature engineering, hyperparameter tuning, and different model architectures to further improve the AUC score and make the model even more accurate.
<p><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*klFuUpBGVAjTfpTak2HhUA.png" alt="Logistic Regression Exemple"></p>

## 8. Checking the variance
<p>TPOT picked <code>LogisticRegression</code> as the best model for our dataset with no pre-processing steps, giving us the AUC score of 0.7850. This is a great starting point. Let's see if we can make it better.</p>
<p>One of the assumptions for linear models is that the data and the features we are giving it are related in a linear fashion, or can be measured with a linear distance metric. If a feature in our dataset has a high variance that's orders of magnitude greater than the other features, this could impact the model's ability to learn from other features in the dataset.</p>
<p>Correcting for high variance is called normalization. It is one of the possible transformations you do before training a model. Let's check the variance to see if such transformation is needed.</p>

In [18]:
X_train.var().round(3)

Recency (months)              66.929
Frequency (times)             33.830
Monetary (c.c. blood)    2114363.700
Time (months)                611.147
dtype: float64

Normalization is a crucial step in preparing our data for machine learning. It involves scaling the values of our features so that they fall within a certain range, usually between 0 and 1, or -1 and 1. This can help to address issues with high variance in our data, where some features may have much larger values than others. 

By normalizing our data, we can ensure that all features are treated equally by our model and that no single feature dominates the others. This can improve the accuracy and generalization performance of our model, especially for algorithms that rely on distance metrics, such as k-nearest neighbors or support vector machines.

There are several ways to normalize data for use in machine learning models. Some of the most common techniques include:

*Min-Max Scaling*: this technique scales the values of each feature to a specific range, usually between 0 and 1. This is done by subtracting the minimum value of the feature and dividing by the maximum range.

*Z-score normalization*: this technique normalizes the values of each feature to have mean zero and standard deviation 1. This is done by subtracting the feature's mean and dividing by the standard deviation.

*Unit Vector Scaling*: this technique normalizes the values of each resource to have norm 1, which means that the sum of squares of the values equals 1.

*Robust Scaler*: this technique scales the values of each feature by taking the median and quartiles instead of the mean and standard deviation. This makes it less sensitive to outliers.

*Log Transformation*: This technique applies a logarithmic transformation to the feature values, which can help reduce variation in extreme values.

These are just some of the normalization techniques commonly used in machine learning. Choosing the appropriate technique depends on the dataset and model being used.

### Log transformation - Explanation

Log transformation is a technique commonly used to normalize data in machine learning. It involves taking the logarithm of the values of a feature, which can help to reduce the impact of extreme values and make the distribution of the data more symmetrical. In particular, the log transformation is often used for data that is skewed to the right, meaning that the majority of the values are low, but there are some very high values that can throw off the analysis.

When performing a log transformation, the values are scaled by taking the natural logarithm (ln) or the base-10 logarithm of the original values. The resulting transformed values will be smaller for larger original values, which can help to bring extreme values closer to the center of the distribution. This can be particularly useful for features that exhibit a power-law relationship, such as income or population size, where the relationship between the variables is not linear but rather grows exponentially.

Overall, log transformation is a useful tool for preprocessing data in machine learning, particularly when dealing with skewed distributions or power-law relationships. However, it's important to note that log transformation can't be applied to zero or negative values, so you may need to add a small constant to the values before taking the logarithm.
<p><img src="https://www.medcalc.org/manual/images/logtransformation.png" alt="Log Transformation Exemple"> width=200 </p>


## 9. Log normalization
<p><code>Monetary (c.c. blood)</code>'s variance is very high in comparison to any other column in the dataset. This means that, unless accounted for, this feature may get more weight by the model (i.e., be seen as more important) than any other feature.</p>
<p>One way to correct for high variance is to use log normalization.</p>

In [19]:
import numpy as np

X_train_normed,X_test_normed = X_train.copy(), X_test.copy()

col_to_normalize = 'Monetary (c.c. blood)'

for df_ in [X_train_normed, X_test_normed]:
    
    df_['monetary_log'] = np.log(df_['Monetary (c.c. blood)'])
    
    df_.drop(columns='Monetary (c.c. blood)', inplace=True)


X_train_normed.var().round(3)

Recency (months)      66.929
Frequency (times)     33.830
Time (months)        611.147
monetary_log           0.837
dtype: float64

## 10. Training the logistic regression model
<p>The variance looks much better now. Notice that now <code>Time (months)</code> has the largest variance, but it's not the <a href="https://en.wikipedia.org/wiki/Order_of_magnitude">orders of magnitude</a> higher than the rest of the variables, so we'll leave it as is.</p>
<p>We are now ready to train the logistic regression model.</p>

In [20]:
from sklearn import linear_model

logreg = linear_model.LogisticRegression(
    solver='liblinear',
    random_state=42
)

logreg.fit(X_train_normed, y_train)

# AUC score for tpot model
logreg_auc_score = roc_auc_score(y_test, logreg.predict_proba(X_test_normed)[:, 1])
print(f'\nAUC score: {logreg_auc_score:.4f}')


AUC score: 0.7891


## 11. Conclusion
<p>The demand for blood fluctuates throughout the year. As one <a href="https://www.kjrh.com/news/local-news/red-cross-in-blood-donation-crisis">prominent</a> example, blood donations slow down during busy holiday seasons. An accurate forecast for the future supply of blood allows for an appropriate action to be taken ahead of time and therefore saving more lives.</p>
<p>In this notebook, we explored automatic model selection using TPOT and AUC score we got was 0.7850. This is better than simply choosing <code>0</code> all the time (the target incidence suggests that such a model would have 76% success rate). We then log normalized our training data and improved the AUC score by 0.5%. In the field of machine learning, even small improvements in accuracy can be important, depending on the purpose.</p>
<p>Another benefit of using logistic regression model is that it is interpretable. We can analyze how much of the variance in the response variable (<code>target</code>) can be explained by other variables in our dataset.</p>

In [22]:
from operator import itemgetter

# Sort models based on their AUC score from highest to lowest
sorted(
    [('tpot', tpot_auc_score), ('logreg', logreg_auc_score)],
    key=itemgetter(1),
    reverse=True
)

[('logreg', 0.7890972663699937), ('tpot', 0.7637476160203432)]

## References
The math behind Logistic Regression, Rai K., 2020, Available at:https://medium.com/analytics-vidhya/the-math-behind-logistic-regression-c2f04ca27bca

Hills, S., Eraso, Y. Factors associated with non-adherence to social distancing rules during the COVID-19 pandemic: a logistic regression analysis. BMC Public Health 21, 352 (2021).

Red Cross in blood donation 'crisis', KJRH, Available at:https://www.kjrh.com/news/local-news/red-cross-in-blood-donation-crisis

Blood transfusion - what to know if you get one, Webmd, Available at:https://www.webmd.com/a-to-z-guides/blood-transfusion-what-to-know#1

Classification ROC and AUC, Google Developers, Available at:https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc?hl