## Introduction
<p><img src="https://assets.datacamp.com/production/project_646/img/blood_donation.png" style="float: right;" alt="A pictogram of a blood bag with blood donation written in it" width="200"></p>
<p>Blood transfusion saves lives - from replacing lost blood during major surgery or a serious injury to treating various illnesses and blood disorders. Ensuring that there's enough blood in supply whenever needed is a serious challenge for the health professionals. According to <a href="https://www.webmd.com/a-to-z-guides/blood-transfusion-what-to-know#1">WebMD</a>, "about 5 million Americans need a blood transfusion every year".</p>
<p>Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive. We want to predict whether or not a donor will give blood the next time the vehicle comes to campus.</p>
<p>The data is stored in <code>datasets/transfusion.data</code> and it is structured according to RFMTC marketing model (a variation of RFM). We'll explore what that means later in this notebook. First, let's inspect the data.</p>

## 1. Loading the blood donations data
<p>We now know that we are working with a typical CSV file (i.e., the delimiter is <code>,</code>, etc.). We proceed to loading the data into memory.</p>

In [1]:
# Importing pandas
import pandas as pd

# Reading in dataset
transfusion = pd.read_csv("C:\\Users\\talfi\\python\\pred\\pred1\\pred2\\pred.data")

# Printing out the first rows of our dataset
transfusion.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


## 2. Inspecting transfusion DataFrame
<p>Let's briefly return to our discussion of RFM model. RFM stands for Recency, Frequency and Monetary Value and it is commonly used in marketing for identifying your best customers. In our case, our customers are blood donors.</p>
<p>RFMTC is a variation of the RFM model. Below is a description of what each column means in our dataset:</p>
<ul>
<li>R (Recency - months since the last donation)</li>
<li>F (Frequency - total number of donation)</li>
<li>M (Monetary - total blood donated in c.c.)</li>
<li>T (Time - months since the first donation)</li>
<li>a binary variable representing whether he/she donated blood in March 2007 (1 stands for donating blood; 0 stands for not donating blood)</li>
</ul>
<p>It looks like every column in our DataFrame has the numeric type, which is exactly what we want when building a machine learning model. Let's verify our hypothesis.</p>

In [2]:
# Print a concise summary of transfusion DataFrame
transfusion.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   Recency (months)                            748 non-null    int64
 1   Frequency (times)                           748 non-null    int64
 2   Monetary (c.c. blood)                       748 non-null    int64
 3   Time (months)                               748 non-null    int64
 4   whether he/she donated blood in March 2007  748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB


## 3. Creating blood_donation column
<p>We are aiming to predict the value in <code>whether he/she donated blood in March 2007</code> column. Let's rename this it to <code>blood_donation</code> so that it's more convenient to work with.</p>

In [2]:
# Rename whether he/she donated blood in March 2007 column as 'blood_donation' for brevity 
transfusion.rename(
    columns={'whether he/she donated blood in March 2007': 'blood_donation'},
    inplace=True
)

# Print out the first 2 rows
transfusion.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),blood_donation
0,2,50,12500,98,1
1,0,13,3250,28,1


## 4. Checking blood_donation incidence
<p>We want to predict whether or not the same donor will give blood the next time the vehicle comes to campus. The model for this is a binary classifier, meaning that there are only 2 possible outcomes:</p>
<ul>
<li><code>0</code> - the donor will not give blood</li>
<li><code>1</code> - the donor will give blood</li>
</ul>
<p>blood_donation incidence is defined as the number of cases of each individual value in a dataset. That is, how many 0s in the blood_donation column compared to how many 1s? blood_donation incidence gives us an idea of how balanced (or imbalanced) is our dataset.</p>

In [3]:
# Print target incidence proportions, rounding output to 3 decimal places
transfusion.blood_donation.value_counts(normalize = True).round(3)

0    0.762
1    0.238
Name: blood_donation, dtype: float64

## SCIKIT LEARN

## 5. Splitting transfusion into train and test datasets
<p>We'll now use <code>train_test_split()</code> method to split <code>transfusion</code> DataFrame.</p>
<p>blood_donation incidence informed us that in our dataset <code>0</code>s appear 76% of the time. We want to keep the same structure in train and test datasets, i.e., both datasets must have 0 target incidence of 76%. This is very easy to do using the <code>train_test_split()</code> method from the <code>scikit learn</code> library - all we need to do is specify the <code>stratify</code> parameter. In our case, we'll stratify on the <code>blood_donation</code> column.</p>

In [4]:
# Import train_test_split method
from sklearn.model_selection import train_test_split

# Split transfusion DataFrame into
# X_train, X_test, y_train and y_test datasets,
# stratifying on the `blood_donation` column
X_train, X_test, y_train ,y_test = train_test_split(
    transfusion.drop(columns='blood_donation'),
    transfusion.blood_donation,
    test_size=0.25,
    random_state=42,
    stratify=transfusion.blood_donation
)

# Print out the first 2 rows of X_train
X_train.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months)
334,16,2,500,16
99,5,7,1750,26


## 6. Accuracy Score with Sckit - Learn 

Now, it is time to train and test our M.L. model's accuracy. <br>
I'll first fit our model to Logistic Regression by  `sklearn.linear_model` 's `LogisticRegression` <br>
Then, I'll train our ML model by `.score()` method <br>
After that, I'll test our ML model  again by `.score()` method. To test our model, I could use `sklearn.metrics`'s`accuracy_score` but both of them do the same thing is by inspecting the SK Learn source code. Turns out that the `.score()` method in the LogisticRegression class directly calls the `sklearn.metrics.accuracy_score` method


In [11]:
from sklearn.linear_model import LogisticRegression
reg_all = LogisticRegression()
# Fitting our model to Logistic Regression
reg_all.fit(X_train, y_train)
# Training our model
reg_all.score(X_train, y_train)

0.7736185383244206

In [12]:
# Testing our model
reg_all.score(X_test, y_test)

0.7700534759358288

Our model's accuracy is **0.7700534759358288** Now, let's do the same steps with the TensorFlow and find out which library provides better accuracy for us.

# TENSORFLOW

## 7. Importing the necessary libraries

In [13]:
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import pandas as pd

## 8. Renaming Column Names for Better Analysis

In [17]:
transfusion = transfusion.rename(columns={"Recency (months)":"Recency-months", "Frequency (times)":"Frequency-times", "Monetary (c.c. blood)":"Monetary-cc-blood","Time (months)":"Time-months"})

## 9. Splitting our data into Training and Evaluation parts

In [18]:
traindata = transfusion.iloc[0:449,:] # 60 % of our dataset is splitted by training set
evaldata = transfusion.iloc[449:748, :] # 40 % of our dataset is splitted by eval set
ytrain = traindata.pop("blood_donation")
yeval = evaldata.pop("blood_donation")

## 10. Creating Input Function

I am creating input functions to supply data for training, evaluating, and prediction.<br>

An input function is a function that returns a tf.data.Dataset object which outputs the following two-element tuple:

* features - A Python dictionary in which:<br>
Each key is the name of a feature. <br>
Each value is an array containing all of that feature's values.
* label - An array containing the values of the label for every example. <br>

In [19]:
def input_evaluation_set():
    features = {0:0,
               1:1}
    labels = np.array([1, 1])
    return features, labels

In [20]:
def input_fn(features, labels, training=True, batch_size=256):
    """An input function for training or evaluating"""
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
    
    # Shuffle and repeat if you are in training mode.
    if training:
        dataset = dataset.shuffle(1000).repeat()
        
    return dataset.batch(batch_size)

## 11. Defining the Feature Columns

A feature column is an object describing how the model should use raw input data from the features dictionary. When I build an Estimator model, I pass it a list of feature columns that describes each of the features you want the model to use. The tf.feature_column module provides many options for representing data to the model.

For transfusion data, the 2 raw features are numeric values, so we'll build a list of feature columns to tell the Estimator model to represent each of the four features as 32-bit floating-point values. Therefore, the code to create the feature column is:

In [21]:
# Feature columns describe how to use the input.
my_feature_columns = []
for key in traindata.keys():
    my_feature_columns.append(tf.feature_column.numeric_column(key = key))

Now that I have the description of how I want the model to represent the raw features, I can build the estimator.

## 12. Instantiate an estimator
___
The Transfusion problem is a classic classification problem. Fortunately, TensorFlow provides several pre-made classifier Estimators, including:

* `tf.estimator.DNNClassifier` for deep models that perform multi-class classification.
* `tf.estimator.DNNLinearCombinedClassifier` for wide & deep models.
* `tf.estimator.LinearClassifier` for classifiers based on linear models. <br>
For this problem, `tf.estimator.DNNClassifier` seems like the best choice. Here's how I instantiated this Estimator:

In [22]:
# Building a DNN with 2 hidden layers with 30 and 10 hidden nodes each.
classifier = tf.estimator.DNNClassifier(feature_columns=my_feature_columns,
                                      # Two hidden layers of 30 and 10 nodes respectively.
                                      hidden_units =[30, 10],
                                       # The model must choose between 3 classes.
                                       n_classes =2)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\talfi\\AppData\\Local\\Temp\\tmpakh71z17', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


#### Train, Evaluate, and Predict
___
Now that I have an Estimator object, I can call methods to do the following:

* Train the model.
* Evaluate the trained model.
* Use the trained model to make predictions.

## 13. Training the Model

In [23]:
# Training the Model
classifier.train(input_fn = lambda: input_fn(traindata, ytrain, training=True),
                steps = 5000)

Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Calling model_fn.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into C:\Users\talfi\AppData\Local\Temp\tmpakh71z17\model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 34.921375, step = 0
INFO:tensorflow:global_step/sec: 414.142
INFO:tensorflow:loss = 0.711242, step = 100 (0.243 sec)
INFO:tensorflow:global_step/sec: 609.928
INFO:tensorflow:loss = 0.68018097, step = 200 (0.165 sec)
IN

<tensorflow_estimator.python.estimator.canned.dnn.DNNClassifierV2 at 0x1fd36a73880>

`Loss = 0.48` = The less the loss is better for us, and 0.48 is a good starting point

I wrap up my `input_fn` call in a lambda to capture the arguments while providing an input function that takes no arguments, as expected by the Estimator. The `steps` argument tells the method to stop training after a number of training steps.

## 14. Evaluating the Model
Now that the model has been trained, I can get some statistics on its performance. The following code block evaluates the accuracy of the trained model on the test data: <br>
As I'll evaluate my data, I'll set `training = False` and use `evaldata`and `yeval` for the `classifier.evaluate` function.  

In [24]:
eval_result = classifier.evaluate(input_fn=lambda: input_fn(evaldata, yeval, training = False))

print("\nTest set accuracy: {accuracy:0.3f}\n".format(**eval_result))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2021-02-01T04:25:31Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\talfi\AppData\Local\Temp\tmpakh71z17\model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.76304s
INFO:tensorflow:Finished evaluation at 2021-02-01-04:25:32
INFO:tensorflow:Saving dict for global step 5000: accuracy = 0.80936456, accuracy_baseline = 0.78929764, auc = 0.78897643, auc_precision_recall = 0.53606206, average_loss = 0.4223799, global_step = 5000, label/mean = 0.21070234, loss = 0.34025657, precision = 0.6666667, prediction/mean = 0.22000127, recall = 0.1904762
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 5000: C:\Users\talfi\AppData\Local\Temp\tmpakh71z17\model.ckpt-5000

Test set accuracy: 0.809



## 15. CONCLUSION

Our result is **0.809** with TensorFlow. <br>It is **0.770** with Scikit-Learn. <br> Our accuracy is increased **3.9 %** . It may seem a little change, but for M.L. models, even little changes can create big differences.