# Predict Blood Donations

Blood donation has been around for a long time. The first successful recorded transfusion was between two dogs in 1665, and the first medical use of human blood in a transfusion occurred in 1818. Even today, donated blood remains a critical resource during emergencies.

The dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive. We want to predict whether or not a donor will give blood the next time the vehicle comes to campus.

#### This is a competition organised by [Drivendata](https://www.drivendata.org/competitions/2/warm-up-predict-blood-donations/page/5/)

## Objective 
The objective of this notebook is to predict if a blood donor will donate within a given time window,
- Given parameters are 
    - Months since last donation (recency) 
    - Number of donations made (frequency)
    - Total volume of blood donated (cc)
    - Months since first donation (time).

In [2]:
# Import the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

RANDOM_STATE = 2017

%matplotlib inline

In [10]:
# Loading the data set
df = pd.read_csv('train.csv', index_col=False)

In [7]:
df.head()

Unnamed: 0.1,Unnamed: 0,Months since Last Donation,Number of Donations,Total Volume Donated (c.c.),Months since First Donation,Made Donation in March 2007
0,619,2,50,12500,98,1
1,664,0,13,3250,28,1
2,441,1,16,4000,35,1
3,160,2,20,5000,45,1
4,358,1,24,6000,77,0


In [5]:
df.shape

(576, 6)

We do not need the first column so we will frop this.

In [12]:
# Change the name of the columns
df.columns = ['unname','months_since_last_donation', 'num_donations', 'vol_donations', 'months_since_first_donation', 'class']

In [15]:
df.drop(['unname'], axis=1, inplace=True)
df.head()

Unnamed: 0,months_since_last_donation,num_donations,vol_donations,months_since_first_donation,class
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [21]:
# Find the null values
df.isnull().sum()

months_since_last_donation     0
num_donations                  0
vol_donations                  0
months_since_first_donation    0
class                          0
dtype: int64

That is great we have no null values

#### In the class column there are two classes
- class 1 : The donor donated blood in March 2007.
- class 0 : The donor did not donate blood in March 2007.

Note : I am asuming that 1 means donated and 0 means not donated

In [23]:
# Count How many people donated
df['class'].value_counts()

0    438
1    138
Name: class, dtype: int64

We can see that there is a class imbalance problem here i.e. the class 0 is overpopulated.
We can fix this issue by oversampling the minority class (1) or undersampling the majority class. 

# [Exploratory data analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis)

By EDA we will explore the data to find some intuitions

In [24]:
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""

    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.array(sorted(data))

    # y-data for the ECDF: y
    y = np.arange(1,n+1)/(n*1.0)

    return x, y