# <font color=green>Blood Donor Prediction(Warmup Blood Donation prediction challenge) </font> 


#### Question:: Can you predict whether a donor will return to donate blood given their donation history?

As per the instructions, the task is to predict whether a donor will give blood the next time the Blood Donation van Comes to the campus. First things first i need to load the data and explore it

### <font color=green>Importing Libraries and Data Loading</font> 

#### Importing Libraries for Data exploration and Visualization

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.core.display import display

In [None]:
TransTrain= pd.read_csv('training_data.csv')
TransTest=pd.read_csv('test_data.csv')

#### The task was about a Blood donation mobile Van in Taiwan that visits a university campus each month. The task was to determine whether one can predict based on past data if a customer or donor would donate blood on march of 2007. 

## <font color=green>Flow Chart</font>

![title](img/2019.png)

### <font color=green>Data Exploration</font>

##### Lets take a look at what the training data looks like

In [None]:
display(TransTrain.head())

The First column is the ID column, i would've Removed it if it was not required for submission along with Predicted Values. Therefore, i will change it's name To Client Id

In [None]:
TransTrain.rename(columns={'Unnamed: 0':'Client_ID'}, inplace=True)
TransTest.rename(columns={'Unnamed: 0':'Client_ID'}, inplace=True)

In [None]:
print(TransTrain.shape)
print(TransTest.shape)

So the Training data has 576 observations in total and testing set is smaller in size with 200 observations. lets see if there are any missing values in the dataset

In [None]:
TransTrain.isnull().sum()

In [None]:
TransTest.isnull().sum()

Apparently there are no missing values, lets explore further and see what are the data types for each Variable

In [None]:
print(TransTrain.dtypes)

### <font color=green>Descriptive Statistics </font>

The data seems to be clean and does not contain any missing values, this is good news as i dont need to worry about data cleaning. However, i need to check if my data has any outliers that need to be treated 

In [None]:
TransTrain.describe()

Ok so the descriptive statics of my training data shows a much clearer picture of the data. I can see that many of my variables have outliers for example in the case of " Number of donations" the min and max values are 1 and 50 respectively but the mean is quite low at 5.42. This means that the max value is that of an outlier. I will need to treat these. Similar cases for other variables as well as "Months since Last Donation" has min and max values 0 and 74 but mean is again quite low at 9.439. These outliers need to be handled before moving forward to modeling

I want to see how my variables are correlated first i will use Pairplot from seaborn library to see the distribution of each variable

In [None]:
sns.pairplot(TransTrain, diag_kind='kde',hue='Made Donation in March 2007')
plt.show()

From the above plot alot of things become clear, we can clearly see that first of all the target variable has only 1 and 0 values and there are no ambiguities in it. We can also see that two variables seem to be highly correlated with eachother as they seem to have highly linear relation. These are Number of Donations and Total Volume donated. This is not surprising as the total volume donated is directly proportional to how many times a customer donated blood. However, since both these are correlated this may cause a problem during model building. I will need to remove one of them but i need to be sure of this. For this i will create a heat map


Furthermore, the pairplot also confirms the existance of outliers in multiple variables

#### Are any of the variables highly correlated with eachother?

In [None]:
corr = TransTrain.corr()
# Correlation Plot for the independent variables
fig, ax = plt.subplots(figsize=(25,15))
sns.heatmap(corr, annot=True,
        xticklabels=corr.columns,
        yticklabels=corr.columns, cmap="YlGnBu")

The above heat map clearly shows that the two variables are highly correlated with each other


I still need to make sure that my Target Variable doesnt contain any ambigous values or whether there exists any sort of class imbalance in my Target Variable

#### What is the Distribution of Target Variable?

In [None]:
TransTrain['Made Donation in March 2007'].value_counts()

This shows that my data is imbalanced, let's plot it to see a clear picture

In [None]:
fig, ax = plt.subplots(figsize=(20,6))
TransTrain['Made Donation in March 2007'].value_counts().plot(kind='bar', subplots=True)

People who have donated in March 2007 is 4 times less than those who havent donated blood

### <font color=green>Feature Engineering and Feature Selection</font>

Before i get rid of the correlated features, i will create some additional variables. Lets Create a Feature for who are Frequent Donors, we can extract this information from the number of donations feature. Since the mean of Number of Donations is 5. We will create a new categorical feature of whether a donor is frequenct donor or not if they have donated more then 5 times. 

In [None]:
TransTrain['Frequenct Donor'] = (TransTrain['Number of Donations'] >= 5)
TransTest['Frequenct Donor'] = (TransTest['Number of Donations'] >= 5)
display(TransTrain.head())

I will create another feature to see how long does a donor wait from their last donation to donate blood again. For this i can simply subtract Last month donations from First month donations and divide them by number of donations

In [None]:
TransTrain['Donation Frequency'] = ((TransTrain['Months since First Donation'] - TransTrain['Months since Last Donation'])
                           /TransTrain['Number of Donations'])
TransTest['Donation Frequency'] = ((TransTest['Months since First Donation'] - TransTest['Months since Last Donation'])
                           /TransTest['Number of Donations'])
display(TransTrain.head(15))

In [None]:
plt.figure(figsize = (15, 5))

sns.distplot(TransTrain[TransTrain['Made Donation in March 2007'].values == 0]['Donation Frequency'], color = 'Green')
plt.ylabel('Frequency')
plt.title('Distribution of Monthly Donation Frequency')

sns.distplot(TransTrain[TransTrain['Made Donation in March 2007'].values == 1]['Donation Frequency'], color = 'yellow')
plt.ylabel('Frequency')
plt.title('Distribution of Monthly Donation Frequency ')

plt.show()

From the above plot we can see that Frequent donors are more likely to donate blood again. However there are many 0 values in the Donation frequency feature which means that these clients only donated blood once and never donated again.

We have our additional Features, i will remove Total volume from the dataset now

In [None]:
TransTrain.drop(['Total Volume Donated (c.c.)'], axis=1, inplace=True)
TransTest.drop(['Total Volume Donated (c.c.)'], axis=1, inplace=True)

Lets check the correlation matrix magain for our newly added features

In [None]:
corr = TransTrain.corr()
# Correlation Plot for the independent variables
fig, ax = plt.subplots(figsize=(25,15))
sns.heatmap(corr, annot=True,
        xticklabels=corr.columns,
        yticklabels=corr.columns, cmap="BuPu")

Before i move on to anything else i need to Encode the categorical features to their respective numeric values for modeling.

In [None]:
TransTrain['Frequenct Donor'] = TransTrain['Frequenct Donor'].astype('category')
TransTest['Frequenct Donor'] = TransTest['Frequenct Donor'].astype('category')


In [None]:
TransTrain['Frequenct Donor'] = TransTrain['Frequenct Donor'].cat.codes
TransTest['Frequenct Donor'] = TransTest['Frequenct Donor'].cat.codes

In [None]:
display(TransTrain.head())

### <font color=green>Splitting into Training and Validation Set</font>

In [None]:
#!pip install imblearn
!pip install sklearn

In [None]:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

In [None]:
X= TransTrain.drop(['Made Donation in March 2007'],axis=1)
Y= pd.DataFrame(TransTrain['Made Donation in March 2007'])
x_test=TransTest

In [None]:
x_train, x_val, y_train, y_val = train_test_split(X, Y,
                                                  test_size = .1,
                                                  random_state=12)

### <font color=green>Handling Class Imbalance Using Smote</font>

Now that we have split the data into training and validation set, lets handle the class imbalance using smote

In [None]:
sm = SMOTE(random_state=12, ratio = 1.0)
x_train_res, y_train_res = sm.fit_sample(x_train, y_train.values.ravel()) # using .values.ravel() because otherwise it gives an error

### <font color=green>Modelling</font>

In [None]:
from sklearn import model_selection
from sklearn.linear_model import LogisticRegressionCV
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss

Now i define a function that will print out a confusion matrix, Although the challenge requires the evalutaion matrics to be Log loss, i want to check the recall and accuracy of the model as well

In [None]:
""""Defining A function for model evaulatin"""
def model1(mod, model_name, x_train, y_train, x_test, y_test):
    mod.fit(x_train_res, y_train_res)
    print(model_name)
    acc = cross_val_score(mod, x_train_res, y_train_res, scoring = "neg_log_loss", cv = 5)
    predictions = cross_val_predict(mod, x_train_res, y_train_res, cv = 5)
    print("Log Loss:", log_loss(y_val, y_val_lr))
    cm = confusion_matrix(predictions, y_train_res)
    print("Confusion Matrix:  \n", cm)
    print("                    Classification Report \n",classification_report(predictions, y_train_res))

#### <font color=green>Logistic Regression</font>

In [None]:
LR=LogisticRegressionCV(max_iter=1000,scoring='neg_log_loss')

In [None]:
model1(LR,"Logistic Regression",x_train_res,y_train_res,x_val,y_val)
y_val_lr = LR.predict_proba(x_val)[:, 1]

We can see that the recall for our Logit model is 68 percent for correctly predicting Blood donors, though we would like to see how it performs in terms of Log loss as well and we can see that Log loss is 0.541 Which seems quite less.. The lesser the better. 