<h1> Question 2 - Use the user, user_features & model_test_file datasets</h1>
<h3> <font color='crimson'>1. Write a function that takes as input the user features and outputs the predicted response variable (e.g. content_created) found in the user dataset.</font></h3>

In [1]:
# import libraries
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from imblearn.metrics import classification_report_imbalanced
import warnings
warnings.filterwarnings("ignore")
import plotly
import plotly.graph_objs as go
import plotly.offline as plt
import plotly.figure_factory as ff
plt.init_notebook_mode(connected=True)

In [2]:
# dataset path
excel_file = "../spreadsheets/Data science take home Datasets.xlsx"

# loading the sheets 'user', 'user_features', 'model_test_file' into dataframes
user_df = pd.read_excel(excel_file, "user")
user_features_df = pd.read_excel(excel_file, "user_features")
model_test_file_df = pd.read_excel(excel_file, "model_test_file")

# indexing the dataframes with user_id
user_df.set_index('user_id', inplace=True)
user_features_df.set_index('user_id', inplace=True)
model_test_file_df.set_index('user_id', inplace=True)

<h3 align='center'><font color='crimson'> Mapping function from input user features to the output predicted response variable</font></h3>

In [3]:
# creating a dataset comprising features and a response variable (var_1, var_2, ... var_12, response)
dataset = user_features_df.copy()
dataset['response'] = dataset.index.map(lambda x: user_df.loc[x][0])

In [4]:
# ratio of output response variable
print("Number of occurrences of each response variable: ")
dataset['response'].value_counts()

Number of occurrences of each response variable: 


0    4516
1     801
Name: response, dtype: int64

<p>From above, it's clearly evident that our response variable is skewed. We propose Synthetic Minority Oversampling Technique (SMOTE), a over-sampling method. SMOTE creates synthetic samples of minority class without just duplicating them. SMOTE does this by selecting similar records and altering that record one column at a time by a random amount within the difference to the neighbouring records.</p>

In [5]:
# partitioning the dataset into training and testing in the ration 70:30
training_features, test_features, training_target, test_target, = \
    train_test_split(dataset.iloc[:, :-1], dataset.iloc[:, -1], test_size=1/3, stratify=dataset.iloc[:, -1])

<p> We are going to use RandomForestClassifier to model the data after oversampling the minority class. RandomForest builds multiple decision trees and merges them together to get a more accurate and stable prediction. RandomForest also returns feature importance which determines the most important features used in determination of response. RandomForestClassifier handles the problem of overfitting by finding the mode of responses of all the trees. Also building of each tree is independent to one another, so trees can be built in parallel. Feature Scaling is also not required for RandomForest because split takes place on one feature at a time. In "general", RandomForest are hard to beat in terms of performance. RandomForest works effeciently for a small dataset too.
</p>

In [6]:
# data to be transformed is chained together through a pipeline
pipe = Pipeline([('oversample', SMOTE()),
                 ('clf', RandomForestClassifier(n_jobs=-1))])

# stratifiedkfold makes sure that folds are made by preserved the percentage of samples for each class
skf = StratifiedKFold()

# range of each parameter to be explored for tuning
param_grid = {'oversample__ratio': [0.25, 0.5, 1],      # ratio of majority class to minority class
              'clf__max_depth': [3, 5],                 # maximum depth of each tree
              'clf__max_features': ['sqrt', 'log2'],    # maximum number of features to be considered at each split
              'clf__n_estimators': [25, 50, 100]}       # number of trees in the forest

# using F1_Score as scoring criterion for scoring and stratifiedkfold for cross validation
grid = GridSearchCV(pipe, param_grid, return_train_score=False,
                    n_jobs=-1, scoring='f1', cv=skf)

grid.fit(training_features, training_target)

GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('oversample', SMOTE(k=None, k_neighbors=5, kind='regular', m=None, m_neighbors=10, n_jobs=1,
   out_step=0.5, random_state=None, ratio='auto', svm_estimator=None)), ('clf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='a..._jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'oversample__ratio': [0.25, 0.5, 1], 'clf__max_depth': [3, 5], 'clf__max_features': ['sqrt', 'log2'], 'clf__n_estimators': [25, 50, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
       scoring='f1', verbose=0)

<p> In general, we use accuracy as scoring criterion for parameter tuning. However, we have an imbalanced dataset. In such cases, accuracy can't be trusted in evaluating a model because of a very high value of True Negatives. We can use Recall as a scoring criterion in cases like Fraud Detection, Cancer Detection so that we get high value of True Positives. In such cases weight of True Positives is very high and so we use only recall without precision as scoring criterion. In our case, we don't have to give high weights to True Positive. So we use both precision and recall as scoring criterion, the trade-off between precision and recall is achieved through F1_Score (harmonic mean of precision and recall).
</p>

<h3> <font color='crimson'> 2. Report the predicted response for the users in the model_test_file. If you use any visualizations/metrics to validate the model please include them in the report. Please explain the reasoning behind the technique you used to build the model.</font></h3>


<h3 align='center'> <font color='crimson'>Predicted response for the users in the model_test_file</font></h3>

In [7]:
# predicted response for the users in model_test_file
model_test_file_df['prediction'] = grid.predict(model_test_file_df.iloc[:, :12])
print("Predicted response for the users in model_test_file: ")
model_test_file_df['prediction']

Predicted response for the users in model_test_file: 


user_id
154472    0
151147    1
10543     0
136986    0
137008    0
137059    0
137253    0
137323    0
171652    0
137341    0
137768    0
137822    0
137879    0
137984    1
138103    0
138200    0
138405    1
138439    0
138473    0
138483    0
138582    0
138606    0
138738    0
138948    0
139606    0
149425    1
149419    0
150894    1
141733    0
144653    0
147263    0
147793    0
148663    1
149065    0
149643    0
149792    0
10420     0
152535    0
153169    0
153389    0
153395    0
154070    0
154777    0
168361    0
168689    0
168840    0
168845    0
168862    1
168864    0
168989    0
168991    0
169591    0
169636    0
169853    0
259509    0
260608    1
602671    0
602672    0
8107      0
Name: prediction, dtype: int64

<h3 align='center'><font color='crimson'> Visualizations to validate the model</font></h3>

<p> We intend to find the best 3 features of the model from feature importance value and see if there are trends between the feature and the response variable to validate the model.</p>

In [8]:
# loading random forest classifier with best parameters
rf_clf = grid.best_estimator_._final_estimator

# feature importances of the model
feature_importances = rf_clf.feature_importances_

# plotting the feature importances of each feature
data = [go.Bar(x=training_features.columns,
               y=feature_importances)]
layout = go.Layout(title='<b>Feature Importance Plot</b>',
                   xaxis={'title': "Feature Name"}, yaxis={'title': "Feature Importance Value"})
plot = go.Figure(data=data, layout=layout)
plt.iplot(plot)

In [9]:
# finding the top 2 features with highest feature importances
best_feature_indices = []
for i in range(2):
    idx = feature_importances.argmax()
    best_feature_indices.append(idx)
    feature_importances[idx] = 0

feature_importances = rf_clf.feature_importances_
best_features = [training_features.columns[idx] for idx in best_feature_indices]
print("Top 2 features with highest feature importances are: " + str(best_features))

Top 2 features with highest feature importances are: ['var_12', 'var_11']


In [10]:
# scatterplot - var_12 vs response
data = [go.Scatter(x=dataset[best_features[0]].apply(np.log10), y=dataset['response'], mode='markers')]
layout = go.Layout(title='<b>'+best_features[0]+' vs Response</b>',
                   xaxis={'title': "log("+best_features[0]+")"}, yaxis={'title': "Response"})
plot = go.Figure(data=data, layout=layout)
plt.iplot(plot)

<p>In the above plot, we can observe that as the value of var_12 increases the number of users creating content increases.</p>

In [11]:
# scatterplot - var_11 vs response
data = [go.Scatter(x=dataset[best_features[1]].apply(np.log10), y=dataset['response'], mode='markers')]
layout = go.Layout(title='<b>'+best_features[1]+' vs Response</b>',
                   xaxis={'title': "log("+best_features[1]+")"}, yaxis={'title': "Response"})
plot = go.Figure(data=data, layout=layout)
plt.iplot(plot)

<p>In the above plot, we can observe that when var_11 is less than 10 and greater than 1000 most of the users aren't creating content.</p>

<h3 align='center'><font color='crimson'> Metrics to validate the model</font></h3>

In [12]:
# printing the classification results
test_features_predictions = grid.predict(test_features)
print("Best parameters after tuning are: " + str(grid.best_params_))
print("F1_Score of the model is: " + str(f1_score(test_target, test_features_predictions)))
print("Accuracy of the model is: " + str(accuracy_score(test_target, test_features_predictions)))
print("Precision of the model is: " + str(precision_score(test_target, test_features_predictions)))
print("Recall of the model is: " + str(recall_score(test_target, test_features_predictions)))
print("Classification Report on test data: \n" + 
      classification_report_imbalanced(test_target, test_features_predictions))

Best parameters after tuning are: {'clf__max_depth': 5, 'clf__max_features': 'log2', 'clf__n_estimators': 25, 'oversample__ratio': 0.5}
F1_Score of the model is: 0.644736842105
Accuracy of the model is: 0.878172588832
Precision of the model is: 0.574780058651
Recall of the model is: 0.734082397004
Classification Report on test data: 
                   pre       rec       spe        f1       geo       iba       sup

          0       0.95      0.90      0.73      0.93      0.81      0.67      1506
          1       0.57      0.73      0.90      0.64      0.81      0.65       267

avg / total       0.89      0.88      0.76      0.88      0.81      0.67      1773

