## Question: Which model performed better? And why?
To answer the question, it is important to keep in mind what we are trying to predict: We want to know whether someone clicked or not. Is in this case precision or recall are more important to look at - is it more valuable that we can be sure if the classifier predicts that someone clicked that they actually clicked or is it more useful to make sure that the classifier does not miss any of the clicks?

When looking at the outcomes of the logistic regression and k-nearest neighbors from the tutorial it becomes apparent that when it comes to non-clicks both models show about the same precision but the logistic regression shows clearly higher recall (so it does hardly miss any non-clicks). Regarding the clicks, precision is a little higher for the logistic regression that for k-neares neighbors, but recall is considerably worse (although both do not perform particularly well here). This shows that the logistic regression misses a very substantial amount of the actual clicks, but when it says that there was a click this is more likely to be true than with the k-nearest neighbors. Considering that we want to predict whether someone clicked after being exposed to the campaign it might be the better option to use k-nearest neighbor here as it is important to not miss as many clicks as would be missed with the logistic regression. We could not at all be sure that we actually found most of the clicking persons, giving less opportunity to more closely into who they are and how they came on the website etc. Furthermore, the f1 score can also be used as factor informing the decision. It is the harmonic mean of precision and recall, so it takes both measures into account. Here we can see that the k-nearest neighbors performed worse on the non-clicks but better on the clicks than the logistic regression. Considering that it has more value to get a good measure of the clicks than of the non-clicks (as we want to know more about the clicks and better predict them to gain value from them) I would again choose k-nearest neighbor.

## Creating a model to predict sell

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.utils.multiclass import unique_labels
%matplotlib inline

I included some handy functions that make the display a little nicer and easier to compare the results of different classifiers:
* make it possible to print bold (printmd)
* make the classification report a pandas dataframe (classification_report_pandas)
* make the confusion matrix a pandas dataframe (cm2df)
* make it possible to display pandas dataframes side by side (display_side_by_side)

In [2]:
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

In [3]:
def classification_report_pandas(real_value,predictions, **kwargs):

    labels = unique_labels(real_value, predictions)
    
    precision, recall, f_score, support = precision_recall_fscore_support(real_value,
                                                                          predictions,
                                                                          labels=labels,
                                                                          average=None)
    if 'number' in kwargs:
        number = kwargs['number']
        results_pd = pd.DataFrame({"class": labels,
                               "f_score": f_score,
                               'precision':precision,
                               'recall':recall,
                               'n_neighbors':number
                               })
    else:
        results_pd = pd.DataFrame({"class": labels,
                               "f_score": f_score,
                               'precision':precision,
                               'recall':recall,
                               })
    return results_pd

In [4]:
def cm2df(cm, labels):
    df = pd.DataFrame()
    for i, row_label in enumerate(labels):
        rowdata={}
        for j, col_label in enumerate(labels): 
            rowdata[col_label]=cm[i,j]
        df = df.append(pd.DataFrame.from_dict({row_label:rowdata}, orient='index'))
    return df[labels]

In [5]:
from IPython.display import display_html
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

... and something to get rid of the annoying pandas warning

In [6]:
pd.options.mode.chained_assignment = None

The first steps are just as last week: Reading the dataframe, putting the referrals into seperate dummy-coded columns.

In [342]:
webdata = pd.read_excel('web_campaign_simulated.xlsx')

In [343]:
def check_referral(referral, site):
    if referral == site:
        return 1
    return 0

In [344]:
webdata['google'] = webdata['referral'].apply(check_referral, args=('google',))
webdata['facebook'] = webdata['referral'].apply(check_referral, args=('facebook',))
webdata['news_a'] = webdata['referral'].apply(check_referral, args=('newsletter A',))
webdata['news_b'] = webdata['referral'].apply(check_referral, args=('newsletter B',))
webdata['nyt'] = webdata['referral'].apply(check_referral, args=('nyt',))
webdata['tumblr'] = webdata['referral'].apply(check_referral, args=('tumblr',))
webdata['twitter'] = webdata['referral'].apply(check_referral, args=('twitter',))

Now I split the data into train and test dataset and train the classifiers (using the same features as in the tutorial, but substituting sell by click as it would not make sense to predict sells with sell)

In [345]:
train, test = train_test_split(webdata, test_size=0.2, random_state=0)

In [346]:
logit_clf = LogisticRegression(max_iter=1000, fit_intercept = True)
n_clf = KNeighborsClassifier(n_neighbors=5)

In [347]:
features = ['age', 'female', 'google', 'click', 'facebook', 'time_spent', 'campaign_1']

In [348]:
logit_clf.fit(train[features], train['sell'])
n_clf.fit(train[features], train['sell'])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [349]:
test['predicted_sells_logit'] = logit_clf.predict(test[features])
test['predicted_sells_nn'] = n_clf.predict(test[features])

By looking at the confusion matrices of both classifiers it becomes apparent that both performed very well regarding the prediction of sells, although for the logistic regression false positives and false negatives are the same while for k-nearest neighbors the false negatives are higher than the false positives. Looking at precision and recall of the models shows however that that they both score very high on both dimensions and that thier f1 scores are close to identical. In the end, each of the classifier can predict the sales almost perfectly, with the k-nearest neighbor performing slightly better (although we are talking about differences of 0.01 in precision and recall here) no I would maybe choose this model - but in the end it basically does not matter, both will perform very well. 

In [350]:
display_side_by_side(cm2df(confusion_matrix(test['sell'], test['predicted_sells_logit']), labels = [0,1]), cm2df(confusion_matrix(test['sell'], test['predicted_sells_nn']), labels = [0,1]))

Unnamed: 0,0,1
0,1105,16
1,15,666

Unnamed: 0,0,1
0,1112,9
1,20,661


In [351]:
display_side_by_side(classification_report_pandas(test['sell'], test['predicted_sells_logit']), classification_report_pandas(test['sell'], test['predicted_sells_nn']))

Unnamed: 0,class,f_score,precision,recall
0,0,0.986167,0.986607,0.985727
1,1,0.977256,0.97654,0.977974

Unnamed: 0,class,f_score,precision,recall
0,0,0.987128,0.982332,0.991971
1,1,0.978534,0.986567,0.970631


## Changes to the k-nearest neighbors classifier
I decided to look at all number of neighbors between 1 and 10 and use a loop to store all confusion matrices (converted to dataframes) in a list and later print them side by side so I can directly compare all of them. This shows that wen only using one neighbor the model has more false positives than the other models but less false negatives. In general, all of the classifier classified about the same amount of sells wrongly (around 25) but sometimes the false negatives are less and sometimes the false positives. It is difficult to see a real pattern here, but it seems to stabilize at around eight neighbors, after that no major changes can be seen. Again, in general the models all perform very well. 

This can also be seen when ordering the different classifiers according to precision and recall: Between the model with the highest precision and the lowest (one and six) for non-clicks is a difference of 0.005 and for clicks (two and one) is a difference of 0.02. However, it also becomes apparent that those higher on precision for non-clicks are lower on precision for clicks and the other way around (see for example classifier 2 and 6 who are the highest on clicks and the lowest on non-clicks) - thus one always has to make the trade-off between these two measures and see what it is more valuable for the question at hand. Regarding recall we can see similar magnitudes of differences between the best and worst classifiers. Furthermore, we can see that those having high precision on non-clicks/clicks now have low recall and the other way around. Thus here one again has to make the decision whether precision or recall are more important (as described for the first question) or whether one wants to have the highest f1 score (mixture of both).

In [352]:
matrices = []
for i in range(1,11):
    n_clf = KNeighborsClassifier(n_neighbors=i)
    n_clf.fit(train[features], train['sell'])
    test['predicted_sells_nn'] = n_clf.predict(test[features])
    matrix = confusion_matrix(test['sell'], test['predicted_sells_nn'])
    matrices.append(cm2df(matrix, [0,1]))
print('1neighbors  2neighbors  3neighbors  4neighbors  5neighbors  6neighbors  7neighbors  8neighbors  9neighbors  10neighbors')
display_side_by_side(matrices[0],matrices[1], matrices[2], matrices[3], matrices[4], matrices[5],matrices[6], matrices[7], matrices[8], matrices[9])

1neighbors  2neighbors  3neighbors  4neighbors  5neighbors  6neighbors  7neighbors  8neighbors  9neighbors  10neighbors


Unnamed: 0,0,1
0,1107,14
1,16,665

Unnamed: 0,0,1
0,1117,4
1,22,659

Unnamed: 0,0,1
0,1113,8
1,19,662

Unnamed: 0,0,1
0,1116,5
1,21,660

Unnamed: 0,0,1
0,1112,9
1,20,661

Unnamed: 0,0,1
0,1117,4
1,22,659

Unnamed: 0,0,1
0,1110,11
1,17,664

Unnamed: 0,0,1
0,1115,6
1,18,663

Unnamed: 0,0,1
0,1113,8
1,17,664

Unnamed: 0,0,1
0,1113,8
1,18,663


In [353]:
data_list = []
for i in range(1,11):
    n_clf = KNeighborsClassifier(n_neighbors=i)
    n_clf.fit(train[features], train['sell'])
    test['predicted_sells_nn'] = n_clf.predict(test[features])
    dataframe = classification_report_pandas(test['sell'], test['predicted_sells_nn'], number = i)
    data_list.append(dataframe)
final_data_sells = pd.concat(data_list)

In [354]:
df_sells0 = final_data_sells[final_data_sells['class'] == 0]
df_sells1 = final_data_sells[final_data_sells['class'] == 1]

In [355]:
display_side_by_side(df_sells0.sort_values('precision', ascending = False), df_sells1.sort_values('precision', ascending = False))
display_side_by_side(df_sells0.sort_values('recall', ascending = False), df_sells1.sort_values('recall', ascending = False))

Unnamed: 0,class,f_score,n_neighbors,precision,recall
0,0,0.986631,1,0.985752,0.987511
0,0,0.988894,9,0.984956,0.992864
0,0,0.987544,7,0.984916,0.990187
0,0,0.989352,8,0.984113,0.994648
0,0,0.988455,10,0.984085,0.992864
0,0,0.988016,3,0.983216,0.992864
0,0,0.987128,5,0.982332,0.991971
0,0,0.988485,4,0.98153,0.99554
0,0,0.988496,2,0.980685,0.996432
0,0,0.988496,6,0.980685,0.996432

Unnamed: 0,class,f_score,n_neighbors,precision,recall
1,1,0.980655,2,0.993967,0.967695
1,1,0.980655,6,0.993967,0.967695
1,1,0.980684,4,0.992481,0.969163
1,1,0.982222,8,0.991031,0.973568
1,1,0.981523,9,0.988095,0.975037
1,1,0.980769,10,0.988077,0.973568
1,1,0.980015,3,0.98806,0.9721
1,1,0.978534,5,0.986567,0.970631
1,1,0.979351,7,0.983704,0.975037
1,1,0.977941,1,0.979381,0.976505


Unnamed: 0,class,f_score,n_neighbors,precision,recall
0,0,0.988496,2,0.980685,0.996432
0,0,0.988496,6,0.980685,0.996432
0,0,0.988485,4,0.98153,0.99554
0,0,0.989352,8,0.984113,0.994648
0,0,0.988016,3,0.983216,0.992864
0,0,0.988894,9,0.984956,0.992864
0,0,0.988455,10,0.984085,0.992864
0,0,0.987128,5,0.982332,0.991971
0,0,0.987544,7,0.984916,0.990187
0,0,0.986631,1,0.985752,0.987511

Unnamed: 0,class,f_score,n_neighbors,precision,recall
1,1,0.977941,1,0.979381,0.976505
1,1,0.979351,7,0.983704,0.975037
1,1,0.981523,9,0.988095,0.975037
1,1,0.982222,8,0.991031,0.973568
1,1,0.980769,10,0.988077,0.973568
1,1,0.980015,3,0.98806,0.9721
1,1,0.978534,5,0.986567,0.970631
1,1,0.980684,4,0.992481,0.969163
1,1,0.980655,2,0.993967,0.967695
1,1,0.980655,6,0.993967,0.967695


As the above models with sells showed so little differences (as they all predicted the sales closely to perfection) I decided to do the same analysis for clicks again to maybe see some more pronounced differences. This indeed shows a pattern for the confusion matrices: For the even numbers of neighbors the false negatives and false positives are much further apart (with higher numbers of false negatives and lower numbers of false positives) compared to the odd number of neighbors where both false negatives and positives are about the same. This pattern gets less clear after seven neighbors and after that only minor changes can be predicted. After some searching I found online that one should take odd values for binary classifications to avoid ties (two classes labels having the same score) so I guess that this is related to the pattern one can see here - we get more false negatives, the recall gets worse with even numbers, so one should rather consider to choose an odd number (at least when recall is important). 

In [356]:
features1 = ['age', 'female', 'google', 'sell', 'facebook', 'time_spent', 'campaign_1']

In [357]:
matrices = []
for i in range(1,11):
    n_clf = KNeighborsClassifier(n_neighbors=i)
    n_clf.fit(train[features1], train['click'])
    test['predicted_clicks_nn'] = n_clf.predict(test[features1])
    matrix = confusion_matrix(test['click'], test['predicted_clicks_nn'])
    matrices.append(cm2df(matrix, [0,1]))
print('1neighbors 2neighbors 3neighbors 4neighbors 5neighbors 6neighbors 7neighbors 8neighbors 9neighbors 10neighbors')
display_side_by_side(matrices[0],matrices[1], matrices[2], matrices[3], matrices[4], matrices[5],matrices[6], matrices[7], matrices[8], matrices[9])

1neighbors 2neighbors 3neighbors 4neighbors 5neighbors 6neighbors 7neighbors 8neighbors 9neighbors 10neighbors


Unnamed: 0,0,1
0,664,427
1,410,301

Unnamed: 0,0,1
0,934,157
1,593,118

Unnamed: 0,0,1
0,710,381
1,466,245

Unnamed: 0,0,1
0,918,173
1,586,125

Unnamed: 0,0,1
0,772,319
1,492,219

Unnamed: 0,0,1
0,913,178
1,579,132

Unnamed: 0,0,1
0,827,264
1,518,193

Unnamed: 0,0,1
0,927,164
1,584,127

Unnamed: 0,0,1
0,831,260
1,532,179

Unnamed: 0,0,1
0,916,175
1,594,117


In [358]:
data_list = []
for i in range(1,11):
    n_clf = KNeighborsClassifier(n_neighbors=i)
    n_clf.fit(train[features1], train['click'])
    test['predicted_clicks_nn'] = n_clf.predict(test[features1])
    dataframe = classification_report_pandas(test['click'], test['predicted_clicks_nn'], number = i)
    data_list.append(dataframe)
final_data_clicks = pd.concat(data_list)

In [359]:
df_clicks0 = final_data_clicks[final_data_clicks['class'] == 0]
df_clicks1 = final_data_clicks[final_data_clicks['class'] == 1]

In [360]:
display_side_by_side(df_clicks0.sort_values('precision', ascending = False), df_clicks1.sort_values('precision', ascending = False))
display_side_by_side(df_clicks0.sort_values('recall', ascending = False), df_clicks1.sort_values('recall', ascending = False))

Unnamed: 0,class,f_score,n_neighbors,precision,recall
0,0,0.613395,1,0.61825,0.608616
0,0,0.678982,7,0.61487,0.75802
0,0,0.712529,8,0.613501,0.849679
0,0,0.70693,6,0.61193,0.836847
0,0,0.713522,2,0.611657,0.856095
0,0,0.655626,5,0.610759,0.707608
0,0,0.707514,4,0.610372,0.84143
0,0,0.677262,9,0.609685,0.761687
0,0,0.704344,10,0.606623,0.839597
0,0,0.626378,3,0.603741,0.650779

Unnamed: 0,class,f_score,n_neighbors,precision,recall
1,1,0.253493,8,0.436426,0.178622
1,1,0.239351,2,0.429091,0.165963
1,1,0.25857,6,0.425806,0.185654
1,1,0.330479,7,0.422319,0.271449
1,1,0.24777,4,0.419463,0.175809
1,1,0.418346,1,0.413462,0.423347
1,1,0.311304,9,0.407745,0.251758
1,1,0.350681,5,0.407063,0.308017
1,1,0.2333,10,0.400685,0.164557
1,1,0.366492,3,0.391374,0.344585


Unnamed: 0,class,f_score,n_neighbors,precision,recall
0,0,0.713522,2,0.611657,0.856095
0,0,0.712529,8,0.613501,0.849679
0,0,0.707514,4,0.610372,0.84143
0,0,0.704344,10,0.606623,0.839597
0,0,0.70693,6,0.61193,0.836847
0,0,0.677262,9,0.609685,0.761687
0,0,0.678982,7,0.61487,0.75802
0,0,0.655626,5,0.610759,0.707608
0,0,0.626378,3,0.603741,0.650779
0,0,0.613395,1,0.61825,0.608616

Unnamed: 0,class,f_score,n_neighbors,precision,recall
1,1,0.418346,1,0.413462,0.423347
1,1,0.366492,3,0.391374,0.344585
1,1,0.350681,5,0.407063,0.308017
1,1,0.330479,7,0.422319,0.271449
1,1,0.311304,9,0.407745,0.251758
1,1,0.25857,6,0.425806,0.185654
1,1,0.253493,8,0.436426,0.178622
1,1,0.24777,4,0.419463,0.175809
1,1,0.239351,2,0.429091,0.165963
1,1,0.2333,10,0.400685,0.164557
