# Charles Book Club - predict customers who will buy a certain book

In this scenario a book club is trying to send mail out to book club members to see if they want to buy a book called Art History of Florence. 
- But they don’t want to send mail out to all members, So the goal of the exercise is to find a way to send mail to only those with a high probability of actually buying the book.

### Loading libraries and the dataset 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, accuracy_score, f1_score, precision_recall_fscore_support, confusion_matrix
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from sklearn.metrics import recall_score,precision_score
from sklearn import svm
from sklearn.model_selection import KFold

In [2]:
bk_club = pd.read_csv('CBC_4000.csv', header = 2)

In [3]:
bk_club.head()

Unnamed: 0,Seq#,ID#,Gender,M,R,F,FirstPurch,ChildBks,YouthBks,CookBks,...,Related Purchase,Unnamed: 19,Mcode,Rcode,Fcode,Yes_Florence,No_Florence,Unnamed: 25,Unnamed: 26,Unnamed: 27
0,1,25,1,297,14,2,22,0,1,1,...,0,,5,4,2,0,1,,,
1,2,29,0,128,8,2,10,0,0,0,...,0,,4,3,2,0,1,,,
2,3,46,1,138,22,7,56,2,1,2,...,2,,4,4,3,0,1,,,
3,4,47,1,228,2,1,2,0,0,0,...,0,,5,1,1,0,1,,,
4,5,51,1,257,10,1,10,0,0,0,...,0,,5,3,1,0,1,,,


In [4]:
bk_club.describe()

Unnamed: 0,Seq#,ID#,Gender,M,R,F,FirstPurch,ChildBks,YouthBks,CookBks,...,Related Purchase,Unnamed: 19,Mcode,Rcode,Fcode,Yes_Florence,No_Florence,Unnamed: 25,Unnamed: 26,Unnamed: 27
count,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,...,4000.0,0.0,4000.0,4000.0,4000.0,4000.0,4000.0,0.0,0.0,0.0
mean,2000.5,16594.623,0.7045,208.0915,13.3905,3.83325,26.50725,0.63975,0.30475,0.73125,...,0.885,,4.28125,3.17,2.08575,0.0845,0.9155,,,
std,1154.844867,9484.433792,0.456324,100.948548,8.103822,3.458386,18.35138,0.994343,0.61194,1.089413,...,1.226234,,0.915619,0.928071,0.831907,0.278171,0.278171,,,
min,1.0,25.0,0.0,15.0,2.0,1.0,2.0,0.0,0.0,0.0,...,0.0,,1.0,1.0,1.0,0.0,0.0,,,
25%,1000.75,8253.25,0.0,129.0,8.0,1.0,12.0,0.0,0.0,0.0,...,0.0,,4.0,3.0,1.0,0.0,1.0,,,
50%,2000.5,16581.0,1.0,208.0,12.0,2.0,20.0,0.0,0.0,0.0,...,0.0,,5.0,3.0,2.0,0.0,1.0,,,
75%,3000.25,24838.25,1.0,283.0,16.0,6.0,36.0,1.0,0.0,1.0,...,1.0,,5.0,4.0,3.0,0.0,1.0,,,
max,4000.0,32977.0,1.0,479.0,36.0,12.0,99.0,7.0,5.0,7.0,...,8.0,,5.0,4.0,3.0,1.0,1.0,,,


In [5]:
bk_club.size

112000

In [6]:
bk_club.shape

(4000, 28)

So, we have 4000 observations with 28 variables

In [9]:
bk_club.columns

Index(['Gender', 'M', 'R', 'F', 'FirstPurch', 'ChildBks', 'YouthBks',
       'CookBks', 'DoItYBks', 'RefBks', 'ArtBks', 'GeogBks', 'ItalCook',
       'ItalAtlas', 'ItalArt', 'Florence', 'Related Purchase'],
      dtype='object')

#### Keeping only the required columns and performing regularisation on columns with large values.


In [10]:
bk_club = bk_club[['Gender', 'M', 'R', 'F', 'FirstPurch', 'ChildBks',
                   'YouthBks', 'CookBks', 'DoItYBks', 'RefBks', 'ArtBks', 'GeogBks',
                   'ItalCook', 'ItalAtlas', 'ItalArt', 'Florence', 'Related Purchase']]

for col in ['M', 'R', 'F', 'FirstPurch']:
    
    bk_club[col] = (bk_club[col] - bk_club[col].mean())/bk_club[col].std()

In [11]:
bk_club.head()

Unnamed: 0,Gender,M,R,F,FirstPurch,ChildBks,YouthBks,CookBks,DoItYBks,RefBks,ArtBks,GeogBks,ItalCook,ItalAtlas,ItalArt,Florence,Related Purchase
0,1,0.880731,0.075211,-0.530088,-0.245608,0,1,1,0,0,0,0,0,0,0,0,0
1,0,-0.793389,-0.66518,-0.530088,-0.89951,0,0,0,0,0,0,0,0,0,0,0,0
2,1,-0.694329,1.0624,0.915673,1.607113,2,1,2,0,1,0,1,1,0,0,0,2
3,1,0.197214,-1.405571,-0.819241,-1.335445,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0.484489,-0.418383,-0.819241,-0.89951,0,0,0,0,0,0,0,0,0,0,0,0


#### Splitting the dataframe into features 'X' and target variable 'y'.

In [12]:
X = bk_club[['Gender', 'M', 'R', 'F', 'FirstPurch', 'ChildBks',
       'YouthBks', 'CookBks', 'DoItYBks', 'RefBks', 'ArtBks', 'GeogBks',
       'ItalCook', 'ItalAtlas', 'ItalArt', 'Related Purchase']]

y = bk_club['Florence']

In [18]:
y.value_counts()

0    3662
1     338
Name: Florence, dtype: int64

#### As you can see the dataset is highly unbalanced. Anyway ignoring this for the time being and splitting into train and test sets

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)


In [20]:
y_train.value_counts()

0    2551
1     249
Name: Florence, dtype: int64

### As you can see the data is highly unbalanced.

- Here, the data is highly skewed with 10 percent positive samples and 90 percent negative samples instead of using accuracy as a parameter we are more concerened with the RECALL of the model.
- If the recall(ratio of predicted true positives over all positives) is high, we are less likely to miss a customer for advertising the new book. 
- Therefore the model aims at improving RECALL of the predictions.

## Trying logistic regression on unbalanced data

In [58]:
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
y_train_pred = model.predict(X_train)

print('accuracy: ', accuracy_score(predictions,y_test))

#a = precision_recall_fscore_support(y_train,y_train_pred)
#print('training ',a)
#a = precision_recall_fscore_support(y_val,predictions)
#print('val ',a)

print('validation loss: ' ,log_loss(y_test,predictions))
print('training loss: ', log_loss(y_train,y_train_pred))
print('confusion matrix:\n ', confusion_matrix(y_test, predictions))
print('test f1 score: ', f1_score(y_test,predictions))
#print(f1_score(y_train,y_train_pred))
print('recall: ',recall_score(y_test,predictions))
print('precision: ',precision_score(y_test,predictions))

accuracy:  0.9275
validation loss:  2.50406128863
training loss:  2.98513767384
confusion matrix:
  [[1111    0]
 [  87    2]]
test f1 score:  0.043956043956
recall:  0.0224719101124
precision:  1.0


- The accuracy is high but that is not the right metric to judge our model as the data is highly skewed.
-  We want high recall which is really poor in this case.
- Let us try using 'balanced logistic regression' which adjusts the weights of our model according to the frequency of each class.
- Also we are varying the C parameter for regularisation.(increasing variance)

In [62]:
model = LogisticRegression(class_weight = 'balanced', C=100)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
y_train_pred = model.predict(X_train)
print('accuracy: ', accuracy_score(predictions,y_test))
#a = precision_recall_fscore_support(y_train,y_train_pred)
#print('training ',a)
#a = precision_recall_fscore_support(y_val,predictions)
#print('val ',a)

print('validation loss: ' ,log_loss(y_test,predictions))
print('training loss: ', log_loss(y_train,y_train_pred))
print('confusion matrix:\n ', confusion_matrix(y_test, predictions))
print('test f1 score: ', f1_score(y_test,predictions))
#print(f1_score(y_train,y_train_pred))
print('recall: ',recall_score(y_test,predictions))
print('precision: ',precision_score(y_test,predictions))

accuracy:  0.6575
validation loss:  11.8297734598
training loss:  11.4473694883
confusion matrix:
  [[747 364]
 [ 47  42]]
test f1 score:  0.169696969697
recall:  0.47191011236
precision:  0.103448275862


- We get an okayish recall this time.
- Hence we should now try something else to reduce this problem of skewed classes in our data.

### Applying SMOTE(synthetic minority oversampling technique)

- The correct way of applying SMOTE is to apply it to the training set only not to the entire dataset. 
- This will ensure that similar examples are not copied over to the test set giving us false high results.

In [14]:
sm = SMOTE(random_state = 1,ratio = 'minority')
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)
y_train_res = pd.Series(y_train_res)
X_train_res = pd.DataFrame(X_train_res)
X_train_res.columns = X_train.columns

As you can now see the two classes are balanced in the training set.

In [15]:
pd.Series(y_train_res).value_counts()

1    2339
0    2339
dtype: int64

Applying logistic regression again, this time to the training set with balanced classes.

In [68]:
model = LogisticRegression(C = 100)
model.fit(X_train_res, y_train_res)
predictions = model.predict(X_test)
y_train_pred = model.predict(X_train_res)
print('accuracy: ', accuracy_score(predictions,y_test))
#a = precision_recall_fscore_support(y_train,y_train_pred)
#print('training ',a)
#a = precision_recall_fscore_support(y_val,predictions)
#print('val ',a)



print('validation loss: ' ,log_loss(y_test,predictions))
print('training loss: ', log_loss(y_train_res,y_train_pred))
print('confusion matrix:\n ', confusion_matrix(y_test, predictions))
print('test f1 score: ', f1_score(y_test,predictions))
#print(f1_score(y_train,y_train_pred))
print('recall: ',recall_score(y_test,predictions))
print('precision: ',precision_score(y_test,predictions))

accuracy:  0.664166666667
validation loss:  11.5995142842
training loss:  13.1420566882
confusion matrix:
  [[748 363]
 [ 40  49]]
test f1 score:  0.195608782435
recall:  0.550561797753
precision:  0.118932038835


- As we can see, the recall has improved a bit.
- Let us now try some other model.
- Also using K-Fold validation this time with 5 folds.

### time for SVMs

#### Using K FOLD

- Using KFold with 5 folds, applying SMOTE on 4 folds and using the remaining fold as test set.
- We use an 'rbf' kernel with parameters C =500, and gamma = 2.

In [22]:
kf = KFold(n_splits = 5, shuffle = True, random_state = 1)

In [76]:
recall_list = []
for train_index, test_index in kf.split(X):
    X_train_split = X.iloc[train_index]
    y_train_split = y[train_index]
    
    X_test_split = X.iloc[test_index]
    y_test_split = y[test_index]
        
    sm = SMOTE(random_state = 1,ratio = 'minority')
    X_train_res, y_train_res = sm.fit_sample(X_train_split, y_train_split)
    y_train_res = pd.Series(y_train_res)
    X_train_res = pd.DataFrame(X_train_res)
    X_train_res.columns = X_train.columns
    
    model = svm.SVC(kernel='rbf',C=500, gamma = 2)
    
    model.fit(X_train_res, y_train_res)
    predictions = model.predict(X_val)
    y_train_res_pred = model.predict(X_train_res)
    recall_list.append(recall_score(y_val,predictions))
    
    print('JCV: ',log_loss(y_val,predictions))
    print('Jtrain:', log_loss(y_train_res,y_train_res_pred))
    print('recall: ',recall_score(y_val,predictions))
    print('accuracy: ',accuracy_score(y_val,predictions))
    print(confusion_matrix(y_val, predictions))
    print('\n')

JCV:  0.755553224832
Jtrain: 0.590617785861
recall:  1.0
accuracy:  0.978125
[[571  14]
 [  0  55]]


JCV:  1.83490497963
Jtrain: 0.602018585662
recall:  0.854545454545
accuracy:  0.946875
[[559  26]
 [  8  47]]


JCV:  1.67299946779
Jtrain: 0.48381039307
recall:  0.836363636364
accuracy:  0.9515625
[[563  22]
 [  9  46]]


JCV:  1.942832409
Jtrain: 0.558413568778
recall:  0.727272727273
accuracy:  0.94375
[[564  21]
 [ 15  40]]


JCV:  1.99680049649
Jtrain: 0.571142120852
recall:  0.727272727273
accuracy:  0.9421875
[[563  22]
 [ 15  40]]




In [77]:
print('recall: ',sum(recall_list)/len(recall_list))

recall:  0.829090909091
