In [53]:
import pandas as pd
from patsy import dmatrices
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import confusion_matrix

# Problem Context
The problem faced with is to predict whether a session with an `Add to Cart` event will be abandonned or result in a successful purchase.

This is a binary classification problem, meaning we are predicting two classes. 
A safe bet for a model would be logistic regression, as it will output a value between 0 and 1 indicating the probability of each X feature set to be part of the positive class.

We will proceed with training the model, evaluating the goodness of fit of the model and then testing it.

# Preparing the Dataset
Choosing which independent variable to add in the model was explained in the exploration.txt file, question 4. However, I added some variables here, which are interacting terms that i deemed would be useful (explanation in the comments).

All categorical variables are one-hot encoded, dropping the first value to reduce collinearity with the intercept.

In [54]:
# helper to write R formulas
def formula_from_cols(df, y, exclude):
    return y + ' ~ ' + ' + '.join([col for col in df.columns if col!=y and col not in exclude])

df = pd.read_csv("googleanalytics.csv")

# converting targets to int
df['is_purchase']=df['is_purchase'].astype(int)

# the action_vs_view_ratio is conveying how much a user interacts with a given pageview on average.
df['action_vs_view_ratio'] = df['number_of_page_views']/df['number_of_user_actions'] 
# average seconds per action is conveying how rapidly the user is interacting with the page
df['avg_seconds_per_actions'] = df['time_on_site']/df['number_of_user_actions']
display(df)

# split the training and testing data set
train, test = train_test_split(df, test_size=0.1)
display(train)
#make sure to exclude last_add_to_cart_hour because it is a numerical categorical variable
y, X = dmatrices(formula_from_cols(train, 'is_purchase', ['last_add_to_cart_hour']) + ' + C(last_add_to_cart_hour)', data=train, return_type='dataframe')

Unnamed: 0,has_remove_from_cart,is_purchase,continent,is_new_visit,isMobile,last_add_to_cart_hour,number_of_user_actions,number_of_page_views,time_on_site,has_promo,first_add_to_cart_delay,add_to_cart_count,quick_click_count,onsite_click_count,product_click_count,action_vs_view_ratio,avg_seconds_per_actions
0,False,0,Americas,False,False,14,1,1,3,False,0.000,1,0,0,0,1.000000,3.000000
1,False,0,Americas,True,True,17,1,3,76,False,71.293,1,0,0,0,3.000000,76.000000
2,False,0,Asia,True,True,5,2,3,318,False,290.289,2,0,0,0,1.500000,159.000000
3,False,0,Asia,False,False,10,2,4,654,True,561.491,1,0,0,0,2.000000,327.000000
4,False,0,Americas,False,False,12,2,5,47,False,26.985,1,1,0,0,2.500000,23.500000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50132,False,0,Asia,True,False,22,1,6,85,False,73.899,1,0,0,0,6.000000,85.000000
50133,False,0,Europe,False,False,15,2,5,49,False,46.777,1,1,0,0,2.500000,24.500000
50134,False,0,Europe,True,False,11,3,10,296,False,80.489,1,2,0,0,3.333333,98.666667
50135,False,0,Americas,True,False,8,4,9,459,False,106.458,2,2,0,0,2.250000,114.750000


Unnamed: 0,has_remove_from_cart,is_purchase,continent,is_new_visit,isMobile,last_add_to_cart_hour,number_of_user_actions,number_of_page_views,time_on_site,has_promo,first_add_to_cart_delay,add_to_cart_count,quick_click_count,onsite_click_count,product_click_count,action_vs_view_ratio,avg_seconds_per_actions
39530,False,0,Americas,True,False,12,2,14,437,False,51.148,1,1,0,0,7.000000,218.500000
5849,False,0,Americas,True,False,8,18,42,914,False,224.076,8,10,0,0,2.333333,50.777778
18497,False,1,Americas,False,True,22,10,18,345,False,41.256,8,2,0,0,1.800000,34.500000
23971,False,0,Americas,False,False,7,2,5,87,False,58.168,1,1,0,0,2.500000,43.500000
47420,False,0,Americas,True,True,15,4,15,1157,False,293.492,2,1,0,1,3.750000,289.250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38123,False,0,Europe,True,False,10,1,11,265,False,101.668,1,0,0,0,11.000000,265.000000
42667,False,0,Europe,True,False,7,4,10,191,False,99.838,1,2,0,1,2.500000,47.750000
15216,False,0,Americas,True,False,7,9,12,98,False,95.877,1,6,0,2,1.333333,10.888889
8609,True,0,Europe,True,True,14,10,30,546,False,168.420,4,4,0,0,3.000000,54.600000


## Fitting the Model
Here we fit the model using statsmodels, which works really well for checking goodness of fit.
I was going to use sklearn, but found out that it was a bit complicated to get the p-values for the coefficients.

In [55]:
logit_model=sm.Logit(y,X)
result=logit_model.fit(maxiter=100)
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.359584
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:            is_purchase   No. Observations:                45123
Model:                          Logit   Df Residuals:                    45080
Method:                           MLE   Df Model:                           42
Date:                Fri, 10 Sep 2021   Pseudo R-squ.:                  0.2848
Time:                        02:13:06   Log-Likelihood:                -16226.
converged:                       True   LL-Null:                       -22687.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                         -3.7836      0.673     -5.622     

# Results
For each independent variable in the table above, we can look at the p-value from the z-score. Where a low p-value means that the null hypothesis (that the independent variable has no effect, a.k.a a coefficient of 0) can be rejected. 

For our table, we can accept that all features with a `p>|Z|` of smaller than 0.01 (99% confidence) can be considered statistically significant enough to reject the null hypothesis.  

Some findings based on this fact:

It seems like customers from Europe are more likely to abandon carts, indicated by their negative log odds of making a purchase.

It seems like isMobile is also very significant independent variable to have, as it decreases the log odds of a purchase by 0.8123. This could mean that people generally dont like purchasing on their phone. 

It also seems that the best hours (with statistical significance) in the day for the last `Add to Cart` event to convert session into a purchase is between 5AM and 6PM.

There is also enough statistical significance to the `has_remove_from_cart` variable to conclude that a session is less likely to convert into a purchase if the user removes items from their carts.

oddly enough, there is less chances of converting to a purchase if the user clicked on a promotion. This could be because some promotions are 'click-baits' and the user quickly gets disinterested. However, I'm also not sure if my interpretation of that column is correct.

The list goes on, there is enough statistical evidence to say that:
- new visitors are less likely to purchase
- the longer a user spends on the site in a session, the more likely the purchase
- the more user_action events a session has, the less likely the purchase
- the longer it takes to add to cart for the first time in a session, the less likely the purchase
- the more actions per page, the more likely the purchase.
etc.

Moreover, the p-value of the log likelihood of our sample being being drawn from the same distribution as the null hypothesis is 0, meaning there is sufficient evidence to reject the null hypothesis. (The independent variables here do matter)

However, even with these results, the predictive power of this model is limited, we will see why looking at the confusion matrix.


In [59]:

def calculate_metrics_of_fit(pred):
    # manually flipping the pred_table to get confusion matrix
    TP = pred[1][1]
    FN = pred[0][1]
    FP = pred[1][0]
    TN = pred[0][0]
    print("Confusion Matrix:")
    print(np.array([[TP,FP],[FN,TN]]))
    P = TP + FN
    N = TN + FP
    PP = TP + FP
    PN = TN + FN
    print("Total positives (purchases): " + str(P))
    print("Total negatives (abandonned): " + str(N))
    print("False positives: " + str(FP))
    print("False negatives: " + str(FN))
    print("True positives: " + str(TP))
    print("True negatives: " + str(TN))
    accuracy = (TP + TN) / (P + N)
    precision = TP/PP
    recall = TP / P
    F1 = 2*precision*recall/(precision + recall)
    print("accuracy: " + str(accuracy))
    print("precision: " + str(precision))
    print("recall: " + str(recall))
    print("F1: " + str(F1))
    print("accuracy of null model, predicting all sessions to be abandonned: " + str(TN/(P+N)))
    
pred = result.pred_table()
calculate_metrics_of_fit(pred)

Confusion Matrix:
[[ 3138.  5964.]
 [ 1818. 34203.]]
Total positives (purchases): 4956.0
Total negatives (abandonned): 40167.0
False positives: 5964.0
False negatives: 1818.0
True positives: 3138.0
True negatives: 34203.0
accuracy: 0.8275380626288146
precision: 0.34475939353988133
recall: 0.6331719128329297
F1: 0.4464361929150662
accuracy of null model, predicting all sessions to be abandonned: 0.7579948141745895


our model has a better accuracy than predicting all sessions to be abandonned, which is good, meaning we at least beat the null model.

With the caveat that the precision for this model is pretty low, sitting at a 33%, meaning the positive predictive power of our model is lacking. However, this is still better than the F1 Score of a model that only predicts abandonned carts sessions, since that one would have an F1 Score of 0. A higher F1 score is better, meaning a better "harmonized" model, with 0 being the worse and 1 being the best, with precision and recall both being 1.
a.k.a F1 represents a "happy mean" between precision and recall. 

We use F1 score here because we have an unbalanced dataset, with the negatives almost outnumbering the positives 9 to 1. This is so we don't get fooled by a high overall prediction accuracy.

By the F1 score, our model has more descriptive statistical power.


# Testing
Having a model that predicts its own training data is not very interesting. We need to look at the confusion matrix of a test set, which is independent to the first set and randomly sampled. We have set 10% of our dataset aside for this testing.

In [60]:
y, X = dmatrices(formula_from_cols(test, 'is_purchase', ['last_add_to_cart_hour']) + ' + C(last_add_to_cart_hour)', data=test, return_type='dataframe')

result_test=np.array(logit_model.predict(result.params,X) > 0.5, dtype=float)

calculate_metrics_of_fit(confusion_matrix(y, result_test))


Confusion Matrix:
[[ 349  661]
 [ 216 3788]]
Total positives (purchases): 565
Total negatives (abandonned): 4449
False positives: 661
False negatives: 216
True positives: 349
True negatives: 3788
accuracy: 0.8250897487036298
precision: 0.3455445544554455
recall: 0.6176991150442478
F1: 0.44317460317460317
accuracy of null model, predicting all sessions to be abandonned: 0.7554846429996012


No huge drops in accuracy means our models generalizes fairly well :)

# Model discussion

Logarithmic regression is good for predicting binary classes such as this problem, but when the data has many categories, it can become hard to implement efficiently in real life. 

If I had more data, I would love to explore some ideas around the product categories and their prices.

I did take a look, but when i summed the prices by the products (`add to cart` is + and `remove from cart` is -), it showed a negative price point for some items, which led me to think that the cart might save items between sessions. That kind of stumped me and I decided it wouldn't be in the scope of my model.

Moreover, a more complex model would also take into account the user's past sessions and the time-series nature of hits within a session.

Some of the feature selected also were not necessarily linear. For instance, I have some doubts about the linearity of time spent on site (at some point, you get diminishing returns from staying on a site). Some of the numerical features could be transformed through log or squared.

Moreover, aside from the interaction ratio terms I added in, a lot of interaction is probably missing from this model. Given how little data there is, it would be hard to fully express all of these interacting term, especially with product info included as well. This can be remedied using factorization machines: "The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically." - https://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines.html

Since one hot encoding on categorical variables with multiple possible values creates a sparse matrix, this would be one way to capture the interactions efficiently.

Some other thoughts:

- Using cities instead of continents (would require more data and a smarter way of storing categorical variables)
- Use regularization (L1, L2, Lasso)
- Couldve used average page load, and also time of the year, since I was doing unions
  on a table each month for a period of 2 years.
- Use a different threshold for a positive classification
- Use gridsearch on 10-fold cross-validation to find optimal alpha and threshold for the above two points.
- I would also need to actually understand every single field in the dataset, as I am still confused about some of them. (what the heck is `adsenseBackfillDfpMeasurableImpressions`?) not everything was Googlable.
- Use PCA to remove collinearity, but that would also reduce model interpretability, since we would not be training on our original features.
- At the very least, remove some of the statistically insignificant counts because they are collinear with the `number_of_user_actions` variable.
- ROC curves could be used for more in-depth quantification of 'goodness of fit'