<a id='top'></a>
# GA Revenue Prediction Test Based on Last Year's Data
This is a fleshing out of what GeoffPidcock's **[Joke Submission Workbook](https://www.kaggle.com/geoffpidcock/joke-submission-workbook)** shows, ie. the prediction of so few repeat visitors will be a real *&lt;word removed by Kaggle's censorship squad>* of a problem.

**This kernel uses last year's data to show that the RMSE would have been, for an all zero prediction, 0.32.**

## As a guide to repeat visitors here is a breakdown of the v2 train/test data:
v2 data|<p align="left">v2 data description
---|:---
train_v2   |<p align="left"> 01/08/2016 -> 30/04/2018 : 21 months with 1323730 unique visitors
test_v2    |<p align="left"> 01/05/2018 -> 05/10/2018 : 5 months with 296530 unique visitors
prediction |<p align="left"> 01/12/2018 -> 31/01/2019 : 2 months; **? repeat visitors**
* Only 2759 visitors appear in both train_v2 and test_v2 sets, and only 195 of them have rev>0.
* We should expect the prediction period to have a similar quantity of repeat visitors.

## And here is a breakdown of last years data:
2017 data|<p align="left">2017 data description
---|:---
train       |<p align="left"> we ignore training data since we're predicting all zeros
test        |<p align="left"> 01/05/2017 -> 05/10/2017  5 months with 305597 unique visitors (similar to this year)
prediction  |<p align="left"> 01/12/2017 -> 31/01/2018  2 months which had 144049 unique visitors; **2174 repeat visitors**
* 2174 repeat visitors is 1.20% of 144049
* 176 had rev>0 which is 0.12% of 144049

**Assuming a similar year, we'll have to predict which 0.12% of visitors will spend money.**  I think I'm going to submit all zeros.


In [None]:
import sys
import warnings
import numpy as np
import pandas as pd 
import json
import time
import ast
from sklearn import metrics
warnings.filterwarnings('ignore')

train_v2_file = '../input/train_v2.csv'
!ls -ld $train_v2_file
relevant_cols = ['date', 'fullVisitorId', 'totals']
train = pd.read_csv(train_v2_file, usecols=relevant_cols, dtype={'fullVisitorId': 'str'})
train['totals.transactionRevenue'] = train['totals'].apply(lambda x: ast.literal_eval(x).get('transactionRevenue',np.nan))
train['totals.transactionRevenue'] = pd.to_numeric(train['totals.transactionRevenue'], errors="coerce")
train.drop(columns=['totals'], inplace=True)
print (train.shape)
print (train.columns)

In [None]:
def rmse_log1p(df, col1, col2):
    return np.sqrt(metrics.mean_squared_error(np.log1p(df[col1].values), np.log1p(df[col2].values)))

print ("")
print ("Get 5 month period for test submission and 2 month period for competition:")
keepcols = ['fullVisitorId','totals.transactionRevenue']
submission2017  = train.loc[(20170501 <= train['date'])].loc[(train['date'] <= 20171005)][keepcols]
competition2017 = train.loc[(20171201 <= train['date'])].loc[(train['date'] <= 20180131)][keepcols]
print ("submission2017", submission2017.shape)
print ("competition2017", competition2017.shape)

print ("")
print ("Get visitors who are in both submission and competition periods:")
submission_visitors = list(submission2017['fullVisitorId'].dropna().astype(str).unique())
competition_visitors = list(competition2017['fullVisitorId'].dropna().astype(str).unique())
visitors_in_both = set(submission_visitors) & set(competition_visitors)
print ("unique 2017 sub visitors:",len(submission_visitors))
print ("unique 2017 comp visitors:",len(competition_visitors))
print ("unique 2017 in both:",len(visitors_in_both))

print ("")
submission2017['totals.transactionRevenue'] = 0.0
submission2017['totals.predictedRevenue'] = 0.0
submission2017 = submission2017.groupby('fullVisitorId').sum().reset_index()
print ("submission with unique visitors and predictions of zero:", submission2017.shape)

competition2017['totals.transactionRevenue'].fillna(0, inplace=True)
competition2017 = competition2017.groupby('fullVisitorId').sum().reset_index()
competition2017 = competition2017[competition2017['fullVisitorId'].isin(visitors_in_both)]
competition2017['totals.predictedRevenue'] = 0.0
print ("competition visitors who appeared in submission:", competition2017.shape)

submission2017 = pd.concat([submission2017, competition2017], axis=0)
submission2017 = submission2017.groupby('fullVisitorId').sum().reset_index()
print ("submission with competiton data for calculating RMSE:", submission2017.shape)

print ("")
print("RMSE score with all predictions zero:", rmse_log1p(submission2017, 'totals.transactionRevenue', 'totals.predictedRevenue'))