In this notebook, I will explore the GA Customer Revenue dataset, do some quick data preprocessing, 
and train a simple model that will serve as a benchmark.

Let's get started!

In [None]:
import pandas as pd
import missingno as msno

In [None]:
# Some constants
# Otherwise, pandas will try to interpret this column as an integer 
# (which is wrong according to the competition's guidelines).
VISITOR_ID_COL = "fullVisitorId"
DTYPES = {VISITOR_ID_COL: 'str'}
TARGET_COL = "transactionRevenue"
TRAIN_DATA_PATH = "../input/train.csv"

In [None]:
train_df = pd.read_csv(TRAIN_DATA_PATH, dtype=DTYPES)

In [None]:
train_df.sample(2).T

In [None]:
msno.matrix(train_df)

No missing data, awesome! Or maybe one shouldn't be that enthusisat since there are a lot of 
nested columns (more about this later) ;)

Alright, after loading the data and having a look at some samples, the first
thing one needs to do is extract the target for this problem and (basic for now) 
features. Let's do that!

# Unnesting the target

So, where is the target? As mentioned in the competition's directions, it is inside
the `totals` column. Let's have a look, shall we?

In [None]:
RAW_TARGET_COL = "totals"
raw_target_s = train_df[RAW_TARGET_COL]

In [None]:
for index, raw_target_row in raw_target_s.sample(30).iteritems():
    print(eval(raw_target_row))

As you can see, this is a nested column (it is a dict). Moreover, the 
target of interest `transactionRevenue` isn't always available. 
Let's unnest this column and explore the missing values.

In [None]:
records = []
for index, raw_target_row in raw_target_s.iteritems():
    parsed_target_row = eval(raw_target_row)
    records.append(parsed_target_row)
parsed_target_df = pd.DataFrame(records)
# Don't forget the visitor id!
parsed_target_df[VISITOR_ID_COL] = train_df[VISITOR_ID_COL]

In [None]:
parsed_target_df.sample(3).T

In [None]:
msno.matrix(parsed_target_df)

Waw, it seems that the target to predict is missing a lot of times. How many times?

In [None]:
def percentage_of_missing(df, col):
    return 100 * df[col].isnull().sum() / df.shape[0]

missing_target_percent = percentage_of_missing(parsed_target_df, 
                                              TARGET_COL)

In [None]:
"The target column contains {}% missing data!".format(missing_target_percent.round(2))

In what follows, I will assume that a missing value for `transactionRevenue`
means that the transaction value is 0 (even though it could be a "real" missing 
value). Let's fill the missing values with this information.

Let's check the distribution of transactionRevenue.

In [None]:
target_df = (parsed_target_df.loc[:, [TARGET_COL, VISITOR_ID_COL]]
                            .assign(**{TARGET_COL: lambda df: df[TARGET_COL].fillna(0.0)
                                                                            .astype(int)}))

In [None]:
target_df.sample(5)

In [None]:
import seaborn as sns
import numpy as np
import matplotlib.pylab as plt


fig, ax = plt.subplots(1, 1, figsize=(12, 8))
# Since most of the transactions are 0$, I will remove these when plotting
# the distribution.
sns.distplot(np.log(target_df.loc[lambda df: df[TARGET_COL] >0, 
                                  TARGET_COL]), ax=ax)
ax.set_xlabel("Log of transaction revenue ($)")

In [None]:
# The same thing as above but this time aggregated using the 
# visitor unique id.
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
# Since most of the transactions are 0$, I will remove these when plotting
# the distribution.

def _log_sum_agg(g):
    """ Take the natural logarithm of the aggregated sum
    (+1 to avoid -inf for a 0 sum).
    """
    return np.log(g.sum() + 1)

grouped_target_a = (target_df.groupby(VISITOR_ID_COL)
                             .agg({TARGET_COL: _log_sum_agg})
                             .values)
sns.distplot(grouped_target_a[grouped_target_a > np.log(1)], ax=ax)
ax.set_xlabel("Log of sum of transaction revenue ($)")

# Basic features extraction

In order to build the benchmark model, one needs some features. Let's use the following ones: 

*  `date`: this is the date of the transaction. I assume that it is in UTC.
* `geoNetwork`: this is a nested column that contains information about location of the transaction. 

In what follows, I will extract these features and engineer some basic ones (day of week, month, year, and so on...)

In [None]:
DATE_COL = "date"
TMS_GMT_COL = "tms_gmt"
# Here, I parse the DATE_COL to extract year, month, and day information 
# (using there positions). Then, I build the TMS_GMT column (using pandas' 
# to_datetime function) and extract additional calendar features: 
# day of week, week of year, and day of year. 
# Notice that I drop DATE_COL and TMS_GMT columns since these
# aren't numerical columns.
date_df = (train_df[[DATE_COL]].assign(year=lambda df: df[DATE_COL].astype(str)
                                                                   .str[0:4]
                                                                   .astype(int),
                                       month=lambda df: df[DATE_COL].astype(str)
                                                                    .str[4:6]
                                                                    .astype(int),
                                       day=lambda df: df[DATE_COL].astype(str)
                                                                  .str[6:8]
                                                                  .astype(int))
                               .drop(DATE_COL, axis=1)
                               .assign(tms_gmt=lambda df: pd.to_datetime(df))
                               .assign(dow=lambda df: df[TMS_GMT_COL].dt.dayofweek,
                                       woy=lambda df: df[TMS_GMT_COL].dt.week,
                                       doy=lambda df: df[TMS_GMT_COL].dt.day)
                               .drop(TMS_GMT_COL, axis=1))

In [None]:
date_df.sample(5)

In [None]:
records = []
GEO_COL = "geoNetwork"
for index, row in train_df[GEO_COL].iteritems():
    parsed_row = eval(row)
    records.append(parsed_row)

geo_df = pd.DataFrame(records)

In [None]:
geo_df.sample(2).T

To make things simpler, I will only keep the `country` and `continent` features 
from the `geoNetwork` parsed column. I will also dummify these features. Finally, I will combine
the various engineered features. Let's do that!

In [None]:
GEO_COLS_TO_KEEP = ["country", "continent"]
engineered_train_df = (geo_df.loc[:, GEO_COLS_TO_KEEP]
                             .pipe(pd.get_dummies)
                             .pipe(pd.merge, date_df, 
                                   left_index=True,
                                   right_index=True)
                             .pipe(pd.merge, 
                                   train_df[[VISITOR_ID_COL]],
                                   left_index=True,
                                   right_index=True))

In [None]:
engineered_train_df.sample(2).T

Awesome! Time to do some (basic) modeling.

# LASSO as a benchmark

Now that I have prepared some features, I will train a LASSO model (i.e. a linear regression model that
does features selection automatically) and compute its CV score. 
Notice that I can't use the cross_val_score from sklearn since I need to aggregate the out-of-fold 
predictions before computing the score. 

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold

In [None]:
# For reproducibility
SEED = 314
CV = 5
# Resources are limited! 
N_SAMPLES = 10000
kf = KFold(CV, random_state=SEED)


benchmark = Lasso(random_state=SEED)

df = engineered_train_df.sample(N_SAMPLES).drop(VISITOR_ID_COL, axis=1)

# TODO: Do some cleaning and refactoring of the CV computation. 
# Also check the grouping step...
# LASSO warnings are annoying. :)
import warnings
warnings.simplefilter("ignore")


cv_rmse = []
for train_index, test_index in kf.split(df):
    train_features_df = df.iloc[train_index, :]
    test_features_df = df.iloc[test_index, :]
    train_target_s = target_df.loc[train_index, TARGET_COL]
    test_target_df = target_df.iloc[test_index, :].reset_index(drop=True)
    benchmark.fit(train_features_df, train_target_s)
    test_target_df.loc[:, "predictions"] = benchmark.predict(test_features_df)
    grouped_df  = (test_target_df.groupby(VISITOR_ID_COL)
                                 .agg({"predictions": _log_sum_agg, 
                                       TARGET_COL: _log_sum_agg})
                                 .reset_index())
    rmse = ((grouped_df["predictions"] - grouped_df[TARGET_COL]) ** 2).mean() ** 0.5
    cv_rmse.append(rmse)

cv_rmse = np.array(cv_rmse)

In [None]:
cv_rmse

In [None]:
"The mean CV RMSE for the benchmark is: {}".format(cv_rmse.mean())

That's it for now, I hope you have enjoyed this introductory notebook. 
Stay tuned for more!