@author: André Daniël VOLSCHENK  

Kaggle project {Google Analytics Customer Revenue Prediction}  
kaggle.com/andredanielvolschenk  


# Problem Statement

DESCRIPTION  
The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.  
RStudio has partnered with Google Cloud and Kaggle to demonstrate the business impact that thorough data analysis can have.  
In this competition, you’re challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.  

EVALUATION  
Submissions are scored on the root mean squared error.  

Submission File  
For each fullVisitorId in the test set, you must predict the natural log of their total revenue in PredictedLogRevenue. The submission file should contain a header and have the following format:  

| fullVisitorId                        | PredictedLogRevenue  
|-----------------------------------------------------------
| 0000000259678714014 | 0  
| 0000049363351866189 | 0  
| 0000053049821714864 | 0  
etc.  

DATA  

What files do I need?  
You will need to download train.csv and test.csv. These contain the data necessary to make predictions for each fullVisitorId listed in sample_submission.csv.  

All information below pertains to the data in both CSV and BigQuery format.  

What should I expect the data format to be?  
Both train.csv and test.csv contain the columns listed under Data Fields. Each row in the dataset is one visit to the store. Because we are predicting the log of the total revenue per user, be aware that not all rows in test.csv will correspond to a row in the submission, but all unique fullVisitorIds will correspond to a row in the submission.  

IMPORTANT: Due to the formatting of fullVisitorId you must load the Id's as strings in order for all Id's to be properly unique!
There are multiple columns which contain JSON blobs of varying depth. In one of those JSON columns, totals, the sub-column transactionRevenue contains the revenue information we are trying to predict. This sub-column exists only for the training data.  

What am I predicting?  
We are predicting the natural log of the sum of all transactions per user.   

File Descriptions  

train.csv - the training set - contains the same data as the BigQuery rstudio_train_set.  
test.csv - the test set - contains the same data as the BigQuery rstudio_test_set.  
sampleSubmission.csv - a sample submission file in the correct format. Contains all fullVisitorIds in test.csv.  

Data Fields  

| Data name                      | Description
|--------------------------------------------------------------
|  fullVisitorId                     | A unique identifier for each user of the Google Merchandise Store.
| channelGrouping            | The channel via which the user came to the Store.
| date                                   | The date on which the user visited the Store.
| device                                | The specifications for the device used to access the Store.
| geoNetwork                      | This section contains information about the geography of the user.
| sessionId                           | A unique identifier for this visit to the store.
| socialEngagementType   | Engagement type, either "Socially Engaged" or "Not Socially Engaged".
| totals                                  | This section contains aggregate values across the session.
| trafficSource                     | This section contains information about the Traffic Source from which the session originated.
| visitId                                 | An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId
| visitNumber                      |  The session number for this user. If this is the first session, then this is set to 1
| visitStartTime                  |  The timestamp (expressed as POSIX time)

A more complete description of each column is given: https://support.google.com/analytics/answer/3437719?hl=en

Removed Data Fields  
Some fields were censored to remove target leakage. The major censored fields are listed below.  
hits - This row and nested fields are populated for any and all types of hits. Provides a record of all page visits.  
customDimensions - This section contains any user-level or session-level custom dimensions that are set for a session. This is a repeated field and has an entry for each dimension that is set.  
totals - Multiple sub-columns were removed from the totals field.  

# Import libraries
Lets import libraries and see what datafiles we have in our environment.

In [None]:
import numpy as np 
import pandas as pd 
import json
import pandas.io.json as pdjson

import ast

import os
print(os.listdir("../input"))

Note: train_v2.csv has 1'708'337 observations and test_v2.csv has 401'589 observations.  
That is a LOT of data!!!  

# Load

Recall that,  due to the formatting of fullVisitorId you must load the Id's as strings in order for all Id's to be properly unique!  
Lets just load 100 rows to take a quick look  

Since `train_v2` and `test_v2`are so big, we will only load 1000 observations for now.  

Lets see what data1 looks like:

In [None]:
#path = 'C:\\Users\\Andre\\code\\Kaggle\\2. Google Analytics Customer Revenue Prediction\\train.csv'
path = '../input/train_v2.csv'
data1 = pd.read_csv(path, sep=',', dtype={'fullVisitorId': 'str'}, nrows=100)
del(path)

# load the competition test data
#path = 'C:\\Users\\Andre\\code\\Kaggle\\2. Google Analytics Customer Revenue Prediction\\test.csv'
path = '../input/test_v2.csv'
data2 = pd.read_csv(path, sep=',', dtype={'fullVisitorId': 'str'}, nrows=100)
del(path)

print('data1 shape:', data1.shape)
data1.head()

Lets see what `data2` looks like:

In [None]:
print('data2 shape:', data2.shape)
data2.head()

From both `data1` and `data2` we can see that there are multiple columns which contain JSON blobs of varying depth.  

JSON is 'JavaScript Object Notation'  
JSON is unstructured data, in-that each row in the table does not have the same number of column (when sub-columns are expanded).

We could write a function soon to just flatten all JSON columns that have embedded sub-columns...

Lets first make a list of the JSON columns:
* `customDimensions`
* `device`
* `geoNetwork`
* `hits`
* `totals`
* `trafficSource`


We have one problem to adress first: The `hits` and `customDimensions` columns encode its Jason blobs differently. It has the `[` and `]` characters around its blob. Furhermore, it uses single quotes `'` instead of double quotes `"`.  They will have to receive special consideration.  

# Explore JSON fields
First we merge `data1` and `data2`...

In [None]:
data = data1.append(data2, ignore_index=True)
del(data1, data2)
print(' data shape:', data.shape)

Let us look at the JSON subcolumns in `data1`...  
Lets first look at the first observation in `customDimensions`:

In [None]:
data.customDimensions[0]

This JSON blob has 2 subcolumns: index and value  
These can be kept in for now.  

Now lets look at `device`:

In [None]:
data.device[0]

Some columns here really wont add usefull information to our prediction, namely: 'browserVersion', 'browserSize', 'operatingSystemVersion', 'mobileInputSelector', 'flashVersion', 'screenColors', 'screenResolution'.  

Now lets consider `geoNetwork`:

In [None]:
data.geoNetwork[0]

Some useless columns:  
'cityId', 'latitude', 'longitude', 'networkLocation'.  

Next we consider `hits`:

In [None]:
data.hits[0]

That is a lot of columns !  
In fact, these are subcolumns within subcolumns! These JSON blobs are deeply embedded!  
For now we will keep them all  

Next we look at `totals`:

In [None]:
data.totals[0]

We will keep these for now, too.  

Finally we can look at `trafficSource`:

In [None]:
data.trafficSource[0]

These are all important to keep, so we keep them.  

# Parsing JSON
Now we finally declare the function to parse the JSON blobs, keeping in mind the columns we can already delete, and keeing in mind that `customDimensions`, and `hits` are encoded differently:

In [None]:
# Any results you write to the current directory are saved as output.

def parse(csv_path, nrows=None):

    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']

    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    device_list=df['device'].tolist()
    
    #deleting unwanted columns before normalizing
    for device in device_list:
        del device['browserVersion'],device['browserSize'],device['flashVersion'],device['mobileInputSelector'],device['operatingSystemVersion'],device['screenResolution'],device['screenColors']
    df['device']=pd.Series(device_list)
    
    geoNetwork_list=df['geoNetwork'].tolist()
    for network in geoNetwork_list:
        del network['latitude'],network['longitude'],network['networkLocation'],network['cityId']
    df['geoNetwork']=pd.Series(geoNetwork_list)
    
    df['hits']=df['hits'].apply(ast.literal_eval)
    df['hits']=df['hits'].str[0]
    df['hits']=df['hits'].apply(lambda x: {'index':np.NaN,'value':np.NaN} if pd.isnull(x) else x)
    
    df['customDimensions']=df['customDimensions'].apply(ast.literal_eval)
    df['customDimensions']=df['customDimensions'].str[0]
    df['customDimensions']=df['customDimensions'].apply(lambda x: {'index':np.NaN,'value':np.NaN} if pd.isnull(x) else x)
    
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource','hits','customDimensions']

    for column in JSON_COLUMNS:
        column_as_df = pdjson.json_normalize(df[column])
        column_as_df.columns = [f"{column}_{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    
    return df
print("The 'parse' function to flatten JSON columns have been created")

Now lets read in `data1` and `data2` with 100'000 rows each, for a total of 200'000 observations.  
We then merge these to form `data`:

In [None]:
data1 = parse('../input/train_v2.csv', nrows=100000)
data2 = parse("../input/test_v2.csv",nrows=100000)

print('data1 shape: ', data1.shape)
print('data2 shape: ', data2.shape)

data = data1.append(data2, sort=True)
del(data1, data2)

print('number of unique columns in data1 + data2:', data.shape)

First lets see which columns are still JSON blobs in the form of lists:  

In [None]:
jsonlist=[]
for i in range(len(data.columns)):   # for each column
    if (isinstance(data.iloc[1,i], list) ):  # see if some element 1 is a list
        jsonlist.append( data.columns[i] )   # if yes, then save name to list
print(jsonlist)

Before flattening these, we can immediately look into which columns we can delete.  
Lets look at the number of unique values per column. We want to see the number of unique values *including* nans, because nans *may* indicate something unique about the column.  
We also indicate which columns are a JSON blob in the form of a list:

In [None]:
print("Printout for each column's number of unique values (incl. nans)\n")
for col in data.columns:
    try:
        print(col, ':', data[col].nunique(dropna=False))
    except TypeError:
        a=data[col].astype('str')
        #print(a)
        print( col, ':', a.nunique(dropna=False), ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LIST')
# Clean workspace
del(col)

# Columns with constant values:

Looks like there are quite a few features with only 1 value (including nans!) in the entire dataset.  
These columns will obviously not contribute to predictive power, so they may be removed.  

Lets remove useless columns and see the new shape of our data:

In [None]:
print('Data shape before dropping constant columns:', data.shape)

print('\nColumns being dropped:')

for col in data.columns:
    try:
        if (data[col].nunique(dropna=False) == 1):
            del(data[col])
            print(col)
    except TypeError:
        a=data[col].astype('str')
        if (a.nunique(dropna=False) == 1):
            del(data[col])
            print(col)
del(col)

print('\ndata shape is now:', data.shape)

Lets look at what our data looks like now:

In [None]:
data.head()

Lets see which JSON columns remain:

In [None]:
print("Printout for each column's number of unique values (incl. nans)\n")
for col in data.columns:
    try:
        print(col, ':', data[col].nunique(dropna=False))
    except TypeError:
        a=data[col].astype('str')
        #print(a)
        print( col, ':', a.nunique(dropna=False), ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LIST')
# Clean workspace
del(col)

There are many JSON columns with only 2 unique values (including nans!)  
These can be coded as categorical variables to save space! We will remember this for later.  
We only have 2 JSON lists that contain more than 2 unique entries: `hits_product` and `hits_promotion`.  
Lets consider each individually to see what subcolumns they have!  

For `hits_product`we consider observation 0:

In [None]:
print('number of unique values in column:', 
      data['hits_product'].astype('str').nunique(dropna=False), '\n' )
print( data['hits_product'].iloc[0] )

These are a lot of additional columns !  
Basically this column encodes the products shown in the GStore search result. We will leave this out, as this is a lot of information for what is likely minimal gain, if any.  

In [None]:
print('data shape:', data.shape)
data = data.drop(labels=['hits_product'], axis=1)
print('Removed hits_product')
print('data shape:', data.shape)

Next we look at `hits_promotion`. we consider observation 1:

In [None]:
print('number of unique values in column:', 
      data['hits_promotion'].astype('str').nunique(dropna=False), '\n' )
print( data['hits_promotion'].iloc[1] )

This is again a lot of columns and data, but these are abstracted in some other existing columns that relate to promotions, namely:  
`hits_promotionActionInfo.promoIsClick`, and `hits_promotionActionInfo.promoIsView`  

Let us delete this column:

In [None]:
print('data shape:', data.shape)
data = data.drop(labels=['hits_promotion'], axis=1)
print('Removed hits_promotion')
print('data shape:', data.shape)

We now we have 106 features remaining.  
In the next notebook, we will continue reducing the number of columns under consideration, and we shall finally start using all the available observations.  

Lets printout our columns one last time so that we can re-use them for the next notebook:

In [None]:
print("Printout for each column's number of unique values (incl. nans)\n")
for col in data.columns:
    try:
        print(col)
    except TypeError:
        a=data[col].astype('str')
        #print(a)
        print( col)
# Clean workspace
del(col)


# Final comments
We have completed the data cleaning notebook. In the next notebook we shall reload observations, but immediately delete columns that do not feature in the list as given above, in order to save memory.  

*Acknowledgements*  
Special thank you to the following authors for their insightful kernels:  
https://www.kaggle.com/julian3833/1-quick-start-read-csv-and-flatten-json-fields  
https://www.kaggle.com/codlife/pre-processing-for-huge-train-data-with-chunksize  
https://www.kaggle.com/usmanabbas/flatten-hits-and-customdimensions  

I welcome comments and suggestions for improvement!  

Part 2 shall be about Visualization, Exploratory Data Analysis (EDA), and Feature Engineering.  
Each feature in the shall be explored and discussed.  
Part 2 : https://www.kaggle.com/andredanielvolschenk/gstore-part-2-visuals-eda-feature-engineering