# Exploratory Analysis

Group project for the 2019 Data Science Workshop at the University of California, Berkeley.

The project is the Google Analytics Customer Revenue Prediction competition on Kaggle: https://www.kaggle.com/c/ga-customer-revenue-prediction

Group members:

* Andy Vargas (mentor)
* Yuem Park
* Marvin Pohl
* Michael Yeh

In [1]:
import os
import json
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
from ast import literal_eval

pd.options.display.max_columns = 999

## Load data

Note that the data files are too large to upload to GitHub - instead, the directory `./data/` has been added to the .gitignore, which should contain the following files on your local machine, all downloaded from the Kaggle competition website:

* sample_submission_v2.csv
* test_v2.csv
* train_v2.csv

In [2]:
# if you want to create a sub-sampled raw dataset, set this to True
create_small = False

# if you want to skip the loading and cleaning of the raw data, set this to True
create_cleaned = False

# if you want to create a sub-sampled cleaned data, set this to True
create_small_cleaned = True

# if you want to ONLY load a previously generated sub-sampled cleaned data, set this to True
load_small_cleaned_only = False

Generate a sub-sampled version of the cleaned data, for faster load/computation times:

In [3]:
if create_small==True:
    train = pd.read_csv('./data/train_v2.csv',dtype={'fullVisitorId':'str'})
    train_small = train.sample(frac=0.1, random_state=2019)
    train_small.to_csv('./data/train_small.csv', index=False)

Some of the columns are in JSON format - the following function (modified from https://www.kaggle.com/sudalairajkumar/simple-exploration-baseline-ga-customer-revenue) flattens the JSON columns, such that we end up with a more typical data table, where each column has a single feature in it:

In [4]:
def load_df(csv_path):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']

    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId':'str'}) # Important!!
    
    # fix the formatting in these two columns, and convert them into lists of dictionaries
    df['hits'] = df['hits'].str.replace("{'", '{"')
    df['hits'] = df['hits'].str.replace("'}", '"}')
    df['hits'] = df['hits'].str.replace(": '", ': "')
    df['hits'] = df['hits'].str.replace("',", '",')
    df['hits'] = df['hits'].str.replace(", '", ', "')
    df['hits'] = df['hits'].str.replace("':", '":')
    df['hits'] = df['hits'].str.replace("\'", "'")
    df['hits'] = df['hits'].str.replace('"7" ', '"7in ')
    df['hits'] = df['hits'].str.replace('/7" ', '/7in ')
    df['hits'] = df['hits'].str.replace('"Player"', "'Player'")
    df['hits'] = df['hits'].str.replace('True', 'true')
    df['hits'] = df['hits'].str.replace('False', 'false')
    df['hits'] = df['hits'].apply(json.loads)
    
    df['customDimensions'] = df['customDimensions'].str.replace("{'", '{"')
    df['customDimensions'] = df['customDimensions'].str.replace("'}", '"}')
    df['customDimensions'] = df['customDimensions'].str.replace(": '", ': "')
    df['customDimensions'] = df['customDimensions'].str.replace("',", '",')
    df['customDimensions'] = df['customDimensions'].str.replace(", '", ', "')
    df['customDimensions'] = df['customDimensions'].str.replace("':", '":')
    df['customDimensions'] = df['customDimensions'].str.replace("\'", "'")
    df['customDimensions'] = df['customDimensions'].str.replace('True', 'true')
    df['customDimensions'] = df['customDimensions'].str.replace('False', 'false')
    df['customDimensions'] = df['customDimensions'].apply(json.loads)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df

In [5]:
if create_cleaned==True and load_small_cleaned_only==False:
    
    csv_path = './data/train_v2.csv'
    train = load_df(csv_path)
    train.head()

In [6]:
if create_small_cleaned==True:
    
    csv_path = './data/train_small.csv'
    train_small = load_df(csv_path)
    train_small.head()

Loaded train_small.csv. Shape: (170834, 59)


Use the following to identify json load errors...:

Get rid of any features that only have a single value (and therefore are not useful for differentiating samples):

In [7]:
if create_cleaned==True and load_small_cleaned_only==False:

    NA_cols = []
    for col in train.columns:
        if train[col].nunique()==1:
            NA_cols.append(col)
            
    train.drop(NA_cols, axis=1, inplace=True)

    NA_cols
    
    # save it out as a .csv which we can read back in later
    train.to_csv('./data/train_cleaned.csv', index=False)

If we decided to skip the loading and cleaning and read in the previously generated .csv:

In [8]:
if create_cleaned==False and load_small_cleaned_only==False:
    
    try:
        train = pd.read_csv('./data/train_cleaned.csv', dtype={'fullVisitorId':'str'})
    except:
        print('./data/train_cleaned.csv does not exist.')

./data/train_cleaned.csv does not exist.


Same thing, but for the smaller data file:

In [9]:
train_small.head()

Unnamed: 0,channelGrouping,customDimensions,date,fullVisitorId,hits,socialEngagementType,visitId,visitNumber,visitStartTime,device.browser,device.browserSize,device.browserVersion,device.deviceCategory,device.flashVersion,device.isMobile,device.language,device.mobileDeviceBranding,device.mobileDeviceInfo,device.mobileDeviceMarketingName,device.mobileDeviceModel,device.mobileInputSelector,device.operatingSystem,device.operatingSystemVersion,device.screenColors,device.screenResolution,geoNetwork.city,geoNetwork.cityId,geoNetwork.continent,geoNetwork.country,geoNetwork.latitude,geoNetwork.longitude,geoNetwork.metro,geoNetwork.networkDomain,geoNetwork.networkLocation,geoNetwork.region,geoNetwork.subContinent,totals.bounces,totals.hits,totals.newVisits,totals.pageviews,totals.sessionQualityDim,totals.timeOnSite,totals.totalTransactionRevenue,totals.transactionRevenue,totals.transactions,totals.visits,trafficSource.adContent,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.adwordsClickInfo.criteriaParameters,trafficSource.adwordsClickInfo.gclId,trafficSource.adwordsClickInfo.isVideoAd,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.slot,trafficSource.campaign,trafficSource.isTrueDirect,trafficSource.keyword,trafficSource.medium,trafficSource.referralPath,trafficSource.source
0,Organic Search,"[{'index': '4', 'value': 'EMEA'}]",20161010,8443509489214341414,"[{'hitNumber': '1', 'time': '0', 'hour': '0', ...",Not Socially Engaged,1476083992,1,1476083992,Chrome,not available in demo dataset,not available in demo dataset,desktop,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Windows,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Europe,Netherlands,not available in demo dataset,not available in demo dataset,not available in demo dataset,glaslokaal.nl,not available in demo dataset,not available in demo dataset,Western Europe,1.0,1,1.0,1,,,,,,1,,,not available in demo dataset,,,,,(not set),,,(none),,(direct)
1,Social,[],20161130,979088112537619939,"[{'hitNumber': '1', 'time': '0', 'hour': '6', ...",Not Socially Engaged,1480514701,1,1480514701,Safari,not available in demo dataset,not available in demo dataset,desktop,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Macintosh,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Europe,Slovakia,not available in demo dataset,not available in demo dataset,not available in demo dataset,t-com.sk,not available in demo dataset,not available in demo dataset,Eastern Europe,1.0,1,1.0,1,,,,,,1,,,not available in demo dataset,,,,,(not set),,,referral,/yt/about/sk/,youtube.com
2,Referral,"[{'index': '4', 'value': 'North America'}]",20161219,299022312209485464,"[{'hitNumber': '1', 'time': '0', 'hour': '18',...",Not Socially Engaged,1482201535,1,1482201535,Chrome,not available in demo dataset,not available in demo dataset,desktop,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Macintosh,not available in demo dataset,not available in demo dataset,not available in demo dataset,New York,not available in demo dataset,Americas,United States,not available in demo dataset,not available in demo dataset,New York NY,rr.com,not available in demo dataset,New York,Northern America,,2,1.0,2,,3.0,,,,1,,,not available in demo dataset,,,,,(not set),,,(none),/Rewards/Products/Details.aspx,(direct)
3,Organic Search,[],20180122,9266484053810934946,"[{'hitNumber': '1', 'time': '0', 'hour': '21',...",Not Socially Engaged,1516684961,1,1516684961,Chrome,not available in demo dataset,not available in demo dataset,desktop,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Windows,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Asia,Philippines,not available in demo dataset,not available in demo dataset,not available in demo dataset,pldt.net,not available in demo dataset,not available in demo dataset,Southeast Asia,,11,1.0,8,1.0,75.0,,,,1,,,not available in demo dataset,,,,,(not set),,(not provided),organic,,google
4,Referral,"[{'index': '4', 'value': 'North America'}]",20170330,4518848952832757046,"[{'hitNumber': '1', 'time': '0', 'hour': '13',...",Not Socially Engaged,1490904631,2,1490904631,Chrome,not available in demo dataset,not available in demo dataset,desktop,not available in demo dataset,False,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,not available in demo dataset,Windows,not available in demo dataset,not available in demo dataset,not available in demo dataset,New York,not available in demo dataset,Americas,United States,not available in demo dataset,not available in demo dataset,New York NY,(not set),not available in demo dataset,New York,Northern America,1.0,1,,1,,,,,,1,,,not available in demo dataset,,,,,(not set),True,,(none),/,(direct)


In [10]:
if create_small_cleaned==True:
    
    NA_cols = []
    for col in train_small.columns:
        if col!='hits' and col!='customDimensions':
            if train_small[col].nunique()==1:
                NA_cols.append(col)
            
    train_small.drop(NA_cols, axis=1, inplace=True)

    NA_cols
    
    # save it out as a .csv which we can read back in later
    train_small.to_csv('./data/train_small_cleaned.csv', index=False)
    
else:
    
    train_small = pd.read_csv('./data/train_small_cleaned.csv', dtype={'fullVisitorId':'str'})

## Simple exploration

In [None]:
train_small.head()

In [None]:
train_small.info()

Things to look at:

* how many times do people buy things?

In [None]:
train_small.groupby('fullVisitorId')['totals.transactionRevenue'].sum().reset_index()

In [None]:
train_small[train_small['totals.transactionRevenue']>1e9]