# Exploratory Analysis

Group project for the 2019 Data Science Workshop at the University of California, Berkeley.

The project is the Google Analytics Customer Revenue Prediction competition on Kaggle: https://www.kaggle.com/c/ga-customer-revenue-prediction

Group members:

* Andy Vargas (mentor)
* Yuem Park
* Marvin Pohl
* Michael Yeh

In [56]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import json
import os
import ast
from pandas.io.json import json_normalize
import pandas_profiling as pp
import math

Load data:

Note that the data files are too large to upload to GitHub - instead, the directory `./data/` has been added to the .gitignore, which should contain the following files on your local machine, all downloaded from the Kaggle competition website:

* sample_submission_v2.csv
* test_v2.csv
* train_v2.csv

Some of the columns are in JSON format - the following function (taken from [here](https://www.kaggle.com/sudalairajkumar/simple-exploration-baseline-ga-customer-revenue)) flattens the JSON columns, such that we end up with a more typical data table, where each column has a single feature in it:

In [2]:
def load_df(csv_path=None, nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource'] # columns of the dataframe that can readily be converted
    JSON_COLUMNS2 = ['customDimensions','device', 'geoNetwork', 'totals', 'trafficSource'] # columns of the dataframe that need to be tweaked
    
    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, # convert JSON_COLUMNS to json format
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for row in list(range(nrows)): # fill empty elements in column 'customDimensions' and 
        if df['customDimensions'][row] == '[]':
            df['customDimensions'][row] = "{'index':'','value':''}"
    
    df['customDimensions'] = df['customDimensions'].str.replace("[", '') # drop square brackets in column 'customDimensions'
    df['customDimensions'] = df['customDimensions'].str.replace("]", '')
    df['customDimensions'] = df['customDimensions'].str.replace("'", "\"")
    
    
    for row in list(range(nrows)): # convert strings in 'customDimensions' to dict
        df['customDimensions'][row] = json.loads(df['customDimensions'][row])
        
    for column in JSON_COLUMNS2: # distribute the dicts to separate columns of the dataframe
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    
    return df

In [110]:
def hits_converter(data):
    return json.loads(json.dumps(ast.literal_eval(data)))

def customDimensions_converter(data):
    if data == '[]':
        return {}
    else:
        return hits_converter(data)[0]

def load_df(csv_path='data/train_v2.csv', nrows=None):
    
    conv_dict = {'device': json.loads,
                'geoNetwork': json.loads,
                'totals': json.loads,
                'trafficSource': json.loads,
                'hits': hits_converter,
                'customDimensions': customDimensions_converter}
    
    df = pd.read_csv(csv_path, 
                     converters=conv_dict, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    df=df.explode('hits')
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource', 'hits', 'customDimensions']
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
        
    df=df.reset_index()
    return df

Load a small subset of train_v2, defined by `nrows`.

In [111]:
%%time
train_df = load_df('./data/train_v2.csv', nrows=1700) # nrows defines size of subset
train_df.head(10)

CPU times: user 10.1 s, sys: 114 ms, total: 10.2 s
Wall time: 10.2 s


Unnamed: 0,index,channelGrouping,date,fullVisitorId,socialEngagementType,visitId,visitNumber,visitStartTime,device.browser,device.browserVersion,...,hits.transaction.transactionRevenue,hits.transaction.transactionTax,hits.transaction.transactionShipping,hits.transaction.affiliation,hits.transaction.localTransactionRevenue,hits.transaction.localTransactionTax,hits.transaction.localTransactionShipping,hits.item.transactionId,customDimensions.index,customDimensions.value
0,0,Organic Search,20171016,3162355547410993243,Not Socially Engaged,1508198450,1,1508198450,Firefox,not available in demo dataset,...,,,,,,,,,4,EMEA
1,1,Referral,20171016,8934116514970143966,Not Socially Engaged,1508176307,6,1508176307,Chrome,not available in demo dataset,...,,,,,,,,,4,North America
2,1,Referral,20171016,8934116514970143966,Not Socially Engaged,1508176307,6,1508176307,Chrome,not available in demo dataset,...,,,,,,,,,4,North America
3,2,Direct,20171016,7992466427990357681,Not Socially Engaged,1508201613,1,1508201613,Chrome,not available in demo dataset,...,,,,,,,,,4,North America
4,2,Direct,20171016,7992466427990357681,Not Socially Engaged,1508201613,1,1508201613,Chrome,not available in demo dataset,...,,,,,,,,,4,North America
5,3,Organic Search,20171016,9075655783635761930,Not Socially Engaged,1508169851,1,1508169851,Chrome,not available in demo dataset,...,,,,,,,,,4,North America
6,3,Organic Search,20171016,9075655783635761930,Not Socially Engaged,1508169851,1,1508169851,Chrome,not available in demo dataset,...,,,,,,,,,4,North America
7,4,Organic Search,20171016,6960673291025684308,Not Socially Engaged,1508190552,1,1508190552,Chrome,not available in demo dataset,...,,,,,,,,,4,North America
8,4,Organic Search,20171016,6960673291025684308,Not Socially Engaged,1508190552,1,1508190552,Chrome,not available in demo dataset,...,,,,,,,,,4,North America
9,5,Referral,20171016,166277907528479249,Not Socially Engaged,1508196701,1,1508196701,Chrome,not available in demo dataset,...,,,,,,,,,4,EMEA


In [118]:
for columns in list(range(len(train_df.columns))):
    print(train_df.columns[columns])

index
channelGrouping
date
fullVisitorId
socialEngagementType
visitId
visitNumber
visitStartTime
device.browser
device.browserVersion
device.browserSize
device.operatingSystem
device.operatingSystemVersion
device.isMobile
device.mobileDeviceBranding
device.mobileDeviceModel
device.mobileInputSelector
device.mobileDeviceInfo
device.mobileDeviceMarketingName
device.flashVersion
device.language
device.screenColors
device.screenResolution
device.deviceCategory
geoNetwork.continent
geoNetwork.subContinent
geoNetwork.country
geoNetwork.region
geoNetwork.metro
geoNetwork.city
geoNetwork.cityId
geoNetwork.networkDomain
geoNetwork.latitude
geoNetwork.longitude
geoNetwork.networkLocation
totals.visits
totals.hits
totals.pageviews
totals.bounces
totals.newVisits
totals.sessionQualityDim
totals.timeOnSite
totals.transactions
totals.transactionRevenue
totals.totalTransactionRevenue
trafficSource.campaign
trafficSource.source
trafficSource.medium
trafficSource.keyword
trafficSource.adwordsClickInfo.

In [173]:
train_df['hits.social.socialNetwork'].value_counts()

(not set)    10461
Name: hits.social.socialNetwork, dtype: int64

## Clean up
There seems to be a few columns that are 'not available in demo dataset.' Let's get rid of them, as well as any features that only have a single value (and therefore are not useful for differentiating samples):

In [4]:
NA_cols = []
for col in train_df.columns:
    if train_df[col].nunique()==1: # find columns that have only 1 unique element
        NA_cols.append(col) # create list of these columns 
        
NA_cols

['socialEngagementType',
 'device.browserVersion',
 'device.browserSize',
 'device.operatingSystemVersion',
 'device.mobileDeviceBranding',
 'device.mobileDeviceModel',
 'device.mobileInputSelector',
 'device.mobileDeviceInfo',
 'device.mobileDeviceMarketingName',
 'device.flashVersion',
 'device.language',
 'device.screenColors',
 'device.screenResolution',
 'geoNetwork.cityId',
 'geoNetwork.latitude',
 'geoNetwork.longitude',
 'geoNetwork.networkLocation',
 'totals.visits',
 'totals.bounces',
 'totals.newVisits',
 'trafficSource.adwordsClickInfo.criteriaParameters',
 'trafficSource.isTrueDirect',
 'trafficSource.adwordsClickInfo.isVideoAd',
 'trafficSource.campaignCode']

Drop colums with only a single unique element and the column `hits`:

In [5]:
train_df.drop(NA_cols, axis=1, inplace=True)
train_df.drop('hits', axis=1, inplace=True)

## Exploratory data analysis

Create pandas_profiling report:

In [6]:
pp.ProfileReport(train_df).to_file(outputfile="EDA_report-mp.html")

TypeError: to_file() got an unexpected keyword argument 'outputfile'

Compare different features:

In [13]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170000 entries, 0 to 169999
Data columns (total 36 columns):
channelGrouping                                 170000 non-null object
date                                            170000 non-null int64
fullVisitorId                                   170000 non-null object
visitId                                         170000 non-null int64
visitNumber                                     170000 non-null int64
visitStartTime                                  170000 non-null int64
customDimensions.index                          170000 non-null object
customDimensions.value                          170000 non-null object
device.browser                                  170000 non-null object
device.operatingSystem                          170000 non-null object
device.isMobile                                 170000 non-null bool
device.deviceCategory                           170000 non-null object
geoNetwork.continent                       

In [53]:
train_df['totals.transactionRevenue'].drop('NaN')

KeyError: "['NaN'] not found in axis"

Set target

In [31]:
train_target = train_df.groupby('fullVisitorId')['totals.transactionRevenue']

In [55]:
type( train_df['totals.transactionRevenue'][1])

float

In [59]:
map(train_df['totals.transactionRevenue'], math.isnan)

TypeError: 'builtin_function_or_method' object is not iterable

In [99]:
for i in list(range(len(train_df)//100)):
    math.isnan(train_df['totals.transactionRevenue'][i])
    

TypeError: must be real number, not str

In [96]:
 list(range(len(train_df)//10000))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]

In [86]:
list(range(len({0,1,2})))

[0, 1, 2]

In [80]:
eval([1,2,3])

TypeError: eval() arg 1 must be a string, bytes or code object

In [104]:
train_df.dropna(inplace = True) 

In [107]:
train_df

Unnamed: 0,channelGrouping,date,fullVisitorId,visitId,visitNumber,visitStartTime,customDimensions.index,customDimensions.value,device.browser,device.operatingSystem,...,trafficSource.campaign,trafficSource.source,trafficSource.medium,trafficSource.keyword,trafficSource.referralPath,trafficSource.adContent,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.slot,trafficSource.adwordsClickInfo.gclId,trafficSource.adwordsClickInfo.adNetworkType
