# Exploratory Analysis

Group project for the 2019 Data Science Workshop at the University of California, Berkeley.

The project is the Google Analytics Customer Revenue Prediction competition on Kaggle: https://www.kaggle.com/c/ga-customer-revenue-prediction

Group members:

* Andy Vargas (mentor)
* Yuem Park
* Marvin Pohl
* Michael Yeh

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import json
import os
import ast
from pandas.io.json import json_normalize
import pandas_profiling as pp

Load data:

Note that the data files are too large to upload to GitHub - instead, the directory `./data/` has been added to the .gitignore, which should contain the following files on your local machine, all downloaded from the Kaggle competition website:

* sample_submission_v2.csv
* test_v2.csv
* train_v2.csv

Some of the columns are in JSON format - the following function (taken from [here](https://www.kaggle.com/sudalairajkumar/simple-exploration-baseline-ga-customer-revenue)) flattens the JSON columns, such that we end up with a more typical data table, where each column has a single feature in it:

In [2]:
def load_df(csv_path=None, nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource'] # columns of the dataframe that can readily be converted
    JSON_COLUMNS2 = ['customDimensions','device', 'geoNetwork', 'totals', 'trafficSource'] # columns of the dataframe that need to be tweaked
    
    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, # convert JSON_COLUMNS to json format
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for row in list(range(nrows)): # fill empty elements in column 'customDimensions' and 
        if df['customDimensions'][row] == '[]':
            df['customDimensions'][row] = "{'index':'','value':''}"
    
    df['customDimensions'] = df['customDimensions'].str.replace("[", '') # drop square brackets in column 'customDimensions'
    df['customDimensions'] = df['customDimensions'].str.replace("]", '')
    df['customDimensions'] = df['customDimensions'].str.replace("'", "\"")
    
    
    for row in list(range(nrows)): # convert strings in 'customDimensions' to dict
        df['customDimensions'][row] = json.loads(df['customDimensions'][row])
        
    for column in JSON_COLUMNS2: # distribute the dicts to separate columns of the dataframe
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    
    return df

Load a small subset of train_v2, defined by `nrows`.

In [3]:
#%%time
train_df = load_df('./data/train_v2.csv', nrows=200) # nrows defines size of subset
train_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,channelGrouping,date,fullVisitorId,hits,socialEngagementType,visitId,visitNumber,visitStartTime,customDimensions.index,customDimensions.value,...,trafficSource.adwordsClickInfo.gclId,trafficSource.adwordsClickInfo.isVideoAd,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.slot,trafficSource.campaign,trafficSource.isTrueDirect,trafficSource.keyword,trafficSource.medium,trafficSource.referralPath,trafficSource.source
0,Organic Search,20171016,3162355547410993243,"[{'hitNumber': '1', 'time': '0', 'hour': '17',...",Not Socially Engaged,1508198450,1,1508198450,4,EMEA,...,,,,,(not set),,water bottle,organic,,google
1,Referral,20171016,8934116514970143966,"[{'hitNumber': '1', 'time': '0', 'hour': '10',...",Not Socially Engaged,1508176307,6,1508176307,4,North America,...,,,,,(not set),,,referral,/a/google.com/transportation/mtv-services/bike...,sites.google.com
2,Direct,20171016,7992466427990357681,"[{'hitNumber': '1', 'time': '0', 'hour': '17',...",Not Socially Engaged,1508201613,1,1508201613,4,North America,...,,,,,(not set),True,,(none),,(direct)
3,Organic Search,20171016,9075655783635761930,"[{'hitNumber': '1', 'time': '0', 'hour': '9', ...",Not Socially Engaged,1508169851,1,1508169851,4,EMEA,...,,,,,(not set),,(not provided),organic,,google
4,Organic Search,20171016,6960673291025684308,"[{'hitNumber': '1', 'time': '0', 'hour': '14',...",Not Socially Engaged,1508190552,1,1508190552,4,Central America,...,,,,,(not set),,(not provided),organic,,google


## Clean up
There seems to be a few columns that are 'not available in demo dataset.' Let's get rid of them, as well as any features that only have a single value (and therefore are not useful for differentiating samples):

In [4]:
NA_cols = []
for col in train_df.columns:
    if train_df[col].nunique()==1: # find columns that have only 1 unique element
        NA_cols.append(col) # create list of these columns 
        
NA_cols

['date',
 'socialEngagementType',
 'device.browserSize',
 'device.browserVersion',
 'device.flashVersion',
 'device.language',
 'device.mobileDeviceBranding',
 'device.mobileDeviceInfo',
 'device.mobileDeviceMarketingName',
 'device.mobileDeviceModel',
 'device.mobileInputSelector',
 'device.operatingSystemVersion',
 'device.screenColors',
 'device.screenResolution',
 'geoNetwork.cityId',
 'geoNetwork.latitude',
 'geoNetwork.longitude',
 'geoNetwork.networkLocation',
 'totals.bounces',
 'totals.newVisits',
 'totals.visits',
 'trafficSource.adwordsClickInfo.criteriaParameters',
 'trafficSource.adwordsClickInfo.isVideoAd',
 'trafficSource.isTrueDirect']

Drop colums with only a single unique element and the column `hits`:

In [15]:
train_df.drop(NA_cols, axis=1, inplace=True)
train_df.drop('hits', axis=1, inplace=True)

Create pandas_profiling report:

In [16]:
pp.ProfileReport(train_df).to_file(outputfile="EDA_report-mps.html")

## Test enviroment