# Exploratory Analysis

Group project for the 2019 Data Science Workshop at the University of California, Berkeley.

The project is the Google Analytics Customer Revenue Prediction competition on Kaggle: https://www.kaggle.com/c/ga-customer-revenue-prediction

Group members:

* Andy Vargas (mentor)
* Yuem Park
* Marvin Pohl
* Michael Yeh

In [1]:
import os
import json
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt

pd.options.display.max_columns = 999

## Load data

Note that the data files are too large to upload to GitHub - instead, the directory `./data/` has been added to the .gitignore, which should contain the following files on your local machine, all downloaded from the Kaggle competition website:

* sample_submission_v2.csv
* test_v2.csv
* train_v2.csv

In [2]:
# if you want to skip the loading and cleaning of the raw data, set this to True
create_cleaned = False

# if you want to create a sub-sampled cleaned data, set this to True
create_small_cleaned = False

# if you want to ONLY load a previously generated sub-sampled cleaned data, set this to True
load_small_cleaned_only = True

Some of the columns are in JSON format - the following function (taken from https://www.kaggle.com/sudalairajkumar/simple-exploration-baseline-ga-customer-revenue) flattens the JSON columns, such that we end up with a more typical data table, where each column has a single feature in it:

In [3]:
if create_cleaned==True and load_small_cleaned_only==False:
    
    def load_df(csv_path='./data/train_v2.csv', nrows=None):
        JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']

        df = pd.read_csv(csv_path, 
                         converters={column: json.loads for column in JSON_COLUMNS}, 
                         dtype={'fullVisitorId': 'str'}, # Important!!
                         nrows=nrows)

        for column in JSON_COLUMNS:
            column_as_df = json_normalize(df[column])
            column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
            df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
        print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
        return df
    
    train = load_df()
    
    train.head()

There seems to be a few columns that are 'not available in demo dataset.' Let's get rid of them, as well as any features that only have a single value (and therefore are not useful for differentiating samples):

In [4]:
if create_cleaned==True and load_small_cleaned_only==False:

    NA_cols = []
    for col in train.columns:
        if train[col].nunique()==1:
            NA_cols.append(col)
            
    train.drop(NA_cols, axis=1, inplace=True)

    NA_cols
    
    # save it out as a .csv which we can read back in later
    train.to_csv('./data/train_cleaned.csv', index=False)

If we decided to skip the loading and cleaning and read in the previously generated .csv:

In [5]:
if create_cleaned==False and load_small_cleaned_only==False:
    
    train = pd.read_csv('./data/train_cleaned.csv', dtype={'fullVisitorId':'str'})

Generate a sub-sampled version of the cleaned data, for faster load/computation times:

In [6]:
if create_small_cleaned==True and load_small_cleaned_only==False:
    
    train_small = train.sample(frac=0.1, random_state=2019)
    
    train_small.to_csv('./data/train_cleaned_small.csv', index=False)

If we decided to skip the loading and cleaning and read in the previously generated .csv:

In [7]:
if create_small_cleaned==False:
    
    train_small = pd.read_csv('./data/train_cleaned_small.csv', dtype={'fullVisitorId':'str'})

## Simple exploration

In [12]:
train_small.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170834 entries, 0 to 170833
Data columns (total 36 columns):
channelGrouping                                 170834 non-null object
customDimensions                                170834 non-null object
date                                            170834 non-null int64
fullVisitorId                                   170834 non-null object
hits                                            170834 non-null object
visitId                                         170834 non-null int64
visitNumber                                     170834 non-null int64
visitStartTime                                  170834 non-null int64
device.browser                                  170834 non-null object
device.deviceCategory                           170834 non-null object
device.isMobile                                 170834 non-null bool
device.operatingSystem                          170834 non-null object
geoNetwork.city                            

Things to look at:

* how many times do people buy things?