<h1>Data Cleaning of a GStore Customer Dataset</h1>
<p>The goal of this data cleaning is to prepare the dataset before using it for data analysis and prediction. The dataset is downloaded from Kaggle Competition (https://www.kaggle.com/c/ga-customer-revenue-prediction), and it contains Google Merchandise Store's customer transactions from August 1st, 2016 to October 15th, 2018.
</p>This notebook contains the steps I have done to clean the dataset.
<p>
<ol>
    <li>Loading the Datasets and Flatten JSON Columns</li>
    <li>Examining the columns</li>
    <li>Investigating the values</li>
    <ol>
        <li>Finding columns having null values</li>
        <li>Replacing null values with 0</li>
        <li>Replace '(not set)', '(none)', '(not provided)' with '(missing)'</li>
        <li>Dropping columns having uniform value</li>
    </ol>
    <li>Fixing columns data type</li>
</ol>
<p>Reference of Google Analytics Schema: https://support.google.com/analytics/answer/3437719?hl=en</p>

In [1]:
import pandas as pd
import numpy as np
import json
import os
from pandas.io.json import json_normalize
from pandas.api.types import is_string_dtype
from pandas.api.types import is_int64_dtype
from pandas.api.types import is_float_dtype
from pandas.api.types import is_object_dtype
from pandas.api.types import is_bool_dtype
import math

<h2>1. Loading the Datasets and Flatten JSON Columns</h2>
<p>Google Analytics Customer Transaction has five RECORD data type: 'totals', 'trafficSource', 'device', 'customDimensions', 'geoNetwork', 'hits'. Because 'geoNetwork.subContinent' and 'total.hits' can be used as substitute for 'customDimensions' and 'hits', these two RECORDs will not be flattended in the following step.</p> 

In [2]:
#Train dataset file's path
df_read_path = "../data/train_v2_small.csv"

In [3]:
"""
Function load_df(csv_path, nrows)
    This function reads a dataset, flattens the JSON columns in json_cols list, and 
    contructs a dataframe that contains columns written in the use_cols list. The use_cols is needed 
    to prevent column duplicates after JSON flattening.
Input:
    1. csv_path: the dataset's file name to load.
    2. nrows: number of rows to read during the data loading.
Output:
    A dataframe of the dataset.

"""
def load_df(csv_path=df_read_path, nrows=None):
    JSON_COLUMNS = ['totals', 'trafficSource', 'device', 'geoNetwork']
    df = pd.read_csv(csv_path,
                     converters = {column: json.loads for column in JSON_COLUMNS}, 
                     dtype= {'fullVisitorId': 'str', 'date': 'str'}, # Important!!
                     nrows=nrows)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df

In [4]:
#read train_v2_small.csv and flatten json columns
df = load_df(csv_path=df_read_path, nrows=2000)

Loaded train_v2_small.csv. Shape: (2000, 59)


In [5]:
#drop 'hits' and 'customDimension' columns which are not flattened
df.drop(['hits', 'customDimensions'], axis='columns', inplace = True)

<h2>2. Examining the columns</h2>

In [6]:
"""
Function find_columns_datatype(dataframe, datatype='object')
    This function finds the data type of all columns in the dataframe.
Input:
    dataframe: dataframe's name.
    datatype: datatype's name, with 'object' as the default.
Output:
    a list of column name(s) having datatype requested.
"""
def find_columns_datatype (dataframe, datatype='object'):
    cols = []
    if (datatype == 'int'):
        for col in dataframe.columns:
            if (is_int64_dtype (dataframe[col])):
                cols.append(col)
    elif (datatype == 'float'):
        for col in dataframe.columns:
            if (is_float_dtype (dataframe[col])):
                cols.append(col)
    elif (datatype == 'bool'):
        for col in dataframe.columns:
            if (is_bool_dtype (dataframe[col])):
                cols.append(col)
    elif (datatype == 'string'):
        for col in dataframe.columns:
            if (is_string_dtype (dataframe[col])):
                cols.append(col)                
    elif (datatype == 'object'):
        for col in dataframe.columns:
            if (is_object_dtype (dataframe[col])):
                cols.append(col)
    print(f'{len(cols)} columns have {datatype} : {cols}')
    return (cols)

In [7]:
bool_cols = find_columns_datatype(df, 'bool')

1 columns have bool : ['device.isMobile']


In [8]:
int_cols = find_columns_datatype(df, 'int')

3 columns have int : ['visitId', 'visitNumber', 'visitStartTime']


In [9]:
float_cols = find_columns_datatype(df, 'float')

0 columns have float : []


In [10]:
obj_cols = find_columns_datatype(df)

53 columns have object : ['channelGrouping', 'date', 'fullVisitorId', 'socialEngagementType', 'totals.visits', 'totals.hits', 'totals.pageviews', 'totals.bounces', 'totals.newVisits', 'totals.sessionQualityDim', 'totals.timeOnSite', 'totals.transactions', 'totals.transactionRevenue', 'totals.totalTransactionRevenue', 'trafficSource.campaign', 'trafficSource.source', 'trafficSource.medium', 'trafficSource.keyword', 'trafficSource.adwordsClickInfo.criteriaParameters', 'trafficSource.referralPath', 'trafficSource.isTrueDirect', 'trafficSource.adContent', 'trafficSource.adwordsClickInfo.page', 'trafficSource.adwordsClickInfo.slot', 'trafficSource.adwordsClickInfo.gclId', 'trafficSource.adwordsClickInfo.adNetworkType', 'trafficSource.adwordsClickInfo.isVideoAd', 'device.browser', 'device.browserVersion', 'device.browserSize', 'device.operatingSystem', 'device.operatingSystemVersion', 'device.mobileDeviceBranding', 'device.mobileDeviceModel', 'device.mobileInputSelector', 'device.mobileDevic

<h2>3. Investigating the values</h2>

<h3>A. Finding columns having null values</h3>

In [11]:
"""
Function find_cols_missing_values(dataframe)
    This function finds columns that have at least one null value in it.
Input: 
    dataframe: the name of the dataframe
Output:
    list of columns having at least one null value, its total missing values, 
    and percentage of missing value.
"""
def find_cols_missing_values(dataframe):
    total_sum = dataframe.isnull().sum()
    total_count = dataframe.isnull().count()
    
    total = total_sum.sort_values(ascending=False)
    percent = (total_sum / total_count * 100).sort_values(ascending=False)

    total_percent = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    print("Columns having at least one null values: ")
    print(total_percent[~(total_percent['Total'] == 0)]) #print only columns having at least 1 null values
    return

In [12]:
#Find columns having null values
find_cols_missing_values(df)

Columns having at least one null values: 
                                              Total  Percent
totals.totalTransactionRevenue                 1976    98.80
totals.transactionRevenue                      1976    98.80
totals.transactions                            1976    98.80
trafficSource.adContent                        1776    88.80
trafficSource.adwordsClickInfo.adNetworkType   1753    87.65
trafficSource.adwordsClickInfo.gclId           1753    87.65
trafficSource.adwordsClickInfo.isVideoAd       1753    87.65
trafficSource.adwordsClickInfo.page            1753    87.65
trafficSource.adwordsClickInfo.slot            1753    87.65
trafficSource.referralPath                     1444    72.20
trafficSource.isTrueDirect                     1428    71.40
totals.bounces                                 1023    51.15
totals.timeOnSite                               978    48.90
trafficSource.keyword                           839    41.95
totals.newVisits                           

<p>
    Based on the information of columns' value frequencies, there are three columns, 'device.isMobile', 'trafficSource.isTrueDirect', 'trafficSource.adwordsClickInfo.isVideoAd', that have boolean data type. Two of them have significant amount of missing values. According to the information in <a href="https://support.google.com/analytics/answer/3437719?hl=en">Google Analytics Schema</a>, these missing values can be replaced with False.
</p>
<ol>
    <li>'trafficSource.isTrueDirect': missing data can be filled with False, indicates that the source is direct URL.</li>
    <li>'trafficSource.adwordsClickInfo.isVideoAd': missing data can be filled with True, indicates that the video ad is not YouTube Trueview.</li>
    <li>'totals.pageviews': missing data can be filled with 1, indicates that there is no pageview in the session.</li>
    <li>'totals.bounces': missing data can be filled with 0, indicates that the session is not bounced.</li>                       <li>'totals.newVisits': missing data can be filled with 0, indicates that it is a return customer</li>
    <li>'totals.sessionQualityDim': missing data should be filled with 0, indicates that the Session Quality is not calculated for the selected time range.</li>
    <li>'totals.timeOnSite': missing data can be filled with 0, indicates that there is 0 second on site in the session.</li>
    <li>'totals.transactions': missing data should be filled with 0, indicates that there is no transaction during the session.</li>
    <li>'totals.transactionRevenue': missing data should be filled with 0, indicates that there is no transaction during the session.</li>
    <li>'totals.totalTransactionRevenue': missing data should be filled with 0, indicates that there is no transaction during the session.</li>
    <li>'trafficSource.adwordsClickInfo.page': missing data can be filled with 0, indicates there is no ad in the search result.</li> 
</ol>

<h3>B. Replacing null values with 0</h3>

In [13]:
#Fill null values with False
df['trafficSource.isTrueDirect'].fillna(False, inplace=True)

#Fill null values with True
df['trafficSource.adwordsClickInfo.isVideoAd'].fillna(True, inplace=True)

In [14]:
"""
Function fill_null_zero(dataframe, list_cols) 
    This function replace null values in columns with 0.
Input: 
    dataframe: name of dataframe
    list_cols: list of columns to replace
Output: dataframe without NULL values
"""
def fill_null_zero(dataframe, list_cols):
    for col in list_cols:
        dataframe[col].fillna(0, inplace=True)

In [15]:
#Fill null values with 0
num_cols = ['totals.transactionRevenue','totals.totalTransactionRevenue','totals.transactions','trafficSource.adwordsClickInfo.page','totals.timeOnSite','totals.sessionQualityDim','totals.pageviews','totals.bounces', 'totals.newVisits']
fill_null_zero(df, num_cols)

<h3>C. Replace '(not set)', '(none)', '(not provided)' with '(missing)'</h3>

In [18]:
"""
Function replace_values_nan(dataframe, list_cols, word)
    This function is to replace the word found in list_cols with '(missing)'
Input:
    dataframe: the name of the dataframe
    list_cols: list of columns to look at
    word: the string to replace with '(missing)'
"""
def replace_values_missing(dataframe, list_cols, word):
    for col in list_cols:
        dataframe.loc[dataframe[col] == word, col] = np.nan
    for col in list_cols:    
        dataframe[col].fillna('(missing)', inplace=True)
    return dataframe

In [None]:
#Replace null values with 'NaN' in object type except several columns
replace_values_missing(df, obj_cols, '(not set)')
replace_values_missing(df, obj_cols, '(none)')
replace_values_missing(df, obj_cols, '(not provided)')

In [20]:
#Check again if there is any null value in df
find_cols_missing_values(df)

Columns having at least one null values: 
Empty DataFrame
Columns: [Total, Percent]
Index: []


<h3>D. Dropping columns having uniform value</h3>
<p>Column that has the same value throughout its cells will not contribute to the variation in the dependent variables (transaction and revenue), that it can be eliminated from the dataframe.</p>

In [21]:
"""
Function find_uniform_column(dataframe, list_cols)
    This function finds columns having the same value throughout the column.
Input:
    dataframe: the name of the dataframe
    list_cols: the list of columns to inspect
Output:
    list of columns having uniform value in it.
"""
def find_uniform_column(dataframe, list_cols):
    uniform_cols = []
    for col in list_cols:
        if (df[col].nunique(dropna = False) == 1):
            uniform_cols.append(col)
    return uniform_cols

In [22]:
#Find the columns having the same value throughout the column
uni_bool_cols = find_uniform_column(df, bool_cols)
print(uni_bool_cols)
uni_int_cols = find_uniform_column(df, int_cols)
print(uni_int_cols)
uni_obj_cols = find_uniform_column(df, obj_cols)
print(uni_obj_cols)

[]
[]
['date', 'socialEngagementType', 'totals.visits', 'trafficSource.adwordsClickInfo.criteriaParameters', 'device.browserVersion', 'device.browserSize', 'device.operatingSystemVersion', 'device.mobileDeviceBranding', 'device.mobileDeviceModel', 'device.mobileInputSelector', 'device.mobileDeviceInfo', 'device.mobileDeviceMarketingName', 'device.flashVersion', 'device.language', 'device.screenColors', 'device.screenResolution', 'geoNetwork.cityId', 'geoNetwork.latitude', 'geoNetwork.longitude', 'geoNetwork.networkLocation']


In [23]:
#Drop columns with uniform value
df.drop(uni_obj_cols, axis='columns', inplace = True)

In [24]:
#The size of df 
print(df.shape)

(2000, 37)


<h2>4. Fixing columns data type</h2>

<p>Some columns have limited different values can have categorical data type.</p>
<ol>
<li>channelGrouping : 8</li>
<li>trafficSource.campaign : 33</li>
<li>trafficSource.medium : 7</li>
<li>trafficSource.adwordsClickInfo.slot : 4 (including 1633063 'missing')</li>
<li>trafficSource.source : 345</li>
<li>device.operatingSystem : 24 </li>
<li>device.deviceCategory : 3</li>
<li>geoNetwork.continent : 6</li>
<li>geoNetwork.subContinent : 23</li>
<li>geoNetwork.country : 228</li>
<li>geoNetwork.region : 483</li>
<li>geoNetwork.metro : 123</li>
<li>geoNetwork.city : 956</li>
<li>device.browser : 129</li>
<li>trafficSource.adContent : 77 (including 1643600 'missing')</li>
<li>trafficSource.adwordsClickInfo.adNetworkType : 4 (including 1633063 'missing')</li>
</ol>
The rest of the columns stay with string type:
<ol>
<li>date</li>
<li>fullVisitorId</li>
<li>trafficSource.adwordsClickInfo.gclId</li> 
<li>trafficSource.keyword : 4547 (including 1052780 'missing')</li>
<li>trafficSource.referralPath : 3197 (including 1142073 'missing')</li>
<li>geoNetwork.networkDomain : 41982</li>
</ol>

In [None]:
#Check again the data type of all columns 
int_cols = find_columns_datatype(df, 'int')
bool_cols = find_columns_datatype(df, 'bool')
obj_cols = find_columns_datatype(df)

In [26]:
#Fix the data type of boolean and 16 object type columns to category type
str_cat_cols= ['device.isMobile', 'trafficSource.isTrueDirect', 'trafficSource.adwordsClickInfo.isVideoAd', 'channelGrouping', 'trafficSource.campaign', 'trafficSource.medium', 'trafficSource.adwordsClickInfo.slot', 'trafficSource.source', 'device.operatingSystem', 'device.deviceCategory', 'geoNetwork.continent', 'geoNetwork.subContinent', 'geoNetwork.country', 'geoNetwork.region', 'geoNetwork.metro', 'geoNetwork.city', 'device.browser', 'trafficSource.adContent', 'trafficSource.adwordsClickInfo.adNetworkType']
for col in str_cat_cols:
    df[col] = df[col].astype('category')

In [27]:
#Fix the data type of 10 object type columns to integer type
str_int_cols = ['totals.hits', 'totals.pageviews', 'totals.bounces', 'totals.newVisits', 'totals.sessionQualityDim', 'totals.timeOnSite', 'totals.transactions', 'totals.transactionRevenue', 'totals.totalTransactionRevenue', 'trafficSource.adwordsClickInfo.page']
for col in str_int_cols:
    df[col] = df[col].astype('int64')

In [None]:
df.info()

In [29]:
# save to a file for next use
df_write_path = "../data/train_small_clean.csv"

df.to_csv(df_write_path, index=False)