### Being new to Google Analytics, I have decided to attempt to perform some analysis on the available Google Analytics Customer Revenue Prediction dataset here on Kaggle. Most of the work here are tutorials/bits that I have found on Kaggle or other sources, and the source will be attached respectively.

As per the source of the dataset, the objective of our prediction is the natural log of the sum of all transacations per user, where our resulting model will be applied onto the test set. The formula can be understood as:

$$y_{user} = \sum_{i=1}^{n} transaction_{user_i} $$
$$target_{user} = \ln({y_{user} + 1})$$

In [None]:
import os 
import pandas as pd
import numpy as np
import random
import json
from pandas.io.json import json_normalize

print(os.listdir("../input"))
print(os.listdir("../input/ga-customer-revenue-prediction/"))

There are 6 csvs in total, but from the looks of it, the main dataset that we will likely be using are the last two csvs.

In [None]:
train_df = pd.read_csv("../input/ga-customer-revenue-prediction/train.csv", low_memory = False)
train_df.head()

From the output above, we can tell that some columns that come in the form of dictionaries or more precisely JSON format. Now I am not a big fan when it comes to JSON format and have always struggled dealing with them. So here, I was able to find a function that flattens the columns. [source][1]

[1]: https://www.kaggle.com/julian3833/1-quick-start-read-csv-and-flatten-json-fields/notebook

All hail the kaggle masters who share their code for free. =D

In [None]:
# Function that flattens the json format columns, namely device, geoNetwork, totals & trafficSource
def load_df(file_path, nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    
    df = pd.read_csv(file_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column]) # Breaks down the json formats columns into seperate columns
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(file_path)}. Shape: {df.shape}")
    return df

In [None]:
train_df = load_df(file_path = "../input/ga-customer-revenue-prediction/train.csv")

In [None]:
pd.options.display.max_columns = 999

In [None]:
train_df.head()

### What do the columns tell:

1. channelGrouping: The source that directed the user to the Store
2. date: Date the user visited the store.
3. fullvisitorId: Unique set of id for each user
4. sessionId: Unique identifier for each visit per user to the store. The structure of the id is can be understood as: fullVisitorId + _ + visitID
5. SocialEngagementType: Engagement Type, could be socially engaged or not socially engaged, but all of them are actually not socially engaged.
6. visitID: unique visit identifier 
7. visitStartTime: Timestamp in Posix format
8. geoNetwork: Geological location of visit
9. trafficSource: Information about Traffic Source from the session originated.

In [None]:
train_df.describe()

In [None]:
train_df.columns.tolist()

### Check column data types. 

In [None]:
train_df.dtypes

In [None]:
print("Channel Grouping:", train_df.channelGrouping.unique().tolist())
print("Social Engagement Type:", train_df.socialEngagementType.unique().tolist())
print("Number of unique IDs:",len(train_df.visitId.unique()))
print("Browser used:", train_df["device.browser"].unique())

### Dropping constant values

In [None]:
const_cols = [col for col in train_df.columns if train_df[col].nunique(dropna = False) == 1]
const_cols

## Exploratory Data Analysis

### As we are attempting to predict the total revenue per user, it would be good to look into the distribution of the revenue per visitor we have in the training data.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Turning the column of transaction Revenue into float instead of object. We want it to be numeric
train_df["totals.transactionRevenue"] = train_df["totals.transactionRevenue"].astype(float)

# Grouping the revenue by visitor ID
grouped_by_vis_id = train_df.groupby("fullVisitorId")["totals.transactionRevenue"].sum().reset_index()

grouped_by_vis_id.head()

In [None]:
grouped_by_vis_id.shape[0]

In [None]:
plt.figure(figsize = (8, 6))
sns.scatterplot(data = grouped_by_vis_id, x = range(grouped_by_vis_id.shape[0]), y = np.sort(np.log1p(grouped_by_vis_id["totals.transactionRevenue"].values)))
plt.title("Seaborn Way")
plt.ylabel("Transaction Revenue", fontsize = 14)

plt.figure(figsize = (8, 6))
plt.scatter(x = range(grouped_by_vis_id.shape[0]), y = np.sort(np.log1p(grouped_by_vis_id["totals.transactionRevenue"].values)))
plt.title("Matplotlib way")
plt.ylabel("Transaction Revenue", fontsize = 14)

The outputs above further supports the 80/20 rule that: only small percentage of customers produce most of the revenue, as per the overview from the data source.

### Obtaining the ratio/percentage numerically

In [None]:
non_zero_cases = pd.notnull(train_df["totals.transactionRevenue"]).sum()
print("Non zero transactions:", non_zero_cases)
print("Percentage of non-zero transactions out of entire dataset:", non_zero_cases / train_df.shape[0])

revenue_generating_customers = (grouped_by_vis_id["totals.transactionRevenue"] > 0).sum()
print("Unique customers with non-zero transactions:", revenue_generating_customers)
print("Percentage of UNIQUE revenue generating customers out of entire dataset:", revenue_generating_customers / grouped_by_vis_id.shape[0]) #dvded by grouped_by_vis_id as we want the UNIQUE customers

The following plots are basically a copy of SRK's version, which I find really well executed. Can be found [here][1]

[1]: https://www.kaggle.com/sudalairajkumar/simple-exploration-baseline-ga-customer-revenue

### Device Info

In [None]:
import plotly.subplots
from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected = True)
import plotly.graph_objs as go

Below, is a use of the plotly Python library, an interactive plotting library that allows us to create beautiful interactive web-based visualizations. This is the first time I have approached plotly, and as mentioned, the work is just a spin-off version of SRK's work (basically the same stuff to be honest). 

In [None]:
def horizontal_bar_chart(cnt_srs, color):
    trace = go.Bar(
        y = cnt_srs.index[::-1], #This will turn the sequence of index upside down, i.e. last becomes first and vice versa.
        x = cnt_srs.values[::-1],
        showlegend = False,
        orientation = "h",
        marker = dict(
            color = color))
    return trace

In [None]:
# Device Browser
cnt_srs = train_df.groupby("device.browser")["totals.transactionRevenue"].agg(["size", "count", "mean"])
cnt_srs.columns = ["count", "count_of_non_zero_revenue", "mean"]
cnt_srs = cnt_srs.sort_values(by = "count", ascending = False)
#cnt_srs.head()
#cnt_srs.index[::-1]
trace1 = horizontal_bar_chart(cnt_srs["count"].head(10), "rgba(242, 160, 160, 1)")
trace2 = horizontal_bar_chart(cnt_srs["count_of_non_zero_revenue"].head(10), "rgba(156, 240, 198, 1)")
trace3 = horizontal_bar_chart(cnt_srs["mean"].head(10), "rgba(115, 228, 246, 1)")


# Device Category
cnt_srs = train_df.groupby("device.deviceCategory")["totals.transactionRevenue"].agg(["size", "count", "mean"])
cnt_srs.columns = ["count", "count_of_non_zero_revenue", "mean"]
cnt_srs = cnt_srs.sort_values(by = "count", ascending = False)
trace4 = horizontal_bar_chart(cnt_srs["count"].head(10), "rgba(242, 160, 160, 1)")
trace5 = horizontal_bar_chart(cnt_srs["count_of_non_zero_revenue"].head(10), "rgba(156, 240, 198, 1)")
trace6 = horizontal_bar_chart(cnt_srs["mean"].head(10), "rgba(115, 228, 246, 1)")

# Device Operating System
cnt_srs = train_df.groupby("device.operatingSystem")["totals.transactionRevenue"].agg(["size", "count", "mean"])
cnt_srs.columns = ["count", "count_of_non_zero_revenue", "mean"]
cnt_srs = cnt_srs.sort_values(by = "count", ascending = False)
trace7 = horizontal_bar_chart(cnt_srs["count"].head(10), "rgba(242, 160, 160, 1)")
trace8 = horizontal_bar_chart(cnt_srs["count_of_non_zero_revenue"].head(10), "rgba(156, 240, 198, 1)")
trace9 = horizontal_bar_chart(cnt_srs["mean"].head(10), "rgba(115, 228, 246, 1)")



fig = plotly.subplots.make_subplots(rows = 3, cols = 3, vertical_spacing = 0.04, subplot_titles = ["Device Browser - Count", 
                                                                                         "Device Browser - Non-zero Revenue Count", 
                                                                                         "Device Browser - Mean Revenue",
                                                                                         "Device Category - Count", 
                                                                                         "Device Category - Non-zero Revenue Count", 
                                                                                         "Device Category - Mean Revenue", 
                                                                                         "Device OS - Count", 
                                                                                         "Device OS - Non-zero Revenue Count", 
                                                                                         "Device OS - Mean Revenue"])

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)
fig.append_trace(trace4, 2, 1)
fig.append_trace(trace5, 2, 2)
fig.append_trace(trace6, 2, 3)
fig.append_trace(trace7, 3, 1)
fig.append_trace(trace8, 3, 2)
fig.append_trace(trace9, 3, 3)

fig['layout'].update(height=1200, width=1200, paper_bgcolor='rgb(233, 233, 233)', title="Device Plots")
py.iplot(fig, filename='device-plots')

### Timeline plotting: Date exploration

In [None]:
import datetime

In [None]:
def scatter_plot(cnt_srs, color):
    trace = go.Scatter(
        x = cnt_srs.index[::-1],
        y = cnt_srs.values[::-1],
        showlegend = False,
        marker = dict(color = color))
    return trace

In [None]:
train_df.head(2)

In [None]:
train_df['date'] = train_df['date'].apply(lambda x: datetime.date(int(str(x)[:4]), int(str(x)[4:6]), int(str(x)[6:])))
train_df.head()

In [None]:
cnt_srs = train_df.groupby("date")["totals.transactionRevenue"].agg(["size", "count"])
cnt_srs.columns = ["count", "count_of_non_zero_revenue"]
cnt_srs = cnt_srs.sort_index()
trace1 = scatter_plot(cnt_srs["count"], "rgba(13, 81, 108, 1)")
trace2 = scatter_plot(cnt_srs["count_of_non_zero_revenue"], "rgba(54, 13, 108, 1)")


fig = plotly.subplots.make_subplots(rows = 2, cols = 1, vertical_spacing = 0.08,
                                   subplot_titles = ["Date - Count", "Date - Non-zero Revenue Instances"])
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 2, 1)
fig["layout"].update(height = 800, width = 800, paper_bgcolor = "rgb(233, 233, 233)", title = "Timeline Plots")
py.iplot(fig, filename = "date_plots")

In [None]:
test_df = load_df(file_path = "../input/ga-customer-revenue-prediction/test.csv")

In [None]:
test_df["date"] = test_df["date"].apply(lambda x:datetime.date(int(str(x)[:4]), int(str(x)[4:6]), int(str(x)[6:])))
cnt_srs = test_df.groupby("date")["fullVisitorId"].size()
cnt_srs.head()

In [None]:
trace = scatter_plot(cnt_srs, "rgba(162, 11, 74, 11)")

layout = go.Layout(
    height = 400, 
    width = 800,
    paper_bgcolor = "rgb(233, 233, 233)",
    title = "Test set timeline")

data = [trace]
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename = "ActivationDate")

In [None]:
cnt_srs = train_df.groupby("geoNetwork.continent")["totals.transactionRevenue"].agg(["size", "count", "mean"])
cnt_srs.columns = ["count", "count_of_non_zero_revenue", "mean"]
cnt_srs = cnt_srs.sort_values(by = "count", ascending = False)
trace1 = horizontal_bar_chart(cnt_srs["count"].head(10), "rgba(237, 139, 27, 1)")
trace2 = horizontal_bar_chart(cnt_srs["count_of_non_zero_revenue"].head(10), "rgba(241, 200, 94, 1)")
trace3 = horizontal_bar_chart(cnt_srs["mean"].head(10), "rgba(253, 220, 6, 1)")

cnt_srs = train_df.groupby("geoNetwork.subContinent")["totals.transactionRevenue"].agg(["size", "count", "mean"])
cnt_srs.columns = ["count", "count_of_non_zero_revenue", "mean"]
cnt_srs = cnt_srs.sort_values(by = "count", ascending = False)
trace4 = horizontal_bar_chart(cnt_srs["count"], "rgba(237, 139, 27, 1)")
trace5 = horizontal_bar_chart(cnt_srs["count_of_non_zero_revenue"], "rgba(241, 200, 94, 1)")
trace6 = horizontal_bar_chart(cnt_srs["mean"], "rgba(253, 220, 6, 1)")


cnt_srs = train_df.groupby("geoNetwork.networkDomain")["totals.transactionRevenue"].agg(["size", "count", "mean"])
cnt_srs.columns = ["count", "count_of_non_zero_revenue", "mean"]
cnt_srs = cnt_srs.sort_values(by = "count", ascending = False)
trace7 = horizontal_bar_chart(cnt_srs["count"].head(10), "rgba(237, 139, 27, 1)")
trace8 = horizontal_bar_chart(cnt_srs["count_of_non_zero_revenue"].head(10), "rgba(241, 200, 94, 1)")
trace9 = horizontal_bar_chart(cnt_srs["mean"].head(10), "rgba(253, 220, 6, 1)")

fig = plotly.subplots.make_subplots(rows = 3, cols = 3, vertical_spacing = 0.08, horizontal_spacing = 0.15,
                                  subplot_titles = ["Continent - Count", "Continent - Non-zero Revenue Count", "Continent - Mean Revenue",
                                          "Sub Continent - Count",  "Sub Continent - Non-zero Revenue Count", "Sub Continent - Mean Revenue",
                                          "Network Domain - Count", "Network Domain - Non-zero Revenue Count", "Network Domain - Mean Revenue"])

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)
fig.append_trace(trace4, 2, 1)
fig.append_trace(trace5, 2, 2)
fig.append_trace(trace6, 2, 3)
fig.append_trace(trace7, 3, 1)
fig.append_trace(trace8, 3, 2)
fig.append_trace(trace9, 3, 3)

fig["layout"].update(height = 1500, width = 1200, paper_bgcolor = "rgb(233, 233, 233)", 
                    title = "Geographical Information")
py.iplot(fig, filename = "geo_plots")

In [None]:
def create_cnt_srs(df, col1, col2):
    cnt_srs = df.groupby(col1)[col2].agg(["size", "count", "mean"])
    cnt_srs.columns = ["count", "count_of_non_zero_revenue", "mean"]
    cnt_srs = cnt_srs.sort_values(by = "count", ascending = False)
    
    return cnt_srs

In [None]:
cnt_srs = create_cnt_srs(train_df, "trafficSource.source", "totals.transactionRevenue")
trace1 = horizontal_bar_chart(cnt_srs["count"].head(10), "rgba(113, 228, 228, 1)")
trace2 = horizontal_bar_chart(cnt_srs["count_of_non_zero_revenue"].head(10), "rgba(30, 158, 158, 1)")
trace3 = horizontal_bar_chart(cnt_srs["mean"].head(10), "rgba(21, 106, 106, 1)")

cnt_srs = create_cnt_srs(train_df, "trafficSource.medium", "totals.transactionRevenue")
trace4 = horizontal_bar_chart(cnt_srs["count"].head(10), "rgba(113, 228, 228, 1)")
trace5 = horizontal_bar_chart(cnt_srs["count_of_non_zero_revenue"].head(10), "rgba(30, 158, 158, 1)")
trace6 = horizontal_bar_chart(cnt_srs["mean"].head(10), "rgba(21, 106, 106, 1)")

fig = plotly.subplots.make_subplots(rows = 2, cols = 3, vertical_spacing = 0.08, horizontal_spacing = 0.15,
                                   subplot_titles = ["Traffic Source - Count", "Traffic Source - Non_zero_revenue_count", "Traffic Source - Mean",
                                                    "Traffic Medium - Count", "Traffic Medium - Non_zero_revenue_count", "Traffic_medium - Mean"])

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)
fig.append_trace(trace4, 2, 1)
fig.append_trace(trace5, 2, 2)
fig.append_trace(trace6, 2, 3)

fig["layout"].update(height = 1200, width = 1200, paper_bgcolor = "rgb(233, 233, 233)", 
                    title = "Traffic Source")
py.iplot(fig, filename = "traffic_info_plots")

In [None]:
train_df.head()

In [None]:
cnt_srs = create_cnt_srs(train_df, "totals.pageviews", "totals.transactionRevenue")
trace1 = horizontal_bar_chart(cnt_srs["count"].head(60), "rgba(171, 11, 11, 1)")
trace2 = horizontal_bar_chart(cnt_srs["count_of_non_zero_revenue"].head(60), "rgba(238, 21, 21, 1)")
trace3 = horizontal_bar_chart(cnt_srs["mean"].head(60), "rgba(246, 125, 125, 1)")

cnt_srs = create_cnt_srs(train_df, "totals.hits", "totals.transactionRevenue")
trace4 = horizontal_bar_chart(cnt_srs["count"].head(60), "rgba(8, 51, 144, 1)")
trace5 = horizontal_bar_chart(cnt_srs["count_of_non_zero_revenue"].head(60), "rgba(13, 84, 236, 1)")
trace6 = horizontal_bar_chart(cnt_srs["mean"].head(60), "rgba(97, 144, 244, 1)")

fig = plotly.subplots.make_subplots(rows = 2, cols = 3, vertical_spacing = 0.08, horizontal_spacing = 0.15,
                                   subplot_titles = ["Pageviews - Count", "Pageviews - Non_zero_revenue", "Pageviews - Mean"
                                                    "Total hits - Count", "Total hits - Non_zero_revnue", "Total hits - Mean"])

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)
fig.append_trace(trace4, 2, 1)
fig.append_trace(trace5, 2, 2)
fig.append_trace(trace6, 2, 3)

fig["layout"].update(height = 1200, width = 1200, paper_bgcolor = "rgb(233, 233, 233)", 
                    title = "Page Visits")
py.iplot(fig, filename = "page_view_visits_plot")

#### So that concludes what Plotly is capable of. I find that it is really similar to R-studio's Shiny, something which I am much more familiar with. But there are certain aspects I find plotly better, mainly when it comes to the interactive-ness. Plus, the object-oriented programming style, is much more straightforward for me, which makes picking it up much easier than I thought.