This kernel is a beginners approach to exploring the data set, in order to prepare for the development of a predicting model. This is my first Kernel and competition, and would love to hear everyone's feedback! 

In this kernel, I will be going through each feature of the data and discussing its relevance when trying to predict revenue per customer.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import json
import matplotlib.pyplot as plt


from pandas.io.json import json_normalize

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

To flatten the JSON fields in the data set, I used Julián Peller's "1 - Quick start: read csv and flatten json fields", thanks for this!



In [None]:
def load_df(csv_path='../input/train.csv', nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    
    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df

Import train and test data frames. Specify a number of rows for quicker execution. 

In [None]:
train_df = load_df()#nrows=20000)
test_df = load_df("../input/test.csv")#,nrows=20000)

In [None]:
train_df.head()

In [None]:
features_not_available = []
for a,b in zip(train_df.loc[0], train_df.columns):
    if a == 'not available in demo dataset':
        features_not_available.append(b)
train_df = train_df.drop(features_not_available, axis=1)
train_df.head()

# Data fields
## 1. channelGrouping

The channel via which the user came into the store. Here we want to explore:
- Channel density i.e. how many visits per channel
- Channel revenue i.e. how much revenue each channel generates

In [None]:
channel_density = train_df.groupby(['channelGrouping']).channelGrouping.count().sort_values()
plt.xticks(rotation=60)
plt.xlabel('Channels')
plt.ylabel('Visits')
plt.title('Channel Density')
x_den = channel_density.index
y_den = channel_density
for a,b in zip(x_den, y_den):
    plt.text(a, b+60, str(round(100*b/train_df.shape[0]))+'%')
plt.bar(x_den, y_den)

In [None]:
train_df["totals.transactionRevenue"] = train_df["totals.transactionRevenue"].astype('float')
channel_revenue = train_df.groupby('channelGrouping')['totals.transactionRevenue'].sum().sort_values().reset_index()
plt.xticks(rotation=60)
plt.xlabel('Channels')
plt.ylabel('Revenue')
plt.title('Revenue per channel')
plt.bar(channel_revenue['channelGrouping'], channel_revenue['totals.transactionRevenue'])

Channel breakdown:
- Referral: Traffic that occurs when a user finds you through a site other than a major search engine
- Social: Traffic from a social network, such as Facebook, LinkedIn, Twitter, or Instagram
- Organic search: Traffic from search engine results that is earned, not paid
- Paid search: Traffic from search engine results that is the result of paid advertising via Google AdWords or another paid search platform
- Email: Traffic from email marketing that has been properly tagged with an email parameter
- Other: If traffic does not fit into another source or has been tagged as “Other” via a URL parameter, it will be bucketed into “Other” traffic.
- Direct: Any traffic where the referrer or source is unknown.

## 2. Device
The specifications for the device used to access the Store.
### 2.1 device.browser
What internet browser was used to access the store.

In [None]:
device_browser = train_df.groupby('device.browser')['totals.transactionRevenue'].agg(['count', 'size', 'mean']).sort_values('count', ascending=False)[0:8]
device_browser.head()

In [None]:
plt.xticks(rotation=60)
plt.xlabel('Browser')
plt.ylabel('Count')
plt.title('Device browser count')
plt.bar(device_browser.index, device_browser['count'])

In [None]:
plt.xticks(rotation=60)
plt.xlabel('Browser')
plt.ylabel('Revenue')
plt.title('Device browser total revenue')
plt.bar(device_browser.index, device_browser['size'])

In [None]:
plt.xticks(rotation=60)
plt.xlabel('Browser')
plt.ylabel('Mean revenue')
plt.title('Device browser mean revenue')
plt.bar(device_browser.index, device_browser['mean'])