**Objective of the notebook:**

In this notebook, We will analyze the data and try to perform simple data analysis around Missing values in the data set

**Objective of the competition:**

In this competition, we a’re challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer.

In [None]:
import warnings
warnings.filterwarnings('ignore')

import os
import json
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt

import seaborn as sns
color = sns.color_palette()

%matplotlib inline
pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999

from IPython.display import HTML, display
import tabulate

import plotly.offline as py
py.init_notebook_mode(connected=True)
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.offline as offline
offline.init_notebook_mode()

**File Descriptions**

train.csv - the training set - contains the same data as the BigQuery rstudio_train_set.

test.csv - the test set - contains the same data as the BigQuery rstudio_test_set.

**Data Fields**

fullVisitorId- A unique identifier for each user of the Google Merchandise Store.

channelGrouping - The channel via which the user came to the Store.

date - The date on which the user visited the Store.

device - The specifications for the device used to access the Store.

geoNetwork - This section contains information about the geography of the user.

sessionId - A unique identifier for this visit to the store.

socialEngagementType - Engagement type, either "Socially Engaged" or "Not Socially Engaged".

totals - This section contains aggregate values across the session.

trafficSource - This section contains information about the Traffic Source from which the session originated.

visitId - An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.

visitNumber - The session number for this user. If this is the first session, then this is set to 1.

visitStartTime - The timestamp (expressed as POSIX time).

**Credit Note** : Using code from kernal [**Quick start: read csv and flatten json fields**](https://www.kaggle.com/julian3833/1-quick-start-read-csv-and-flatten-json-fields) by [Julián Peller1 ](https://www.kaggle.com/julian3833)

In [None]:
def load_df(csv_path='../input/train.csv', nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    
    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df

In [None]:
%%time
train_df = load_df()
test_df = load_df("../input/test.csv")

In [None]:
print('size of training data : ', train_df.shape)
print('size of testing data  : ', test_df.shape)

** Train Data snippets : **

In [None]:
train_df.head()

** Column names for training data ** 

In [None]:
train_df.columns.values

** Train Data snippets : **

In [None]:
test_df.head()

**Column names for testing data**

In [None]:
test_df.columns.values

*** Missing values assesment ***

In this section 

    [a] We will display the counts and percentage of missing value.
    
    [b] We will explore the attribites with missing values 
    
    [c] We will try to explore if we can provide a recomendation for imputation of missing values.


** Below statatistics shows that there are 8 columns with more than 97% missing values **

Next task is to analyse the missing attribute and try to recommnd on imputing the missing values

In [None]:
total = train_df.isnull().sum().sort_values(ascending = False)
percent = (train_df.isnull().sum() / train_df.isnull().count()*100).sort_values(ascending = False)
missing_application_train_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_application_train_data.head(20)

<h3> <span style="color:,brown">**Feature # 1 : trafficSource.campaignCode** </span></h3>

*** Description : *** 

Value of the utm_id campaign tracking parameter, **used for manual campaign tracking.**

*** Analysis : ***

In Train dataframe trafficSource.campaignCode has value in only one cell as a result we can drop this attribute. 
 
In Test dataframe trafficSource.campaignCode is not present.

**Recomendation :**   We can plan to remove this feature.

In [None]:
temp1 = train_df['trafficSource.campaignCode'].value_counts()

trace1 = go.Bar(
    x = temp1.index,
    y = temp1 ,
)

data = [trace1]

layout = go.Layout(
    title = "Campaign code for training data",
    xaxis=dict(
        title='Campaign codes',
        domain=[0, 0.5]
    ),
    
    yaxis=dict(
        title='Count of Campaign codes '
        
    )
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Campaign code')

In [None]:
train_df['trafficSource.campaignCode'].value_counts()

<h3> <span style="color:,brown">**Feature # 2 : totals.transactionRevenue :** </span></h3>

**Description :** 
Total transaction revenue, expressed as <span style="color:,blue">the value passed to Analytics multiplied by 10^6. *(e.g., 2.40 would be given as 2400000).</span>

**Analysis : **

Now This is the Target attribute . 

**Recomendation :** 5332 datapoints has valid numerical values and rest all datapoints can be populated with value 0_


In [None]:
transactionRevenue = train_df['totals.transactionRevenue'].value_counts()
print(transactionRevenue.head())
len(transactionRevenue)

<h3> <span style="color:,brown">**Feature # 3 : trafficSource.adwordsClickInfo.page :** </span></h3>

**Description : ** 

Page number in search results where the ad was shown.

**Analysis :**

In 21362 cases add was shown on 1st page, In 73 cases add was shown on 2nd page and so on.

**Recomendation : ** _With my experience I never go beyond 3 / 4  pages of search results so this feature is a good candidate for binning._


In [None]:
temp1 = train_df['trafficSource.adwordsClickInfo.page'].value_counts()
temp2 = test_df['trafficSource.adwordsClickInfo.page'].value_counts()


trace1 = go.Bar(
    x = temp1.index,
    y = temp1,
   
)
trace2 = go.Bar(
    x = temp2.index,
    y = temp2 
    
)

data = [trace1, trace2]

layout = go.Layout(
    title = "Page # where the ad was shown for training data",
    width = 900,
    xaxis=dict(
        title='Page #',
        tickfont=dict(
            size=14,
            color='rgb(107, 107, 107)'
        )
    ),
    
    yaxis=dict(
        title='# of instances',  
        titlefont=dict(
            size=16,
            color='rgb(107, 107, 107)'
        ),
        tickfont=dict(
            size=14,
            color='rgb(107, 107, 107)'
        )
    )
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
print(train_df['trafficSource.adwordsClickInfo.page'].value_counts())
print(test_df['trafficSource.adwordsClickInfo.page'].value_counts())

<h3> <span style="color:,brown">**Feature # 4 : trafficSource.adwordsClickInfo.adNetworkType :** </span></h3>

**Description :** 

Network Type. Takes one of the following values: {“Google Search", "Content", "Search partners", "Ad Exchange", "Yahoo Japan Search", "Yahoo Japan AFS", “unknown”}

**Analysis : **

Train dataframe contains *** Google Search and Search partners *** (Content class is missing in training data). 
Test dataframe contains *** Content, Google Search and Search partners. ***

**Recomendation :**  _Not sure how to handle "content" ._ 
    We may need to bin the whole data into 2 bins 
    
        1) Google Search and 
        
        2) Others 

In [None]:
temp1 = train_df['trafficSource.adwordsClickInfo.adNetworkType'].value_counts()
temp2 = test_df['trafficSource.adwordsClickInfo.adNetworkType'].value_counts()


trace1 = go.Bar(
    x = temp1.index,
    y = temp1,
    name = 'train'
   
)
trace2 = go.Bar(
    x = temp2.index,
    y = temp2,
    name = 'test'
    
)

data = [trace1, trace2]

layout = go.Layout(
    title = "Page # where the ad was shown for training data",
    width = 900,
    xaxis=dict(
        title='Page #',
        tickfont=dict(
            size=14,
            color='rgb(107, 107, 107)'
        )
    ),
    
    yaxis=dict(
        title='# of instances',  
        titlefont=dict(
            size=16,
            color='rgb(107, 107, 107)'
        ),
        tickfont=dict(
            size=14,
            color='rgb(107, 107, 107)'
        )
    )
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
print(train_df['trafficSource.adwordsClickInfo.adNetworkType'].value_counts())
print(test_df['trafficSource.adwordsClickInfo.adNetworkType'].value_counts())

<h3> <span style="color:,brown">**Feature # 5 : trafficSource.adwordsClickInfo.slot :** </span></h3>

**Description :*** 

Position of the Ad. Takes one of the following values:{“RHS", "Top"}

**Analysis : **

As per the document Top and RHS are the valid values but Test dataframe also contains Google Display Network, which may be an invalid entry. 

**Recomendation : **  We can replace all NaN's and Google Display Network with unique value class.

In [None]:
temp1 = train_df['trafficSource.adwordsClickInfo.slot'].value_counts()
temp2 = test_df['trafficSource.adwordsClickInfo.slot'].value_counts()


trace1 = go.Bar(
    x = temp1.index,
    y = temp1,
    name = 'train'
   
)
trace2 = go.Bar(
    x = temp2.index,
    y = temp2,
    name = 'test'
    
)

data = [trace1, trace2]

layout = go.Layout(
    title = "Page # where the ad was shown for training data",
    width = 900,
    xaxis=dict(
        title='Page #',
        tickfont=dict(
            size=14,
            color='rgb(107, 107, 107)'
        )
    ),
    
    yaxis=dict(
        title='# of instances',  
        titlefont=dict(
            size=16,
            color='rgb(107, 107, 107)'
        ),
        tickfont=dict(
            size=14,
            color='rgb(107, 107, 107)'
        )
    )
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
print(train_df['trafficSource.adwordsClickInfo.slot'].value_counts())
print(test_df['trafficSource.adwordsClickInfo.slot'].value_counts())


<h3> <span style="color:,brown">**Feature # 6 : trafficSource.adwordsClickInfo.isVideoAd : ** </span></h3>

**Description : **

True if it is a Trueview video ad.
 
**Analysis and Recomendation: **

In train and test Dataframe user can replace all NaN's with True.

In [None]:
print(train_df['trafficSource.adwordsClickInfo.isVideoAd'].value_counts())
print(test_df['trafficSource.adwordsClickInfo.isVideoAd'].value_counts())

<h3> <span style="color:,brown">**Feature # 7 : trafficSource.isTrueDirect : ** </span></h3>

**Description : **

True if the source of the session was Direct (meaning the user typed the name of your website URL into the browser or came to your site via a bookmark), This field will also be true if 2 successive but distinct sessions have exactly the same campaign details. Otherwise NULL.

**Analysis and Recomendation :**

In train and test Dataframe user can replace all NaN's with False.



In [None]:
print(train_df['trafficSource.isTrueDirect'].value_counts())
print(test_df['trafficSource.isTrueDirect'].value_counts())


<h3> <span style="color:,brown">**Feature # 8 : trafficSource.referralPath : **</span></h3>

**Description : **

If trafficSource.medium is "referral", then this is set to the path of the referrer

In [None]:
#print(train_df['trafficSource.referralPath'].value_counts())
#print(test_df['trafficSource.referralPath'].value_counts())

<h3> <span style="color:,brown">**Feature # 9 : trafficSource.keyword : ** </span></h3>

**Description : **

The keyword of the traffic source, usually set when the trafficSource.medium is "organic" or "cpc".

In [None]:
# print(train_df['trafficSource.keyword'].value_counts())
# print(test_df['trafficSource.keyword'].value_counts())

<h3> <span style="color:,brown">**Feature # 10 : totals.bounces : ** </span></h3>

**Description : **
Total bounces (If the user leave the site rather than continuing on the site). For a bounced session, the value is 1, otherwise it is null.

**Analysis ** 

**[Bounce rate](https://en.wikipedia.org/wiki/Bounce_rate)** is an Internet marketing term used in web traffic analysis. It represents the percentage of visitors who enter the site and then leave ("bounce") rather than continuing to view other pages within the same site. Bounce rate is basically calculated on how much time a person spends on your site.

**Recomendation ** 

Replace all NaN's with _0_

In [None]:
print(train_df['totals.bounces'].value_counts())
print(test_df['totals.bounces'].value_counts())

<h3> <span style="color:,brown">**Feature # 11 : totals.newVisits : ** </span></h3>

**Description : **

Total number of new users in session (for convenience). If this is the first visit, this value is 1, otherwise it is null.

**Recomendation :** 

replace all NaN's with _0_

In [None]:
print(train_df['totals.newVisits'].value_counts())
print(test_df['totals.newVisits'].value_counts())