# Kaggle Google Analytics: Feature engineering - Analysis and EDA on Tableau

This notebook digs deeper into the GA data and features and attempts to extract insights by applying various ratios to subsets and categorical variables. This is currently work in progress and any comments, ideas or suggestions are welcome

# Introduction and context
tbc
Paper to be discussed: https://arxiv.org/pdf/1701.07852.pdf
Error analysis for context: https://www.kaggle.com/xavierbourretsicotte/lgbm-error-analysis-eda-and-thoughts

## Description and intuition behind the main variables

### Categorical variables

- **source**: the *origin* of website traffic. Possible sources include: “google” (the name of a search engine), “facebook.com” (the name of a referring site), “spring_newsletter” (the name of one of your newsletters), and “direct” (users that typed your URL directly into their browser, or who had bookmarked your site).
- **Medium**: the *category* of the traffic source. Possible medium include: “organic” (unpaid search), “cpc” (cost per click, i.e. paid search), “referral” (referral), “email” (the name of a custom medium you have created), “none” (direct traffic has a medium of “none”).
- **channel**: a group of several traffic sources with the same medium
- **channelGrouping**: a *set of channels* which can help more easily highlight or aggregate traffic channels. 
- **Keyword**: When SSL search is employed, Keyword will have the value (not provided).
- **Campaign** is the name of the referring Google Ads campaign or a custom campaign that you have created.
- **Content** identifies a specific link or content item in a custom campaign. For example, if you have two call-to-action links within the same email message, you can use different Content values to differentiate them so that you can tell which version is most effective.
- **geoNetwork.networkDomain**: The domain name of user's ISP, derived from the domain name registered to the ISP's IP address.
- **trafficSource.adwordsClickInfo.gclId**: Google click ID
- **trafficSource.adwordsClickInfo.page**: Page number in search results where the ad was shown.
- **trafficSource.adwordsClickInfo.slot**: Position of the Ad. Takes one of the following values:{“RHS", "Top"}
- **referralPath**: If trafficSource.medium is "referral", then this is set to the path of the referrer. (The host name of the referrer is in trafficSource.source.)

### Numerical
- **visitnumber**: The session number for this user. If this is the first session, then this is set to 1. 
- **bounces**: tbc
- **hits**: tbc
- **newVisits**: If this is the first visit, this value is 1, otherwise it is null.
- **pageviews**: Total number of pageviews within the session.

GA descriptions are: https://support.google.com/analytics/answer/1033173?hl=en&ref_topic=6010089


# A summary of features and ratios for the main variables
The following story brings shows the relationship between the 5 features (columns) and the 3 main prediction proxys (rows) when applied to different views, slices and variables of the dataset. 

For example, how does the bounce rate impact the average revenue per session, when analysed at the level of referral paths. In other words, do referral paths with a high bounce rate have low revenue per session ? (yes) 

### Features: 
- Bounce rate
- Hits per session
- Page views per session
- Hits per pageviews
- Average visit number
- Average time since last visit. 

### Prediction proxys
There are three different ways of looking at the final prediction, each with their own benefits and drawbacks. Note that each of these are calculated at the appropriate level of granularity
- Revenues per session: this is the prediction target used by models at the *session level* 
- Percentage of session with positive revenues (i.e. non zero): this can be viewed as the target of a classification model at the *session level* 
- Total revenues: calculated at the corresponding level of granularity 

### Levels of granularity
- Country
- Country + city
- Referral path
- Keyword
- Source
- Medium + city
- Network Domain
- Network Domain + city

### Regression line
A simple linear regression line is fitted to each graph, the *p-value* can be seen by hovering your mouse above the line. This can be useful to gain intuition, but an alternative way of viewing the potential impact of the feature is to look for vertical lines, or cuts in the data, which are well modeled by decision trees. More discussion tbc 


In [None]:
%%HTML
<div class='tableauPlaceholder' id='viz1538638609624' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;2N&#47;2N75FGWFZ&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='path' value='shared&#47;2N75FGWFZ' /> <param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;2N&#47;2N75FGWFZ&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1538638609624');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='1016px';vizElement.style.height='1014px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

# Map

### Average bounce rates (colors) and revenue per session (size of black dot) 

In [None]:
%%HTML
<div class='tableauPlaceholder' id='viz1538639356356' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ka&#47;KaggleGoogleAnalytics-Featureandratios&#47;Map1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='KaggleGoogleAnalytics-Featureandratios&#47;Map1' /><param name='tabs' value='yes' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ka&#47;KaggleGoogleAnalytics-Featureandratios&#47;Map1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1538639356356');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

# Investigating traffic quality
*Are there any bots ?*

Note that when the hit per session ratio is **exactly** 1, 2 or 3, absolutely no sessions have any revenues. A similar effect applies to hits per page when the ratio is exactly 2, 3 or or 4. A possible explanation is that bots, crawlers or other automated scripts may be the only ones that on average, behave in sch a way... 


In [None]:
%%HTML
<div class='tableauPlaceholder' id='viz1538639607639' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ka&#47;KaggleGoogleAnalytics-Featureandratios&#47;NetworkDomain_hits_per_sess_1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='KaggleGoogleAnalytics-Featureandratios&#47;NetworkDomain_hits_per_sess_1' /><param name='tabs' value='yes' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ka&#47;KaggleGoogleAnalytics-Featureandratios&#47;NetworkDomain_hits_per_sess_1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1538639607639');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

# Next steps...

...