In [1]:
## Importing packages

# This R environment comes with all of CRAN and many other helpful packages preinstalled.
# You can see which packages are installed by checking out the kaggle/rstats docker image: 
# https://github.com/kaggle/docker-rstats

library(tidyverse) # metapackage with lots of helpful functions

## Running code

# In a notebook, you can run a single code cell by clicking in the cell and then hitting 
# the blue arrow to the left, or by clicking in the cell and pressing Shift+Enter. In a script, 
# you can run code by highlighting the code you want to run and then clicking the blue arrow
# at the bottom of this window.

## Reading in files

# You can access files from datasets you've added to this kernel in the "../input/" directory.
# You can see the files added to this kernel by running the code below. 

list.files(path = "../input/acp-data")

## Saving data

# If you save any files or images, these will be put in the "output" directory. You 
# can see the output directory by committing and running your kernel (using the 
# Commit & Run button) and then checking out the compiled version of your kernel.

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang

Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2

-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --

[32mv[39m [34mggplot2[39m 3.1.1       [32mv[39m [34mpurrr  [39m 0.3.3  
[32mv[39m [34mtibble [39m 2.1.1       [32mv[39m [34mdplyr  [39m 0.8.0.[31m1[39m
[32mv[39m [34mtidyr  [39m 0.8.3       [32mv[39m [34mstringr[39m 1.4.0  
[32mv[39m [34mreadr  [39m 1.3.1       [32mv[39m [34mforcats[39m 0.4.0  

-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



# 1. Business Objective

## 1.1 Problem Statement

### Premise: 
* Clickthrough rate (CTR) is a ratio showing how often people who see your ad end up clicking it. Clickthrough rate (CTR) can be used to gauge how well your keywords and ads are performing.

* CTR is the number of clicks that your ad receives divided by the number of times your ad is shown: clicks ÷ impressions = CTR. For example, if you had 5 clicks and 100 impressions, then your CTR would be 5%.

* Each of your ads and keywords have their own CTRs that you can see listed in your account.

* A high CTR is a good indication that users find your ads helpful and relevant. CTR also contributes to your keyword's expected CTR, which is a component of Ad Rank. Note that a good CTR is relative to what you're advertising and on which networks.

> Credits: Google (https://support.google.com/adwords/answer/2615875?hl=en) 

* Search advertising is a multi-billion dollar internet industry that has served as one of the most lucrative stories in the domain of machine learning. It has relied extensively on the ability of learned models to predict ad click–through rates (CTR) accurately while promoting authenticity and low latency.

* Many proprietary search engines owned by Google, Microsoft, Yahoo etc., have effectively tackled the economic model underlying the prediction of ad CTR, which works in accordance with cost-per-click (CPC) advertising system where several ads, bidded by advertisers, are selectively picked and ranked by the product of the CPC and CTR (revenue). 
* So the business objective for these companies centralizes on the balance between maximizing profit (and thus the CPC) and user satisfaction (and thus the CTR).

> **The mathematical objective of our project solely revolves around finding pCTR, 
which is the probability that a certain ad is clicked while being conditioned 
on the occurrence of the ad (AdID), user (UserID) and relevant context (P(Click|AdID,UserID,Context)).
Thus, accurately predicting the probability of click (pCTR) of ads is critical for maximizing the revenue 
and improving user satisfaction.****

## 1.2 Constraints

* Interpretability
* Low latency / delay

## 1.3 Dataset source

> https://www.kaggle.com/c/kddcup2012-track2

# 2. Machine Learning Objective

## 2.1 A Glimpse of the Dataset 

### 2.1.1 Schema

<table style="width:50%;text-align:center;">
<caption style="text-align:center;">Data Files</caption>
<tr>
<td><b>Filename</b></td><td><b>Available Format</b></td>
</tr>
<tr>
<td>training</td><td>.txt (9.9Gb)</td>
</tr>
<tr>
<td>queryid_tokensid</td><td>.txt (704Mb)</td>
</tr>
<tr>
<td>purchasedkeywordid_tokensid</td><td>.txt (26Mb)</td>
</tr>
<tr>
<td>titleid_tokensid</td><td>.txt (172Mb)</td>
</tr>
<tr>
<td>descriptionid_tokensid</td><td>.txt (268Mb)</td>
</tr>
<tr>
<td>userid_profile</td><td>.txt (284Mb)</td>
</tr>
</table>

### 2.1.2 Feature description of training.txt (Main)

<table style="width:100%">
  <caption style="text-align:center;">training.txt</caption>
  <tr>
    <th>Feature</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>UserID</td>
    <td>The unique id for each user</td>
    </tr>
  <tr>
    <td>AdID</td>
    <td>The unique id for each ad</td>
  </tr>
  <tr>
    <td>QueryID</td>
    <td>The unique id for each Query (it is a primary key in Query table(queryid_tokensid.txt))</td>
  </tr>
  <tr>
    <td>Depth</td>
    <td>The number of ads impressed in a session is known as the 'depth'. </td>
  </tr>
  <tr>
    <td>Position</td>
    <td>The order of an ad in the impression list is known as the ‘position’ of that ad.</td>
  </tr>
  <tr>
    <td>Impression</td>
    <td>The number of search sessions in which the ad (AdID) was impressed by the user (UserID) who issued the query (Query).</td>
  </tr>
  <tr>
    <td>Click</td>
    <td>The number of times, among the above impressions, the user (UserID) clicked the ad (AdID).</td>
  </tr>
  <tr>
    <td>TitleId</td>
    <td>A property of ads. This is the key of 'titleid_tokensid.txt'. [An Ad, when impressed, would be displayed as a short text known as ’title’, followed by a slightly longer text known as the ’description’, and a URL (usually shortened to save screen space) known as ’display URL’.]</td>
  </tr>
  <tr>
    <td>DescId</td>
    <td>A property of ads.  This is the key of 'descriptionid_tokensid.txt'. [An Ad, when impressed, would be displayed as a short text known as ’title’, followed by a slightly longer text known as the ’description’, and a URL (usually shortened to save screen space) known as ’display URL’.]</td>
  </tr>
  <tr>
    <td>AdURL</td>
    <td>The URL is shown together with the title and description of an ad. It is usually the shortened landing page URL of the ad, but not always. In the data file,  this URL is hashed for anonymity.</td>
  </tr>
  <tr>
    <td>KeyId</td>
    <td>A property of ads. This is the key of  'purchasedkeyword_tokensid.txt'.</td>
  </tr>
  <tr>
    <td>AdvId</td>
    <td>a property of the ad. Some advertisers consistently optimize their ads, so the title and description of their ads are more attractive than those of others’ ads.</td>
  </tr>
</table>

___
There are five additional data files, as mentioned in the above section: 

1. queryid_tokensid.txt 

2. purchasedkeywordid_tokensid.txt 

3. titleid_tokensid.txt 

4. descriptionid_tokensid.txt 

5. userid_profile.txt 

Each line of the first four files maps an id to a list of tokens, corresponding to the query, keyword, ad title, and ad description, respectively. In each line, a TAB character separates the id and the token set.  A token can basically be a word in a natural language. For anonymity, each token is represented by its hash value.  Tokens are delimited by the character ‘|’. 

Each line of ‘userid_profile.txt’ is composed of UserID, Gender, and Age, delimited by the TAB character. Note that not every UserID in the training and the testing set will be present in ‘userid_profile.txt’. Each field is described below: 

1. Gender:  '1'  for male, '2' for female,  and '0'  for unknown. 

2. Age: '1'  for (0, 12],  '2' for (12, 18], '3' for (18, 24], '4'  for  (24, 30], '5' for (30,  40], and '6' for greater than 40. 

### 2.1.3 Examining an instance of the data
__ training_subsampled.csv __ (Subsampled version - 5M rows of training.txt)
<pre>
Click Impression     AdURL	                AdId	   AdvId   Depth      Pos	  QId	   KeyId	TitleId	 DescId	  UId
0	      1	     4298118681424644510	  7686695	 385	   3	    3	  1601	   5521	   7709	   576	   490234
0	      1	     4860571499428580850      21560664	37484	 2	    2	  2255103	317	    48989	  44771	 490234
0	      1	     9704320783495875564	  21748480	36759     3	    3	  4532751	60721	  685038	 29681	 490234
</pre>

__ queryid_tokensid.txt__
<pre>
QId	Query
0	12731
1	1545|75|31
2	383
3	518|1996
4	4189|75|31
</pre>

__purchasedkeywordid_tokensid.txt__
<pre>
KId Keyword
0	12731
1	1545
2	477
3	1545|75|31
4	279
</pre>

__titleid_tokensid.txt__
<pre>
TitleId	Title
0	615|1545|75|31|1|138|1270|615|131
1	466|582|685|1|42|45|477|314
2	12731|190|513|12731|677|183
3	2371|3970|1|2805|4340|3|2914|10640|3688|11|834|3
4	165|134|460|2887|50|2|17527|1|1540|592|2181|3|...
</pre>

__descriptionid_tokensid.txt__
<pre>
DescId	Description
0	1545|31|40|615|1|272|18889|1|220|511|20|5270|1...
1	172|46|467|170|5634|5112|40|155|1965|834|21|41...
2	2672|6|1159|109662|123|49933|160|848|248|207|1...
3	13280|35|1299|26|282|477|606|1|4016|1671|771|1...
4	13327|99|128|494|2928|21|26500|10|11733|10|318
</pre>

__userid_profile.txt__
<pre>
UId	Gender   Age
 1	  1	    5
 2	  2	    3
 3	  1	    5
 4	  1	    3
 5	  2	    1
</pre>

## 2.2 Mapping business objective to Machine learning problem

### 2.2.1 What is the primary objective of this ML problem?

### 2.2.2 What constraints / bias we need to address?

### 2.2.3 Key Performance Indicator / Performance Evaluation Metrics 

### 2.2.4 What type of Machine Learning problem can our problem statement be posed as?

# 3. Exploratory Data Analysis


## 3.1 Importing the data

### 3.1.1 Read training_subsampled.csv (Main data)

In [None]:
main = read.csv("../input/acp-data/training_subsampled.csv")

In [None]:
main %>% head

### 3.1.2 Read Query data

In [None]:
query.colnames  = c('QId', 'Query')
queries = read.csv("../input/acp-data/queryid_tokensid.txt", sep = "\t", header = FALSE, col.names = query.colnames)

In [None]:
queries %>% head

### 3.1.3 Read User data

In [None]:
user.colnames  = c('UId', 'Gender', 'Age')
users = read.csv("../input/acp-data/userid_profile.txt", sep = "\t", header = FALSE, col.names = user.colnames)

In [None]:
users %>% head

### 3.1.4 Read Ad Title data

In [None]:
adtitle.colnames  = c('TitleId', 'Title')
ad.titles = read.csv("../input/acp-data/titleid_tokensid.txt", sep = "\t", header = FALSE, col.names = adtitle.colnames)

In [None]:
ad.titles %>% head

### 3.1.5 Read Ad Description data

In [None]:
addesc.colnames  = c('DescId', 'Description')
ad.desc = read.csv("../input/acp-data/descriptionid_tokensid.txt", sep = "\t", header = FALSE, col.names = addesc.colnames)

In [None]:
ad.desc %>% head

### 3.1.6 Read Purchased Keyword token data

In [None]:
keywd.colnames  = c('KeywordId', 'Keyword')
keywords = read.csv("../input/acp-data/purchasedkeywordid_tokensid.txt", sep = "\t", header = FALSE, col.names = keywd.colnames)

In [None]:
keywords %>% head

In [None]:
findCount <- function(x) {
  length(unlist(strsplit(x, "|", fixed = TRUE)))
}

In [None]:
X <- as.data.frame(matrix(rnorm(30), nrow=5, ncol=6))
X.apply(X, 2, )