# w261 Final Project - Clickthrough Rate Prediction


Team 24   
Vivian Lu, Siddhartha Jakkamreddy, Venky Nagapudi, Luca Garre   
Summer 2019, sections 4 and 5   

## Table of Contents

* __Section 1__ - Question Formulation
* __Section 2__ - Algorithm Explanation
* __Section 3__ - EDA & Challenges
* __Section 4__ - Algorithm Implementation
* __Section 5__ - Course Concepts

# __Section 1__ - Question Formulation

## __Introduction__
Online ad is a multibillion dollar industry fueled by large investments and ever increasing performance goals. Targeted advertisement based on users' browsing industry and demographic, ad features such as overall appearance, employed colors and text, and website features such as ad's relative placement in the webpage, sizes, etc., is receiving more and more interest due to its potential for revenue generation. In this context, machine learning is proving resourceful in the understanding of the features that mostly affect users' Click-Through Rates (CTR) and, based on this understanding, in informing the design of ads that maximize performance metrics such as click and convertion rates. Further, machine learning solutions can easily be deployed in a data pipeline enviroment in order to select and offer, on a user-specific basis, the ad which expectedly maximizes the user's interest. 

...

## __Goal of the analysis__
The purpose of the present analysis is to estimate whether a given ad will be clicked based on a set of features describing the ad. 

...

## __Description of the dataset__
The dataset is provided by [put_reference_to_CriteoLabs] and is composed of three files, a `readme.txt`, a `train.txt` and a `test.txt` file, respectively. The readme file contains a brief description of the data. The `train.txt` and `test.txt` files contain the train and test data. Both files are formatted as tab separated value tables, and amount to 45840617 and 6042135 rows for the train and test data, respectively. Following the description of the data, each row represents an ad and contains the following fields (see commands below, these expect the data to be contained in a data folder inside the current working directory):

- 1 binary field indicating whether the ad has been clicked (1) or not (0). This field is available only for the train data;
- 13 fields containing integer features representing counts;
- 26 categorical features. These are hashed as 32 bits keys for anonymization purposes;

From a printout of the first rows of the data files it appears that the data contain no headers. This implies that, with the sole exception of the first binary field, it is not possible to characterize the various fields in terms of the features these represent. It is also noted that rows in the data can have missing values. This is again noticed when looking at the printed lines, as these have a number of entries which is lower than the number of fields specified in the `readme.txt` file. 

...

In [5]:
#number of rows in the train data
!wc -l data/train.txt

45840617 data/train.txt


In [6]:
#number of rows in the test data
!wc -l data/test.txt

6042135 data/test.txt


In [7]:
# first row of the train data
!head -1 data/train.txt

0	1	1	5	0	1382	4	15	2	181	1	2		2	68fd1e64	80e26c9b	fb936136	7b4723c4	25c83c98	7e0ccccf	de7995b8	1f89b562	a73ee510	a8cd5504	b2cb9c98	37c9c164	2824a5f6	1adce6ef	8ba8b39a	891b62e7	e5ba7672	f54016b9	21ddcdc9	b1252a9d	07b5194c		3a171ecb	c5c50484	e8b83407	9727dd16


In [8]:
# first row of the test data
!head -1 data/test.txt

	29	50	5	7260	437	1	4	14		1	0	6	5a9ed9b0	a0e12995	a1e14474	08a40877	25c83c98		964d1fdd	5b392875	a73ee510	de89c3d2	59cd5ae7	8d98db20	8b216f7b	1adce6ef	78c64a1d	3ecdadf7	3486227d	1616f155	21ddcdc9	5840adea	2c277e62		423fab69	54c91918	9b3e8820	e75c9ae9


# __Section 2__ - Algorithm Explanation

# __Section 3__ - EDA & Challenges

# __Section 4__ - Algorithm Implementation

# __Section 5__ - Course Concepts

In [9]:
# imports
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from os import path
import seaborn as sns

In [2]:
# store path to notebook
PWD = !pwd
PWD = PWD[0]

In [3]:
PWD

'/media/notebooks/W261/W261_Final_Project'

In [4]:
# create Spark Session
from pyspark.sql import SparkSession
app_name = "final_project"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .getOrCreate()
sc = spark.sparkContext

In [5]:
# create a random sample from train data for EDA (20%)
# fixed seed for reproduceability 
train_sample = sc.textFile('data/train.txt').sample(False,0.05,2019).cache()

In [6]:
train_sample.top(5)

['1\t99\t9\t35\t30\t1\t11\t154\t27\t44\t1\t8\t\t11\t05db9164\t89ddfee8\td08c8f30\t3805af8d\t25c83c98\t7e0ccccf\t1c86e0eb\t062b5529\ta73ee510\t935a36f0\t755e4a50\tadd8347b\t5978055e\t051219e6\td5223973\tf8a48751\te5ba7672\t5bb2ec8e\t21ddcdc9\ta458ea53\t540e0e34\t\t32c7478e\t3fdb382b\tf0f449dd\t49d68486',
 '1\t99\t44\t2\t2\t1153\t2\t2959\t17\t265\t2\t50\t1\t2\t5a9ed9b0\t26a88120\tb00d1501\td16679b9\t25c83c98\t7e0ccccf\t3f4ec687\t5b392875\ta73ee510\t0e9ead52\tc4adf918\te0d76380\t85dbe138\tb28479f6\t2ebbf26a\t1203a270\t8efede7f\tb486119d\t\t\t73d06dde\tad3062eb\t32c7478e\taee52b6f\t\t',
 '1\t99\t38\t1\t3\t10\t4\t374\t23\t426\t1\t19\t42\t3\t68fd1e64\t26a88120\td032c263\tc18be181\t0942e0a7\tfbad5c96\t3f4ec687\t1f89b562\ta73ee510\t726f00fd\tc4adf918\tdfbb09fb\t85dbe138\t07d13a8f\t040ec437\t84898b2a\t8efede7f\t57598e25\t\t\t0014c32a\t\t32c7478e\t3b183c5c\t\t',
 '1\t99\t33\t2\t2\t433\t32\t166\t30\t129\t1\t5\t\t2\t05db9164\t9e5ce894\t3e90a31f\t13508380\t25c83c98\t7e0ccccf\t3598a741\t0b153874\ta7

In [7]:
train_sample.coalesce(1,True).saveAsTextFile("train_sample.txt")

In [8]:
!head -5 train_sample.txt/part-00000

0	0	1		0	16597	557	3	5	123	0	1		1	8cf07265	7cd19acc	77f2f2e5	d16679b9	4cf72387	fbad5c96	8fb24933	0b153874	a73ee510	0095a535	3617b5f5	9f32b866	428332cf	b28479f6	83ebd498	31ca40b6	e5ba7672	d0e5eb07			dfcfc3fa	ad3062eb	32c7478e	aee52b6f		
0	1	0	1		1427	3	16	11	50	0	2	1		05db9164	26a88120	615e3e4e	2788fed8	4cf72387	7e0ccccf	3f4ec687	0b153874	a73ee510	0e9ead52	c4adf918	f5d19c1c	85dbe138	07d13a8f	24ff9452	1034ac0d	3486227d	b486119d			63580fba		32c7478e	2a90c749		
0		1			23255		0	1	73		0			7e5c2ff4	d833535f	b00d1501	d16679b9	25c83c98	7e0ccccf	65c53f25	1f89b562	a73ee510	3b08e48b	ad2bc6f4	e0d76380	39ccb769	b28479f6	a733d362	1203a270	776ce399	281769c2			73d06dde		32c7478e	aee52b6f		
0	0	37	23	9	1635	84	2	17	109	0	2		50	05db9164	9b25e48b	2d9b2559	96302ef8	43b19349	fbad5c96	e64ca89e	5b392875	a73ee510	3b76bfa9	87bb382c	3d899a5a	d95a2a6d	8ceecbc8	8f3ef960	24352c5c	07c540c4	7d8c03aa	fbf39fb5	a458ea53	0c61029b		32c7478e	216a829e	001f3601	abc00283
0	2	0	9	5	44	5	2	4	5	2	2		5	5a9ed9b0	3e4b7926	e1266b28	

In [15]:
df = pd.read_csv('train_sample.txt/part-00000', delimiter='\t', header=None)

In [16]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
0,0,0.0,1,,0.0,16597.0,557.0,3.0,5.0,123.0,...,e5ba7672,d0e5eb07,,,dfcfc3fa,ad3062eb,32c7478e,aee52b6f,,
1,0,1.0,0,1.0,,1427.0,3.0,16.0,11.0,50.0,...,3486227d,b486119d,,,63580fba,,32c7478e,2a90c749,,
2,0,,1,,,23255.0,,0.0,1.0,73.0,...,776ce399,281769c2,,,73d06dde,,32c7478e,aee52b6f,,
3,0,0.0,37,23.0,9.0,1635.0,84.0,2.0,17.0,109.0,...,07c540c4,7d8c03aa,fbf39fb5,a458ea53,0c61029b,,32c7478e,216a829e,001f3601,abc00283
4,0,2.0,0,9.0,5.0,44.0,5.0,2.0,4.0,5.0,...,07c540c4,e261f8d8,21ddcdc9,b1252a9d,31b4af04,,32c7478e,8d653a3e,445bbe3b,32280082


In [17]:
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
count,2292037.0,1253362.0,2292037.0,1799718.0,1794652.0,2232828.0,1779597.0,2193136.0,2290882.0,2193136.0,1253362.0,2193136.0,537580.0,1794652.0
mean,0.2564675,3.524587,105.5237,27.03023,7.327119,18536.03,115.5878,16.45949,12.53695,105.9942,0.6180864,2.738067,0.992498,8.217248
std,0.436683,9.456074,387.9461,402.6174,8.845165,69153.18,337.0288,70.80816,16.92911,219.744,0.6845344,5.228751,6.083341,15.87631
min,0.0,0.0,-2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,2.0,2.0,326.0,8.0,1.0,2.0,10.0,0.0,1.0,0.0,2.0
50%,0.0,1.0,3.0,6.0,4.0,2812.0,32.0,3.0,7.0,38.0,1.0,1.0,0.0,4.0
75%,1.0,3.0,35.0,18.0,10.0,10136.0,102.0,11.0,19.0,110.0,1.0,3.0,1.0,10.0
max,1.0,1575.0,19219.0,65535.0,969.0,2634953.0,66619.0,34536.0,4513.0,18345.0,8.0,163.0,1831.0,4317.0
