# Click-Through Rate Prediction
### Notebook created by [Wenyi Xu](https://github.com/xuwenyihust)
### Create a [click-through rate](https://www.kaggle.com/c/criteo-display-ad-challenge) (CTR) prediction pipeline.

### Parse CTR Data

View Criteo's agreement.

In [3]:
from IPython.lib.display import IFrame

IFrame("http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset/", 600, 350)

#### Load the data

In [5]:
# Hidden path
rawData = sc.textFile('/FileStore/tables/86sw321e1469469485636/dac_sample.txt').map(lambda x: x.replace('\t', ','))
print rawData.take(1)
print type(rawData)

#### Split the dataset into training, validation and test sets

Specify the weights & seed for randomSplit method.

**Training : Validation : Test** will be **8 : 1 : 1**.

In [7]:
weights = [.8, .1, .1]
seed = 42

In [8]:
rawTrainData, rawValidationData, rawTestData = rawData.randomSplit(weights, seed)

**Cache** the splitted datasets, since we will be repeatedly using them.

In [10]:
rawTrainData.cache()
rawValidationData.cache()
rawTestData.cache()

**Data preview**.

In [12]:
nTrain = rawTrainData.count()
nVal = rawValidationData.count()
nTest = rawTestData.count()
print nTrain, nVal, nTest, nTrain + nVal + nTest
print rawData.take(1)

#### Extract features

Split each datapoint of type string into different field.

Drop the first field -> **label** (clicked or not)

Save the remaining fields -> **features**

Define a *parse_data_point* function, input each data point(row), return a list of **(featureID, value)** tuples.

In [15]:
def parse_data_point(data_point):
  features = data_point.split(',')[1:]
  return [(idx, value) for (idx, value) in enumerate(features)]

In [16]:
parsedTrainFeat = rawTrainData.map(parse_data_point)
print parsedTrainFeat.take(1)

We can see that now the string of features has been splitted into a list of features.

Count the **number of distinct values for each feature**.

In [18]:
numCategories = (parsedTrainFeat
                 # Flatten all the elements in the list
                 .flatMap(lambda x: x)
                 # Drop the duplicated values for features
                 .distinct()
                 # Set feature value to 1 for the convenience of counting
                 .map(lambda x: (x[0], 1))
                 # Count how many times each key (featureID) occurs
                 .reduceByKey(lambda x, y: x + y)
                 .sortByKey()
                 .collect())

print numCategories[2][1]