# Click-Through Rate Prediction
### Notebook created by [Wenyi Xu](https://github.com/xuwenyihust)
### Create a [click-through rate](https://www.kaggle.com/c/criteo-display-ad-challenge) (CTR) prediction pipeline.

In [2]:
import numpy as np
from pyspark.mllib.linalg import SparseVector

### 1. Parse CTR Data

View Criteo's agreement.

In [4]:
from IPython.lib.display import IFrame

IFrame("http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset/", 600, 350)

#### Load the data

In [6]:
# Hidden path
rawData = sc.textFile('/FileStore/tables/86sw321e1469469485636/dac_sample.txt').map(lambda x: x.replace('\t', ','))
print rawData.take(1)
print type(rawData)

#### Split the dataset into training, validation and test sets

Specify the weights & seed for randomSplit method.

**Training : Validation : Test** will be **8 : 1 : 1**.

In [8]:
weights = [.8, .1, .1]
seed = 42

In [9]:
rawTrainData, rawValidationData, rawTestData = rawData.randomSplit(weights, seed)

**Cache** the splitted datasets, since we will be repeatedly using them.

In [11]:
rawTrainData.cache()
rawValidationData.cache()
rawTestData.cache()

**Data preview**.

In [13]:
nTrain = rawTrainData.count()
nVal = rawValidationData.count()
nTest = rawTestData.count()
print nTrain, nVal, nTest, nTrain + nVal + nTest
print rawData.take(1)

#### Extract features

Split each datapoint of type string into different field.

Drop the first field -> **label** (clicked or not)

Save the remaining fields -> **features**

Define a *parse_data_point* function, input each data point(row), return a list of **(featureID, value)** tuples.

In [16]:
def parsePoint(point):
    """Converts a comma separated string into a list of (featureID, value) tuples.

    Note:
        featureIDs should start at 0 and increase to the number of features - 1.

    Args:
        point (str): A comma separated string where the first value is the label and the rest
            are features.

    Returns:
        list: A list of (featureID, value) tuples.
    """
    features = point.split(',')[1:]
    return [(idx, value) for (idx, value) in enumerate(features)]

In [17]:
parsedTrainFeat = rawTrainData.map(parsePoint)
print parsedTrainFeat.take(1)

We can see that now the string of features has been splitted into a list of features.

Count the **number of distinct values for each feature**.

In [19]:
numCategories = (parsedTrainFeat
                 # Flatten all the elements in the list
                 .flatMap(lambda x: x)
                 # Drop the duplicated values for features
                 .distinct()
                 # Set feature value to 1 for the convenience of counting
                 .map(lambda x: (x[0], 1))
                 # Count how many times each key (featureID) occurs
                 .reduceByKey(lambda x, y: x + y)
                 .sortByKey()
                 .collect())

print numCategories[2][1]

### 2. Generate OHE Features

#### Construct an OHE dictionary

Categorical feature:  **(featureID, category)** tuple.

OHE dictionary:  **Map** each tuple **to** a distinct **integer**.

Function:

- Input the lists of (featureID, category) tuples.
- Flatten the lists to get all the unique (featureID, category) tuples.
- Attach a unique integer to each distinct feature to create the dictionary.

In [22]:
def createOneHotDict(inputData):
    """Creates a one-hot-encoder dictionary based on the input data.

    Args:
        inputData (RDD of lists of (int, str)): An RDD of observations where each observation is
            made up of a list of (featureID, value) tuples.

    Returns:
        dict: A dictionary where the keys are (featureID, value) tuples and map to values that are
            unique integers.
    """
    distinctFeats = (inputData
                       .flatMap(lambda row: row)
                       .distinct())
    
    return (distinctFeats
                           # Zips this RDD with its element indices
                           # (featureID, value) -> ((featureID, value), int)
                           .zipWithIndex()
                           # Return the key-value pairs in this RDD to the master as a dictionary.
                           .collectAsMap())

Create a **OHE dictionary** based on the **parsedTrainFeat**:

    [[(0, u'1'), (1, u'1'), (2, u'5'), (3, u'0'), (4, u'1382'), (5, u'4'), (6, u'15'), ...

In [24]:
ctrOHEDict = createOneHotDict(parsedTrainFeat)
numCtrOHEFeats = len(ctrOHEDict.keys())
print 'Number of elements in the dictionary: ', numCtrOHEFeats

In [25]:
print "(0, '1'): ", ctrOHEDict[(0, '1')]
print "(0, '3'): ", ctrOHEDict[(0, '3')]
print "(1, '4'): ", ctrOHEDict[(1, '4')]
print "(3, '1'): ", ctrOHEDict[(3, '1')]
print "(9, '3'): ", ctrOHEDict[(9, '3')]

#### Define a OHE function

Use it to generate **OHE features** from the original categorical data.

The OHE features should be **SparseVector** format to reduce the storage & computational burdens.

In [27]:
def oneHotEncoding(rawFeats, OHEDict, numOHEFeats):
    """Produce a one-hot-encoding from a list of features and an OHE dictionary.

    Note:
        You should ensure that the indices used to create a SparseVector are sorted.

    Args:
        rawFeats (list of (int, str)): The features corresponding to a single observation.  Each
            feature consists of a tuple of featureID and the feature's value. (e.g. sampleOne)
        OHEDict (dict): A mapping of (featureID, value) to unique integer.
        numOHEFeats (int): The total number of unique OHE features (combinations of featureID and
            value).

    Returns:
        SparseVector: A SparseVector of length numOHEFeats with indicies equal to the unique
            identifiers for the (featureID, value) combinations that occur in the observation and
            with values equal to 1.0.
    """
    return SparseVector(numOHEFeats, [(OHEDict[(featID, value)],1) for (featID, value) in rawFeats])

#### Apply OHE to the dataset

For each data point (sample):

**data point -> one-hot encoded (categorical to numerical) -> sparse vector -> labeled point**

In [29]:
from pyspark.mllib.regression import LabeledPoint

In [30]:
def parseOHEPoint(point, OHEDict, numOHEFeats):
    """Obtain the label and feature vector for this raw observation.

    Note:
        You must use the function `oneHotEncoding` in this implementation or later portions
        of this lab may not function as expected.

    Args:
        point (str): A comma separated string where the first value is the label and the rest
            are features.
        OHEDict (dict of (int, str) to int): Mapping of (featureID, value) to unique integer.
        numOHEFeats (int): The number of unique features in the training dataset.

    Returns:
        LabeledPoint: Contains the label for the observation and the one-hot-encoding of the
            raw features based on the provided OHE dictionary.
    """
    return LabeledPoint(point.split(',')[0],oneHotEncoding(parsePoint(point), OHEDict, numCtrOHEFeats))

In [31]:
OHETrainData = rawTrainData.map(lambda point: parseOHEPoint(point, ctrOHEDict, numCtrOHEFeats))
OHETrainData.cache()
print OHETrainData.take(1)

To explain the format of the resulting data points:

In [33]:
SparseVector(10, [(1,1), (6,1)])

In [34]:
LabeledPoint(0, SparseVector(10, [(1,1), (6,1)]))

### Visualize the Feature Frequency