# **Click-Through Rate Prediction**
#### This exercise covers the steps for creating a click-through rate (CTR) prediction pipeline. 
#### ** This exercise will cover: **
+  ####*Part 1:* Parse CTR data and generate OHE features
+  ####*Part 2:* CTR prediction and logloss evaluation
 + #### *Visualization:* ROC curve
 
#### Note that, for reference, you can look up the details of the relevant Spark methods in [Spark's Python API](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD) and the relevant NumPy methods in the [NumPy Reference](http://docs.scipy.org/doc/numpy/reference/index.html)

### **Part 1: Parse CTR data and generate OHE features**

#### **Data loading**

In [None]:
import os.path
baseDir = os.path.join('data')
inputPath = os.path.join('HandsOnML', 'dac_sample.txt')
fileName = os.path.join(baseDir, inputPath)

if os.path.isfile(fileName):
    rawData = (sc
               .textFile(fileName, 2)
               .map(lambda x: x.replace('\t', ',')))  # work with either ',' or '\t' separated data
    print rawData.take(1)

#### **Splitting the data **
####  1- Splitting the data into training, validation, and test sets using the [randomSplit method](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.randomSplit) with the specified weights and seed to create RDDs storing each of these datasets.
#### 2 - [Cache](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.cache) each of these RDDs. 
#### 3- Compute the size of each dataset.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
weights = [.6, .2, .2]
seed = 42
# Use randomSplit with weights and seed
rawTrainData, rawValidationData, rawTestData = <FILL IN>
# Cache the data because we need to use these datasets later
rawTrainData.<FILL IN>
rawValidationData.<FILL IN>
rawTestData.<FILL IN>

nTrain = <FILL IN>
nVal = <FILL IN>
nTest = <FILL IN>
print nTrain, nVal, nTest, nTrain + nVal + nTest
print rawData.take(1)

#### ** Creating ONE Dictionary**
#### Generating a dictionary containing a list of "distinct (featureID, value) : Unique Integer" from the raw training data. We will ignore the first field (which is the 0-1 label), and parse the remaining fields (or raw features).

In [None]:
#Fist we should get a RDD containing a list of (featureID,value) from the training data
# TODO: Replace <FILL IN> with appropriate code
def parseData(data):
    """Converts a comma separated string into a list of (featureID, value) tuples.

    Note:
        featureIDs should start at 0 and increase to the number of features - 1.

    Args:
        data (str): A comma separated string where the first value is the label and the rest
            are features.

    Returns:
        list: A list of (featureID, value) tuples.
    """
    
    data_chain = <FILL IN>
    
    fv_tuple = <FILL IN>
    
    return fv_tuple
    
parsedTrainFeat = <FILL IN>
print parsedTrainFeat.take(1)

In [None]:
# Create an OHE dictionary from the RDD with the list of (featureID, value)
# generated in the previous function 

In [None]:
##### OHE dictionary sample : 
#cityOHEDict[(0,'Beijing')] = 0
#cityOHEDict[(0,'Paris')] = 1
#cityOHEDict[(0,'London')] = 2
#cityOHEDict[(0,'New York')] = 3
#cityOHEDict[(1, 'Asia')] = 4
#cityOHEDict[(1, 'Europe')] = 5
#cityOHEDict[(1, 'American')] = 6
#cityOHEDict[(2, 'very much')] = 7
#cityOHEDict[(2, 'a little')] = 8

In [None]:
# TODO: Replace <FILL IN> with appropriate code
def createOneHotDictionary(inputData):
    """Creates a one-hot-encoder dictionary based on the input data.

    Args:
        inputData (RDD of lists of (int, str)): An RDD of observations where each observation is
            made up of a list of (featureID, value) tuples.

    Returns:
        dictionary: A dictionary where the keys are (featureID, value) tuples and map to values that are
            unique integers.
    """
    
    DistinctFeatures = <FILL IN>
    
    OHEDictionary = <FILL IN>
    
    return OHEDictionary
OHEDictionary = <FILL IN>
numCtrOHEFeats = len(OHEDictionary.keys())
print numCtrOHEFeats

In [None]:
# We have 190509 features totally now

#### ** Transform training data to a RDD of LabeledPoint **

In [None]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import SparseVector

In [None]:
# TODO: Replace <FILL IN> with appropriate code
def oneHotEncoding(rawFeats, OHEDictionary):
    """Produce a 1-of-k encoding from a list of features and an 1-of-k dictionary.

    Note:
        You should ensure that the indices used to create a SparseVector are sorted.
        !Important!:Because some categorical values will likely appear in validation/Test data 
        that did not exist in the training data. 
        To deal with this situation, ignoring unseen categories in validation/test data

    Args:
        rawFeats (list of (int, str)): The features corresponding to a single observation.Each
            feature consists of a tuple of featureID and the feature's value.
        OHEDictionary: A mapping of (featureID, value) to unique integer.

    Returns:
        SparseVector: A SparseVector of length numOHEFeats with indicies equal to the unique
            identifiers for the (featureID, value) combinations that occur in the observation and
            with values equal to 1.0.
    """
    return <FILL IN>

def getLabeledPoint(data, OHEDictionary, numOHEFeats):
    """Obtain the label and feature vector for this raw observation.

    Note:
        You must use the function `oneHotEncoding` in this implementation.
    Args:
        data (str): A comma separated string where the first value is the label and the rest
            are features.
        OHEDictionary (dictionary of (int, str) to int): Mapping of (featureID, value) to unique integer.
        numOHEFeats (int): The number of unique features in the training dataset.

    Returns:
        LabeledPoint: Contains the label for the observation and the one-hot-encoding of the
            raw features based on the provided OHE dictionary.
    """
    
    parsedFeat = <FILL IN>
    
    sparseVector = <FILL IN>
    
    return <FILL IN>

OHETrainData = rawTrainData.map(lambda data: getLabeledPoint(data, OHEDictionary, numCtrOHEFeats))
OHETrainData.cache()
print OHETrainData.take(1)

OHEValidationData = rawValidationData.map(lambda data: getLabeledPoint(data, OHEDictionary, numCtrOHEFeats))
OHEValidationData.cache()
print OHEValidationData.take(1)

OHETestData = rawValidationData.map(lambda data: getLabeledPoint(data, OHEDictionary, numCtrOHEFeats))
OHETestData.cache()
print OHETestData.take(1)

### ** Part 2: CTR prediction and logloss evaluation **

#### ** Log loss **
#### Throughout this exercise, we will use log loss to evaluate the quality of models.  Log loss is defined as: $$  \begin{align} \scriptsize \ell_{log}(p, y) = \begin{cases} -\log (p) & \text{if } y = 1 \\\ -\log(1-p) & \text{if } y = 0 \end{cases} \end{align} $$ where $ \scriptsize p$ is a probability between 0 and 1 and $ \scriptsize y$ is a label of either 0 or 1.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
from math import log

def computeLogLoss(p, y):
    """Calculates the value of log loss for a given probabilty and label.

    Note:
        log(0) is undefined, so when p is 0 we need to add a small value (epsilon) to it
        and when p is 1 we need to subtract a small value (epsilon) from it.

    Args:
        p (float): A probabilty between 0 and 1.
        y (int): A label.  Takes on the values 0 and 1.

    Returns:
        float: The log loss value.
    """
    epsilon = 10e-12
    
    if y == 1.:
        return <FILL IN>
    if y == 0.:
        return <FILL IN>

#### ** Predicted probability **
#### In order to compute the log loss for the model we train, we need to write code to generate predictions from the model. Write a function that computes the raw linear prediction (t = x.dot(w) + intercept) from this logistic regression model and then passes it through a [sigmoid function](http://en.wikipedia.org/wiki/Sigmoid_function) $ \scriptsize \sigma(t) = (1+ e^{-t})^{-1} $ to return the model's probabilistic prediction. Then compute probabilistic predictions on the training data.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
from math import exp #  exp(-t) = e^-t

def getProbability(x, w, intercept):
    """Calculate the probability for an observation given a set of weights and intercept.

    Note:
        We'll bound our raw prediction between 20 and -20 for numerical purposes.

    Args:
        x (SparseVector): A vector with values of 1.0 for features that exist in this
            observation and 0.0 otherwise.
        w (DenseVector): A vector of weights (betas) for the model.
        intercept (float): The model's bias.

    Returns:
        float: A probability between 0 and 1.
    """
    rawPrediction = <FILL IN>

    # Bound the raw prediction value
    
    rawPrediction = min(rawPrediction, 20)
    rawPrediction = max(rawPrediction, -20)
    
    return <FILL IN>

#### ** Evaluate the model **
#### Now we write a general function that takes as input a model and data, and outputs the log loss. 

In [None]:
# TODO: Replace <FILL IN> with appropriate code
def computeLogLossOfModel(model, data):
    """Calculates the log loss for the data given the model.

    Args:
        model (LogisticRegressionModel): A trained logistic regression model.
        data (RDD of LabeledPoint): Labels and features for each observation.

    Returns:
        float: Log loss for the data.
    """
    
    probability_lable_tuple = <FILL IN>
    
    loss = <FILL IN>
    
    return loss.mean()

#### ** Traning the model with Logistic regression **
####  First use [LogisticRegressionWithSGD](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.classification.LogisticRegressionWithSGD) to train serveral models using `OHETrainData` with the given differente hyperparameter configuration. Getting the best model finally

In [None]:
from pyspark.mllib.classification import LogisticRegressionWithSGD

numIters = 80
regType = 'l2'
includeIntercept = True

# Initialize variables
bestModel = None
bestLogLoss = float('inf')

In [None]:
# TODO: Replace <FILL IN> with appropriate code
stepSizes = [1,10]
regParams = [1e-6,1e-3]
for stepSize in stepSizes:
    for regParam in regParams:
        model = <FILL IN>
        logLossVa = <FILL IN>
        print ('\tstepSize = {0:.1f}, regParam = {1:.0e}: logloss = {2:.3f}'
               .format(stepSize, regParam, logLossVa))
        if (logLossVa < bestLogLoss):
            bestModel = <FILL IN>
            bestLogLoss = <FILL IN>

#### **Visualization: ROC curve **
#### Graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied

In [None]:
import matplotlib.pyplot as plt
import numpy as np
labelsAndScores = OHEValidationData.map(lambda lp:
                                            (lp.label, getProbability(lp.features, bestModel.weights, bestModel.intercept)))
labelsAndWeights = labelsAndScores.collect()
labelsAndWeights.sort(key=lambda (k, v): v, reverse=True)
labelsByWeight = np.array([k for (k, v) in labelsAndWeights])

length = labelsByWeight.size
truePositives = labelsByWeight.cumsum()
numPositive = truePositives[-1]
falsePositives = np.arange(1.0, length + 1, 1.) - truePositives

truePositiveRate = truePositives / numPositive
falsePositiveRate = falsePositives / (length - numPositive)

def preparePlot(xticks, yticks, figsize=(10.5, 6), hideLabels=False, gridColor='#999999',
                gridWidth=1.0):
    """Template for generating the plot layout."""
    plt.close()
    fig, ax = plt.subplots(figsize=figsize, facecolor='white', edgecolor='white')
    ax.axes.tick_params(labelcolor='#999999', labelsize='10')
    for axis, ticks in [(ax.get_xaxis(), xticks), (ax.get_yaxis(), yticks)]:
        axis.set_ticks_position('none')
        axis.set_ticks(ticks)
        axis.label.set_color('#999999')
        if hideLabels: axis.set_ticklabels([])
    plt.grid(color=gridColor, linewidth=gridWidth, linestyle='-')
    map(lambda position: ax.spines[position].set_visible(False), ['bottom', 'top', 'left', 'right'])
    return fig, ax

# Generate layout and plot data
fig, ax = preparePlot(np.arange(0., 1.1, 0.1), np.arange(0., 1.1, 0.1))
ax.set_xlim(-.05, 1.05), ax.set_ylim(-.05, 1.05)
ax.set_ylabel('True Positive Rate')
ax.set_xlabel('False Positive Rate')
plt.plot(falsePositiveRate, truePositiveRate, color='#8cbfd0', linestyle='-', linewidth=3.)
plt.plot((0., 1.), (0., 1.), linestyle='--', color='#d6ebf2', linewidth=2.)  # Baseline model
pass

#### ** (3e) Evaluate on the test set **
#### Finally, do the predictions with the best model on the test set.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
# Log loss for the best model on the test dataset
logLossTest = <FILL IN>
print ('Test Dataset Log Loss:\n\tlogLossTest = {0:.3f}'
       .format(logLossTest))

sortedWeights = sorted(bestModel.weights)
print sortedWeights[:5], bestModel.intercept

In [None]:
labelsAndScores = OHETestData.map(lambda lp:
                                            (lp.label, getProbability(lp.features, bestModel.weights, bestModel.intercept)))
labelsAndWeights = labelsAndScores.collect()
labelsAndWeights.sort(key=lambda (k, v): v, reverse=True)
labelsByWeight = np.array([k for (k, v) in labelsAndWeights])

length = labelsByWeight.size
truePositives = labelsByWeight.cumsum()
numPositive = truePositives[-1]
falsePositives = np.arange(1.0, length + 1, 1.) - truePositives

truePositiveRate = truePositives / numPositive
falsePositiveRate = falsePositives / (length - numPositive)

# Generate layout and plot data
fig, ax = preparePlot(np.arange(0., 1.1, 0.1), np.arange(0., 1.1, 0.1))
ax.set_xlim(-.05, 1.05), ax.set_ylim(-.05, 1.05)
ax.set_ylabel('True Positive Rate')
ax.set_xlabel('False Positive Rate')
plt.plot(falsePositiveRate, truePositiveRate, color='#8cbfd0', linestyle='-', linewidth=3.)
plt.plot((0., 1.), (0., 1.), linestyle='--', color='#d6ebf2', linewidth=2.)  # Baseline model
pass