## Logistic regression

Today's second exercise will involve classification using logistic regression with <a href="http://spark.apache.org/docs/latest/mllib-guide.html">MLlib</a>.

This exercise will be divided into four parts:
+ #### 1. Importing and preparing the data
+ #### 2. Logistic regression
+ #### 3. Evaluating the results

<br>
In the following exercises, you will need to replace the code parts in the cell that starts with following comment: "#Replace the `<INSERT>`"

To go through the notebook, fill in the `<INSERT>`:s with appropriate code in the cells. 
To run a cell, press Shift-Enter to run it and advance to the following cell or Ctrl-Enter to only run the code in the cell. You should do the exercises from the top to the bottom in this notebook, because following cells may depend on code in previous cells.

## Description of the data set
In this exercise, we will utilize the <a href="https://www.kaggle.com/c/stumbleupon">StumbleUpon Evergreen Classification Challenge</a> data set:

>StumbleUpon is a user-curated web content discovery engine that recommends relevant, high quality pages and media to its users, based on their interests. While some pages we recommend, such as news articles or seasonal recipes, are only relevant for a short period of time, others maintain a timeless quality and can be recommended to users long after they are discovered. In other words, pages can either be classified as "ephemeral" or "evergreen".

We will try to accurately predict a page as either "ephemeral" or "evergreen", i.e. having long lasting value to the users.

If we want to run these lines as part of a script we would need to create a spark context:

In [1]:
# from pyspark import SparkContext, StorageLevel
# from pyspark.sql import SQLContext
# sc = SparkContext(master="local[*]")
# sqlContext = SQLContext(sc)

In [2]:
#Helper functions to check results
import numpy as np

def check(x,y,label):
    if(x == y):
        print("Yay, "+label+" is correct!")
    else:
        print("Nay, "+label+" is incorrect, please try again!")

def checkArray(x,y,label):
    if np.allclose(x,y):
        print("Yay, "+label+" is correct!")
    else:
        print("Nay, "+label+" is incorrect, please try again!")

## 1. Importing and preparing the data

The data set is currently saved as a text file in the <a href="https://en.wikipedia.org/wiki/Tab-separated_values">tab separated values (TSV)</a> format, without a header, named "evergreen.tsv". We want to read in this textfile as separate lines into an rdd. A line is represented by the following fields, specified here: <a href="https://www.kaggle.com/c/stumbleupon/data">data specification</a>.

### 1.1 Creating the RDD
Read in the CSV-file as an RDD:

In [None]:
rawLines = sc.textFile('evergreen.tsv').cache()

### 1.2 Checking out the RDD

Run the following code to get a feel for the data set:

In [None]:
numObs = rawLines.count()
sampleObs = rawLines.take(1)
sampleObs = sampleObs[0]
numFeatures = len(sampleObs.split('\t'))-1
print("The number of observations: "+str(numObs))
print("The number of features: "+str(numFeatures))
print("One observation: "+ str(sampleObs))
print(len(sampleObs))

In [None]:
#Check if the observations are correct
check(numObs, 7395, "the number of observations")
check(numFeatures, 26, "the number of features")
check(len(sampleObs), 6803, "the first observation")

### 1.2 Checking out the RDD some more
Features number 4, 5, 18, and 21 in the eighth observation got missing values represented by the string "?":

In [None]:
sampleObs = rawLines.take(8)[7]
print(sampleObs)
sampleObsVector = sampleObs.split('\t')
print("\nThe missing values from features 4, 5, 18, and 21: "+str([sampleObsVector [i] for i in [3,4,17,20]]))

Feature number 4 seems to be categorical:

In [None]:
feature4 = rawLines.map(lambda x:x.split("\t")[3].strip("\""))
print(feature4.take(2**5))

### 1.3 Parsing the vectors
This time we will wait with the creation of the LabeledPoints, because the data needs some preprocessing first. Firstly we will remove the three first features: the url of the page, the page ID, and the JSON representing the text of the page. 

The url may contain some information, i.e. pages with similar addresses, may have similar evergreen qualities, but will skip it for this exercise. The page ID should not contain any information, as it is only used for identifying the page, although it could contain some structure, e.g. if the distribution of evergreen pages is not uniform over time. The JSON-feature we will return to later in this exercise.

Implement the function below that parses the lines of the TSV-file and returns a list of unicode tokens corresponding to feature 4 to 27. You will remove the leading and trailing quotes (") using strip().

In [None]:
#Replace the <INSERT>
def parseObsVec(line):
    """Creates a vector of the 4 to 27 features, from a line in the input file.

    Args:
        line (str)): A line from the input TSV-file.

    Returns:
        vector: A unicode list of the 4 to 27 features, with the quotation marks removed.
    """
    wholeLine = line.split("\t")
    vector = <INSERT>
    return vector

In [None]:
parsedVec = rawLines.map(lambda x: parseObsVec(x))
exampleVec = parsedVec.take(1)[0]
print(exampleVec)

In [None]:
#Check if the parsing function is correct
check(exampleVec, [u'business', u'0.789131', u'2.055555556', u'0.676470588', u'0.205882353', u'0.047058824', 
                   u'0.023529412', u'0.443783175', u'0', u'0', u'0.09077381', u'0', u'0.245831182', u'0.003883495',
                   u'1', u'1', u'24', u'0',u'5424', u'170', u'8', u'0.152941176', u'0.079129575', u'0'], 
      "the parsing function")

### 1.4 Creating a list of categories

To be able to map the categories to one-hot encoded (OHE) features, we will firstly need the list of the distinct categories. The category is the first element of the parsed vector parsedVec.

Create a list of the different categories from this first item in parsedVec:

In [None]:
#Replace the <INSERT>
#Extract the first element
parsedCat= parsedVec.<INSERT>
#Collect all the distinct elements
listOfCat = parsedCat.<INSERT>
print(listOfCat)

In [None]:
#Check if the list of categories is correct
check(listOfCat, [u'recreation', u'business', u'computer_internet', u'culture_politics', u'law_crime', u'health', u'?',
                   u'gaming', u'unknown', u'science_technology', u'sports', u'religion', u'weather',
                   u'arts_entertainment'], 
      "the list of categories")

### 1.5 Creating a dictionary for the categories

After creating the list we need to create a dictionary mapping the categories to indices. 

As you may have noticed one category is "?", this could be imputed by replacing it with the most frequent category or the "unknown" category, but because it is a categorical feature we can use it as it is.

Create such a dictionary below:

In [None]:
OHEDict ={}
for i in range(0, len(listOfCat)):
    OHEDict[listOfCat[i]] = i
print(OHEDict)

In [None]:
#Check if the dictionary of categories is correct
check(OHEDict, {u'gaming': 7, u'recreation': 0, u'business': 1, u'computer_internet': 2, u'unknown': 8,
                u'culture_politics': 3, u'science_technology': 9, u'law_crime': 4, u'sports': 10, u'religion': 
                11, u'weather': 12, u'health': 5, u'?': 6, u'arts_entertainment': 13}, 
      "the dictionary of categories")

### 1.6 Extending the feature vector with one-hot encoded categories
We now need to create the OHE features, i.e. extending the feature vector with a vector consisting out of 13 zeros and 1 one.

Implement the following function that extends the feature vector with OHE encoded features. You will convert the category, which is the first element of rawVector and append it to end of rawVector.

In [None]:
#Replace the <INSERT>
def oneHotEncoding(rawVector, OHEDict):
    """Extends the feature vector with binary OHE features using a dictionary.

    Args:
        rawVector (list of str)): The features corresponding to a single observation.
        OHEDict (dict): A mapping of the categories to a unique integer.

    Returns:
        vector: A unicode list of the 5 to 27 features, extended with 14 binary OHE features.
    """
    catVector = np.zeros(len(OHEDict))
    catVector[<INSERT>] = 1
    rawVector.<INSERT>
    vector = rawVector[1:]
    return vector

In [None]:
parsedOHE = parsedVec.map(lambda x:  oneHotEncoding(x,OHEDict))
sampleOHE = parsedOHE.take(1)[0]
print(sampleOHE)

In [None]:
#Check if the OHE features is correct
check(sampleOHE, [u'0.789131', u'2.055555556', u'0.676470588', u'0.205882353', u'0.047058824', u'0.023529412',
                u'0.443783175', u'0', u'0', u'0.09077381', u'0', u'0.245831182', u'0.003883495', u'1', u'1',
                u'24', u'0', u'5424', u'170', u'8', u'0.152941176', u'0.079129575', u'0', 0.0, 1.0, 0.0, 0.0,
                0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 
      "the OHE features")

### 1.7 Impute real valued feature
The first feature of the current feature vector is real valued, and has got missing values, represented by "?" as seen before. To be able to use the feature we need to somehow impute the missing values. One way of doing this is replacing the missing values with the mean of the non-missing values.

Implement the following function that takes the feature vector and replaces the "?" from the first feature in the vector with the mean:

In [None]:
#Replace the <INSERT>
def imputeMean(vector):
    """Imputes the missing values from the first feature with the mean

    Args:
        vector RDD(list of str): The features corresponding to a single observation.

    Returns:
        vector: A list with the 36 features, with the first feature imputed
    """
    meanVec = vector.map(<INSERT>).filter(<INSERT>)
    mean = meanVec.map(lambda x:  float(x)).<INSERT>
    vector = vector.map(lambda x: x if x[0] != "?" else [mean]+x[1:])
    return vector

In [None]:
parsedMean = imputeMean(parsedOHE)
sampleMean = parsedMean.take(8)[7]
print(sampleMean)

In [None]:
#Check if the real value imputation is correct
check(sampleMean, [0.603334316623788, u'1.883333333', u'0.71969697', u'0.265151515', u'0.113636364', u'0.015151515', 
                   u'0.49934811', u'0', u'0', u'0.02661597', u'0', u'0.173745927', u'0.025830258', u'?', u'0', u'5',
                   u'?', u'27656', u'132', u'4', u'0.068181818', u'0.148550725', u'0', 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
                   1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 
      "the mean imputation")

### 1.8 Impute binary feature
The 14:th and 17:th features of the current feature vector are binary valued, and have got missing values, represented by "?" as seen before. To use these features, we need a method to impute the missing values.

One way of doing this is to replace the missing values with the mode of the non-missing values, i.e. the most frequent value. Utilize that the mode of a binary feature is its mean rounded to the nearest integer.

Implement the following function (that is very similar to the imputeMean function) that takes the feature vector and replaces the "?" from the feature with index, in the vector with the mode:

In [None]:
def imputeMode(vector,index):
    """Imputes the missing values from the feature with index, with the mode (for binary features)

    Args:
        vector (list of str): The features corresponding to a single observation.
        index (integer):

    Returns:
        vector: A list with the 36 features, with the index feature imputed
    """
    meanVec = vector.map(lambda x:  x[index]).filter(lambda x:  x != "?")
    mean = meanVec.map(lambda x:  float(x)).mean()
    mode = round(mean)
    vector = vector.map(lambda x: x if x[index] != "?" else x[0:index]+[mode]+x[index+1:])
    return vector

In [None]:
parsedMode1 = imputeMode(parsedMean,13)
parsedMode2 = imputeMode(parsedMode1,16)
sampleMode = parsedMode2.take(8)[7]
print(sampleMode)

In [None]:
#Check if the binary mode imputation is correct
check(sampleMode, [0.603334316623788, u'1.883333333', u'0.71969697', u'0.265151515', u'0.113636364', u'0.015151515',
                   u'0.49934811', u'0', u'0', u'0.02661597', u'0', u'0.173745927', u'0.025830258', 1.0, u'0', u'5',
                   0.0, u'27656', u'132', u'4', u'0.068181818', u'0.148550725', u'0', 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
                   1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 
      "the mode imputation")

### 1.9 Remove the label from the vector
To scale and normalize the features, we need to remove the labels from the feature vectors. The label is the 23:th feature in the vector.

Create a list of the labels by removing them from the feature vectors:

In [None]:
labels = parsedMode2.map(lambda x: x[22])
features = parsedMode2.map(lambda x: x[0:22]+x[23:])
sampleLabel = labels.take(1)[0]
sampleFeatures = features.take(1)[0]
print(sampleLabel)
print(sampleFeatures)

In [None]:
#Check if the labels and feature vectors are correct
check(sampleLabel, "0", "the labels")
check(sampleFeatures, [u'0.789131', u'2.055555556', u'0.676470588', u'0.205882353', u'0.047058824', u'0.023529412', 
                   u'0.443783175', u'0', u'0', u'0.09077381', u'0', u'0.245831182', u'0.003883495', u'1', u'1', 
                   u'24', u'0', u'5424', u'170', u'8', u'0.152941176', u'0.079129575', 0.0, 1.0, 0.0, 0.0, 0.0,
                   0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 
      "the features")

### 1.10 Scaling and normalizing the observations

In [None]:
#Importing some important stuff
from pyspark.mllib.feature import StandardScaler
from pyspark.mllib.feature import Normalizer
from pyspark.mllib.regression import LabeledPoint

In [None]:
def scaleNorm(features,labels):
    scaler = StandardScaler().fit(features)
    scaledFeatures = scaler.transform(features)
    norm = Normalizer()
    normScaledFeatures = norm.transform(scaledFeatures)
    normScaledObs = labels.zip(normScaledFeatures)
    normScaledObs = normScaledObs.map(lambda x: LabeledPoint(x[0], x[1]))
    return normScaledObs

In [None]:
normScaledObs = scaleNorm(features,labels)
sampleNormScaled = normScaledObs.take(1)[0]
print(sampleNormScaled)

In [None]:
checkArray(sampleNormScaled.features, [0.487082351628,0.0258987958334,0.361671922814,0.152373563046,0.0532495054536,
                                       0.0351843677492,0.0084491817701,0.0,0.0,0.237864093861,0.0,0.508669847859,
                                       0.000219746796182,0.0,0.229299339034,0.127813034757,0.0,0.0663708422856,
                                       0.10287584824,0.268730211619,0.0906239738681,0.108465470686,0.0,0.335395658357
                                       ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0], 
      "the normalized and scaled features")

### 1.11 Illustrating the features
We will visualize the data set, by projecting it from 36 dimensions to 2. This dimensionality reduction is achieved by using <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a>:

In [None]:
from sklearn.decomposition import TruncatedSVD
#Calculate the PCA of normScaledFeatures
Y = TruncatedSVD(n_components=2).fit_transform(features.collect())

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(15,10))
axes = plt.gca()
axes.set_xlim([-3000,220000])
axes.set_ylim([-3000,5000])
plt.scatter(Y[:, 0], Y[:, 1], c=labels.collect(),linewidths=.5, s=25)

Let us zoom in on the large cluster of points:

In [None]:
plt.figure(figsize=(15,10))
axes = plt.gca()
axes.set_xlim([-1000,50000])
axes.set_ylim([-600,1000])
plt.scatter(Y[:, 0], Y[:, 1], c=labels.collect(),linewidths=.5, s=50)

### 1.12  Splitting the data into training, validation, and test sets 
Our last task with preparing the data set is to split it into a training, validation and test set. An usual split is 70%/15%/15%:

In [None]:
weights = [.7, .15, .15]
seed = 0
trainObsNum, valObsNum, testObsNum = normScaledObs.randomSplit(weights, seed)

trainObsNum.cache()
valObsNum.cache()
testObsNum.cache()

In [None]:
numTrain = trainObsNum.count()
numVal = valObsNum.count()
numTest = testObsNum.count()

print("Size of train set: "+str(numTrain))
print("Size of validation set: "+str(numVal))
print("Size of test set: "+str(numTest))
print("Size of total data set: "+str(+numTrain+numVal+numTest))

In [None]:
#Check if the data sets are correct
check(numTrain, 5141, "the train set")
check(numVal, 1111, "the validation set")
check(numTest, 1143, "the test set")
check(numTrain+numVal+numTest, 7395, "the total data set")

## 2. Logistic regression

### 2.1 Logistic regression with limited-memory BFGS
MLlib's <a href="https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.classification.LogisticRegressionWithLBFGS">LogisticRegressionWithLBFGS</a> utilizes logistic regression with <a href="https://en.wikipedia.org/wiki/Limited-memory_BFGS">limited-memory Broyden–Fletcher–Goldfarb–Shanno</a> as an optimization algorithm. There exists a SGD variant of logistic regression in MLlib also, but the LBFGS tends to be both faster and use less memory.

Create the following logistic regression model with the regParam option set to regPar and the rest the default values:

In [None]:
#Import even more important stuff
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

In [None]:
#Replace the <INSERT>
regPar = 0.00001
modelNum = <INSERT>
weights = modelNum.weights

In [None]:
#Check if the model is correct
checkArray(weights, [-0.593982441776,-0.485583947393,0.983432367551,0.243023024636,2.697369678,-0.970078920158,
                   -1.15629521729,-0.679875196911,0.0,-2.63052337353,-0.344456478847,0.401163670025,-1.09054556395,
                   0.0,0.130032949101,-4.43478759595,-0.879216191225,-1.27835993825,1.4469798125,-0.583934656104,
                   -0.0292248086351,-0.869472496836,5.33744180816,4.8052916109,-0.400227340539,1.44351290076,
                   0.449742429556,2.51770631199,3.50314642457,0.0629487776013,0.119936170972,1.06368150779,
                   -0.828227386113,0.383527091966,-8.34051003105,1.0660670245], 
      "weights of the model")

### 2.2 Evaluating the logistic regression

A common metric to evaluate binary classification is the <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve">area under the receiver operating characteristic (AUROC)</a>, which is also used as the evaluation metric by <a href="https://www.kaggle.com/c/stumbleupon/details/evaluation">StumbleUpon Evergreen Classification Challenge</a>. A model that assigns random labels to observations will get an AUROC of 0.5, so hopefully we will beat that! ;)

You do not need to calculate this yourself, use instead the following MLlib class: <a href="https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.evaluation.BinaryClassificationMetrics">BinaryClassificationMetrics</a>. To get the raw percentages from the predict method, you need to use <a href="https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.classification.LogisticRegressionModel.clearThreshold">clearThreshold()</a> instead of the binary classification.

Calculate the AUROC for the validation data:

In [None]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics

In [None]:
#Replace the <INSERT>
modelNum.<INSERT>
predLabel = valObsNum.map(lambda lp:(modelNum.predict(lp.features), lp.label))
metrics = <INSERT>
AUROC = metrics.<INSERT>
print(AUROC)

In [None]:
#Check if the AUROC on the validation set is correct
check(round(AUROC,3), 0.736, "AUROC of the validation set")

## 3. Evaluating the results

### 3.1 Evaluate the model on the test set
Let us try out the model on the test set, and see how they fare:

In [None]:
predLabel = testObsNum.map(lambda lp:(modelNum.predict(lp.features), lp.label))
metrics = BinaryClassificationMetrics(predLabel)
AUROC = metrics.areaUnderROC
print(AUROC)

In [None]:
#Check if the numeric model's AUROC on the test set is correct
check(round(AUROC,3), 0.710, "AUROC of the validation set")

### 3.2 Post mortem discussion
If you check <a href="https://www.kaggle.com/c/stumbleupon/leaderboard">the leaderboard</a> on Kaggle, you can see that the winner had an AUROC of 0.88906, although not directly comparable with our result on the test set, because they used another test set, but still it gives an indication of the quality of the result.

A lower bound should be an AUROC of 0.5, because a model that assigns random labels to observations will get it, although 10 participants in the Kaggle challenge did not get better than that.

If we would have gotten the same result on their test set, i.e. 0.710 we would have placed on the 540:th place out of 625 participants. We would not have beaten Random Forest Benchmark 0.768, in which they applied random forest to the provided features.

But you will improve on this result in this week home assignment, by using the JSON-feature!