[Spark with Jupyter Notebook on MacOS (2.0.0 and higher)](https://medium.com/@roshinijohri/spark-with-jupyter-notebook-on-macos-2-0-0-and-higher-c61b971b5007)
==========================================================================================

#### Run in Terminal:
$\textrm{brew install apache-spark}$

$\textrm{brew info apache-spark}$

$\textrm{export SPARK_HOME='/usr/local/Cellar/apache-spark/2.4.5/libexec/'}$ -> Edit depending on version

$\textrm{pyspark}$

In [1]:
import os
exec(open(os.path.join(os.environ['SPARK_HOME'], 'python/pyspark/shell.py')).read())

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/

Using Python version 3.7.4 (default, Aug 13 2019 15:17:50)
SparkSession available as 'spark'.


In [2]:
import pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder \
    .appName('spark test') \
    .getOrCreate() \

columns = ['id', 'dogs', 'cats']
vals = [
    (1, 2, 0),
    (2, 0, 1)
]

In [3]:
# Create DataFrame
df = spark.createDataFrame(vals, columns)
df.show()

+---+----+----+
| id|dogs|cats|
+---+----+----+
|  1|   2|   0|
|  2|   0|   1|
+---+----+----+



# Numeric Feature Extraction

In [4]:
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer, VectorAssembler, Normalizer, StandardScaler, MinMaxScaler, MaxAbsScaler, \
                               CountVectorizer, IDF, StringIndexer, PCA, StopWordsRemover
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.clustering import KMeans
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql.functions import avg, col, concat, count, desc, explode, lit, min, max, split, stddev, udf
from pyspark.sql.types import IntegerType

import re

In [5]:
# Instantiate a Spark session 
# The entry point to programming Spark with the Dataset and DataFrame API
# Note: master(): sets the Spark master URL to connect to, such as "local" to run locally, "local[4]" to run 
#                 locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster
#       appName(): sets a name for the application, which will be shown in the Spark web UI
#       getOrCreate(): get or instantiate a SparkContext and register it as a singleton object
spark = SparkSession.builder \
    .master('local') \
    .appName('Word Count') \
    .getOrCreate()

In [6]:
# Get all values as a list of key-value pairs
spark.sparkContext.getConf().getAll()

[('spark.master', 'local'),
 ('spark.sql.catalogImplementation', 'hive'),
 ('spark.rdd.compress', 'True'),
 ('spark.driver.host', '192.168.0.19'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.app.id', 'local-1592509346247'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.app.name', 'Word Count'),
 ('spark.driver.port', '52912')]

In [7]:
spark

### Read in the Data Set

In [8]:
path = '/Users/yangweichle/Documents/Employment/TRAINING/DATA SCIENCE/Spark/Udacity_Spark for Big Data/Machine Learning with Spark/data/Train_onetag_small.json'

# Loads JSON files and returns the results as a `DataFrame`
# Note: path: string represents path to the JSON dataset, or a list of paths, or RDD of Strings storing JSON objects
stack_overflow_data = spark.read.json(path=path)

# Sets the storage level to persist the contents of the `DataFrame` across operations after the first time it is computed
# This can only be used to assign a new storage level if the `DataFrame` does not have a storage level set yet
# If no storage level is specified defaults to (C{MEMORY_AND_DISK})
stack_overflow_data.persist()

DataFrame[Body: string, Id: bigint, Tags: string, Title: string, oneTag: string]

In [9]:
# Prints out the schema in the tree format
stack_overflow_data.printSchema()

root
 |-- Body: string (nullable = true)
 |-- Id: long (nullable = true)
 |-- Tags: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- oneTag: string (nullable = true)



In [10]:
# Returns the first ``n`` rows
stack_overflow_data.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php')

### Tokenization

Tokenization splits strings into separate words. Spark has a [Tokenizer](https://spark.apache.org/docs/latest/ml-features.html#tokenizer) class as well as RegexTokenizer, which allows for more control over the tokenization process.

In [11]:
# A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to 
#    split the text (default) or repeatedly matching the regex (if gaps is false)
# Optional parameters also allow filtering tokens using a minimal length
# It returns an array of strings that can be empty
regexTokenizer = RegexTokenizer(inputCol='Body', outputCol='words', pattern='\\W')

# Transforms the input dataset with optional parameters
stack_overflow_data = regexTokenizer.transform(stack_overflow_data)

In [12]:
# Returns the first ``n`` rows
stack_overflow_data.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

#### Count the number of words in each body tag

In [13]:
# Create a user defined function (UDF)
body_length = udf(lambda x: len(x), IntegerType())

# Returns a new `DataFrame` by adding a column or replacing the existing column that has the same name
stack_overflow_data = stack_overflow_data.withColumn('BodyLength', body_length(stack_overflow_data.words))

#### Count the number of paragraphs and links in each body tag

In [14]:
# Create a user defined function (UDF)
number_of_paragraphs = udf(lambda x: len(re.findall('</p>', x)), IntegerType())
number_of_links = udf(lambda x: len(re.findall('</a>', x)), IntegerType())

# Returns a new `DataFrame` by adding a column or replacing the existing column that has the same name
stack_overflow_data = stack_overflow_data.withColumn('NumParagraphs', number_of_paragraphs(stack_overflow_data.Body))
stack_overflow_data = stack_overflow_data.withColumn('NumLinks', number_of_links(stack_overflow_data.Body))

In [15]:
# Returns the first ``n`` rows
stack_overflow_data.head(2)

[Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'whic

### VectorAssembler

Combine the body length, number of paragraphs, and number of links columns into a vector

In [16]:
# A feature transformer that merges multiple columns into a vector column
vecAssembler = VectorAssembler(inputCols=['BodyLength', 'NumParagraphs', 'NumLinks'], outputCol='NumFeatures')

# Transforms the input dataset with optional parameters
stack_overflow_data = vecAssembler.transform(stack_overflow_data)

In [17]:
# Returns the first ``n`` rows
stack_overflow_data.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

### Normalize the Vectors

In [18]:
# Normalize a vector to have unit norm using the given p-norm
normalizer = Normalizer(inputCol='NumFeatures', outputCol='ScaledNumFeatures')

# Transforms the input dataset with optional parameters
stack_overflow_data = normalizer.transform(stack_overflow_data)

In [19]:
# Returns the first ``n`` rows
stack_overflow_data.head(2)

[Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'whic

### Scale the Vectors

In [20]:
# Standardizes features by removing the mean and scaling to unit variance using column  summary statistics 
#    on the samples in the training set
# The "unit std" is computed using the `corrected sample standard deviation`, which is computed as the 
#    square root of the unbiased sample variance
standardScaler = StandardScaler(inputCol='NumFeatures', outputCol='ScaledNumFeatures2', withStd=True)

# Fits a model to the input dataset with optional parameters
scalerModel = standardScaler.fit(stack_overflow_data)

# Transforms the input dataset with optional parameters
stack_overflow_data = scalerModel.transform(stack_overflow_data)

In [21]:
# Returns the first ``n`` rows
stack_overflow_data.head(2)

[Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'whic

# Text Processing

Find the term frequencies of the words.

### CountVectorizer

In [22]:
# Extracts a vocabulary from document collections and generates a `CountVectorizerModel`
cv = CountVectorizer(inputCol='words', outputCol='TF', vocabSize=1000)

# Fits a model to the input dataset with optional parameters
cvModel = cv.fit(stack_overflow_data)

# Transforms the input dataset with optional parameters
stack_overflow_data = cvModel.transform(stack_overflow_data)

In [23]:
# Returns the first ``num`` rows as a `list` of `Row`
stack_overflow_data.take(1)

[Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'whic

In [24]:
# An array of terms in the vocabulary
cvModel.vocabulary

['p',
 'the',
 'i',
 'to',
 'code',
 'a',
 'gt',
 'lt',
 'is',
 'and',
 'pre',
 'in',
 'this',
 'of',
 'it',
 'that',
 'for',
 '0',
 '1',
 'have',
 'my',
 'if',
 'on',
 'but',
 'with',
 'can',
 'not',
 'be',
 'as',
 't',
 'li',
 'from',
 '2',
 's',
 'http',
 'an',
 'm',
 'strong',
 'new',
 'how',
 'do',
 'com',
 'so',
 'or',
 'at',
 'using',
 'when',
 'am',
 'like',
 'class',
 'id',
 'there',
 'get',
 'are',
 'name',
 'what',
 'any',
 'file',
 'string',
 'data',
 'all',
 'which',
 'want',
 'would',
 'amp',
 'use',
 'java',
 'function',
 'public',
 'some',
 '3',
 'text',
 'error',
 'android',
 'value',
 'c',
 'x',
 'href',
 'you',
 'one',
 'by',
 'user',
 'me',
 'server',
 'type',
 'here',
 'way',
 'return',
 'int',
 'will',
 'div',
 'need',
 'then',
 'set',
 'e',
 'system',
 'has',
 'problem',
 'out',
 'php',
 'no',
 'just',
 '4',
 'org',
 'know',
 'html',
 'only',
 'where',
 'page',
 'application',
 '5',
 'thanks',
 'var',
 'br',
 'we',
 'd',
 'should',
 'does',
 'add',
 'n',
 'true',

In [25]:
# Show the last 10 terms in the vocabulary
cvModel.vocabulary[-10:]

['customer',
 'desktop',
 'buttons',
 'previous',
 'master',
 'math',
 '000',
 'comes',
 'blog',
 'wordpress']

### Inter-Document Frequency

In [26]:
# Compute the Inverse Document Frequency (IDF) given a collection of documents
idf = IDF(inputCol='TF', outputCol='TFIDF')

# Fits a model to the input dataset with optional parameters
idfModel = idf.fit(stack_overflow_data)

# Transforms the input dataset with optional parameters
stack_overflow_data = idfModel.transform(stack_overflow_data)

In [27]:
# Returns the first ``n`` rows
stack_overflow_data.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

### StringIndexer

In [28]:
# A label indexer that maps a string column of labels to an ML column of label indices
# If the input column is numeric, we cast it to string and index the string values
# The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0
# Note: stringOrderType: the ordering behavior; default value is 'frequencyDesc'; other option is 'alphabetDesc'
stringIndexer = StringIndexer(inputCol='oneTag', outputCol='label')

# Fits a model to the input dataset with optional parameters
stringIndexerModel = stringIndexer.fit(stack_overflow_data)

# Transforms the input dataset with optional parameters
stack_overflow_data = stringIndexerModel.transform(stack_overflow_data)

In [29]:
# Returns the first ``n`` rows
stack_overflow_data.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

# Creating Features

### Read in the Data Set

In [30]:
path = '/Users/yangweichle/Documents/Employment/TRAINING/DATA SCIENCE/Spark/Udacity_Spark for Big Data/Machine Learning with Spark/data/Train_onetag_small.json'

# Loads JSON files and returns the results as a `DataFrame`
# Note: path: string represents path to the JSON dataset, or a list of paths, or RDD of Strings storing JSON objects
stack_overflow_data2 = spark.read.json(path=path)

# Sets the storage level to persist the contents of the `DataFrame` across operations after the first time it is computed
# This can only be used to assign a new storage level if the `DataFrame` does not have a storage level set yet
# If no storage level is specified defaults to (C{MEMORY_AND_DISK})
stack_overflow_data2.persist()

DataFrame[Body: string, Id: bigint, Tags: string, Title: string, oneTag: string]

In [31]:
# A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to 
#    split the text (default) or repeatedly matching the regex (if gaps is false)
# Optional parameters also allow filtering tokens using a minimal length
# It returns an array of strings that can be empty 
regexTokenizer = RegexTokenizer(inputCol='Body', outputCol='words', pattern='\\W')

# Transforms the input dataset with optional parameters
stack_overflow_data2 = regexTokenizer.transform(stack_overflow_data2)

In [32]:
# Create a user defined function (UDF)
body_length = udf(lambda x: len(x), IntegerType())

# Returns a new `DataFrame` by adding a column or replacing the existing column that has the same name
stack_overflow_data2 = stack_overflow_data2.withColumn('BodyLength', body_length(stack_overflow_data2.words))

In [33]:
# Returns the first ``n`` rows
stack_overflow_data2.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

### Question 1

Select the question with Id = 1112. How many words does its body contain (check the `BodyLength` column)?

In [34]:
# Filters rows using the given condition
stack_overflow_data2.where(stack_overflow_data2.Id == 1112).show()

+--------------------+----+--------------------+--------------------+------+--------------------+----------+
|                Body|  Id|                Tags|               Title|oneTag|               words|BodyLength|
+--------------------+----+--------------------+--------------------+------+--------------------+----------+
|<p>I submitted my...|1112|iphone app-store ...|iPhone app releas...|iphone|[p, i, submitted,...|        63|
+--------------------+----+--------------------+--------------------+------+--------------------+----------+



### Question 2

Create a new column that concatenates the question `Title` and `Body`. Apply the same functions we used before to compute the number of words in this combined column. What's the value in this new column for Id = 5123?

In [35]:
# Returns a new `DataFrame` by adding a column or replacing the existing column that has the same name
# concat: concatenates multiple input columns together into a single column;
#         the function works with strings, binary and compatible array columns
stack_overflow_data2 = stack_overflow_data2.withColumn('Desc', concat(col('Title'), lit(' '), col('Body')))

In [36]:
# A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to 
#    split the text (default) or repeatedly matching the regex (if gaps is false)
# Optional parameters also allow filtering tokens using a minimal length
# It returns an array of strings that can be empty 
regexTokenizer = RegexTokenizer(inputCol='Desc', outputCol='words2', pattern='\\W')

# Transforms the input dataset with optional parameters
stack_overflow_data2 = regexTokenizer.transform(stack_overflow_data2)

In [37]:
# Returns a new `DataFrame` by adding a column or replacing the existing column that has the same name
stack_overflow_data2 = stack_overflow_data2.withColumn('DescLength', body_length(stack_overflow_data2.words2))

In [38]:
# Filters rows using the given condition
stack_overflow_data2.where(stack_overflow_data2.Id == 5123).collect()

[Row(Body="<p>Here's an interesting experiment with using Git. Think of Github's ‘pages’ feature: I write a program in one branch (e.g. <code>master</code>), and a documentation website is kept in another, entirely unrelated branch (e.g. <code>gh-pages</code>).</p>\n\n<p>I can generate documentation in HTML format from the code in my <code>master</code>-branch, but I want to publish this as part of my documentation website in the <code>gh-pages</code> branch.</p>\n\n<p>How could I intelligently generate my docs from my code in <code>master</code>, move it to my <code>gh-pages</code> branch and commit the changes there? Should I use a post-commit hook or something? Would this be a good idea, or is it utterly foolish?</p>\n", Id=5123, Tags='git branch', Title='Git branch experiment', oneTag='git', words=['p', 'here', 's', 'an', 'interesting', 'experiment', 'with', 'using', 'git', 'think', 'of', 'github', 's', 'pages', 'feature', 'i', 'write', 'a', 'program', 'in', 'one', 'branch', 'e', '

### Create a Vector

Create a vector from the combined Title + Body length column. In the next few questions, you'll try different normalizer/scaler methods on this new column.

In [39]:
# A feature transformer that merges multiple columns into a vector column
vecAssembler = VectorAssembler(inputCols=['DescLength'], outputCol='DescVec')

# Transforms the input dataset with optional parameters
stack_overflow_data2 = vecAssembler.transform(stack_overflow_data2)

### Question 3

Using the `Normalizer` method, what's the normalized value for question Id = 512?

In [40]:
# Normalize a vector to have unit norm using the given p-norm
normalizer = Normalizer(inputCol='DescVec', outputCol='DescVecNormalizer')

# Transforms the input dataset with optional parameters
stack_overflow_data2 = normalizer.transform(stack_overflow_data2)

In [41]:
# Filters rows using the given condition
stack_overflow_data2.where(stack_overflow_data2.Id == 512).collect()

[Row(Body="<p>I'd like to have a better understanding of what optimizations HotSpot might generate for my Java code at run time. </p>\n\n<p>Is there a way to see the optimized code that HotSpot is using after it's been running for a while?</p>\n", Id=512, Tags='java optimization hotspot', Title='How can I see the code that HotSpot generates after optimizing?', oneTag='java', words=['p', 'i', 'd', 'like', 'to', 'have', 'a', 'better', 'understanding', 'of', 'what', 'optimizations', 'hotspot', 'might', 'generate', 'for', 'my', 'java', 'code', 'at', 'run', 'time', 'p', 'p', 'is', 'there', 'a', 'way', 'to', 'see', 'the', 'optimized', 'code', 'that', 'hotspot', 'is', 'using', 'after', 'it', 's', 'been', 'running', 'for', 'a', 'while', 'p'], BodyLength=46, Desc="How can I see the code that HotSpot generates after optimizing? <p>I'd like to have a better understanding of what optimizations HotSpot might generate for my Java code at run time. </p>\n\n<p>Is there a way to see the optimized code 

### Question 4

Using the `StandardScaler` method (scaling both the mean and the standard deviation) what's the normalized value for question Id = 512?

In [42]:
# Standardizes features by removing the mean and scaling to unit variance using column  summary statistics 
#    on the samples in the training set
# The "unit std" is computed using the `corrected sample standard deviation`, which is computed as the 
#    square root of the unbiased sample variance
standardScaler = StandardScaler(inputCol='DescVec', outputCol='DescVecStandardScaler', withMean=True, withStd=True)

# Fits a model to the input dataset with optional parameters
scalerModel = standardScaler.fit(stack_overflow_data2)

# Transforms the input dataset with optional parameters
stack_overflow_data2 = scalerModel.transform(stack_overflow_data2)

In [43]:
# Filters rows using the given condition
stack_overflow_data2.where(stack_overflow_data2.Id == 512).collect()

[Row(Body="<p>I'd like to have a better understanding of what optimizations HotSpot might generate for my Java code at run time. </p>\n\n<p>Is there a way to see the optimized code that HotSpot is using after it's been running for a while?</p>\n", Id=512, Tags='java optimization hotspot', Title='How can I see the code that HotSpot generates after optimizing?', oneTag='java', words=['p', 'i', 'd', 'like', 'to', 'have', 'a', 'better', 'understanding', 'of', 'what', 'optimizations', 'hotspot', 'might', 'generate', 'for', 'my', 'java', 'code', 'at', 'run', 'time', 'p', 'p', 'is', 'there', 'a', 'way', 'to', 'see', 'the', 'optimized', 'code', 'that', 'hotspot', 'is', 'using', 'after', 'it', 's', 'been', 'running', 'for', 'a', 'while', 'p'], BodyLength=46, Desc="How can I see the code that HotSpot generates after optimizing? <p>I'd like to have a better understanding of what optimizations HotSpot might generate for my Java code at run time. </p>\n\n<p>Is there a way to see the optimized code 

### Question 5

Using the `MinMAxScaler` method, what's the normalized value for question Id = 512?

In [44]:
# Rescale each feature individually to a common range [min, max] linearly using column summary statistics, 
#   which is also known as min-max normalization or Rescaling. The rescaled value for feature E is calculated as,
#   Rescaled(e_i) = (e_i - E_min) / (E_max - E_min) * (max - min) + min
# For the case E_max == E_min, Rescaled(e_i) = 0.5 * (max + min)
mmScaler = MinMaxScaler(inputCol='DescVec', outputCol='DescVecMinMaxScaler')

# Fits a model to the input dataset with optional parameters
mmScalerModel = mmScaler.fit(stack_overflow_data2)

# Transforms the input dataset with optional parameters
stack_overflow_data2 = mmScalerModel.transform(stack_overflow_data2)

In [45]:
# Filters rows using the given condition
stack_overflow_data2.where(stack_overflow_data2.Id == 512).collect()

[Row(Body="<p>I'd like to have a better understanding of what optimizations HotSpot might generate for my Java code at run time. </p>\n\n<p>Is there a way to see the optimized code that HotSpot is using after it's been running for a while?</p>\n", Id=512, Tags='java optimization hotspot', Title='How can I see the code that HotSpot generates after optimizing?', oneTag='java', words=['p', 'i', 'd', 'like', 'to', 'have', 'a', 'better', 'understanding', 'of', 'what', 'optimizations', 'hotspot', 'might', 'generate', 'for', 'my', 'java', 'code', 'at', 'run', 'time', 'p', 'p', 'is', 'there', 'a', 'way', 'to', 'see', 'the', 'optimized', 'code', 'that', 'hotspot', 'is', 'using', 'after', 'it', 's', 'been', 'running', 'for', 'a', 'while', 'p'], BodyLength=46, Desc="How can I see the code that HotSpot generates after optimizing? <p>I'd like to have a better understanding of what optimizations HotSpot might generate for my Java code at run time. </p>\n\n<p>Is there a way to see the optimized code 

# Dimensionality Reduction

In [46]:
# PCA trains a model to project vectors to a lower dimensional space of the top `k` principal components
pca = PCA(k=100, inputCol='TFIDF', outputCol='pcaTFIDF')

# Fits a model to the input dataset with optional parameters
pcaModel = pca.fit(stack_overflow_data)

# Transforms the input dataset with optional parameters
stack_overflow_data = pcaModel.transform(stack_overflow_data)

In [47]:
# Returns the first ``n`` rows
stack_overflow_data.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

# Supervised ML Algorithms

### Linear Regression

In [48]:
# Create a user defined function (UDF)
number_of_tags = udf(lambda x: len(x.split(' ')), IntegerType())

# Returns a new `DataFrame` by adding a column or replacing the existing column that has the same name
stack_overflow_data = stack_overflow_data.withColumn('NumTags', number_of_tags(stack_overflow_data.Tags))

In [49]:
# Returns the first ``n`` rows
stack_overflow_data.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

In [50]:
stack_overflow_data.groupby('NumTags').count().orderBy('NumTags').show()

+-------+-----+
|NumTags|count|
+-------+-----+
|      1|13858|
|      2|26540|
|      3|28769|
|      4|19108|
|      5|11725|
+-------+-----+



In [51]:
stack_overflow_data.groupby('NumTags').agg(avg(col('BodyLength'))).orderBy('NumTags').show()

+-------+------------------+
|NumTags|   avg(BodyLength)|
+-------+------------------+
|      1|135.41311877615817|
|      2|153.82456669178598|
|      3|172.73704334526747|
|      4|192.67050450073268|
|      5|218.54251599147122|
+-------+------------------+



In [52]:
# A feature transformer that merges multiple columns into a vector column
vecAssembler = VectorAssembler(inputCols=['BodyLength'], outputCol='LengthFeature')

# Transforms the input dataset with optional parameters
stack_overflow_data = vecAssembler.transform(stack_overflow_data)

In [53]:
# Returns the first ``n`` rows
stack_overflow_data.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

In [54]:
# Linear regression 
# The learning objective is to minimize the specified loss function, with regularization
# This supports two kinds of loss:
#   * squaredError (a.k.a squared loss)
#   * huber (a hybrid of squared error for relatively small errors and absolute error for relatively large ones, and we estimate the scale parameter from training data)
# This supports multiple types of regularization:
#   * none (a.k.a. ordinary least squares)
#   * L2 (ridge regression)
#   * L1 (Lasso)
#   * L2 + L1 (elastic net)
# Note: Fitting with huber loss only supports none and L2 regularization
lr = LinearRegression(maxIter=5, regParam=0.0, fitIntercept=False, solver='normal')

In [55]:
data = stack_overflow_data.select(col('NumTags').alias('label'), col('LengthFeature').alias('features'))

# Returns the first ``n`` rows
data.head()

Row(label=5, features=DenseVector([83.0]))

In [56]:
# Fits a model to the input dataset with optional parameters
lrModel = lr.fit(data)

In [57]:
# LinearRegression model coefficients
lrModel.coefficients

DenseVector([0.0079])

In [58]:
# LinearRegression model intercept
lrModel.intercept

0.0

In [59]:
# LinearRegression summary (e.g. residuals, mse, r-squared) of model on training set
# An exception is thrown if `trainingSummary is None`
lrModelSummary = lrModel.summary

In [60]:
# LinearRegression R^2, the coefficient of determination
lrModelSummary.r2

0.42481762576079773

### Question

Build a linear regression model using the length of the combined `Title` + `Body` fields. What is the value of r^2 when fitting a model with `maxIter=5`, `regParam=0.0`, `fitIntercept=False`, `solver='normal'`?

In [61]:
# Returns a new `DataFrame` by adding a column or replacing the existing column that has the same name
# concat: concatenates multiple input columns together into a single column;
#         the function works with strings, binary and compatible array columns
stack_overflow_data = stack_overflow_data.withColumn('Desc', concat(col('Title'), lit(' '), col('Body')))

In [62]:
# A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to 
#    split the text (default) or repeatedly matching the regex (if gaps is false)
# Optional parameters also allow filtering tokens using a minimal length
# It returns an array of strings that can be empty 
regexTokenizer = RegexTokenizer(inputCol='Desc', outputCol='words2', pattern='\\W')

# Transforms the input dataset with optional parameters
stack_overflow_data = regexTokenizer.transform(stack_overflow_data)

In [63]:
# Returns a new `DataFrame` by adding a column or replacing the existing column that has the same name
stack_overflow_data = stack_overflow_data.withColumn('DescLength', body_length(stack_overflow_data.words2))

In [64]:
# A feature transformer that merges multiple columns into a vector column
vecAssembler = VectorAssembler(inputCols=['DescLength'], outputCol='DescVec')

# Transforms the input dataset with optional parameters
stack_overflow_data = vecAssembler.transform(stack_overflow_data)

In [65]:
stack_overflow_data.groupby('NumTags').agg(avg(col('DescLength'))).orderBy('NumTags').show()

+-------+------------------+
|NumTags|   avg(DescLength)|
+-------+------------------+
|      1|143.68776158175783|
|      2| 162.1539186134137|
|      3|181.26021064340088|
|      4|201.46530249110322|
|      5|227.64375266524522|
+-------+------------------+



In [66]:
# Linear regression 
# The learning objective is to minimize the specified loss function, with regularization
# This supports two kinds of loss:
#   * squaredError (a.k.a squared loss)
#   * huber (a hybrid of squared error for relatively small errors and absolute error for relatively large ones, and we estimate the scale parameter from training data)
# This supports multiple types of regularization:
#   * none (a.k.a. ordinary least squares)
#   * L2 (ridge regression)
#   * L1 (Lasso)
#   * L2 + L1 (elastic net)
# Note: Fitting with huber loss only supports none and L2 regularization
lr = LinearRegression(maxIter=5, regParam=0.0, fitIntercept=False, solver='normal')

In [67]:
data = stack_overflow_data.select(col('NumTags').alias('label'), col('DescVec').alias('features'))

# Returns the first ``n`` rows
data.head(5)

[Row(label=5, features=DenseVector([96.0])),
 Row(label=1, features=DenseVector([83.0])),
 Row(label=3, features=DenseVector([3168.0])),
 Row(label=3, features=DenseVector([124.0])),
 Row(label=3, features=DenseVector([154.0]))]

In [68]:
# Fits a model to the input dataset with optional parameters
lrModel = lr.fit(data)

In [69]:
# LinearRegression R^2, the coefficient of determination
lrModel.summary.r2

0.44551495963084176

### Logistic Regression

In [70]:
# Logistic regression
# This class supports multinomial logistic (softmax) and binomial logistic regression
logreg = LogisticRegression(maxIter=10, regParam=0.0)

In [71]:
# Note: Multinomial logistic regression
data2 = stack_overflow_data.select(col('label').alias('label'), col('TFIDF').alias('features'))

# Returns the first ``n`` rows
data2.head()

Row(label=3.0, features=SparseVector(1000, {0: 0.0026, 1: 0.7515, 2: 0.1374, 3: 0.3184, 5: 0.3823, 8: 1.0754, 9: 0.3344, 15: 0.5899, 21: 1.8551, 28: 1.1263, 31: 1.1113, 35: 3.3134, 36: 1.2545, 43: 2.3741, 45: 2.3753, 48: 1.2254, 51: 1.1879, 57: 11.0264, 61: 2.8957, 71: 2.1945, 78: 1.6947, 84: 6.5898, 86: 1.6136, 94: 2.3569, 97: 1.8218, 99: 2.6292, 100: 1.9206, 115: 2.3592, 147: 5.4841, 152: 2.1116, 169: 2.6328, 241: 2.5745, 283: 3.2325, 306: 3.2668, 350: 6.2367, 490: 3.8893, 578: 3.6182, 759: 3.7771, 832: 8.8964}))

In [72]:
# Fits a model to the input dataset with optional parameters
logregModel = logreg.fit(data2)

In [73]:
# LogisticRegression model coefficients
logregModel.coefficientMatrix

DenseMatrix(301, 1000, [7.2356, 0.0372, 0.0333, 0.0894, -0.0442, 0.0287, 0.0018, 0.0007, ..., -0.0006, -0.0009, -0.0002, -0.0003, -0.0, -0.0015, -0.0003, -0.0005], 1)

In [74]:
# LogisticRegression model intercept
logregModel.interceptVector

DenseVector([5.0624, 4.2809, 4.1836, 4.0456, 3.9815, 3.8424, 3.3918, 3.4562, 3.3316, 3.2418, 2.9428, 2.8218, 2.7839, 2.7625, 2.6392, 2.5983, 2.4539, 2.4447, 2.3916, 2.3566, 2.1003, 2.0631, 2.0567, 1.7878, 1.7815, 1.7789, 1.7183, 1.5344, 1.5141, 1.4106, 1.3633, 1.3618, 1.3407, 1.3321, 1.3387, 1.2438, 1.1902, 1.1985, 1.2037, 1.2022, 1.1798, 1.1327, 1.1006, 1.0406, 0.9521, 0.9417, 0.9192, 0.9164, 0.8901, 0.8584, 0.8452, 0.8359, 0.8296, 0.8064, 0.7944, 0.7899, 0.7819, 0.7776, 0.7598, 0.7628, 0.7327, 0.7291, 0.6964, 0.6557, 0.6597, 0.6572, 0.6451, 0.6439, 0.6062, 0.6087, 0.5191, 0.5071, 0.5063, 0.5012, 0.466, 0.4616, 0.4529, 0.4337, 0.4241, 0.4104, 0.406, 0.3852, 0.3536, 0.3461, 0.3453, 0.3236, 0.2877, 0.2839, 0.2742, 0.2597, 0.2447, 0.2194, 0.1946, 0.1855, 0.1849, 0.1718, 0.1684, 0.1625, 0.1454, 0.1231, 0.1156, 0.1063, 0.0996, 0.0669, 0.0713, 0.0213, 0.0143, 0.0013, 0.0025, -0.0089, -0.0199, -0.0291, -0.0283, -0.0409, -0.0398, -0.0501, -0.0721, -0.0735, -0.0711, -0.0729, -0.0839, -0.0852, 

In [75]:
# LogisticRegression summary (e.g. accuracy/precision/recall, objective history, total iterations) of model on training set
# An exception is thrown if `trainingSummary is None`
logregModelSummary = logregModel.summary

In [76]:
# LogisticRegression accuracy (total number of correctly classified instances out of the total number of instances)
logregModelSummary.accuracy

0.3674

In [77]:
# Random chance of getting the tag right
1/301.0

0.0033222591362126247

# Unsupervised ML Algorithms

### K-Means

Examine the distribution of the Title + Body length feature used before and instead of using the raw number of words, create categories based on this length: short, longer, and super long.

### Question 1

How many times greater is the Description Length of the longest question than the Description Length of the shortest question (rounded to the nearest whole number)?

Tip: Don't forget to import Spark SQL's aggregate functions that can operate on DataFrame columns.

In [78]:
stack_overflow_data.agg(min('DescLength')).show()

+---------------+
|min(DescLength)|
+---------------+
|             10|
+---------------+



In [79]:
stack_overflow_data.agg(max('DescLength')).show()

+---------------+
|max(DescLength)|
+---------------+
|           7532|
+---------------+



In [80]:
stack_overflow_data.agg(max('DescLength')/min('DescLength')).show()

+-----------------------------------+
|(max(DescLength) / min(DescLength))|
+-----------------------------------+
|                              753.2|
+-----------------------------------+



### Question 2

What is the mean and standard deviation of the Description length?

In [81]:
stack_overflow_data.agg(avg('DescLength'), stddev('DescLength')).show()

+---------------+-----------------------+
|avg(DescLength)|stddev_samp(DescLength)|
+---------------+-----------------------+
|      180.28187|     192.10819533505128|
+---------------+-----------------------+



### Question 3

Use K-means to create 5 clusters of Description lengths. Set the random seed to 42 and fit a 5-class K-means model on the Description length column (use `KMeans().setParams(...)`).

What length is the center of the cluster representing the longest questions?

In [82]:
# K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al)
# Sets params for KMeans
kmeans = KMeans().setParams(featuresCol='DescVec', predictionCol='DescGroup', k=5, seed=42)

# Fits a model to the input dataset with optional parameters
kmeansModel = kmeans.fit(stack_overflow_data)

# Transforms the input dataset with optional parameters
stack_overflow_data = kmeansModel.transform(stack_overflow_data)

In [83]:
# Returns the first ``n`` rows
stack_overflow_data.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

In [84]:
stack_overflow_data.groupby('DescGroup').agg(avg(col('DescLength')), avg(col('NumTags')), count(col('DescLength'))).orderBy('avg(DescLength)').show()

+---------+------------------+------------------+-----------------+
|DescGroup|   avg(DescLength)|      avg(NumTags)|count(DescLength)|
+---------+------------------+------------------+-----------------+
|        4| 92.75317245164402| 2.732166913366707|            60127|
|        0|224.90495069296375| 3.068663379530917|            30016|
|        2| 457.1547183613753|3.2275054864667156|             8202|
|        3| 989.9467576791809| 3.279180887372014|             1465|
|        1| 2634.815789473684|3.3684210526315788|              190|
+---------+------------------+------------------+-----------------+



# ML Pipelines

In [85]:
path = '/Users/yangweichle/Documents/Employment/TRAINING/DATA SCIENCE/Spark/Udacity_Spark for Big Data/Machine Learning with Spark/data/Train_onetag_small.json'

# Loads JSON files and returns the results as a `DataFrame`
# Note: path: string represents path to the JSON dataset, or a list of paths, or RDD of Strings storing JSON objects
stack_overflow_data = spark.read.json(path=path)

# Sets the storage level to persist the contents of the `DataFrame` across operations after the first time it is computed
# This can only be used to assign a new storage level if the `DataFrame` does not have a storage level set yet
# If no storage level is specified defaults to (C{MEMORY_AND_DISK})
stack_overflow_data.persist()

DataFrame[Body: string, Id: bigint, Tags: string, Title: string, oneTag: string]

In [86]:
# A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to 
#    split the text (default) or repeatedly matching the regex (if gaps is false)
# Optional parameters also allow filtering tokens using a minimal length
# It returns an array of strings that can be empty
regexTokenizer = RegexTokenizer(inputCol='Body', outputCol='words', pattern='\\W')

# Extracts a vocabulary from document collections and generates a `CountVectorizerModel`
cv = CountVectorizer(inputCol='words', outputCol='TF', vocabSize=1000)

# Compute the Inverse Document Frequency (IDF) given a collection of documents
idf = IDF(inputCol='TF', outputCol='features')

# A label indexer that maps a string column of labels to an ML column of label indices
# If the input column is numeric, we cast it to string and index the string values
# The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0
# Note: stringOrderType: the ordering behavior; default value is 'frequencyDesc'; other option is 'alphabetDesc'
stringIndexer = StringIndexer(inputCol='oneTag', outputCol='label')

In [87]:
# Logistic regression
# This class supports multinomial logistic (softmax) and binomial logistic regression
logreg = LogisticRegression(maxIter=10, regParam=0.0, elasticNetParam=0)

In [88]:
# A simple pipeline, which acts as an estimator
# A Pipeline consists of a sequence of stages, each of which is either an `Estimator` or a `Transformer`
# When `Pipeline.fit` is called, the stages are executed in order
# If a stage is an `Estimator`, its `Estimator.fit` method will be called on the input dataset to fit a model
#   Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage
# If a stage is a `Transformer`, its `Transformer.transform` method will be called to produce the dataset for the next stage
# The fitted model from a `Pipeline` is a `PipelineModel`, which consists of fitted models and transformers, corresponding to the pipeline stages
# If stages is an empty list, the pipeline acts as an identity transformer
pipeline = Pipeline(stages=[regexTokenizer, cv, idf, stringIndexer, logreg])

In [89]:
# Fits a model to the input dataset with optional parameters
plogregModel = pipeline.fit(stack_overflow_data)

# Transforms the input dataset with optional parameters
stack_overflow_data = plogregModel.transform(stack_overflow_data)

In [90]:
# Returns the first ``n`` rows
stack_overflow_data.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

In [91]:
# Total number of correctly classified instances
stack_overflow_data.where(stack_overflow_data.label == stack_overflow_data.prediction).count()

36740

In [92]:
# Total number of instances
stack_overflow_data.count()

100000

In [93]:
# Accuracy = total number of correctly classified instances / total number of instances
(stack_overflow_data.where(stack_overflow_data.label == stack_overflow_data.prediction).count())/stack_overflow_data.count()

0.3674

In [94]:
# Evaluator for Multiclass Classification, which expects two input columns: prediction and label
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='accuracy')

# Evaluates the output with optional parameters
accuracy = evaluator.evaluate(stack_overflow_data)

In [95]:
# LogisticRegression accuracy (total number of correctly classified instances out of the total number of instances)
accuracy

0.3674

# Model Selection and Tuning

In [96]:
path = '/Users/yangweichle/Documents/Employment/TRAINING/DATA SCIENCE/Spark/Udacity_Spark for Big Data/Machine Learning with Spark/data/Train_onetag_small.json'

# Loads JSON files and returns the results as a `DataFrame`
# Note: path: string represents path to the JSON dataset, or a list of paths, or RDD of Strings storing JSON objects
stack_overflow_data = spark.read.json(path=path)

# Sets the storage level to persist the contents of the `DataFrame` across operations after the first time it is computed
# This can only be used to assign a new storage level if the `DataFrame` does not have a storage level set yet
# If no storage level is specified defaults to (C{MEMORY_AND_DISK})
stack_overflow_data.persist()

DataFrame[Body: string, Id: bigint, Tags: string, Title: string, oneTag: string]

### Step 1. Train Test Split

As a first step break your data set into 80% of training data and set aside 20%. Set random seed to `42`.

In [97]:
# Randomly splits this `DataFrame` with the provided weights
train, test = stack_overflow_data.randomSplit([0.8, 0.2], seed=42)

# Train, Test, Validation sets
#train, rest = stack_overflow_data.randomSplit([0.6, 0.4], seed=42)
#test, validation = rest.randomSplit([0.5, 0.5], seed=42)

### Step 2. Build Pipeline

In [98]:
# A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to 
#    split the text (default) or repeatedly matching the regex (if gaps is false)
# Optional parameters also allow filtering tokens using a minimal length
# It returns an array of strings that can be empty
regexTokenizer = RegexTokenizer(inputCol='Body', outputCol='words', pattern='\\W')

# Extracts a vocabulary from document collections and generates a `CountVectorizerModel`
cv = CountVectorizer(inputCol='words', outputCol='TF', vocabSize=1000)

# Compute the Inverse Document Frequency (IDF) given a collection of documents
idf = IDF(inputCol='TF', outputCol='features')

# A label indexer that maps a string column of labels to an ML column of label indices
# If the input column is numeric, we cast it to string and index the string values
# The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0
# Note: stringOrderType: the ordering behavior; default value is 'frequencyDesc'; other option is 'alphabetDesc'
stringIndexer = StringIndexer(inputCol='oneTag', outputCol='label')

In [99]:
# Logistic regression
# This class supports multinomial logistic (softmax) and binomial logistic regression
logreg = LogisticRegression(maxIter=10, regParam=0.0, elasticNetParam=0)

In [100]:
# A simple pipeline, which acts as an estimator
# A Pipeline consists of a sequence of stages, each of which is either an `Estimator` or a `Transformer`
# When `Pipeline.fit` is called, the stages are executed in order
# If a stage is an `Estimator`, its `Estimator.fit` method will be called on the input dataset to fit a model
#   Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage
# If a stage is a `Transformer`, its `Transformer.transform` method will be called to produce the dataset for the next stage
# The fitted model from a `Pipeline` is a `PipelineModel`, which consists of fitted models and transformers, corresponding to the pipeline stages
# If stages is an empty list, the pipeline acts as an identity transformer
pipeline = Pipeline(stages=[regexTokenizer, cv, idf, stringIndexer, logreg])

In [101]:
# Fits a model to the input dataset with optional parameters
plogregModel = pipeline.fit(train)

In [102]:
# Transforms the input dataset with optional parameters
results = plogregModel.transform(test)

In [103]:
# Returns the first ``n`` rows
results.head()

Row(Body='<blockquote>\n  <p><strong>Possible Duplicate:</strong><br>\n  <a href="http://cstheory.stackexchange.com/questions/1574/do-you-use-any-article-organizers">Do you use any article organizers?</a>  </p>\n</blockquote>\n\n\n\n<p>As part of my Ph.D. studies I need to create an overview of recently (since 2000) published papers with impact on my study field.</p>\n\n<p>Creating a list of articles isn\'t a problem, but what I also want to do is to follow the most cited articles and authors (after I\'m done with this).</p>\n\n<p>Is there some tool, that would allow me to store the articles and then simplify the most cited authors/articles search?</p>\n', Id=6835, Tags='soft-question software citations', Title='Tool for managing scientific papers', oneTag='soft-question', words=['blockquote', 'p', 'strong', 'possible', 'duplicate', 'strong', 'br', 'a', 'href', 'http', 'cstheory', 'stackexchange', 'com', 'questions', '1574', 'do', 'you', 'use', 'any', 'article', 'organizers', 'do', 'yo

In [104]:
# Total number of correctly classified instances
results.where(results.label == results.prediction).count()

6851

In [105]:
# Total number of instances
results.count()

19978

In [106]:
# Accuracy = total number of correctly classified instances / total number of instances
(results.where(results.label == results.prediction).count())/results.count()

0.34292721994193615

In [107]:
# Evaluator for Multiclass Classification, which expects two input columns: prediction and label
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='accuracy')

# Evaluates the output with optional parameters
accuracy = evaluator.evaluate(results)

In [108]:
# LogisticRegression accuracy (total number of correctly classified instances out of the total number of instances)
accuracy

0.34292721994193615

### K-Fold Cross Validation

### Step 3. Tune Model

On the first 80% of the data find the most accurate logistic regression model using 3-fold cross-validation with the following parameter grid:

- CountVectorizer vocabulary size: `[1000, 2000]`
- LogisticRegression regularization parameter: `[0.0, 0.1]`
- LogisticRegression max Iteration number: `[10]`

In [109]:
# Builder for a param grid used in grid search-based model selection
# Sets the given parameters in this grid to fixed values
paramGrid = ParamGridBuilder() \
    .addGrid(cv.vocabSize, [1000, 2000]) \
    .addGrid(logreg.regParam, [0.0, 0.1]) \
    .build()

# K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly
#   partitioned folds which are used as separate training and test datasets 
#   e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which
#   uses 2/3 of the data for training and 1/3 for testing
# Each fold is used as the test set exactly once
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=3)

In [110]:
# Fits a model to the input dataset with optional parameters
cvModel = crossval.fit(train)

In [111]:
# Average performance of parameter grid
cvModel.avgMetrics

[0.30080930368241526,
 0.23247357061619897,
 0.3280572469357257,
 0.2580292235553692]

### Step 4: Compute Accuracy of Best Model

In [112]:
# Transforms the input dataset with optional parameters
results = cvModel.transform(test)

In [113]:
# Total number of correctly classified instances
results.where(results.label == results.prediction).count()

7193

In [114]:
# Total number of instances
results.count()

19978

In [115]:
# Accuracy = total number of correctly classified instances / total number of instances
(results.where(results.label == results.prediction).count())/results.count()

0.3600460506557213

In [116]:
# Evaluator for Multiclass Classification, which expects two input columns: prediction and label
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='accuracy')

# Evaluates the output with optional parameters
accuracy = evaluator.evaluate(results)

In [117]:
# LogisticRegression accuracy (total number of correctly classified instances out of the total number of instances)
accuracy

0.3600460506557213