<a href="https://colab.research.google.com/github/viviakemik/spark/blob/main/machine-learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 37 kB/s 
[?25hCollecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 63.2 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.0-py2.py3-none-any.whl size=281805912 sha256=eb81f2e61e66d2f8829fa7ad68c476238b17856c25e1ffaaa6c5db0cca6007f4
  Stored in directory: /root/.cache/pip/wheels/0b/de/d2/9be5d59d7331c6c2a7c1b6d1a4f463ce107332b1ecd4e80718
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.2 pyspark-3.2.0


In [2]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer, VectorAssembler, Normalizer, StandardScaler
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

import re

In [3]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Word Count") \
    .getOrCreate()

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
stack_overflow_data = 'drive/MyDrive/Github/spark/data/Train_onetag_small.json'

In [6]:
df = spark.read.json(stack_overflow_data)

In [7]:
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php')

# **Tokenization**

In [8]:
regexTokenizer = RegexTokenizer(inputCol="Body", outputCol="words", pattern="\\W")
df = regexTokenizer.transform(df)
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

In [9]:
body_length = udf(lambda x: len(x), IntegerType())
df = df.withColumn("BodyLength", body_length(df.words))

In [10]:
number_of_paragraphs = udf(lambda x: len(re.findall("</p>", x)), IntegerType())
number_of_links = udf(lambda x: len(re.findall("</a>", x)), IntegerType())

In [11]:
df = df.withColumn("NumParagraphs", number_of_paragraphs(df.Body))
df = df.withColumn("NumLinks", number_of_links(df.Body))

In [12]:
df.head(2)

[Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'whic

## **VectorAssembler**

Combine the body length, number of paragraphs, and number of links columns into a vector

In [13]:
assembler = VectorAssembler(inputCols=["BodyLength", "NumParagraphs", "NumLinks"], outputCol="NumFeatures")
df = assembler.transform(df)

In [14]:
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

## **Normalize the Vectors**

In [15]:
scaler = Normalizer(inputCol="NumFeatures", outputCol="ScaledNumFeatures")
df = scaler.transform(df)

In [16]:
df.head(2)

[Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'whic

## **Scale the Vectors**

In [17]:
scaler2 = StandardScaler(inputCol="NumFeatures", outputCol="ScaledNumFeatures2", withStd=True)
scalerModel = scaler2.fit(df)
df = scalerModel.transform(df)

In [18]:
df.head(2)

[Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'whic

## **CountVectorizer**

In [19]:
from pyspark.ml.feature import CountVectorizer, IDF, StringIndexer

In [20]:
# find the term frequencies of the words
cv = CountVectorizer(inputCol="words", outputCol="TF", vocabSize=1000)
cvmodel = cv.fit(df)
df = cvmodel.transform(df)
df.take(1)

[Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'whic

In [21]:
cvmodel.vocabulary

['p',
 'the',
 'i',
 'to',
 'code',
 'a',
 'gt',
 'lt',
 'is',
 'and',
 'pre',
 'in',
 'this',
 'of',
 'it',
 'that',
 'for',
 '0',
 '1',
 'have',
 'my',
 'if',
 'on',
 'but',
 'with',
 'can',
 'not',
 'be',
 'as',
 't',
 'li',
 'from',
 '2',
 's',
 'http',
 'an',
 'm',
 'strong',
 'new',
 'how',
 'do',
 'com',
 'so',
 'or',
 'at',
 'using',
 'when',
 'am',
 'like',
 'class',
 'id',
 'there',
 'get',
 'are',
 'name',
 'what',
 'any',
 'file',
 'string',
 'data',
 'all',
 'which',
 'want',
 'would',
 'amp',
 'use',
 'java',
 'function',
 'public',
 'some',
 '3',
 'text',
 'error',
 'android',
 'value',
 'c',
 'x',
 'href',
 'you',
 'one',
 'by',
 'user',
 'me',
 'server',
 'type',
 'here',
 'way',
 'return',
 'int',
 'will',
 'div',
 'need',
 'then',
 'set',
 'e',
 'system',
 'has',
 'problem',
 'out',
 'php',
 'no',
 'just',
 '4',
 'org',
 'know',
 'html',
 'only',
 'where',
 'page',
 'application',
 '5',
 'thanks',
 'var',
 'br',
 'we',
 'd',
 'should',
 'does',
 'add',
 'n',
 'true',

In [22]:
# show the last 10 terms in the vocabulary
cvmodel.vocabulary[-10:]

['customer',
 'desktop',
 'buttons',
 'previous',
 'math',
 'master',
 '000',
 'blog',
 'comes',
 'wordpress']

## **Inter-document Frequency**

In [23]:
idf = IDF(inputCol="TF", outputCol="TFIDF")
idfModel = idf.fit(df)
df = idfModel.transform(df)
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

## **StringIndexer**

In [24]:
indexer = StringIndexer(inputCol="oneTag", outputCol="label")
df = indexer.fit(df).transform(df)

In [25]:
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

In [26]:
df.columns

['Body',
 'Id',
 'Tags',
 'Title',
 'oneTag',
 'words',
 'BodyLength',
 'NumParagraphs',
 'NumLinks',
 'NumFeatures',
 'ScaledNumFeatures',
 'ScaledNumFeatures2',
 'TF',
 'TFIDF',
 'label']

## **Question 1**
Select the question with Id = 1112. How many words does its body contain (check the BodyLength column)?

In [27]:
df.where(df.Id == 1112).show()

+--------------------+----+--------------------+--------------------+------+--------------------+----------+-------------+--------+--------------+--------------------+--------------------+--------------------+--------------------+-----+
|                Body|  Id|                Tags|               Title|oneTag|               words|BodyLength|NumParagraphs|NumLinks|   NumFeatures|   ScaledNumFeatures|  ScaledNumFeatures2|                  TF|               TFIDF|label|
+--------------------+----+--------------------+--------------------+------+--------------------+----------+-------------+--------+--------------+--------------------+--------------------+--------------------+--------------------+-----+
|<p>I submitted my...|1112|iphone app-store ...|iPhone app releas...|iphone|[p, i, submitted,...|        63|            1|       0|[63.0,1.0,0.0]|[0.99987404748359...|[0.32825169441613...|(1000,[0,1,2,3,8,...|(1000,[0,1,2,3,8,...|  7.0|
+--------------------+----+--------------------+----

## **Question 2**
Create a new column that concatenates the question title and body. Apply the same functions we used before to compute the number of words in this combined column. What's the value in this new column for Id = 5123?

In [28]:
from pyspark.sql.functions import col, lit, concat

df = df.withColumn("TitleBody", concat(col("Title"), lit(' '), col("Body")))

In [29]:
regexTokenizer2 = RegexTokenizer(inputCol="TitleBody", outputCol="words2", pattern="\\W")
df = regexTokenizer2.transform(df)
df = df.withColumn("TitleBodyLength", body_length(df.words2))

In [30]:
df.where(df.Id == 5123).show()

+--------------------+----+----------+--------------------+------+--------------------+----------+-------------+--------+---------------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+---------------+
|                Body|  Id|      Tags|               Title|oneTag|               words|BodyLength|NumParagraphs|NumLinks|    NumFeatures|   ScaledNumFeatures|  ScaledNumFeatures2|                  TF|               TFIDF|label|           TitleBody|              words2|TitleBodyLength|
+--------------------+----+----------+--------------------+------+--------------------+----------+-------------+--------+---------------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+---------------+
|<p>Here's an inte...|5123|git branch|Git branch experi...|   git|[p, here, s, an, ...|       132|            3|       0|[132.0,3.0,0.0]|[0.99

## **Question 3**
Using the Normalizer method what's the normalized value for question Id = 512?

In [31]:
assembler2 = VectorAssembler(inputCols=["TitleBodyLength"], outputCol="TitleBodyVec")
df = assembler2.transform(df)

In [32]:
scaler = Normalizer(inputCol="TitleBodyVec", outputCol="TitleBodyVecNorm")
df = scaler.transform(df)

In [33]:
df.where(df.Id == 512).show()

+--------------------+---+--------------------+--------------------+------+--------------------+----------+-------------+--------+--------------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+---------------+------------+----------------+
|                Body| Id|                Tags|               Title|oneTag|               words|BodyLength|NumParagraphs|NumLinks|   NumFeatures|   ScaledNumFeatures|  ScaledNumFeatures2|                  TF|               TFIDF|label|           TitleBody|              words2|TitleBodyLength|TitleBodyVec|TitleBodyVecNorm|
+--------------------+---+--------------------+--------------------+------+--------------------+----------+-------------+--------+--------------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+---------------+------------+----------------+
|<p>I'd like to ha...|512|ja

## **Question 4**
Using the StandardScaler method (scaling both the mean and the standard deviation) what's the normalized value for question Id = 512?

In [34]:
scaler3 = StandardScaler(inputCol="TitleBodyVec", outputCol="TitleBodyVecNorm3", withMean=True, withStd=True)
scalerModel2 = scaler3.fit(df)
df = scalerModel2.transform(df)

In [35]:
df.where(df.Id == 512).show()

+--------------------+---+--------------------+--------------------+------+--------------------+----------+-------------+--------+--------------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+---------------+------------+----------------+--------------------+
|                Body| Id|                Tags|               Title|oneTag|               words|BodyLength|NumParagraphs|NumLinks|   NumFeatures|   ScaledNumFeatures|  ScaledNumFeatures2|                  TF|               TFIDF|label|           TitleBody|              words2|TitleBodyLength|TitleBodyVec|TitleBodyVecNorm|   TitleBodyVecNorm3|
+--------------------+---+--------------------+--------------------+------+--------------------+----------+-------------+--------+--------------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+---------------+------------+----

## **Question 5**
Using the MinMAxScaler method what's the normalized value for question Id = 512?

In [36]:
from pyspark.ml.feature import MinMaxScaler

scaler_q3 = MinMaxScaler(inputCol="TitleBodyVec", outputCol="TitleBodyVecMinMaxScaler")
scalerModel_q3 = scaler_q3.fit(df)
df = scalerModel_q3.transform(df)

In [37]:
df.where(df.Id == 512).show()

+--------------------+---+--------------------+--------------------+------+--------------------+----------+-------------+--------+--------------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+---------------+------------+----------------+--------------------+------------------------+
|                Body| Id|                Tags|               Title|oneTag|               words|BodyLength|NumParagraphs|NumLinks|   NumFeatures|   ScaledNumFeatures|  ScaledNumFeatures2|                  TF|               TFIDF|label|           TitleBody|              words2|TitleBodyLength|TitleBodyVec|TitleBodyVecNorm|   TitleBodyVecNorm3|TitleBodyVecMinMaxScaler|
+--------------------+---+--------------------+--------------------+------+--------------------+----------+-------------+--------+--------------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+----

# **PCA**

PCA works well as long as the number of input columns is not too high

In [38]:
from pyspark.ml.feature import PCA

pca = PCA(k=100, inputCol="TFIDF", outputCol="pcaTFIDF")
model = pca.fit(df)
df = model.transform(df)

In [39]:
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

# **Linear Regression**

For Linear Regression we need continuous variable as target output.
So, lets build a model to predict the number of tags


In [40]:
number_of_tags = udf(lambda x: len(x.split(" ")), IntegerType())
df = df.withColumn("NumTags", number_of_tags(df.Tags))

In [41]:
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

In [42]:
df.groupby("NumTags").count().orderBy("NumTags").show()

+-------+-----+
|NumTags|count|
+-------+-----+
|      1|13858|
|      2|26540|
|      3|28769|
|      4|19108|
|      5|11725|
+-------+-----+



In [43]:
from pyspark.sql.functions import avg

df.groupby("NumTags").agg(avg(col("BodyLength"))).orderBy("NumTags").show()

+-------+------------------+
|NumTags|   avg(BodyLength)|
+-------+------------------+
|      1|135.41311877615817|
|      2|153.82456669178598|
|      3|172.73704334526747|
|      4|192.67050450073268|
|      5|218.54251599147122|
+-------+------------------+



In [44]:
assembler = VectorAssembler(inputCols=["BodyLength"], outputCol="LengthFeature")
df = assembler.transform(df)

In [45]:
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

In [46]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(maxIter=5, regParam=0.0, fitIntercept=False, solver="normal")

In [47]:
data = df.select(col("NumTags").alias("label"), col("LengthFeature").alias("features"))
data.head()

Row(label=5, features=DenseVector([83.0]))

In [48]:
lrModel = lr.fit(data)

In [49]:
lrModel.coefficients

DenseVector([0.0079])

In [50]:
lrModel.intercept

0.0

In [51]:
lrModelSummary = lrModel.summary

In [52]:
lrModelSummary.r2

0.42481762576079996

It is not good => needs feature engineering and parameter tuning

## **Question 1**

Build a linear regression model using the length of the combined Title + Body fields. What is the value of r^2 when fitting a model with maxIter=5, regParam=0.0, fitIntercept=False, solver="normal"?

In [53]:
df.groupby("NumTags").agg(avg(col("TitleBodyLength"))).orderBy("NumTags").show()

+-------+--------------------+
|NumTags|avg(TitleBodyLength)|
+-------+--------------------+
|      1|  143.68776158175783|
|      2|   162.1539186134137|
|      3|  181.26021064340088|
|      4|  201.46530249110322|
|      5|  227.64375266524522|
+-------+--------------------+



In [54]:
data = df.select(col("NumTags").alias("label"), col("TitleBodyVec").alias("features"))
data.head()

Row(label=5, features=DenseVector([96.0]))

In [55]:
lrModel_q1 = lr.fit(data)

In [56]:
lrModel_q1.summary.r2

0.4455149596308462

# **Logistic Regression**

Predict a tag for the questions. The one tag is the most common tag a question has. It is a multiclass classification problem, but you can simplify to single target classification

In [57]:
data2 = df.select(col("label").alias("label"), col("TFIDF").alias("features"))
data2.head()

Row(label=3.0, features=SparseVector(1000, {0: 0.0026, 1: 0.7515, 2: 0.1374, 3: 0.3184, 5: 0.3823, 8: 1.0754, 9: 0.3344, 15: 0.5899, 21: 1.8551, 28: 1.1263, 31: 1.1113, 35: 3.3134, 36: 1.2545, 43: 2.3741, 45: 2.3753, 48: 1.2254, 51: 1.1879, 57: 11.0264, 61: 2.8957, 71: 2.1945, 78: 1.6947, 84: 6.5898, 86: 1.6136, 94: 2.3569, 97: 1.8218, 99: 2.6292, 100: 1.9206, 115: 2.3592, 147: 5.4841, 152: 2.1116, 169: 2.6328, 241: 2.5745, 283: 3.2325, 306: 3.2668, 350: 6.2367, 490: 3.8893, 578: 3.6182, 759: 3.7771, 832: 8.8964}))

In [58]:
from pyspark.ml.classification import LogisticRegression

lr2 = LogisticRegression(maxIter=10, regParam=0.0)

In [59]:
lrModel2 = lr2.fit(data2)

In [60]:
lrModel2.coefficientMatrix

DenseMatrix(301, 1000, [-3.366, -0.0027, 0.0564, 0.0879, -0.0873, 0.0537, 0.0028, 0.0011, ..., -0.0012, -0.002, -0.0007, -0.0005, -0.0001, -0.0007, -0.0033, -0.0011], 1)

In [61]:
lrModel2.interceptVector

DenseVector([5.0349, 3.3699, 4.2743, 4.0637, 4.3092, 4.1958, 3.4761, 3.7129, 3.1161, 2.8859, 3.2474, 2.9347, 3.1001, 2.8214, 2.7538, 3.0578, 3.2807, 3.0748, 3.0897, 2.7948, 2.6887, 3.0541, 2.9388, 2.1495, 3.1079, 2.0982, 2.6243, 1.7693, 2.122, 1.8288, 2.2967, 2.2808, 1.5032, 2.3648, 2.6052, 1.3411, 2.2037, 0.8171, 1.7048, 1.877, 1.5355, 1.322, 1.796, 0.885, 1.0449, 1.3353, 1.1925, 0.9448, 0.8333, 0.9393, 0.8649, 1.0534, 0.4789, -0.091, 0.5386, 0.7571, 0.9612, 1.8785, 1.3878, 2.2508, -0.3393, 1.0961, 0.8168, 0.5892, 1.659, 0.802, 0.5494, 0.663, 0.1311, 0.203, -0.1665, 0.6357, 0.9036, 0.6988, -0.0894, 0.4273, 1.3243, 0.1655, 0.7225, 1.1429, 0.0385, 0.3549, 0.0286, -0.2136, 0.4241, 0.3773, -0.5959, -0.0354, -0.3957, -0.1694, 0.0864, 0.6522, -0.1645, 0.1155, -0.1603, 0.8341, 0.1882, 0.2458, 0.3807, -0.6794, -0.4357, -0.3846, 0.2361, 0.2743, -0.4891, -0.1413, 0.8849, -0.0908, 0.2604, -0.0706, -0.367, 0.5003, 0.161, 0.2303, -0.3059, 0.343, 0.4042, 0.1672, -0.416, -0.0833, -0.0584, -0.4839, -

In [62]:
lrModel2.summary.accuracy

0.39128

# **Unsupervised Learning**

We might want to take a look at the distribution of the Title+Body length feature we used before and instead of using the raw number of words create categories based on this length: short, longer,..., super long.

## **Question 1**

How many times greater is the Description Length of the longest question than the Description Length of the shortest question (rounded to the nearest whole number)?

Tip: Don't forget to import Spark SQL's aggregate functions that can operate on DataFrame columns.

In [63]:
from pyspark.sql.functions import min, max, stddev

df.agg(min("TitleBodyLength")).show()

+--------------------+
|min(TitleBodyLength)|
+--------------------+
|                  10|
+--------------------+



In [64]:
df.agg(max("TitleBodyLength")).show()

+--------------------+
|max(TitleBodyLength)|
+--------------------+
|                7532|
+--------------------+



## **Question 2**

What is the mean and standard deviation of the Description length?

In [65]:
df.agg(avg("TitleBodyLength"), stddev("TitleBodyLength")).show()

+--------------------+----------------------------+
|avg(TitleBodyLength)|stddev_samp(TitleBodyLength)|
+--------------------+----------------------------+
|           180.28187|          192.10819533505023|
+--------------------+----------------------------+



## **Question 3**

Let's use K-means to create 5 clusters of Description Lengths. Set the random seed to 42 and fit a 5-class K-means model on the Description Length column (you can use KMeans().setParams(...) ). What length is the center of the cluster representing the longest questions?

In [66]:
from pyspark.ml.clustering import KMeans

kmeans = KMeans().setParams(featuresCol="TitleBodyVec", predictionCol="TitleBodyGroup", k=5, seed=42)
model = kmeans.fit(df)
df = model.transform(df)

In [67]:
from pyspark.sql.functions import count

df.groupby("TitleBodyGroup").agg(avg(col("TitleBodyLength")), avg(col("NumTags")), count(col("TitleBodyLength"))).orderBy("avg(TitleBodyLength)").show()

+--------------+--------------------+------------------+----------------------+
|TitleBodyGroup|avg(TitleBodyLength)|      avg(NumTags)|count(TitleBodyLength)|
+--------------+--------------------+------------------+----------------------+
|             0|   96.02297592997812|2.7428884026258205|                 63066|
|             4|  238.22969197457567|3.0864357058042886|                 28634|
|             1|  492.97406340057637|3.2337175792507207|                  6940|
|             3|   1064.769101595298| 3.292191435768262|                  1191|
|             2|  2731.0828402366865|  3.42603550295858|                   169|
+--------------+--------------------+------------------+----------------------+



# **Pipelines**

In [68]:
print(type(lrModel2))

<class 'pyspark.ml.classification.LogisticRegressionModel'>


It is a transformer

In [69]:
print(type(lr2))

<class 'pyspark.ml.classification.LogisticRegression'>


It is a estimator, but it does not really say that

Let's build a pipeline using the same example in Logistic Regression (chain Transformers and Estimators => the steps need to form a DAG)

In [70]:
df2 = spark.read.json(stack_overflow_data)
df2.persist()

DataFrame[Body: string, Id: bigint, Tags: string, Title: string, oneTag: string]

In [71]:
from pyspark.ml import Pipeline

#turn body in the list of words
regexTokenizer = RegexTokenizer(inputCol="Body", outputCol="words", pattern="\\W")
#transform words into term of frequencies
cv = CountVectorizer(inputCol="words", outputCol="TF", vocabSize=10000)
#turns frequencies into TF-IDF
idf = IDF(inputCol="TF", outputCol="features")
#transform string tags into numeric values - independent can be anywhere in Pipeline
indexer = StringIndexer(inputCol="oneTag", outputCol="label")

lr =  LogisticRegression(maxIter=10, regParam=0.0, elasticNetParam=0)

pipeline = Pipeline(stages=[regexTokenizer, cv, idf, indexer, lr])

In [None]:
plrModel = pipeline.fit(df2)

In [None]:
df3 = plrModel.transform(df2)

In [None]:
df3.head()

In [None]:
df3.filter(df3.label == df3.prediction).count()