# Linear Regression Quiz
Use this Jupyter notebook to find the answer to the quiz in the previous section. There is an answer key in the next part of the lesson.

In [40]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat,col,lit,udf
from pyspark.ml.feature import RegexTokenizer,VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.sql.types import IntegerType
# TODOS: 
# 1) import any other libraries you might need
# 2) run the cells below to read the dataset and extract description length features
# 3) write code to answer the quiz question

In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Creating Features") \
    .getOrCreate()

### Read Dataset

In [3]:
stack_overflow_data = 'Train_onetag_small.json'

In [4]:
df = spark.read.json(stack_overflow_data)
df.persist()

DataFrame[Body: string, Id: bigint, Tags: string, Title: string, oneTag: string]

### Build Description Length Features

In [11]:
df = df.withColumn("Desc", concat(col("Title"), lit(' '), col("Body")))

In [36]:
#df.take(1)

In [21]:
body_length=udf(lambda x: len(x),IntegerType())

In [29]:
#df=df.drop('words')

In [30]:
regexTokenizer = RegexTokenizer(inputCol="Desc", outputCol="words", pattern="\\W")
df = regexTokenizer.transform(df)
df = df.withColumn("DescLength", body_length(df.words))

In [34]:
assembler = VectorAssembler(inputCols=["DescLength"], outputCol="DescVec")
df = assembler.transform(df)

In [80]:
numtag=udf(lambda x: len(x.split(" ")), IntegerType())
df=df.withColumn("NumTags", numtag(df.Tags))

In [82]:
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', Desc="How to check if an uploaded file is an image without mime type? <p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an imag

# Question
Build a linear regression model using the length of the combined Title + Body fields. What is the value of r^2 when fitting a model with `maxIter=5, regParam=0.0, fitIntercept=False, solver="normal"`?

In [91]:
df.createOrReplaceTempView("t1")
spark.sql('''
    SELECT NumTags, AVG(DescLength)
    FROM t1
    GROUP BY 1
    ORDER BY 2 ''').show()

+-------+------------------+
|NumTags|   avg(DescLength)|
+-------+------------------+
|      1|143.68776158175783|
|      2| 162.1539186134137|
|      3|181.26021064340088|
|      4|201.46530249110322|
|      5|227.64375266524522|
+-------+------------------+



In [83]:
data=df.select('DescVec','NumTags')
data.show()

+--------+-------+
| DescVec|NumTags|
+--------+-------+
|  [96.0]|      5|
|  [83.0]|      1|
|[3168.0]|      3|
| [124.0]|      3|
| [154.0]|      3|
|  [75.0]|      3|
| [121.0]|      1|
| [170.0]|      3|
| [107.0]|      3|
|  [74.0]|      5|
| [145.0]|      3|
| [148.0]|      3|
|  [24.0]|      3|
|  [49.0]|      3|
|  [48.0]|      1|
| [389.0]|      3|
| [380.0]|      2|
| [216.0]|      2|
| [123.0]|      4|
| [404.0]|      5|
+--------+-------+
only showing top 20 rows



In [84]:
# TODO: write your code to answer this question
lr=LinearRegression(featuresCol='DescVec',labelCol='NumTags',maxIter=5, regParam=0.0, fitIntercept=False, solver='normal')

In [85]:
lrmodel=lr.fit(data)

In [86]:
lrmodel.summary.r2

0.4455149596308462

In [87]:
lrmodel.coefficients

DenseVector([0.0079])

What error message do you see if you try to use the column with Integers instead of vectors?

In [93]:
test=df.select('DescLength','NumTags')
lr2=LinearRegression(featuresCol='DescLength', labelCol='NumTags',maxIter=5, regParam=0.0, fitIntercept=False, solver='normal')

In [96]:
test.show()

+----------+-------+
|DescLength|NumTags|
+----------+-------+
|        96|      5|
|        83|      1|
|      3168|      3|
|       124|      3|
|       154|      3|
|        75|      3|
|       121|      1|
|       170|      3|
|       107|      3|
|        74|      5|
|       145|      3|
|       148|      3|
|        24|      3|
|        49|      3|
|        48|      1|
|       389|      3|
|       380|      2|
|       216|      2|
|       123|      4|
|       404|      5|
+----------+-------+
only showing top 20 rows



In [94]:
lr2.fit(test)

IllegalArgumentException: 'requirement failed: Column DescLength must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually int.'