# Model Tuning Quiz
Use this Jupyter notebook to find the answer to the quiz in the previous section. There is an answer key in the next part of the lesson.

In [55]:
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import RegexTokenizer,CountVectorizer,IDF,StringIndexer
# TODOS: 
# 1) import any other libraries you might need
# 2) run the cells below to read dataset
# 3) follow the steps below to find the answer to the quiz question

In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Creating Features") \
    .getOrCreate()

In [3]:
stack_overflow_data = 'Train_onetag_small.json'

In [4]:
df = spark.read.json(stack_overflow_data)
df.persist()

DataFrame[Body: string, Id: bigint, Tags: string, Title: string, oneTag: string]

In [6]:
df.printSchema()

root
 |-- Body: string (nullable = true)
 |-- Id: long (nullable = true)
 |-- Tags: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- oneTag: string (nullable = true)



In [13]:
df.createOrReplaceTempView('t1')
spark.sql('''
select distinct oneTag
from t1''').show()

+-------------------+
|             oneTag|
+-------------------+
|                 qt|
|           keyboard|
|                cpu|
|      documentation|
|algebra-precalculus|
|          windows-7|
|   image-processing|
|            magento|
|             iphone|
|           geometry|
|    database-design|
|windows-server-2008|
|               unix|
|     zend-framework|
|      asp.net-mvc-2|
|                ftp|
|              xcode|
|             debian|
|              azure|
|            android|
+-------------------+
only showing top 20 rows



# Question
What is the accuracy of the best model trained with the parameter grid described above (and keeping all other parameters at their default value computed on the 10% untouched data?

### Step 1. Train Test Split
As a first step break your data set into 90% of training data and set aside 10%. Set random seed to `42`.

In [12]:
# TODO: write your code for this step
train, test=df.randomSplit([.9,.1],42)

### Step 2. Build Pipeline

In [22]:
# TODO: write your code for this step
tokenzier=RegexTokenizer(inputCol='Body',outputCol='words',pattern='\\W')
cv=CountVectorizer(inputCol='words',outputCol='TF')
idf=IDF(inputCol='TF',outputCol='features')
indexer=StringIndexer(inputCol='oneTag',outputCol='label')
lrmodel=LogisticRegression(featuresCol='features',labelCol='label')
pipline=Pipeline(stages=[tokenzier,cv,idf,indexer,lrmodel])

### Step 3. Tune Model
On the first 90% of the data let's find the most accurate logistic regression model using 3-fold cross-validation with the following parameter grid:

- CountVectorizer vocabulary size: `[1000, 5000]`
- LogisticRegression regularization parameter: `[0.0, 0.1]`
- LogisticRegression max Iteration number: `[10]`

In [31]:
CrossValidator?

In [29]:
# set up parameters
param=ParamGridBuilder() \
    .addGrid(cv.vocabSize,[1000,5000]) \
    .addGrid(lrmodel.maxIter,[10]) \
    .addGrid(lrmodel.regParam,[0.0,0.1]) \
    .build()

In [33]:
# set up crossvalidation
crossval=CrossValidator(estimator=pipline,
                       estimatorParamMaps=param,
                       evaluator=MulticlassClassificationEvaluator(),
                       numFolds=3)

### Step 4: Compute Accuracy of Best Model

In [34]:
# TODO: write your code for this step
model=crossval.fit(train)

In [36]:
pre=model.transform(test)

In [45]:
pre.createOrReplaceTempView('t2')
spark.sql('''
select avg(case when prediction =label then 1
            else 0
            end)
from t2''').show()

+-----------------------------------------------------+
|avg(CASE WHEN (prediction = label) THEN 1 ELSE 0 END)|
+-----------------------------------------------------+
|                                    0.392378263937897|
+-----------------------------------------------------+



In [61]:
test=pre.select('prediction','label').rdd.map(tuple)

In [63]:
metrics=MulticlassMetrics(predictionAndLabels=test)

In [71]:
metrics.fMeasure()

0.392378263937897

In [68]:
metrics.recall(label=1)

0.5136138613861386

In [69]:
metrics.precision(label=1)

0.39040451552210725