## Day 79 Lecture 2 Assignment

In this assignment, we will learn about machine learning in Spark.

Run the cells below to start a spark session.

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-eu.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

In [0]:
!pip install -q findspark --quiet
!pip install pyspark --quiet

In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *

In [0]:
APP_NAME = "Day79"

In [0]:
spark = SparkSession.builder.appName(APP_NAME).getOrCreate()

In this assignment, we will be using the video game sales dataset again. It is loaded below.

In [0]:
video = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/content/Video_Games_Sales_as_at_22_Dec_2016.csv")

In [68]:
video.show()

+--------------------+--------+---------------+------------+--------------------+--------+--------+--------+-----------+------------+------------+------------+----------+----------+-------------------+------+
|                Name|Platform|Year_of_Release|       Genre|           Publisher|NA_Sales|EU_Sales|JP_Sales|Other_Sales|Global_Sales|Critic_Score|Critic_Count|User_Score|User_Count|          Developer|Rating|
+--------------------+--------+---------------+------------+--------------------+--------+--------+--------+-----------+------------+------------+------------+----------+----------+-------------------+------+
|          Wii Sports|     Wii|           2006|      Sports|            Nintendo|   41.36|   28.96|    3.77|       8.45|       82.53|          76|          51|         8|       322|           Nintendo|     E|
|   Super Mario Bros.|     NES|           1985|    Platform|            Nintendo|   29.08|    3.58|    6.81|       0.77|       40.24|        null|        null|     

We will predict global sales using a number of variables in this dataset. We will start by removing all missing data (though we know that this will make the dataset significantly smaller).

In [97]:
# Answer below:
video = video.dropna()
video.count()

6947

Next, we will create dummy variables for the genre. Create these variables using the `OneHotEncoder` provided in spark.

In [98]:
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="Genre", outputCol="GenreIndexed")
indexed = indexer.fit(video).transform(video)
indexed.show()

+--------------------+--------+---------------+--------+--------------------+--------+--------+--------+-----------+------------+------------+------------+----------+----------+--------------------+------+------------+
|                Name|Platform|Year_of_Release|   Genre|           Publisher|NA_Sales|EU_Sales|JP_Sales|Other_Sales|Global_Sales|Critic_Score|Critic_Count|User_Score|User_Count|           Developer|Rating|GenreIndexed|
+--------------------+--------+---------------+--------+--------------------+--------+--------+--------+-----------+------------+------------+------------+----------+----------+--------------------+------+------------+
|          Wii Sports|     Wii|           2006|  Sports|            Nintendo|   41.36|   28.96|    3.77|       8.45|       82.53|          76|          51|         8|       322|            Nintendo|     E|         1.0|
|      Mario Kart Wii|     Wii|           2008|  Racing|            Nintendo|   15.68|   12.76|    3.79|       3.29|       3

In [99]:
# Answer below:
from pyspark.ml.feature import OneHotEncoderEstimator
encoder = OneHotEncoderEstimator(inputCols=['GenreIndexed'],outputCols=['Genre_D'])
model = encoder.fit(indexed)
encoded = model.transform(indexed)
encoded.show()

+--------------------+--------+---------------+--------+--------------------+--------+--------+--------+-----------+------------+------------+------------+----------+----------+--------------------+------+------------+--------------+
|                Name|Platform|Year_of_Release|   Genre|           Publisher|NA_Sales|EU_Sales|JP_Sales|Other_Sales|Global_Sales|Critic_Score|Critic_Count|User_Score|User_Count|           Developer|Rating|GenreIndexed|       Genre_D|
+--------------------+--------+---------------+--------+--------------------+--------+--------+--------+-----------+------------+------------+------------+----------+----------+--------------------+------+------------+--------------+
|          Wii Sports|     Wii|           2006|  Sports|            Nintendo|   41.36|   28.96|    3.77|       8.45|       82.53|          76|          51|         8|       322|            Nintendo|     E|         1.0|(11,[1],[1.0])|
|      Mario Kart Wii|     Wii|           2008|  Racing|        

Convert the critic score and the user score to a number between 0 and 1 by dividing by 100. 

In [0]:
# Answer below:
encoded1= encoded.withColumn('Scaled_US', video['User_Score']/100).withColumn('Scaled_CS', video['Critic_Score']/100)


In [101]:
encoded1.show()

+--------------------+--------+---------------+--------+--------------------+--------+--------+--------+-----------+------------+------------+------------+----------+----------+--------------------+------+------------+--------------+-------------------+---------+
|                Name|Platform|Year_of_Release|   Genre|           Publisher|NA_Sales|EU_Sales|JP_Sales|Other_Sales|Global_Sales|Critic_Score|Critic_Count|User_Score|User_Count|           Developer|Rating|GenreIndexed|       Genre_D|          Scaled_US|Scaled_CS|
+--------------------+--------+---------------+--------+--------------------+--------+--------+--------+-----------+------------+------------+------------+----------+----------+--------------------+------+------------+--------------+-------------------+---------+
|          Wii Sports|     Wii|           2006|  Sports|            Nintendo|   41.36|   28.96|    3.77|       8.45|       82.53|          76|          51|         8|       322|            Nintendo|     E|   

Using the vector assembler, create a vector of features using the scaled user score, the scaled critic score and the one hot encoded vector.

In [0]:
# Answer below:
from pyspark.ml.feature import VectorAssembler
feature_cols = ['Scaled_CS','Scaled_US','Genre_D']
assembler = VectorAssembler(inputCols = feature_cols, outputCol = 'features')


In [103]:
type(assembler)

pyspark.ml.feature.VectorAssembler

In [0]:
video_features = assembler.transform(encoded1)

In [105]:
type(video_features)

pyspark.sql.dataframe.DataFrame

In [106]:
video_features.show()

+--------------------+--------+---------------+--------+--------------------+--------+--------+--------+-----------+------------+------------+------------+----------+----------+--------------------+------+------------+--------------+-------------------+---------+--------------------+
|                Name|Platform|Year_of_Release|   Genre|           Publisher|NA_Sales|EU_Sales|JP_Sales|Other_Sales|Global_Sales|Critic_Score|Critic_Count|User_Score|User_Count|           Developer|Rating|GenreIndexed|       Genre_D|          Scaled_US|Scaled_CS|            features|
+--------------------+--------+---------------+--------+--------------------+--------+--------+--------+-----------+------------+------------+------------+----------+----------+--------------------+------+------------+--------------+-------------------+---------+--------------------+
|          Wii Sports|     Wii|           2006|  Sports|            Nintendo|   41.36|   28.96|    3.77|       8.45|       82.53|          76|   

Split the data into 70% in the training sample and 30% in the test sample.

In [107]:
# Answer below:
train, test = video_features.randomSplit([0.7,0.3], seed = 1)
train.count()


4837

In [108]:
test.count()

2110

Using the train and test data, generate a linear regression to predict global sales. Print the r squared from the model summary.

In [113]:
train.show()

+--------------------+--------+---------------+------------+--------------------+--------+--------+--------+-----------+------------+------------+------------+----------+----------+--------------------+------+------------+---------------+--------------------+---------+--------------------+
|                Name|Platform|Year_of_Release|       Genre|           Publisher|NA_Sales|EU_Sales|JP_Sales|Other_Sales|Global_Sales|Critic_Score|Critic_Count|User_Score|User_Count|           Developer|Rating|GenreIndexed|        Genre_D|           Scaled_US|Scaled_CS|            features|
+--------------------+--------+---------------+------------+--------------------+--------+--------+--------+-----------+------------+------------+------------+----------+----------+--------------------+------+------------+---------------+--------------------+---------+--------------------+
|   Tales of Xillia 2|     PS3|           2012|Role-Playing|  Namco Bandai Games|     0.2|    0.12|    0.45|       0.07|       

In [118]:
# Answer below:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol= 'features', labelCol='Global_Sales')

# Fit the model
lrModel = lr.fit(train)

# # Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

Coefficients: [4.158643649444037,-9.538641777750332,0.33790985363750004,0.2520051380322762,0.4455677944897637,0.1162363737205317,0.3086939617053152,0.5083768382248952,0.6547017855166578,0.16823868355872185,0.21493745006933673,-0.3653731428035088,-0.026537101215647948]
Intercept: -1.7205279827879534


In [120]:
lrModel.summary.rootMeanSquaredError

2.085580280038145

In [123]:
lrModel.summary.residuals.show()

+--------------------+
|           residuals|
+--------------------+
|   0.245207318404432|
|  0.5660934230931725|
|-0.02733844629026...|
| -0.0168971108791251|
| -0.8974979934621012|
|  0.4490501143433512|
|  0.5023486917819483|
|  0.8367433232321027|
|  0.4049345020988634|
| -0.6880306827888953|
|0.007966964258761325|
| -0.1812360237464551|
| 0.04796696425876132|
| 0.00909204253095236|
| -0.4431230797844513|
| 0.03244911087609248|
|-0.38343871074587627|
| -0.3605214002850966|
| -0.5471254228952057|
|  0.2508746471238254|
+--------------------+
only showing top 20 rows



In [124]:
lrModel.summary.r2

0.061834142107681456