This is the first assignment for the Coursera course "Advanced Machine Learning and Signal Processing"

Just execute all cells one after the other and you are done - just note that in the last one you have to update your email address (the one you've used for coursera) and obtain a submission token, you get this from the programming assignment directly on coursera.

This notebook is designed to run in a IBM Watson Studio default runtime (NOT the Watson Studio Apache Spark Runtime as the default runtime with 1 vCPU is free of charge). Therefore, we install Apache Spark in local mode for test purposes only. Please don't use it in production.

In case you are facing issues, please read the following two documents first:

https://github.com/IBM/skillsnetwork/wiki/Environment-Setup

https://github.com/IBM/skillsnetwork/wiki/FAQ

Then, please feel free to ask:

https://coursera.org/learn/machine-learning-big-data-apache-spark/discussions/all

Please make sure to follow the guidelines before asking a question:

https://github.com/IBM/skillsnetwork/wiki/FAQ#im-feeling-lost-and-confused-please-help-me


If running outside Watson Studio, this should work as well. In case you are running in an Apache Spark context outside Watson Studio, please remove the Apache Spark setup in the first notebook cells.

In [2]:
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown('# <span style="color:red">'+string+'</span>'))


if ('sc' in locals() or 'sc' in globals()):
    printmd('<<<<<!!!!! It seems that you are running in a IBM Watson Studio Apache Spark Notebook. Please run it in an IBM Watson Studio Default Runtime (without Apache Spark) !!!!!>>>>>')


In [1]:
#!pip install pyspark==2.4.5

In [2]:
try:
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SparkSession
except ImportError as e:
    printmd('<<<<<!!!!! Please restart your kernel after installing Apache Spark !!!!!>>>>>')

In [3]:
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

spark = SparkSession \
    .builder \
    .getOrCreate()

In [4]:
!wget https://github.com/IBM/coursera/raw/master/coursera_ml/a2.parquet

--2020-05-21 15:14:46--  https://github.com/IBM/coursera/raw/master/coursera_ml/a2.parquet
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/IBM/skillsnetwork/raw/master/coursera_ml/a2.parquet [following]
--2020-05-21 15:14:47--  https://github.com/IBM/skillsnetwork/raw/master/coursera_ml/a2.parquet
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/IBM/skillsnetwork/master/coursera_ml/a2.parquet [following]
--2020-05-21 15:14:47--  https://raw.githubusercontent.com/IBM/skillsnetwork/master/coursera_ml/a2.parquet
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.28.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.28.133|:443... connected.
HTTP request sent, awaiting response... 200 OK

In [9]:
df=spark.read.load('a2.parquet')

df.createOrReplaceTempView("df")
spark.sql("SELECT * from df").show(5)


+-----+-----------+-------------------+-------------------+-------------------+
|CLASS|   SENSORID|                  X|                  Y|                  Z|
+-----+-----------+-------------------+-------------------+-------------------+
|    0|         26| 380.66434005495194| -139.3470983812975|-247.93697521077704|
|    0|         29| 104.74324299209692| -32.27421440203938|-25.105013725863852|
|    0| 8589934658| 118.11469236129976| 45.916682927433534| -87.97203782706572|
|    0|34359738398| 246.55394030642543|-0.6122810693132044|-398.18662513951506|
|    0|17179869241|-190.32584900181487|  234.7849657520335|-206.34483804019288|
+-----+-----------+-------------------+-------------------+-------------------+
only showing top 5 rows



In [13]:
# use onehot encoder on class
from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder(inputCol='CLASS',outputCol='Enc_CLass')
encoded = encoder.transform(df)
encoded.show(5)

+-----+-----------+-------------------+-------------------+-------------------+-------------+
|CLASS|   SENSORID|                  X|                  Y|                  Z|    Enc_CLass|
+-----+-----------+-------------------+-------------------+-------------------+-------------+
|    0|         26| 380.66434005495194| -139.3470983812975|-247.93697521077704|(1,[0],[1.0])|
|    0|         29| 104.74324299209692| -32.27421440203938|-25.105013725863852|(1,[0],[1.0])|
|    0| 8589934658| 118.11469236129976| 45.916682927433534| -87.97203782706572|(1,[0],[1.0])|
|    0|34359738398| 246.55394030642543|-0.6122810693132044|-398.18662513951506|(1,[0],[1.0])|
|    0|17179869241|-190.32584900181487|  234.7849657520335|-206.34483804019288|(1,[0],[1.0])|
+-----+-----------+-------------------+-------------------+-------------------+-------------+
only showing top 5 rows



In [18]:
# create a vector assembly from vectors X,Y,Z
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=['X','Y','Z'],outputCol='Feature')
assembled = assembler.transform(encoded)
assembled.show(5)



+-----+-----------+-------------------+-------------------+-------------------+-------------+--------------------+
|CLASS|   SENSORID|                  X|                  Y|                  Z|    Enc_CLass|             Feature|
+-----+-----------+-------------------+-------------------+-------------------+-------------+--------------------+
|    0|         26| 380.66434005495194| -139.3470983812975|-247.93697521077704|(1,[0],[1.0])|[380.664340054951...|
|    0|         29| 104.74324299209692| -32.27421440203938|-25.105013725863852|(1,[0],[1.0])|[104.743242992096...|
|    0| 8589934658| 118.11469236129976| 45.916682927433534| -87.97203782706572|(1,[0],[1.0])|[118.114692361299...|
|    0|34359738398| 246.55394030642543|-0.6122810693132044|-398.18662513951506|(1,[0],[1.0])|[246.553940306425...|
|    0|17179869241|-190.32584900181487|  234.7849657520335|-206.34483804019288|(1,[0],[1.0])|[-190.32584900181...|
+-----+-----------+-------------------+-------------------+-------------------+-

In [21]:
# use pipeline
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[encoder,assembler])
model = pipeline.fit(df)
processed = model.transform(df)
processed.show(5)

+-----+-----------+-------------------+-------------------+-------------------+-------------+--------------------+
|CLASS|   SENSORID|                  X|                  Y|                  Z|    Enc_CLass|             Feature|
+-----+-----------+-------------------+-------------------+-------------------+-------------+--------------------+
|    0|         26| 380.66434005495194| -139.3470983812975|-247.93697521077704|(1,[0],[1.0])|[380.664340054951...|
|    0|         29| 104.74324299209692| -32.27421440203938|-25.105013725863852|(1,[0],[1.0])|[104.743242992096...|
|    0| 8589934658| 118.11469236129976| 45.916682927433534| -87.97203782706572|(1,[0],[1.0])|[118.114692361299...|
|    0|34359738398| 246.55394030642543|-0.6122810693132044|-398.18662513951506|(1,[0],[1.0])|[246.553940306425...|
|    0|17179869241|-190.32584900181487|  234.7849657520335|-206.34483804019288|(1,[0],[1.0])|[-190.32584900181...|
+-----+-----------+-------------------+-------------------+-------------------+-

In [24]:
X_train = processed.drop('X').drop('Y').drop('Z').drop('CLASS').drop('SENSORID')
X_train.show(5)

+-------------+--------------------+
|    Enc_CLass|             Feature|
+-------------+--------------------+
|(1,[0],[1.0])|[380.664340054951...|
|(1,[0],[1.0])|[104.743242992096...|
|(1,[0],[1.0])|[118.114692361299...|
|(1,[0],[1.0])|[246.553940306425...|
|(1,[0],[1.0])|[-190.32584900181...|
+-------------+--------------------+
only showing top 5 rows

