# Million Song Dataset

This is a subset of the Million Song Dataset http://millionsongdataset.com/

This subset was taked from the UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD#

This dataset contains over 500,000 songs with the year of the song with 90 attributes relating to the timbre average and timbre covaraiance of the song.

We will create a model that can predict the year of the song based on the timbre attributes (color or tone quality) of the song.

In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
import pyspark
import json

Create a Spark Seesion

Spark's Machine Learning libraries can require a higher amount of memory for the Java spark driver process

In [2]:
sc.setLogLevel("ERROR")
spark

---

Loading the dataset

When loading the dataset, Spark will assume the data being read are STRINGS, even if the data are numerical values. Using `interSchema=True` will get spark to infer what the datatype should be. 

In [4]:
path = "YearPredictionMSD.txt"
MSD_dd = spark.read.csv(path,inferSchema=True,header=False)

                                                                                

---

Count number of elements in the Spark DataFrame

In [5]:
MSD_dd.count()

                                                                                

515345

---

Show column names.

We didn't have a header so the default column names are '_cX' 

In [6]:
MSD_dd.columns

['_c0',
 '_c1',
 '_c2',
 '_c3',
 '_c4',
 '_c5',
 '_c6',
 '_c7',
 '_c8',
 '_c9',
 '_c10',
 '_c11',
 '_c12',
 '_c13',
 '_c14',
 '_c15',
 '_c16',
 '_c17',
 '_c18',
 '_c19',
 '_c20',
 '_c21',
 '_c22',
 '_c23',
 '_c24',
 '_c25',
 '_c26',
 '_c27',
 '_c28',
 '_c29',
 '_c30',
 '_c31',
 '_c32',
 '_c33',
 '_c34',
 '_c35',
 '_c36',
 '_c37',
 '_c38',
 '_c39',
 '_c40',
 '_c41',
 '_c42',
 '_c43',
 '_c44',
 '_c45',
 '_c46',
 '_c47',
 '_c48',
 '_c49',
 '_c50',
 '_c51',
 '_c52',
 '_c53',
 '_c54',
 '_c55',
 '_c56',
 '_c57',
 '_c58',
 '_c59',
 '_c60',
 '_c61',
 '_c62',
 '_c63',
 '_c64',
 '_c65',
 '_c66',
 '_c67',
 '_c68',
 '_c69',
 '_c70',
 '_c71',
 '_c72',
 '_c73',
 '_c74',
 '_c75',
 '_c76',
 '_c77',
 '_c78',
 '_c79',
 '_c80',
 '_c81',
 '_c82',
 '_c83',
 '_c84',
 '_c85',
 '_c86',
 '_c87',
 '_c88',
 '_c89',
 '_c90']

---

Show the first 3 rows with the first 3 columns

Column `_c0` is the YEAR of the song

You will need to add `.show()` to list the values of the DataFrame

In [7]:
MSD_dd.select("_c0","_c1","_c2","_c3")

DataFrame[_c0: int, _c1: double, _c2: double, _c3: double]

In [8]:
MSD_dd.select("_c0","_c1","_c2","_c3").show(3)

+----+--------+--------+--------+
| _c0|     _c1|     _c2|     _c3|
+----+--------+--------+--------+
|2001|49.94357|21.47114| 73.0775|
|2001|48.73215| 18.4293|70.32679|
|2001|50.95714|31.85602|55.81851|
+----+--------+--------+--------+
only showing top 3 rows



---

Double check the data type with `.printSchema()`

In [9]:
MSD_dd.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: double (nullable = true)
 |-- _c5: double (nullable = true)
 |-- _c6: double (nullable = true)
 |-- _c7: double (nullable = true)
 |-- _c8: double (nullable = true)
 |-- _c9: double (nullable = true)
 |-- _c10: double (nullable = true)
 |-- _c11: double (nullable = true)
 |-- _c12: double (nullable = true)
 |-- _c13: double (nullable = true)
 |-- _c14: double (nullable = true)
 |-- _c15: double (nullable = true)
 |-- _c16: double (nullable = true)
 |-- _c17: double (nullable = true)
 |-- _c18: double (nullable = true)
 |-- _c19: double (nullable = true)
 |-- _c20: double (nullable = true)
 |-- _c21: double (nullable = true)
 |-- _c22: double (nullable = true)
 |-- _c23: double (nullable = true)
 |-- _c24: double (nullable = true)
 |-- _c25: double (nullable = true)
 |-- _c26: double (nullable = true)
 |-- _c27: double (nullable = tr

---

Start the Machine Learing process!

First, we will need to split the data to training and test sets

In [10]:
from pyspark.ml.feature import VectorAssembler 
from pyspark.ml import Pipeline
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

(trainingData, testData) = MSD_dd.randomSplit([0.7, 0.3])

Get the columns for the features (timbre attributes)

In [11]:
feature_columns = MSD_dd.columns[1:]

Setting up Spark's gradient boosting Regressor

In [12]:
MSD_data = VectorAssembler(inputCols=feature_columns, outputCol="features")
gbt = GBTRegressor(featuresCol="features", labelCol="_c0", maxIter=10, maxDepth=10)
pipeline = Pipeline(stages=[MSD_data, gbt])

Train the model with `trainingData`

In [13]:
model = pipeline.fit(trainingData)

                                                                                

---

Now, we make predictions from the trained model

In [14]:
# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "_c0", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="_c0", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)

print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)


+------------------+----+--------------------+
|        prediction| _c0|            features|
+------------------+----+--------------------+
|1981.6303655935535|1941|[30.25882,-62.391...|
|  1989.66312356362|1941|[31.96273,-101.69...|
|1987.0270204672634|1941|[34.57043,-169.65...|
|1984.6133678545932|1941|[39.21391,-135.56...|
|1989.9391258267783|1958|[42.68421,15.1133...|
+------------------+----+--------------------+
only showing top 5 rows





Root Mean Squared Error (RMSE) on test data = 9.53532


                                                                                