# Machine Learning in Pyspark

- Pyspark has two machine learning libraries:
    
    - mllib, mainly RDD based
    - ml, mainly Pyspark DataFrame based
    
- Spark ML is the newer, scikit-learn inspired, machine learning library and is where new active development is taking place
- The spark.ml is considered as the recommended package because with DataFrames the API is more versatile and flexible.
- Name of columns for both mllib and ML is important in Pyspark

In [None]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('LinearReg').getOrCreate()

In [6]:
# CSV options
infer_schema = "false"
first_row_is_header = "true"
delimiter = ","
file_type = "csv"

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter).csv("Advertising.csv")

In [7]:
df.show(5)

+---+-----+-----+---------+-----+
|_c0|   TV|radio|newspaper|sales|
+---+-----+-----+---------+-----+
|  1|230.1| 37.8|     69.2| 22.1|
|  2| 44.5| 39.3|     45.1| 10.4|
|  3| 17.2| 45.9|     69.3|  9.3|
|  4|151.5| 41.3|     58.5| 18.5|
|  5|180.8| 10.8|     58.4| 12.9|
+---+-----+-----+---------+-----+
only showing top 5 rows



In [8]:
regressionDataFrame = df.drop('_c0')

In [9]:
regressionDataFrame

DataFrame[TV: string, radio: string, newspaper: string, sales: string]

In [10]:
regressionDataFrame.take(5)

[Row(TV='230.1', radio='37.8', newspaper='69.2', sales='22.1'),
 Row(TV='44.5', radio='39.3', newspaper='45.1', sales='10.4'),
 Row(TV='17.2', radio='45.9', newspaper='69.3', sales='9.3'),
 Row(TV='151.5', radio='41.3', newspaper='58.5', sales='18.5'),
 Row(TV='180.8', radio='10.8', newspaper='58.4', sales='12.9')]

In [11]:
regressionDataFrame.show(5)

+-----+-----+---------+-----+
|   TV|radio|newspaper|sales|
+-----+-----+---------+-----+
|230.1| 37.8|     69.2| 22.1|
| 44.5| 39.3|     45.1| 10.4|
| 17.2| 45.9|     69.3|  9.3|
|151.5| 41.3|     58.5| 18.5|
|180.8| 10.8|     58.4| 12.9|
+-----+-----+---------+-----+
only showing top 5 rows



In [12]:
regressionDataRDD = regressionDataFrame.rdd.map(list)

In [13]:
regressionDataRDD.take(5)

[['230.1', '37.8', '69.2', '22.1'],
 ['44.5', '39.3', '45.1', '10.4'],
 ['17.2', '45.9', '69.3', '9.3'],
 ['151.5', '41.3', '58.5', '18.5'],
 ['180.8', '10.8', '58.4', '12.9']]

## Labeled Points

MLlib requires that our features be expressed with LabeledPoints. The required format for a labeled point is a tuple of the target value and a vector of predictors (features). 

In [14]:
from pyspark.mllib.regression import LabeledPoint
regressionDataLabelPoint = regressionDataRDD.map(lambda data : LabeledPoint(data[3], data[0:3]))

regressionDataLabelPoint.take(5)

In [None]:
[LabeledPoint(22.1, [230.1,37.8,69.2]),
 LabeledPoint(10.4, [44.5,39.3,45.1]),
 LabeledPoint(9.3, [17.2,45.9,69.3]),
 LabeledPoint(18.5, [151.5,41.3,58.5]),
 LabeledPoint(12.9, [180.8,10.8,58.4])]

## Split the data to Train and Test

In [16]:
regressionLabelPointSplit = regressionDataLabelPoint.randomSplit([0.7, 0.3])

In [17]:
regressionLabelPointTrainData = regressionLabelPointSplit[0]

In [18]:
type(regressionLabelPointTrainData)

pyspark.rdd.PipelinedRDD

## Optional: We can transform Pyspark dataframe to Pandas dataframe

In [20]:
regressionLabelPointTrainData.toDF().toPandas().head(5)

Unnamed: 0,features,label
0,"[230.1, 37.8, 69.2]",22.1
1,"[17.2, 45.9, 69.3]",9.3
2,"[151.5, 41.3, 58.5]",18.5
3,"[180.8, 10.8, 58.4]",12.9
4,"[57.5, 32.8, 23.5]",11.8


In [27]:
regressionLabelPointTrainData.toDF().toPandas()['features'].values[:10]

array([DenseVector([230.1, 37.8, 69.2]), DenseVector([17.2, 45.9, 69.3]),
       DenseVector([151.5, 41.3, 58.5]), DenseVector([180.8, 10.8, 58.4]),
       DenseVector([57.5, 32.8, 23.5]), DenseVector([120.2, 19.6, 11.6]),
       DenseVector([8.6, 2.1, 1.0]), DenseVector([199.8, 2.6, 21.2]),
       DenseVector([66.1, 5.8, 24.2]), DenseVector([204.1, 32.9, 46.0])],
      dtype=object)

##### We need an extra step here to have numpy array

- http://blog.madhukaraphatak.com/spark-vector-to-numpy/

In [25]:
regressionLabelPointTestData = regressionLabelPointSplit[1]

In [23]:
from pyspark.mllib.regression import LinearRegressionWithSGD as lrSGD

ourModelWithLinearRegression  = lrSGD.train(data = regressionLabelPointTrainData, 
                                            iterations = 100, step = 0.0002, intercept = True)

In [24]:
ourModelWithLinearRegression

(weights=[0.05732977236215489,0.10086159549519522,0.052482565077522574], intercept=1.0021531334458886)

In [28]:
ourModelWithLinearRegression.weights

DenseVector([0.0573, 0.1009, 0.0525])

In [29]:
ourModelWithLinearRegression.intercept

1.0021531334458886

## Checking the Model Performance with Metrics

In [58]:
regressionLabelPointTestData.take(2)

[LabeledPoint(10.4, [44.5,39.3,45.1]), LabeledPoint(7.2, [8.7,48.9,75.0])]

In [69]:
regressionLabelPointTestData.count()

54

In [70]:
regressionLabelPointTrainData.count()

146

In [43]:
regressionLabelPointTestData.toDF().show(2)

+----------------+-----+
|        features|label|
+----------------+-----+
|[44.5,39.3,45.1]| 10.4|
| [8.7,48.9,75.0]|  7.2|
+----------------+-----+
only showing top 2 rows



In [66]:
from pyspark.mllib.evaluation import RegressionMetrics

prediObserRDDin = regressionLabelPointTestData.\
map(lambda x: (float(ourModelWithLinearRegression.predict(x.features)),x.label))

In [68]:
prediObserRDDin.take(2)

[(9.88415239151922, 10.4), (10.369246553525876, 7.2)]

In [54]:
metrics = RegressionMetrics(prediObserRDDin)
metrics

<pyspark.mllib.evaluation.RegressionMetrics at 0x7f31aca48880>

In [55]:
metrics.r2

0.7964421473548583

In [56]:
metrics.rootMeanSquaredError

2.295759591568639

## Resources

- https://www.lucasallen.io/spark-dataframes-mllib-tutorial/