## Spark Cluster Overview

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

![Spark Cluster Overview](http://spark.apache.org/docs/2.0.0/img/cluster-overview.png)

In [1]:
import socket

This code runs on the 'driver' node

In [2]:
print( "Hello World from " + socket.gethostname() )

Hello World from yp-spark-dal09-env5-0033


Create some data and distribute on the cluster 'executor' nodes

In [3]:
rdd = spark.sparkContext.parallelize( range(0, 100) )

Run a function on the nodes and return the values back to the 'driver' node

In [4]:
rdd = rdd.map( lambda x: "Hello World from " + socket.gethostname() ).collect()

Print all the values

In [5]:
print( rdd )

['Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0036', 'Hello Wo

Print out the unique values

In [7]:
print( set(rdd) )

set(['Hello World from yp-spark-dal09-env5-0036', 'Hello World from yp-spark-dal09-env5-0049'])


## Alternating Least Squares (ALS) Hello World

Load the data

In [9]:
! rm -f ratings.dat
! wget https://raw.githubusercontent.com/snowch/movie-recommender-demo/master/web_app/data/ratings.dat

--2017-04-24 05:41:45--  https://raw.githubusercontent.com/snowch/movie-recommender-demo/master/web_app/data/ratings.dat
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8910990 (8.5M) [text/plain]
Saving to: ‘ratings.dat’


2017-04-24 05:41:46 (54.5 MB/s) - ‘ratings.dat’ saved [8910990/8910990]



Inspect the data

In [10]:
! head -3 ratings.dat
! echo
! tail -3 ratings.dat

1::832::2::N/A
1::1781::1::N/A
1::1124::1::N/A

5999::1608::5::N/A
5999::982::4::N/A
5999::1350::3::N/A


Load the data in spark

In [11]:
from pyspark.mllib.recommendation import Rating

ratingsRDD = sc.textFile('ratings.dat') \
               .map(lambda l: l.split("::")) \
               .map(lambda p: Rating(
                                  user = int(p[0]), 
                                  product = int(p[1]),
                                  rating = float(p[2]), 
                                  )).cache()

In [15]:
from pyspark.mllib.recommendation import ALS

# set some values for the parameters
# these should be ascertained via experimentation
rank = 5
numIterations = 20
lambdaParam = 0.1

model = ALS.train(ratingsRDD.toDF(), rank, numIterations, lambdaParam)

Predict how the **user=1** would rate **product=1**

In [14]:
model.predict(user=1, product=1)

1.4271848394995756

Predict the **top (1)** recommendations for **all** users.

In [17]:
model.recommendProductsForUsers(1).toDF().collect()

[Row(_1=1084, _2=Row(_1=Row(user=1084, product=1830, rating=1.7749724456713025))),
 Row(_1=3456, _2=Row(_1=Row(user=3456, product=334, rating=1.6720514484427365))),
 Row(_1=772, _2=Row(_1=Row(user=772, product=1891, rating=4.193756669050148))),
 Row(_1=3764, _2=Row(_1=Row(user=3764, product=2280, rating=4.097255538860141))),
 Row(_1=3272, _2=Row(_1=Row(user=3272, product=891, rating=4.14989990244117))),
 Row(_1=752, _2=Row(_1=Row(user=752, product=891, rating=4.304143637696159))),
 Row(_1=4352, _2=Row(_1=Row(user=4352, product=2249, rating=4.035544597239712))),
 Row(_1=1724, _2=Row(_1=Row(user=1724, product=670, rating=4.299077560409296))),
 Row(_1=428, _2=Row(_1=Row(user=428, product=2249, rating=4.194831649946838))),
 Row(_1=1900, _2=Row(_1=Row(user=1900, product=2280, rating=4.293863012999837))),
 Row(_1=1328, _2=Row(_1=Row(user=1328, product=2539, rating=4.117531728996765))),
 Row(_1=464, _2=Row(_1=Row(user=464, product=550, rating=4.003281616169205))),
 Row(_1=1040, _2=Row(_1=Row(

In the movie recommender web application the predicted ratings get stored in a Cloudant datastore for easy retrieval.