# Project 2

The goal of this assignment is for you to try out different ways of implementing and configuring a recommender, and to evaluate your different approaches.

For project 2, you’re asked to take some recommendation data (such as your toy movie dataset, Movielens, or another Dataset of your choosing), and implement at least two different recommendation algorithms on the data.  For example, content-based, user-user CF, and/or item-item CF.  You should evaluate different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc.  You don’t need to be exhaustive—these are just some suggested possibilities.  You may use whatever third party libraries you want.  Please provide at least one graph, and a textual summary of your evaluation.

You may work in a small group.  Please submit a link to your GitHub repository for your Jupyter notebook or RMarkdown file.  Due end of day on Sunday June 26th.

**Requires the Jupyter-Scala language Kernel, available from: https://github.com/alexarchambault/jupyter-scala**

In [25]:
classpath.add( "org.apache.spark" %% "spark-core" % "1.6.1",
             "org.apache.spark" %% "spark-mllib" % "1.6.1",
              "org.apache.spark" %% "spark-sql" % "1.6.1")

0 new artifact(s)




# Response

## The Recommender System

As I'm farily new to Spark and the whole data manipulation world in Scala, let's keep the problem simple. This is a system that recommends movies to users based on the dataset collected by the class survey.

As part of this exercise, I will produce a manual similarity function and compare the performance against the collaborative filtering library in Spark

## The Code

### Firing up a Spark Context

In [26]:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}

[32mimport [36morg.apache.spark.{SparkConf, SparkContext}[0m
[32mimport [36morg.apache.spark.sql._[0m
[32mimport [36morg.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType}[0m
[32mimport [36morg.apache.spark.mllib.linalg.Vectors[0m
[32mimport [36morg.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}[0m

In [27]:


val conf = new SparkConf()
  .setAppName("week1-EstimatePi")
  .setMaster("local") 
val sc = new SparkContext(conf)


[36mconf[0m: org.apache.spark.SparkConf = org.apache.spark.SparkConf@15593650
[36msc[0m: org.apache.spark.SparkContext = org.apache.spark.SparkContext@4af0eef9

### Data Loading and Transformations

The objective here is to:

* Load the `MovieRatings.csv` file
* Transform into Zero filled matrix
* Transform into Long-format data structure


In [28]:
// Read the CSV file
val csv = 
    sc
        .textFile("../MovieRatings.csv")
        .map(line => 
             line
                 .replaceAll(",$",", ")
                 .split(",")
                 .map(t => t.trim)
            )
csv.collect


[36mcsv[0m: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at Main.scala:29
[36mres27_1[0m: Array[Array[String]] = [33mArray[0m(
  [33mArray[0m(
    [32m"Critic"[0m,
    [32m"CaptainAmerica"[0m,
    [32m"Deadpool"[0m,
    [32m"Frozen"[0m,
    [32m"JungleBook"[0m,
    [32m"PitchPerfect2"[0m,
    [32m"StarWarsForce"[0m
  ),
  [33mArray[0m([32m"Burton"[0m, [32m""[0m, [32m""[0m, [32m""[0m, [32m"4"[0m, [32m""[0m, [32m"4"[0m),
  [33mArray[0m([32m"Charley"[0m, [32m"4"[0m, [32m"5"[0m, [32m"4"[0m, [32m"3"[0m, [32m"2"[0m, [32m"3"[0m),
  [33mArray[0m([32m"Dan"[0m, [32m""[0m, [32m"5"[0m, [32m""[0m, [32m""[0m, [32m""[0m, [32m"5"[0m),
  [33mArray[0m([32m"Dieudonne"[0m, [32m"5"[0m, [32m"4"[0m, [32m""[0m, [32m""[0m, [32m""[0m, [32m"5"[0m),
  [33mArray[0m([32m"Matt"[0m, [32m"4"[0m, [32m""[0m, [32m"2"[0m, [32m""[0m, [32m"2"[0m, [32m"5"[0m),
  [33mArray[0m([32m"Mauricio"[0m, [3

#### Transforming into Zero-filled Matrix

In [29]:
//val movies = sc.parallelize(csv.first)
val movies = csv.first
val critics = csv.collect.map(_(0))

// let's also make a parallelized version of those
val moviesPar = sc.parallelize(movies).zipWithIndex
val criticsPar = sc.parallelize(critics).zipWithIndex

[36mmovies[0m: Array[String] = [33mArray[0m(
  [32m"Critic"[0m,
  [32m"CaptainAmerica"[0m,
  [32m"Deadpool"[0m,
  [32m"Frozen"[0m,
  [32m"JungleBook"[0m,
  [32m"PitchPerfect2"[0m,
  [32m"StarWarsForce"[0m
)
[36mcritics[0m: Array[String] = [33mArray[0m(
  [32m"Critic"[0m,
  [32m"Burton"[0m,
  [32m"Charley"[0m,
  [32m"Dan"[0m,
  [32m"Dieudonne"[0m,
  [32m"Matt"[0m,
  [32m"Mauricio"[0m,
  [32m"Max"[0m,
  [32m"Nathan"[0m,
  [32m"Param"[0m,
  [32m"Parshu"[0m,
  [32m"Prashanth"[0m,
  [32m"Shipra"[0m,
  [32m"Sreejaya"[0m,
  [32m"Steve"[0m,
  [32m"Vuthy"[0m,
  [32m"Xingjia"[0m
)
[36mmoviesPar[0m: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[4] at zipWithIndex at Main.scala:38
[36mcriticsPar[0m: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[6] at zipWithIndex at Main.scala:41

In [30]:
val zeroFilledMatrix = 
    csv
        .collect
        .filterNot(r => r(0) == movies(0)) // filter out first row
        .map(r => r.filterNot(value => critics contains value)) // map function that returns the record minus the critic name
        .map(r => r.map(value => if (value == "") 0.00 else value.toDouble))

// and now parallelize it
val zeroFilledMatrixPar = sc.parallelize(zeroFilledMatrix)

// now, let's convert it into a linalg matrix so we can perform linear algebra operations on it
val criticMoviesMatrix = new RowMatrix(zeroFilledMatrixPar.map(line => Vectors.dense(line)))


// also, let's have a transposed version of it:
val dataTransposed =  sc.parallelize(zeroFilledMatrix.toSeq.transpose)

val moviesCriticMatrix = new RowMatrix(dataTransposed.map(line => Vectors.dense(line.toArray)))


[36mzeroFilledMatrix[0m: Array[Array[Double]] = [33mArray[0m(
  [33mArray[0m([32m0.0[0m, [32m0.0[0m, [32m0.0[0m, [32m4.0[0m, [32m0.0[0m, [32m4.0[0m),
  [33mArray[0m([32m4.0[0m, [32m5.0[0m, [32m4.0[0m, [32m3.0[0m, [32m2.0[0m, [32m3.0[0m),
  [33mArray[0m([32m0.0[0m, [32m5.0[0m, [32m0.0[0m, [32m0.0[0m, [32m0.0[0m, [32m5.0[0m),
  [33mArray[0m([32m5.0[0m, [32m4.0[0m, [32m0.0[0m, [32m0.0[0m, [32m0.0[0m, [32m5.0[0m),
  [33mArray[0m([32m4.0[0m, [32m0.0[0m, [32m2.0[0m, [32m0.0[0m, [32m2.0[0m, [32m5.0[0m),
  [33mArray[0m([32m4.0[0m, [32m0.0[0m, [32m3.0[0m, [32m3.0[0m, [32m4.0[0m, [32m0.0[0m),
  [33mArray[0m([32m4.0[0m, [32m4.0[0m, [32m4.0[0m, [32m2.0[0m, [32m2.0[0m, [32m4.0[0m),
  [33mArray[0m([32m0.0[0m, [32m0.0[0m, [32m0.0[0m, [32m0.0[0m, [32m0.0[0m, [32m4.0[0m),
  [33mArray[0m([32m4.0[0m, [32m4.0[0m, [32m1.0[0m, [32m0.0[0m, [32m0.0[0m, [32m5.0[0m),
  [33mArray[0m

#### Transforming into a Long-format data structure

For practical purposes, we'll do an index-based long format, meaning that the string names will be substituted for an index

In [31]:
val longFormat = 
    csv
        .collect
        .filterNot(r => r(0) == movies(0)) // filter out first row
        .flatMap(r=> (1 to movies.length-1).map(i=> (r(0),movies(i-1),r(i)))) // pivot each column sothat we have: (user,movie,rating)
        .filter(r=> r._3 !="") // filter out those unrated movies
        .map(r=> (critics.indexOf(r._1),movies.indexOf(r._2),r._3.toDouble)) // convert the remaining rating to a double
                           
val ratingsLong = sc.parallelize(longFormat)

[36mlongFormat[0m: Array[(Int, Int, Double)] = [33mArray[0m(
  [33m[0m([32m1[0m, [32m3[0m, [32m4.0[0m),
  [33m[0m([32m1[0m, [32m5[0m, [32m4.0[0m),
  [33m[0m([32m2[0m, [32m0[0m, [32m4.0[0m),
  [33m[0m([32m2[0m, [32m1[0m, [32m5.0[0m),
  [33m[0m([32m2[0m, [32m2[0m, [32m4.0[0m),
  [33m[0m([32m2[0m, [32m3[0m, [32m3.0[0m),
  [33m[0m([32m2[0m, [32m4[0m, [32m2.0[0m),
  [33m[0m([32m2[0m, [32m5[0m, [32m3.0[0m),
  [33m[0m([32m3[0m, [32m1[0m, [32m5.0[0m),
  [33m[0m([32m3[0m, [32m5[0m, [32m5.0[0m),
  [33m[0m([32m4[0m, [32m0[0m, [32m5.0[0m),
  [33m[0m([32m4[0m, [32m1[0m, [32m4.0[0m),
  [33m[0m([32m4[0m, [32m5[0m, [32m5.0[0m),
  [33m[0m([32m5[0m, [32m0[0m, [32m4.0[0m),
  [33m[0m([32m5[0m, [32m2[0m, [32m2.0[0m),
  [33m[0m([32m5[0m, [32m4[0m, [32m2.0[0m),
  [33m[0m([32m5[0m, [32m5[0m, [32m5.0[0m),
  [33m[0m([32m6[0m, [32m0[0m, [32m4.0[0m),
  [33m[0m([32m

## Model Building - Cosine Distance Based

Let's now build the following models using their respective cosine distances:

* Movie-Movie collaborative filtering
* User-User collaborative filtering


### Movie-Movie  Collaborative Filtering

In [32]:
val moviesMoviesCosineDistance = criticMoviesMatrix.columnSimilarities()

val moviesMoviesSimilarities = moviesMoviesCosineDistance
  .entries
  .map {
    case MatrixEntry(i, j, u) => (i, j, u) }
  .collect
  .map(r => Seq(movies(r._1.toInt), movies(r._2.toInt), r._3.toDouble))
  .sortBy(-_(2).asInstanceOf[Double])

[36mmoviesMoviesCosineDistance[0m: org.apache.spark.mllib.linalg.distributed.CoordinateMatrix = org.apache.spark.mllib.linalg.distributed.CoordinateMatrix@11921c07
[36mmoviesMoviesSimilarities[0m: Array[Seq[Any]] = [33mArray[0m(
  [33mList[0m(Deadpool, Frozen, 0.9164305818234458),
  [33mList[0m(Critic, CaptainAmerica, 0.8011927448021527),
  [33mList[0m(Critic, PitchPerfect2, 0.7649201061631455),
  [33mList[0m(Critic, JungleBook, 0.7437115739277018),
  [33mList[0m(Critic, Deadpool, 0.7406840835263832),
  [33mList[0m(CaptainAmerica, PitchPerfect2, 0.7299810272285707),
  [33mList[0m(Deadpool, JungleBook, 0.7191555984642707),
  [33mList[0m(CaptainAmerica, Deadpool, 0.6802170238514358),
  [33mList[0m(Critic, Frozen, 0.5992177600023244),
  [33mList[0m(Frozen, JungleBook, 0.5913486717104743),
  [33mList[0m(CaptainAmerica, JungleBook, 0.579267135696289),
  [33mList[0m(Deadpool, PitchPerfect2, 0.5773720984150789),
  [33mList[0m(CaptainAmerica, Frozen, 0.568979447

### User-User Collaborative Filtering Model

In [33]:
val userUserCosineDistance = moviesCriticMatrix.columnSimilarities()

val userUserSimilarities = userUserCosineDistance
  .entries
  .map {
    case MatrixEntry(i, j, u) => (i, j, u) }
  .collect
  .map(r => Seq(critics(r._1.toInt), critics(r._2.toInt), r._3.toDouble))
  .sortBy(-_(2).asInstanceOf[Double])

[36muserUserCosineDistance[0m: org.apache.spark.mllib.linalg.distributed.CoordinateMatrix = org.apache.spark.mllib.linalg.distributed.CoordinateMatrix@62e53689
[36muserUserSimilarities[0m: Array[Seq[Any]] = [33mArray[0m(
  [33mList[0m(Dan, Nathan, 0.9859249803487347),
  [33mList[0m(Mauricio, Shipra, 0.9847319278346619),
  [33mList[0m(Burton, Mauricio, 0.9811873171500672),
  [33mList[0m(Burton, Shipra, 0.9792633226865932),
  [33mList[0m(Burton, Parshu, 0.9610484599102903),
  [33mList[0m(Param, Parshu, 0.9600666937386864),
  [33mList[0m(Param, Shipra, 0.955672134494952),
  [33mList[0m(Burton, Param, 0.9474847084398104),
  [33mList[0m(Mauricio, Parshu, 0.9410294354946785),
  [33mList[0m(Mauricio, Param, 0.9296599791147713),
  [33mList[0m(Parshu, Shipra, 0.9293555142631518),
  [33mList[0m(Burton, Steve, 0.927771250724491),
  [33mList[0m(Dieudonne, Sreejaya, 0.9091372900969895),
  [33mList[0m(Prashanth, Vuthy, 0.8999999999999999),
  [33mList[0m(Shipra, St

### Querying the Models

* User: Who should Mauricio go out to the movies with?

In [34]:
val user= "Mauricio"
userUserSimilarities
    .filter(r=> r(0) == user || r(1)==user).map(r=>(if (r(1)==user) r(0) else r(1), r(2).asInstanceOf[Double]))
    .filter(_._2>0.75)
    .sortBy(-_._2)

[36muser[0m: String = [32m"Mauricio"[0m
[36mres33_1[0m: Array[(Any, Double)] = [33mArray[0m(
  [33m[0m(Shipra, [32m0.9847319278346619[0m),
  [33m[0m(Burton, [32m0.9811873171500672[0m),
  [33m[0m(Parshu, [32m0.9410294354946785[0m),
  [33m[0m(Param, [32m0.9296599791147713[0m),
  [33m[0m(Nathan, [32m0.866578244826242[0m),
  [33m[0m(Steve, [32m0.8574929257125442[0m),
  [33m[0m(Dan, [32m0.8123623944599234[0m),
  [33m[0m(Dieudonne, [32m0.8081220356417687[0m)
)

* If I liked Frozen, what other movies should I watch?

In [35]:
val movie= "Frozen"
moviesMoviesSimilarities
    .filter(r=> r(0) == movie || r(1)==movie).map(r=>(if (r(1)==movie) r(0) else r(1), r(2).asInstanceOf[Double]))
    .filter(_._2>0.75)
    .sortBy(-_._2)

[36mmovie[0m: String = [32m"Frozen"[0m
[36mres34_1[0m: Array[(Any, Double)] = [33mArray[0m([33m[0m(Deadpool, [32m0.9164305818234458[0m))

## Using Collaborative Filtering ALS Model

Based on: http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

In [36]:
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating

val ratingsALS = ratingsLong.map(r=>Rating(r._1, r._2, r._3))

// Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratingsALS, rank, numIterations, 0.01)

// Evaluate the model on rating data
val criticsMovies = ratingsALS.map { case Rating(critic, movie, rate) =>
  (critic, movie)
}
val predictions =
  model.predict(criticsMovies).map { case Rating(critic, movie, rate) =>
    ((critic, movie), rate)
  }
val ratesAndPredictions = ratingsALS.map { case Rating(critic, movie, rate) =>
  ((critic, movie), rate)
}.join(predictions)
val MSE = ratesAndPredictions.map { case ((critic, movie), (r1, r2)) =>
  val err = (r1 - r2)
  err * err
}.mean()

println("Mean Squared Error = " + MSE)

Mean Squared Error = 1.7300149322469586E-4


[32mimport [36morg.apache.spark.mllib.recommendation.ALS[0m
[32mimport [36morg.apache.spark.mllib.recommendation.MatrixFactorizationModel[0m
[32mimport [36morg.apache.spark.mllib.recommendation.Rating[0m
[36mratingsALS[0m: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating] = MapPartitionsRDD[22] at map at Main.scala:42
[36mrank[0m: Int = [32m10[0m
[36mnumIterations[0m: Int = [32m10[0m
[36mmodel[0m: org.apache.spark.mllib.recommendation.MatrixFactorizationModel = org.apache.spark.mllib.recommendation.MatrixFactorizationModel@60b785dd
[36mcriticsMovies[0m: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[230] at map at Main.scala:54
[36mpredictions[0m: org.apache.spark.rdd.RDD[((Int, Int), Double)] = MapPartitionsRDD[240] at map at Main.scala:59
[36mratesAndPredictions[0m: org.apache.spark.rdd.RDD[((Int, Int), (Double, Double))] = MapPartitionsRDD[244] at join at Main.scala:66
[36mMSE[0m: Double = [32m1.7300149322469586E-4[0m

In [37]:
model.recommendProducts(critics.indexOf(user),5).map(r => (critics(r.user), movies(r.product), r.rating)) 
model.recommendUsersForProducts(3).collect.flatMap(m=>m._2.map(r=>(movies(m._1),critics(r.user),r.rating)))
model.recommendProductsForUsers(3).collect.flatMap(m=>m._2.map(r=>(movies(r.product),critics(m._1),r.rating)))


[36mres36_0[0m: Array[(String, String, Double)] = [33mArray[0m(
  [33m[0m([32m"Mauricio"[0m, [32m"CaptainAmerica"[0m, [32m4.55839706531505[0m),
  [33m[0m([32m"Mauricio"[0m, [32m"Critic"[0m, [32m4.000934381708938[0m),
  [33m[0m([32m"Mauricio"[0m, [32m"JungleBook"[0m, [32m3.9826942547896538[0m),
  [33m[0m([32m"Mauricio"[0m, [32m"PitchPerfect2"[0m, [32m3.896175989914277[0m),
  [33m[0m([32m"Mauricio"[0m, [32m"Deadpool"[0m, [32m3.0055283089099647[0m)
)
[36mres36_1[0m: Array[(String, String, Double)] = [33mArray[0m(
  [33m[0m([32m"JungleBook"[0m, [32m"Mauricio"[0m, [32m3.9826942547896538[0m),
  [33m[0m([32m"JungleBook"[0m, [32m"Sreejaya"[0m, [32m3.981912103579872[0m),
  [33m[0m([32m"JungleBook"[0m, [32m"Vuthy"[0m, [32m2.9946114143957767[0m),
  [33m[0m([32m"Critic"[0m, [32m"Sreejaya"[0m, [32m5.015867105191404[0m),
  [33m[0m([32m"Critic"[0m, [32m"Dieudonne"[0m, [32m4.985885047261088[0m),
  [33m[0m([32m"C

In [38]:
//sc.stop



In [50]:
val moviesCriticMatrix = new RowMatrix(sc.parallelize(Array(Array(1.00,4.0),Array(1.0,4.0))).map(line => Vectors.dense(line)))

[36mmoviesCriticMatrix[0m: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@41baebb1

In [51]:
moviesCriticMatrix.columnSimilarities().entries.collect

[36mres50[0m: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = [33mArray[0m([33mMatrixEntry[0m([32m0L[0m, [32m1L[0m, [32m0.9999999999999998[0m))