# Project 3

The goal of this assignment is give you practice working with Singular Value Decomposition.

Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system.

You may approach this in a large number of ways.  You are welcome to start with an existing recommender system written by yourself or someone else (always citing your sources, so that you can be graded on what you added, not what you found).

Here is one example.  Suppose you start with (or create) a collaborative filtering system against (a subset of) the MovieLens database or our toy dataset.  You could create a content-based system, where you populate your item profiles by pulling text information for specific movies from a source like imdb, applying text processing techniques (like TF-IDF), then using SVD and topic modeling to create a set of features derived from the text.

An extra intermediate step could be to take text that was pre-classified, e.g. “fighting” or “singing” and build out two “explainable” features.  SVD builds features that may or may not map neatly to movie genres or news topics.

**Requires the Jupyter-Scala language Kernel, available from: https://github.com/alexarchambault/jupyter-scala**

In [1]:
classpath.add( "org.apache.spark" %% "spark-core" % "1.6.1",
             "org.apache.spark" %% "spark-mllib" % "1.6.1",
              "org.apache.spark" %% "spark-sql" % "1.6.1",
             "co.theasi" % "plotly_2.10" % "0.1")

161 new artifact(s)


161 new artifacts in macro
161 new artifacts in runtime
161 new artifacts in compile




# Response

## The Recommender System

This week I'll try loading a more complicated dataset: Plain text. For the purpose of this exercise, I'll use t

## The Code

### Firing up a Spark Context

In [2]:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}

[32mimport [36morg.apache.spark.{SparkConf, SparkContext}[0m
[32mimport [36morg.apache.spark.sql._[0m
[32mimport [36morg.apache.spark.sql.types._[0m
[32mimport [36morg.apache.spark.mllib.linalg.Vectors[0m
[32mimport [36morg.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}[0m

In [3]:

    val conf = new SparkConf()
      .setAppName("week1-EstimatePi")
      .setMaster("local") 

    val sc = new SparkContext(conf)


Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/06/30 18:52:40 INFO SparkContext: Running Spark version 1.6.1
16/06/30 18:52:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/06/30 18:52:41 INFO SecurityManager: Changing view acls to: malarconba001
16/06/30 18:52:41 INFO SecurityManager: Changing modify acls to: malarconba001
16/06/30 18:52:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(malarconba001); users with modify permissions: Set(malarconba001)
16/06/30 18:52:43 INFO Utils: Successfully started service 'sparkDriver' on port 15597.
16/06/30 18:52:43 INFO Slf4jLogger: Slf4jLogger started
16/06/30 18:52:43 INFO Remoting: Starting remoting
16/06/30 18:52:44 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.1.15:15610]
16/06/30 18:52:44 INFO Utils: Su

[36mconf[0m: org.apache.spark.SparkConf = org.apache.spark.SparkConf@455e7809
[36msc[0m: org.apache.spark.SparkContext = org.apache.spark.SparkContext@18e637dd

### Data Loading and Transformations

The objective here is to:

* Load the http://mc6help.tripod.com/RecipeLibrary/AllAppetizerRecipes.txt file
* Transform into Zero filled matrix
* Transform into Long-format data structure


In [4]:

val csv = 
    sc
        .textFile("AllAppetizerRecipes.txt")
        .map(t => t.trim)
        .map(t => (t,t=="* Exported from MasterCook *") ) // add boolean if we have a record delimiter
        .zipWithIndex // add record id
        .map(r=>(r._1._1,r._1._2,r._2)) // flatten the nested index
csv.take(20)



[36mcsv[0m: org.apache.spark.rdd.RDD[(String, Boolean, Long)] = MapPartitionsRDD[5] at map at Main.scala:30
[36mres3_1[0m: Array[(String, Boolean, Long)] = [33mArray[0m(
  [33m[0m([32m"* Exported from MasterCook *"[0m, [32mtrue[0m, [32m0L[0m),
  [33m[0m([32m""[0m, [32mfalse[0m, [32m1L[0m),
  [33m[0m([32m"Barbecue Pecans"[0m, [32mfalse[0m, [32m2L[0m),
  [33m[0m([32m""[0m, [32mfalse[0m, [32m3L[0m),
  [33m[0m([32m"Recipe By     : Possum Kingdom Lake Cookbook"[0m, [32mfalse[0m, [32m4L[0m),
  [33m[0m([32m"Serving Size  : 25    Preparation Time : 0:00"[0m, [32mfalse[0m, [32m5L[0m),
  [33m[0m([32m"Categories    :"[0m, [32mfalse[0m, [32m6L[0m),
  [33m[0m([32m"Amount  Measure       Ingredient -- Preparation Method"[0m, [32mfalse[0m, [32m7L[0m),
  [33m[0m([32m"--------  ------------  --------------------------------"[0m, [32mfalse[0m, [32m8L[0m),
  [33m[0m([32m"2   tablespoons  butter"[0m, [32mfalse[0m, [32m9L[

From the above, sample, we can see that the text file gets imported a line at a time per record. However, the recipes file is formatted as follows:

```
* Exported from MasterCook *

                     Barbecue Pecans

Recipe By     : Possum Kingdom Lake Cookbook
Serving Size  : 25    Preparation Time : 0:00
Categories    : 
  Amount  Measure       Ingredient -- Preparation Method
--------  ------------  --------------------------------
       2   tablespoons  butter
     1/4           cup  Worcestershire sauce
       1    tablespoon  catsup
       6        dashes  Hot sauce
       4          cups  Pecans -- halves
                        salt -- to taste

Melt butter in a large saucepan; add Worcestershire sauce, , catsup, and hot sauce.  

Stir in nuts; spoon into a glass baking dish, spreading evenly.  toast at 400 degrees about 20 minutes, stirring frequently.  

Turn out on absorbent towels, and sprinkle with salt.

                                    - - - - - - - - - - - - - - - - - - - 
```

The goal here is to map the file into: RecipeIngredients(RecipeName, RecipeText) 

Where:

* Record Break: is the line: ```* Exported from MasterCook *```
* An entire record is composed of the lines between record breaks
* RecipeName: The third line in the record
* RecipeText: The concatenated lines of the recipe record

In [5]:
// The goal with this section is to end up with a recipeIndexTable collection that looks like: recipeIndexTable(RecordId,RecipeID)

// Generate a recordIndex table (RecipeId, LineId) by relying in the presence and offset of the record break
val recordIndexes =  csv.filter(_._2).map(_._3).zipWithIndex.map(r=>(r._2,r._1)) ++ sc.parallelize(Seq((csv.filter(_._2).count,csv.count)))

// Now, let's just create an offset recordIndexes Table 
val recordIndexesOffset = recordIndexes.map(r=>(r._1-1,r._2-1))

// and join it with the record indexes so we have the recipeIndexTable(RecordId,RecipeID)
val recipeIndexTable = recordIndexes.join(recordIndexesOffset).flatMap(r=> (r._2._1 to r._2._2).map(t=>(t,r._1)))
recipeIndexTable.collect.sorted

[36mrecordIndexes[0m: org.apache.spark.rdd.RDD[(Long, Long)] = UnionRDD[12] at $plus$plus at Main.scala:27
[36mrecordIndexesOffset[0m: org.apache.spark.rdd.RDD[(Long, Long)] = MapPartitionsRDD[13] at map at Main.scala:30
[36mrecipeIndexTable[0m: org.apache.spark.rdd.RDD[(Long, Long)] = MapPartitionsRDD[17] at flatMap at Main.scala:33
[36mres4_3[0m: Array[(Long, Long)] = [33mArray[0m(
  [33m[0m([32m0L[0m, [32m0L[0m),
  [33m[0m([32m1L[0m, [32m0L[0m),
  [33m[0m([32m2L[0m, [32m0L[0m),
  [33m[0m([32m3L[0m, [32m0L[0m),
  [33m[0m([32m4L[0m, [32m0L[0m),
  [33m[0m([32m5L[0m, [32m0L[0m),
  [33m[0m([32m6L[0m, [32m0L[0m),
  [33m[0m([32m7L[0m, [32m0L[0m),
  [33m[0m([32m8L[0m, [32m0L[0m),
  [33m[0m([32m9L[0m, [32m0L[0m),
  [33m[0m([32m10L[0m, [32m0L[0m),
  [33m[0m([32m11L[0m, [32m0L[0m),
  [33m[0m([32m12L[0m, [32m0L[0m),
  [33m[0m([32m13L[0m, [32m0L[0m),
  [33m[0m([32m14L[0m, [32m0L[0m),
  [33m[0m(

Let's now join it to the imported data so we add the recipeID

In [6]:


val csvIndexed = csv.map(r=>(r._3,(r._1,r._2))).join(recipeIndexTable)
csvIndexed.take(4)

[36mcsvIndexed[0m: org.apache.spark.rdd.RDD[(Long, ((String, Boolean), Long))] = MapPartitionsRDD[21] at join at Main.scala:27
[36mres5_1[0m: Array[(Long, ((String, Boolean), Long))] = [33mArray[0m(
  [33m[0m(
    [32m3558L[0m,
    [33m[0m(
      [33m[0m(
        [32m"Blend well the cream cheese with the Brie cheese. Add the hazelnuts and apple; blend. Spread on melba toast or crackers."[0m,
        [32mfalse[0m
      ),
      [32m92L[0m
    )
  ),
  [33m[0m([32m1084L[0m, [33m[0m([33m[0m([32m"- - - - - - - - - - - - - - - - - - -"[0m, [32mfalse[0m), [32m28L[0m)),
  [33m[0m([32m3586L[0m, [33m[0m([33m[0m([32m"2            tb  Olive oil"[0m, [32mfalse[0m), [32m93L[0m)),
  [33m[0m([32m1410L[0m, [33m[0m([33m[0m([32m""[0m, [32mfalse[0m), [32m37L[0m))
)

Let's now combine the recipe lines separated by a pipe (|) 

In [7]:
import scala.util.matching.Regex


val recipesText = csvIndexed
    .map(r=>(r._2._2,r._1,r._2._1._1))// let's flatten the nested list so we have RecipeId, Recipe Line Id and Recipe Line
    .sortBy(r=>(r._1,r._2)) // Properly sort it so we have all lines consecutively arranged as per the recipe line id
    .map(r=>(r._1,r._3)) // retain only the recipeid and recipe lines
    .groupBy(_._1) // and group it by the RecipeId
    .map( 
        g => (
                g._1,    // return the RecipeId
                g._2.map(_._2.trim).mkString("\n")
        )  // and concatenate the nested array of recipe lines with a pipe
        )
    .map { g=>
        val pattern = new Regex("""(?s)\* Exported from MasterCook \*\n\n([^\n]+)\n\n""", "RecipeName")
        (g._1,
        pattern.findFirstMatchIn(g._2).get.group("RecipeName"),
        g._2.replaceAll("""(?is)^.+---------\n""","").trim
        )
         }

recipesText.take(4)

[32mimport [36mscala.util.matching.Regex[0m
[36mrecipesText[0m: org.apache.spark.rdd.RDD[(Long, String, String)] = MapPartitionsRDD[32] at map at Main.scala:36
[36mres6_2[0m: Array[(Long, String, String)] = [33mArray[0m(
  [33m[0m(
    [32m34L[0m,
    [32m"Cheese-Olive Balls"[0m,
    [32m"""
1/4      teaspoon  hot pepper sauce    
1      teaspoon  paprika    
1/2      teaspoon  salt    
2          cups  sharp cheddar cheese -- grated    
1/2           cup  butter    
1           cup  flour -- sifted    
olives    
    
Mix ingredientsexcept olives  like a pie crust.  Wrap each olive with mixture.  spread the little balls on a pan and freeze.  Bake at 425 degrees for 12 minutes. Can be keep frozen in a bag. Serve hot.    
    
    
    
    
[33m...[0m

And finally, let's extract our data in a long format: RecipeIngredients(RecipeName,RecipeNameHash, RecipeText) 

In [8]:
var recipeIngredients = recipesText
    .map{
        text=>
            (text._2, text._3) // amd return (RecipeName, RecipeText)
    }

recipeIngredients.collect

[36mrecipeIngredients[0m: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[33] at map at Main.scala:26
[36mres7_1[0m: Array[(String, String)] = [33mArray[0m(
  [33m[0m(
    [32m"Cheese-Olive Balls"[0m,
    [32m"""
1/4      teaspoon  hot pepper sauce    
1      teaspoon  paprika    
1/2      teaspoon  salt    
2          cups  sharp cheddar cheese -- grated    
1/2           cup  butter    
1           cup  flour -- sifted    
olives    
    
Mix ingredientsexcept olives  like a pie crust.  Wrap each olive with mixture.  spread the little balls on a pan and freeze.  Bake at 425 degrees for 12 minutes. Can be keep frozen in a bag. Serve hot.    
    
    
    
    
    
[33m...[0m

# Using Spark-ML Transformations Library

In contrast to Sparl's MLLlib library, ML is much simpler and offers a simplified interface. http://spark.apache.org/docs/latest/ml-guide.html (Lot's of code has been borrowed from this site)

The goal is to:

* Create a Dataframe
* Pre-process the text: Remove unwanted chars, stop words and tokenize
* Convert the TF-IDF (HashingTF and IDF)


In [9]:
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import org.apache.spark.sql.functions._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val sentenceData = sqlContext.createDataFrame(recipeIngredients).toDF("recipeName", "recipeText")
val sentenceDataClean = sentenceData.withColumn("recipeTextClean", 
                        regexp_replace(
                            regexp_replace(
                                lower(sentenceData("recipeText")) // make everything lowercase
                                ,"[^a-z]"," " // replace non-letters for whitespaces
                            )
                            ," +"," " // convert multiple whitespaces into a single space
                        )
                       )
sentenceDataClean.select("recipeTextClean").take(3)



[32mimport [36morg.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}[0m
[32mimport [36morg.apache.spark.sql.functions._[0m
[36msqlContext[0m: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@5cd1fd15
[36msentenceData[0m: org.apache.spark.sql.DataFrame = [recipeName: string, recipeText: string]
[36msentenceDataClean[0m: org.apache.spark.sql.DataFrame = [recipeName: string, recipeText: string, recipeTextClean: string]
[36mres8_5[0m: Array[org.apache.spark.sql.Row] = [33mArray[0m(
  [ teaspoon hot pepper sauce teaspoon paprika teaspoon salt cups sharp cheddar cheese grated cup butter cup flour sifted olives mix ingredientsexcept olives like a pie crust wrap each olive with mixture spread the little balls on a pan and freeze bake at degrees for minutes can be keep frozen in a bag serve hot nutr assoc ],
  [ ounces jumbo ripe olives canned pitted cup italian dressing bunch green onions drain olives and marinate at room temperature in dressing for one hour 

Now, let's tokenize the RDD

In [10]:
// tokenize

val tokenizer = new Tokenizer().setInputCol("recipeTextClean").setOutputCol("words")
val tokenizedData = tokenizer.transform(sentenceDataClean)


tokenizedData.select("words").take(3)


[36mtokenizer[0m: org.apache.spark.ml.feature.Tokenizer = tok_85cef01bf337
[36mtokenizedData[0m: org.apache.spark.sql.DataFrame = [recipeName: string, recipeText: string, recipeTextClean: string, words: array<string>]
[36mres9_2[0m: Array[org.apache.spark.sql.Row] = [33mArray[0m(
  [WrappedArray(, teaspoon, hot, pepper, sauce, teaspoon, paprika, teaspoon, salt, cups, sharp, cheddar, cheese, grated, cup, butter, cup, flour, sifted, olives, mix, ingredientsexcept, olives, like, a, pie, crust, wrap, each, olive, with, mixture, spread, the, little, balls, on, a, pan, and, freeze, bake, at, degrees, for, minutes, can, be, keep, frozen, in, a, bag, serve, hot, nutr, assoc)],
  [WrappedArray(, ounces, jumbo, ripe, olives, canned, pitted, cup, italian, dressing, bunch, green, onions, drain, olives, and, marinate, at, room, temperature, in, dressing, for, one, hour, or, more, turning, to, coat, on, all, sides, cut, green, onions, into, one, inch, pieces, slash, one, end, of, each, piece

and remove the stop words

In [11]:
// remove stop words

import org.apache.spark.ml.feature.StopWordsRemover

val remover = new StopWordsRemover()
  .setInputCol("words")
  .setOutputCol("filtered")
remover.setStopWords(remover.getStopWords++Array("exported","internet","address","com","from","mastercook","recipe","by","serving","size","preparation","time","categories","amount","measure","ingredient","preparation","method","place","all","ingredients","in","copyright","notice","taken","raw","gourmet","simple","recipes","living","nomi","shannon","nomi","shannon","commercial","rights","reserved","distributed","freely","non","commercial","purposes","provided","copyright","notice","included","following","web","site","http","www","living","foods","rawgourmet","contact","author","questions","regarding","matter","rawgourmet","living","foods","source","http","www","living","foods","recipes","gadogado","html","copyright","nomi","shannon","read","copyright","notice","yield","cups","notes","based","indonesian","dish","traditionally","peanuts","using","almonds","peanuts","recommended","fungus","called","aflatoxin","naturally","occurs","peanut","crop","crops","inspected","certain","percentage","allowed","proven","carcinogen","peanuts","best","left","reason","peanut","butter","isn","t","recommended","possible","make","butter","raw","peanuts","peanut","butter","produced","roasted","peanuts","nutr","assoc"))
val swdData = remover.transform(tokenizedData)

swdData.select("filtered").take(3)

[32mimport [36morg.apache.spark.ml.feature.StopWordsRemover[0m
[36mremover[0m: org.apache.spark.ml.feature.StopWordsRemover = stopWords_f95bb8207aab
[36mres10_2[0m: org.apache.spark.ml.feature.StopWordsRemover = stopWords_f95bb8207aab
[36mswdData[0m: org.apache.spark.sql.DataFrame = [recipeName: string, recipeText: string, recipeTextClean: string, words: array<string>, filtered: array<string>]
[36mres10_4[0m: Array[org.apache.spark.sql.Row] = [33mArray[0m(
  [WrappedArray(, teaspoon, hot, pepper, sauce, teaspoon, paprika, teaspoon, salt, sharp, cheddar, cheese, grated, cup, cup, flour, sifted, olives, mix, ingredientsexcept, olives, like, pie, crust, wrap, olive, mixture, spread, little, balls, pan, freeze, bake, degrees, minutes, frozen, bag, serve, hot)],
  [WrappedArray(, ounces, jumbo, ripe, olives, canned, pitted, cup, italian, dressing, bunch, green, onions, drain, olives, marinate, room, temperature, dressing, hour, turning, coat, sides, cut, green, onions, inch, pi

In [11]:
Now we can use the limited list of words to generate a dataframe with the hashed features 

: 

In [12]:
// hash-tf array

val hashingTF = new HashingTF()
  .setInputCol("filtered").setOutputCol("rawFeatures").setNumFeatures(500)
val featurizedData = hashingTF.transform(swdData)

featurizedData.select("rawFeatures").take(3)



[36mhashingTF[0m: org.apache.spark.ml.feature.HashingTF = hashingTF_0098bbfdaeb2
[36mfeaturizedData[0m: org.apache.spark.sql.DataFrame = [recipeName: string, recipeText: string, recipeTextClean: string, words: array<string>, filtered: array<string>, rawFeatures: vector]
[36mres11_2[0m: Array[org.apache.spark.sql.Row] = [33mArray[0m(
  [(500,[0,1,23,36,42,85,93,119,124,146,153,160,163,206,249,251,263,279,284,288,302,335,349,355,371,378,398,423,432,455,482,488],[1.0,2.0,1.0,1.0,1.0,1.0,3.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0])],
  [(500,[0,8,26,59,71,75,85,93,97,103,138,139,141,152,171,175,176,229,236,250,252,284,297,316,317,356,372,378,382,395,423,432,438,443,447,488,494,496],[1.0,1.0,4.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,2.0,1.0,1.0,1.0,1.0])],
  [(500,[0,4,10,17,29,51,52,66,93,100,102,115,129,158,176,207,216,229,309,310,313,329,3

Normalizing

In [13]:
// rescale wordcounts

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)

val rescaledData = idfModel.transform(featurizedData)


rescaledData.select("features").take(3)


[36midf[0m: org.apache.spark.ml.feature.IDF = idf_8d7f8f85eb7e
[36midfModel[0m: org.apache.spark.ml.feature.IDFModel = idf_8d7f8f85eb7e
[36mrescaledData[0m: org.apache.spark.sql.DataFrame = [recipeName: string, recipeText: string, recipeTextClean: string, words: array<string>, filtered: array<string>, rawFeatures: vector, features: vector]
[36mres12_3[0m: Array[org.apache.spark.sql.Row] = [33mArray[0m(
  [(500,[0,1,23,36,42,85,93,119,124,146,153,160,163,206,249,251,263,279,284,288,302,335,349,355,371,378,398,423,432,455,482,488],[0.059423420470800806,2.295766675349785,1.6486586255873816,1.1478833376748925,1.8718021769015913,1.5998684614179497,1.5037685182495197,0.9067212808580044,0.9710156315634016,0.6931471805599453,1.1180303745252111,2.159484249353372,1.3121863889661687,1.0608719606852626,1.5051412020614923,2.5649493574615367,1.2770950691548986,1.754019141245208,1.754019141245208,1.9363406980391626,1.8718021769015913,0.6013396313068224,2.005333569526114,0.7125652664170469,2

Finally, let's estimate a good number of features for the PCA by using the variance

In [14]:
// Estimate the k parameter for PCA based on the number of features that explain up to 80% of variance

import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
// calculate the column variances
val variances = new RowMatrix(rescaledData.select("features").rdd.map(r => r(0).asInstanceOf[org.apache.spark.mllib.linalg.SparseVector])).computeColumnSummaryStatistics.variance.toArray.sorted(Ordering[Double].reverse)
// and the total variance
val varianceTotal = variances.reduceLeft((a,b) => a + b)
// get the number of columns that explain 80% of the variance and use this as k
val k = variances.map{var s = 0.0; d => {s += d; s/varianceTotal}}.zipWithIndex.filter(_._1>0.8).head._2

[32mimport [36morg.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}[0m
[36mvariances[0m: Array[Double] = [33mArray[0m(
  [32m6.017903085109311[0m,
  [32m4.757474156511935[0m,
  [32m4.740411822959413[0m,
  [32m4.4541309892818886[0m,
  [32m3.335603581688776[0m,
  [32m3.1993890370482068[0m,
  [32m3.186828120085474[0m,
  [32m3.054532046090568[0m,
  [32m3.0070157507307647[0m,
  [32m2.992070101863424[0m,
  [32m2.9445148418187963[0m,
  [32m2.5951491059948553[0m,
  [32m2.477303977401982[0m,
  [32m2.425215832796767[0m,
  [32m2.405393024411766[0m,
  [32m2.393122497879171[0m,
  [32m2.367174662584539[0m,
  [32m2.3385269239482094[0m,
  [32m2.2817463683701344[0m,
[33m...[0m
[36mvarianceTotal[0m: Double = [32m374.47878621576535[0m
[36mk[0m: Int = [32m243[0m

And finally, let's run the PCA transformation

In [15]:
/// PCA

import org.apache.spark.ml.feature.PCA
import org.apache.spark.mllib.linalg.Vectors

val pca = new PCA()
  .setInputCol("features")
  .setOutputCol("pcaFeatures")
  .setK(k)
  .fit(rescaledData)
val pcaDF = pca.transform(rescaledData)
pcaDF.select("recipeName","pcaFeatures").take(3)


[32mimport [36morg.apache.spark.ml.feature.PCA[0m
[32mimport [36morg.apache.spark.mllib.linalg.Vectors[0m
[36mpca[0m: org.apache.spark.ml.feature.PCAModel = pca_02c2194e6894
[36mpcaDF[0m: org.apache.spark.sql.DataFrame = [recipeName: string, recipeText: string, recipeTextClean: string, words: array<string>, filtered: array<string>, rawFeatures: vector, features: vector, pcaFeatures: vector]
[36mres14_4[0m: Array[org.apache.spark.sql.Row] = [33mArray[0m(
  [Cheese-Olive Balls,[-1.5745572325731925,0.22192109724859044,-0.1580542136807957,-0.3123119779488005,0.1517057539806714,-1.6133968878431084,-0.07801142995610842,0.1551214441506181,-0.18823113908511294,-1.0957153773622479,-0.40657782344279053,-0.28137855784953036,0.49419586340560806,-0.033542182171615434,-0.6924852729615883,-0.054834622388484106,0.34504507374813814,0.8023428955688754,-0.03803106861617622,-1.1144731966830235,-0.06428218365788038,-0.7269873333506213,-0.22923701170976507,1.3416618107695653,-0.047807655356420

## Calculating the text-based Recipe-Recipe Cosine Simmilarity Model

In order to calculate the cosine simmilarities, we need to implement the formula as one is not available in the Dataframe object. 

The goal is to calculate the recipe-recipe cosine similarities based on the PCA-derived features 

In [16]:
val recipePcaRdd = pcaDF
    .select("recipeName","pcaFeatures","filtered")
    .rdd
    .zipWithIndex
    .collect
    .map(r=> (r._2,
              r._1(0).toString,
              r._1(1).asInstanceOf[org.apache.spark.mllib.linalg.DenseVector].toArray,
              Math.sqrt(r._1(1)
                        .asInstanceOf[org.apache.spark.mllib.linalg.DenseVector]
                        .toArray
                        .reduce((T,v) => T + v*v)
                       ), // calculate the vector length
              r._1(2).toString
             )
        )

[36mrecipePcaRdd[0m: Array[(Long, String, Array[Double], Double, String)] = [33mArray[0m(
  [33m[0m(
    [32m0L[0m,
    [32m"Cheese-Olive Balls"[0m,
    [33mArray[0m(
      [32m-1.5745572325731925[0m,
      [32m0.22192109724859044[0m,
      [32m-0.1580542136807957[0m,
      [32m-0.3123119779488005[0m,
      [32m0.1517057539806714[0m,
      [32m-1.6133968878431084[0m,
      [32m-0.07801142995610842[0m,
      [32m0.1551214441506181[0m,
      [32m-0.18823113908511294[0m,
      [32m-1.0957153773622479[0m,
      [32m-0.40657782344279053[0m,
      [32m-0.28137855784953036[0m,
      [32m0.49419586340560806[0m,
      [32m-0.033542182171615434[0m,
      [32m-0.6924852729615883[0m,
[33m...[0m

In [17]:
val simmilarities = recipePcaRdd.flatMap(r0=> recipePcaRdd
                     .filter(r1 => r1._1>r0._1)
                     .map(r1 => 
                          (r0._2,
                            r1._2,
                            (0 to r1._3.length-1).map(i=>r0._3(i)*r1._3(i)).foldLeft(0.0)((T,v) => T + v)/(r0._4*r1._4),
                            r0._5,
                            r1._5
                           )
                    )
    ).sortBy(-_._3)

[36msimmilarities[0m: Array[(String, String, Double, String, String)] = [33mArray[0m(
  [33m[0m(
    [32m"Blue Cheese Stuffed Mushrooms"[0m,
    [32m"Stuffed Mushrooms"[0m,
    [32m0.6884983721232661[0m,
    [32m"WrappedArray(, large, fresh, mushrooms, tablespoons, margarine, cup, finely, chopped, red, pepper, cup, heavy, cream, cup, crumbled, blue, cheese, cooked, rice, tablespoon, minced, fresh, basil, teaspoon, ground, white, pepper, fresh, basil, chopped, garnish, clean, mushrooms, damp, paper, towel, remove, mushroom, stems, finely, chop, stems, set, aside, saute, mushroom, caps, skillet, tender, drain, paper, towels, saute, mushroom, stems, red, pepper, skillet, add, cream, bring, boil, reduce, heat, add, cheese, cook, melted, stir, rice, basil, pepper, cook, thoroughly, heated, spoon, rice, mixture, mushroom, caps, mushroom, caps, greased, shallow, baking, pan, cover, bake, degrees, minutes, tender, drain, paper, towels, garnish, stuffed, mushrooms, basil, rice, cou

In [18]:
// Helper function that displays a nicely formatted table
def displayTable(table:List[Map[String, String]])(implicit publish: jupyter.api.Publish[jupyter.api.Evidence]): Unit = {
    val keys = table.flatMap(r=>r.keys).distinct.sorted
    val header = "<th>"+keys.mkString("</th><th>")+"</th>"
    val rows = "<tr>"+table.map(r=>keys.map(k=>"<td>"+r.getOrElse(k,"&nbsp;")+"</td>")).mkString("</tr><tr>")+"</tr>"
    publish.display("table",("text/html" -> ("<table>"+header+rows+"</table>")))
}

defined [32mfunction [36mdisplayTable[0m

## The Recipe-Recipe Simliarity Based Model

Let's see the top-20 similar reciples

In [19]:
displayTable(simmilarities
//             .filter(_._3 >0.5)
             .map( r=>
                        Map("Recipe 0" -> r._1,
                            "Recipe 1" -> r._2,
                            "Cosine Similarity" -> r._3.toString//,
//                            "Recipe Text 0" -> r._4,
//                            "Recipe Text 1" -> r._5
                           )
                )
             .toList
             .take(20)
            )

0,1,2
0.6884983721232661,Blue Cheese Stuffed Mushrooms,Stuffed Mushrooms
0.6702883263140217,Mushrooms Filled with Feta Cheese and Pine Nuts,Blue Cheese Stuffed Mushrooms
0.6415835534206843,Mushrooms Filled with Feta Cheese and Pine Nuts,Stuffed Mushrooms
0.6112733989136132,Five-Spice Appetizer Meatballs,Sweet and Sour Party Meat Balls
0.5367300021737985,"Pork Meatball With Sweet-Sour Sauce, Pk",Five-Spice Appetizer Meatballs
0.5139370241898292,Fruit Salsa Dip,Ants on a Log
0.4912639950927898,Deviled Eggs,"Deviled Egg Slices, Pk"
0.4763990337573837,Elegant Vegetarian Pate,Mushroom Individuals
0.4730949203798235,Cheesy Wontons With Sweet and Sour Dip,Ginger-Date Wontons
0.4713876303247022,Southwestern Chicken Filo Triangles,Italian Roasted Vegetables




## Plotting it:

Scala/Spark does not offer much plotting options. For convenience, let's embeed a static plotly graph. I'll eventually figure out how to dynamically pass data to the graph from my app

In [20]:
publish.display("table",("text/html" -> ("""<iframe width="900" height="800" frameborder="0" scrolling="no" src="https://plot.ly/~rmalarc/5.embed"></iframe>""")))



In [21]:
sc.stop



# Conclusions

* I'm getting better with Spark!.
* The ML library is much user-friendly than ML-LIB, even though it's not as feature-rich. 
* The PCA-based relationships appear to work, although it's hard to prove as I'm not that familiar with the recipes. I suspect that there is a fair amount of noise due to certain keywords which should be added to the stop-word list.