# Project 2

The goal of this assignment is for you to try out different ways of implementing and configuring a recommender, and to evaluate your different approaches.

For project 2, you’re asked to take some recommendation data (such as your toy movie dataset, Movielens, or another Dataset of your choosing), and implement at least two different recommendation algorithms on the data.  For example, content-based, user-user CF, and/or item-item CF.  You should evaluate different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc.  You don’t need to be exhaustive—these are just some suggested possibilities.  You may use whatever third party libraries you want.  Please provide at least one graph, and a textual summary of your evaluation.

You may work in a small group.  Please submit a link to your GitHub repository for your Jupyter notebook or RMarkdown file.  Due end of day on Sunday June 26th.

**Requires the Jupyter-Scala language Kernel, available from: https://github.com/alexarchambault/jupyter-scala**

In [1]:
classpath.add( "org.apache.spark" %% "spark-core" % "1.6.1",
             "org.apache.spark" %% "spark-mllib" % "1.6.1",
              "org.apache.spark" %% "spark-sql" % "1.6.1")

158 new artifact(s)


158 new artifacts in macro
158 new artifacts in runtime
158 new artifacts in compile




# Response

## The Recommender System

As I'm farily new to Spark and the whole data manipulation world in Scala, let's keep the problem simple. This is a system that recommends movies to users based on the dataset collected by the class survey.

As part of this exercise, I will produce a manual similarity function and compare the performance against the collaborative filtering library in Spark

## The Code

### Firing up a Spark Context

In [2]:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}

[32mimport [36morg.apache.spark.{SparkConf, SparkContext}[0m
[32mimport [36morg.apache.spark.sql._[0m
[32mimport [36morg.apache.spark.sql.types._[0m
[32mimport [36morg.apache.spark.mllib.linalg.Vectors[0m
[32mimport [36morg.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}[0m

In [3]:

    val conf = new SparkConf()
      .setAppName("week1-EstimatePi")
      .setMaster("local") 

    val sc = new SparkContext(conf)


Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/06/25 10:39:32 INFO SparkContext: Running Spark version 1.6.1
16/06/25 10:39:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/06/25 10:39:32 INFO SecurityManager: Changing view acls to: malarconba001
16/06/25 10:39:32 INFO SecurityManager: Changing modify acls to: malarconba001
16/06/25 10:39:32 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(malarconba001); users with modify permissions: Set(malarconba001)
16/06/25 10:39:35 INFO Utils: Successfully started service 'sparkDriver' on port 9425.
16/06/25 10:39:36 INFO Slf4jLogger: Slf4jLogger started
16/06/25 10:39:36 INFO Remoting: Starting remoting
16/06/25 10:39:37 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.1.15:9440]
16/06/25 10:39:37 INFO Utils: Succ

[36mconf[0m: org.apache.spark.SparkConf = org.apache.spark.SparkConf@45a822c8
[36msc[0m: org.apache.spark.SparkContext = org.apache.spark.SparkContext@2d60c055

### Data Loading and Transformations

The objective here is to:

* Load the http://mc6help.tripod.com/RecipeLibrary/AllAppetizerRecipes.txt file
* Transform into Zero filled matrix
* Transform into Long-format data structure


In [4]:

val csv = 
    sc
        .textFile("AllAppetizerRecipes.txt")
        .map(t => t.trim)
        .map(t => (t,t=="* Exported from MasterCook *") ) // add boolean if we have a record delimiter
        .zipWithIndex // add record id
        .map(r=>(r._1._1,r._1._2,r._2)) // flatten the nested index
csv.take(20)



[36mcsv[0m: org.apache.spark.rdd.RDD[(String, Boolean, Long)] = MapPartitionsRDD[5] at map at Main.scala:30
[36mres3_1[0m: Array[(String, Boolean, Long)] = [33mArray[0m(
  [33m[0m([32m"* Exported from MasterCook *"[0m, [32mtrue[0m, [32m0L[0m),
  [33m[0m([32m""[0m, [32mfalse[0m, [32m1L[0m),
  [33m[0m([32m"Barbecue Pecans"[0m, [32mfalse[0m, [32m2L[0m),
  [33m[0m([32m""[0m, [32mfalse[0m, [32m3L[0m),
  [33m[0m([32m"Recipe By     : Possum Kingdom Lake Cookbook"[0m, [32mfalse[0m, [32m4L[0m),
  [33m[0m([32m"Serving Size  : 25    Preparation Time : 0:00"[0m, [32mfalse[0m, [32m5L[0m),
  [33m[0m([32m"Categories    :"[0m, [32mfalse[0m, [32m6L[0m),
  [33m[0m([32m"Amount  Measure       Ingredient -- Preparation Method"[0m, [32mfalse[0m, [32m7L[0m),
  [33m[0m([32m"--------  ------------  --------------------------------"[0m, [32mfalse[0m, [32m8L[0m),
  [33m[0m([32m"2   tablespoons  butter"[0m, [32mfalse[0m, [32m9L[

From the above, sample, we can see that the text file gets imported a line at a time per record. However, the recipes file is formatted as follows:

```
* Exported from MasterCook *

                     Barbecue Pecans

Recipe By     : Possum Kingdom Lake Cookbook
Serving Size  : 25    Preparation Time : 0:00
Categories    : 
  Amount  Measure       Ingredient -- Preparation Method
--------  ------------  --------------------------------
       2   tablespoons  butter
     1/4           cup  Worcestershire sauce
       1    tablespoon  catsup
       6        dashes  Hot sauce
       4          cups  Pecans -- halves
                        salt -- to taste

Melt butter in a large saucepan; add Worcestershire sauce, , catsup, and hot sauce.  

Stir in nuts; spoon into a glass baking dish, spreading evenly.  toast at 400 degrees about 20 minutes, stirring frequently.  

Turn out on absorbent towels, and sprinkle with salt.

                                    - - - - - - - - - - - - - - - - - - - 
```

The goal here is to map the file into: RecipeIngredients(RecipeName, Ingredient, IsUsed) 

Where:

* Record Break: is the line: ```* Exported from MasterCook *```
* An entire record is composed of the lines between record breaks
* RecipeName: The third line in the record
* Ingredient: The third column in the Ingredients table
* IsUsed: Defaults to 1.

In [5]:
// The goal with this section is to end up with a recipeIndexTable collection that looks like: recipeIndexTable(RecordId,RecipeID)

// Generate a recordIndex table (RecipeId, LineId) by relying in the presence and offset of the record break
val recordIndexes =  csv.filter(_._2).map(_._3).zipWithIndex.map(r=>(r._2,r._1)) ++ sc.parallelize(Seq((csv.filter(_._2).count,csv.count)))

// Now, let's just create an offset recordIndexes Table 
val recordIndexesOffset = recordIndexes.map(r=>(r._1-1,r._2-1))

// and join it with the record indexes so we have the recipeIndexTable(RecordId,RecipeID)
val recipeIndexTable = recordIndexes.join(recordIndexesOffset).flatMap(r=> (r._2._1 to r._2._2).map(t=>(t,r._1)))
recipeIndexTable.collect.sorted

[36mrecordIndexes[0m: org.apache.spark.rdd.RDD[(Long, Long)] = UnionRDD[12] at $plus$plus at Main.scala:27
[36mrecordIndexesOffset[0m: org.apache.spark.rdd.RDD[(Long, Long)] = MapPartitionsRDD[13] at map at Main.scala:30
[36mrecipeIndexTable[0m: org.apache.spark.rdd.RDD[(Long, Long)] = MapPartitionsRDD[17] at flatMap at Main.scala:33
[36mres4_3[0m: Array[(Long, Long)] = [33mArray[0m(
  [33m[0m([32m0L[0m, [32m0L[0m),
  [33m[0m([32m1L[0m, [32m0L[0m),
  [33m[0m([32m2L[0m, [32m0L[0m),
  [33m[0m([32m3L[0m, [32m0L[0m),
  [33m[0m([32m4L[0m, [32m0L[0m),
  [33m[0m([32m5L[0m, [32m0L[0m),
  [33m[0m([32m6L[0m, [32m0L[0m),
  [33m[0m([32m7L[0m, [32m0L[0m),
  [33m[0m([32m8L[0m, [32m0L[0m),
  [33m[0m([32m9L[0m, [32m0L[0m),
  [33m[0m([32m10L[0m, [32m0L[0m),
  [33m[0m([32m11L[0m, [32m0L[0m),
  [33m[0m([32m12L[0m, [32m0L[0m),
  [33m[0m([32m13L[0m, [32m0L[0m),
  [33m[0m([32m14L[0m, [32m0L[0m),
  [33m[0m(

Let's now join it to the imported data so we add the recipeID

In [6]:


val csvIndexed = csv.map(r=>(r._3,(r._1,r._2))).join(recipeIndexTable)
csvIndexed.take(4)

[36mcsvIndexed[0m: org.apache.spark.rdd.RDD[(Long, ((String, Boolean), Long))] = MapPartitionsRDD[21] at join at Main.scala:27
[36mres5_1[0m: Array[(Long, ((String, Boolean), Long))] = [33mArray[0m(
  [33m[0m(
    [32m3558L[0m,
    [33m[0m(
      [33m[0m(
        [32m"Blend well the cream cheese with the Brie cheese. Add the hazelnuts and apple; blend. Spread on melba toast or crackers."[0m,
        [32mfalse[0m
      ),
      [32m92L[0m
    )
  ),
  [33m[0m([32m1084L[0m, [33m[0m([33m[0m([32m"- - - - - - - - - - - - - - - - - - -"[0m, [32mfalse[0m), [32m28L[0m)),
  [33m[0m([32m3586L[0m, [33m[0m([33m[0m([32m"2            tb  Olive oil"[0m, [32mfalse[0m), [32m93L[0m)),
  [33m[0m([32m1410L[0m, [33m[0m([33m[0m([32m""[0m, [32mfalse[0m), [32m37L[0m))
)

Let's now combine the recipe lines separated by a pipe (|) 

In [7]:
val recipesText = csvIndexed
    .map(r=>(r._2._2,r._1,r._2._1._1))// let's flatten the nested list so we have RecipeId, Recipe Line Id and Recipe Line
    .sortBy(r=>(r._1,r._2)) // Properly sort it so we have all lines consecutively arranged as per the recipe line id
    .map(r=>(r._1,r._3)) // retain only the recipeid and recipe lines
    .groupBy(_._1) // and group it by the RecipeId
    .map( 
        g => (
                g._1,    // return the RecipeId
                g._2.map(_._2.trim).mkString("|")  // and concatenate the nested array of recipe lines with a pipe
        )
    )

recipesText.take(4)

[36mrecipesText[0m: org.apache.spark.rdd.RDD[(Long, String)] = MapPartitionsRDD[31] at map at Main.scala:30
[36mres6_1[0m: Array[(Long, String)] = [33mArray[0m(
  [33m[0m(
    [32m34L[0m,
    [32m"* Exported from MasterCook *||Cheese-Olive Balls||Recipe By     :|Serving Size  : 1     Preparation Time : 0:00|Categories    :|Amount  Measure       Ingredient -- Preparation Method|--------  ------------  --------------------------------|1/4      teaspoon  hot pepper sauce|1      teaspoon  paprika|1/2      teaspoon  salt|2          cups  sharp cheddar cheese -- grated|1/2           cup  butter|1           cup  flour -- sifted|olives||Mix ingredientsexcept olives  like a pie crust.  Wrap each olive with mixture.  spread the little balls on a pan and freeze.  Bake at 425 degrees for 12 minutes. Can be keep frozen in a bag. Serve hot.||||||||||- - - - - - - - - - - - - - - - - - -|||||Nutr. Assoc. : 0"[0m
  ),
  [33m[0m(
    [32m52L[0m,
    [32m"* Exported from MasterCook *||S

And finally, let's extract our data in a long format: RecipeIngredients(RecipeName,RecipeNameHash, IngredientName,IngredientNameHash, IsUsed) 

In [8]:
import scala.util.matching.Regex
var recipeIngredients = recipesText
    .flatMap{
        text=>
            val recipeName = {        // get the recipe name from the first part of the recipe record
                val pattern = new Regex("""\* Exported from MasterCook \*\|\|([^\|]+)\|\|""", "RecipeName") 
                pattern.findFirstMatchIn(text._2).get.group("RecipeName")  
            }

            """(?s)----\|.*?\|\|""".r   /* Let's find the recipe table, which is all the stuff 
                                                 between the header and 2 consecutive pipes (new lines) */
                .findFirstMatchIn(text._2)    // get the first match
                .map(r=>r.matched)            // and return the matched text from the regex object
                .getOrElse("")                // get or default to whitespace
                .split("\\|")                 // and now process the table lines
                .map(
                    _.trim.replaceAll(" {2,}","\\|")  // for each line, let's pipe delimit the columns.
                )
                .filter(r=>(!r.isEmpty)&&(r!="----")&&(r.length>2)) // filtering out the empty ones and short aberrations
                .map { r=>   // for each recipe line
                    val v = r.split("\\|")    // split it into an iterator by the pipe sign
                    (recipeName, v(v.length-1),1.00) // amd return (RecipeName, IngredientName, IsUSed)
                }
    }

recipeIngredients.collect

[32mimport [36mscala.util.matching.Regex[0m
[36mrecipeIngredients[0m: org.apache.spark.rdd.RDD[(String, String, Double)] = MapPartitionsRDD[32] at flatMap at Main.scala:26
[36mres7_2[0m: Array[(String, String, Double)] = [33mArray[0m(
  [33m[0m([32m"Cheese-Olive Balls"[0m, [32m"hot pepper sauce"[0m, [32m1.0[0m),
  [33m[0m([32m"Cheese-Olive Balls"[0m, [32m"paprika"[0m, [32m1.0[0m),
  [33m[0m([32m"Cheese-Olive Balls"[0m, [32m"salt"[0m, [32m1.0[0m),
  [33m[0m([32m"Cheese-Olive Balls"[0m, [32m"sharp cheddar cheese -- grated"[0m, [32m1.0[0m),
  [33m[0m([32m"Cheese-Olive Balls"[0m, [32m"butter"[0m, [32m1.0[0m),
  [33m[0m([32m"Cheese-Olive Balls"[0m, [32m"flour -- sifted"[0m, [32m1.0[0m),
  [33m[0m([32m"Cheese-Olive Balls"[0m, [32m"olives"[0m, [32m1.0[0m),
  [33m[0m([32m"Stuffed Ripe Olives"[0m, [32m"jumbo ripe olives -- canned, pitted"[0m, [32m1.0[0m),
  [33m[0m([32m"Stuffed Ripe Olives"[0m, [32m"Italian dressing"

Lets have a freestanding list of recipes and ingredients

In [9]:
val ingredients = recipeIngredients.map(_._2).distinct.zipWithIndex
ingredients.collect
val recipes = recipeIngredients.map(_._1).distinct.zipWithIndex
recipes.collect

[36mingredients[0m: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[37] at zipWithIndex at Main.scala:25
[36mres8_1[0m: Array[(String, Long)] = [33mArray[0m(
  [33m[0m([32m"French bread loaf -- sliced and heated"[0m, [32m0L[0m),
  [33m[0m([32m"bread"[0m, [32m1L[0m),
  [33m[0m([32m"cucumber -- thinly sliced"[0m, [32m2L[0m),
  [33m[0m([32m"pine nuts -- toasted"[0m, [32m3L[0m),
  [33m[0m([32m"soy sauce"[0m, [32m4L[0m),
  [33m[0m([32m"-- crumbled"[0m, [32m5L[0m),
  [33m[0m([32m"sour cream -- or yogurt"[0m, [32m6L[0m),
  [33m[0m([32m"Salt -- pepper"[0m, [32m7L[0m),
  [33m[0m([32m"Hot sauce"[0m, [32m8L[0m),
  [33m[0m([32m"whole-wheat flour pastry"[0m, [32m9L[0m),
  [33m[0m([32m"chicken pieces -- see notes"[0m, [32m10L[0m),
  [33m[0m([32m"catsup"[0m, [32m11L[0m),
  [33m[0m([32m"segmented"[0m, [32m12L[0m),
  [33m[0m([32m"Mushrooms -- coarsely chopped"[0m, [32m13L[0m),
  [33m[0m([32m"mince

Let's finally bake the indexes along with the recipeIngredients

In [10]:
val withIngredientsIndexed = recipeIngredients.groupBy(_._2).join(ingredients)
withIngredientsIndexed.take(4)

[36mwithIngredientsIndexed[0m: org.apache.spark.rdd.RDD[(String, (Iterable[(String, String, Double)], Long))] = MapPartitionsRDD[47] at join at Main.scala:27
[36mres9_1[0m: Array[(String, (Iterable[(String, String, Double)], Long))] = [33mArray[0m(
  [33m[0m(
    [32m"French bread loaf -- sliced and heated"[0m,
    [33m[0m(
      [33mCompactBuffer[0m(
        [33m[0m(
          [32m"Baked Whole Garlic with French Bread"[0m,
          [32m"French bread loaf -- sliced and heated"[0m,
          [32m1.0[0m
        )
      ),
      [32m0L[0m
    )
  ),
  [33m[0m([32m"bread"[0m, [33m[0m([33mCompactBuffer[0m([33m[0m([32m"Chicken Almond Dainties"[0m, [32m"bread"[0m, [32m1.0[0m)), [32m1L[0m)),
  [33m[0m(
    [32m"cucumber -- thinly sliced"[0m,
    [33m[0m([33mCompactBuffer[0m([33m[0m([32m"Snack Sandwiches"[0m, [32m"cucumber -- thinly sliced"[0m, [32m1.0[0m)), [32m2L[0m)
  ),
  [33m[0m(
[33m...[0m

It's time to flatten the group

In [11]:
val withIngredientsIndexedFlat = withIngredientsIndexed.flatMap(r => r._2._1.map(x => (x._1,x._2,r._2._2,x._3)))
withIngredientsIndexedFlat.take(4)

[36mwithIngredientsIndexedFlat[0m: org.apache.spark.rdd.RDD[(String, String, Long, Double)] = MapPartitionsRDD[48] at flatMap at Main.scala:25
[36mres10_1[0m: Array[(String, String, Long, Double)] = [33mArray[0m(
  [33m[0m(
    [32m"Baked Whole Garlic with French Bread"[0m,
    [32m"French bread loaf -- sliced and heated"[0m,
    [32m0L[0m,
    [32m1.0[0m
  ),
  [33m[0m([32m"Chicken Almond Dainties"[0m, [32m"bread"[0m, [32m1L[0m, [32m1.0[0m),
  [33m[0m([32m"Snack Sandwiches"[0m, [32m"cucumber -- thinly sliced"[0m, [32m2L[0m, [32m1.0[0m),
  [33m[0m(
    [32m"Mushrooms Filled with Feta Cheese and Pine Nuts"[0m,
    [32m"pine nuts -- toasted"[0m,
    [32m3L[0m,
    [32m1.0[0m
  )
)

And add the recipe ID following the same methodology:

In [12]:
val recipeIngredientsIndexed = withIngredientsIndexedFlat
    .groupBy(_._1)
    .join(recipes)
    .flatMap(r => r._2._1.map(x => (x._1,r._2._2,x._2,x._3,x._4)))

recipeIngredientsIndexed.take(4)

[36mrecipeIngredientsIndexed[0m: org.apache.spark.rdd.RDD[(String, Long, String, Long, Double)] = MapPartitionsRDD[54] at flatMap at Main.scala:30
[36mres11_1[0m: Array[(String, Long, String, Long, Double)] = [33mArray[0m(
  [33m[0m([32m"Crunchy Chocolate-Coconut Balls"[0m, [32m0L[0m, [32m"egg -- slightly beaten"[0m, [32m149L[0m, [32m1.0[0m),
  [33m[0m([32m"Crunchy Chocolate-Coconut Balls"[0m, [32m0L[0m, [32m"sweet chocolate squares"[0m, [32m170L[0m, [32m1.0[0m),
  [33m[0m([32m"Crunchy Chocolate-Coconut Balls"[0m, [32m0L[0m, [32m"butter"[0m, [32m285L[0m, [32m1.0[0m),
  [33m[0m([32m"Crunchy Chocolate-Coconut Balls"[0m, [32m0L[0m, [32m"coconut flakes"[0m, [32m322L[0m, [32m1.0[0m)
)

### Converting the RDD into a Sql Context

Thinking that it may be useful, here it is:


In [13]:
// lot's of code from : http://spark.apache.org/docs/latest/sql-programming-guide.html

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

[36msqlContext[0m: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@41927918

In [14]:
// Import Row.
import org.apache.spark.sql.Row;

// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType,DoubleType};
// Generate the schema based on the string of schema
val schema =
  StructType(List(StructField("recipeName", StringType, true)
             ,StructField("ingredientName", StringType, true)
             ,StructField("isUsed", DoubleType, true)
            )
            )

// Convert records of the RDD (people) to Rows.
val rowRDD = recipeIngredients.map(p => Row(p._1, p._2,p._3))

// Apply the schema to the RDD.
val recipeIngredientsDataFrame = sqlContext.createDataFrame(rowRDD, schema)

// Register the DataFrames as a table.
recipeIngredientsDataFrame.registerTempTable("recipeingredients")



[32mimport [36morg.apache.spark.sql.Row[0m
[32mimport [36morg.apache.spark.sql.types.{StructType,StructField,StringType,DoubleType}[0m
[36mschema[0m: org.apache.spark.sql.types.StructType = [33mStructType[0m(
  StructField(recipeName,StringType,true),
  StructField(ingredientName,StringType,true),
  StructField(isUsed,DoubleType,true)
)
[36mrowRDD[0m: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[55] at map at Main.scala:41
[36mrecipeIngredientsDataFrame[0m: org.apache.spark.sql.DataFrame = [recipeName: string, ingredientName: string, isUsed: double]

Let's query the dataframe

In [15]:
sqlContext.sql("SELECT * FROM recipeingredients where upper(ingredientName) like '%CILANTRO%'").collect

[36mres14[0m: Array[org.apache.spark.sql.Row] = [33mArray[0m(
  [Pot Stickers,Cilantro -- minced,1.0],
  [Fruit Salsa Dip,cilantro,1.0],
  [Tuna Appetizers,dried cilantro,1.0],
  [Vietnamese Spring Rolls,cilantro -- minced,1.0],
  [Southwestern Chicken Filo Triangles,fresh cilantro -- finely minced,1.0],
  [Southwestern Chicken Filo Triangles,fresh cilantro -- minced,1.0],
  [Crab And Avocado Cocktail,cilantro; fresh -- snipped,1.0]
)

I thought I would use the SQL collection to pivot the matix. However, the performance is so poor when I ran it that I'm completely ommiting it from the excercise.

### Transforming into a Matrix and Generating the Cosine Similarities

In [16]:
// Helper function that displays a nicely formatted table
def displayTable(table:List[Map[String, String]])(implicit publish: jupyter.api.Publish[jupyter.api.Evidence]): Unit = {
    val keys = table.flatMap(r=>r.keys).distinct.sorted
    val header = "<th>"+keys.mkString("</th><th>")+"</th>"
    val rows = "<tr>"+table.map(r=>keys.map(k=>"<td>"+r.getOrElse(k,"&nbsp;")+"</td>")).mkString("</tr><tr>")+"</tr>"
    publish.display("table",("text/html" -> ("<table>"+header+rows+"</table>")))
}

defined [32mfunction [36mdisplayTable[0m

In [17]:
// I don't yet know why but I have to get these counts and hard code them into the functor. 
ingredients.count.toInt
recipes.count.toInt

[36mres16_0[0m: Int = [32m626[0m
[36mres16_1[0m: Int = [32m103[0m

In [18]:
// it also sucks that the only silimarity function is buried in a RowMatrix object, which only takes dense vectors
val recipeIngredientsMatrix = new RowMatrix(
    recipeIngredientsIndexed
        .groupBy(_._2)
        .map(
            r=>
                Vectors.dense(
                    (0 to 626).map(
                        c=>  r._2.map(i=> i._4.toInt).filter(_ == c).size.toDouble
                    ).toArray
                )
        )
)

recipeIngredientsMatrix.rows.take(5)

val cs = recipeIngredientsMatrix.columnSimilarities


val ic = ingredients.collect
cs.entries
  .map {
    case MatrixEntry(i, j, u) => (i, j, u) }
  .collect
//  .map(r => (ic.filter(ri => ri._2 == r._1).map(_._1), ic.filter(ri => ri._2 == r._2).map(_._1), r._3.toDouble))
  .map(r => (ic.filter(ri => ri._2 == r._1).map(_._1), ic.filter(ri => ri._2 == r._2).map(_._1), r._3.toDouble))
  .sortBy(-_._3)

[36mrecipeIngredientsMatrix[0m: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@531a5cf0
[36mres17_1[0m: Array[org.apache.spark.mllib.linalg.Vector] = [33mArray[0m(
  [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,

In [19]:
val ingredientsRecipeMatrix = new RowMatrix(
    recipeIngredientsIndexed
        .groupBy(_._4)
        .map(
            r=>
                Vectors.dense(
                    (0 to 103).map(
                        c=>  r._2.map(i=> i._2.toInt).filter(_ == c).size.toDouble*100
                    ).toArray
                )
        )
)

ingredientsRecipeMatrix.rows.take(5)

val cs = ingredientsRecipeMatrix.columnSimilarities

// let's collect the ingredients so we can see them
val recipeIngredientsLocal = recipeIngredients.collect

val rc = recipes.collect
val cse = cs.entries
      .map {
        case MatrixEntry(i, j, u) => (i, j, u) }
      .collect
      .map(r => (rc.filter(ri => ri._2 == r._1).map(_._1), rc.filter(ri => ri._2 == r._2).map(_._1), r._3.toDouble))
      .sortBy(-_._3)
      .map(r=> Map(" Recipe 1"->r._1.mkString
                   ,"Ingredients 1"->recipeIngredientsLocal.filter(_._1==r._1.mkString).map(_._2).mkString("<br>")
                   ," Recipe 2"->r._2.mkString
                   ,"Ingredients 2"->recipeIngredientsLocal.filter(_._1==r._2.mkString).map(_._2).mkString("<br>")
                   ,"  Cosine Similarity"->r._3.toString))


[36mingredientsRecipeMatrix[0m: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@790be33d
[36mres18_1[0m: Array[org.apache.spark.mllib.linalg.Vector] = [33mArray[0m(
  [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
  [0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

## The Recipe-Recipe Simliarity Based Model

Let's see the top-20 similar reciples

In [20]:
displayTable(
    cse.toList.take(20)
    )

0,1,2,3,4
0.4780914437337573,Cranberry Coconut Fruit Balls,Orange Coconut Balls,dried apricots pecans fresh cranberries -- rinsed and drained grated orange peel -- from 1 orange butter confectioner's sugar graham cracker crumbs coconut flakes red food coloring green food coloring,frozen orange juice concentrate -- thawed butter water confectioner's sugar coconut flakes graham cracker crumbs chopped pecans
0.4629100498862756,Crunchy Chocolate-Coconut Balls,Orange Coconut Balls,sweet chocolate squares butter egg -- slightly beaten confectioner's sugar coconut flakes whole wheat flakes -- crisp,frozen orange juice concentrate -- thawed butter water confectioner's sugar coconut flakes graham cracker crumbs chopped pecans
0.4472135954999578,"Toasted ID Bits, Pk",Debbie's Spiced Pecans,butter -- melted worcestershire sauce celery salt tabasco sauce garlic powder bite-size oat cereal rings pretzel sticks shelled peanuts bite-size shredded wheat,pecan halves butter -- melted tabasco sauce worcestershire sauce garlic salt
0.408248290463863,Rumaki,Pineapple Chicken Satay,"waterchestnuts, canned -- drained soy sauce bacon brown sugar",boneless skinless chicken breasts -- trimmed and cut into fresh pineapple (about 50 pieces) -- peeled and cut into unsweetened coconut milk sesame oil soy sauce fresh ginger -- minced brown sugar scallions -- cut into thin sticks DIPPING SAUCE creamy peanut butter unsweetened coconut milk pineapple juice soy sauce brown sugar fresh ginger -- chopped garlic -- chopped scallions -- minced hot pepper sauce
0.408248290463863,Curried Pecans,"Curried Meat Balls, Pk",melted butter curry powder salt pecan halves -- or walnuts,curry powder dry bread crumbs few grains pepper egg -- beaten salt minced steak
0.3872983346207416,Crunchy Chocolate-Coconut Balls,Cranberry Coconut Fruit Balls,sweet chocolate squares butter egg -- slightly beaten confectioner's sugar coconut flakes whole wheat flakes -- crisp,dried apricots pecans fresh cranberries -- rinsed and drained grated orange peel -- from 1 orange butter confectioner's sugar graham cracker crumbs coconut flakes red food coloring green food coloring
0.3749999999999999,Baked Whole Garlic with French Bread,Rosemary Chicken Wings,garlic -- left whole olive oil clarified butter salt black pepper French bread loaf -- sliced and heated cream cheese or soft cheese (optional) butter -- softened,olive oil butter finely chopped shallots dried rosemary lemonade black pepper salt chicken wings
0.3535533905932737,Rhode Island Clam Cakes,"Swedish Meat Ball Appetizers, Pk","flour baking powder salt pepper eggs milk clams, canned with liquid lard, or more -- for deep frying",cooking oil ground beef egg soft bread crumbs brown sugar salt pepper ginger ground cloves nutmeg cinnamon milk sour cream salt
0.3380617018914066,Sardine Appetizer,Copenhagens,mashed sardines minced pimento chopped olives mayonnaise lemon juice toasted bread stuffed olives,"waterchestnuts -- canned, drained shrimp mayonnaise chopped parsley lemon juice"
0.3333333333333333,Mushroom Individuals,"Swedish Meat Ball Appetizers, Pk",button mushrooms butter garlic clove -- crushed dill weed salt pepper lemon juice sherry sour cream,cooking oil ground beef egg soft bread crumbs brown sugar salt pepper ginger ground cloves nutmeg cinnamon milk sour cream salt




## Plotting it:

Scala/Spark does not offer much plotting options. For convenience, let's embeed a static plotly graph. I'll eventually figure out how to dynamically pass data to the graph from my app

In [21]:
publish.display("table",("text/html" -> ("""<iframe width="900" height="800" frameborder="0" scrolling="no" src="https://plot.ly/~rmalarc/3.embed"></iframe>""")))



In [22]:
sc.stop



# Conclusions

* Spark is raw power, not bells and whistles and user friendliness: Here are some of the major limitations/issues I've found
  * Lot's of datatypes disparate APIs. For instance, the cosine similarity is buried into a Matrix api, which you can't directly use if you have a plain RDD or SQL dataframe. Lot's of data conversions
  * I only found that ONE similarity function (which is the cosine). I didn't get to experiment with other functions.
  * I couldn't get it to reference other RDDs within another's functor. For instnace, I had to hardcode the length of the recipe's array as I couldn't directly access it within the second iterator. Due to the same reason, I had to first calculate whatever I wanted to access from within the functor, and bake it into the RDD by using joins. Perhaps this is the Spark way. Lot's of "Task not serializable" erros: http://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou
  * I couldn't get the Column Simmilarities method to function properly with a matrix of sparse vectors. Due to time constrains, I turned into a dense vector (not pretty)
* It was really cool to be able to completely parse a plain text list of recipes into a "dataset".
* The cosine simmilarity seems to work pretty nicely, even in a really sparse scenario such as this one. Better results should be obtained by mastering the list of ingredients.

In [22]:
alias

: 