-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://cdn2.hubspot.net/hubfs/438089/docs/training/dblearning-banner.png" alt="Databricks Learning" width="555" height="64">
</div>

-sandbox
&copy; 2018 Databricks, Inc. All rights reserved.<br/>

### Challenges
* Business wants better product recommendations and conversion on website and emails
* Data Science spends most of their time connecting and wrangling data, very little on actual data science
* Data Science is hard to scale from sample data to large data sets


### Azure Databricks Solutions
* With all the data in one place (Azure Storage, Azure Data Lake), Easy for DS to spend time on DS
* Azure Databricks Scales to ML on GB, TB, PB of Data
* Easily go into production with ML (save results to CosmosDB)

### Why Initech uses Azure Databricks for ML
* Millions of users and 100,000s of prodcuts, product reccomendations need more than a single machine
* Easy APIs for newer data science team
* Store results in CosmosDB for online serving (emails, website, etc)

####Azure Databricks for Machine Learning and Data Scientists
![arch](https://kpistoropen.blob.core.windows.net/collateral/roadshow/azure_roadshow_ml.png)

# ![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) Providing Product Recommendations

One of the most common uses of big data is to predict what users want.  This allows Google to show you relevant ads, Amazon to recommend relevant products, and Netflix to recommend movies that you might like.  This lab will demonstrate how we can use Apache Spark to recommend products to a user.  

We will start with some basic techniques, and then use the SparkML library's Alternating Least Squares method to make more sophisticated predictions. Here are the SparkML [Python docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html) and the [Scala docs](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.package).

For this lesson, we will use around 900,000 historical product ratings from our company Initech.

In this lab:
* *Part 0*: Exploratory Analysis
* *Part 1*: Collaborative Filtering
* *Part 2*: Analysis

## ![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/wiki-book/general/logo_spark_tiny.png) *Part 0:* Exploratory Analysis

Let's start by taking a look at our data.  It's already mounted in `/mnt/training-msft/ratings.parquet` table for us.  Exploratory analysis should answer questions such as:

* How many observations do I have?
* What are the features?
* Do I have missing values?
* What do summary statistics (e.g. mean and variance) tell me about my data?

Start by importing the data.  Bind it to `productRatings` by running the cell below

In [7]:
product_ratings = spark.read.parquet("dbfs:/mnt/training-sources/initech/productRatings/")

In [8]:
display(product_ratings)

Take a count of the data using the `count()` DataFrame method.

In [10]:
#TO-DO
product_ratings.count()

#### Let's look at what these product_ids mean?

* There is a product lookup dataset in parquet located here: `dbfs:/mnt/training-sources/initech/productsShort/`

In [12]:
#TO-DO
product_df = spark.read.parquet("dbfs:/mnt/training-sources/initech/productsShort/") #use the Spark parquet reader like the productRatings DataFrame above to read the products data

In [13]:
display(product_df)

## ![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/wiki-book/general/logo_spark_tiny.png) *Part 1:* Collaborative Filtering

The image below (from [Wikipedia][collab]) shows an example of predicting of the user's rating using collaborative filtering. At first, people rate different items (like videos, products, articles, images, games). After that, the system is making predictions about a user's rating for an item, which the user has not rated yet. These predictions are built upon the existing ratings of other users, who have similar ratings with the active user. For instance, in the image below the system has made a prediction, that the active user will not like the video.  
![collaborative filtering](https://courses.edx.org/c4x/BerkeleyX/CS100.1x/asset/Collaborative_filtering.gif)

[SparkML]: http://spark.apache.org/docs/latest/ml-guide.html
[collab]: https://en.wikipedia.org/?title=Collaborative_filtering
[collab2]: http://recommender-systems.org/collaborative-filtering/

In [15]:
 #We'll hold out 60% for training, 20% of our data for validation, and leave 20% for testing 
seed = 1800009193
(training_df, validation_df, test_df) = product_ratings.randomSplit([.6, .2, .2], seed=seed)

In [16]:
display(training_df)

### My Ratings
* Fill in your ratings for the above `product_df`
* Pick 5-10 product ids to rate
* Choose your ratings be 1-5

In [18]:
#TO-DO
my_user_id = 0
my_rated_products = [
     (1, my_user_id, 5), # Replace with your ratings.
     (2, my_user_id, 5),
     (3, my_user_id, 5),
     (4, my_user_id, 5),
     (6, my_user_id, 1),
     (7, my_user_id, 1),
     (9, my_user_id, 1),
     (9, my_user_id, 1),
     (9, my_user_id, 1),
     ]

In [19]:
my_ratings_df = spark.createDataFrame(my_rated_products, ['product_id','user_id','rating'])

Join your ratings with the `product_df` to see your ratings with the product metadata

In [21]:
display(my_ratings_df.join(product_df, ['product_id']))

Union your ratings with the `trainingDF` to see your ratings with the product metadata

In [23]:
training_with_my_ratings_DF = training_df.union(my_ratings_df)

### Alternating Least Squares

In this part, we will use the Apache Spark ML Pipeline implementation of Alternating Least Squares, [ALS (Python)](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.recommendation.ALS) or [ALS (Scala)](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS). ALS takes a training dataset (DataFrame) and several parameters that control the model creation process.

The process we will use for determining the best model is as follows:
1. Pick a set of model parameters. The most important parameter to model is the *rank*, which is the number of columns in the Users matrix (green in the diagram above) or the number of rows in the Products matrix (blue in the diagram above). In general, a lower rank will mean higher error on the training dataset, but a high rank may lead to [overfitting](https://en.wikipedia.org/wiki/Overfitting).  We will train models with a rank of 2 using the `trainingDF` dataset.

2. Set the appropriate parameters on the `ALS` object:
    * The "User" column will be set to the values in our `user_id` DataFrame column.
    * The "Item" column will be set to the values in our `product_id` DataFrame column.
    * The "Rating" column will be set to the values in our `rating` DataFrame column.
    * We'll be using a regularization parameter of 0.1.
    
   **Note**: Read the documentation for the ALS class **carefully**. It will help you accomplish this step.
3. Have the ALS output transformation (i.e., the result of `ALS.fit()`) produce a _new_ column
   called "prediction" that contains the predicted value.

4. Create multiple models using `ALS.fit()`, one for each of our rank values. We'll fit 
   against the training data set (`trainingDF`).

5. We'll run our prediction against our validation data set (`validationDF`) and check the error.

6. Use `.setColdStartStrategy("drop")` so that the model can deal with missing values.

In [25]:
from pyspark.ml.recommendation import ALS

# Let's initialize our ALS learner
als = ALS()

# Now we set the parameters for the method
(als.setPredictionCol("prediction")
   .setUserCol("user_id")
   .setItemCol("product_id")
   .setRatingCol("rating")
   .setMaxIter(5)
   .setSeed(seed)
   .setRegParam(0.1)
   .setRank(2)
   .setColdStartStrategy("drop")
)

### Validation: 
Let's see how we did against know ratings

In [27]:
#TO-DO
model = als.fit(training_with_my_ratings_DF) #fill in with training_with_my_ratings_DF
# Run the model to create a prediction. Predict against the validationDF.
predict_df = model.transform(validation_df)

In [28]:
display(predict_df)

## ![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/wiki-book/general/logo_spark_tiny.png) *Part 2:* Your Recommendations:
Let's look at what ALS recommended for your user based on your ratings

In [30]:
#TO-DO
#Filter the predictions DF for your user id something like "user_id = ID"
predictions = model.recommendForAllUsers(10)
my_predictions = predictions.filter("user_id = 0")

In [31]:
display(my_predictions)

In [32]:
from pyspark.sql.functions import *
my_recs = my_predictions.select("user_id", explode("recommendations").alias("recommendations")).select("user_id", "recommendations.product_id", "recommendations.rating")

In [33]:
display(my_recs)

###Join

In [35]:
from pyspark.sql.functions import *
my_recs = my_predictions.select("user_id", explode("recommendations").alias("recommendations")).select("user_id", "recommendations.product_id", "recommendations.rating").join(product_df, ['product_id'])

In [36]:
display(my_recs)