# Matrix multiplication

To understand matrix multiplication more directly, let's do some matrix operations manually.

In [2]:
import numpy as np
import pandas as pd
a  = np.array([[2, 2], [3, 3]])
b = np.array([[1, 2], [4, 4]])
a = pd.DataFrame(a, index=['One', 'Two'], columns=[0, 1])
b = pd.DataFrame(b, index=['One', 'Two'], columns=[0, 1])

# Use the .head() method to view the contents of matrices a and b
print("Matrix A: ")
print (a.head())

print("Matrix B: ")
print (b.head())

# Complete the matrix with the product of matrices a and b
product = np.array([[10,12], [15,18]])

# Run this validation to see how your estimate performs
product == np.dot(a,b)

Matrix A: 
     0  1
One  2  2
Two  3  3
Matrix B: 
     0  1
One  1  2
Two  4  4


array([[ True,  True],
       [ True,  True]])

# Matrix multiplication part II

Let's put your matrix multiplication skills to the test.

In [3]:
# Print the dimensions of a
print(a.shape)

# Print the dimensions of b
print(b.shape)

# Can C and D be multiplied together?
a_times_b = True

(2, 2)
(2, 2)


# Matrix factorization

Matrix G is provided here as a Pandas dataframe. View it to understand what it looks like. Look at the possible factor matrices H, I, and J (also Pandas dataframes), and determine which two matrices will produce the matrix G when multiplied together.

In [4]:
# # Take a look at Matrix G using the following print function
# print("Matrix G:")
# print(G)

# # Take a look at the matrices H, I, and J and determine which pair of those matrices will produce G when multiplied together. 
# print("Matrix H:")
# print(H)
# print("Matrix I:")
# print(I)
# print("Matrix J:")
# print(J)

# # Multiply the two matrices that are factors of the matrix G
# prod = np.matmul(H, J)
# print(G == prod)

# Non-negative matrix factorization

It's possible for one matrix to have two equally close factorizations where one has all positive values and the other has some negative values.

The matrix M has been factored twice using two different factorizations. Take a look at each pair of factor matrices L and U, and W and H to see the differences. Then use their products to see that they produce essentially the same product.

In [5]:
# # View the L, U, W, and H matrices.
# print("Matrices L and U:") 
# print(L)
# print(U)

# print("Matrices W and H:")
# print(W)
# print(H)

# # Calculate RMSE between LU and M
# print("RMSE of LU: ", getRMSE(LU, M))

# # Calculate RMSE between WH and M
# print("RMSE of WH: ", getRMSE(WH, M))

# Estimating recommendations

Use your knowledge of matrix multiplication to determine which movie will have the highest recommendation for User_3. The ratings matrix has been factorized into U and P with ALS.

In [6]:
# # View left factor matrix
# print(U)
# # View right factor matrix
# print(P)
# # Multiply factor matrices
# UP = np.matmul(U,P)

# # Convert to pandas DataFrame
# print(pd.DataFrame(UP, columns = P.columns, index = U.index))

# RMSE as ALS alternates

As you know, ALS will alternate between the two factor matrices, adjusting their values each time to iteratively come closer and closer to approximating the original ratings matrix. This exercise is intended to illustrate this to you.

In [7]:
# # Use getRMSE(preds, actuals) to calculate the RMSE of matrices T and F1.
# getRMSE(F1, T)

# # Create list of F2, F3, F4, F5, and F6
# Fs = [F2, F3, F4, F5, F6]

# # Calculate RMSE for F2, F3, F4, F5, and F6.
# getRMSEs(Fs, T)

# Correct format and distinct users

Take a look at the R dataframe. Notice that it is in conventional or "wide" format with a different movie in each column. Also notice that the User's and movie names are not in integer format. Follow the steps to properly prepare this data for ALS.

In [8]:
# # Import monotonically_increasing_id and show R
# from pyspark.sql.functions import monotonically_increasing_id
# R.show()

# # Use the to_long() function to convert the dataframe to the "long" format.
# ratings = to_long(R)
# ratings.show()

# # Get unique users and repartition to 1 partition
# users = ratings.select("User").distinct().coalesce(1)

# # Create a new column of unique integers called "userId" in the users dataframe.
# users = users.withColumn("userId", monotonically_increasing_id()).persist()
# users.show()

# Assigning integer id's to movies

Let's do the same thing to the movies. Then let's join the new user IDs and movie IDs into one dataframe.

In [9]:
# # Extract the distinct movie id's
# movies = ratings.select("Movie").distinct() 

# # Repartition the data to have only one partition.
# movies = movies.coalesce(1) 

# # Create a new column of movieId integers. 
# movies = movies.withColumn("movieId", monotonically_increasing_id()) 

# # Join the ratings, users and movies dataframes
# movie_ratings = ratings.join(users, "User", "left").join(movies, "Movie", "left")
# movie_ratings.show()

# Build out an ALS model

Let's specify your first ALS model. Complete the code below to build your first ALS model.

Recall that you can use the .columns method on the ratings data frame to see what the names of the columns are that contain user, movie, and ratings data. Spark needs to know the names of these columns in order to perform ALS correctly.

In [10]:
# # Split the ratings dataframe into training and test data
# (training_data, test_data) = ratings.randomSplit([0.8, 0.2], seed=42)

# # Set the ALS hyperparameters
# from pyspark.ml.recommendation import ALS
# als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", rank =10, maxIter =15, regParam =0.1,
#           coldStartStrategy="drop", nonnegative =True, implicitPrefs = False)

# # Fit the mdoel to the training_data
# model = als.fit(training_data)

# # Generate predictions on the test_data
# test_predictions = model.transform(test_data)
# test_predictions.show()

# Build RMSE evaluator

Now that you know how to fit a model to training data and generate test predictions, you need a way to evaluate how well your model performs. For this we'll build an evaluator. Evaluators in Spark can be built out in various ways. For our purposes, we want a regressionEvaluator that calculates the RMSE. After we build our regressionEvaluator, we can fit the model to our data and generate predictions.

In [11]:
# # Import RegressionEvaluator
# from pyspark.ml.evaluation import RegressionEvaluator

# # Complete the evaluator code
# evaluator = RegressionEvaluator(metricName="rmse", labelCol="ratings", predictionCol="prediction")

# # Extract the 3 parameters
# print(evaluator.getMetricName())
# print(evaluator.getLabelCol())
# print(evaluator.getPredictionCol())

# Get RMSE

Now that you know how to build a model and generate predictions, and have an evaluator to tell us how well it predicts ratings, we can calculate the RMSE to see how well an ALS model performed. We'll use the evaluator that we built in the previous exercise to calculate and print the rmse.

In [12]:
# # Evaluate the "test_predictions" dataframe
# RMSE = evaluator.evaluate(test_predictions)

# # Print the RMSE
# print (RMSE)