# Movie Recommendations in BigQuery ML

Generate a product (movie) recommendations for users using BigQuery Machine Learning.

### 1. Connecting BigQuery Jupyter Notebook

Set environment variables for notebook to connect Bigquery

In [1]:
import os 
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'C:/Users/Rakyan Prajnagra/Documents/Data Engineer/GCP-DataEngineerLearningPath/Quest-MachineLearning/Quest-1-Movie-Recommendations-BigQueryML/qwiklabs-gcp-02-d1e4aca70f3d-fc50c64884f9.json'

Load the BigQuery client library by executing the command below

In [2]:
%reload_ext google.cloud.bigquery

### 2. Create a New Dataset

Used to store table for the insights. Create new dataset titled `movies` can be done through SQL query.

In [3]:
%%bigquery
CREATE SCHEMA movies
OPTIONS(
    location = 'EU'
)

Query is running:   0%|          |

`Movies` dataset has been created. The location of the dataset is in Europe.

### 3. Explore MovieLens Dataset

Before we make a prediction to the users, first we have to explore the `MovieLens` datas.

`MovieLens` datas are available on csv file in Google Cloud Storage. There are 2 file that we will load to BigQuery. 
1. MovieLens rating file contains rating given by each users for each films
2. MovieLens raw file contains movie with detailed genre

In [4]:
%%bigquery
LOAD DATA OVERWRITE movies.movielens_ratings
FROM FILES (
  format = 'CSV',
  uris = ['gs://dataeng-movielens/ratings.csv'])

Query is running:   0%|          |

In [5]:
%%bigquery
LOAD DATA OVERWRITE movies.movielens_movies_raw
FROM FILES (
  format = 'CSV',
  uris = ['gs://dataeng-movielens/movies.csv'])

Query is running:   0%|          |

2 files succesfully loaded into BigQuery.

Let's check how much users give rating to the movies they watch.

In [6]:
%%bigquery
SELECT
  COUNT(DISTINCT userId) totalUsers,
  COUNT(DISTINCT movieId) totalMovies,
  COUNT(*) totalRatings
FROM
  movies.movielens_ratings

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,totalUsers,totalMovies,totalRatings
0,138493,26744,20000263


We can confirm that dataset consists of over 138 thousand users, nearly 27 thousand movies, and a little more than 20 million ratings.

Now look to the movie raw.

In [7]:
%%bigquery
SELECT *
FROM
  movies.movielens_movies_raw
WHERE
  movieId < 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,movieId,title,genres
0,3,Grumpier Old Men (1995),Comedy|Romance
1,4,Waiting to Exhale (1995),Comedy|Drama|Romance
2,2,Jumanji (1995),Adventure|Children|Fantasy
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


As we can see, the datatype of genres column is STRING.

We can format the genres column into an array with this query.

In [8]:
%%bigquery
CREATE OR REPLACE TABLE movies.movielens_movies 
AS
SELECT * 
REPLACE(SPLIT(genres, "|") AS genres)
FROM
  movies.movielens_movies_raw

Query is running:   0%|          |

Save the results into new table called `movielens_movies`.

### 4. Make Movie Recommendation

After exploring the dataset, the next step is to create and train a model. The model we want to build is to find the users to the unseen movies and determine the rating that a user would give to the new movies just based on the rating that he gave to the movies before.

A model has been created in the cloud-training-prod-bucket dataset, so we don't have to create the model again. We will evaluate that model.

In [9]:
%%bigquery
SELECT * FROM ML.EVALUATE(MODEL `cloud-training-prod-bucket.movies.movie_recommender`)

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,0.653355,0.736362,0.052166,0.521292,0.328412,0.328415


The query above show us the metric for the trained model. Use MSE(mean_squared_error) or MAE(mean_absolute_error) when comparing two or more models, the lower the value of MSE or MAE, the better. We could use R2 Score (r2_score) to evaluate the performance of linear regression models. R2 Score is between 0 - 1, the closer to 1, the better the regression fit.

After we do evaluating, we can use this trained model to provide recommendations.

Recommend user `903` the best comedy movies.

In [11]:
%%bigquery
SELECT *
FROM ML.PREDICT(MODEL `cloud-training-prod-bucket.movies.movie_recommender`,
  (
    SELECT
      movieId,
      title,
      903 AS userId
    FROM `movies.movielens_movies`,
    UNNEST(genres) g
    WHERE g = 'Comedy' 
  ))
ORDER BY predicted_rating DESC
LIMIT 5  

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,predicted_rating,movieId,title,userId
0,6.305485,82978,Neighbors (1920),903
1,5.659956,26136,"Hallelujah Trail, The (1965)",903
2,5.608128,69075,Trojan War (1997),903
3,5.423441,3337,I'll Never Forget What's'isname (1967),903
4,5.301408,6167,Stand-In (1937),903


The result show the recommend movies for the user `903` including the movies that already seen and rated in the past.

To exclude the movies that already seen, execute the query below.

In [12]:
%%bigquery
SELECT *
FROM ML.PREDICT(MODEL `cloud-training-prod-bucket.movies.movie_recommender`,
  (
    WITH seen AS (
      SELECT ARRAY_AGG(movieId) AS movies
      FROM movies.movielens_ratings
      WHERE userId = 903 )
    SELECT
      movieId,
      title,
      903 AS userId
    FROM
      movies.movielens_movies,
      UNNEST(genres) g,
      seen
    WHERE g = 'Comedy' AND movieId NOT IN UNNEST(seen.movies) 
  ))
ORDER BY predicted_rating DESC
LIMIT 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,predicted_rating,movieId,title,userId
0,6.305485,82978,Neighbors (1920),903
1,5.659956,26136,"Hallelujah Trail, The (1965)",903
2,5.608128,69075,Trojan War (1997),903
3,5.423441,3337,I'll Never Forget What's'isname (1967),903
4,5.301408,6167,Stand-In (1937),903


It seems like no difference. Beacuse we limit the result to only 5 movies. Apparently the top predicted comedy movies has not been watched and rated by the user `903`.

### 5. User Prediction

American Mullet (2021) want to get more review from the users. Identify users who are likely to rate it the highest.

In [13]:
%%bigquery
SELECT *
FROM ML.PREDICT(MODEL `cloud-training-prod-bucket.movies.movie_recommender`,
  (
    WITH allUsers AS (
      SELECT DISTINCT userId
      FROM movies.movielens_ratings )
    SELECT 
      96481 AS movieId, 
      (
        SELECT title
        FROM movies.movielens_movies
        WHERE movieId=96481) title,
      userId
      FROM allUsers 
  ))
ORDER BY predicted_rating DESC
LIMIT 100

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,predicted_rating,movieId,title,userId
0,6.000194,96481,American Mullet (2001),104104
1,5.928113,96481,American Mullet (2001),57703
2,5.902559,96481,American Mullet (2001),22625
3,5.882102,96481,American Mullet (2001),118093
4,5.740621,96481,American Mullet (2001),37594
...,...,...,...,...
95,5.082952,96481,American Mullet (2001),112039
96,5.081513,96481,American Mullet (2001),90701
97,5.080543,96481,American Mullet (2001),136866
98,5.079940,96481,American Mullet (2001),120799


The result give us 100 users to do review

### 6. Prediction For All Users & Movies

This query below is to carry out prediction for all the users and movies encountered during training.

In [14]:
%%bigquery
SELECT *
FROM ML.RECOMMEND(MODEL `cloud-training-prod-bucket.movies.movie_recommender`)
LIMIT 100000

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,predicted_rating,userId,movieId
0,4.263859,136704,4096
1,4.570833,136704,8448
2,2.300742,136704,76032
3,1.059709,136704,7169
4,3.638563,136704,8705
...,...,...,...
99995,4.589922,103434,79551
99996,3.652789,103434,71104
99997,3.149116,103434,106688
99998,4.482003,103434,69569


The results are to large if we don't use LIMIT function. Now we know why we have to filter out movies the user has already seen and rated in the past.