# Question 1: Hands-on demostration

## Installing pymongo (if not installed already)
Start Anaconda command prompt. Then,
- Execute the following: `conda install -c anaconda pymongo`
- After the installation is complete, execute the following: `pip install 'pymongo[srv]'` 

After the installations are done, close the command prompts, then shutdown Jupyter. Then, restart it.

In [None]:
# Import required libraries

import pandas as pd
import numpy as np
from pymongo import MongoClient
import time

In [None]:
# Establish a connection to a MongoDB database

# Create a connection string containing the necessary credentials and connection details
conn_string = "mongodb+srv://movielens:movielens123@cluster0.dadyq.mongodb.net/myFirstDatabase?retryWrites=true&w=majority"

# Create a MongoClient instance with the connection string
client = MongoClient(conn_string)


## Define the number of iterations

In [None]:
# Define how many times each query will be executed
# Query execution time will be calculated as the average of these iterations

num_iterations = 5

# Option 1: Normalized schema

In [None]:
# Select the first database option called ML_Option_1

db = client.ML_Option_1

In the below code, we retrieve the required fields from the `RATINGS` collection in the database to create the utility matrix. The fields are queried and converted into a Pandas DataFrame. To measure the execution time of the query, we use the `time` function to calculate the difference between the start and end times of the query. However, relying on a single query execution time can yield unreliable results due to factors like database caching. To address this, we capture a more reliable measure by running the query multiple times. In this code, we send the query six times, with the first run serving as a warm-up for the database. The execution times of the next five queries are captured and used to calculate the average execution time. Finally, the mean execution time is printed for evaluation.

In [None]:
# Create a list to store the execution times
option1_exe_times = []

# Execute the query as many times as defined in num_iterations
# One more additional iteration is performed to warm up the database

for i in range(0, num_iterations+1):
    # Capture the start time
    start_time = time.time()
    
    # Execute the query
    data = pd.DataFrame(list(db.RATINGS.find({},{'movie_id': 1,
                                                 'user_id': 1, 
                                                 'rating': 1})))
    
    # Capture the end time
    end_time = time.time()
    
    # Capture the execution time of the iteration unless it is the first one
    # The first iteration is used for warming up the database    
    if i != 0:
        option1_exe_times.append(end_time - start_time)
        print(f'Iteration {i} execution time: {end_time - start_time}')

# Calculate the mean execution time
option1_mean_time = np.mean(option1_exe_times)

# Print the mean execution time
print(f'\nMean execution time: {option1_mean_time}')

In [None]:
# Show the data retrieved from the query

data

In the below code, we use the `pivot_table` function from Pandas to create the utility matrix. During the creation of the utility matrix, we take into account that there may be multiple ratings by the same user for the same movie. In such cases, we calculate the average rating for each user-item pair. It's important to note that the execution time of this cell is not crucial for evaluating the design options because the data is stored in the local memory of the device running this notebook.

In [None]:
# Generate the utility matrix from the data retrieved

utility_matrix = pd.pivot_table(data, values='rating', index=['user_id'],
                    columns=['movie_id'], aggfunc=np.mean, fill_value=0)

utility_matrix 

# Option 2: Ratings are subdocuments of movies

In [None]:
# Select the second database option called ML_Option_2

db = client.ML_Option_2

The below code retrieves the necessary fields from the database to create the utility matrix. Since the `RATINGS` are subdocuments of `MOVIES` in this database, a more complex query is required. The query first uses `$unwind` to flatten the `RATINGS` subdocuments, then removes the object notation using `$addFields`, and finally uses `$project` to display only the required fields. Similar to before, the query is executed six times, but only the last five executions are considered to calculate the average execution time.

In [None]:
# Create a list to store the execution times
option2_exe_times = []

# Execute the query as many times as defined in num_iterations
# One more additional iteration is performed to warm up the database

for i in range(0, num_iterations+1):
    # Capture the start time
    start_time = time.time()
    
    # Execute the query
    data = pd.DataFrame(list(db.MOVIES.aggregate([
        {
            '$unwind': {
                'path': '$ratings'
            }
        }, {
            '$addFields': {
                'rating': '$ratings.rating', 
                'user_id': '$ratings.user_id'
            }
        }, {
            '$project': {
                'user_id': 1, 
                'movie_id': 1, 
                'rating': 1
            }
        }
    ])
                     ))
    
    # Capture the end time
    end_time = time.time()
    
    # Capture the execution time of the iteration unless it is the first one
    # The first iteration is used for warming up the database    
    if i != 0:
        option2_exe_times.append(end_time - start_time)
        print(f'Iteration {i} execution time: {end_time - start_time}')

# Calculate the mean execution time
option2_mean_time = np.mean(option2_exe_times)

# Print the mean execution time
print(f'\nMean execution time: {option2_mean_time}')

In [None]:
# Show the data retrieved from the query

data

In [None]:
# Generate the utility matrix from the data retrieved

utility_matrix = pd.pivot_table(data, values='rating', index=['user_id'],
                    columns=['movie_id'], aggfunc=np.mean, fill_value=0)

utility_matrix 

# Option 3: Ratings are subdocuments of users

In [None]:
# Select the third database option called ML_Option_3

db = client.ML_Option_3

The below code retrieves the necessary fields from the database to create the utility matrix. Since the `RATINGS` are subdocuments of `USERS` in this database, a similar query to Option 2 is used. The query uses `$unwind` to flatten the `RATINGS` subdocuments, removes the object notation using `$addFields`, and uses `$project` to display only the required fields. Similar to before, the query is executed six times, but only the last five executions are considered to calculate the average execution time.

In [None]:
# Create a list to store the execution times
option3_exe_times = []

# Execute the query as many times as defined in num_iterations
# One more additional iteration is performed to warm up the database

for i in range(0, num_iterations+1):
    # Capture the start time
    start_time = time.time()
    
    # Execute the query
    data = pd.DataFrame(list(db.USERS.aggregate([
        {
            '$unwind': {
                'path': '$ratings'
            }
        }, {
            '$addFields': {
                'rating': '$ratings.rating', 
                'movie_id': '$ratings.movie_id'
            }
        }, {
            '$project': {
                'user_id': 1, 
                'movie_id': 1, 
                'rating': 1
            }
        }
    ])
                     ))
    
    # Capture the end time
    end_time = time.time()
    
    # Capture the execution time of the iteration unless it is the first one
    # The first iteration is used for warming up the database    
    if i != 0:
        option3_exe_times.append(end_time - start_time)
        print(f'Iteration {i} execution time: {end_time - start_time}')

# Calculate the mean execution time
option3_mean_time = np.mean(option3_exe_times)

# Print the mean execution time
print(f'\nMean execution time: {option3_mean_time}')

In [None]:
# Show the data retrieved from the query

data

In [None]:
# Generate the utility matrix from the data retrieved

utility_matrix = pd.pivot_table(data, values='rating', index=['user_id'],
                    columns=['movie_id'], aggfunc=np.mean, fill_value=0)

utility_matrix 

# Comparison of execution times


At this point, we should have obtained the mean execution times for the three design options. The below code prints these execution times to facilitate comparison. The lowest number should indicate the fastest mean execution time. 

In [None]:
# Print mean execution times for comparison

print(f'Option 1 mean execution time in seconds: {round(option1_mean_time, 2)}')
print(f'Option 2 mean execution time in seconds: {round(option2_mean_time, 2)}')
print(f'Option 3 mean execution time in seconds: {round(option3_mean_time, 2)}')

# Optional: Collaborative Filtering

The following set of cells is optional and focuses on implementing collaborative filtering, as discussed in the tutorial. It is important to note that the collaborative filtering implementation presented here is just one approach among many. There are numerous ways to implement collaborative filtering, and alternative methods may exist.

In the below cell, the `cosine_similarity` function from the scikit-learn library is used to calculate the pairwise similarities of the last utility matrix generated in Option 3. Since all the utility matrices generated in each option are the same, there is no need to rerun any of the previous cells. It's worth noting that all the processing is performed locally to calculate the cosine similarities. Therefore, the execution times of these cells are not significant for the purposes of this tutorial.

In [None]:
from sklearn import metrics

# Calculate cosine similarity matrix
cosine_sim = metrics.pairwise.cosine_similarity(utility_matrix)

cosine_sim

In [None]:
# Replace the diagonal values with 0
# This is a workaround to eliminate similarities with self

np.fill_diagonal(cosine_sim, 0)

cosine_sim

In [None]:
# Enter the index value of the user to whom you want to recommend
# For the first user, enter 0
# This user is called "selected user" hereafter

user_index = 0

In [None]:
# Retrieve the similarity scores of the selected user

user_similarity_scores = cosine_sim[user_index]

user_similarity_scores

In [None]:
# Identify the highest similarity score of the "selected user"

max_similarity_score = np.max(user_similarity_scores)

max_similarity_score

In [None]:
# Identify the index value of the most similar user to the "selected  user"

index_of_most_similar = np.where(user_similarity_scores == max_similarity_score)[0][0]

index_of_most_similar

In [None]:
# Retrieve the movies and the ratings of the most similar user as a dictionary

user_item = utility_matrix.iloc[index_of_most_similar].to_dict()

user_item

In [None]:
# Sort the movies by ratings (from highest to lowest)

sorted_dict = {k: v for k, v in sorted(user_item.items(), key=lambda item: item[1], reverse=True)}

sorted_dict

In [None]:
# Find movies not watched by the "selected user"

not_watched = utility_matrix.iloc[user_index][utility_matrix.iloc[user_index]==0].index.values

not_watched

In [None]:
# Iterate through the most similar user's dictionary 
# Find movies that are rated by 4 or 5 by the most similar user, but not watched by the "selected user"

five_star = []
four_star = []

for key in sorted_dict:
    if key in not_watched:
        if sorted_dict[key]==5:
            five_star.append(key)
        if sorted_dict[key]==4:
            four_star.append(key)
    

In [None]:
# Show all 5-star movies that can be recommended to the "selected user"

five_star

In [None]:
# Show all 4-star movies that can be recommended to the "selected user"

four_star