## CONTENT-BASED FILTERING

Here we are using cosine similarity which measures the cosine of the angle between two vectors. In this case, the vectors are the one-hot encoded genre data for the movies. The cosine similarity is a measure of how similar two movies are based on their genre. Scikit learn's cosine similarity package is used here to do content-based filtering. In the code snippet below, we are taking into consideration, the genre of the movie primarily that is given in ml-25m/movie.csv file. We are then, also grouping the recommendations based on their ratings to decide what the top 10 recommendations should be.

In [1]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load the movies data
movies = pd.read_csv('ml-25m/movies.csv')

# One-hot encoding for genres
movies['genres'] = movies['genres'].apply(lambda x: x.split('|'))
genres_encoded = movies['genres'].str.join('|').str.get_dummies()

# Combine the encoded genres with the original movies dataframe
movies_encoded = pd.concat([movies, genres_encoded], axis=1)

# Load the ratings data
ratings = pd.read_csv('ml-25m/ratings.csv')

# Calculate the mean rating for each movie
mean_ratings = ratings.groupby('movieId')['rating'].mean()

# Calculate the number of ratings for each movie
num_ratings = ratings.groupby('movieId')['rating'].count()

# Calculate the weighted rating for each movie
weighted_ratings = (mean_ratings * num_ratings) / (num_ratings + 100)  # Adding 100 as a arbitrary constant to reduce the effect of movies with very few ratings

# Add the weighted ratings to the movies dataframe
movies_encoded = movies_encoded.merge(weighted_ratings.rename('weighted_rating'), how='left', on='movieId')

# Fill NA values with the mean of the weighted ratings
movies_encoded['weighted_rating'] = movies_encoded['weighted_rating'].fillna(movies_encoded['weighted_rating'].mean())

# Create a reverse map of indices and movie titles
indices = pd.Series(movies_encoded.index, index=movies_encoded['title']).drop_duplicates()

def get_recommendations(title, num_recommendations=10):
    # Get the index of the movie that matches the title
    idx = indices[title]
    
    # Get the genre data for this movie
    movie_genres = movies_encoded.iloc[idx, 3:-1].values.reshape(1, -1)  # Exclude the last column, which is now the weighted rating

    # Calculate similarity scores between this movie and all others
    similarity_scores = {}
    for i, row in movies_encoded.iterrows():
        if i != idx:  # don't compare the movie with itself
            other_movie_genres = row.iloc[3:-1].values.reshape(1, -1)  # Exclude the last column, which is now the weighted rating
            similarity_scores[i] = cosine_similarity(movie_genres, other_movie_genres)[0][0]
    
    # Sort movies based on the similarity scores and the weighted ratings
    sorted_similarity_scores = sorted(similarity_scores.items(), key=lambda x: (x[1], movies_encoded.loc[x[0], 'weighted_rating']), reverse=True)
    
    # Get the indices of the top matches
    top_indices = [index for index, score in sorted_similarity_scores[:num_recommendations]]
    
    # Return the top matches
    return movies_encoded['title'].iloc[top_indices]

# Test the function
print(get_recommendations('Toy Story (1995)'))


4780                Monsters, Inc. (2001)
3021                   Toy Story 2 (1999)
43614                        Moana (2016)
3912     Emperor's New Groove, The (2000)
2203                          Antz (1998)
22353               Boxtrolls, The (2014)
30348            The Good Dinosaur (2015)
11604              Shrek the Third (2007)
20015                        Turbo (2013)
12969      Tale of Despereaux, The (2008)
Name: title, dtype: object


In [None]:
%pip install fuzzywuzzy
%pip install flask_session

In [3]:
from fuzzywuzzy import process

def chatbot():
    print("Hello! I can recommend movies for you.")
    print("Enter 'quit' to exit the chatbot at any time.")
    while True:
        title = input("Enter a movie title: ")
        if title.lower() == 'quit':
            print("Goodbye!")
            break
        # Use fuzzy matching to find the closest match to the user's input in the list of movie titles
        closest_match, score = process.extractOne(title, movies_encoded['title'].values)
        if score < 60:  # You can adjust this threshold
            print("I'm sorry, but I couldn't find a movie that closely matches '{}'. Please try another one.".format(title))
            continue
        try:
            recommendations = get_recommendations(closest_match)
            print("If you liked {}, you might also like these movies:".format(closest_match))
            for i, movie in enumerate(recommendations, 1):
                print("{}. {}".format(i, movie))
        except KeyError:
            print("I'm sorry, but I couldn't find that movie in my database. Please try another one.")

# Run the chatbot
chatbot()




Hello! I can recommend movies for you.
Enter 'quit' to exit the chatbot at any time.
If you liked King Cobra (2016), you might also like these movies:
1. Shawshank Redemption, The (1994)
2. Godfather, The (1972)
3. Godfather: Part II, The (1974)
4. Goodfellas (1990)
5. American History X (1998)
6. On the Waterfront (1954)
7. Green Mile, The (1999)
8. No Country for Old Men (2007)
9. 400 Blows, The (Les quatre cents coups) (1959)
10. Dog Day Afternoon (1975)
Goodbye!


## COLLABORATIVE FILTERING

### Collaborative Filtering using SVD

Here's a high-level overview of what we're doing:

1. Preprocess our data to get a user-item matrix. Each cell in this matrix represents the rating a user gave to a movie. If a user hasn't rated a movie, we'll leave that cell empty for now.
2. Apply SVD to this matrix. SVD will decompose our user-item matrix into three separate matrices. We can use these matrices to predict the missing ratings in our user-item matrix.
3. Write a function that uses these predicted ratings to recommend movies to a user.

How we are using Singular Value Decomposition:
- The SVD model decomposes the user-item matrix (which has users as rows, movies as columns, and user ratings as cell values) into three separate matrices.
- These matrices capture the underlying patterns in the ratings data. In other words, they capture the latent factors that explain the observed user ratings. For example, these latent factors might represent different genres, time periods, or other movie characteristics that affect how users rate movies. Once the matrix is factorized, you multiply the three matrices to create a new matrix that represents predicted ratings for all user-movie pairs, including those movies that a user hasn't rated yet. This is a form of collaborative filtering because it uses the ratings of all users to predict the ratings of individual users. The predicted ratings are then used to recommend movies to a user.
- By multiplying these matrices together, we can predict what rating a user would give to a movie, even if they haven't rated it yet.
To recommend movies to a user, the system sorts the movies by their predicted ratings and picks the top ones. Because we're excluding movies that the user has already rated, these will be new movie recommendations.

In [14]:
# Preprocess data to get a user-item matrix
# Limit data to the top 1000 users and top 1000 movies for the purpose of this demonstration
top_users = ratings['userId'].value_counts().index[:1000]
top_movies = ratings['movieId'].value_counts().index[:1000]
limited_ratings = ratings[ratings['userId'].isin(top_users) & ratings['movieId'].isin(top_movies)]

# Create the user-item matrix
user_item_matrix = limited_ratings.pivot_table(index='userId', columns='movieId', values='rating', fill_value=0)


In [17]:
from scipy.sparse.linalg import svds

# Convert the user-item matrix to a numpy array
user_item_matrix_np = user_item_matrix.to_numpy()

"""Apply SVD -- The SVD model predicts ratings for all movies, including those that the user has already rated. 
    Therefore, if a user has rated a movie highly, it's likely that the model will predict a high rating for that 
    movie and potentially recommend it back to the user. We will exclude those movies in the function later."""
U, sigma, Vt = svds(user_item_matrix_np, k=50)

# Since we got sigma as an array, we need to convert it to a diagonal matrix form.
sigma = np.diag(sigma)

# Preview the shapes of U, sigma, and Vt
print(f"Shape of U: {U.shape}")
print(f"Shape of sigma: {sigma.shape}")
print(f"Shape of Vt: {Vt.shape}")


# Predict the ratings
predicted_ratings = np.dot(np.dot(U, sigma), Vt)

# Convert the predicted ratings to a DataFrame
predicted_ratings_df = pd.DataFrame(predicted_ratings, columns=user_item_matrix.columns, index=user_item_matrix.index)

def recommend_movies(user_id, num_recommendations=10):
    """
    This function receives a user ID and returns a list of recommended movies. 
    Recommendations are based on the predicted ratings.
    """
    # Get the predicted ratings for this user
    user_ratings = predicted_ratings_df.loc[user_id]
    
    # Get the movies that the user has already rated
    rated_movies = ratings[ratings['userId'] == user_id]['movieId']
    
    # Exclude the movies that the user has already rated
    user_ratings = user_ratings.drop(rated_movies, errors='ignore')
    
    # Sort the movies based on the predicted ratings
    sorted_user_ratings = user_ratings.sort_values(ascending=False)
    
    # Get the top recommendations
    recommendations = sorted_user_ratings[:num_recommendations]
    
    # Map the movie IDs to titles
    recommendations = recommendations.reset_index()
    recommendations = recommendations.merge(movies[['movieId', 'title']], how='left', on='movieId')
    
    # Return the top recommendations
    return recommendations[['title', 'movieId']]

print(recommend_movies(548))



Shape of U: (1000, 50)
Shape of sigma: (50, 50)
Shape of Vt: (50, 1000)
                                               title  movieId
0                                The Revenant (2015)   139385
1                                    Deadpool (2016)   122904
2                                     Ant-Man (2015)   122900
3                                 The Martian (2015)   134130
4                              Big Short, The (2015)   148626
5  Star Wars: Episode VII - The Force Awakens (2015)   122886
6                                   Cape Fear (1991)     1343
7                  Captain America: Civil War (2016)   122920
8             Snow White and the Seven Dwarfs (1937)      594
9                                    Whiplash (2014)   112552


### Collaborative Filtering with PCA

The PCA method involves transforming your user-item matrix to a lower-dimensional space using PCA, and then computing cosine similarity between users. The cosine similarity is used to find users that are most similar to the given user. The movies that these similar users have rated highly but the given user hasn't watched yet are recommended to the given user. This is a form of collaborative filtering because it uses the preferences of similar users to recommend movies to a user.

In [24]:
from sklearn.decomposition import PCA

# Center the data
centered_matrix = user_item_matrix - user_item_matrix.mean()

# Apply PCA
pca = PCA(n_components=50)
pca.fit(centered_matrix)

# Transform the original data
transformed_matrix = pca.transform(centered_matrix)

# Convert the transformed matrix to a DataFrame
transformed_df = pd.DataFrame(transformed_matrix, index=user_item_matrix.index)

from sklearn.metrics.pairwise import cosine_similarity

# Calculate the cosine similarity between users
user_similarity = cosine_similarity(transformed_df)

# Convert the similarity matrix to a DataFrame
user_similarity_df = pd.DataFrame(user_similarity, index=user_item_matrix.index, columns=user_item_matrix.index)




In [23]:
def recommend_movies_pca(user_id, num_recommendations=10):
    """
    This function receives a user ID and returns a list of recommended movies. 
    Recommendations are based on the ratings of similar users.
    """
    # Get the top 10 similar users to the given user
    similar_users = user_similarity_df[user_id].sort_values(ascending=False)[1:11]
    
    # Get the movies that these similar users have rated highly
    high_rated_movies_similar_users = ratings[ratings['userId'].isin(similar_users.index) & (ratings['rating'] >= 4)]
    
    # Exclude the movies that the user has already rated
    rated_movies = ratings[ratings['userId'] == user_id]['movieId']
    recommended_movies = high_rated_movies_similar_users.loc[~high_rated_movies_similar_users['movieId'].isin(rated_movies)]
    
    # Get the top recommendations
    recommendations = recommended_movies['movieId'].value_counts().index[:num_recommendations]
    
    # Map the movie IDs to titles
    recommendations = pd.DataFrame(recommendations, columns=['movieId'])
    recommendations = recommendations.merge(movies[['movieId', 'title']], how='left', on='movieId')
    
    # Return the top recommendations
    return recommendations

# Test the function
print(recommend_movies_pca(548))


   movieId                           title
0   134130              The Martian (2015)
1   148626           Big Short, The (2015)
2   158238            The Nice Guys (2016)
3   139644                  Sicario (2015)
4   139385             The Revenant (2015)
5   128360        The Hateful Eight (2015)
6   122904                 Deadpool (2016)
7   161024  Jim Jefferies: Freedumb (2016)
8   122900                  Ant-Man (2015)
9   117121            Dorothy Mills (2008)


## Summary for Collaborative Filtering

recommend_movies: This function uses a method called collaborative filtering via singular value decomposition (SVD). It first predicts ratings for all movies a user hasn't rated yet, based on their existing ratings and the ratings of all other users. It then recommends the movies with the highest predicted ratings. The underlying assumption is that users who have agreed in the past will agree in the future, and that they will like similar kind of movies.

recommend_movies_pca: This function also uses collaborative filtering, but via a different approach called k-nearest neighbors (k-NN) using PCA for dimensionality reduction. It first identifies the users that are most similar to the given user, based on their ratings of all movies. It then recommends the movies that these similar users have rated highly but the given user hasn't watched yet. The underlying assumption is that users who are similar (based on their ratings) will have similar preferences for unrated movies.

In summary, recommend_movies (SVD) generates recommendations based on a combination of the given user's ratings and all other users' ratings, while recommend_movies_pca (PCA) generates recommendations based on the ratings of users who are similar to the given user.

## Small-Scale Implementation of Matrix Factorization

This Matrix Factorization simply aims to implement a simple form of matrix factorization for collaborative filtering, often used in recommender systems. At a high-level, it reads a csv file with ratings and tries to fill in the missing ratings

Here's a summary of what the code does:

csv_to_mtx: This function reads a CSV file and converts it to a NumPy array. This file is assumed to be a user-item matrix with ratings, where each row represents a user, each column represents an item (movie), and the cell values are the user's rating for that item.

matrix_factorization: This function performs matrix factorization on the input user-item matrix R using stochastic gradient descent. The function takes two initial matrices P and Q which are the latent feature matrices for the users and items respectively, a parameter K which is the number of latent features, and additional parameters to control the learning process: steps for the number of iterations to run, alpha for the learning rate, and beta for the regularization term. The function iteratively updates P and Q to minimize the difference between the actual ratings and the predicted ratings (the dot product of P and Q), plus a regularization term to prevent overfitting. If the error falls below 0.001, the function stops early.

After defining these functions, the script reads a user-item matrix from a CSV file using the csv_to_mtx function. It then initializes the user and item latent feature matrices P and Q with random values. These matrices have dimensions N x K and M x K respectively, where N is the number of users, M is the number of items, and K is the number of latent features.

The script then calls the matrix_factorization function to factorize the user-item matrix into P and Q. The resulting matrices represent the users and items in terms of the latent feature space.

Finally, the script calculates the predicted ratings by taking the dot product of P and Q. This resulting matrix nR has the same dimensions as the original user-item matrix, but the ratings are now predictions based on the learned latent features.

In [30]:
import numpy

def csv_to_mtx(file):
    mtx = numpy.genfromtxt('ml-25m/user-movie-ratings.csv', delimiter=',', skip_header=1,)
    mtx = numpy.array(mtx)
    return mtx

def matrix_factorization(R, P, Q, K, steps=5000, alpha=0.0002, beta=0.02):
    '''
    R: rating matrix
    P: |U| * K (User features matrix)
    Q: |D| * K (Item features matrix)
    K: latent features
    steps: iterations
    alpha: learning rate
    beta: regularization parameter'''
    Q = Q.T

    for step in range(steps):
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    # calculate error
                    eij = R[i][j] - numpy.dot(P[i,:],Q[:,j])

                    for k in range(K):
                        # calculate gradient with a and beta parameter
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])

        eR = numpy.dot(P,Q)

        e = 0

        for i in range(len(R)):

            for j in range(len(R[i])):

                if R[i][j] > 0:

                    e = e + pow(R[i][j] - numpy.dot(P[i,:],Q[:,j]), 2)

                    for k in range(K):

                        e = e + (beta/2) * (pow(P[i][k],2) + pow(Q[k][j],2))
        # 0.001: local minimum
        if e < 0.001:

            break

    return P, Q.T

R = csv_to_mtx('ml-25m/user-movie-ratings.csv')
print(R)
# N: num of User
N = len(R)
# M: num of Movie
M = len(R[0])
# Num of Features
K = 5

P = numpy.random.rand(N,K)
Q = numpy.random.rand(M,K)

nP, nQ = matrix_factorization(R, P, Q, K)

nR = numpy.dot(nP, nQ.T)
print()
print("Predictions:")
print(nR)

[[5. 3. 0. 1.]
 [4. 0. 0. 1.]
 [1. 1. 0. 5.]
 [1. 0. 0. 4.]
 [0. 1. 5. 4.]
 [2. 1. 3. 0.]]

Predictions:
[[5.01036269 2.92309956 3.7599997  1.00322818]
 [3.97408237 2.41286203 2.68993508 0.99988375]
 [1.04952349 0.91155024 5.04139756 4.97017238]
 [0.99089893 0.72014583 3.96034546 3.97950694]
 [1.75164628 1.0417902  4.96814036 4.0010332 ]
 [1.8940461  1.15572328 2.99266344 1.84698784]]
