# SVD from scratch using gradient descent

uses the described backpropagation algorithm for gradient descent and singular vectors as described in the [article](https://sifter.org/simon/journal/20070815.html)

1. Extract user-item interactions from the ratings dataframe.
2. Define the SVD model with functions for initializing the user and movie vectors, predicting ratings, and updating the vectors using gradient descent.
3. Train the model on the user-item interactions data.
4. Use the learned vectors to make predictions on new user-movie pairs.

evaluation metric: RMSE

In [12]:
# implement SVD from scratch
import numpy as np
import pandas as pd

# evaluate the SVD for the MovieLens dataset
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.model_selection import train_test_split


In [2]:
movies_df = pd.read_csv('../data/ml-latest-small/movies.csv')
ratings_df = pd.read_csv('../data/ml-latest-small/ratings.csv')
links_df = pd.read_csv('../data/ml-latest-small/links.csv')
tags_df = pd.read_csv('../data/ml-latest-small/tags.csv')


In [3]:
movies_df.info()
movies_df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


### --------> OBSERVATIONS:

+ movieId: A unique identifier for the movie.
+ title: The title of the movie, along with its release year in parentheses.
+ genres: The genres associated with the movie, separated by pipe characters (|).

In [4]:
ratings_df.info()
ratings_df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


### --------> OBSERVATIONS:

+ userId: A unique identifier for the user who provided the rating.
+ movieId: A unique identifier for the movie, which is consistent with the movieId in the movies.csv file.
+ rating: The user's rating for the movie on a scale of 0.5 to 5 stars, given in increments of 0.5 stars.
+ timestamp: The Unix timestamp representing the time at which the user provided the rating.

In [5]:
links_df.info()
links_df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0


### ------> OBSERVATIONS:

+ movieId: A unique identifier for the movie, which is consistent with the movieId in the movies.csv file.
+ imdbId: The Internet Movie Database (IMDb) identifier for the movie. 
+ tmdbId: The Movie Database (TMDb) identifier for the movie. 

In [6]:
tags_df.info()
tags_df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992


### --------> OBSERVATIONS:

+ userId: A unique identifier for the user who assigned the tag.
+ movieId: A unique identifier for the movie, which is consistent with the movieId in the movies.csv file.
+ tag: A text label assigned by the user to describe or categorize the movie.
+ timestamp: The Unix timestamp representing the time at which the user assigned the tag.

In [10]:
# Extract user-item interactions
user_item_ratings = ratings_df[['userId', 'movieId', 'rating']]

# Define the SVD model
class SVD:
    def __init__(self, num_factors, learning_rate, num_epochs):
        self.num_factors = num_factors
        self.learning_rate = learning_rate
        self.num_epochs = num_epochs

    def fit(self, user_item_ratings):
        self.user_factors = np.random.randn(user_item_ratings.userId.nunique(), self.num_factors)
        self.movie_factors = np.random.randn(user_item_ratings.movieId.nunique(), self.num_factors)
        self.user_index = {user_id: idx for idx, user_id in enumerate(user_item_ratings.userId.unique())}
        self.movie_index = {movie_id: idx for idx, movie_id in enumerate(user_item_ratings.movieId.unique())}

        for epoch in range(self.num_epochs):
            for _, (user_id, movie_id, rating) in user_item_ratings.iterrows():
                user_idx = self.user_index[user_id]
                movie_idx = self.movie_index[movie_id]

                prediction = np.dot(self.user_factors[user_idx], self.movie_factors[movie_idx])
                error = rating - prediction

                # Update user and movie factors using gradient descent (with a single singular vector)
                self.user_factors[user_idx] += self.learning_rate * error * self.movie_factors[movie_idx]
                self.movie_factors[movie_idx] += self.learning_rate * error * self.user_factors[user_idx]

    def predict(self, user_id, movie_id):
        user_idx = self.user_index.get(user_id, -1)
        movie_idx = self.movie_index.get(movie_id, -1)

        if user_idx == -1 or movie_idx == -1:
            return None

        return np.dot(self.user_factors[user_idx], self.movie_factors[movie_idx])

# Train the model
model = SVD(num_factors=2, learning_rate=0.001, num_epochs=100)
model.fit(user_item_ratings)

# Make predictions
user_id = 1
movie_id = 2
predicted_rating = model.predict(user_id, movie_id)
print(f"Predicted rating for user {user_id} and movie {movie_id}: {predicted_rating}")


Predicted rating for user 1 and movie 2: 4.2251652250894365


In [13]:
def rmse(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    return sqrt(mse)

# Split the data into train and test sets
train_df, test_df = train_test_split(user_item_ratings, test_size=0.2, random_state=42)

# Train the model on the train set
model = SVD(num_factors=2, learning_rate=0.001, num_epochs=100)
model.fit(train_df)

# Predict and evaluate on the test set
y_true = []
y_pred = []

for _, (user_id, movie_id, rating) in test_df.iterrows():
    prediction = model.predict(user_id, movie_id)
    
    if prediction is not None:
        y_true.append(rating)
        y_pred.append(prediction)

rmse_value = rmse(y_true, y_pred)
print(f"Root Mean Squared Error: {rmse_value}")


Root Mean Squared Error: 1.064477021491554


### ----------> OBSERVATIONS: 

An RMSE of 1.064477021491554 means that, on average, the predicted ratings by the SVD model differ from the true ratings by approximately 1.06 units (in this case, rating points).

Since the rating scale for this dataset is between 0.5 and 5.0, an RMSE of 1.064477021491554 can be considered moderately good. However, there is room for improvement.