# SVD from scratch using gradient descent

uses the described backpropagation algorithm for gradient descent and singular vectors as described in the [article](https://sifter.org/simon/journal/20070815.html)

1. Extract user-item interactions from the ratings dataframe.
2. Define the SVD model with functions for initializing the user and movie vectors, predicting ratings, and updating the vectors using gradient descent.
3. Train the model on the user-item interactions data.
4. Use the learned vectors to make predictions on new user-movie pairs.

## Data

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.

This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.


## Evaluation metric: 
+ RMSE

In [None]:
import numpy as np
import pandas as pd

# evaluate the SVD for the MovieLens dataset
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity



In [3]:
movies_df = pd.read_csv('../data/ml-latest-small/movies.csv')
ratings_df = pd.read_csv('../data/ml-latest-small/ratings.csv')
links_df = pd.read_csv('../data/ml-latest-small/links.csv')
tags_df = pd.read_csv('../data/ml-latest-small/tags.csv')


In [4]:
movies_df.info()
movies_df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


### --------> OBSERVATIONS:

+ movieId: A unique identifier for the movie.
+ title: The title of the movie, along with its release year in parentheses.
+ genres: The genres associated with the movie, separated by pipe characters (|).

In [5]:
ratings_df.info()
ratings_df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


### --------> OBSERVATIONS:

+ userId: A unique identifier for the user who provided the rating.
+ movieId: A unique identifier for the movie, which is consistent with the movieId in the movies.csv file.
+ rating: The user's rating for the movie on a scale of 0.5 to 5 stars, given in increments of 0.5 stars.
+ timestamp: The Unix timestamp representing the time at which the user provided the rating.

In [6]:
links_df.info()
links_df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0


### ------> OBSERVATIONS:

+ movieId: A unique identifier for the movie, which is consistent with the movieId in the movies.csv file.
+ imdbId: The Internet Movie Database (IMDb) identifier for the movie. 
+ tmdbId: The Movie Database (TMDb) identifier for the movie. 

In [7]:
tags_df.info()
tags_df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992


### --------> OBSERVATIONS:

+ userId: A unique identifier for the user who assigned the tag.
+ movieId: A unique identifier for the movie, which is consistent with the movieId in the movies.csv file.
+ tag: A text label assigned by the user to describe or categorize the movie.
+ timestamp: The Unix timestamp representing the time at which the user assigned the tag.

In [8]:
# Extract user-item interactions from the ratings DataFrame
user_item_ratings = ratings_df[['userId', 'movieId', 'rating']]

# Define the SVD (Singular Value Decomposition) model class
class SVD:
    def __init__(self, num_factors, learning_rate, num_epochs):
        # Initialize model parameters: number of factors (dimensionality of the user/item embeddings),
        # learning rate for stochastic gradient descent, and number of epochs for training
        self.num_factors = num_factors
        self.learning_rate = learning_rate
        self.num_epochs = num_epochs

    def fit(self, user_item_ratings):
        # Initialize user and movie factors as random matrices with dimensions (# of unique users/items) x (num_factors)
        self.user_factors = np.random.randn(user_item_ratings.userId.nunique(), self.num_factors)
        self.movie_factors = np.random.randn(user_item_ratings.movieId.nunique(), self.num_factors)
        
        # Create dictionaries mapping user/movie IDs to their corresponding indices in the factor matrices
        self.user_index = {user_id: idx for idx, user_id in enumerate(user_item_ratings.userId.unique())}
        self.movie_index = {movie_id: idx for idx, movie_id in enumerate(user_item_ratings.movieId.unique())}

        # Iterate through the data num_epochs times
        for epoch in range(self.num_epochs):
            # Iterate over all user-movie interactions in the data
            for _, (user_id, movie_id, rating) in user_item_ratings.iterrows():
                # Get user/movie indices
                user_idx = self.user_index[user_id]
                movie_idx = self.movie_index[movie_id]

                # Predict the rating and calculate the prediction error
                prediction = np.dot(self.user_factors[user_idx], self.movie_factors[movie_idx])
                error = rating - prediction

                # Update user and movie factors using the calculated error, and the learning rate
                self.user_factors[user_idx] += self.learning_rate * error * self.movie_factors[movie_idx]
                self.movie_factors[movie_idx] += self.learning_rate * error * self.user_factors[user_idx]

    def predict(self, user_id, movie_id):
        # Get user/movie indices
        user_idx = self.user_index.get(user_id, -1)
        movie_idx = self.movie_index.get(movie_id, -1)

        # If the user or movie is unknown, return None
        if user_idx == -1 or movie_idx == -1:
            return None

        # Predict the rating as the dot product of user and movie factors
        return np.dot(self.user_factors[user_idx], self.movie_factors[movie_idx])


In [10]:
%%time

def rmse(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    return sqrt(mse)

def mae(y_true, y_pred):
    return np.mean(np.abs(np.array(y_true) - np.array(y_pred)))


# Split the data into train and test sets
train_df, test_df = train_test_split(user_item_ratings, test_size=0.2, random_state=42)

# Train the model on the train set
model_SVD = SVD(num_factors=35, learning_rate=0.001, num_epochs=100)
model_SVD.fit(train_df)

# Predict and evaluate on the test set
y_true = []
y_pred = []

for _, (user_id, movie_id, rating) in test_df.iterrows():
    prediction = model_SVD.predict(user_id, movie_id)
    
    if prediction is not None:
        y_true.append(rating)
        y_pred.append(prediction)

rmse_value = rmse(y_true, y_pred)
print(f"Root Mean Squared Error: {rmse_value}")
print(f"Mean Absolute Error: {mae(y_true, y_pred)}")


Root Mean Squared Error: 2.0974801454615064
Mean Absolute Error: 1.3493420650933812
CPU times: user 4min 53s, sys: 1.08 s, total: 4min 54s
Wall time: 4min 55s


### ----------> OBSERVATIONS: 

The model's Root Mean Squared Error (RMSE) is 2.097, and the Mean Absolute Error (MAE) is 1.349. These values indicate that on average, the model's predictions are approximately 2.1 and 1.35 rating points off from the actual ratings, respectively. Given the rating scale of 0.5 to 5.0, these errors are relatively high, suggesting that the model's accuracy could be improved. The total computation time was approximately 4 minutes and 55 seconds.

# KNN Based CF

In [16]:
class KNN_CF:
    def __init__(self, k, user_based=True):
        # Initialize model parameters: number of neighbors (K) and whether the model is user-based or item-based
        self.k = k
        self.user_based = user_based

        # Initialize the KNN model
        self.model = NearestNeighbors(n_neighbors=self.k, metric='cosine')

    def fit(self, user_item_ratings):
        # Create a user-item interactions matrix
        self.interactions_matrix = user_item_ratings.pivot(index='userId', columns='movieId', values='rating').fillna(0)

        # If the model is item-based, transpose the interactions matrix
        if not self.user_based:
            self.interactions_matrix = self.interactions_matrix.T

        # Fit the KNN model
        self.model.fit(self.interactions_matrix)

    def predict(self, user_id, movie_id):
        # If the model is user-based
        if self.user_based:
            # If the user or movie is unknown, return None
            if user_id not in self.interactions_matrix.index or movie_id not in self.interactions_matrix.columns:
                return None

            # Find the K nearest neighbors to the user
            distances, indices = self.model.kneighbors(self.interactions_matrix.loc[user_id].values.reshape(1, -1))

            # Calculate the predicted rating as the weighted average of the neighbors' ratings for the movie
            weights = 1 - distances.flatten()
            ratings = self.interactions_matrix.iloc[indices.flatten()].loc[:, movie_id]
            prediction = np.dot(ratings, weights) / np.sum(weights)

            return prediction

        # If the model is item-based
        else:
            # If the user or movie is unknown, return None
            if user_id not in self.interactions_matrix.columns or movie_id not in self.interactions_matrix.index:
                return None

            # Find the K nearest neighbors to the movie
            distances, indices = self.model.kneighbors(self.interactions_matrix.loc[movie_id].values.reshape(1, -1))

            # Calculate the predicted rating as the weighted average of the neighbors' ratings by the user
            weights = 1 - distances.flatten()
            ratings = self.interactions_matrix.iloc[indices.flatten()].loc[:, user_id]
            prediction = np.dot(ratings, weights) / np.sum(weights)

            return prediction

# User-Based Collaborative Filtering

the model finds users that are similar to the target user based on their rating history. It then predicts the target user's rating for a specific item based on the ratings given to that item by the similar users.



In [17]:
%%time
# Initialize and train the User based KNN_CF model
model_user_based_knn_cf = KNN_CF(k=5, user_based=True)
model_user_based_knn_cf.fit(user_item_ratings)

# predict and evaluate on the test set
y_true = []
y_pred = []

for _, (user_id, movie_id, rating) in test_df.iterrows():
    prediction = model_user_based_knn_cf.predict(user_id, movie_id)
    
    if prediction is not None:
        y_true.append(rating)
        y_pred.append(prediction)

print(f"Root Mean Squared Error: {rmse(y_true, y_pred)}")
print(f"Mean Absolute Error: {mae(y_true, y_pred)}")

Root Mean Squared Error: 1.4820965972028965
Mean Absolute Error: 1.2579065539751189
CPU times: user 30min 22s, sys: 8min 11s, total: 38min 34s
Wall time: 4min 16s


### -------> OBSERVATION: 

The User-Based Collaborative Filtering model has an RMSE of 1.4821 and an MAE of 1.2579. These values represent the average discrepancy between the predicted and actual ratings. Given the rating scale of 0.5 to 5.0, these results are moderately good, suggesting the model's predictions are reasonably close to the actual ratings. The model took approximately 4 minutes and 16 seconds to run.

# Item-Based Collaborative Filtering

the model finds items that are similar to the target item based on the ratings they received from users. It then predicts the target user's rating for the target item based on the user's ratings for the similar items.

In [18]:
%%time
# Initialize and train the User based KNN_CF model
model_item_based_knn_cf = KNN_CF(k=5, user_based=False)
model_item_based_knn_cf.fit(user_item_ratings)

# predict and evaluate on the test set
y_true = []
y_pred = []

for _, (user_id, movie_id, rating) in test_df.iterrows():
    prediction = model_item_based_knn_cf.predict(user_id, movie_id)
    
    if prediction is not None:
        y_true.append(rating)
        y_pred.append(prediction)

print(f"Root Mean Squared Error: {rmse(y_true, y_pred)}")
print(f"Mean Absolute Error: {mae(y_true, y_pred)}")

Root Mean Squared Error: 1.422181994356813
Mean Absolute Error: 1.1501393384975036
CPU times: user 32min 39s, sys: 8min 39s, total: 41min 19s
Wall time: 4min 29s


# Auto-Adaptive Imputation (AutAI)

In [21]:
%%time

def autai_imputation(R, active_user, active_item):
    # Initialize the imputed matrix R_prime
    R_prime = R.copy()

    # Calculate similarity between each pair of users
    sim = cosine_similarity(R)

    # Identify the set of items rated by the active user
    Ua = np.where(~np.isnan(R[active_user]))[0]

    # Identify the set of items co-rated by the active user and each other user
    Ts = set()
    for ua_prime in Ua:
        Taa_prime = np.where(~np.isnan(R[:, ua_prime]))[0]
        Ts = Ts.union(Taa_prime)

    # Identify the set of ratings by the active user and on the co-rated items
    Na_s = R[np.ix_(Ua, list(Ts))]

    # Perform imputation for each rating in Na_s
    for ua_prime in range(Na_s.shape[0]):
        for i in range(Na_s.shape[1]):
            if np.isnan(Na_s[ua_prime, i]):
                # Calculate the weighted average of the user's ratings, weighted by similarity
                weights = sim[ua_prime, :]
                ratings = R[:, i]
                mask = ~np.isnan(ratings)
                r_hat = np.average(ratings[mask], weights=weights[mask])
                
                # Impute the rating
                R_prime[ua_prime, i] = r_hat

    return R_prime

# Perform imputation on the user-item ratings matrix
user_item_ratings_imputed = autai_imputation(user_item_ratings.values, active_user=0, active_item=0)

# Convert the imputed ratings matrix to a dataframe
user_item_ratings_imputed = pd.DataFrame(user_item_ratings_imputed, index=user_item_ratings.index, columns=user_item_ratings.columns)

# Split the data into train and test sets
train_df, test_df = train_test_split(user_item_ratings_imputed, test_size=0.2, random_state=42)

# Initialize and train the User based KNN_CF model
model_user_based_knn_cf = KNN_CF(k=5, user_based=False)
model_user_based_knn_cf.fit(train_df)

# predict and evaluate on the test set
y_true = []
y_pred = []

for _, (user_id, movie_id, rating) in test_df.iterrows():
    prediction = model_user_based_knn_cf.predict(user_id, movie_id)
    
    if prediction is not None:
        y_true.append(rating)
        y_pred.append(prediction)

print(f"Root Mean Squared Error: {rmse(y_true, y_pred)}")
print(f"Mean Absolute Error: {mae(y_true, y_pred)}")



: 

: 