# SVD from scratch using gradient descent

uses the described backpropagation algorithm for gradient descent and singular vectors as described in the [article](https://sifter.org/simon/journal/20070815.html)

1. Extract user-item interactions from the ratings dataframe.
2. Define the SVD model with functions for initializing the user and movie vectors, predicting ratings, and updating the vectors using gradient descent.
3. Train the model on the user-item interactions data.
4. Use the learned vectors to make predictions on new user-movie pairs.

## Data

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.

This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.


## Evaluation metric: 
+ RMSE

In [None]:
import numpy as np
import pandas as pd

from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
import pickle



In [44]:
ratings_df = pd.read_csv('../data/ml-latest-small/ratings.csv')

# unique users
print(f'Number of unique users: {ratings_df.userId.unique().shape[0]}\n')

# unique movies
print(f'Number of unique movies: {ratings_df.movieId.unique().shape[0]}\n')

# unique ratings
print(f'Number of unique ratings: {ratings_df.rating.unique().shape[0]}')
# unique ratings
print(f'Number of unique ratings: {ratings_df.rating.unique()}\n')

ratings_df.info()
ratings_df.head(3)


Number of unique users: 610

Number of unique movies: 9724

Number of unique ratings: 10
Number of unique ratings: [4.  5.  3.  2.  1.  4.5 3.5 2.5 0.5 1.5]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


### --------> OBSERVATIONS:

+ userId: A unique identifier for the user who provided the rating.
+ movieId: A unique identifier for the movie, which is consistent with the movieId in the movies.csv file.
+ rating: The user's rating for the movie on a scale of 0.5 to 5 stars, given in increments of 0.5 stars.
+ timestamp: The Unix timestamp representing the time at which the user provided the rating.

In [35]:
def split_data_by_rated_items_unique_users(df, test_size, given_n):

    # Extract user-item interactions from the ratings DataFrame
    df = df[['userId', 'movieId', 'rating']]

    unique_users = df['userId'].unique()
    np.random.shuffle(unique_users)

    # Get the user IDs for each set
    # the size of each set is 50, 100, 400, and 60 users respectively
    M50_users = unique_users[:50]
    M100_users = unique_users[50:150]
    M400_users = unique_users[150:550]
    test_users = unique_users[550:]

    # Split the DataFrame into the different sets based on the user IDs
    M50_df = df[df['userId'].isin(M50_users)]
    M100_df = df[df['userId'].isin(M100_users)]
    M400_df = df[df['userId'].isin(M400_users)]
    test_df = df[df['userId'].isin(test_users)]

    # For each user in the test set, keep only 'given_n' rated items if they have rated that many,
    # otherwise keep all the items they have rated.
    test_df = test_df.groupby('userId').apply(lambda x: x.sample(min(len(x), given_n), random_state=42))
    test_df = test_df.reset_index(drop=True)

    return M50_df, M100_df, M400_df, test_df


def all_but_one(df):
    # For each user, select one rating and split it into a separate DataFrame
    test_df = df.groupby('userId').sample(n=1, random_state=42)
    train_df = df.drop(test_df.index)
    
    return train_df, test_df


# Call the functions
M50_df, M100_df, M400_df, test_df_given_10 = split_data_by_rated_items_unique_users(ratings_df, test_size=0.2, given_n=10)
# print('Training set:\n', train_df_given_10)
# print('Test set:\n', test_df_given_10)

train_df, test_df = all_but_one(ratings_df)
# print('All-But-One Training set:\n', train_df)
# print('All-But-One Test set:\n', test_df)

In [36]:
# rmse: root mean squared error
def rmse(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    return sqrt(mse)

# mae: mean absolute error
def mae(y_true, y_pred):
    return np.mean(np.abs(np.array(y_true) - np.array(y_pred)))

In [37]:
# Define the SVD (Singular Value Decomposition) model class
class SVD:
    def __init__(self, num_factors, learning_rate, num_epochs):
        # Initialize model parameters: number of factors (dimensionality of the user/item embeddings),
        # learning rate for stochastic gradient descent, and number of epochs for training
        self.num_factors = num_factors
        self.learning_rate = learning_rate
        self.num_epochs = num_epochs

    def fit(self, user_item_ratings):
        # Initialize user and movie factors as random matrices with dimensions (# of unique users/items) x (num_factors)
        self.user_factors = np.random.randn(user_item_ratings.userId.nunique(), self.num_factors)
        self.movie_factors = np.random.randn(user_item_ratings.movieId.nunique(), self.num_factors)
        
        # Create dictionaries mapping user/movie IDs to their corresponding indices in the factor matrices
        self.user_index = {user_id: idx for idx, user_id in enumerate(user_item_ratings.userId.unique())}
        self.movie_index = {movie_id: idx for idx, movie_id in enumerate(user_item_ratings.movieId.unique())}

        # Iterate through the data num_epochs times
        for epoch in range(self.num_epochs):
            # Iterate over all user-movie interactions in the data
            for _, (user_id, movie_id, rating) in user_item_ratings.iterrows():
                # Get user/movie indices
                user_idx = self.user_index[user_id]
                movie_idx = self.movie_index[movie_id]

                # Predict the rating and calculate the prediction error
                prediction = np.dot(self.user_factors[user_idx], self.movie_factors[movie_idx])
                error = rating - prediction

                # Update user and movie factors using the calculated error, and the learning rate
                self.user_factors[user_idx] += self.learning_rate * error * self.movie_factors[movie_idx]
                self.movie_factors[movie_idx] += self.learning_rate * error * self.user_factors[user_idx]

    def predict(self, user_id, movie_id):
        # Get user/movie indices
        user_idx = self.user_index.get(user_id, -1)
        movie_idx = self.movie_index.get(movie_id, -1)

        # If the user or movie is unknown, return None
        if user_idx == -1 or movie_idx == -1:
            return None

        # Predict the rating as the dot product of user and movie factors
        return np.dot(self.user_factors[user_idx], self.movie_factors[movie_idx])


In [42]:
%%time
# Train the model on the train set
model_SVD_M50_df = SVD(num_factors=35, learning_rate=0.001, num_epochs=100)
model_SVD_M50_df.fit(M50_df)

# Predict and evaluate on the test set
y_true = []
y_pred = []

for _, (user_id, movie_id, rating) in test_df_given_10.iterrows():
    prediction = model_SVD_M50_df.predict(user_id, movie_id)
    
    if prediction is not None:
        y_true.append(rating)
        y_pred.append(prediction)

rmse_value = rmse(y_true, y_pred)
print(f"Root Mean Squared Error: {rmse_value}")
print(f"Mean Absolute Error: {mae(y_true, y_pred)}")

# export model into model folder
pickle.dump(model_SVD_M50_df, open('../models/model_SVD_M50_df.pkl', 'wb'))
print('model_SVD_M50_df.pkl exported')


ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.

### ----------> OBSERVATIONS: 

The model's Root Mean Squared Error (RMSE) is 2.097, and the Mean Absolute Error (MAE) is 1.349. These values indicate that on average, the model's predictions are approximately 2.1 and 1.35 rating points off from the actual ratings, respectively. Given the rating scale of 0.5 to 5.0, these errors are relatively high, suggesting that the model's accuracy could be improved. The total computation time was approximately 4 minutes and 55 seconds.

# KNN Based CF

In [22]:
class KNN_CF:
    def __init__(self, k, user_based=True):
        self.k = k
        self.user_based = user_based
        self.EPSILON = 1e-9

    def fit(self, user_item_ratings):
        self.user_item_matrix = user_item_ratings.pivot(index='userId', columns='movieId', values='rating').fillna(0).values
        self.n_users, self.n_items = self.user_item_matrix.shape
        self.user_ids = user_item_ratings['userId'].unique()
        self.item_ids = user_item_ratings['movieId'].unique()

        # Compute Pearson Correlation Coefficient for Each Pair
        if self.user_based:
            self.matrix = self.user_item_matrix
            self.n_entities = self.n_users
        else:
            self.matrix = self.user_item_matrix.T
            self.n_entities = self.n_items

        self.pearson_corr = np.zeros((self.n_entities, self.n_entities))
        for i in range(self.n_entities):
            for j in range(self.n_entities):
                mask_i = self.matrix[i, :] > 0
                mask_j = self.matrix[j, :] > 0
                corrated_index = np.intersect1d(np.where(mask_i), np.where(mask_j))
                if len(corrated_index) == 0:
                    continue
                mean_i = np.sum(self.matrix[i, :]) / (np.sum(np.clip(self.matrix[i, :], 0, 1)) + self.EPSILON)
                mean_j = np.sum(self.matrix[j, :]) / (np.sum(np.clip(self.matrix[j, :], 0, 1)) + self.EPSILON)
                i_sub_mean = self.matrix[i, corrated_index] - mean_i
                j_sub_mean = self.matrix[j, corrated_index] - mean_j
                sum_sqrt_i = np.sqrt(np.sum(np.square(i_sub_mean)))
                sum_sqrt_j = np.sqrt(np.sum(np.square(j_sub_mean)))
                sim = np.sum(i_sub_mean * j_sub_mean) / (sum_sqrt_i * sum_sqrt_j + self.EPSILON)
                self.pearson_corr[i, j] = sim

    def predict(self, user_id, movie_id):
        if user_id in self.user_ids and movie_id in self.item_ids:
            user_idx = np.where(self.user_ids == user_id)[0][0]
            item_idx = np.where(self.item_ids == movie_id)[0][0]
            
            # User-based CF
            if self.user_based:
                sim_entity_indices = np.argsort(self.pearson_corr[user_idx])[-(self.k + 1):-1]
                sim_vals = self.pearson_corr[user_idx, sim_entity_indices]
                entity_mean = np.sum(self.matrix[user_idx, :]) / (np.sum(np.clip(self.matrix[user_idx, :], 0, 1)) + self.EPSILON)
                sim_entities = self.matrix[sim_entity_indices, :]
                mask_rated = sim_entities[:, item_idx] > 0
                sim_vals = sim_vals[mask_rated]
                sim_entities = sim_entities[mask_rated, :]
                
            # Item-based CF
            else:
                sim_entity_indices = np.argsort(self.pearson_corr[item_idx])[-(self.k + 1):-1]
                sim_vals = self.pearson_corr[item_idx, sim_entity_indices]
                entity_mean = np.sum(self.matrix[:, user_idx]) / (np.sum(np.clip(self.matrix[:, user_idx], 0, 1)) + self.EPSILON)
                sim_entities = self.matrix[:, sim_entity_indices]
                mask_rated = sim_entities[user_idx, :] > 0
                sim_vals = sim_vals[mask_rated]
                sim_entities = sim_entities[:, mask_rated]

            sim_entity_mean = np.sum(sim_entities, axis=1) / (np.sum(np.clip(sim_entities, 0, 1), axis=1) + self.EPSILON)
            sim_r_sum_mean = sim_vals * (sim_entities - sim_entity_mean[:, None])
            
            prediction = entity_mean + np.sum(sim_r_sum_mean) / (np.sum(sim_vals) + self.EPSILON)
            prediction = np.clip(prediction, 0, 5)
            
            return prediction
        else:
            return None


# User-Based Collaborative Filtering

the model finds users that are similar to the target user based on their rating history. It then predicts the target user's rating for a specific item based on the ratings given to that item by the similar users.



In [23]:
%%time
# Initialize and train the User based KNN_CF model
model_user_based_knn_cf = KNN_CF(k=5, user_based=True)
model_user_based_knn_cf.fit(user_item_ratings)

# predict and evaluate on the test set
y_true = []
y_pred = []

for _, (user_id, movie_id, rating) in test_df.iterrows():
    prediction = model_user_based_knn_cf.predict(user_id, movie_id)
    
    if prediction is not None:
        y_true.append(rating)
        y_pred.append(prediction)

# assuming that rmse and mae are defined or imported from appropriate library
print(f"Root Mean Squared Error: {rmse(y_true, y_pred)}")
print(f"Mean Absolute Error: {mae(y_true, y_pred)}")


ValueError: operands could not be broadcast together with shapes (0,) (0,9724) 

### -------> OBSERVATION: 

The User-Based Collaborative Filtering model has an RMSE of 1.4821 and an MAE of 1.2579. These values represent the average discrepancy between the predicted and actual ratings. Given the rating scale of 0.5 to 5.0, these results are moderately good, suggesting the model's predictions are reasonably close to the actual ratings. The model took approximately 4 minutes and 16 seconds to run.

# Item-Based Collaborative Filtering

the model finds items that are similar to the target item based on the ratings they received from users. It then predicts the target user's rating for the target item based on the user's ratings for the similar items.

In [None]:
%%time
# Initialize and train the Item based KNN_CF model
model_item_based_knn_cf = KNN_CF(k=5, user_based=False)
model_item_based_knn_cf.fit(user_item_ratings)

# Predict and evaluate on the test set
y_true = []
y_pred = []

for _, (user_id, movie_id, rating) in test_df.iterrows():
    prediction = model_item_based_knn_cf.predict(user_id, movie_id)
    
    if prediction is not None:
        y_true.append(rating)
        y_pred.append(prediction)

# Assuming that rmse and mae are defined or imported from an appropriate library
print(f"Root Mean Squared Error: {rmse(y_true, y_pred)}")
print(f"Mean Absolute Error: {mae(y_true, y_pred)}")

Root Mean Squared Error: 1.422181994356813
Mean Absolute Error: 1.1501393384975036
CPU times: user 32min 39s, sys: 8min 39s, total: 41min 19s
Wall time: 4min 29s


# Auto-Adaptive Imputation (AutAI)

In [21]:
%%time

def autai_imputation(R, active_user, active_item):
    # Initialize the imputed matrix R_prime
    R_prime = R.copy()

    # Calculate similarity between each pair of users
    # not this
    sim = cosine_similarity(R)

    # Identify the set of items rated by the active user
    Ua = np.where(~np.isnan(R[active_user]))[0]

    # Identify the set of items co-rated by the active user and each other user
    Ts = set()
    for ua_prime in Ua:
        Taa_prime = np.where(~np.isnan(R[:, ua_prime]))[0]
        Ts = Ts.union(Taa_prime)

    # Identify the set of ratings by the active user and on the co-rated items
    Na_s = R[np.ix_(Ua, list(Ts))]

    # Perform imputation for each rating in Na_s
    for ua_prime in range(Na_s.shape[0]):
        for i in range(Na_s.shape[1]):
            if np.isnan(Na_s[ua_prime, i]):
                # Calculate the weighted average of the user's ratings, weighted by similarity
                weights = sim[ua_prime, :]
                ratings = R[:, i]
                mask = ~np.isnan(ratings)
                r_hat = np.average(ratings[mask], weights=weights[mask])
                
                # Impute the rating
                R_prime[ua_prime, i] = r_hat

    return R_prime

# Perform imputation on the user-item ratings matrix
user_item_ratings_imputed = autai_imputation(user_item_ratings.values, active_user=0, active_item=0)

# Convert the imputed ratings matrix to a dataframe
user_item_ratings_imputed = pd.DataFrame(user_item_ratings_imputed, index=user_item_ratings.index, columns=user_item_ratings.columns)

# Split the data into train and test sets
train_df, test_df = train_test_split(user_item_ratings_imputed, test_size=0.2, random_state=42)

# Initialize and train the User based KNN_CF model
model_user_based_knn_cf = KNN_CF(k=5, user_based=False)
model_user_based_knn_cf.fit(train_df)

# predict and evaluate on the test set
y_true = []
y_pred = []

for _, (user_id, movie_id, rating) in test_df.iterrows():
    prediction = model_user_based_knn_cf.predict(user_id, movie_id)
    
    if prediction is not None:
        y_true.append(rating)
        y_pred.append(prediction)

print(f"Root Mean Squared Error: {rmse(y_true, y_pred)}")
print(f"Mean Absolute Error: {mae(y_true, y_pred)}")



: 

: 