Tasks
1. Data Retrieval and Preprocessing
- Obtain MovieLens 1M dataset. []
- Load dataset. []
- Check data integrity: []
- Address issues like missing movies. []
- Handle data inconsistencies (e.g., user IDs with additional data, ratings for non-existent movies). []
- Create User-Item Interaction Matrix. []
- Split data for 5-fold cross-validation. []
- Handling Cold Starts (dealing with users or items not seen during training). []
2. Recommendation Algorithms
- Implement Naive Approaches: []
- Global Average Rating. []
- Average Rating per Item. []
- Average Rating per User. []
- Optimal Linear Combination with and without bias. []
- Implement UV Matrix Decomposition. []
- Implement Matrix Factorization with Gradient Descent and Regularization. []
- For each algorithm, calculate: 
- RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error). []
- Address Cold Starts for the implemented algorithms. []
3. Visualization
- Apply dimensionality reduction techniques for visualization: []
- PCA (Principal Component Analysis). []
- t-SNE (t-Distributed Stochastic Neighbor Embedding). []
- UMAP (Uniform Manifold Approximation and Projection). []
4. Documentation and Reporting 
- Document code, algorithms, and preprocessing steps. []
- Summarize and analyze the results of each algorithm. []
- Provide insights into the best-performing algorithms. []
- Discuss challenges and limitations encountered during the implementation. []

In [19]:
import pandas as pd
import numpy as np
import sklearn as sklearn
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error
from math import sqrt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# import umap
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
from sklearn.linear_model import LinearRegression

## Data retrival

In [20]:
# Loading data sets

# movies
df_movies = pd.read_csv("ml-1m/movies.dat", sep='::', encoding='ISO-8859-1', header=None, engine='python', names=['MovieID', 'Title', 'Genres'])
df_movies = df_movies.rename({0: 'MovieID', 1: 'Title', 2: 'Genre'}, axis='columns')

#ratings
df_ratings = pd.read_csv("ml-1m/ratings.dat", sep='::', encoding='ISO-8859-1', header=None, engine='python', names=['UserID', 'MovieID', 'Rating', 'Timestamp'])

#users
df_users = pd.read_csv("ml-1m/users.dat", sep='::', encoding='ISO-8859-1', header=None, engine="python", names=['UserID', 'Gender', 'Age', 'Occupation', 'ZipCode'])
df_users.columns = ['UserID', 'Gender', 'Age', 'Occupation', 'ZipCode']



In [21]:
# Check for missing values in ratings dataset
df_ratings.isnull().sum()

# Check for missing values in movies dataset
df_movies.isnull().sum()

# Check for missing values in users dataset
df_users.isnull().sum()



UserID        0
Gender        0
Age           0
Occupation    0
ZipCode       0
dtype: int64

In [22]:
# Checking data integrity

# movies
print(df_movies.head())
print

# ratings
print(df_ratings.head())
print

# users

print(df_users.head())

# In movies.dat there is missing movieID of 91 we create a placeholder
new_movie = pd.DataFrame({'MovieID': [91], 'Title': ['Unknown'], 'Genres': ['Unknown']})
df_movies = pd.concat([df_movies, new_movie], ignore_index=True)




   MovieID                               Title                        Genres
0        1                    Toy Story (1995)   Animation|Children's|Comedy
1        2                      Jumanji (1995)  Adventure|Children's|Fantasy
2        3             Grumpier Old Men (1995)                Comedy|Romance
3        4            Waiting to Exhale (1995)                  Comedy|Drama
4        5  Father of the Bride Part II (1995)                        Comedy
   UserID  MovieID  Rating  Timestamp
0       1     1193       5  978300760
1       1      661       3  978302109
2       1      914       3  978301968
3       1     3408       4  978300275
4       1     2355       5  978824291
   UserID Gender  Age  Occupation ZipCode
0       1      F    1          10   48067
1       2      M   56          16   70072
2       3      M   25          15   55117
3       4      M   45           7   02460
4       5      M   25          20   55455


In [23]:
# User-item interaction

In [24]:
# Users split into 5

num_folds = 5

# Kfold object to split data into
kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)

# Convert the DataFrame to a list to ensure consistent indices
user_data = df_users.values
ratings_data = df_ratings.values

# Split the data into 5 folds
for fold, (train_indices, test_indices) in enumerate(kf.split(user_data)):
    # Split the users dataset
    train_users = df_users.iloc[train_indices]
    test_users = df_users.iloc[test_indices]

    # Split the ratings dataset
    train_ratings = df_ratings[df_ratings['UserID'].isin(train_users['UserID'])]
    test_ratings = df_ratings[df_ratings['UserID'].isin(test_users['UserID'])]

    print(f"Fold {fold + 1} - Train Users: {len(train_users)}, Test Users: {len(test_users)}")


Fold 1 - Train Users: 4832, Test Users: 1208
Fold 2 - Train Users: 4832, Test Users: 1208
Fold 3 - Train Users: 4832, Test Users: 1208
Fold 4 - Train Users: 4832, Test Users: 1208
Fold 5 - Train Users: 4832, Test Users: 1208


In [25]:
# Movie split into 5
num_folds = 5

# Kfold object to split data into
kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)

# Convert the DataFrame to a list to ensure consistent indices
movie_data = df_movies.values

# Split the data into 5 folds
for fold, (train_indices, test_indices) in enumerate(kf.split(movie_data)):
    # Split the movies dataset
    train_movies = df_movies.iloc[train_indices]
    test_movies = df_movies.iloc[test_indices]

    print(f"Fold {fold + 1} - Train Movies: {len(train_movies)}, Test Movies: {len(test_movies)}")

Fold 1 - Train Movies: 3107, Test Movies: 777
Fold 2 - Train Movies: 3107, Test Movies: 777
Fold 3 - Train Movies: 3107, Test Movies: 777
Fold 4 - Train Movies: 3107, Test Movies: 777
Fold 5 - Train Movies: 3108, Test Movies: 776


In [26]:
# Ratins into 5
num_folds = 5

# Kfold object to split data into
kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)

# Convert the DataFrame to a list to ensure consistent indices
ratings_data = df_ratings.values

# Split the data into 5 folds
for fold, (train_indices, test_indices) in enumerate(kf.split(ratings_data)):
    # Split the ratings dataset
    train_ratings = df_ratings.iloc[train_indices]
    test_ratings = df_ratings.iloc[test_indices]

    print(f"Fold {fold + 1} - Train Ratings: {len(train_ratings)}, Test Ratings: {len(test_ratings)}")


Fold 1 - Train Ratings: 800167, Test Ratings: 200042
Fold 2 - Train Ratings: 800167, Test Ratings: 200042
Fold 3 - Train Ratings: 800167, Test Ratings: 200042
Fold 4 - Train Ratings: 800167, Test Ratings: 200042
Fold 5 - Train Ratings: 800168, Test Ratings: 200041


# Recommendations algorithms

## Naive approach

In [29]:
class recommenderSystem():


    def Naive_1(self, df_ratings):
        # Naive Approach
        r_global = df_ratings['Rating'].mean()
        r_item = df_ratings.groupby('MovieID')['Rating'].mean().reset_index().rename({'Rating':
                                                                                      'R_item'},axis='columns')
        
        r_user = df_ratings.groupby('UserID')['Rating'].mean().reset_index().rename({'Rating': 'R_user'},axis='columns')
        df_ratings=df_ratings.merge(r_item, on=['MovieID']).merge(r_user, on=['UserID'])
        print(r_global)
        print(r_item.head())
        print(r_user.head())
        print(df_ratings.head())

        X = df_ratings[['R_item','R_user']]
        y = df_ratings['Rating']
        model = LinearRegression().fit(X, y)

        alpha, beta = model.coef_
        gamma = model.intercept_

        print(f'alpha: {alpha}, beta: {beta}, gamma: {gamma}')
        
    def function_3(self, train):
        print("hi")
        # The UV matrix decomposition

    def function_4(self, train):
        # The Matrix Factorization
        print("yo")
        
    def function_5(self, train):
        print("world")
        
    def visualisation_1(self):
        # Apply PCA
        pca = PCA(n_components=2)
        pca_result = pca.fit_transform(data)
        
    def visualisation_2(self):
        # Apply t-SNE
        tsne = TSNE(n_components=2, verbose=1)
        tsne_result = tsne.fit_transform(data)
        
    def visualisation_3(self):
        # Apply UMAP
        umap_model = umap.UMAP(n_components=2)
        umap_result = umap_model.fit_transform(data)

    def cross_validation(self,folds):
        # prepare cross validation
        kfold = KFold(folds, True, 1)
        train_list=[]
        test_list=[]
        # enumerate splits
        for train, test in kfold.split(self.data):
         train_list.append(data[train])
         test_list.append(data[test])
        return train_list, test_list
    
    def perf_measures(y_true,y_pred):
        # Calculate RMSE (Root Mean Squared Error)
        rmse = sqrt(mean_squared_error(y_true, y_pred))
        print(f'RMSE: {rmse}')

        # Calculate MAE (Mean Absolute Error)
        mae = mean_absolute_error(y_true, y_pred)
        print(f'MAE: {mae}')
        
    def main():
        train_list,test_list=self.cross_validartion(5)
        
            
        
if __name__ == '__main__':
            # Specify the file path
        file_path = 'ml-1m/ratings.dat'
        df_ratings = pd.read_csv(file_path, sep='::',header=None, engine='python')
        df_ratings = df_ratings.rename({0: 'UserID',
                                        1:'MovieID',
                                        2:'Rating',
                                        3:'Timestamp'},axis='columns')

        print(df_ratings.head())
        # Specify the file path
        file_path = 'ml-1m/users.dat'
        df_users = pd.read_csv(file_path, sep='::',header=None, engine='python')
        df_users = df_users.rename({0: 'UserID',
                                        1:'Gender',
                                        2:'Age',
                                        3:'Occupation',
                                        4: 'Zip-code'
                                        },axis='columns')
        print(df_users.head())
        # Specify the file path
        file_path = 'ml-1m/movies.dat'
        df_movies = pd.read_csv(file_path, sep='::', header=None, encoding='ISO-8859-1', engine='python')
        df_movies = df_movies.rename({0: 'MovieID',
                                        1:'Title',
                                        2:'Genre'},axis='columns')


        print(df_movies.head())

        rec= recommenderSystem()
        rec.Naive_1(df_ratings)


   UserID  MovieID  Rating  Timestamp
0       1     1193       5  978300760
1       1      661       3  978302109
2       1      914       3  978301968
3       1     3408       4  978300275
4       1     2355       5  978824291
   UserID Gender  Age  Occupation Zip-code
0       1      F    1          10    48067
1       2      M   56          16    70072
2       3      M   25          15    55117
3       4      M   45           7    02460
4       5      M   25          20    55455
   MovieID                               Title                         Genre
0        1                    Toy Story (1995)   Animation|Children's|Comedy
1        2                      Jumanji (1995)  Adventure|Children's|Fantasy
2        3             Grumpier Old Men (1995)                Comedy|Romance
3        4            Waiting to Exhale (1995)                  Comedy|Drama
4        5  Father of the Bride Part II (1995)                        Comedy
3.581564453029317
   MovieID    R_item
0        1  4