# Introduction to the project

Compare recommender system performance using non-negative matrix factorization vs a baseline found in assignment in week 3

## Transfering data from GitHub to Colab

In [1]:
!rm -rf /content/bbc_news_dtsa_5510
!rm -rf /content/tmp_data
!git clone https://github.com/rat-sparebank1/bbc_news_dtsa_5510.git

Cloning into 'bbc_news_dtsa_5510'...
remote: Enumerating objects: 48, done.[K
remote: Counting objects: 100% (48/48), done.[K
remote: Compressing objects: 100% (45/45), done.[K
remote: Total 48 (delta 9), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (48/48), 7.57 MiB | 12.03 MiB/s, done.
Resolving deltas: 100% (9/9), done.


In [2]:
!unzip -q bbc_news_dtsa_5510/data/hw3_dtsa_week3.zip -d /content/tmp_data

## Importing data

Using same code as in assignment week 3

In [3]:
import pandas as pd

In [4]:
MV_users = pd.read_csv('tmp_data/users.csv')
MV_movies = pd.read_csv('tmp_data/movies.csv')
train = pd.read_csv('tmp_data/train.csv')
test = pd.read_csv('tmp_data/test.csv')

In [5]:
print(MV_users)
print(MV_movies)
print(train)
print(test)

       uID gender  age  accupation    zip
0        1      F    1          10  48067
1        2      M   56          16  70072
2        3      M   25          15  55117
3        4      M   45           7  02460
4        5      M   25          20  55455
...    ...    ...  ...         ...    ...
6035  6036      F   25          15  32603
6036  6037      F   45           1  76006
6037  6038      F   56           1  14706
6038  6039      F   45           0  01060
6039  6040      M   25           6  11106

[6040 rows x 5 columns]
       mID                        title  year  Doc  Com  Hor  Adv  Wes  Dra  \
0        1                    Toy Story  1995    0    1    0    0    0    0   
1        2                      Jumanji  1995    0    0    0    1    0    0   
2        3             Grumpier Old Men  1995    0    1    0    0    0    0   
3        4            Waiting to Exhale  1995    0    1    0    0    0    1   
4        5  Father of the Bride Part II  1995    0    1    0    0    0    0 

## Construct a rating matrix

In [6]:
# Using elements from class RecSys()
from collections import namedtuple
from scipy.sparse import coo_matrix, csr_matrix
import numpy as np
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)
mid2idx = dict(zip(data.movies.mID,list(range(len(data.movies)))))
uid2idx = dict(zip(data.users.uID,list(range(len(data.users)))))

def rating_matrix(data):
  ind_movie = [mid2idx[x] for x in data.train.mID]
  ind_user = [uid2idx[x] for x in data.train.uID]
  rating_train = list(data.train.rating)
  allusers = list(data.users['uID'])
  allmovies = list(data.movies['mID'])

  return np.array(coo_matrix((rating_train, (ind_user, ind_movie)), shape=(len(allusers), len(allmovies))).toarray())

r_matrix = rating_matrix(data)
print(r_matrix)

[[5 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [3 0 0 ... 0 0 0]]


## For Matrix factorization I will use the features found in the MV_movies dataset

In [7]:
column_list = (MV_movies.columns).to_list()
columns_to_remove = ['mID', 'title', 'year']
for column in columns_to_remove:
  column_list.remove(column)
print("movie categories: ", column_list)
n_components = len(column_list)
print("number of categories: ", n_components)

movie categories:  ['Doc', 'Com', 'Hor', 'Adv', 'Wes', 'Dra', 'Ani', 'War', 'Chi', 'Cri', 'Thr', 'Sci', 'Mys', 'Rom', 'Fil', 'Fan', 'Act', 'Mus']
number of categories:  18


# Use Scikit-learn to do NMF

In [11]:
from sklearn.decomposition import NMF

model = NMF(n_components=n_components, init='random', max_iter=400)
W = model.fit_transform(r_matrix)
H = model.components_



In [12]:
from numpy.lib.index_tricks import r_

print("W matrix shape: ", W.shape)
print("H matrix shape: ", H.shape)
print("Rating matrix shape: ", r_matrix.shape)

W matrix shape:  (6040, 18)
H matrix shape:  (18, 3883)
Rating matrix shape:  (6040, 3883)


In [13]:
# Using the reconstructed matrix to make predictions on the test data

def predict_from_W_H(uid, mid):
  user_index = uid2idx[uid]
  movie_index = mid2idx[mid]
  return np.dot(W[user_index, :], H[:, movie_index])

y_pred_list = []
for user, movie in zip(data.test.uID, data.test.mID):
  y_pred_list.append(predict_from_W_H(user, movie))
y_pred = np.array(y_pred_list)
y_true = (data.test.rating).to_numpy()

rmse = np.sqrt(((y_true - y_pred)**2).mean())
print(rmse)

2.861111446705944


## Some Conclusions:

This was my approach:
1. Create a rating matrix with the training data. The raiting matrix has lots of zeros since not all users have seen all movies.
2. I extract genre information from MV_movies and counts that there are 18 genra in the dataset.
3. Then use this number as the number of components that should be use for matrix factorizations given me to matrices:


> **number of users x 18**, and **18 x number for movies**


4. The predictions were made using dot product between the user values in the W matrix and the movie values in the H matrix
5. I then use it to predict ratings in the test set
The RMSE was worse than the most simple baseline 2.86 vs 1.29

This may be expected since the hypothesis for this to work is that fitting the model should discover the dependencies betweeen the genre, the movies and the users, kind of the model expects that users rate similarly movies mostly/solely depending on the gender. And this is not necesarily true, there are other factors that affect ratings.

On way to reduce the RMSE without changing the approach of using genre for find the number for features for the factorization is to use the information in MV_movies to incorporate the genre information. Both including the actual genre information per movie and try to calculate i.e. average rating base in per user per genre.