<a href="https://colab.research.google.com/github/tobias-hoepfl/Digital-Organizations-SE/blob/main/assignment/assignment_7_Hoepfl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The following assignment consists again of a theoretical part (learning portfolio) and a practical part (assignment). The goal is to train a neural model for a recommendation system.

The plan would be that in the first week we will discuss your learnings from the theory part, that means you are relatively free to fill your Learning Portfolio on this topic and in the following week we will discuss your solutions of the practical part.

#Theory part (filling your Learning Portfolio, June 7)

In preparation for the practical part, I ask you to familiarize yourself with the following video sources in the next week:

1) Please watch the following videos:

https://www.youtube.com/watch?v=Fmtorg_dmM0&ab_channel=ritvikmath (not absolutely necessary, only for the overview)

https://course.fast.ai/Lessons/lesson7.html (The second part of the presentation starting with the topic collaborative filtering is mandatory)

Note: The first part of the video mainly contains tips for neural networks to submit a Kaggle Competition. For that, you would have to watch the end of the 6th video to understand this better. But this is not mandatory.

2) Please download the following notebooks and edit it in Google-Colab. Try to answer a few questions that are asked at the end. Take notes and update your Learning Portfolio.

https://www.kaggle.com/code/jhoward/collaborative-filtering-deep-dive/notebook


#Practical part (Assignment, June 14)

Find any data set that can be used for a recommender system and try to train and validate a neural network for it.

For this purpose I ask you to download a data set from the given lists and to use it for your program application. 

https://gist.github.com/entaroadun/1653794

https://github.com/caserec/Datasets-for-Recommender-Systems

https://grouplens.org/datasets/movielens/

https://eigentaste.berkeley.edu/dataset/

**Dataset chosen:**

http://www2.informatik.uni-freiburg.de/~cziegler/BX/


**Credits:**

Book-Crossing Dataset ... mined by Cai-Nicolas Ziegler, DBIS Freiburg
	
Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.

	
[ ! ] Freely available for research use when acknowledged with the following reference (further details on the dataset are given in this publication):

Improving Recommendation Lists Through Topic Diversification,
Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. To appear.

In [None]:
#set drive connection
from google.colab import drive
import pandas as pd

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# I first had to change the encoding compared to the original file, because there were problems otherwise   
df = pd.read_csv("/content/drive/MyDrive/DigitalOrganizations/BX-Book-Ratings.csv", sep=';')

In [None]:
df.head()
#The main rating table shows the User-id, the ISBN and the rating

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [None]:
df = df[df['Book-Rating'] != 0]

In [None]:
df.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
1,276726,0155061224,5
3,276729,052165615X,3
4,276729,0521795028,6
6,276736,3257224281,8
7,276737,0600570967,6


In [None]:
import torch
from fastai.collab import *
from fastai.tabular.all import *

In [None]:
class DotProduct(Module):
    #number of users, books and factors used to initialize the embeddings
    def __init__(self, n_users, n_books, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.book_factors = Embedding(n_books, n_factors)
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        books = self.book_factors(x[:,1])
        return (users * books).sum(dim=1)

In [None]:
#bigger batch size because there is in general more data than in the example in the video
dls = CollabDataLoaders.from_df(df, item_name='ISBN', user_name = 'User-ID', bs=256)
dls.show_batch()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,210587,60519134,6
1,247240,440223202,9
2,260944,451161343,10
3,195790,3404125010,8
4,203280,743406184,6
5,114446,99271478,7
6,236223,312868855,4
7,23547,451405501,8
8,225002,60502258,5
9,56959,911266135,10


In [None]:
n_users  = len(dls.classes['User-ID'])
n_books = len(dls.classes['ISBN'])

In [None]:
model = DotProduct(n_users, n_books, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())

In [None]:
learn.fit_one_cycle(5, 5e-3)
#in the fifth round already overfitting can be observed
#using higher learning rate does not help
#e.g. learn.fit_one_cycle(5, 5e-2)

#The overall loss is still very bad after training

epoch,train_loss,valid_loss,time
0,58.292271,57.467304,00:15
1,39.41753,41.757679,00:15
2,24.931091,36.746586,00:15
3,17.693079,35.355202,00:15
4,14.69533,35.174816,00:15


In [None]:
#now we enhance the function by including sigmoid 
class DotProduct(Module):
    def __init__(self, n_users, n_books, n_factors, y_range=(1,10.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.book_factors = Embedding(n_books, n_factors)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        books = self.book_factors(x[:,1])
        return sigmoid_range((users * books).sum(dim=1), *self.y_range)

In [None]:
model = DotProduct(n_users, n_books, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)
#result is much better already now

epoch,train_loss,valid_loss,time
0,6.372075,6.332678,00:16
1,4.124954,5.234595,00:15
2,1.825995,5.145153,00:16
3,0.869565,5.128331,00:15
4,0.532589,5.131505,00:16


In [None]:
#now we still include a bias that accounts for the effect, that some users have higher ratings in general 
#and some books are rated higher in general
class DotProductBias(Module):
    def __init__(self, n_users, n_books, n_factors, y_range=(0.5,10.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.movie_factors = Embedding(n_books, n_factors)
        self.movie_bias = Embedding(n_books, 1)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)

In [None]:
model = DotProductBias(n_users, n_books, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)
#only now we get result that have a good range; square root of 3.5 is 1.87, which is a suitable error

epoch,train_loss,valid_loss,time
0,4.768305,4.68262,00:16
1,2.875582,3.566159,00:16
2,1.276504,3.515994,00:18
3,0.600183,3.492131,00:16
4,0.327826,3.495471,00:19


In [None]:
#now we still include weight decay which punishes higher coefficients to prevent overfitting
model = DotProductBias(n_users, n_books, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)
#makes it fit slower; in this case also the solution without (or a lower) weight decay seems suitable when looking at validation loss

epoch,train_loss,valid_loss,time
0,4.959069,4.827568,00:16
1,3.456735,3.818312,00:15
2,2.308975,3.783504,00:15
3,1.588842,3.788006,00:16
4,1.144086,3.797323,00:20
