# Data Preprocessing

In [1]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
np.random.seed(123) #seed: random sayılar üretirken kullanılan tam sayı başlangıç değerini ayarlar. 


In [16]:
# dataset: https://grouplens.org/datasets/movielens/

ratings = pd.read_csv('movie_data_/rating.csv', parse_dates=['timestamp'], nrows= 1000000)

#movies = pd.read_csv('movie_data_/movie.csv',  nrows= 1000000)
#`timestamp` column that shows the date and time the review was submitted. 

In [17]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40
...,...,...,...,...
999995,6743,1580,4.0,2005-06-03 01:05:57
999996,6743,1584,3.0,2005-06-03 01:09:03
999997,6743,1586,3.0,2005-06-03 01:23:51
999998,6743,1589,4.0,2005-06-03 01:26:30


In [18]:
#movies

In [19]:
rand_userIds = np.random.choice(ratings['userId'].unique(), size=int(len(ratings['userId'].unique())), replace=False)
ratings = ratings.loc[ratings['userId'].isin(rand_userIds)]

print('{} rows of data from {} users'.format(len(ratings), len(rand_userIds)))

There are 1000000 rows of data from 6743 users


In [20]:
# concat dataframes
#movies_with_ratings = ratings.merge(movies[["movieId","title"]],how="left",on="movieId")
#movies_with_ratings


In [21]:
ratings.sample(5)

Unnamed: 0,userId,movieId,rating,timestamp
712548,4746,5952,5.0,2012-02-12 20:29:10
291653,2002,3633,4.0,2000-12-08 01:32:26
652718,4347,48326,3.0,2007-01-19 13:39:02
652382,4347,5064,3.5,2003-07-28 23:29:53
547746,3673,1883,4.0,1998-07-26 20:30:15


After filtering the dataset, there are now 
1000000 rows of data 
6743 users. 

Each row in the dataframe corresponds to a movie review made by a single user.
(Dataframede her row, tek bir kullanıcı tarafından yapılan bir film incelemesine karşılık gelir)

## Train-test split:

Along with the rating, there is also a `timestamp` column that shows the date and time the review was submitted. Using the `timestamp` column, we will implement our train-test split strategy using the leave-one-out methodology. For each user, the most recent review is used as the test set (i.e. leave one out), while the rest will be used as training data .


This train-test split strategy is often used when training and evaluating recommender systems. Doing a random split would not be fair, as we could potentially be using a user's recent reviews for training and earlier reviews for testing. This introduces data leakage with a look-ahead bias, and the performance of the trained model would not be generalizable to real-world performance.

### Split ratings dataset into a train and test set using the leave-one-out methodology: 

In [22]:
ratings['rank_latest'] = ratings.groupby(['userId'])['timestamp'] \
                                .rank(method='first', ascending=False)

train_ratings = ratings[ratings['rank_latest'] != 1]
test_ratings = ratings[ratings['rank_latest'] == 1]

# drop columns that we no longer need
train_ratings = train_ratings[['userId', 'movieId', 'rating']]
test_ratings = test_ratings[['userId', 'movieId', 'rating']]

### Converting the dataset into an implicit feedback dataset

As discussed earlier, we will train a recommender system using implicit feedback. However, the MovieLens dataset that we're using is based on explicit feedback. To convert this dataset into an implicit feedback dataset, we'll simply binarize the ratings such that they are are '1' (positive class). The value of '1' represents that the user has interacted with the item.





In [23]:
train_ratings.loc[:, 'rating'] = 1

train_ratings.sample(5)

Unnamed: 0,userId,movieId,rating
748244,4986,1883,1
448645,3051,4079,1
869087,5805,31522,1
988309,6656,4979,1
272108,1865,3527,1


the ratio of negative to positive samples is 4:1. 

In [24]:
# Get a list of all movie IDs
all_movieIds = ratings['movieId'].unique()

# Placeholders that will hold the training data
users, items, labels = [], [], []

# This is the set of items that each user has interaction with
user_item_set = set(zip(train_ratings['userId'], train_ratings['movieId']))

# 4:1 ratio of negative to positive samples
num_negatives = 4

for (u, i) in tqdm(user_item_set):
    users.append(u)
    items.append(i)
    labels.append(1) # items that the user has interacted with are positive
    for _ in range(num_negatives):
        # randomly select an item
        negative_item = np.random.choice(all_movieIds) 
        # check that the user has not interacted with this item
        while (u, negative_item) in user_item_set:
            negative_item = np.random.choice(all_movieIds)
        users.append(u)
        items.append(negative_item)
        labels.append(0) # items not interacted with are negative
        
        # check that the user has not interacted with this item
        #her seferinde random negative_item seçiyor. Daha sonra negative_itemı user_item_set içerisinde var mı yok mu bakıyor. Buna bakamsının sebebi user_item_set'in kulanıcının etkileşime girdiği
        #filmleri göstermesi. Eğer bu filmelerin içinde negative_item yoksa bu filmlerin negative_item olduğu kesinleşiyor. (candidate_negative_item dersen daha rahat anlarsın)
        #etkileşime girmediğini bulana kadar while döngüsü devam


  0%|          | 0/993257 [00:00<?, ?it/s]

We now have the data in the format required by our model. Before we move on, let's define a PyTorch Dataset to facilitate training. The class below simply encapsulates the code we have written above into a PyTorch Dataset class.

In [25]:
class MovieLensTrainDataset(Dataset):
    """MovieLens PyTorch Dataset for Training
    
    Args:
        ratings (pd.DataFrame): Dataframe containing the movie ratings
        all_movieIds (list): List containing all movieIds
    
    """

    def __init__(self, ratings, all_movieIds):
        self.users, self.items, self.labels = self.get_dataset(ratings, all_movieIds)

    def __len__(self):
        return len(self.users)
  
    def __getitem__(self, idx):
        return self.users[idx], self.items[idx], self.labels[idx]

    def get_dataset(self, ratings, all_movieIds):
        users, items, labels = [], [], []
        user_item_set = set(zip(ratings['userId'], ratings['movieId']))

        num_negatives = 4
        for u, i in user_item_set:
            users.append(u)
            items.append(i)
            labels.append(1)
            for _ in range(num_negatives):
                negative_item = np.random.choice(all_movieIds)
                while (u, negative_item) in user_item_set:
                    negative_item = np.random.choice(all_movieIds)
                users.append(u)
                items.append(negative_item)
                labels.append(0)
        return torch.tensor(users), torch.tensor(items), torch.tensor(labels)

## model - Neural Collaborative Filtering (NCF)

### User Embeddings



### Learned Embeddings


### Model Architecture




# Define this NCF model using PyTorch Lightning:

In [26]:
class NCF(pl.LightningModule):
    """ Neural Collaborative Filtering (NCF)
    
        Args:
            num_users (int): Number of unique users
            num_items (int): Number of unique items
            ratings (pd.DataFrame): Dataframe containing the movie ratings for training
            all_movieIds (list): List containing all movieIds (train + test)
    """
    
    def __init__(self, num_users, num_items, ratings, all_movieIds):
        super().__init__()
        self.user_embedding = nn.Embedding(num_embeddings=num_users, embedding_dim=8)
        self.item_embedding = nn.Embedding(num_embeddings=num_items, embedding_dim=8)
        self.fc1 = nn.Linear(in_features=16, out_features=64)
        self.fc2 = nn.Linear(in_features=64, out_features=32)
        self.output = nn.Linear(in_features=32, out_features=1)
        self.ratings = ratings
        self.all_movieIds = all_movieIds
    def forward(self, user_input, item_input):
        
        # Pass through embedding layers
        user_embedded = self.user_embedding(user_input)
        item_embedded = self.item_embedding(item_input)

        # Concat the two embedding layers
        vector = torch.cat([user_embedded, item_embedded], dim=-1)

        # Pass through dense layer
        vector = nn.ReLU()(self.fc1(vector))
        vector = nn.ReLU()(self.fc2(vector))

        # Output layer
        pred = nn.Sigmoid()(self.output(vector))

        return pred
    
    def training_step(self, batch, batch_idx):
        user_input, item_input, labels = batch
        predicted_labels = self(user_input, item_input)
        loss = nn.BCELoss()(predicted_labels, labels.view(-1, 1).float())
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters())

    def train_dataloader(self):
        return DataLoader(MovieLensTrainDataset(self.ratings, self.all_movieIds),
                          batch_size=4)

In [27]:
type(train_ratings)

pandas.core.frame.DataFrame

We instantiate the NCF model using the class that we have defined above.

In [28]:
num_users = ratings['userId'].max()+1
num_items = ratings['movieId'].max()+1

all_movieIds = ratings['movieId'].unique()

model = NCF(num_users, num_items, train_ratings, all_movieIds)

train NCF model for 4 epochs.

In [30]:

trainer = pl.Trainer(max_epochs=2, gpus=None, reload_dataloaders_every_epoch=True,
                     progress_bar_refresh_rate=50, logger=False, checkpoint_callback=False)

trainer.fit(model)

#'reload_dataloaders_every_epoch=True'. Bu, her dönem için rastgele seçilen yeni bir negatif örnek seti oluşturur 
# ve bu, modelimizin negatif örneklerin seçiminden etkilenmemesini sağlar.





GPU available: False, used: False
TPU available: False, using: 0 TPU cores

  | Name           | Type      | Params
---------------------------------------------
0 | user_embedding | Embedding | 54.0 K
1 | item_embedding | Embedding | 1.0 M 
2 | fc1            | Linear    | 1.1 K 
3 | fc2            | Linear    | 2.1 K 
4 | output         | Linear    | 33    
---------------------------------------------
1.1 M     Trainable params
0         Non-trainable params
1.1 M     Total params
4.409     Total estimated model params size (MB)


Training: 0it [00:00, ?it/s]

1

# Evaluating our Recommender System

Now that our model is trained, we are ready to evaluate it using the test data. In traditional Machine Learning projects, we evaluate our models using metrics such as Accuracy (for classification problems) and RMSE (for regression problems). However, such metrics are too simplistic for evaluating recommender systems.

To design a good metric for evaluating recommender systems, we need to first understand how modern recommender systems are used. 


The key here is that we don't need the user to interact on *every* single item in the list of recommendations. Instead, we just need the user to interact with at least one item on the list - as long as the user does that, the recommendations have worked.

To simulate this, let's run the following evaluation protocol to generate a list of 10 recommended items for each user.

* For each user, randomly select 99 items that the user **has not interacted with**
* Combine these 99 items with the test item (the actual item that the user interacted with). We now have 100 items.
* Run the model on these 100 items, and rank them according to their predicted probabilities
* Select the top 10 items from the list of 100 items. If the test item is present within the top 10 items, then we say that this is a hit.
* Repeat the process for all users. The Hit Ratio is then the average hits.

This evaluation protocol is known as **Hit Ratio @ 10**, and it is commonly used to evaluate recommender systems. 

### Hit Ratio @ 10 

Now, let's evaluate our model using the described protocol.

In [31]:
# User-item pairs for testing
test_user_item_set = set(zip(test_ratings['userId'], test_ratings['movieId']))

# Dict of all items that are interacted with by each user
user_interacted_items = ratings.groupby('userId')['movieId'].apply(list).to_dict()

hits = []
for (u,i) in tqdm(test_user_item_set):
    interacted_items = user_interacted_items[u]
    not_interacted_items = set(all_movieIds) - set(interacted_items)
    selected_not_interacted = list(np.random.choice(list(not_interacted_items), 99))
    test_items = selected_not_interacted + [i]
    
    predicted_labels = np.squeeze(model(torch.tensor([u]*100), 
      torch.tensor(test_items)).detach().numpy())
    
    top10_items = [test_items[i] for i in np.argsort(predicted_labels)[::-1][0:10].tolist()]
    
    if i in top10_items:
        hits.append(1)
    else:
        hits.append(0)


df = pd.DataFrame(top10_items)
#df.merge(movies[["title"]],how="left", on="userId")
print(df)
        
print("The Hit Ratio @ 10 is {:.2f}".format(np.average(hits)))

# 10 tane prediction veriyor. Bunlar userın daha önce izlediği ratinglere en yakın movie listesi olması bekleniyor.

  0%|          | 0/6743 [00:00<?, ?it/s]

       0
0   2291
1   1485
2  48516
3   2470
4   1244
5   5013
6   5388
7   2662
8   3699
9   2318
The Hit Ratio @ 10 is 0.71


I guess, pretty good Hit Ratio @10 score! 

## To put this into context, what this means is that 71% of the users were recommended the actual item (among a list of 10 items) that they eventually interacted with. 