# Importer les librairies necessaires

La librairie PyTorch est la librairie principale de ce projet, elle offre les outils necessaires pour creer des reseaux de neuronnes, les entrainer, et les utiliser. La librairie PyTorch Lightning organise et structure le code PyTorch de sorte a accelerer la creation et l'entrainement des modeles dd'apprentissage profond. Les autres librairies sont des librairies classiques de traitement de donnees (Pandas) et de calcul scientifique (Numpy).

In [13]:
import random
import argparse

import numpy as np
import pandas as pd

import pytorch_lightning as pl

import torch
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger
from torch.utils.data import DataLoader
import torch.nn as nn
from torch.nn import Linear
from torch.nn import functional as F

from data_processing import get_context, pad_list, map_column, MASK, PAD

# Creation du Modele

Le modèle qu'on implémente s'appelle BERT4Rec et est basé sur BERT pour le NLP. C'est un réseau de neurones Transformer entraîné pour prédire des films "masqués" de l'historique d'un utilisateur. Voici le papier original: https://arxiv.org/pdf/1904.06690.pdf

La première étape consiste à construire l'historique de l'utilisateur sous forme de liste chronologique de films. Certains de ces films sont remplacés par un jeton [MASK]. Le modèle BERT4Rec est ensuite entraîné à essayer de prédire les valeurs correctes des éléments [MASK]. En faisant cela, le modèle apprendra des représentations utiles pour chaque film et également des motifs importants qui existent entre les films. Puis, pour l'inférence, nous pouvons simplement ajouter un [MASK] à la fin d'une séquence d'utilisateur pour prédire le film qu'il voudra le plus probablement voir à l'avenir.

In [None]:
def masked_accuracy(y_pred: torch.Tensor, y_true: torch.Tensor, mask: torch.Tensor):

    _, predicted = torch.max(y_pred, 1)

    y_true = torch.masked_select(y_true, mask)
    predicted = torch.masked_select(predicted, mask)

    acc = (y_true == predicted).double().mean()

    return acc


def masked_ce(y_pred, y_true, mask):

    loss = F.cross_entropy(y_pred, y_true, reduction="none")

    loss = loss * mask

    return loss.sum() / (mask.sum() + 1e-8)


class Recommender(pl.LightningModule):
    def __init__(
        self,
        vocab_size,
        channels=128,
        cap=0,
        mask=1,
        dropout=0.4,
        lr=1e-4,
    ):
        super().__init__()

        self.cap = cap
        self.mask = mask

        self.lr = lr
        self.dropout = dropout
        self.vocab_size = vocab_size

        self.item_embeddings = torch.nn.Embedding(
            self.vocab_size, embedding_dim=channels
        )

        self.input_pos_embedding = torch.nn.Embedding(512, embedding_dim=channels)

        encoder_layer = nn.TransformerEncoderLayer(
            d_model=channels, nhead=4, dropout=self.dropout
        )

        self.encoder = torch.nn.TransformerEncoder(encoder_layer, num_layers=6)

        self.linear_out = Linear(channels, self.vocab_size)

        self.do = nn.Dropout(p=self.dropout)

    def encode_src(self, src_items):
        src_items = self.item_embeddings(src_items)

        batch_size, in_sequence_len = src_items.size(0), src_items.size(1)
        pos_encoder = (
            torch.arange(0, in_sequence_len, device=src_items.device)
            .unsqueeze(0)
            .repeat(batch_size, 1)
        )
        pos_encoder = self.input_pos_embedding(pos_encoder)

        src_items += pos_encoder

        src = src_items.permute(1, 0, 2)

        src = self.encoder(src)

        return src.permute(1, 0, 2)

    def forward(self, src_items):

        src = self.encode_src(src_items)

        out = self.linear_out(src)

        return out

    def training_step(self, batch, batch_idx):
        src_items, y_true = batch

        y_pred = self(src_items)

        y_pred = y_pred.view(-1, y_pred.size(2))
        y_true = y_true.view(-1)

        src_items = src_items.view(-1)
        mask = src_items == self.mask

        loss = masked_ce(y_pred=y_pred, y_true=y_true, mask=mask)
        accuracy = masked_accuracy(y_pred=y_pred, y_true=y_true, mask=mask)

        self.log("train_loss", loss)
        self.log("train_accuracy", accuracy)

        return loss

    def validation_step(self, batch, batch_idx):
        src_items, y_true = batch

        y_pred = self(src_items)

        y_pred = y_pred.view(-1, y_pred.size(2))
        y_true = y_true.view(-1)

        src_items = src_items.view(-1)
        mask = src_items == self.mask

        loss = masked_ce(y_pred=y_pred, y_true=y_true, mask=mask)
        accuracy = masked_accuracy(y_pred=y_pred, y_true=y_true, mask=mask)

        self.log("valid_loss", loss)
        self.log("valid_accuracy", accuracy)

        return loss

    def test_step(self, batch, batch_idx):
        src_items, y_true = batch

        y_pred = self(src_items)

        y_pred = y_pred.view(-1, y_pred.size(2))
        y_true = y_true.view(-1)

        src_items = src_items.view(-1)
        mask = src_items == self.mask

        loss = masked_ce(y_pred=y_pred, y_true=y_true, mask=mask)
        accuracy = masked_accuracy(y_pred=y_pred, y_true=y_true, mask=mask)

        self.log("test_loss", loss)
        self.log("test_accuracy", accuracy)

        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)
        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
            optimizer, patience=10, factor=0.1
        )
        return {
            "optimizer": optimizer,
            "lr_scheduler": scheduler,
            "monitor": "valid_loss",
        }

# Entrainement du Modele

le modele a ete entraine pendant 100 epochs en utilisant une taille de batch de 64, l'optimiseur Adam avec un pas d'apprentissage de 10e-4, et la fonction de perte Cross Entropy. La taille de l'historique utilise durant l'entrainement est de 120 items (films).

Les donnees utilisees proviennent du Dataset MovieLens 1M, qui contient 1 million de notes de 6000 utilisateurs sur 4000 films.

In [None]:
def mask_list(l1, p=0.8):

    l1 = [a if random.random() < p else MASK for a in l1]

    return l1


def mask_last_elements_list(l1, val_context_size: int = 5):

    l1 = l1[:-val_context_size] + mask_list(l1[-val_context_size:], p=0.5)

    return l1


class Dataset(torch.utils.data.Dataset):
    def __init__(self, groups, grp_by, split, history_size=120):
        self.groups = groups
        self.grp_by = grp_by
        self.split = split
        self.history_size = history_size

    def __len__(self):
        return len(self.groups)

    def __getitem__(self, idx):
        group = self.groups[idx]

        df = self.grp_by.get_group(group)

        context = get_context(df, split=self.split, context_size=self.history_size)

        trg_items = context["movieId_mapped"].tolist()

        if self.split == "train":
            src_items = mask_list(trg_items)
        else:
            src_items = mask_last_elements_list(trg_items)

        pad_mode = "left" if random.random() < 0.5 else "right"
        trg_items = pad_list(trg_items, history_size=self.history_size, mode=pad_mode)
        src_items = pad_list(src_items, history_size=self.history_size, mode=pad_mode)

        src_items = torch.tensor(src_items, dtype=torch.long)

        trg_items = torch.tensor(trg_items, dtype=torch.long)

        return src_items, trg_items


def train(
    data_csv_path: str,
    log_dir: str = "recommender_logs",
    model_dir: str = "recommender_models",
    batch_size: int = 64,
    epochs: int = 5,
    history_size: int = 120,
):
    data = pd.read_csv(data_csv_path)

    data.sort_values(by="timestamp", inplace=True)

    data, mapping, inverse_mapping = map_column(data, col_name="movieId")

    grp_by_train = data.groupby(by="userId")

    groups = list(grp_by_train.groups)

    train_data = Dataset(
        groups=groups,
        grp_by=grp_by_train,
        split="train",
        history_size=history_size,
    )
    val_data = Dataset(
        groups=groups,
        grp_by=grp_by_train,
        split="val",
        history_size=history_size,
    )

    print("len(train_data)", len(train_data))
    print("len(val_data)", len(val_data))

    train_loader = DataLoader(
        train_data,
        batch_size=batch_size,
        num_workers=4,
        shuffle=True,
        persistent_workers=True,
    )
    val_loader = DataLoader(
        val_data,
        batch_size=batch_size,
        num_workers=4,
        shuffle=False,
        persistent_workers=True,
    )

    model = Recommender(
        vocab_size=len(mapping) + 2,
        lr=1e-4,
        dropout=0.3,
    )

    logger = TensorBoardLogger(
        save_dir=log_dir,
    )

    checkpoint_callback = ModelCheckpoint(
        monitor="valid_loss",
        mode="min",
        dirpath=model_dir,
        filename="recommender",
    )

    trainer = pl.Trainer(
        max_epochs=epochs,
        logger=logger,
        callbacks=[checkpoint_callback],
    )
    trainer.fit(model, train_loader, val_loader)

    result_val = trainer.test(model, val_loader)

    output_json = {
        "val_loss": result_val[0]["test_loss"],
        "best_model_path": checkpoint_callback.best_model_path,
    }

    print(output_json)

    return output_json

In [None]:
parser = argparse.ArgumentParser()
parser.add_argument("--data_csv_path", type=str, default="data/MovieLens 1M Dataset/ml-1m/ratings.csv")
parser.add_argument("--epochs", type=int, default=100)
args = parser.parse_args()

train(
    data_csv_path=args.data_csv_path,
    epochs=args.epochs,
)

A la fin de l'entrainement, on sauvegarde le modele. Dans notre cas, ce dernier a acheve une perte sur les donnees de test de 6.16 ainsi qu'une precision de 1.11% apres 100 epochs. Dans le papier, le modele a acheve 28% de precision sans mentionner le nombre d'epochs d'entrainement.

# Utilisation du modele entraine pour la recommandation de films

In [25]:
data_csv_path = "data/MovieLens 1M Dataset/ml-1m/ratings.csv"
movies_path = "data/MovieLens 1M Dataset/ml-1m/movies.csv"

model_path = "recommender_models/recommender.ckpt"

In [31]:
data = pd.read_csv(data_csv_path)
movies = pd.read_csv(movies_path)

In [32]:
data.sort_values(by="timestamp", inplace=True)

In [33]:
data, mapping, inverse_mapping = map_column(data, col_name="movieId")
grp_by_train = data.groupby(by="userId")

In [34]:
random.sample(list(grp_by_train.groups), k=10)

[1366, 4410, 2109, 3235, 5368, 361, 4867, 3173, 2938, 47]

In [35]:
model = Recommender(
        vocab_size=len(mapping) + 2,
        lr=1e-4,
        dropout=0.3,
    )
model.eval()
model.load_state_dict(torch.load(model_path)["state_dict"])

<All keys matched successfully>

In [36]:
movie_to_idx = {a: mapping[b] for a, b in zip(movies.title.tolist(), movies.movieId.tolist()) if b in mapping}
idx_to_movie = {v: k for k, v in movie_to_idx.items()}

In [37]:
def predict(list_movies, model, movie_to_idx, idx_to_movie):
    
    ids = [PAD] * (120 - len(list_movies) - 1) + [movie_to_idx[a] for a in list_movies] + [MASK]
    
    src = torch.tensor(ids, dtype=torch.long).unsqueeze(0)
    
    with torch.no_grad():
        prediction = model(src)
    
    masked_pred = prediction[0, -1].numpy()
    
    sorted_predicted_ids = np.argsort(masked_pred).tolist()[::-1]
    
    sorted_predicted_ids = [a for a in sorted_predicted_ids if a not in ids]
    
    return [idx_to_movie[a] for a in sorted_predicted_ids[:30] if a in idx_to_movie]


### Senario 1: Aventure/Fantaisie 

In [43]:
list_movies = ["Willow (1988)",
               "Star Wars: Episode I - The Phantom Menace (1999)",
               "Time Bandits (1981)",
               "Ladyhawke (1985)"]

top_movie = predict(list_movies, model, movie_to_idx, idx_to_movie)
top_movie

['Star Wars: Episode IV - A New Hope (1977)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)',
 'E.T. the Extra-Terrestrial (1982)',
 'Alien (1979)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'Raiders of the Lost Ark (1981)',
 'Back to the Future (1985)',
 'Jaws (1975)',
 'Jurassic Park (1993)',
 'Princess Bride The (1987)',
 'Godfather The (1972)',
 'American Beauty (1999)',
 'Abyss The (1989)',
 '2001: A Space Odyssey (1968)',
 'Ghostbusters (1984)',
 'Godfather: Part II The (1974)',
 'Terminator 2: Judgment Day (1991)',
 '20000 Leagues Under the Sea (1954)',
 'Braveheart (1995)',
 'Willy Wonka and the Chocolate Factory (1971)',
 'Saving Private Ryan (1998)',
 'Matrix The (1999)',
 'Psycho (1960)',
 'Goonies The (1985)',
 'Indiana Jones and the Last Crusade (1989)',
 'Wizard of Oz The (1939)',
 'Beetlejuice (1988)',
 'Hook (1991)',
 'NeverEnding Story The (1984)',
 "Schindler's List (1993)"]

### Senario 2:  Action/Aventure

In [44]:
list_movies = ["Golden Voyage of Sinbad The (1974)",
               "Sinbad and the Eye of the Tiger (1977)",
               "Godzilla 2000 (Gojira ni-sen mireniamu) (1999)",
               "Mortal Kombat (1995)",
               "Judge Dredd (1995)",
               "Waterworld (1995)",
]
top_movie = predict(list_movies, model, movie_to_idx, idx_to_movie)
top_movie

['Star Wars: Episode IV - A New Hope (1977)',
 'Alien (1979)',
 'Jaws (1975)',
 'Psycho (1960)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)',
 'Godfather The (1972)',
 '2001: A Space Odyssey (1968)',
 'Godfather: Part II The (1974)',
 'King Kong (1933)',
 'Matrix The (1999)',
 'Terminator 2: Judgment Day (1991)',
 'Exorcist The (1973)',
 'Raiders of the Lost Ark (1981)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'American Beauty (1999)',
 'Birds The (1963)',
 'Ghostbusters (1984)',
 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)',
 'Star Wars: Episode I - The Phantom Menace (1999)',
 'E.T. the Extra-Terrestrial (1982)',
 'Halloween (1978)',
 'Shining The (1980)',
 'Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)',
 'Braveheart (1995)',
 'Dracula (1931)',
 'Blade Runner (1982)',
 '20000 Leagues Under the Sea (1954)',
 'Butch Cassidy and the Sundance Kid (1969)',
 'Saving Private Ryan (1998)',
 'Good The Bad and 

### Senario 3: Comedie

In [45]:
list_movies = ["Toy Story (1995)",
               "Toy Story 2 (1999)",
               "Little Nemo: Adventures in Slumberland (1992)",
               "It Takes Two (1995)",
               "Mighty Aphrodite (1995)",
               "Ghostbusters (1984)",
               "Ace Ventura: Pet Detective (1994)"]
top_movie = predict(list_movies, model, movie_to_idx, idx_to_movie)
top_movie

['Aladdin (1992)',
 'Babe (1995)',
 'Shakespeare in Love (1998)',
 "Bug's Life A (1998)",
 'Beauty and the Beast (1991)',
 'Groundhog Day (1993)',
 'American Beauty (1999)',
 'Mary Poppins (1964)',
 'Lion King The (1994)',
 'Princess Bride The (1987)',
 'Little Mermaid The (1989)',
 'Lady and the Tramp (1955)',
 '101 Dalmatians (1961)',
 'Being John Malkovich (1999)',
 'Snow White and the Seven Dwarfs (1937)',
 'Hercules (1997)',
 'Hunchback of Notre Dame The (1996)',
 'Antz (1998)',
 'Fantasia (1940)',
 'Nightmare Before Christmas The (1993)',
 'Galaxy Quest (1999)',
 'Jungle Book The (1967)',
 'Clueless (1995)',
 'Wrong Trousers The (1993)',
 'My Cousin Vinny (1992)',
 'South Park: Bigger Longer and Uncut (1999)',
 'Addams Family The (1991)',
 'Sleeping Beauty (1959)',
 'Peter Pan (1953)',
 'Close Shave A (1995)']

Nous pouvons voir que le modèle fait des recommandations intéressantes dans le genre Aventure/Fantaisie. Notez que le modèle n'a pas accès au genre des films.

Dans ce cas, le modèle a pu suggérer d'excellents films, comme Aladdin ou Star Wars, qui sont en adéquation avec le thème de l'historique de l'utilisateur.

# Conclusion

Dans ce projet, nous avons developpe un systeme de recommendation de films base sur le traitement du langage naturel a partir des titres des films et des historiques des utilisateurs. Alors que les performances du modele durant l'entrainement n'etaient pas bonnes en terme de precision, les recommendations du systeme faisaient du sens lors du test.