In [5]:
!pip install transformers[torch]

Collecting transformers[torch]
  Using cached transformers-4.24.0-py3-none-any.whl (5.5 MB)
Collecting regex!=2019.12.17
  Using cached regex-2022.10.31-cp310-cp310-win_amd64.whl (267 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Using cached tokenizers-0.13.1-cp310-cp310-win_amd64.whl (3.3 MB)
Collecting pyyaml>=5.1
  Using cached PyYAML-6.0-cp310-cp310-win_amd64.whl (151 kB)
Collecting numpy>=1.17
  Downloading numpy-1.23.4-cp310-cp310-win_amd64.whl (14.6 MB)
     --------------------------------------- 14.6/14.6 MB 22.6 MB/s eta 0:00:00
Collecting filelock
  Using cached filelock-3.8.0-py3-none-any.whl (10 kB)
Collecting tqdm>=4.27
  Using cached tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
Collecting requests
  Downloading requests-2.28.1-py3-none-any.whl (62 kB)
     ---------------------------------------- 62.8/62.8 KB ? eta 0:00:00
Collecting huggingface-hub<1.0,>=0.10.0
  Using cached huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
Collecting torch!=1.12.0,>=1.7
  Using cache

You should consider upgrading via the 'C:\Users\03764849\Junction-2022\venv\Scripts\python.exe -m pip install --upgrade pip' command.


In [2]:
from typing import List, Dict, Tuple
from datetime import datetime
import json, csv
from tqdm import tqdm

import torch
import torch.nn as nn
from torch import Tensor
from torch.optim import AdamW
from transformers import LongformerTokenizer, LongformerModel

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")
tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096")

Some weights of the model checkpoint at allenai/longformer-base-4096 were not used when initializing LongformerModel: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing LongformerModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LongformerModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
input_ids = tokenizer.encode("I came to University a bit later in life, took a couple of gap years before I sorted my life out and got myself to University. I started at 20 which wasn't too bad, only a year or two older than most people.\nI had a boyfriend I loved, great friends, and overall a great life in London. But I decided I wanted to get a degree and to have the University experience and to get out of London for a few years. I was motivated and felt ready for University.\nI struggled with my first year, I fell into old bad habits and my expectations of what university would be like just didn't match up with the reality. I felt lonely and homesick, and wanted my old life back. I stopped attending lectures and fell behind with work. On top of this my brother tried to kill himself, and two of my aunts and my grandmother died. I just stopped caring about anything and fell into a deep depression, and had constant anxiety and panic attacks. It got so bad I was told that I would need to become an external student for a year and retake parts of modules.\nBecoming an external student in the UK means that you receive no personal maintenance loan, so I had to take up a full time bar job to pay rent on my uni house.\nI have tried to make the best of this year and stay positive, but I am at an all time low.\nI have made friends through bar work, but I don't feel particularly close to any of them or would see them outside of work, despite my attempts to make real friendships with them and do this.\nI honestly feel unsure about whether I want to carry on with university, I just don't know if I can live here for another two years.\nThere are many great aspects to my uni city, a great music scene, great night life and the student scene is second to none. But I just feel noticeably older than most of the people here, parties and drugs and drinking just aren't enough for me anymore.\nIt is made more difficult by the fact that most of my closest friends have now graduated and are getting on with their lives back in London or are off traveling.\nI am trying to remind myself of why I came to Uni in the first place and remember what motivated me, but its hard. I tell myself it will be better when I go back to Uni properly next year and am attending lectures, but I am worried I will just mess up again due to depression and struggle to really enjoy the next couple of years.\nI had so many plans to get more involved and do more with my life this year, but my job requires me to work sometimes up to 40-50 hours a week, and is physically exhausting. On my days off I just want to lie in bed all day. My bar also attracts lonely middle aged men I have to pretend I am not massively creeped out by and chat to them all day, whilst they continually try and make advances on me. It is really getting to me, I tried to just ignore them but my manager told me I wasn't being chatty and friendly enough :s.\nI also had to break up with my boyfriend as I literally had no time to see him and couldn't handle long distance anymore.\nI am thinking about going back to London for the summer, I will have to work but I think it will be good to spend some proper time with my family and friends, but I am worried that if I do this I won't go back to university. Many people who work on bars stay here over the summer, and my housemate will be here, but there isn't much to do when all the students are gone and I worry I will become even more depressed and lonely.",
                             padding="max_length", return_tensors="pt")
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

---

In [20]:
def embeddings_distance(embeddings1: Tensor, embeddings2: Tensor):
  cos_sim = nn.CosineSimilarity(dim=1, eps=1e-6)
  return cos_sim(embeddings1, embeddings2)

class MatchingLoss(nn.Module):
  def __init__(self, margin: float, alpha: float, beta: float, tau: float = 0.5):
    super(MatchingLoss, self).__init__()
    self.margin = Tensor([margin])
    self.alpha = Tensor([alpha])
    self.beta = Tensor([beta])
    self.tau = Tensor([tau])
    self.sigmoid = nn.Sigmoid()

  def forward(self, user_embeddings: Tensor, reciever_embeddings: Tensor, feedback: int, evolution: float, user_profile_features: Tensor, reciever_profile_features: Tensor) -> float:
    feedback = Tensor([feedback])
    evolution = Tensor([evolution])
    
    flattened_user_embeddings = user_embeddings.view(1, -1)
    flattened_reciever_embeddings = reciever_embeddings.view(1, -1)

    complete_user_embeddings = torch.cat((flattened_user_embeddings, user_profile_features), 1)
    complete_reciever_embeddings = torch.cat((flattened_reciever_embeddings, reciever_profile_features), 1)
    
    switch = torch.floor(self.tau + self.sigmoid(self.alpha * feedback + self.beta * evolution))
    return switch * embeddings_distance(complete_user_embeddings, complete_reciever_embeddings)**2 \
          + switch * max(0, self.margin - embeddings_distance(complete_user_embeddings, complete_reciever_embeddings))**2

In [7]:
def get_evolution_exponential_weighted_decay(user_id: int, rho: float = 0.5) -> float:
  serie = []
  with open("../sample_dataset/series.csv") as csvfile:
    data = csv.reader(csvfile,delimiter=';')
    data.__next__()
    serie = [(int(row[2]), datetime.strptime(row[1], "%d-%m-%Y")) for row in data if int(row[0])==user_id]
  n = len(serie)
  if n == 0:
    print("Id not found")
    return 0
  coeff = (1-rho)/(1-rho**n)
  discrete_derivatives = [(serie[i+1][0]-serie[i][0]) / 
                          (abs(serie[i+1][1] - serie[i][1]).days) 
                          for i in range(0, n-1)]
  return(coeff * sum(rho**(n-i-1)*discrete_derivatives[i] for i in range(0,n-1)))

In [8]:
def get_tokenization(user_id: int, tokenizer) -> List[int]:
  user_id = str(user_id)
  with open("../sample_dataset/profiles.json", "r") as profiles_file:
    profiles = json.load(profiles_file)
    if profiles[user_id]["tokenization"] == "null":  # Note that tokenizations do not need to be recomputed each time as they do not change
      profiles[user_id]["tokenization"] = tokenizer.encode(profiles[user_id]["description"], padding="max_length", return_tensors="pt").tolist()[0]
      with open("../sample_dataset/profiles.json", "w") as profiles_file:
        json.dump(profiles, profiles_file, indent=2, ensure_ascii=False)
  unshaped_tokenization = torch.LongTensor(profiles[user_id]["tokenization"])
  return unshaped_tokenization.view(1,-1)

In [9]:
def get_complete_feedback():
    """Gets explicit feedback"""
    with open("../sample_dataset/feedback.csv") as csvfile:
        data = csv.reader(csvfile,delimiter=';')
        data.__next__()
        action_mappings = {
            "Helpful" : 1,
            "Exit" : 0
        }
        feedback = [
            (int(row[0]),
            list(map(int, row[1].split(","))),
            action_mappings[row[3]]) 
            for row in data if row[3]!="Later"]
    return feedback

In [10]:
def profiles_feature_extraction(ids: list):
  profile_feature_map = {}
  for user_id in ids:
    user_id = str(user_id)
    gender_mapping = {
      "Female": [1,0,0],
      "Male": [0,1,0],
      "Other": [0,0,1]
    }
    occupation_mapping = {
      "Student & Employed": [1,0,0,0],
      "Student": [0,1,0,0],
      "Employed": [0,0,1,0],
      "Unemployed": [0,0,0,1]
    }
    with open("../sample_dataset/profiles.json", "r") as profiles_file:
      profiles = json.load(profiles_file)
      user_profile = profiles[user_id]
      profile_feature_map[user_id] = Tensor([user_profile["age"], 
                *gender_mapping[user_profile["gender"]], 
                *occupation_mapping[user_profile["occupation"]]]).view(1,-1)
  return profile_feature_map

In [12]:
def retrain(user_id: int, group_peers_ids: List[int], user_feedback: int, tokenizer, optimizer, model, matching_loss):
  model.train()
  user_evolution = get_evolution_exponential_weighted_decay(user_id)
  profile_features = profiles_feature_extraction(group_peers_ids)
  for peer_id in tqdm(group_peers_ids):
    user_embeddings = model(get_tokenization(user_id, tokenizer))[0]  # the last hidden-state is the first element of the output tuple
    peer_embeddings = model(get_tokenization(peer_id, tokenizer))[0]
    loss = matching_loss(user_embeddings, peer_embeddings, user_feedback, user_evolution, profile_features[str(user_id)], profile_features[str(peer_id)])
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

In [21]:
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")
tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096")
matching_loss = MatchingLoss(margin=1, alpha=0.7, beta=0.3)
optimizer = AdamW(model.parameters())

Some weights of the model checkpoint at allenai/longformer-base-4096 were not used when initializing LongformerModel: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing LongformerModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LongformerModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [23]:
# Retrain validation
feedback = get_complete_feedback()
for user_feedback in feedback:
  retrain(user_feedback[0], user_feedback[1], user_feedback[2], tokenizer, optimizer, model, matching_loss)

100%|██████████| 3/3 [03:48<00:00, 76.05s/it]
100%|██████████| 3/3 [03:54<00:00, 78.01s/it]
100%|██████████| 3/3 [03:47<00:00, 75.91s/it]
