<a href="https://colab.research.google.com/github/vvikasreddy/lexically_constrained_beam_search_/blob/main/beam_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

References:

Marian MT model : https://huggingface.co/docs/transformers/model_doc/marian

Code to get the logits : https://huggingface.co/docs/transformers/main_classes/output

to get the BOS and EOS tokens: https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig.decoder_start_token_id

get topk values : https://pytorch.org/docs/stable/generated/torch.topk.html

ideas and core implementation drawn from this paper: https://arxiv.org/pdf/1704.07138

bleu score : https://www.nltk.org/api/nltk.translate.bleu_score.html

rogue score : https://huggingface.co/spaces/evaluate-metric/rouge/blob/main/README.md

(not directly related to beam search project)


reference to link google colab with .py file from git  : https://colab.research.google.com/github/jckantor/cbe61622/blob/master/docs/A.02-Downloading_Python_source_files_from_github.ipynb

reference to get rogue score working : https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working



## downloading essential modules

In [1]:
import locale
print(locale.getpreferredencoding())
import locale
def getpreferredencoding(do_setlocale = True):
  return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

# reference to get this working : https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working

UTF-8


In [2]:
!pip install datasets



## Importing necessary libraries

In [3]:
import torch, random
from torch.utils.data import Dataset
from datasets import load_dataset
from transformers import MarianMTModel, MarianTokenizer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from torch.utils.data import DataLoader
from tqdm import tqdm
import warnings
from nltk.translate.bleu_score import corpus_bleu

## loading the dataset, considering the wmt turkish - english translation

In [4]:
ds = load_dataset("wmt/wmt16", "tr-en")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Glancing the organization of the dataset

In [5]:
ds

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 205756
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 1001
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 3000
    })
})

In [6]:
ds['train'][0]

{'translation': {'en': "Kosovo's privatisation process is under scrutiny",
  'tr': "Kosova'nın özelleştirme süreci büyüteç altında"}}

## Loading the tokenizer and model, based of Marian-NMT

In [7]:
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-tr-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-tr-en")



In [8]:
model

MarianMTModel(
  (model): MarianModel(
    (shared): Embedding(62389, 512, padding_idx=62388)
    (encoder): MarianEncoder(
      (embed_tokens): Embedding(62389, 512, padding_idx=62388)
      (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
      (layers): ModuleList(
        (0-5): 6 x MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): SiLU()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05

In [9]:
# taking a look at the number of beams used by the model.
model.config.num_beams

6

In [10]:
# viewing the turkish and english translation
print(ds["validation"][1]["translation"]["tr"])
print(ds["validation"][1]["translation"]["en"])

Norveç'in beş milyon insanı en yüksek yaşam standartlarının tadını çıkarıyor, sadece Avrupa'da değil, dünyada.
Norway's five million people enjoy one of the highest standards of living, not just in Europe, but in the world.


## Extracting and importing the constraints from my github

In [11]:
# code to import constraints and store in a local directory, from my git. (code reference cicted above.)

user = "vvikasreddy"
repo = "lexically_constrained_beam_search_"
pyfile = "constraints.py"

# i.e url is "https://github.com/vvikasreddy/lexically_constrained_beam_search_/blob/main/constraints.py"

url = f"https://raw.githubusercontent.com/{user}/{repo}/main/{pyfile}"
!wget --no-cache --backups=1 {url}

import constraints

--2024-12-07 01:22:36--  https://raw.githubusercontent.com/vvikasreddy/lexically_constrained_beam_search_/main/constraints.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4914 (4.8K) [text/plain]
Saving to: ‘constraints.py’


2024-12-07 01:22:36 (67.7 MB/s) - ‘constraints.py’ saved [4914/4914]



In [12]:
# takes almost 4 minutes to get the constraints, you will see 3 progress bars
c = constraints.get_constraints()

100%|██████████| 205756/205756 [00:44<00:00, 4671.14it/s]
100%|██████████| 205756/205756 [01:06<00:00, 3105.42it/s]
100%|██████████| 26221852/26221852 [00:41<00:00, 634381.51it/s]


In [13]:
print("some of the constraints are :")

# Extract 5 random keys
random_keys = random.sample(list(c.keys()), 5)

for key in random_keys:
  print(key, c[key])

print("The length of the constraints is", len(c))

some of the constraints are :
('Ekim', 'Pazartesi') (('Monday', '(October'), 0.9722479175671673)
('ve', 'NATO') (('and', 'NATO'), 1.0560217638878333)
('ve', 'Sırbistan') (('and', 'Serbia'), 0.9671508806281317)
('en', 'yüksek') (('the', 'highest'), 1.0629455738333067)
('Reuters,', 'BBC,') (('Reuters,', 'BBC,'), 0.9497800176080862)
The length of the constraints is 570


## Helper functions for Beam Search

In [14]:
def get_ngrams(src, n = 2, ):
  """
  Args:
    src: text for which ngrams should be returned
    n : represents the value of ngrams

  returns:
    returns ngrams
  """

  src = src.split(" ")
  src = [tuple(src[i:i+n]) for i in range(len(src) - n + 1)]
  return src

def constraints_tokens(src, c):
  """
  args:
    src: It is the turkish sentence, to which we want to return the constraints
    c: represents the entire list of constraints

  returns:
    returns the corresponding constraints of the src text
  """

  # gets the ngrams
  ngrams = get_ngrams(src)

  constraints_src = []
  for ngram in ngrams:

    # if ngram is present then add it to the constraints list
    if ngram in c:
      f = c[ngram][0]
      for gram in f:
        if  gram in constraints_src: continue
        out = tokenizer(gram, return_tensors="pt")
        constraints_src.append(out["input_ids"])

  return constraints_src

# this are some of the example constraints of the sample text in turkish
constraints_tokens("Southeast European Times için Priştine'den Muhamet Brayşori'nin haberi -- 21/03/12", c)

[tensor([[3113,   56,   47, 1517,    0]]),
 tensor([[5827, 1786,  373,    0]]),
 tensor([[5827, 1786,  373,    0]]),
 tensor([[3762,    0]]),
 tensor([[27,  0]]),
 tensor([[3113,   56,   47, 1517,    0]]),
 tensor([[21,  0]]),
 tensor([[ 4388, 10158,   204,     0]]),
 tensor([[1417,    0]]),
 tensor([[6041,   47, 2628,    0]]),
 tensor([[27,  0]]),
 tensor([[3113,   56,   47, 1517,    0]])]

In [15]:
def visualize_data(decoder_input):
  """
  code used for sanity, to visualize the constraints.
  args:
    decoder_input : takes in decoder token ids
  returns:
    returns the tokens"""

  return tokenizer.decode(decoder_input.squeeze(), skip_special_tokens = True)

In [16]:


def get_indices(ds, c):
  """
  Since wmt dataset is not domain speicific, I am trying to only take the sentences, for which constraints are present.

  args:
    ds: represents the hugging face WMT dataset
    c: represents the constraints
  returns:
    indices: which contains the first 1000 indices, that have constraints present.
  """
  count = 0
  indices = []
  count = 100
  i = 0

  while count:
    x = ds["train"][i]["translation"]["tr"]
    if constraints_tokens(x, c):
      count -=1
      indices.append(i)
    i+=1

  return indices

print("The last index in all of the 3000 is,", get_indices(ds,c)[-1])

The last index in all of the 3000 is, 273


In [20]:
class Results(Dataset):

  """
  Results class is the implementation of the Beam search

  Methods:

  generate_translation: Generates the next best token
  get_the_text: returns the translated text at the end of beam search
  get_top_k_prob: given a bunch of translations, it returns the top k beams
  constraints_tokens: returns the constraints of the particular target sentence (i.e in our case turkish sentence)
  beam_search: tries to return the best translation, we have a cap of maxlength 30, and beamsize of 6
  __len__: returns the length of indices considered (in our case 1000, same as the size of validation set)
  __getitem__: for each item called returns the translated sentence at the end of beam search
  """

  def __init__(self, ds, c, indices, device = None):
      """
      Args:
          ds : represents the Hugging face dataset
          c: represents all the constraints generated
          indices: these are the indices which we consider for the test
          device: chooses the device based on the configs
      """
      self.data = ds
      self.c = c
      self.indices = indices
      if device is None:
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

  def generate_translation(self, src_text, decoder_input=[], probabilities=[], get_constrained_token_probability=-1, k=6):

    """
    this method generates the k best translations and their probabilites

    args:
      src_text : target sentence in turkish
      decoder_inputs : holds the token ids in english, up till that timestep
      probabliities : holds the corresponding probabilities of the decoder_inputs
      get_constrained_token_probablity: holds value -1, if there is no constraint, else contains the constraint token id
      k : represents the beam size, which is set to 6

    returns:
      decoder_input_ids: it holds the decoder input ids so far
      probabilities: holds the corresponding probabilities of the decoder inputs
      vis_data: used for sanity purposes, to visualize the generated data
      """

    # holds cuda if available else cpu
    device = self.device

    model.to(device)

    # Tokenize input
    encoder_inputs = tokenizer(src_text, return_tensors="pt").to(device)

    # If decoder_input is empty, include the decoder start token
    if len(decoder_input) == 0:
      # Initial decoder start token has probability 1
      probabilities = torch.tensor([[1.0]]).to(device)
      decoder_input = torch.tensor([[model.config.decoder_start_token_id]]).to(device)
    else:
      # Ensure decoder_input and probabilities are on the correct device
      decoder_input = decoder_input.to(device)
      if probabilities != [] :
        probabilities = probabilities.to(device)

    # Change the model to eval mode and stop the computation of gradients
    model.eval()
    with torch.no_grad():
      # Generate tokens
      outputs = model(
          input_ids=encoder_inputs.input_ids,
          attention_mask=encoder_inputs.attention_mask,
          decoder_input_ids=decoder_input
      )

      # Get the most frequently generated token
      next_token_logits = outputs.logits[:, -1, :]

      # Constraint handling
      if get_constrained_token_probability != -1:
          softmax_ = torch.softmax(next_token_logits, dim=-1)
          return softmax_[0][get_constrained_token_probability]

      # Get the top k tokens with maximum logits value
      top_probs, top_indices = torch.topk(torch.softmax(next_token_logits, dim=-1), k=k)

    # Initializing output containers
    decoder_input_tokens = []
    probs = []
    vis_data = []

    for indx, id in enumerate(top_indices[0]):
        # Concatenate new tokens and probabilities
        new_decoder_input = torch.cat([decoder_input, id.unsqueeze(0).unsqueeze(0)], dim=1)
        new_probs = torch.cat([probabilities, top_probs[0][indx].unsqueeze(0).unsqueeze(0)], dim=1)
        decoder_input_tokens.append(new_decoder_input)
        probs.append(new_probs)

        # Generate visualization data, used for sanity (removed for submission)
        vis_data.append((vis_data, tokenizer.decode(new_decoder_input.squeeze(), skip_special_tokens=True)))

    return decoder_input_tokens, probs, vis_data

  def get_the_text(self, x):
    """
    Given decoder token ids, it converts to tokens

    args:
      x: represents the decoder token ids
    returns:
      the corresponding translated sentences in english
    """

    token_id_list = x

    # moves tensors to cpu
    token_ids_batch = [tensor_item.cpu().squeeze().tolist() for tensor_item in token_id_list]

    decoded_sentences = tokenizer.batch_decode(token_ids_batch, skip_special_tokens=True)

    if not decoded_sentences: return [""]
    else: return decoded_sentences

  def get_top_k_prob(self, A, B, k=6):

    """
    takes in decoder inputs, and corresponding probablities and returns the k best beams

    args:
      A: represents the list of decoder input ids
      B: represents the list of correspondding probabilities
      k : it is the beam size
    returns:
      returns the k best values."""

    # container to hold the cummulative probablility sum of the beams
    d = {}

    # cummulative sum
    for indx, val in enumerate(B):
      cum_sum = torch.prod(val)
      d[cum_sum] = indx

    # sort to the cummulative probability sum
    sorted_keys = sorted(d.keys(), reverse = True)

    # containers to store top k values
    top_k_indices = []
    top_k_sequences = []

    for key in sorted_keys[:k]:
      top_k_indices.append(A[d[key]])
      top_k_sequences.append(B[d[key]])

    return top_k_sequences, top_k_indices

  def constraints_tokens(self, src):
    """
    args:
      src: It is the turkish sentence, to which we want to return the constraints
      c: represents the entire list of constraints

    returns:
      returns the corresponding constraints of the src text
  """
    # gets the ngrams
    ngrams = get_ngrams(src)
    constraints_src = []

    for ngram in ngrams:
      # if ngram is present then add it to the constraints list
      if ngram in self.c:
        f = self.c[ngram][0]
        for gram in f:

          if  gram in constraints_src: continue
          out = tokenizer(gram, return_tensors="pt")
          constraints_src.append(out["input_ids"])

    return constraints_src

  def beam_search(self, maxlen, numC, k, src, constrained_tokens):

    """
    Combines all the methods to return the k best beams

    args:
      maxlen: it is the max length, until where we generate sentences
      numC: represents the number of constraints
      k: represents the number of beams
      src: represents the target sentences, in turkish language
      constrained_tokens: represnts the constraints associated with the corresponding sentence

    returns:
      returns 2 sentences,
      1. sentence containing the constraints associated with the sentence
      2. sentence generated without any constraints
    """

    # device where values are present
    device = self.device

    decoder_start_token = model.config.decoder_start_token_id

    # initialize the grids
    grids = [[[] for _ in range(numC + 1)] for _ in range(maxlen + 1)]
    probs_grid  = [[[] for _ in range(numC + 1)] for _ in range(maxlen + 1)]

    # intialize the first grid to start hyp
    grids[0][0] = [1]

    generated_constraint_index = 0

    # iterate through the timestep
    for t in range(1, maxlen):

        index_c = max(0, (numC - t) - maxlen)
        # iterate through constraints
        for c in range(index_c, min(t, numC) + 1):

            # Prepare batched generation to reduce individual calls, g holds the decoder input tokens, for the current timestep.
            g = []

            # storing decoder inputs
            decoder_inputs = []
            probs = []
            vis_data = []

            # generation of translations for current hypotheses
            for indx, element in enumerate(grids[t-1][c]):

              if type(element) == int:
                decoder_input = []
                prev_probs =[]
              else:
                decoder_input = element.to(device)
                prev_probs = probs_grid[t-1][c][indx].to(device)

              # collecting the batch of translations
              t_g, t_probs, t_vis_data = self.generate_translation(src_text=src, decoder_input=decoder_input, probabilities=prev_probs)

              # adding to the current lists
              g.extend(t_g)
              probs.extend(t_probs)
              vis_data.extend(t_vis_data)

            # retrieve the probability of the constraint and add that to the decoder_input.
            if c > 0 and constrained_tokens:

              for indx, element in enumerate(grids[t-1][c-1]):

                if c == 1 and t == 1:
                  decoder_inputs = torch.tensor([[model.config.decoder_start_token_id]]).to(device)
                  prob = torch.tensor([[1]]).to(device)
                else:
                  decoder_inputs = element.to(device)
                  prob = probs_grid[t-1][c-1][indx].to(device)

                # iterating, because a constraint can be made up of many token ids
                partial_constraints = constrained_tokens[c - 1].tolist()

                # iterating, because a constraint can be made up of many token ids
                for partial_constraint in partial_constraints[0]:
                  if partial_constraint == 0: continue


                  cons = self.generate_translation(src, decoder_input=decoder_input, get_constrained_token_probability=partial_constraint)
                  decoder_inputs = torch.cat([decoder_inputs, torch.tensor(partial_constraint).unsqueeze(0).unsqueeze(0).to(device)], dim=1)

                  # check to ensure the constraints are generated, else a warning is raised
                  if cons is None:  # Check if the constraints list is empty
                      warnings.warn("Generated constraints are empty. Proceeding without constraints.", UserWarning)

                  prob = torch.cat([prob, torch.tensor(cons).unsqueeze(0).unsqueeze(0).to(device)], dim=1)

                  # appending to generated token ids to(g), and probabilites to probs
                  g.append(decoder_inputs)
                  probs.append(prob)

            # storing the k best token ids to the current grid.
            probs_grid[t][c], grids[t][c] = self.get_top_k_prob(g, probs, k)

    # return the best token ids which contains all the constraints, and ones without any constraints.
    return self.get_top_k_prob(grids[maxlen -1][numC], probs_grid[maxlen - 1][numC], k = 1), self.get_top_k_prob(grids[maxlen -1][0], probs_grid[maxlen - 1][0], k = 1)

  def __len__(self):
      """Returns the total number of samples in the dataset."""
      return len(self.indices)

  def __getitem__(self, idx):
      """
      Retrieves a single target sentence, and returns prediction with constraints, prediction without any constraints, and reference translation

      args:
          idx: Index of the sample to retrieve.

      Returns:
          returns target sentence, prediction with constraints, prediction without any constraints, and reference translation
      """

      sample = self.data["train"][self.indices[idx]]["translation"]["tr"]

      # gets the constraints for that sentence
      constraints = self.constraints_tokens(sample)

      # using beam search gets the predicted translations, both with and without constraints
      prediction, prediction_without_constraints = self.beam_search(maxlen= 23, numC=len(constraints), k = 4, src = sample, constrained_tokens = constraints)

      # convert the token ids to tokens
      prediction = self.get_the_text(prediction[1])
      prediction_without_constraints = self.get_the_text(prediction_without_constraints[1])
      result = self.data["train"][self.indices[idx]]["translation"]["en"]

      # if len(prediction) > 1: prediction = prediction[0][:]
      # else: prediction = []
      # if len(prediction_without_constraints) > 0 : prediction_without_constraints = prediction_without_constraints[0][:]

      return sample, prediction[0], prediction_without_constraints[0], result

In [21]:
# creating the instance of the Results class
indices = get_indices(ds,c)
results = Results(ds,c, indices = indices)

# generating sentences for 43th index
results[43]

  prob = torch.cat([prob, torch.tensor(cons).unsqueeze(0).unsqueeze(0).to(device)], dim=1)


("Sırbistan'ın, Kosova'da yaşayan Sırp nüfusun ve yanı sıra dünya genelinde birçok ülkenin bu eyleme karşı çıkması, ciddi sorunlarla karşı karşıya kalacağımızın habercisi.",
 "Serbia's opposition to this action bloc, the Serb population in Koso as well as many",
 "Serbia's opposition to the Serbian population in Kosovo, as well as many countries around the world",
 'The fact that neither Serbia nor the Serbian population in Kosovo as well as a number of countries throughout the world agree with such an act speaks of a period of serious challenges we are facing.')

In [22]:
# Initialize lists to store predictions and actual texts

if __name__ == "__main__":
  predictions = []
  predictions_without_constraints = []
  references = []

  # using a dataloader, to get the batches of 15
  dataloader = DataLoader(results, batch_size=2, shuffle=False)

  # Iterate through dataloader and storing results in containers
  for tr_text, pred_text, pred_cons_text, actual_text in tqdm(dataloader):

    predictions.extend([ref.split() for ref in pred_text])
    predictions_without_constraints.extend([ref.split() for ref in pred_cons_text])
    references.extend([[ref.split()] for ref in actual_text])


  prob = torch.cat([prob, torch.tensor(cons).unsqueeze(0).unsqueeze(0).to(device)], dim=1)
100%|██████████| 50/50 [35:24<00:00, 42.49s/it]


In [23]:

from nltk.translate.bleu_score import sentence_bleu
import nltk

def BLEU_score(references, translations):
  """
  Averages the BLEU score over all the translations

  args:
    references: represents acutal translations
    translations: represents predicted translations
  returns:
    average BLEU score
  """
  bleu_scores = []

  # since my constraints were generated only for ngrams of size 2
  weights = [0.5, 0.5,0,0]

  for ref, trans in zip(references, translations):
    # Calculate BLEU score for this sentence
    score = sentence_bleu(ref, trans, weights = weights)
    bleu_scores.append(score)

  # Calculate and return average BLEU score
  return sum(bleu_scores) / len(bleu_scores)


bleu_score_constraints = BLEU_score(references, predictions)
bleu_score_without_constraints = BLEU_score(references, predictions_without_constraints)
print(f"BLEU Score with constraints: {bleu_score_constraints:.4f}, BLEU score without constraints: {bleu_score_without_constraints:.4f}")

BLEU Score with constraints: 0.2515, BLEU score without constraints: 0.4039


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [24]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [26]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=f0dcd58ab342d4fc5111017129888a4ed3fa4d4fd12e0128f2dcf1bf31d492f0
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [27]:
import evaluate

def calculate_rouge_with_evaluate(references, translations):
    """
    Calculate average ROUGE-1, ROUGE-2, and ROUGE-L F1 scores using the `evaluate` library.

    args:
      references: represents acutal translations
      translations: represents predicted translations
    returns:
      average BLEU score
  """
    # Load the ROUGE metric
    rouge = evaluate.load("rouge")

    # joining the references and translations into one list
    references_flat = [" ".join(ref[0]) for ref in references]
    translations_flat = [" ".join(trans) for trans in translations]
    results = rouge.compute(predictions=translations_flat, references=references_flat)

    return {
        "rouge_1": results["rouge1"],
        "rouge_2": results["rouge2"],
        "rouge_L": results["rougeL"]
    }

# Calculate ROUGE scores for predictions with and without constraints
rouge_scores_constraints = calculate_rouge_with_evaluate(references, predictions)
rouge_scores_no_constraints = calculate_rouge_with_evaluate(references, predictions_without_constraints)

print(f"ROUGE-1 with constraints: {rouge_scores_constraints['rouge_1']:.4f}")
print(f"ROUGE-2 with constraints: {rouge_scores_constraints['rouge_2']:.4f}")
print(f"ROUGE-L with constraints: {rouge_scores_constraints['rouge_L']:.4f}")

print(f"ROUGE-1 without constraints: {rouge_scores_no_constraints['rouge_1']:.4f}")
print(f"ROUGE-2 without constraints: {rouge_scores_no_constraints['rouge_2']:.4f}")
print(f"ROUGE-L without constraints: {rouge_scores_no_constraints['rouge_L']:.4f}")


ROUGE-1 with constraints: 0.4838
ROUGE-2 with constraints: 0.2797
ROUGE-L with constraints: 0.4195
ROUGE-1 without constraints: 0.6283
ROUGE-2 without constraints: 0.4563
ROUGE-L without constraints: 0.5792
