# Week 36

1. Explore the dataset from https://huggingface.co/datasets/coastalcph/tydi_xor_rc. Familiarize yourself with the dataset card, download the
dataset and explore its columns. Summarize basic data statistics for training and validation data in each of the languages Finnish (fi), Japanese
(ja) and Russian (ru).

2. For each of the languages Finnish, Japanese and Russian, report the 5
most common words in the questions from the training set. What kind of
words are they?

3. Implement a rule-based classifier that predicts whether a question is answerable or impossible, only using the document (context) and question.
You may use machine translation as a component. Use the answerable
field to evaluate it on the validation set. What is the performance of your
classifier for each of the languages Finnish, Japanese and Russian?



####  1. Exploring the dataset

In [None]:
import pandas as pd

splits = {'train': 'train.parquet', 'validation': 'validation.parquet'}
df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
print(df.head)

<bound method NDFrame.head of                                                 question  \
0      উইকিলিকস কত সালে সর্বপ্রথম ইন্টারনেটে প্রথম তথ...   
1               দ্বিতীয় বিশ্বযুদ্ধে কোন দেশ পরাজিত হয় ?   
2      মার্কিন যুক্তরাষ্ট্রের সংবিধান অনুযায়ী মার্কিন...   
3      আরব-ইসরায়েলি যুদ্ধে আরবের মোট কয়জন সৈন্যের মৃ...   
4              বিশ্বে প্রথম পুঁজিবাদী সমাজ কবে গড়ে ওঠে ?   
...                                                  ...   
15321              కోళ్లు ఎక్కువగా ఏ దేశంలో కనిపిస్తాయి?   
15322       క్షయ వ్యాధికి విరుగుడు ఏ దేశంలో కనుగొన్నారు?   
15323                 ఖురాన్ ఏ అరబ్బీ భాషలో ఎవరు రాసారు?   
15324  టెక్సస్ రాష్ట్రంలోని అతిపెద్ద మానవ నిర్మితం ఏది ?   
15325         తమిళనాడులో రాష్ట్ర మొదటి ముఖ్యమంత్రి ఎవరు?   

                                                 context lang  answerable  \
0      WikiLeaks () is an international non-profit or...   bn        True   
1      The war in Europe concluded with an invasion o...   bn        True   
2      Same-sex ma

In [None]:
df.describe()

Unnamed: 0,answer_start
count,15326.0
mean,157.062769
std,226.748482
min,-1.0
25%,13.0
50%,78.0
75%,210.0
max,3964.0


In [None]:
# Train and validation sets
train_df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"])
val_df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["validation"])

# Explore the dataset columns (training and validation)
print("Training Data Columns:", train_df.columns)
print("Validation Data Columns:", val_df.columns)

Training Data Columns: Index(['question', 'context', 'lang', 'answerable', 'answer_start', 'answer',
       'answer_inlang'],
      dtype='object')
Validation Data Columns: Index(['question', 'context', 'lang', 'answerable', 'answer_start', 'answer',
       'answer_inlang'],
      dtype='object')


In [None]:
finnish_questions = df[df['lang'] == 'fi']
japanese_questions = df[df['lang'] == 'ja']
russian_questions = df[df['lang'] == 'ru']

# Summarize basic statistics
print("Statistics for Finnish Questions:")
print(finnish_questions.describe(include='all'))  # or include only numeric features by default
print("\nStatistics for Japanese Questions:")
print(japanese_questions.describe(include='all'))
print("\nStatistics for Russian Questions:")
print(russian_questions.describe(include='all'))

Statistics for Finnish Questions:
                                                question  \
count                                               2126   
unique                                              2100   
top     Kuinka monta peliä Final Fantasy-sarjaan kuuluu?   
freq                                                   3   
mean                                                 NaN   
std                                                  NaN   
min                                                  NaN   
25%                                                  NaN   
50%                                                  NaN   
75%                                                  NaN   
max                                                  NaN   

                                                  context  lang answerable  \
count                                                2126  2126       2126   
unique                                               2054     1          2   
top     Mir

We can see, that there are 2301 Japanese questions (2100 unique), 2126 Finnish questions (1743 unique) and 1983 Russian questions (1612 unique).

#### 2.  5 most common words

In [None]:
!pip install googletrans==4.0.0-rc1



In [None]:
from collections import Counter
from googletrans import Translator
translator = Translator()

In [None]:
# Find 5 most common words in Finnish in df dataset

finnish_questions = df[df['lang'] == 'fi']['question']

# Combine all questions into a single string
all_questions_text = ' '.join(finnish_questions.astype(str))

# Tokenize the text
words = all_questions_text.lower().split()

# Count the frequency of each word
word_counts = Counter(words)

# Get the 5 most common words
top_5_words = word_counts.most_common(5)

# Translate words dynamically
translated_top_5 = []
for word, count in top_5_words:
    translation = translator.translate(word, src='fi', dest='en').text  # Translate word from Finnish to English
    translated_top_5.append((word, count, translation))

# Print the top 5 words with translations
print("Top 5 most common words in Finnish questions with English translations:")
for word, count, translation in translated_top_5:
    print(f"{word} (count: {count}) - English: {translation}")

# print("Top 5 most common words in Finnish questions:", top_5_words)


Top 5 most common words in Finnish questions with English translations:
on (count: 582) - English: there is
mikä (count: 328) - English: What
milloin (count: 287) - English: When
vuonna (count: 227) - English: in
kuka (count: 215) - English: Who


In [None]:
# Find 5 most common words in Russian in df dataset
russian_questions = df[df['lang'] == 'ru']['question']

all_questions_text = ' '.join(russian_questions.astype(str))

words = all_questions_text.lower().split()

word_counts = Counter(words)

top_5_words = word_counts.most_common(5)

# Translate words dynamically
translated_top_5 = []
for word, count in top_5_words:
    translation = translator.translate(word, src='ru', dest='en').text  # Translate word from Russian to English
    translated_top_5.append((word, count, translation))

# Print the top 5 words with translations
print("Top 5 most common words in Russian questions with English translations:")
for word, count, translation in translated_top_5:
    print(f"{word} (count: {count}) - English: {translation}")

#print("Top 5 most common words in Russian questions:", top_5_words)




Top 5 most common words in Russian questions with English translations:
в (count: 978) - English: V
сколько (count: 426) - English: How many
на (count: 385) - English: on
когда (count: 251) - English: When
кто (count: 209) - English: Who


For both Finnish and Russian it appears the the most common words are question
words or "connecting"/grammar words.

Because most of japanese words are not seperated by spaces, we'll have to use a tokenizer:

In [None]:
# Load tokenizer directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer_ja = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ja-en")
#model_ja = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-ja-en")

tokenizer_fi = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fi-en")
#model_fi = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-fi-en")

tokenizer_ru = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ru-en")
#model_ru = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-ru-en")



In [None]:
def tokenize_japanese_sentences(sentence):
    tokens = tokenizer_ja.tokenize(sentence)
    return " ".join(tokens)

japanese_questions = df[df['lang'] == 'ja']['question']

# Apply tokenization to the 'japanese_questions' series
tokenized_japanese_questions = japanese_questions.apply(tokenize_japanese_sentences)

# Print the original and tokenized questions
for original, tokenized in zip(japanese_questions.head(5), tokenized_japanese_questions.head(5)):
    print(f"Original: {original}")
    print(f"Tokenized: {tokenized}")
    print("-" * 50)

Original: ポーランドで農地改革が行われたことがある？
Tokenized: ▁ポーランド で 農 地 改革 が 行わ れた こと がある ?
--------------------------------------------------
Original: ビスカヤ県で初めて進出した大規模鉱業会社は何？
Tokenized: ▁ビ スカ ヤ 県 で 初めて 進 出 した 大 規模 鉱 業 会社 は 何 ?
--------------------------------------------------
Original: 古代ローマ帝国はいつ起きた？
Tokenized: ▁古代 ローマ 帝国 は いつ 起き た ?
--------------------------------------------------
Original: スペイン・バスク州の州都はどこですか？
Tokenized: ▁スペイン ・ バス ク 州 の 州 都 は どこ です か ?
--------------------------------------------------
Original: イタリア王国海軍は第一次世界大戦中に何隻戦艦をつくった？
Tokenized: ▁イタリア 王国 海軍 は 第一 次 世界 大戦 中 に 何 隻 戦艦 を つく った ?
--------------------------------------------------


Now that the Japanese questions are properly tokenized, we can proceed to find the most common words:

In [None]:
import re

def clean_text(text):
    # Remove punctuation using regex
    text = re.sub(r'[^\w\s]', '', text)  # Removes non-word characters (punctuation)
    return text.lower().split()

# Convert tokenized questions into a single string
all_questions_text = ' '.join(tokenized_japanese_questions.astype(str))

# Clean the text by removing punctuation
cleaned_words = clean_text(all_questions_text)

# Count word frequencies
word_counts = Counter(cleaned_words)

# Get the top 5 most common words
top_5_words = word_counts.most_common(5)

# Translate words dynamically
translated_top_5 = []
for word, count in top_5_words:
    translation = translator.translate(word, src='ja', dest='en').text  # Translate word from Japanese to English
    translated_top_5.append((word, count, translation))

# Print the top 5 words with translations
print("Top 5 most common words in Japanese questions with English translations:")
for word, count, translation in translated_top_5:
    print(f"{word} (count: {count}) - English: {translation}")

Top 5 most common words in Japanese questions with English translations:
は (count: 2224) - English: teeth
の (count: 1632) - English: of
何 (count: 542) - English: what
した (count: 469) - English: did
いつ (count: 459) - English: when


As we can see, the filtered Japanese results are similar to the Finnish and Russian results.

Note the "は" token. Altough it can be translated to "teeth", it is most commonly used as grammatical particle - used to indicate the subject of the sentence. It is a frequently used particle which explains why we see it so much.

Now all together:

In [None]:
def clean_text(text):
    # Remove punctuation using regex
    text = re.sub(r'[^\w\s]', '', text)  # Removes non-word characters (punctuation)
    return text.lower().split()

def find_most_common_words(tokenizer, language, num_words=5):
    # Filter questions for the specified language
    questions_ = df[df['lang'] == language]['question']

    # Tokenize questions
    tokenized_questions = questions_.apply(lambda x: " ".join(tokenizer.tokenize(x)))

    # Convert tokenized questions into a single string
    all_questions_text = ' '.join(tokenized_questions.astype(str))

    # Clean the text by removing punctuation
    cleaned_words = clean_text(all_questions_text)

    # Count word frequencies
    word_counts = Counter(cleaned_words)

    # Get the top n most common words
    top_n_words = word_counts.most_common(num_words)

    # Translate words dynamically
    translated_top_n = []
    for word, count in top_n_words:
        translation = translator.translate(word, src=language, dest='en').text  # Translate word to English
        translated_top_n.append((word, count, translation))

    # Print the top n words with translations
    for word, count, translation in translated_top_n:
        print(f"{word} (count: {count}) - English: {translation}")

print("Top 5 most common words in Finnish questions:")
find_most_common_words(tokenizer_fi, 'fi')

print("\nTop 5 most common words in Japanese questions:")
find_most_common_words(tokenizer_ja, 'ja')

print("\nTop 5 most common words in Russian questions:")
find_most_common_words(tokenizer_ru, 'ru')

Top 5 most common words in Finnish questions:
on (count: 678) - English: there is
n (count: 420) - English: of
mikä (count: 328) - English: What
milloin (count: 287) - English: When
a (count: 242) - English: a

Top 5 most common words in Japanese questions:
は (count: 2224) - English: teeth
の (count: 1632) - English: of
何 (count: 542) - English: what
した (count: 469) - English: did
いつ (count: 459) - English: when

Top 5 most common words in Russian questions:
в (count: 1172) - English: V
на (count: 493) - English: on
а (count: 451) - English: A
сколько (count: 426) - English: How many
е (count: 317) - English: e


####  3. Rule-based Classifier

Classifier that predicts whether a question is answerable or impossible, only using the document (context) and question. You may use machine translation as a component.

In [None]:
!pip install tqdm



In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sklearn.metrics import accuracy_score, confusion_matrix
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from tqdm import tqdm  # Import tqdm for progress bars

nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def clean_tokens(tokens):
    # Remove punctuation tokens and stopwords, and change words to "basic" form (lemmatize)
    cleaned_tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha() and token not in stop_words]
    return cleaned_tokens

def predict_answerability(context, question, lang):
    translated_question = translator.translate(question, src=lang, dest='en').text

    # Tokenize both the context and the translated question
    tokenized_context = tokenizer.tokenize(context.lower())
    tokenized_question = tokenizer.tokenize(translated_question.lower())

    # Clean tokens of common words and punctuation
    cleaned_context = clean_tokens(tokenized_context)
    cleaned_question = clean_tokens(tokenized_question)

    # Check for overlap
    for token in cleaned_question:
        if token in cleaned_context:
            return 'True', translated_question

    # If there is no overlap, we determine the question unanswerable
    return 'False', translated_question


# Evaluate the classifier on the validation set
def evalute_classifier(df_, num_samples=100, seed=420):
  predictions = []
  true_labels = []

  sampled_rows = df_.sample(n=num_samples, random_state=seed)

  for index, row in tqdm(sampled_rows.iterrows(), total=num_samples):
      context = row['context']  # Always in English
      question = row['question']  # Could be in any language
      lang = row['lang']

      prediction, translated_question = predict_answerability(context, question, lang)

      predictions.append(1 if prediction == 'True' else 0)
      true_labels.append(1 if str(row['answerable']) == 'True' else 0)

      """print(f"Context: {context}")
      print(f"Question: {question}")
      print(f"Translated Question: {translated_question}")
      print(f"Prediction: {prediction}")  # This will be 'answerable' or 'impossible'
      print(f"True Label: {row['answerable']}")
      print("-" * 50)"""

  # Calculate accuracy
  accuracy = accuracy_score(true_labels, predictions)
  print(f"Accuracy: {accuracy * 100:.2f}%")

  # Calculate FP and FN percentages
  cm = confusion_matrix(true_labels, predictions, labels=[1, 0])
  tn, fp, fn, tp = cm.ravel()
  false_positive_percentage = (fp / (fp + tn)) * 100 if (fp + tn) > 0 else 0
  false_negative_percentage = (fn / (fn + tp)) * 100 if (fn + tp) > 0 else 0

  print(f"False Positive Percentage: {false_positive_percentage:.2f}%")
  print(f"False Negative Percentage: {false_negative_percentage:.2f}%")


print(f"Classifer evaluation for Finnish:")
evalute_classifier(val_df[val_df['lang'] == 'fi'])
print("-" * 50)
print(f"Classifer evaluation for Japanese:")
evalute_classifier(val_df[val_df['lang'] == 'ja'])
print("-" * 50)
print(f"Classifer evaluation for Russian:")
evalute_classifier(val_df[val_df['lang'] == 'ru'])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Classifer evaluation for Finnish:


100%|██████████| 100/100 [01:43<00:00,  1.04s/it]


Accuracy: 68.00%
False Positive Percentage: 1.49%
False Negative Percentage: 93.94%
--------------------------------------------------
Classifer evaluation for Japanese:


 12%|█▏        | 12/100 [00:13<01:38,  1.11s/it]


KeyboardInterrupt: 

# Week 37

Let k be the number of members in your group (k ∈ {1, **2**, 3}). Implement
k different * language models for the questions in the three languages Finnish,
Japanese and Russian, as well as for the document contexts in English (total
**2 × 4** language models), using the training data. Evaluate each of them on the
validation data, report their performance and discuss the results. Reminder: a
language model is a function that takes text as input and returns its probability.

\* Different approach (n-gram/neural) or different n, different smoothing etc.
2


In [None]:
import pandas as pd

splits = {'train': 'train.parquet', 'validation': 'validation.parquet'}
df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"])

train_df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"])
val_df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["validation"])

 ####  1. Trigram model (Sofia)

In [None]:
from collections import defaultdict, Counter
import math

class TrigramModel:
    def __init__(self, tokenizer, smoothing_coef):
        # Initialize the trigram, bigram, and unigram counts
        self.trigram_counts = defaultdict(Counter)
        self.bigram_counts = defaultdict(Counter)
        self.unigram_counts = Counter()
        self.known_words = set()
        self.tokenizer = tokenizer
        self.smoothing_coef = smoothing_coef

    def train(self, sentences):
        """
        Train the trigram model on a list of tokenized sentences.
        :param sentences: List of sentences where each sentence is a list of words.
        """
        tokenized_sentences = [self.tokenizer.tokenize(sentence.lower()) for sentence in sentences]

        for sentence in tokenized_sentences:
            # Replace first occurrences with 'OOV' and update known_words set
            for i in range(len(sentence)):
                word = sentence[i]
                if word not in self.known_words:
                    sentence[i] = 'OOV'
                    self.known_words.add(word)

            # Count unigrams, bigrams, and trigrams
            for i in range(len(sentence) - 2):
                self.unigram_counts[sentence[i]] += 1
                self.bigram_counts[sentence[i]][sentence[i + 1]] += 1
                self.trigram_counts[(sentence[i], sentence[i + 1])][sentence[i + 2]] += 1

            # Count remaining bigram and unigram
            if len(sentence) > 1:
                self.unigram_counts[sentence[-2]] += 1
                self.bigram_counts[sentence[-2]][sentence[-1]] += 1
            self.unigram_counts[sentence[-1]] += 1  # Count the last word as a unigram

    def unigram_probability(self, word):
        """
        Calculate the unigram probability P(word).
        :param word: The word for which to calculate the probability.
        :return: The probability of the word.
        """
        total_count = sum(self.unigram_counts.values())
        vocab_size = len(self.known_words)  # Unique words
        if word in self.unigram_counts:
            return self.unigram_counts[word] / total_count
        else:
            return self.unigram_counts['OOV'] / total_count if 'OOV' in self.unigram_counts else 0

    def bigram_probability(self, w1, w2):
        """
        Calculate the bigram probability P(w2 | w1).
        :param w1: Previous word.
        :param w2: Current word.
        :return: Probability of w2 given w1.
        """
        vocab_size = len(self.known_words)
        if self.unigram_counts[w1] > 0:
            return (self.bigram_counts[w1][w2] + self.smoothing_coef) / (self.unigram_counts[w1] + self.smoothing_coef * vocab_size)
        else:
            return (self.bigram_counts['OOV'][w2] + self.smoothing_coef) / (self.unigram_counts['OOV'] + self.smoothing_coef * vocab_size)

    def trigram_probability(self, w1, w2, w3):
        """
        Calculate the trigram probability P(w3 | w1, w2).
        :param w1: First word.
        :param w2: Second word.
        :param w3: Third word.
        :return: Probability of w3 given w1 and w2.
        """
        vocab_size = len(self.known_words)

        # Get the bigram count for (w1, w2)
        bigram_count = self.bigram_counts[w1][w2]

        if bigram_count > 0:
            return (self.trigram_counts[(w1, w2)][w3] + self.smoothing_coef) / (bigram_count + self.smoothing_coef * vocab_size)
        else:
            # Handle the case where (w1, w2) is unknown, fallback to a smoothed probability for OOV
            return (self.trigram_counts[('OOV', 'OOV')][w3] + self.smoothing_coef) / (self.bigram_counts['OOV']['OOV'] + self.smoothing_coef * vocab_size)

    def replace_unknowns(self, sentence):
        """
        Replace unknown words in the sentence with 'OOV'.
        :param sentence: A list of words (tokens).
        :return: List of words with unknown words replaced by 'OOV'.
        """
        return ['OOV' if word not in self.known_words else word for word in sentence]

    def sentence_perplexity(self, sentence):
        """
        Calculate the perplexity of a sentence.
        :param sentence: A sentence string.
        :return: The perplexity of the sentence.
        """
        tokens = self.tokenizer.tokenize(sentence.lower())
        tokens = self.replace_unknowns(tokens)

        T = len(tokens)

        if T == 0:
            return float('inf')

        # Initial unigram and bigram probabilities for first two tokens
        prob = math.pow(self.unigram_probability(tokens[0]), 1/T)
        if T > 1:
            prob *= math.pow(self.bigram_probability(tokens[0], tokens[1]), 1/T)

        # Trigram probabilities for the rest of the tokens
        for i in range(2, T):
            prob *= math.pow(self.trigram_probability(tokens[i - 2], tokens[i - 1], tokens[i]), 1/T)

        return 1 / prob if prob > 0 else float('inf')


In [None]:
!pip install transformers

In [None]:
from transformers import AutoTokenizer
# Load the XLM-Roberta tokenizer
tokenizer_multilingual = AutoTokenizer.from_pretrained("xlm-roberta-base")

In [None]:
def evaluate_model(model, validation_sentences):
    """
    Evaluate the model by calculating the perplexity on validation sentences.
    :param model: The TrigramModel instance to evaluate.
    :param validation_sentences: List of validation sentences.
    :return: Average perplexity of the model on the validation set.
    """
    total_perplexity = 0
    count = 0

    for sentence in validation_sentences:
        perplexity = model.sentence_perplexity(sentence)
        total_perplexity += perplexity
        count += 1

    return total_perplexity / count if count > 0 else float('inf')

smoothing_coef = 0.1

Finnish_model = TrigramModel(tokenizer_multilingual, smoothing_coef)
Finnish_model.train(train_df[train_df['lang'] == 'fi']['question'])

Japanese_model = TrigramModel(tokenizer_multilingual, smoothing_coef)
Japanese_model.train(train_df[train_df['lang'] == 'ja']['question'])

Russian_model = TrigramModel(tokenizer_multilingual, smoothing_coef)
Russian_model.train(train_df[train_df['lang'] == 'ru']['question'])

Context_model = TrigramModel(tokenizer_multilingual, smoothing_coef)
Context_model.train(train_df['context'])

validation_fi = val_df[val_df['lang'] == 'fi']['question']
validation_ja = val_df[val_df['lang'] == 'ja']['question']
validation_ru = val_df[val_df['lang'] == 'ru']['question']
validation_context = val_df['context']


# Evaluate models
finnish_perplexity = evaluate_model(Finnish_model, validation_fi)
print(f"Finnish Model Perplexity: {finnish_perplexity:.2f}")

japanese_perplexity = evaluate_model(Japanese_model, validation_ja)
print(f"Japanese Model Perplexity: {japanese_perplexity:.2f}")

russian_perplexity = evaluate_model(Russian_model, validation_ru)
print(f"Russian Model Perplexity: {russian_perplexity:.2f}")

context_perplexity = evaluate_model(Context_model, validation_context)
print(f"Context Model Perplexity: {context_perplexity:.2f}")


 ####  2. Bigram model (Tom)

The dataset for each langauge is rather small. In order to avoid overfitting I chose to implement a simple model - the bigram one.

In [None]:
from collections import defaultdict, Counter
import math

class BigramModel:
    def __init__(self, tokenizer, smoothing_coef):
        # Initialize the bigram and unigram counts
        self.bigram_counts = defaultdict(Counter)
        self.unigram_counts = Counter()
        self.known_words = set()
        self.tokenizer = tokenizer
        self.smoothing_coef = smoothing_coef

    def train(self, sentences):
        """
        Train the bigram model on a list of tokenized sentences.
        :param sentences: List of sentences where each sentence is a list of words.
        """
        tokenized_sentences = [self.tokenizer.tokenize(sentence.lower()) for sentence in sentences]

        for sentence in tokenized_sentences:
            # Replace first occurrences with 'OOV' and update known_words set
            for i in range(len(sentence)):
                word = sentence[i]
                if word not in self.known_words:
                    sentence[i] = 'OOV'
                    self.known_words.add(word)

            # Count bigrams and unigrams
            for i in range(len(sentence) - 1):
                self.unigram_counts[sentence[i]] += 1
                self.bigram_counts[sentence[i]][sentence[i + 1]] += 1
            self.unigram_counts[sentence[-1]] += 1  # Count the last word as a unigram

    def unigram_probability(self, word):
        """
        Calculate the unigram probability P(word).
        :param word: The word for which to calculate the probability.
        :return: The probability of the word.
        """
        total_count = sum(self.unigram_counts.values())
        vocab_size = len(self.known_words)  # Unique words
        if word in self.unigram_counts:
            return self.unigram_counts[word] / total_count
        else:
            return self.unigram_counts['OOV'] / total_count if 'OOV' in self.unigram_counts else 0

    def bigram_probability(self, w1, w2):
        """
        Calculate the bigram probability P(w2 | w1).
        :param w1: Previous word.
        :param w2: Current word.
        :return: Probability of w2 given w1.
        """
        vocab_size = len(self.known_words)  # Unique words
        if self.unigram_counts[w1] > 0:
            return (self.bigram_counts[w1][w2] + self.smoothing_coef) / (self.unigram_counts[w1] + self.smoothing_coef * vocab_size)
        else:
            return (self.bigram_counts['OOV'][w2] + self.smoothing_coef) / (self.unigram_counts['OOV'] + self.smoothing_coef * vocab_size)

    def replace_unknowns(self, sentence):
        """
        Replace unknown words in the sentence with 'OOV'.
        :param sentence: A list of words (tokens).
        :return: List of words with unknown words replaced by 'OOV'.
        """
        return ['OOV' if word not in self.known_words else word for word in sentence]

    def sentence_preplexity(self, sentence):
        """
        Calculate the perplexity of a sentence.
        :param sentence: A sentence string.
        :return: The perplexity of the sentence.
        """
        tokens = self.tokenizer.tokenize(sentence.lower())
        tokens = self.replace_unknowns(tokens)

        T = len(tokens)

        if T == 0:
          return float('inf')

        prob = math.pow(self.unigram_probability(tokens[0]), 1/T)
        for i in range(T - 1):
          prob *= math.pow(self.bigram_probability(tokens[i], tokens[i + 1]), 1/T)

        if prob > 0:
          return 1 / prob  # Perplexity formula
        else:
          #print unigram probabilty for first token
          return float('inf')
        return prob

Now that we have a working model template, we can apply it to the Russian / Japanese / Finnish questions and the English context.

In [None]:
!pip install transformers

In [None]:
from transformers import AutoTokenizer
# Load the XLM-Roberta tokenizer
tokenizer_multilingual = AutoTokenizer.from_pretrained("xlm-roberta-base")

In [None]:
def evaluate_model(model, validation_sentences):
    """
    Evaluate the model by calculating the perplexity on validation sentences.
    :param model: The BigramModel instance to evaluate.
    :param validation_sentences: List of validation sentences.
    :return: Average perplexity of the model on the validation set.
    """
    total_perplexity = 0
    count = 0

    for sentence in validation_sentences:
        perplexity = model.sentence_preplexity(sentence)
        total_perplexity += perplexity
        count += 1

    return total_perplexity / count if count > 0 else float('inf')

smoothing_coef = 0.1 # Random choice. Not possible to calibrate because we're using the validation set as a testing set.

Finnish_model = BigramModel(tokenizer_multilingual, smoothing_coef)
Finnish_model.train(train_df[train_df['lang'] == 'fi']['question'])

Japanese_model = BigramModel(tokenizer_multilingual, smoothing_coef)
Japanese_model.train(train_df[train_df['lang'] == 'ja']['question'])

Russian_model = BigramModel(tokenizer_multilingual, smoothing_coef)
Russian_model.train(train_df[train_df['lang'] == 'ru']['question'])

Context_model = BigramModel(tokenizer_multilingual, smoothing_coef)
Context_model.train(train_df['context'])

validation_fi = val_df[val_df['lang'] == 'fi']['question']
validation_ja = val_df[val_df['lang'] == 'ja']['question']
validation_ru = val_df[val_df['lang'] == 'ru']['question']
validation_context = val_df['context']


# Evaluate models
finnish_perplexity = evaluate_model(Finnish_model, validation_fi)
print(f"Finnish Model Perplexity: {finnish_perplexity:.2f}")

japanese_perplexity = evaluate_model(Japanese_model, validation_ja)
print(f"Japanese Model Perplexity: {japanese_perplexity:.2f}")

russian_perplexity = evaluate_model(Russian_model, validation_ru)
print(f"Russian Model Perplexity: {russian_perplexity:.2f}")

context_perplexity = evaluate_model(Context_model, validation_context)
print(f"Context Model Perplexity: {context_perplexity:.2f}")


To check: Why the perplexity of Trigram is higher than the bigram? Shouldn't it be the opposite?... (sofia)

# Week 38


Let k be the number of members in your group. For each of the three languages
Finnish, Japanese and Russian separately, using the training data, train k different classifiers that receive the document (context) and question as input
and predict whether the question is answerable or impossible given the context.

Evaluate the classifiers on the respective validation sets, report and analyse the performance for each language and compare the scores across languages.
The classifiers can use machine translation, linguistic/lexical features (e.g.,
bag-of-words, n-gram counts, word overlap) word embeddings, or word/sentence
representations from (multilingual) neural language models.8 You can also train
or fine-tune your own neural language models on the dataset. Different from
1(c), however, they must be learned rather than rule-based. Motivate your
choice of features and classifier

#### 1. Neural Network with mean embeddings (Tom)

In [None]:
import pandas as pd

splits = {'train': 'train.parquet', 'validation': 'validation.parquet'}
df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"])
# Train and validation sets
train_df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"])
val_df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["validation"])

In [None]:
!pip install transformers torch
!pip install tqdm

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertTokenizer, DistilBertModel
from tqdm import tqdm
import tensorflow as tf
from tensorflow.keras import layers, models
from sklearn.metrics import confusion_matrix
import random

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')
distilbert_model = DistilBertModel.from_pretrained('distilbert-base-multilingual-cased')

def get_mean_pooled_embedding(sentences):
    inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        outputs = distilbert_model(**inputs)
    last_hidden_states = outputs.last_hidden_state

    # Mean pooling over the token embeddings
    mean_pooled = last_hidden_states.mean(dim=1)
    return mean_pooled

In [None]:
import os
import numpy as np
import pandas as pd  # Make sure to import pandas
from google.colab import drive
from tqdm import tqdm  # Import tqdm for progress bar

def ensure_directory_exists(directory):
    if not os.path.exists(directory):
        os.makedirs(directory)

def pre_process(data, use_embeddings=False, save_embeddings=False, filename=None):
    # Ensure the directory exists if either use_embeddings or save_embeddings is True
    if use_embeddings or save_embeddings:
        # Mount Google Drive
        drive.mount('/content/drive')
        ensure_directory_exists(os.path.dirname(filename))

    # Check if embeddings file already exists and if we want to use it
    if use_embeddings and os.path.exists(filename):
        print("Loading existing embeddings...")
        return load_embeddings(filename)

    # If file doesn't exist or we don't want to use it, process data
    context_embeddings = []
    question_embeddings = []

    for index, row in tqdm(data.iterrows(), total=len(data), desc="Processing data"):
        context = row['context']
        question = row['question']
        context_embedding = get_mean_pooled_embedding(context)
        question_embedding = get_mean_pooled_embedding(question)
        context_embeddings.append(context_embedding)
        question_embeddings.append(question_embedding)

    context_tensor = tf.squeeze(torch.stack(context_embeddings), axis=1)
    question_tensor = tf.squeeze(torch.stack(question_embeddings), axis=1)
    label_tensor = tf.convert_to_tensor(data['answerable'].astype(int), dtype=tf.int32)

    # Save embeddings to Google Drive if requested
    if save_embeddings and filename:
        np.savez(filename, context=context_tensor.numpy(), question=question_tensor.numpy(), labels=label_tensor.numpy())
        print(f"Embeddings saved to {filename}")

    return context_tensor, question_tensor, label_tensor

def load_embeddings(filename):
    data = np.load(filename)
    context_tensor = tf.convert_to_tensor(data['context'])
    question_tensor = tf.convert_to_tensor(data['question'])
    label_tensor = tf.convert_to_tensor(data['labels'], dtype=tf.int32)
    return context_tensor, question_tensor, label_tensor

In [None]:
def set_seed(seed_value=42):
    # Setting the seed for Python's random module
    random.seed(seed_value)

    # Setting the seed for NumPy
    np.random.seed(seed_value)

    # Setting the seed for TensorFlow
    tf.random.set_seed(seed_value)

    # Ensuring TensorFlow works deterministically (optional, but useful for reproducibility)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'

def create_and_train_model(training_data, use_embeddings=False, save_embeddings=False, filename=None):
    # Process Data
    context_tensor, question_tensor, label_tensor = pre_process(training_data, use_embeddings, save_embeddings, filename)

    # Create model
    input_1 = layers.Input(shape=(context_tensor.shape[1],))
    input_2 = layers.Input(shape=(question_tensor.shape[1],))

    # Concatenate the two input vectors
    concatenated = layers.Concatenate()([input_1, input_2])

    # Hidden layer
    hidden_layer = layers.Dense(128, activation='relu')(concatenated)

    # Output layer (binary classification, use a single unit with sigmoid activation)
    output = layers.Dense(1, activation='sigmoid')(hidden_layer)

    # Define the model
    model = models.Model(inputs=[input_1, input_2], outputs=output)

    # Compile the model (using binary crossentropy loss for binary classification)
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    # Train the model
    model.fit([context_tensor, question_tensor], label_tensor, epochs=10, batch_size=32)
    return model

def evaluate_model(model, validation_data, use_embeddings=False, save_embeddings=False, filename=None):
    # Process Data
    context_tensor, question_tensor, label_tensor = pre_process(validation_data, use_embeddings, save_embeddings, filename)

    # Evaluate the model
    threshold = 0.5
    loss, accuracy = model.evaluate([context_tensor, question_tensor], label_tensor)

    # Calculate Confusion Matrix
    predictions_prob = model.predict([context_tensor, question_tensor])
    predictions = (predictions_prob >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(label_tensor, predictions).ravel()

    # Calculate TPR and FPR
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0.0  # True Positive Rate
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0.0  # False Positive Rate

    return loss, accuracy, tpr, fpr

In [None]:
set_seed()
base_dir = '/content/drive/My Drive/nlp/week3'
use_embeddings = False # change in order to use existing embeddings
save_embeddings = False # change in order to save new embeddings

results = {}

for lang in ['ru', 'ja', 'fi']:
    # Filter training and validation data for the specific language
    train_data = train_df[train_df['lang'] == lang]
    val_data = val_df[val_df['lang'] == lang]

    # Create and train the model
    print(f"Training model for language: {lang}")
    train_file_name = os.path.join(base_dir, f"{lang}_embeddings.npz")
    model = create_and_train_model(train_data, use_embeddings, save_embeddings, train_file_name)

    # Evaluate the model
    print(f"Evaluating model for language: {lang}")
    test_file_name = os.path.join(base_dir, f"{lang}_test_embeddings.npz")
    loss, accuracy, tpr, fpr = evaluate_model(model, val_data, use_embeddings, save_embeddings, test_file_name)

    # Store the results
    results[lang] = {
        'Loss': loss,
        'Accuracy': accuracy,
        'True Positive Rate (TPR)': tpr,
        'False Positive Rate (FPR)': fpr
    }

# Print the results for each language model
for lang, metrics in results.items():
    print(f"Results for {lang}:")
    print(f"  Loss: {metrics['Loss']:.4f}")
    print(f"  Accuracy: {metrics['Accuracy']:.4f}")
    print(f"  True Positive Rate (TPR): {metrics['True Positive Rate (TPR)']:.4f}")
    print(f"  False Positive Rate (FPR): {metrics['False Positive Rate (FPR)']:.4f}")
    print()  # Just for better readability

#### 2. Logistic Regression with Bag-of-Words (Sofia)

In [None]:
import nltk
import pandas as pd
import warnings
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Fix the "copy warning" for readable output
warnings.filterwarnings("ignore", category=pd.errors.SettingWithCopyWarning)

# Import Datasets
splits = {'train': 'train.parquet', 'validation': 'validation.parquet'}
df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"])

# Train and validation sets
train_df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"])
val_df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["validation"])

# Split questions by the language
finnish_questions = df[df['lang'] == 'fi']
japanese_questions = df[df['lang'] == 'ja']
russian_questions = df[df['lang'] == 'ru']

languages = ['fi', 'ru']

# Load stop words into lists
nltk.download('stopwords')
stop_words_english = list(stopwords.words('english'))  # Преобразование к списку
stop_words_finnish = list(stopwords.words('finnish'))
stop_words_russian = list(stopwords.words('russian'))

for lang in languages:
    questions = df[df['lang'] == lang]

    # Generating stopword list for each case
    if lang == 'fi':
        stop_words = stop_words_finnish + stop_words_english
    elif lang == 'ru':
        stop_words = stop_words_russian + stop_words_english
    else:
        stop_words = stop_words_english

    # Combine context and question into a single input feature
    questions.loc[:, 'combined_input'] = questions.loc[:, 'context'] + ' ' + questions.loc[:, 'question']

    # Create a bag-of-words representation with removal of stop-words
    vectorizer = CountVectorizer(stop_words=stop_words)
    X = vectorizer.fit_transform(questions['combined_input'])
    y = questions['answerable'].astype(int)  # Convert 'True'/'False' to 0/1

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Train a logistic regression model
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy for language {lang}: {accuracy:.2f}")
    print(classification_report(y_test, y_pred))


Проблема может быть в том, что чтобы понять - можно ли ответить на вопрос или нет - нужно смотреть на реальные ключевые слова и упускать слова без смысла

The model has excellent performance for class 1 (answerable), with high precision, recall, and F1-score.
However, it struggles with class 0 (impossible) questions, as indicated by its low precision, recall, and F1-score.
Accuracy is misleading in this case due to the class imbalance. Although 92% of the predictions are correct, this is mainly due to the overwhelming presence of class 1.

To improve performance on class 0:
Use class weighting to penalize the model for misclassifying class 0.
Resample the data to balance the classes (e.g., oversample class 0 or undersample class 1).
Experiment with more sophisticated models or feature engineering techniques.

In [None]:
# prompt: count questions in dataset with 0 (not answerable)

not_answerable_count = train_df[train_df['answerable'] == False].shape[0]
print(f"Number of questions not answerable: {not_answerable_count}")

answerable_count = train_df[train_df['answerable'] == True].shape[0]
print(f"Number of questions answerable: {answerable_count}")


Japonese text requires specific tokenizer:

In [None]:
# Load tokenizer directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer_ja = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ja-en")
#model_ja = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-ja-en")

def tokenize_japanese_sentences(sentence):
    tokens = tokenizer_ja.tokenize(sentence)
    return " ".join(tokens)

japanese_questions = df[df['lang'] == 'ja']['question']

# Apply tokenization to the 'japanese_questions' series
tokenized_japanese_questions = japanese_questions.apply(tokenize_japanese_sentences)


# Print the original and tokenized questions
#for original, tokenized in zip(japanese_questions.head(5), tokenized_japanese_questions.head(5)):
#    print(f"Original: {original}")
#    print(f"Tokenized: {tokenized}")
#    print("-" * 50)

# Week 39

We now move from binary classification to span-based QA, i.e. identifying the
span in the document that answers the question.
Let k be the number of members in your group. Using the training data in
Finnish, Japanese and Russian separately, train k different sequence labellers,
which predict the tokens in a document context that constitute the answer to
the corresponding question. You can decide whether to train one model per
language or a single model for all three languages. Evaluate using a sequence labelling metric on the validation set, report and analyse the performance for each
language and compare the scores across languages. Note that if the question is
unanswerable, a correct output must be empty (contain no tokens).

#### 1. All-languages DistilBertForQuestionAnswering model (Tom)

In [None]:
!pip install datasets
!pip install --upgrade torch accelerate

import pandas as pd

splits = {'train': 'train.parquet', 'validation': 'validation.parquet'}
# Train and validation sets
train_df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"])
val_df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["validation"])

Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xx

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Import necessary libraries
import torch
from transformers import (
    DistilBertForQuestionAnswering,
    DistilBertTokenizerFast,
    Trainer,
    TrainingArguments,
)
from datasets import Dataset
import numpy as np

# Check and set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Load pre-trained model and tokenizer
model_name = 'distilbert-base-multilingual-cased'
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)
model = DistilBertForQuestionAnswering.from_pretrained(model_name).to(device)

# Function to prepare features for the model
def prepare_features(examples):
    # Tokenize the inputs with the tokenizer
    tokenized_examples = tokenizer(
        examples['question'],
        examples['context'],
        truncation=True,
        max_length=384,
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,  # Needed to map tokens to original text
        padding='max_length',
    )

    # Since one example might give rise to multiple features (due to truncation),
    # we need to keep track of the mapping between features and examples
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples["offset_mapping"]

    # Initialize lists to store the start and end positions
    start_positions = []
    end_positions = []

    for i, offsets in enumerate(offset_mapping):
        # Get the input IDs and CLS index
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Get the sequence IDs to differentiate question and context tokens
        sequence_ids = tokenized_examples.sequence_ids(i)

        # Map the feature to its corresponding example
        sample_index = sample_mapping[i]
        answer = examples["answer"][sample_index]
        answer_start = examples["answer_start"][sample_index]
        context = examples["context"][sample_index]

        # If there is no answer, set the start and end positions to the CLS index
        if answer_start is None or answer_start == -1:
            start_positions.append(cls_index)
            end_positions.append(cls_index)
        else:
            # Calculate the start and end character positions of the answer in the context
            start_char = answer_start
            end_char = answer_start + len(answer)

            # Find the start and end token indices in the feature
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # If the answer is out of the span (due to truncation), label it as CLS index
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                start_positions.append(cls_index)
                end_positions.append(cls_index)
            else:
                # Otherwise, find the start and end token indices that correspond to the answer
                # Note that the answer could be in the middle of the tokens due to tokenization
                # Adjust the start and end positions accordingly
                start_position = token_start_index
                end_position = token_end_index

                for idx in range(token_start_index, token_end_index + 1):
                    if offsets[idx][0] <= start_char and offsets[idx][1] > start_char:
                        start_position = idx
                    if offsets[idx][0] < end_char and offsets[idx][1] >= end_char:
                        end_position = idx
                        break

                start_positions.append(start_position)
                end_positions.append(end_position)

    # Add the start and end positions to the tokenized examples
    tokenized_examples["start_positions"] = start_positions
    tokenized_examples["end_positions"] = end_positions
    tokenized_examples.pop("offset_mapping")  # Remove offset_mapping as it's no longer needed

    return tokenized_examples

Using device: cuda


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Convert pandas DataFrames to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df[(train_df['lang'] == 'ru') | (train_df['lang'] == 'ja') | (train_df['lang'] == 'fi')])

# Apply the prepare_features function to the dataset
tokenized_train_dataset = train_dataset.map(
    prepare_features,
    batched=True,
    remove_columns=train_dataset.column_names,
)

# Set training arguments
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy='no',  # Disable evaluation during training
    save_strategy='no',     # You can choose when to save checkpoints
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    save_total_limit=2,
    fp16=True,
    dataloader_num_workers=2,
    report_to="none"  # This disables reporting to W&B
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    tokenizer=tokenizer,
    # compute_metrics can be omitted since we're not evaluating
)


# Start training
trainer.train()

# Save the trained model
#trainer.save_model('./trained_model')

Map:   0%|          | 0/6410 [00:00<?, ? examples/s]

Step,Training Loss
100,3.5267
200,2.7442
300,2.7635
400,2.6103
500,2.2111
600,2.0356
700,1.9375
800,1.9574
900,1.6794
1000,1.5475


TrainOutput(global_step=1236, training_loss=2.149041206705532, metrics={'train_runtime': 305.3471, 'train_samples_per_second': 64.746, 'train_steps_per_second': 4.048, 'total_flos': 1937258840724480.0, 'train_loss': 2.149041206705532, 'epoch': 3.0})

In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
# Import necessary libraries
import numpy as np
from datasets import Dataset
import evaluate
from collections import defaultdict

# Load the evaluation metric
metric = evaluate.load("squad")

val_dataset = Dataset.from_pandas(val_df[(val_df['lang'] == 'ru') | (val_df['lang'] == 'ja') | (val_df['lang'] == 'fi')])
val_dataset = val_dataset.map(lambda example, idx: {'id': str(idx)}, with_indices=True)

# Prepare the validation features
def prepare_validation_features(examples):
    tokenized_examples = tokenizer(
        examples['question'], examples['context'],
        truncation=True, max_length=384, stride=128,
        return_overflowing_tokens=True, return_offsets_mapping=True,
        padding='max_length'
    )
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    tokenized_examples["example_id"] = [examples["id"][sample_mapping[i]] for i in range(len(tokenized_examples["input_ids"]))]
    return tokenized_examples

# Tokenize the validation dataset
tokenized_val_dataset = val_dataset.map(prepare_validation_features, batched=True, remove_columns=val_dataset.column_names)

# Make predictions on the validation dataset
predictions = trainer.predict(tokenized_val_dataset)

# Post-process predictions to get the final answers
def postprocess_predictions(examples, features, predictions):
    all_start_logits, all_end_logits = predictions
    example_id_to_index = {str(k): i for i, k in enumerate(examples["id"])}
    features_per_example = defaultdict(list)

    for i, feature in enumerate(features):
        features_per_example[str(feature["example_id"])].append(i)

    final_predictions = {}
    for example_id, feature_indices in features_per_example.items():
        context = examples[example_id_to_index[example_id]]["context"]
        best_answer, max_score = "", -float('inf')

        for feature_index in feature_indices:
            start_logits, end_logits, offsets = all_start_logits[feature_index], all_end_logits[feature_index], features[feature_index]["offset_mapping"]
            start_idx, end_idx = np.argmax(start_logits), np.argmax(end_logits)

            if start_idx <= end_idx and start_idx < len(offsets) and end_idx < len(offsets):
                start_char, end_char = offsets[start_idx][0], offsets[end_idx][1]
                predicted_answer = context[start_char:end_char]
                score = start_logits[start_idx] + end_logits[end_idx]

                if score > max_score:
                    best_answer, max_score = predicted_answer, score

        final_predictions[example_id] = best_answer if best_answer != "" else " "

    return final_predictions

# Generate final predictions
final_predictions = postprocess_predictions(val_dataset, tokenized_val_dataset, predictions.predictions)

# Convert 'val_dataset' to a pandas DataFrame and filter by language
val_df_with_ids = val_dataset.to_pandas().astype({'id': str})
val_df_with_ids.loc[val_df_with_ids['answer_start'] == -1, 'answer'] = " "
languages = ['ru', 'fi', 'ja']
val_dfs_by_language = {lang: val_df_with_ids[val_df_with_ids['lang'] == lang].reset_index(drop=True) for lang in languages}
val_dfs_by_language['all'] = val_df_with_ids[val_df_with_ids['lang'].isin(languages)].reset_index(drop=True)

# Evaluate and compute metrics per language
for lang, lang_df in val_dfs_by_language.items():
    if lang_df.empty:
        print(f"No examples found for language: {lang}\n")
        continue

    lang_dataset = Dataset.from_pandas(lang_df)
    references = [{"id": ex["id"], "answers": {"text": [ex["answer"]], "answer_start": [ex["answer_start"]]}} for ex in lang_dataset]
    lang_ids = set(lang_df['id'])
    predictions_formatted = [{"id": k, "prediction_text": v} for k, v in final_predictions.items() if k in lang_ids]

    if len(predictions_formatted) != len(references):
        print(f"Mismatch in predictions/references for language {lang}\n")
        continue

    # Compute and print metrics
    results = metric.compute(predictions=predictions_formatted, references=references)
    print(f"Language: {lang}\nExact Match: {results['exact_match']:.2f}\nF1 Score: {results['f1']:.2f}\n")

Map:   0%|          | 0/1380 [00:00<?, ? examples/s]

Map:   0%|          | 0/1380 [00:00<?, ? examples/s]

Language: ru
Exact Match: 41.16
F1 Score: 27.95

Language: fi
Exact Match: 44.13
F1 Score: 28.41

Language: ja
Exact Match: 50.00
F1 Score: 26.79

Language: all
Exact Match: 45.22
F1 Score: 27.75



In [None]:
!pip install googletrans==4.0.0-rc1

Collecting googletrans==4.0.0-rc1
  Downloading googletrans-4.0.0rc1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3 (from googletrans==4.0.0-rc1)
  Downloading httpx-0.13.3-py3-none-any.whl.metadata (25 kB)
Collecting hstspreload (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading hstspreload-2024.10.1-py3-none-any.whl.metadata (2.1 kB)
Collecting chardet==3.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading chardet-3.0.4-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting idna==2.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading idna-2.10-py2.py3-none-any.whl.metadata (9.1 kB)
Collecting rfc3986<2,>=1.3 (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting httpcore==0.9.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading httpcore-0.9.1-py3-none-any.whl.metadata (4.6 kB)
Collecting h11<0.10,>=0.8 (from httpcore==0.9.*->httpx==0.13.3->goog

In [None]:
from googletrans import Translator
translator = Translator()

# Convert 'val_dataset' to a pandas DataFrame
val_df_with_ids = val_dataset.to_pandas()

# Ensure that 'id' is a string type to match the keys in 'final_predictions'
val_df_with_ids['id'] = val_df_with_ids['id'].astype(str)

# Convert 'val_dataset' to a pandas DataFrame
val_df_with_ids = val_dataset.to_pandas()
val_df_with_ids['id'] = val_df_with_ids['id'].astype(str)

# Initialize the translator
translator = Translator()

# Iterate over the first 10 examples
for idx in range(10, min(20, len(val_df_with_ids))):
    # Get the example
    example = val_df_with_ids.iloc[idx]
    example_id = example['id']
    question = example['question']
    context = example['context']
    actual_answer = example['answer']
    predicted_answer = final_predictions.get(example_id, "")
    language = example['lang']

    # Translate the question into English
    try:
        translation = translator.translate(question, dest='en')
        question_en = translation.text
    except Exception as e:
        question_en = "Translation unavailable"
        print(f"Error translating question {idx+1}: {e}")

    print(f"Example {idx+1}:")
    print(f"Question ({language}): {question}")
    print(f"Translated Question: {question_en}")
    print(f"Context: {context}\n")
    print(f"Actual Answer: {actual_answer}")
    print(f"Actual Answer_Start: {example['answer_start']}")
    print(f"Predicted Answer: {predicted_answer}")
    print("-" * 80)

Example 11:
Question (fi): Mikä on Jaco Pastoriuksen tunnetuin kappale?
Translated Question: What is Jaco Pastorius' most famous song?
Context: Birdland marked the peak of Weather Report's commercial career with the release of "Heavy Weather". With the addition of Jaco Pastorius, the band was able to push its music to the "height of its popularity", and with that came "Birdland." "Birdland" served as a tribute to the famous New York City jazz club that hosted many famous jazz musicians, which operated on Broadway from 1949 through 1965. This was the club, which he frequented almost daily, where Zawinul heard Count Basie, Louis Armstrong, Duke Ellington, and Miles Davis. It was also where he met his wife, Maxine. Looking back, Zawinul claimed, "The old Birdland was the most important place in my life." The song was also named in honor of the man after whom the club was named, Charlie Parker, the 'Bird' himself. 

Actual Answer: Birdland
Actual Answer_Start: 0
Predicted Answer:  
-------

#### 2. Sofia

fine-tuning a multilingual BERT model for a binary token classification task (such as identifying answerable vs. non-answerable tokens in a question-answering dataset) and evaluates its performance.
- pre-trained BERT model (bert-base-multilingual-cased)

In [None]:
from torch.utils.data import Dataset, DataLoader, TensorDataset
import os
import pandas as pd
import torch
from sklearn.metrics import classification_report, precision_recall_fscore_support
from sklearn.utils import compute_class_weight
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertConfig, BertForTokenClassification, Trainer, TrainingArguments, BertTokenizerFast
from torch.nn import CrossEntropyLoss
import torch.nn.functional as F

In [None]:
MAX_LENGTH = 512

def tokenize_and_align_labels(data, tokenizer, max_length=512):
    tokenized_inputs = {
        "input_ids": [],
        "attention_mask": [],
        "labels": []
    }

    for i, context in enumerate(data['context']):
        context = data['context'].iloc[i]
        question = data['question'].iloc[i]
        answer = data['answer'].iloc[i]


        encoded_input = tokenizer(
            question,
            context,
            max_length=max_length,
            truncation=True,
            padding="max_length",
            return_offsets_mapping=True,
        )


        offsets = encoded_input.pop("offset_mapping")


        labels = [0] * len(encoded_input['input_ids'])


        answer_start_idx = context.find(answer)
        if answer_start_idx != -1:
            answer_end_idx = answer_start_idx + len(answer)


            for idx, (start, end) in enumerate(offsets):
                if start >= answer_start_idx and end <= answer_end_idx:
                    labels[idx] = 1


        if len(labels) < max_length:
            labels += [-100] * (max_length - len(labels))


        tokenized_inputs["input_ids"].append(encoded_input["input_ids"])
        tokenized_inputs["attention_mask"].append(encoded_input["attention_mask"])
        tokenized_inputs["labels"].append(labels)


    tokenized_inputs["input_ids"] = torch.tensor(tokenized_inputs["input_ids"])
    tokenized_inputs["attention_mask"] = torch.tensor(tokenized_inputs["attention_mask"])
    tokenized_inputs["labels"] = torch.tensor(tokenized_inputs["labels"])

    return tokenized_inputs

class CustomDataset(Dataset):
    def __init__(self, input_ids, attention_mask, labels):
        self.input_ids = input_ids
        self.attention_mask = attention_mask
        self.labels = labels

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        item = {
            'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
            'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }
        return item

In [None]:
model_save_path = "./model_weights"

tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-cased')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

splits = {'train': 'train.parquet', 'validation': 'validation.parquet'}
df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"])

train_data = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"])
valid_data = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["validation"])

# Load data
train_data = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"], columns=['context', 'question', 'answerable', 'answer', 'lang'], filters=[('lang', 'in', ['ja','fi','ru'])])
valid_data = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["validation"], columns=['context', 'question', 'answerable', 'answer', 'lang'], filters=[('lang', 'in', ['ja','fi','ru'])])



# train_data, valid_data = train_test_split(data, test_size=0.1, random_state=42)


tokenized_train_data = tokenize_and_align_labels(train_data, tokenizer)
tokenized_valid_data = tokenize_and_align_labels(valid_data, tokenizer)

batch_size = 16
train_dataset = CustomDataset(tokenized_train_data['input_ids'],
                              tokenized_train_data['attention_mask'],
                              tokenized_train_data['labels'])

valid_dataset = CustomDataset(tokenized_valid_data['input_ids'],
                              tokenized_valid_data['attention_mask'],
                              tokenized_valid_data['labels'])


model = BertForTokenClassification.from_pretrained('bert-base-multilingual-cased', num_labels=2)

class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")


        loss_fct = CrossEntropyLoss(weight=torch.tensor([1.0, 5.0]).to(device))
        loss = loss_fct(logits.view(-1, model.config.num_labels), labels.view(-1))

        return (loss, outputs) if return_outputs else loss

training_args = TrainingArguments(
    output_dir=model_save_path,
    num_train_epochs=15,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    save_strategy="epoch",
    eval_strategy="epoch",
    load_best_model_at_end=True,
    save_total_limit=4,
    learning_rate=1e-5,
    gradient_accumulation_steps=8,
    fp16=True
)


trainer = WeightedTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset
)


trainer.train()

print("Saving the model...")
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

print("Evaluating the model on the validation dataset...")
eval_results = trainer.evaluate()

def get_predictions(model, dataloader):
    model.eval()
    predictions, true_labels = [], []

    for batch in dataloader:

        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']


        if isinstance(input_ids, list) or isinstance(input_ids, str):
            input_ids = torch.tensor(input_ids)
        if isinstance(attention_mask, list) or isinstance(attention_mask, str):
            attention_mask = torch.tensor(attention_mask)
        if isinstance(labels, list) or isinstance(labels, str):
            labels = torch.tensor(labels)


        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        labels = labels.to(device)

        with torch.no_grad():

            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits


        predicted_labels = torch.argmax(logits, dim=2)


        predictions.extend(predicted_labels.cpu().numpy())
        true_labels.extend(labels.cpu().numpy())

    return predictions, true_labels


valid_dataloader = DataLoader(valid_dataset, batch_size=64)


predictions, true_labels = get_predictions(model, valid_dataloader)


flattened_predictions = [pred for batch in predictions for pred in batch]
flattened_true_labels = [true for batch in true_labels for true in batch]


filtered_preds = [pred for pred, label in zip(flattened_predictions, flattened_true_labels) if label != -100]
filtered_labels = [label for label in flattened_true_labels if label != -100]


precision, recall, f1, _ = precision_recall_fscore_support(filtered_labels, filtered_preds, average='weighted')


report = classification_report(filtered_labels, filtered_preds)


print(f"Evaluation Results:\nPrecision: {precision:.2f}\nRecall: {recall:.2f}\nF1 Score: {f1:.2f}")
print(f"\nClassification Report:\n{report}")

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(self.labels[idx], dtype=torch.long)


Epoch,Training Loss,Validation Loss
0,0.5619,0.490097
1,0.4851,0.454535
2,0.5085,0.457419
3,0.4266,0.398539
4,0.3851,0.34483
5,0.2994,0.320205
6,0.2774,0.315281
8,0.2083,0.392403
9,0.1768,0.43924
10,0.1505,0.438939


  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(self.labels[idx], dtype=torch.long)
  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(self.labels[idx], dtype=torch.long)
  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(self.labels[idx], dtype=torch.long)
  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(self.labels[idx], dtype=torch.long)
  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(s

Saving the model...
Evaluating the model on the validation dataset...


  'input_ids': torch.tensor(self.input_ids[idx], dtype=torch.long),
  'attention_mask': torch.tensor(self.attention_mask[idx], dtype=torch.long),
  'labels': torch.tensor(self.labels[idx], dtype=torch.long)


Evaluation Results:
Precision: 0.95
Recall: 0.94
F1 Score: 0.95

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.96      0.97    673526
           1       0.42      0.55      0.48     33034

    accuracy                           0.94    706560
   macro avg       0.70      0.76      0.72    706560
weighted avg       0.95      0.94      0.95    706560



# Week 40

We now introduce open QA, i.e. generating an answer to a question even when
it is not extracted as a span from a document.
While for all answerable questions in the dataset, the English answer is
available, for some of the questions in the dataset, the answer in the same
language as the question is also available, in the answer inlang field. Use this
subset of the questions in Finnish, Japanese and Russian to train (or fine-tune)
a model that receives the question and context as input and generates the in-
language answer.10 You can decide whether to train one model per language or
a single model for all three languages.
If your group contains at least two members, additionally train an encoder-
decoder model that receives only the question as input and generates the in-
language answer.
If your group contains at least three members, additionally train an encoder-
decoder model that receives only the English answer as input and generates the
in-language answer.
Evaluate using a text generation evaluation metric on the validation set,
compare the results across languages and models and discuss them

#### 1. Fine-tuned M5 (Sofia)

##### Finnish language

In [None]:
!pip install transformers datasets sacrebleu pandas evaluate torch googletrans==4.0.0-rc1

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [None]:
import pandas as pd
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
import torch

splits = {'train': 'train.parquet', 'validation': 'validation.parquet'}
df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"],
                     filters=[('lang', 'in', ['fi'])])

# df = pd.concat([pd.read_parquet('../../train.parquet',filters=[('lang', 'in', ['fi'])]),pd.read_parquet('../../validation.parquet',filters=[('lang', 'in', ['fi'])])])

model_name = 'facebook/m2m100_418M'
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

tokenizer.src_lang = "en"
target_lang = "fi"

def translate_text(text):
    if text:
        encoded_text = tokenizer(text, return_tensors='pt').to(device)
        translated_tokens = model.generate(**encoded_text, forced_bos_token_id=tokenizer.get_lang_id(target_lang))
        return tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    return text

df['answer_inlang'] = df['answer'].apply(translate_text)

df.to_parquet('translated_fi_answers.parquet')

tokenizer_config.json:   0%|          | 0.00/298 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

In [None]:
from datasets import Dataset
from sklearn.model_selection import train_test_split
from transformers import MT5ForConditionalGeneration, MT5Tokenizer, AdamW, get_linear_schedule_with_warmup
from torch.utils.data import DataLoader
from torch import nn
import torch
import pandas as pd
import random
from tqdm import tqdm
from functools import partial
from transformers import Trainer, TrainingArguments, DataCollatorForSeq2Seq, default_data_collator

In [None]:
# Load MT5 model and tokenizer for finetuning
MODEL_NAME = 'google/mt5-small'
tokenizer_mt5 = MT5Tokenizer.from_pretrained(MODEL_NAME)

# Prepare data for MT5 finetuning
def prepare_data(samples, tokenizer=None, max_input_length=64, max_target_length=32):
    english_answers = samples['answer']
    inlang_answers = samples['answer_inlang']

    model_inputs = tokenizer(
        english_answers, max_length=max_input_length, truncation=True, padding="max_length"
    )

    labels = tokenizer(
        inlang_answers, max_length=max_target_length, truncation=True, padding="max_length"
    ).input_ids

    labels = [
        [(label if label != tokenizer.pad_token_id else -100) for label in label_example]
        for label_example in labels
    ]

    model_inputs["labels"] = labels
    return model_inputs

# Split data into train and validation
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

# Convert pandas DataFrames to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

# Tokenize datasets
tokenized_train = train_dataset.map(lambda x: prepare_data(x, tokenizer=tokenizer_mt5), batched=True)
tokenized_val = val_dataset.map(lambda x: prepare_data(x, tokenizer=tokenizer_mt5), batched=True)

# Initialize the model and data collator for finetuning
model_mt5 = MT5ForConditionalGeneration.from_pretrained(MODEL_NAME).to(device)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer_mt5, model=model_mt5, label_pad_token_id=-100)

# Custom Trainer to ensure all tensors are contiguous before saving
class ContiguousTrainer(Trainer):
    def save_model(self, output_dir=None, _internal_call=False):  # Accept _internal_call argument
        # Make all model parameters contiguous
        for param in self.model.parameters():
            param.data = param.data.contiguous()
        super().save_model(output_dir)

# Training arguments
training_args = TrainingArguments(
    output_dir="./mt5_model",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    logging_dir="./logs",
    logging_steps=100,
    save_steps=500,
    save_total_limit=1,
    fp16=False,
    report_to="none"
)

# Use the custom ContiguousTrainer for training
trainer = ContiguousTrainer(
    model=model_mt5,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer_mt5,
    data_collator=data_collator,
)

# Train the model
trainer.train()

# Save the final model and tokenizer
model_mt5.save_pretrained('./mt5_model')
tokenizer_mt5.save_pretrained('./mt5_model')



Epoch,Training Loss,Validation Loss
1,14.331,8.390997
2,6.7577,4.044094
3,5.0446,3.313808
4,4.5356,3.060942
5,4.1709,2.90127
6,3.8862,2.803905
7,3.6667,2.742069
8,3.52,2.6953
9,3.2753,2.664326
10,3.3988,2.658067


'\nmodel = MT5ForConditionalGeneration.from_pretrained(MODEL_NAME).to(device)\n\ndata_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, label_pad_token_id=-100)\n\ntraining_args = TrainingArguments(\n    output_dir="./mt5_model",\n    per_device_train_batch_size=4,\n    per_device_eval_batch_size=4,\n    num_train_epochs=10,\n    evaluation_strategy="epoch",\n    learning_rate=5e-5,\n    logging_dir="./logs",\n    logging_steps=100,\n    save_steps=500,\n    save_total_limit=1,\n    fp16=False\n)\n\ntrainer = Trainer(\n    model=model,\n    args=training_args,\n    train_dataset=tokenized_train,\n    eval_dataset=tokenized_valid,\n    tokenizer=tokenizer,\n    data_collator=data_collator,\n)\ntrainer.train()\n\n# Save the model and tokenizer\nmodel = model.to(\'cpu\')  # Move model to CPU if it\'s on GPU to save\nfor param in model.parameters():\n    param.data = param.data.contiguous()\n\nmodel.save_pretrained(\'./mt5_model\')\ntokenizer.save_pretrained(\'./mt5_model

In [None]:
from sklearn.model_selection import train_test_split
import torch
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
from torch.utils.data import DataLoader
from datasets import Dataset
import pandas as pd
from tqdm import tqdm
from functools import partial
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_dir = './mt5_model/'
model = MT5ForConditionalGeneration.from_pretrained(model_dir).to(device)
tokenizer = MT5Tokenizer.from_pretrained(model_dir)

data = pd.read_parquet('translated_fi_answers.parquet',
                       columns=['context', 'question', 'answerable','answer', 'lang','answer_inlang'],
                       filters=[('lang', 'in', ['fi'])])

train_data, valid_data = train_test_split(data, test_size=0.1, random_state=42)

valid_data['id'] = valid_data.index.astype(str)
valid_dataset = Dataset.from_pandas(valid_data)

def prepare_data(samples, tokenizer=None, max_input_length=64, max_target_length=32):
    english_answers = samples['answer']
    inlang_answers = samples['answer_inlang']
    model_inputs = tokenizer(
        english_answers, max_length=max_input_length, truncation=True, padding="max_length"
    )
    labels = tokenizer(
        inlang_answers, max_length=max_target_length, truncation=True, padding="max_length"
    ).input_ids

    model_inputs["labels"] = labels
    return model_inputs

def collate_fn(batch):
    input_ids = torch.tensor([item['input_ids'] for item in batch], dtype=torch.long)
    attention_mask = torch.tensor([item['attention_mask'] for item in batch], dtype=torch.long)
    return {'input_ids': input_ids, 'attention_mask': attention_mask}

def generate_answers(model, valid_dl):
    model.eval()
    all_predictions = []
    with torch.no_grad():
        for batch in tqdm(valid_dl, desc="Generating answers"):
            batch = {k: v.to(device) for k, v in batch.items()}

            generated_ids = model.generate(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                max_length=64,
                num_beams=4
            )

            preds = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
            all_predictions.extend(preds)
    return all_predictions

tokenized_valid = valid_dataset.map(partial(prepare_data, tokenizer=tokenizer), batched=True)

valid_dl = DataLoader(tokenized_valid, collate_fn=collate_fn, shuffle=False, batch_size=4)

generated_answers = generate_answers(model, valid_dl)

predictions = [{'id': str(i), 'prediction_text': pred} for i, pred in enumerate(generated_answers)]

gold = [{'id': example['id'], 'answers': example['answer_inlang']} for _, example in valid_data.iterrows()]

def compute_metrics(predictions, references, tokenizer):
    exact_match = total = bleu_score_total = 0
    chencherry = SmoothingFunction()

    for pred, ref in zip(predictions, references):
        total += 1
        pred_text = pred["prediction_text"]
        true_text = ref["answers"]

        exact_match += exact_match_score(pred_text, true_text)

        pred_tokens = tokenizer.tokenize(pred_text)
        ref_tokens = tokenizer.tokenize(true_text)

        print([ref_tokens],pred_tokens)
        bleu_score_total += sentence_bleu([ref_tokens], pred_tokens, smoothing_function=chencherry.method1)

    exact_match = 100.0 * exact_match / total
    avg_bleu = bleu_score_total / total

    return {'exact_match': exact_match, 'bleu': avg_bleu}

def exact_match_score(prediction, ground_truth):
    return prediction.strip().lower() == ground_truth.strip().lower()

results = compute_metrics(predictions, gold, tokenizer)

for i, prediction in enumerate(predictions):
    answer = valid_data.iloc[i]['answer_inlang']
    predicted_answer = prediction['prediction_text']

    print(f"Example {i + 1}")
    print(f"In language answer: {answer}")
    print(f"Predicted Answer: {predicted_answer}\n")

print(f"Evaluation Results:\nExact Match: {results['exact_match']:.2f}%\nBLEU Score: {results['bleu']:.4f}")

Map:   0%|          | 0/213 [00:00<?, ? examples/s]

Generating answers: 100%|██████████| 54/54 [00:43<00:00,  1.24it/s]


[['▁Pää', 'osat', '▁Siri', 'ma', '▁Rat', 'watt', 'e', '▁Dias', '▁Bandar', 'ana', 'ike']] ['▁Pää', 'osat', '▁Siri', 'ma', '▁Rat', 'watt', 'e']
[['▁', '1760']] ['▁Vuo', 'nna', '▁1700']
[['▁Ei']] ['▁Ei']
[['▁Ei']] ['▁Ei']
[['▁', 'Tyy', 'ppi', '▁tanssi', '▁käänt', 'y', 'y', '.']] ['▁Pää', 'osat', '▁on', '▁', 'a', '▁', 'tyyppi', ',', '▁joka', '▁on', '▁', 'a', '▁', 'tyyppi', ',', '▁joka', '▁on', '▁', 'a', '▁', 'tyyppi', ',', '▁joka', '▁on', '▁', 'a', '▁', 'tyyppi', ',', '▁joka', '▁on', '▁', 'a', '▁', 'tyyppi', ',', '▁joka', '▁on', '▁', 'a', '▁', 'tyyppi', ',', '▁joka', '▁on', '▁', 'a', '▁', 'tyyppi', ',', '▁joka', '▁on', '▁', 'a', '▁', 'tyyppi', ',', '▁joka', '▁on', '▁', 'perinte', 'inen', '▁tanssi']
[['▁1860', 's']] ['▁Vuo', 'nna', '▁1860']
[['▁Ei']] ['▁Ei']
[['▁P', 'iraat', 'tipu', 'olu', 'e']] ['▁Pää', 'osat', '▁Pirate', '▁Party']
[['▁', '1527']] ['▁Vuo', 'nna', '▁', '1527']
[['▁', 'Algeria', ',', '▁T', 'ša', 'd', ',', '▁Egypt', 'i', ',', '▁', 'Libya', ',', '▁Mali', ',', '▁Maurit', 'ania'

##### Russian language

In [None]:
!pip install transformers datasets sacrebleu pandas evaluate torch googletrans==4.0.0-rc1



In [None]:
import pandas as pd
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
import torch
import time
import logging

# Load data
splits = {'train': 'train.parquet', 'validation': 'validation.parquet'}
df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"],
                     filters=[('lang', 'in', ['ru'])])

# Initialize model and tokenizer
model_name = 'facebook/m2m100_418M'
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)

# Set up device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

tokenizer.src_lang = "en"
target_lang = "ru"

# Translation function with detailed logging
def translate_text(text, index, total):
    if text:

        print(f"Starting translation for index {index + 1}/{total}")

        # Encoding and moving to device
        encoded_text = tokenizer(text, return_tensors='pt').to(device)

        # Token generation
        translated_tokens = model.generate(**encoded_text, forced_bos_token_id=tokenizer.get_lang_id(target_lang))

        # Decoding
        translation = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

        print(f"Completed translation for {index + 1}/{total}")

        return translation
    return text

# Apply translation with progress logging
total = len(df)
df['answer_inlang'] = [translate_text(text, i, total) for i, text in enumerate(df['answer'])]

# Save translated data
df.to_parquet('translated_ru_answers.parquet')
print("Translation completed and saved to 'translated_ru_answers.parquet'")

In [None]:
from datasets import Dataset
from sklearn.model_selection import train_test_split
from transformers import MT5ForConditionalGeneration, MT5Tokenizer, AdamW, get_linear_schedule_with_warmup
from torch.utils.data import DataLoader
from torch import nn
import torch
import pandas as pd
import random
from tqdm import tqdm
from functools import partial
from transformers import Trainer, TrainingArguments, DataCollatorForSeq2Seq, default_data_collator

In [None]:
# Load MT5 model and tokenizer for finetuning
MODEL_NAME = 'google/mt5-small'
tokenizer_mt5 = MT5Tokenizer.from_pretrained(MODEL_NAME)

# Prepare data for MT5 finetuning
def prepare_data(samples, tokenizer=None, max_input_length=64, max_target_length=32):
    english_answers = samples['answer']
    inlang_answers = samples['answer_inlang']

    model_inputs = tokenizer(
        english_answers, max_length=max_input_length, truncation=True, padding="max_length"
    )

    labels = tokenizer(
        inlang_answers, max_length=max_target_length, truncation=True, padding="max_length"
    ).input_ids

    labels = [
        [(label if label != tokenizer.pad_token_id else -100) for label in label_example]
        for label_example in labels
    ]

    model_inputs["labels"] = labels
    return model_inputs

# Split data into train and validation
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

# Convert pandas DataFrames to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

# Tokenize datasets
tokenized_train = train_dataset.map(lambda x: prepare_data(x, tokenizer=tokenizer_mt5), batched=True)
tokenized_val = val_dataset.map(lambda x: prepare_data(x, tokenizer=tokenizer_mt5), batched=True)

# Initialize the model and data collator for finetuning
model_mt5 = MT5ForConditionalGeneration.from_pretrained(MODEL_NAME).to(device)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer_mt5, model=model_mt5, label_pad_token_id=-100)

# Custom Trainer to ensure all tensors are contiguous before saving
class ContiguousTrainer(Trainer):
    def save_model(self, output_dir=None, _internal_call=False):  # Accept _internal_call argument
        # Make all model parameters contiguous
        for param in self.model.parameters():
            param.data = param.data.contiguous()
        super().save_model(output_dir)

# Training arguments
training_args = TrainingArguments(
    output_dir="./mt5_model",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    logging_dir="./logs",
    logging_steps=100,
    save_steps=500,
    save_total_limit=1,
    fp16=False,
    report_to="none"
)

# Use the custom ContiguousTrainer for training
trainer = ContiguousTrainer(
    model=model_mt5,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer_mt5,
    data_collator=data_collator,
)

# Train the model
trainer.train()

# Save the final model and tokenizer
model_mt5.save_pretrained('./mt5_model')
tokenizer_mt5.save_pretrained('./mt5_model')

In [None]:
from sklearn.model_selection import train_test_split
import torch
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
from torch.utils.data import DataLoader
from datasets import Dataset
import pandas as pd
from tqdm import tqdm
from functools import partial
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_dir = './mt5_model/'
model = MT5ForConditionalGeneration.from_pretrained(model_dir).to(device)
tokenizer = MT5Tokenizer.from_pretrained(model_dir)

data = pd.read_parquet('translated_ru_answers.parquet',
                       columns=['context', 'question', 'answerable','answer', 'lang','answer_inlang'],
                       filters=[('lang', 'in', ['ru'])])

train_data, valid_data = train_test_split(data, test_size=0.1, random_state=42)

valid_data['id'] = valid_data.index.astype(str)
valid_dataset = Dataset.from_pandas(valid_data)

def prepare_data(samples, tokenizer=None, max_input_length=64, max_target_length=32):
    english_answers = samples['answer']
    inlang_answers = samples['answer_inlang']
    model_inputs = tokenizer(
        english_answers, max_length=max_input_length, truncation=True, padding="max_length"
    )
    labels = tokenizer(
        inlang_answers, max_length=max_target_length, truncation=True, padding="max_length"
    ).input_ids

    model_inputs["labels"] = labels
    return model_inputs

def collate_fn(batch):
    input_ids = torch.tensor([item['input_ids'] for item in batch], dtype=torch.long)
    attention_mask = torch.tensor([item['attention_mask'] for item in batch], dtype=torch.long)
    return {'input_ids': input_ids, 'attention_mask': attention_mask}

def generate_answers(model, valid_dl):
    model.eval()
    all_predictions = []
    with torch.no_grad():
        for batch in tqdm(valid_dl, desc="Generating answers"):
            batch = {k: v.to(device) for k, v in batch.items()}

            generated_ids = model.generate(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                max_length=64,
                num_beams=4
            )

            preds = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
            all_predictions.extend(preds)
    return all_predictions

tokenized_valid = valid_dataset.map(partial(prepare_data, tokenizer=tokenizer), batched=True)

valid_dl = DataLoader(tokenized_valid, collate_fn=collate_fn, shuffle=False, batch_size=4)

generated_answers = generate_answers(model, valid_dl)

predictions = [{'id': str(i), 'prediction_text': pred} for i, pred in enumerate(generated_answers)]

gold = [{'id': example['id'], 'answers': example['answer_inlang']} for _, example in valid_data.iterrows()]

def compute_metrics(predictions, references, tokenizer):
    exact_match = total = bleu_score_total = 0
    chencherry = SmoothingFunction()

    for pred, ref in zip(predictions, references):
        total += 1
        pred_text = pred["prediction_text"]
        true_text = ref["answers"]

        exact_match += exact_match_score(pred_text, true_text)

        pred_tokens = tokenizer.tokenize(pred_text)
        ref_tokens = tokenizer.tokenize(true_text)

        print([ref_tokens],pred_tokens)
        bleu_score_total += sentence_bleu([ref_tokens], pred_tokens, smoothing_function=chencherry.method1)

    exact_match = 100.0 * exact_match / total
    avg_bleu = bleu_score_total / total

    return {'exact_match': exact_match, 'bleu': avg_bleu}

def exact_match_score(prediction, ground_truth):
    return prediction.strip().lower() == ground_truth.strip().lower()

results = compute_metrics(predictions, gold, tokenizer)

for i, prediction in enumerate(predictions):
    answer = valid_data.iloc[i]['answer_inlang']
    predicted_answer = prediction['prediction_text']

    print(f"Example {i + 1}")
    print(f"In language answer: {answer}")
    print(f"Predicted Answer: {predicted_answer}\n")

print(f"Evaluation Results:\nExact Match: {results['exact_match']:.2f}%\nBLEU Score: {results['bleu']:.4f}")

##### Japanese language

In [None]:
!pip install transformers datasets sacrebleu pandas evaluate torch googletrans==4.0.0-rc1

In [None]:
import pandas as pd
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
import torch
import time
import logging

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')

# Load data
splits = {'train': 'train.parquet', 'validation': 'validation.parquet'}
df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"],
                     filters=[('lang', 'in', ['ja'])])

# Initialize model and tokenizer
model_name = 'facebook/m2m100_418M'
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)

# Set up device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

tokenizer.src_lang = "en"
target_lang = "ja"

# Translation function with detailed logging
def translate_text(text, index, total):
    if text:
        #start_time = time.time()

        print(f"Starting translation for index {index + 1}/{total}")

        # Encoding and moving to device
        #print("Encoding text and moving to device...")
        encoded_text = tokenizer(text, return_tensors='pt').to(device)

        # Token generation
        #print("Generating tokens...")
        #generate_start_time = time.time()
        translated_tokens = model.generate(**encoded_text, forced_bos_token_id=tokenizer.get_lang_id(target_lang))
        #generate_elapsed_time = time.time() - generate_start_time
        #print(f"Token generation completed in {generate_elapsed_time:.2f} seconds")

        # Decoding
        #print("Decoding tokens...")
        translation = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

        #elapsed_time = time.time() - start_time
        print(f"Completed translation for {index + 1}/{total}")

        return translation
    return text

# Apply translation with progress logging
total = len(df)
df['answer_inlang'] = [translate_text(text, i, total) for i, text in enumerate(df['answer'])]

# Save translated data
df.to_parquet('translated_ja_answers.parquet')
print("Translation completed and saved to 'translated_ja_answers.parquet'")

In [None]:
from datasets import Dataset
from sklearn.model_selection import train_test_split
from transformers import MT5ForConditionalGeneration, MT5Tokenizer, AdamW, get_linear_schedule_with_warmup
from torch.utils.data import DataLoader
from torch import nn
import torch
import pandas as pd
import random
from tqdm import tqdm
from functools import partial
from transformers import Trainer, TrainingArguments, DataCollatorForSeq2Seq, default_data_collator

In [None]:
# Load MT5 model and tokenizer for finetuning
MODEL_NAME = 'google/mt5-small'
tokenizer_mt5 = MT5Tokenizer.from_pretrained(MODEL_NAME)

# Prepare data for MT5 finetuning
def prepare_data(samples, tokenizer=None, max_input_length=64, max_target_length=32):
    english_answers = samples['answer']
    inlang_answers = samples['answer_inlang']

    model_inputs = tokenizer(
        english_answers, max_length=max_input_length, truncation=True, padding="max_length"
    )

    labels = tokenizer(
        inlang_answers, max_length=max_target_length, truncation=True, padding="max_length"
    ).input_ids

    labels = [
        [(label if label != tokenizer.pad_token_id else -100) for label in label_example]
        for label_example in labels
    ]

    model_inputs["labels"] = labels
    return model_inputs

# Split data into train and validation
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

# Convert pandas DataFrames to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

# Tokenize datasets
tokenized_train = train_dataset.map(lambda x: prepare_data(x, tokenizer=tokenizer_mt5), batched=True)
tokenized_val = val_dataset.map(lambda x: prepare_data(x, tokenizer=tokenizer_mt5), batched=True)

# Initialize the model and data collator for finetuning
model_mt5 = MT5ForConditionalGeneration.from_pretrained(MODEL_NAME).to(device)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer_mt5, model=model_mt5, label_pad_token_id=-100)

# Custom Trainer to ensure all tensors are contiguous before saving
class ContiguousTrainer(Trainer):
    def save_model(self, output_dir=None, _internal_call=False):  # Accept _internal_call argument
        # Make all model parameters contiguous
        for param in self.model.parameters():
            param.data = param.data.contiguous()
        super().save_model(output_dir)

# Training arguments
training_args = TrainingArguments(
    output_dir="./mt5_model",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    logging_dir="./logs",
    logging_steps=100,
    save_steps=500,
    save_total_limit=1,
    fp16=False,
    report_to="none"
)

# Use the custom ContiguousTrainer for training
trainer = ContiguousTrainer(
    model=model_mt5,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer_mt5,
    data_collator=data_collator,
)

# Train the model
trainer.train()

# Save the final model and tokenizer
model_mt5.save_pretrained('./mt5_model')
tokenizer_mt5.save_pretrained('./mt5_model')

In [None]:
from sklearn.model_selection import train_test_split
import torch
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
from torch.utils.data import DataLoader
from datasets import Dataset
import pandas as pd
from tqdm import tqdm
from functools import partial
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_dir = './mt5_model/'
model = MT5ForConditionalGeneration.from_pretrained(model_dir).to(device)
tokenizer = MT5Tokenizer.from_pretrained(model_dir)

data = pd.read_parquet('translated_ja_answers.parquet',
                       columns=['context', 'question', 'answerable','answer', 'lang','answer_inlang'],
                       filters=[('lang', 'in', ['ja'])])

train_data, valid_data = train_test_split(data, test_size=0.1, random_state=42)

valid_data['id'] = valid_data.index.astype(str)
valid_dataset = Dataset.from_pandas(valid_data)

def prepare_data(samples, tokenizer=None, max_input_length=64, max_target_length=32):
    english_answers = samples['answer']
    inlang_answers = samples['answer_inlang']
    model_inputs = tokenizer(
        english_answers, max_length=max_input_length, truncation=True, padding="max_length"
    )
    labels = tokenizer(
        inlang_answers, max_length=max_target_length, truncation=True, padding="max_length"
    ).input_ids

    model_inputs["labels"] = labels
    return model_inputs

def collate_fn(batch):
    input_ids = torch.tensor([item['input_ids'] for item in batch], dtype=torch.long)
    attention_mask = torch.tensor([item['attention_mask'] for item in batch], dtype=torch.long)
    return {'input_ids': input_ids, 'attention_mask': attention_mask}

def generate_answers(model, valid_dl):
    model.eval()
    all_predictions = []
    with torch.no_grad():
        for batch in tqdm(valid_dl, desc="Generating answers"):
            batch = {k: v.to(device) for k, v in batch.items()}

            generated_ids = model.generate(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                max_length=64,
                num_beams=4
            )

            preds = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
            all_predictions.extend(preds)
    return all_predictions

tokenized_valid = valid_dataset.map(partial(prepare_data, tokenizer=tokenizer), batched=True)

valid_dl = DataLoader(tokenized_valid, collate_fn=collate_fn, shuffle=False, batch_size=4)

generated_answers = generate_answers(model, valid_dl)

predictions = [{'id': str(i), 'prediction_text': pred} for i, pred in enumerate(generated_answers)]

gold = [{'id': example['id'], 'answers': example['answer_inlang']} for _, example in valid_data.iterrows()]

def compute_metrics(predictions, references, tokenizer):
    exact_match = total = bleu_score_total = 0
    chencherry = SmoothingFunction()

    for pred, ref in zip(predictions, references):
        total += 1
        pred_text = pred["prediction_text"]
        true_text = ref["answers"]

        exact_match += exact_match_score(pred_text, true_text)

        pred_tokens = tokenizer.tokenize(pred_text)
        ref_tokens = tokenizer.tokenize(true_text)

        print([ref_tokens],pred_tokens)
        bleu_score_total += sentence_bleu([ref_tokens], pred_tokens, smoothing_function=chencherry.method1)

    exact_match = 100.0 * exact_match / total
    avg_bleu = bleu_score_total / total

    return {'exact_match': exact_match, 'bleu': avg_bleu}

def exact_match_score(prediction, ground_truth):
    return prediction.strip().lower() == ground_truth.strip().lower()

results = compute_metrics(predictions, gold, tokenizer)

for i, prediction in enumerate(predictions):
    answer = valid_data.iloc[i]['answer_inlang']
    predicted_answer = prediction['prediction_text']

    print(f"Example {i + 1}")
    print(f"In language answer: {answer}")
    print(f"Predicted Answer: {predicted_answer}\n")

print(f"Evaluation Results:\nExact Match: {results['exact_match']:.2f}%\nBLEU Score: {results['bleu']:.4f}")

#### 2. No context model

In [None]:
!pip install transformers datasets sacrebleu pandas evaluate

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting sacrebleu
  Downloading sacrebleu-2.4.3-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m800.4 kB/s[0m eta [36m0:00:00[0m
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting portalocker (from sacrebleu)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Co

In [None]:
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
from datasets import Dataset
import pandas as pd

# Load mT5-small model and tokenizer
model_name = 'google/mt5-small'
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)

splits = {'train': 'train.parquet', 'validation': 'validation.parquet'}
# Train and validation sets
train_df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"])
val_df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["validation"])

languages = ['ru', 'fi', 'ja']

train_df = train_df[train_df['lang'].isin(languages) & train_df['answer_inlang'].notna()]
val_df = val_df[val_df['lang'].isin(languages) & val_df['answer_inlang'].notna()]

#train_df.loc[train_df['answer_start'] == -1, 'answer_inlang'] = ""
#val_df.loc[val_df['answer_start'] == -1, 'answer_inlang'] = ""

train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

def preprocess_data(examples):
    # Tokenize the questions (inputs)
    inputs = examples['question']
    targets = examples['answer_inlang']

    # Tokenize the inputs with padding and truncation
    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")

    # Tokenize the target answers with padding and truncation
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, truncation=True, padding="max_length").input_ids

    # Replace padding token ID (which is usually 0) with -100, to ignore these tokens in the loss calculation
    #labels = [[(label if label != tokenizer.pad_token_id else 0) for label in label_seq] for label_seq in labels]

    # Add labels to model inputs
    model_inputs["labels"] = labels

    return model_inputs

# Tokenize the datasets
train_dataset = train_dataset.map(preprocess_data, batched=True)
val_dataset = val_dataset.map(preprocess_data, batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Map:   0%|          | 0/150 [00:00<?, ? examples/s]



Map:   0%|          | 0/300 [00:00<?, ? examples/s]

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Define training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir='./results_mt5',
    eval_strategy="no",
    save_strategy="no",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    predict_with_generate=True,
    num_train_epochs=3,
    logging_dir='./logs_mt5',
    save_total_limit=3,
    learning_rate=1e-5,
    logging_steps=1,
    fp16=False,  # Disable mixed precision to avoid potential issues
    report_to="none"  # Disable W&B logging
)

# Initialize the Trainer
trainer = Seq2SeqTrainer(
    model=model,                           # The mT5 model
    args=training_args,                    # The training arguments we set up
    train_dataset=train_dataset,           # The training dataset
    tokenizer=tokenizer,                   # The tokenizer
)

trainer.train()

Step,Training Loss


In [None]:
!pip install googletrans==4.0.0-rc1



In [None]:
import evaluate
import numpy as np

# Load BLEU metric
bleu_metric = evaluate.load("sacrebleu")

def remove_extra_tokens(decoded_texts):
    # Remove tokens like <extra_id_0> from decoded strings
    cleaned_texts = [text.replace("<extra_id_0>", "").strip() for text in decoded_texts]
    return cleaned_texts

# Safe decode function to handle potential token ID errors
def safe_decode(predictions):
    try:
        # Ensure we skip special tokens like PAD, CLS, SEP, etc.
        decoded_texts = tokenizer.batch_decode(predictions, skip_special_tokens=True)
        return remove_extra_tokens(decoded_texts)
    except IndexError as e:
        print(f"Error decoding: {e}")
        return [""] * len(predictions)  # Return empty strings if there's a decoding error

# Calculate BLEU, TPR, and FPR
def compute_metrics(pred):
    labels_ids = pred.label_ids
    labels_ids = np.where(labels_ids != -100, labels_ids, tokenizer.pad_token_id)
    pred_ids = pred.predictions

    # Remove invalid token IDs (e.g., negative values)
    pred_ids = [[token for token in pred if token >= 0] for pred in pred_ids]

    # Decode predicted and reference texts
    pred_str = safe_decode(pred_ids)
    labels_str = safe_decode(labels_ids)

    # BLEU expects a list of references, where each reference itself is a list
    labels_str = [[label] for label in labels_str]
    bleu = bleu_metric.compute(predictions=pred_str, references=labels_str)["score"]

    # Calculate TPR and FPR based on the presence of an answer
    tp = fp = tn = fn = 0
    for pred, label in zip(pred_str, [ref[0] for ref in labels_str]):
        if label:  # Answerable
            if pred:  # Non-empty prediction
                tp += 1
            else:  # Empty prediction
                fn += 1
        else:  # Unanswerable
            if pred:  # Non-empty prediction
                fp += 1
            else:  # Empty prediction
                tn += 1

    # Calculate TPR and FPR
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0

    return {"bleu": bleu, "tpr": tpr, "fpr": fpr}

# Make predictions with controlled token generation length
predictions = trainer.predict(val_dataset, max_new_tokens=50, top_k=50, top_p=0.9, temperature=0.7)

# Compute metrics after predictions
metrics = compute_metrics(predictions)
print(f"BLEU Score: {metrics['bleu']}")
print(f"True Positive Rate (TPR): {metrics['tpr']}")
print(f"False Positive Rate (FPR): {metrics['fpr']}")

# Decode predictions and labels, removing <extra_id_0>
pred_str = remove_extra_tokens(tokenizer.batch_decode(predictions.predictions, skip_special_tokens=True))
label_str = remove_extra_tokens(tokenizer.batch_decode(predictions.label_ids, skip_special_tokens=True))

# Convert `val_dataset` to a DataFrame to add decoded answers for easier filtering
val_df = val_dataset.to_pandas()

# Add the decoded predictions and true answers to the DataFrame
val_df['predicted_answer'] = pred_str
val_df['true_answer'] = label_str

# Define the list of languages to filter by (adjust as per your dataset's language codes)
languages = ['fi', 'ja', 'ru']  # Adjust language codes as needed

# Loop through each language, filter the DataFrame, and print the first 10 entries
for lang in languages:
    print(f"\nPredictions for language: {lang}")
    lang_df = val_df[val_df['lang'] == lang]
    print(lang_df[['question', 'true_answer', 'predicted_answer']].head(10))
    print("\n" + "="*80 + "\n")


KeyboardInterrupt: 

# Week 41+

While generating an answer is more flexible than extracting it as a span, it
may be right for the wrong reasons, i.e. the answer may be correct even if the
question is unanswerable given the context.
Use all questions in Finnish, Japanese and Russian to train (or fine-tune) a
model that receives the question and context as input and generates the English
answer. You can decide whether to train one model per question language or a
single model for all three languages.
Evaluate using a text generation metric on the validation set, and compare
the overall results between answerable and unanswerable examples. Can the
model answer correctly even when the answer is not provided in the context?
Discuss the results.

#### Tom

In [None]:
!pip install transformers datasets sacrebleu pandas evaluate torch googletrans==4.0.0-rc1

Collecting googletrans==4.0.0-rc1
  Downloading googletrans-4.0.0rc1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3 (from googletrans==4.0.0-rc1)
  Downloading httpx-0.13.3-py3-none-any.whl.metadata (25 kB)
Collecting hstspreload (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading hstspreload-2024.10.1-py3-none-any.whl.metadata (2.1 kB)
Collecting chardet==3.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading chardet-3.0.4-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting idna==2.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading idna-2.10-py2.py3-none-any.whl.metadata (9.1 kB)
Collecting rfc3986<2,>=1.3 (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting httpcore==0.9.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading httpcore-0.9.1-py3-none-any.whl.metadata (4.6 kB)
Collecting h11<0.10,>=0.8 (from httpcore==0.9.*->httpx==0.13.3->goog

In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer
from datasets import Dataset
import pandas as pd

# Load flan-t5-small model and tokenizer
model_name = 'google/flan-t5-small'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Define dataset paths and load data
splits = {'train': 'train.parquet', 'validation': 'validation.parquet'}
train_df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"])
val_df = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["validation"])

languages = ['ru', 'fi', 'ja']
train_df = train_df[train_df['lang'].isin(languages)]
val_df = val_df[val_df['lang'].isin(languages)]

train_df.loc[train_df['answerable'] == False, 'answer'] = ""
val_df.loc[val_df['answerable'] == False, 'answer'] = ""

# Convert data to Hugging Face dataset format
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

def preprocess_data(examples):
    # Tokenize the questions (inputs)
    inputs = examples['question']
    targets = examples['answer']

    # Tokenize the inputs with padding and truncation
    model_inputs = tokenizer(inputs, examples['context'], max_length=512, truncation=True, padding="max_length")

    # Tokenize the target answers with padding and truncation
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, truncation=True, padding="max_length").input_ids

    # Replace padding token ID with -100 to ignore in loss calculation
    labels = [[(label if label != tokenizer.pad_token_id else -100) for label in label_seq] for label_seq in labels]

    # Add labels to model inputs
    model_inputs["labels"] = labels

    return model_inputs

# Tokenize the datasets
train_dataset = train_dataset.map(preprocess_data, batched=True)
val_dataset = val_dataset.map(preprocess_data, batched=True)

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
import torch

# Define training arguments with optimizations for memory management
training_args = Seq2SeqTrainingArguments(
    output_dir='./results_flan_t5',
    eval_strategy="no",
    save_strategy="no",
    per_device_train_batch_size=8,  # Reduced batch size
    per_device_eval_batch_size=8,   # Reduced eval batch size
    gradient_accumulation_steps=2,  # Grad accumulation to simulate larger batch
    predict_with_generate=True,
    num_train_epochs=3,
    logging_dir='./logs_flan_t5',
    save_total_limit=3,
    learning_rate=1e-5,
    logging_steps=50,
    fp16=False,  # Enable mixed precision
    report_to="none"  # Disable W&B logging
)

# Clear GPU cache
torch.cuda.empty_cache()

# Initialize the Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)

# Train the model
trainer.train()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Map:   0%|          | 0/6410 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Map:   0%|          | 0/1380 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Step,Training Loss
50,1.7296
100,1.6044
150,1.5488
200,1.4684
250,1.408
300,1.3635
350,1.3461
400,1.3263
450,1.3269
500,1.2496


TrainOutput(global_step=1203, training_loss=1.3188989422069624, metrics={'train_runtime': 897.2135, 'train_samples_per_second': 21.433, 'train_steps_per_second': 1.341, 'total_flos': 3574674405457920.0, 'train_loss': 1.3188989422069624, 'epoch': 3.0})

In [None]:
import evaluate
import nltk
import numpy as np
import torch
from googletrans import Translator

# Download the necessary NLTK data
nltk.download('punkt')

# Initialize translation service
translator = Translator()

# Load BLEU and F1 metrics
bleu_metric = evaluate.load("bleu")
# f1_metric = evaluate.load("f1")  # Load the F1 metric

# Separate answerable and unanswerable examples from val_dataset
answerable_dataset = val_dataset.filter(lambda x: x["answerable"] == True)
unanswerable_dataset = val_dataset.filter(lambda x: x["answerable"] == False)

# Separate answerable and unanswerable examples by language
languages = ['ru', 'fi', 'ja']
answerable_datasets = {lang: answerable_dataset.filter(lambda x: x['lang'] == lang) for lang in languages}
unanswerable_datasets = {lang: unanswerable_dataset.filter(lambda x: x['lang'] == lang) for lang in languages}

# Function to evaluate a dataset split
def evaluate_split(dataset, split_name=""):
    predictions = trainer.predict(dataset)
    preds = predictions.predictions

    # Decode predictions
    decoded_preds = [str(pred) for pred in tokenizer.batch_decode(preds, skip_special_tokens=True)]

    # Replace -100 in labels with the pad_token_id
    label_ids = predictions.label_ids
    label_ids = np.where(label_ids != -100, label_ids, tokenizer.pad_token_id)

    # Decode labels
    decoded_labels = [str(label) for label in tokenizer.batch_decode(label_ids, skip_special_tokens=True)]

    # Prepare references: each reference should be a list of reference texts
    references = [[label] for label in decoded_labels]

    # Check if all references are empty for unanswerable questions
    if all(len(ref[0]) == 0 for ref in references):
        print(f"{split_name}: All references are empty, skipping BLEU.")
        bleu_score = {"bleu": None}
        # f1_score_value = None
    else:
        # Compute BLEU score
        bleu_score = bleu_metric.compute(predictions=decoded_preds, references=references)
        print(f"{split_name} BLEU score: {bleu_score['bleu']}")

        # Compute F1 score
        # f1_score_value = f1_metric.compute(predictions=decoded_preds, references=decoded_labels)["f1"]
        # print(f"{split_name} F1 score: {f1_score_value:.4f}")

    # Compute Exact Match score
    def exact_match(predictions, references):
        return np.mean([pred.strip() == ref.strip() for pred, ref in zip(predictions, references)])

    em_score = exact_match(decoded_preds, decoded_labels)
    print(f"{split_name} Exact Match score: {em_score:.4f}")

    return {
        "bleu": bleu_score.get("bleu"),
        # "f1": f1_score_value,
        "exact_match": em_score,
        "decoded_preds": decoded_preds,
        "decoded_labels": decoded_labels
    }


# Evaluate each language split for both answerable and unanswerable questions
results = {}
for lang in languages:
    print(f"\nEvaluating answerable examples in {lang}:")
    answerable_results = evaluate_split(answerable_datasets[lang], f"Answerable ({lang})")
    results[f"answerable_{lang}"] = answerable_results

    print(f"\nEvaluating unanswerable examples in {lang}:")
    unanswerable_results = evaluate_split(unanswerable_datasets[lang], f"Unanswerable ({lang})")
    results[f"unanswerable_{lang}"] = unanswerable_results

# Combine all answerable and unanswerable datasets across languages
all_answerable = answerable_dataset  # already includes all answerable examples across languages
all_unanswerable = unanswerable_dataset  # already includes all unanswerable examples across languages

# Evaluate combined answerable and unanswerable datasets across all languages
print("\nEvaluating all answerable examples across languages:")
all_answerable_results = evaluate_split(all_answerable, "Answerable (All Languages)")
results["answerable_all"] = all_answerable_results

print("\nEvaluating all unanswerable examples across languages:")
all_unanswerable_results = evaluate_split(all_unanswerable, "Unanswerable (All Languages)")
results["unanswerable_all"] = all_unanswerable_results

# Display comparison results by language and answerability
print("\nComparison of results by language and answerability:")
for lang in languages:
    for metric in ["bleu", "exact_match"]:
        ans_score = results[f"answerable_{lang}"][metric]
        unans_score = results[f"unanswerable_{lang}"][metric]
        print(f"{metric.capitalize()} - {lang.upper()} - Answerable: {ans_score}, Unanswerable: {unans_score}")

# Display overall comparison for all languages combined
print("\nOverall results for all languages combined:")
for metric in ["bleu", "exact_match"]:
    ans_score = results["answerable_all"][metric]
    unans_score = results["unanswerable_all"][metric]
    print(f"{metric.capitalize()} - All Languages - Answerable: {ans_score}, Unanswerable: {unans_score}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Filter:   0%|          | 0/1380 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1380 [00:00<?, ? examples/s]

Filter:   0%|          | 0/951 [00:00<?, ? examples/s]

Filter:   0%|          | 0/951 [00:00<?, ? examples/s]

Filter:   0%|          | 0/951 [00:00<?, ? examples/s]

Filter:   0%|          | 0/429 [00:00<?, ? examples/s]

Filter:   0%|          | 0/429 [00:00<?, ? examples/s]

Filter:   0%|          | 0/429 [00:00<?, ? examples/s]


Evaluating answerable examples in ru:




KeyboardInterrupt: 

In [None]:
# Function to calculate in-context and empty prediction percentages
def calculate_prediction_statistics(decoded_preds, contexts):
    in_context_count = 0
    empty_count = 0

    for pred, context in zip(decoded_preds, contexts):
        if not pred.strip():  # Check for empty prediction
            empty_count += 1
        elif pred.strip() in context:  # Check if prediction is in context
            in_context_count += 1

    total = len(decoded_preds)
    empty_percentage = (empty_count / total) * 100
    in_context_percentage = (in_context_count / total) * 100
    out_of_context_percentage = 100 - in_context_percentage - empty_percentage

    return empty_percentage, in_context_percentage, out_of_context_percentage

# Calculate statistics for answerable examples across all languages
print("\nCalculating prediction statistics for all answerable examples across languages:")
all_answerable_preds = results["answerable_all"]["decoded_preds"]
all_answerable_contexts = all_answerable["context"]
answerable_empty, answerable_in_context, answerable_out_of_context = calculate_prediction_statistics(
    all_answerable_preds, all_answerable_contexts
)
print(f"Answerable (All Languages) - Empty predictions: {answerable_empty:.2f}%, "
      f"In-context predictions: {answerable_in_context:.2f}%, "
      f"Out-of-context predictions: {answerable_out_of_context:.2f}%")

# Calculate statistics for unanswerable examples across all languages
print("\nCalculating prediction statistics for all unanswerable examples across languages:")
all_unanswerable_preds = results["unanswerable_all"]["decoded_preds"]
all_unanswerable_contexts = all_unanswerable["context"]
unanswerable_empty, unanswerable_in_context, unanswerable_out_of_context = calculate_prediction_statistics(
    all_unanswerable_preds, all_unanswerable_contexts
)
print(f"Unanswerable (All Languages) - Empty predictions: {unanswerable_empty:.2f}%, "
      f"In-context predictions: {unanswerable_in_context:.2f}%, "
      f"Out-of-context predictions: {unanswerable_out_of_context:.2f}%")

# Calculate statistics for all examples across all languages
print("\nCalculating prediction statistics for all examples across languages:")
all_preds = all_answerable_preds + all_unanswerable_preds
all_contexts = list(all_answerable["context"]) + list(all_unanswerable["context"])
all_empty, all_in_context, all_out_of_context = calculate_prediction_statistics(all_preds, all_contexts)
print(f"All examples (All Languages) - Empty predictions: {all_empty:.2f}%, "
      f"In-context predictions: {all_in_context:.2f}%, "
      f"Out-of-context predictions: {all_out_of_context:.2f}%")


Calculating prediction statistics for all answerable examples across languages:
Answerable (All Languages) - Empty predictions: 15.67%, In-context predictions: 80.02%, Out-of-context predictions: 4.31%

Calculating prediction statistics for all unanswerable examples across languages:
Unanswerable (All Languages) - Empty predictions: 22.84%, In-context predictions: 73.43%, Out-of-context predictions: 3.73%

Calculating prediction statistics for all examples across languages:
All examples (All Languages) - Empty predictions: 17.90%, In-context predictions: 77.97%, Out-of-context predictions: 4.13%


In [None]:
# Print specific unanswerable predictions where the prediction is non-empty and not verbatim in the context
print("\nUnanswerable questions with non-empty, non-verbatim predictions across languages:")
for lang in languages:
    decoded_preds = results[f"unanswerable_{lang}"]["decoded_preds"]
    contexts = unanswerable_datasets[lang]["context"]
    questions = unanswerable_datasets[lang]["question"]

    for i, (question, context, prediction) in enumerate(zip(questions, contexts, decoded_preds)):
        if prediction.strip() and prediction.strip() not in context:
            # Translate question and context to English
            question_en = translator.translate(question, dest='en').text
            context_en = translator.translate(context, dest='en').text

            # Print the question, context, and prediction
            print(f"Language: {lang.upper()}")
            print(f"Question (Translated): {question_en}")
            print(f"Context (Translated): {context_en}")
            print(f"Model Prediction: {prediction}")
            print("-" * 50)