<a href="https://colab.research.google.com/github/sudhirslab/aiml-hotspot/blob/main/Multilingual_Model_w_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install the necessary libraries
!pip install transformers torch



In [None]:
/*
    @author : Sudhir R. Pradhan
*/

import torch
from transformers import BertTokenizer, BertForMaskedLM

# Load the pre-trained mBERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertForMaskedLM.from_pretrained('bert-base-multilingual-cased')

# Example of multilingual input (English, Spanish, and Hindi)
# texts = [
#     "I love programming and learning new things.",
#     "Me encanta programar y aprender cosas nuevas.",
#     "मुझे प्रोग्रामिंग पसंद है और नई चीजें सीखना।"
# ]

texts = [
    "I love programming and learning new ",
    "Me encanta programar y aprender cosas ",
    "मुझे प्रोग्रामिंग पसंद है और नई चीजें सीखना "
]

# Tokenize the texts in multiple languages
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)

# Perform a masked language modeling task (predict the masked token)
inputs['labels'] = inputs.input_ids.detach().clone()  # Copy input ids for training task
outputs = model(**inputs)

# Get the prediction logits and the token ids of the words in the input
logits = outputs.logits
predicted_ids = torch.argmax(logits, dim=-1)

# Decode the predicted tokens back into words
predicted_texts = tokenizer.batch_decode(predicted_ids, skip_special_tokens=True)

print("Input texts: ", texts)
print("Predicted texts: ", predicted_texts)


Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Input texts:  ['I love programming and learning new ', 'Me encanta programar y aprender cosas ', 'मुझे प्रोग्रामिंग पसंद है और नई चीजें सीखना ']
Predicted texts:  ['. I love programming and learning new. learning new I I I I I. I love learning programming learning learning new learning', '. Me encanta programar y aprender cosas... Me Me Me Me Me Me aprender a programa, para', '. मुझे प्रोग्रामिंग पसंद है और नई चीजें सीखना ।']



> The warning you're seeing is a known and harmless issue related to the specific model architecture you're using. The BertForMaskedLM model from Hugging Face is initialized from a pre-trained bert-base-multilingual-cased model, which includes layers and parameters used in tasks like sentence classification (e.g., cls.seq_relationship and bert.pooler). These layers aren't necessary for the Masked Language Modeling (MLM) task and are therefore ignored when initializing BertForMaskedLM. This behavior is normal and can be ignored for this type of task.However, the main issue you're facing is related to the incorrect predictions from the model. The predictions you're getting (e.g., "I I I I I love learning programming") suggest that the model isn't predicting the masked token correctly. This could happen for a couple of reasons:1. Using BertForMaskedLM with Incorrect Inputs:The problem likely arises because of how the model is handling the <mask> token. It seems the model isn't understanding the task properly or generating random predictions based on the context. This could be because:The masked token is at the end of the sentence.The model doesn't have a good understanding of how to fill in the masked token in a multilingual context.2. Model Inference Strategy:The BertForMaskedLM model works by predicting the masked token based on the context surrounding it. When the model generates multiple words that don't make sense or are irrelevant, it's typically due to improper handling of the masked token or a context mismatch.Let's break it down step by step and make some refinements to fix the issue.


In [None]:
# @author : Sudhir R. Pradhan

# Install the necessary libraries
# !pip install transformers torch

import torch
from transformers import BertTokenizer, BertForMaskedLM

# Load the pre-trained mBERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertForMaskedLM.from_pretrained('bert-base-multilingual-cased')

# Example of multilingual input (English, Spanish, and Hindi) with masked tokens
texts = [
    "I love programming and learning new [MASK].",
    "Me encanta programar y aprender cosas [MASK].",
    "मुझे प्रोग्रामिंग पसंद है और नई चीजें सीखना [MASK]।"
]

# Tokenize the texts in multiple languages
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)

# Perform a masked language modeling task (predict the masked token)
# Prepare inputs for prediction
with torch.no_grad():
    outputs = model(**inputs)

# Get the prediction logits for the masked token positions
logits = outputs.logits

# Extract the indices of the masked positions (index of the [MASK] token)
masked_indices = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

# For each sentence, predict the masked token
predicted_ids = []
for idx in masked_indices:
    predicted_token_id = logits[0, idx].argmax().item()
    predicted_ids.append(predicted_token_id)

# Decode the predicted tokens back into words
predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_ids)
predicted_texts = []

# Replace the [MASK] token with the predicted token in the original sentence
for i, text in enumerate(texts):
    predicted_text = text.replace('[MASK]', predicted_tokens[i])
    predicted_texts.append(predicted_text)

print("Input texts: ", texts)
print("Predicted texts: ", predicted_texts)


Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Input texts:  ['I love programming and learning new [MASK].', 'Me encanta programar y aprender cosas [MASK].', 'मुझे प्रोग्रामिंग पसंद है और नई चीजें सीखना [MASK]।']
Predicted texts:  ['I love programming and learning new things.', 'Me encanta programar y aprender cosas I.', 'मुझे प्रोग्रामिंग पसंद है और नई चीजें सीखना things।']


---------


In [None]:
from transformers import XLMRobertaForMaskedLM, XLMRobertaTokenizer
import torch

# Load the XLM-RoBERTa tokenizer and model
model_name = "xlm-roberta-large"  # A multilingual transformer model
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
model = XLMRobertaForMaskedLM.from_pretrained(model_name)

# Ensure that the model handles masking
tokenizer.pad_token = tokenizer.eos_token  # Set padding token to EOS token

# Input sentences in multiple languages
texts = [
    "I love programming and learning new <mask>.",  # English
    "Me encanta programar y aprender nuevas <mask>.",  # Spanish
    "मुझे प्रोग्रामिंग पसंद है और नई चीजें सीखना <mask>।",  # Hindi
]

# Function to predict the next word for a masked token
def predict_next_word(text):
    # Tokenize input sentence and prepare it for masked language modeling
    input_ids = tokenizer.encode(text, return_tensors="pt")

    # Create attention mask (1 for real tokens, 0 for padding tokens)
    attention_mask = torch.ones(input_ids.shape, device=input_ids.device)

    # Get predictions for the <mask> token
    with torch.no_grad():
        output = model(input_ids, attention_mask=attention_mask)

    # Find the index of the <mask> token
    mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1].item()

    # Get the predicted token ID (highest probability for the <mask> token)
    predicted_token_id = output.logits[0, mask_token_index].argmax().item()

    # Decode the predicted token to the next word
    predicted_word = tokenizer.decode(predicted_token_id, skip_special_tokens=True)

    return predicted_word

# Loop through the text inputs and predict the next word
for text in texts:
    next_word = predict_next_word(text)
    print(f"Input: {text}")
    print(f"Predicted next word: {next_word}\n")


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Input: I love programming and learning new <mask>.
Predicted next word: things

Input: Me encanta programar y aprender nuevas <mask>.
Predicted next word: cosas

Input: मुझे प्रोग्रामिंग पसंद है और नई चीजें सीखना <mask>।
Predicted next word: है

