## Introduction to Approach 3:Fine-Tuning BERT for Language Identification

In this notebook, I worked on an advanced approach to language identification by fine-tuning a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model. My objective is to leverage a multi-lingual BERT model, specifically designed to handle text in multiple languages, to accurately identify the language of given text samples.

### Why BERT?
BERT models are pre-trained on a large corpus of text from numerous sources, enabling them to capture a rich language structure. By fine-tuning BERT on our specific task of language identification, I aim to harness its powerful language representations to improve classification accuracy across diverse languages, even with relatively minimal additional training.

### Goals and Setup
The goal is to fine-tune the `bert-base-multilingual-cased` model, which supports 104 languages and has been cased to preserve the case of the input text, enhancing its ability to recognize language-specific nuances better.

will:
- Set up the model and tokenizer for multilingual input.
- Prepare the dataset, which includes encoding texts and converting language labels to a format suitable for BERT.
- Define training parameters that balance efficiency with performance, allowing for quick iterations and adjustments.
- Utilize the Hugging Face Transformers library to streamline model training and evaluation.

References:


1. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. *arXiv preprint arXiv:1810.04805*. Available at https://arxiv.org/abs/1810.04805

2. Google AI Blog. (2020). Announcing Multilingual BERT. Available at https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html

3. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. *arXiv preprint arXiv:1706.03762*. Available at https://arxiv.org/abs/1706.03762

4. Hugging Face. Transformers: State-of-the-art Natural Language Processing. Software available from https://huggingface.co/transformers/


In [2]:
import numpy as np
import random
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from transformers import AutoTokenizer
import torch
from torch.utils.data import Dataset

In [3]:
# !pip install transformers[torch]

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Data Loading and Sampling
This section covers the loading and random sampling of the training and testing data. Sampling is particularly useful to reduce computation time while developing the model on Google Colab. We use a sample size of 25,000 for training and 10,000 for testing to ensure a representative subset of the full dataset.

In [3]:
def load_data_sample(filepath, sample_size=10, random_state=42):
    """
    Load a random sample of lines from a file to reduce memory usage and speed up computations.
    """
    with open(filepath, 'r', encoding='utf-8') as file:
      lines = file.readlines()

    random.seed(random_state)
    sampled_indices = random.sample(range(len(lines)), sample_size)
    sampled_lines = [lines[i].strip() for i in sampled_indices]
    return sampled_lines

def load_datasets(train_data_path, train_labels_path, test_data_path, test_labels_path, train_sample_size=25000, test_sample_size=10000):
    """
    Load training and testing data and labels from specified file paths.

    Args:
    train_data_path (str): Path to the training data file.
    train_labels_path (str): Path to the training labels file.
    test_data_path (str): Path to the testing data file.
    test_labels_path (str): Path to the testing labels file.
    train_sample_size (int): Number of samples to load from the training data.
    test_sample_size (int): Number of samples to load from the testing data.

    Returns:
    tuple: Tuple containing loaded training data, training labels, testing data, testing labels.
    """
    # Load training data and labels
    X_train = load_data_sample(train_data_path, sample_size=train_sample_size)
    y_train = load_data_sample(train_labels_path, sample_size=train_sample_size)

    # Load testing data and labels
    X_test = load_data_sample(test_data_path, sample_size=test_sample_size)
    y_test = load_data_sample(test_labels_path, sample_size=test_sample_size)

    return X_train, y_train, X_test, y_test


# Define the base path for data files
base_path = '/content/drive/MyDrive/data'

# Paths to data files using the base path
train_data_path = f'{base_path}/train/x_train.txt'
train_labels_path = f'{base_path}/train/y_train.txt'
test_data_path = f'{base_path}/test/x_test.txt'
test_labels_path = f'{base_path}/test/y_test.txt'

In [4]:
# Load all datasets using simplified paths
X_train, y_train, X_test, y_test = load_datasets(train_data_path, train_labels_path, test_data_path, test_labels_path)

In [69]:
print(X_train[0])
print(y_train[0])

By the late 1980s, CBS was telecasting 15 or 16 regular season games per year. In 1989 alone, only 13 of the 24 playoff games (Games 1–3, specifically) in Round 1 aired on TBS or CBS (for example, none of the four games from the Seattle–Houston first round series appeared on national television). Notably, Game 5 of the 1989 playoff series between the Chicago Bulls and Cleveland Cavaliers (featuring Michael Jordan's now famous game winning, last second shot over Craig Ehlo) was not nationally televised. CBS affiliates in Virginia elected to show the first game of a second round series between Seattle and the Lakers.
cbk


initialize and load the pre-trained BERT model along with its tokenizer, configured specifically for our multilingual language classification task involving 235 unique labels (according to our dataset).




In [None]:
# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertForSequenceClassification.from_pretrained(
    'bert-base-multilingual-cased',
    num_labels=235,  # The number of unique labels in the dataset
    output_attentions = False,
    output_hidden_states = False,
)

Encode Texts and Labels

In [25]:
def encode_texts(texts):
    return tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)

In [26]:
# Combine all labels from both train and test sets to ensure all are known before encoding
all_labels = sorted(set(y_train) | set(y_test))  # Union of y_train and y_test labels, sorted for consistency

# Create a mapping from language code to a unique index
lang2idx = {lang: idx for idx, lang in enumerate(all_labels)}

# Function to encode labels based on the mapping
def encode_labels(labels):
    return np.array([lang2idx[lang] for lang in labels])

`LanguageDataset` class, which is responsible for organizing and preparing the encoded texts and labels for use with the BERT model.


In [28]:

class LanguageDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Assuming 'encode_texts' function from the previous example tokenizes and encodes texts
train_encodings = encode_texts(X_train)
test_encodings = encode_texts(X_test)

# Assuming 'encode_labels' converts labels to integer indices
train_labels = encode_labels(y_train)
test_labels = encode_labels(y_test)

# Create Dataset objects
train_dataset = LanguageDataset(train_encodings, train_labels)
eval_dataset = LanguageDataset(test_encodings, test_labels)


configure the training parameters for our model, specifying details such as the number of epochs, batch sizes for training and evaluation, and the directory for saving model checkpoints.


In [31]:
training_args = TrainingArguments(
    output_dir='./bert_base_model_results',  # directory to save model checkpoints
    num_train_epochs=2,            # to reduce computational load and time
    per_device_train_batch_size=16,  # batch size for training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    evaluation_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

In [None]:
# Run with a smaller subset to test
small_train_dataset = LanguageDataset(train_encodings[:100], train_labels[:100])
small_eval_dataset = LanguageDataset(test_encodings[:50], test_labels[:50])

test_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset
)
test_trainer.train()


In [None]:
trainer.train()
# cell output:
# Epoch	Training Loss	Validation Loss ---- [3126/3126 21:29, Epoch 2/2]
# 1	0.065400	0.286143
# 2	0.064200	0.259737

In [None]:
model.save_pretrained('./final_model')
tokenizer.save_pretrained('./final_model')

In [None]:
evaluation_results = trainer.evaluate()
# cell output:
# {'eval_loss': 0.2597372531890869,
#  'eval_runtime': 66.2139,
#  'eval_samples_per_second': 151.026,
#  'eval_steps_per_second': 2.371,
#  'epoch': 2.0}

In [76]:
from google.colab import files

# Zip the model directory for easier download
!zip -r final_model.zip ./final_model

# Trigger the browser download
files.download('final_model.zip')

  adding: final_model/ (stored 0%)
  adding: final_model/special_tokens_map.json (deflated 42%)
  adding: final_model/config.json (deflated 77%)
  adding: final_model/model.safetensors (deflated 7%)
  adding: final_model/tokenizer_config.json (deflated 75%)
  adding: final_model/vocab.txt (deflated 45%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# # Zip the model directory for easier download
# !zip -r bert_base_model_results.zip ./bert_base_model_results

# # Trigger the browser download
# files.download('bert_base_model_results.zip')
files.download('final_model')

# Testing the model

In [5]:
# Load the model and tokenizer
model_path = './final_model'
loaded_model = BertForSequenceClassification.from_pretrained(model_path)
loaded_tokenizer = BertTokenizer.from_pretrained(model_path)

In [6]:

# Create a sorted list of unique labels
unique_labels = sorted(set(y_train))

# Create a mapping from labels to indices
label_to_index = {label: idx for idx, label in enumerate(unique_labels)}

# Invert the mapping to get from indices to labels
index_to_language = {idx: label for label, idx in label_to_index.items()}


In [74]:
def predict(text, tokenizer, model, index_to_language):
    model.eval()  # Ensure the model is in evaluation mode
    inputs = encode_texts(text, tokenizer)  # Encode the input text
    with torch.no_grad():  # Disable gradient calculation for inference
        outputs = model(**inputs)  # Forward pass
        logits = outputs.logits
        prediction_index = torch.argmax(logits, dim=-1).item()  # Get the index of the highest logit

    # Convert index to language using the mapping
    predicted_language = index_to_language[prediction_index]
    return predicted_language

In [75]:
# Example English text
english_text = "This is a simple test to see how the BERT model classifies this sentence."

# Predict using the pretrained model
predicted_language = predict(english_text, loaded_tokenizer, loaded_model, index_to_language)

print("Predicted Language:", predicted_language)

Predicted Language: eng
