<a href="https://colab.research.google.com/github/simulate111/Deep-Learning-in-Human-Language-Technology/blob/main/course_project_reza_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep learning in Human Language Technology Project (Template)

- Student(s) Name(s): Mohammadreza Akhtari (2304399)
- Date: October 27, 2024
- Chosen Corpus: amazon_reviews_multi
- Contributions (if group project): -

### Corpus information

- Description of the chosen corpus:
The dataset is an extensive and diverse collection of Amazon reviews for multilingual text classification. The corpus comprises reviews from six languages: English, Japanese, German, French, Chinese, and Spanish. These reviews were gathered precisely during the period from 2015 to 2019. For each language, the training, development, and test sets consist of 200,000, 5,000, and 5,000 reviews, respectively

- Paper(s) and other published materials related to the corpus:
Heterogeneous text graph for comprehensive multilingual sentiment analysis: capturing short- and longdistance semantics https://doi.org/10.7717/peerj-cs.1876

Sentiment Analysis Across Languages: Evaluation Before and After Machine Translation to English https://doi.org/10.48550/arXiv.2405.02887

The Multilingual Amazon Reviews Corpus 10.18653/v1/2020.emnlp-main.369

- Random baseline performance and expected performance for recent machine learned models:
For Sentiment Classification: 0.90730 for German BERT model (bert-base-german-cased)
For Star Rating Prediction: 0.64942 for XLM model (Multilingual xlm-roberta-base model)

In another study: between 85-87 using MSA-GCN model and its superiority over mavy languages

---

## 1. Setup

In [None]:
# Your code to install and import libraries etc. here
!pip install transformers datasets evaluate
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding, EarlyStoppingCallback
from datasets import load_dataset, DatasetDict
import evaluate
import numpy as np
import os
os.environ["WANDB_DISABLED"] = "true"

Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.

---

## 2. Data download, sampling and preprocessing

### 2.1. Download the corpus

In [None]:
# Your code to download the corpus here
#First, training and evaluation on english dataset is requested
dataset_en = load_dataset("mteb/amazon_reviews_multi", "en", trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/47.0 [00:00<?, ?B/s]

amazon_reviews_multi.py:   0%|          | 0.00/6.17k [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/28.3M [00:00<?, ?B/s]

en/validation/0000.parquet:   0%|          | 0.00/713k [00:00<?, ?B/s]

en/test/0000.parquet:   0%|          | 0.00/711k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/200000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [None]:
print(dataset_en)
#Dataset consists of 1200000 data with the 30000 data for validation and 30000 for test dataset.
#Data fetures are id, text, label, and label text.
#However, considering the only English or any other language database, there is 200000 of data with 5000 for validation and 5000 for test datasets.

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 5000
    })
})


In [None]:
# HEre, you could find some more information about the dataset provided by the provider.
print(dataset_en['train'].info)

DatasetInfo(description='We provide an Amazon product reviews dataset for multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.\nFor each language, there are 200,000, 5,000 and 5,000 reviews in the training, development and test sets respectively. The maximum number of reviews per reviewer is 20 and the maximum number of reviews per product is 20. All reviews are truncated after 2,000 characters, and all reviews are at least 20 characters long.\nNote that the language of a review does not necessarily match the language of its marketplace (e.g. revi

### 2.2. Sampling and preprocessing

In [None]:
# Your code for any necessary sampling and preprocessing here
#Downsizing the dataset to make the computation faster and possible
#The origin dataset has 1,200,000 training data and 30,000 for each of test and validation set. The traiing set is dowsized to 12,000 data and 3,000 for validation and test datasets.
#Data is shuffled to make sampling reasonable and with good distribution of different data class.
train_dataset_en = dataset_en['train'].shuffle().select(range(int(len(dataset_en['train']) * 0.1)))
val_dataset_en = dataset_en['validation'].shuffle().select(range(int(len(dataset_en['validation']) * 0.5)))
test_dataset_en = dataset_en['test'].shuffle().select(range(int(len(dataset_en['test']) * 1)))

# Now, a new dataset for further analysis is made here from the dowscaled data.
downsampled_dataset_en = DatasetDict({'train': train_dataset_en, 'validation': val_dataset_en, 'test': test_dataset_en})
print(downsampled_dataset_en)

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 2500
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 5000
    })
})


In [None]:
# Label distribution to explore the evenly distribution of labels
for labs in downsampled_dataset_en.keys():
    print(f"{labs} labels:")
    labels = downsampled_dataset_en[labs]['label']
    lab, counts = np.unique(labels, return_counts=True)
    for label, count in zip(lab, counts):
        print(f"'{label}': {count} samples")
    print()

train labels:
'0': 4013 samples
'1': 4040 samples
'2': 3982 samples
'3': 3969 samples
'4': 3996 samples

validation labels:
'0': 506 samples
'1': 490 samples
'2': 497 samples
'3': 496 samples
'4': 511 samples

test labels:
'0': 1000 samples
'1': 1000 samples
'2': 1000 samples
'3': 1000 samples
'4': 1000 samples



In [None]:
# Your code for any necessary sampling and preprocessing here
#Use bert-base-cased model as was used also during the exercises.
model = "bert-base-cased"
tokenizer = transformers.AutoTokenizer.from_pretrained(model)



In [None]:
# Define a simple function that applies the tokenizer
# maximum length of BERT models is 512 due to the position embeddings
def tokenize(sample):
    return tokenizer(
        sample["text"],
        max_length=512,
        truncation=True)
tokenized_datasets_en = downsampled_dataset_en.map(tokenize)

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

---

## 3. Machine learning model

### 3.1. Model training

In [None]:
# Your code to train the transformer based model on the training set and evaluate the performance on the validation set here
#Taking advantage of course exercises
#There are few languages in the datset. Therefore, we use multilingual model here.
modell = "bert-base-multilingual-cased"
# Initialize the model
#We have 5 labels here as we want to give 1 to 5 stars or label 0 to 4.
model = AutoModelForSequenceClassification.from_pretrained(modell, num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Set training arguments lr=1e-5
training_argse5_500 = transformers.TrainingArguments(
    output_dir="checkpoints",
    eval_strategy="steps",
    logging_strategy="no",
    load_best_model_at_end=True,
    eval_steps=100,
    learning_rate=0.00001,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    max_steps=500,
    report_to="none")

In [None]:
# Load the accuracy metric
accuracy = evaluate.load("accuracy")

# Accuracy calculation function
def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = outputs.argmax(axis=-1)  # Pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels)

# Collator
data_collator = DataCollatorWithPadding(tokenizer)

#Giving the number of steps of patience before early stopping
early_stopping = EarlyStoppingCallback(early_stopping_patience=5)

In [None]:
# Print a sample of test labels to see that they are not ordered
print(downsampled_dataset_en["test"]["label"][:100])

[2, 4, 1, 1, 4, 0, 0, 3, 1, 1, 3, 2, 0, 4, 3, 4, 2, 4, 1, 2, 3, 1, 4, 0, 0, 4, 4, 1, 2, 3, 4, 0, 2, 4, 1, 3, 4, 2, 1, 4, 1, 1, 1, 3, 4, 1, 2, 0, 2, 1, 2, 0, 4, 1, 2, 4, 1, 4, 3, 3, 1, 4, 0, 3, 4, 3, 1, 0, 4, 4, 3, 0, 2, 0, 4, 4, 1, 0, 4, 0, 3, 4, 4, 3, 1, 0, 1, 3, 1, 3, 2, 1, 4, 0, 3, 3, 3, 0, 1, 2]


In [None]:
# Initialize the Trainer
trainer_en_e5_500 = transformers.Trainer(
    model=model,
    args=training_argse5_500,
    train_dataset=tokenized_datasets_en["train"],
    eval_dataset=tokenized_datasets_en["validation"],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer=tokenizer,
    callbacks=[early_stopping])

max_steps is given, it will override any value given in num_train_epochs


In [None]:
trainer_en_e5_500.train()

Step,Training Loss,Validation Loss,Accuracy
100,No log,1.603041,0.2568
200,No log,1.590898,0.2572
300,No log,1.5558,0.3212
400,No log,1.533177,0.3188
500,No log,1.518368,0.3256


TrainOutput(global_step=500, training_loss=1.5680638427734375, metrics={'train_runtime': 274.5821, 'train_samples_per_second': 14.568, 'train_steps_per_second': 1.821, 'total_flos': 259759274926608.0, 'train_loss': 1.5680638427734375, 'epoch': 0.2})

### 3.2 Hyperparameter optimization

In [None]:
# Set training arguments lr=1e-5 and step 2000 instead of 500
training_argse5_1000 = transformers.TrainingArguments(
    output_dir="checkpoints",
    eval_strategy="steps",
    logging_strategy="no",
    load_best_model_at_end=True,
    eval_steps=100,
    learning_rate=0.00001,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    max_steps=1000,
    report_to="none")

In [None]:
# Initialize the Trainer
trainer_en_e5_1000 = transformers.Trainer(
    model=model,
    args=training_argse5_1000,
    train_dataset=tokenized_datasets_en["train"],
    eval_dataset=tokenized_datasets_en["validation"],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer=tokenizer,
    callbacks=[early_stopping])

max_steps is given, it will override any value given in num_train_epochs


In [None]:
trainer_en_e5_1000.train()

Step,Training Loss,Validation Loss,Accuracy
100,No log,1.508843,0.334
200,No log,1.423228,0.378
300,No log,1.404901,0.3916
400,No log,1.404437,0.3864
500,No log,1.320775,0.4264
600,No log,1.326175,0.4268
700,No log,1.285812,0.4532
800,No log,1.268042,0.4532
900,No log,1.265406,0.4588
1000,No log,1.257705,0.4564


TrainOutput(global_step=1000, training_loss=1.3316658935546875, metrics={'train_runtime': 659.2427, 'train_samples_per_second': 12.135, 'train_steps_per_second': 1.517, 'total_flos': 511028878537776.0, 'train_loss': 1.3316658935546875, 'epoch': 0.4})

In [None]:
# Set training arguments lr=1e-5 and step 2000 instead of 500
training_argse5_2000 = transformers.TrainingArguments(
    output_dir="checkpoints",
    eval_strategy="steps",
    logging_strategy="no",
    load_best_model_at_end=True,
    eval_steps=100,
    learning_rate=0.00001,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    max_steps=2000,
    report_to="none")

In [None]:
# Initialize the Trainer lr=1e-5 and step=2000
trainer_en_e5_2000 = transformers.Trainer(
    model=model,
    args=training_argse5_2000,
    train_dataset=tokenized_datasets_en["train"],
    eval_dataset=tokenized_datasets_en["validation"],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer=tokenizer,
    callbacks=[early_stopping])

max_steps is given, it will override any value given in num_train_epochs


In [None]:
trainer_en_e5_2000.train()

Step,Training Loss,Validation Loss,Accuracy
100,No log,1.348708,0.4352
200,No log,1.309572,0.4452
300,No log,1.367085,0.4404
400,No log,1.388093,0.4436
500,No log,1.357381,0.4496
600,No log,1.299636,0.4628
700,No log,1.301477,0.4644
800,No log,1.277516,0.4712
900,No log,1.266281,0.4756
1000,No log,1.223707,0.4776


TrainOutput(global_step=2000, training_loss=1.107172119140625, metrics={'train_runtime': 1365.6091, 'train_samples_per_second': 11.716, 'train_steps_per_second': 1.465, 'total_flos': 1011693368995056.0, 'train_loss': 1.107172119140625, 'epoch': 0.8})

In [None]:
# examine different learning rate

# Set training arguments lr=1e-3 and step 500
training_argse3_500 = transformers.TrainingArguments(
    output_dir="checkpoints",
    eval_strategy="steps",
    logging_strategy="no",
    load_best_model_at_end=True,
    eval_steps=100,
    learning_rate=0.001,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    max_steps=500,
    report_to="none")
# Set training arguments lr=1e-1 and step 500
training_argse1_500 = transformers.TrainingArguments(
    output_dir="checkpoints",
    eval_strategy="steps",
    logging_strategy="no",
    load_best_model_at_end=True,
    eval_steps=100,
    learning_rate=0.1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    max_steps=500,
    report_to="none")

In [None]:
# Initialize the Trainer for lr=1e-3 and step=500
trainer_en_e3_500 = transformers.Trainer(
    model=model,
    args=training_argse3_500,
    train_dataset=tokenized_datasets_en["train"],
    eval_dataset=tokenized_datasets_en["validation"],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer=tokenizer,
    callbacks=[early_stopping])

max_steps is given, it will override any value given in num_train_epochs


In [None]:
# Initialize the Trainer, lr=1e-1 and step=500
trainer_en_e1_500 = transformers.Trainer(
    model=model,
    args=training_argse1_500,
    train_dataset=tokenized_datasets_en["train"],
    eval_dataset=tokenized_datasets_en["validation"],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer=tokenizer,
    callbacks=[early_stopping])

max_steps is given, it will override any value given in num_train_epochs


In [None]:
trainer_en_e3_500.train()

Step,Training Loss,Validation Loss,Accuracy
100,No log,1.699089,0.2044
200,No log,1.692819,0.2024
300,No log,1.6526,0.196
400,No log,1.6193,0.2044
500,No log,1.615826,0.2024


TrainOutput(global_step=500, training_loss=1.692766357421875, metrics={'train_runtime': 325.0646, 'train_samples_per_second': 12.305, 'train_steps_per_second': 1.538, 'total_flos': 259759274926608.0, 'train_loss': 1.692766357421875, 'epoch': 0.2})

In [None]:
trainer_en_e1_500.train()

Step,Training Loss,Validation Loss,Accuracy
100,No log,19.864988,0.2044
200,No log,14.27912,0.1988
300,No log,7.98982,0.2044
400,No log,4.353621,0.196
500,No log,2.002184,0.1984


TrainOutput(global_step=500, training_loss=13.4939541015625, metrics={'train_runtime': 285.9376, 'train_samples_per_second': 13.989, 'train_steps_per_second': 1.749, 'total_flos': 259759274926608.0, 'train_loss': 13.4939541015625, 'epoch': 0.2})

In [None]:
#learning_rate=0.0000001 and step of 500
training_argse7_500 = transformers.TrainingArguments(
    output_dir="checkpoints",
    eval_strategy="steps",
    logging_strategy="no",
    load_best_model_at_end=True,
    eval_steps=100,
    learning_rate=0.0000001,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    max_steps=500,
    report_to="none")

In [None]:
# Initialize the Trainer lr=1e-7 and step=500
trainer_en_e7_500 = transformers.Trainer(
    model=model,
    args=training_argse7_500,
    train_dataset=tokenized_datasets_en["train"],
    eval_dataset=tokenized_datasets_en["validation"],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer=tokenizer,
    callbacks=[early_stopping])

max_steps is given, it will override any value given in num_train_epochs


In [None]:
trainer_en_e7_500.train()

Step,Training Loss,Validation Loss,Accuracy
100,No log,2.000044,0.1984
200,No log,1.99853,0.1984
300,No log,1.997409,0.1984
400,No log,1.996653,0.1984
500,No log,1.996568,0.1984


TrainOutput(global_step=500, training_loss=4.20672900390625, metrics={'train_runtime': 275.2327, 'train_samples_per_second': 14.533, 'train_steps_per_second': 1.817, 'total_flos': 259759274926608.0, 'train_loss': 4.20672900390625, 'epoch': 0.2})

### 3.3. Evaluation on test set

In [None]:
# Your code to evaluate the final model on the test set here
# Evaluation on test dataset
print(trainer_en_e5_500.evaluate(eval_dataset=tokenized_datasets_en["test"])['eval_accuracy'])

0.2


### 3.4. Cross-lingual experiments

In [None]:
# Your code to train and evaluate the cross-lingual model
# Try with German language and dataset
dataset_de = load_dataset("mteb/amazon_reviews_multi", "de")

# Tokenize the dataset
tokenized_datasets_de = dataset_de.map(tokenize, batched=True)
trainer_de = Trainer(
    model=model,
    args=training_argse5_500,
    train_dataset=tokenized_datasets_de["train"],
    eval_dataset=tokenized_datasets_en["validation"],
    compute_metrics=compute_accuracy,
    data_collator=DataCollatorWithPadding(tokenizer),
    tokenizer=tokenizer,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)])

trainer_de.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
100,No log,66.017967,0.2044
200,No log,65.272972,0.2044
300,No log,64.742859,0.2044
400,No log,64.434914,0.2044
500,No log,64.330772,0.2044


TrainOutput(global_step=500, training_loss=65.6970078125, metrics={'train_runtime': 321.3447, 'train_samples_per_second': 12.448, 'train_steps_per_second': 1.556, 'total_flos': 451950633112656.0, 'train_loss': 65.6970078125, 'epoch': 0.02})

In [None]:
zero_shot_results = trainer_de.evaluate(eval_dataset=tokenized_datasets_en["test"])
print("Accuracy of trained on German but evaluated on English:",zero_shot_results['eval_accuracy'])

Accuracy of trained on German but evaluated on English: 0.2


In [None]:
!nvidia-smi

Sun Oct 27 09:35:55 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   68C    P0              30W /  70W |  14995MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
from tensorflow.keras import backend as K
K.clear_session()

In [None]:
!nvidia-smi

Sun Oct 27 09:36:19 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P0              30W /  70W |  14995MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
#How about japanese which is really different from european alphabet and languages
dataset_ja = load_dataset("mteb/amazon_reviews_multi", "ja")
# Tokenization
tokenized_datasets_ja = dataset_ja.map(tokenize, batched=True)
trainer_ja = Trainer(
    model=model,
    args=training_argse5_500,
    train_dataset=tokenized_datasets_ja["train"],
    eval_dataset=tokenized_datasets_en["validation"],
    compute_metrics=compute_accuracy,
    data_collator=DataCollatorWithPadding(tokenizer),
    tokenizer=tokenizer,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)])

trainer_ja.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
100,No log,62.943974,0.2044
200,No log,62.290844,0.2044
300,No log,61.850407,0.2044
400,No log,61.590725,0.2044
500,No log,61.505875,0.2044


TrainOutput(global_step=500, training_loss=63.22587890625, metrics={'train_runtime': 289.9367, 'train_samples_per_second': 13.796, 'train_steps_per_second': 1.725, 'total_flos': 349581230810256.0, 'train_loss': 63.22587890625, 'epoch': 0.02})

In [None]:
#Evaluation
zero_shot_results_ja = trainer_ja.evaluate(eval_dataset=tokenized_datasets_en["test"])
print("Accuracy of trained on Japanese but evaluated on English:",zero_shot_results_ja['eval_accuracy'])

Accuracy of trained on Japanese but evaluated on English: 0.2


---

## 4. Results and summary

### 4.1 Corpus insights

The corpus consist of 1,200,200 data form 6 various languages with 200,000 for each of them. THe corpus includes comment of these 6 languages between 2015 and 2019, which is labeled 0 to 4 (5 classes). Each langauge has 5,000 data of validation and 5,000 test data. The data could be used to train multi lingual or mono lingial language model. As the data size is huge and requires a lot of computational capacity, the dataset is dowscaled to include 10% of data. To achieve this, the data is shuffled before division to avoid bias and have evenly distributed of various classes in the dowscaled samples. THe validation dataset is also downsampled to include 50% of data.

### 4.2 Results

The model is trained and evaluated first on the english comments available in the dataset. THe model achieve the accuracy of about 0.32 with the learning rate of 1e-5 and in 500 steps. As the accuracy seems to be low, some optimization had been done to increase the accuracy such as exploring various learning rates and steps. In 500 steps, model with lerning rate of 1e-5 seems to work better than others with the accuracy of 0.32 compared to 0.2 if the learning rate increase or decrease. However, steps play a major role here and the accuracy increase from 0.32 with 500 steps to 0.45 with 1000 steps and to 0.50 percent with 2000 steps, which is considered as the baseline.

Zero-shot cross-lingual transfer has been also done to train the model on German and Japanese language and evaluate it on the English data set to see how multiliguality model works. To achieve this multi lingual model of bert-base-multilingual-cased is used. The achieved accuracy is 20% while the baseline accuracy was 32%. It is obvious that the performance reduce in this case but it is promising computation to use training on one language and use it in other language which could be improved and fine tunned by taking advantage of few shot calculation an dother techniques.

### 4.3 Relation to random baseline / expected performance / state of the art

My model achieved 50% of accuracy in 2000 steps and with the learning rate of 1e-5. However, more optimization and investigation of parmeters could be done to improve the accuracy even better if sufficient computational resources is avaiolable. Based on the newly published articles in 2024, the current accuracy on the Multilingual Amazon Reviews Corpus is almost 88% which is much better and promising than mine. However, the articles use many different methods having access to a processing and human resources. The current investigation could be also improved by using the whole data set instead of dowsampling and take advantage more steps or epochs as well as optimizing every parameters. Further database could be achieved by using the translational machine to translate dataset in other languages into the targeted language and take advantage of them as was discussed in one of the aboved articles.

---

## 5. Bonus Task (optional)

### 5.1. and 5.2. Model and Data selection

(Briefly describe which model was used and why. Also, describe how the test data was downsampled, include relevant code.)

### 5.3. Prompt design

(Include your final prompt here. Also, explain here all prompt engineering insights you learned during the project.)

### 5.4. Generate

In [None]:
# Your code to run the generative model and extract predictions from the model output.

### 5.5. Evaluation and results

(Briefly summarize your results)

### 5.6 Error analysis (group projects only)

(Present the error analysis results here)