<a href="https://colab.research.google.com/github/simulate111/Deep-Learning-in-Human-Language-Technology/blob/main/course_project_rezaC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep learning in Human Language Technology Project (Template)

- Student(s) Name(s): Mohammadreza Akhtari
- Date: October 2024
- Chosen Corpus:
- Contributions (if group project): -

### Corpus information

- Description of the chosen corpus:
- Paper(s) and other published materials related to the corpus:
- Random baseline performance and expected performance for recent machine learned models:

---

## 1. Setup

In [None]:
# Your code to install and import libraries etc. here
!pip install transformers datasets evaluate
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding, EarlyStoppingCallback
from datasets import load_dataset, DatasetDict
import evaluate
import numpy as np
import os
os.environ["WANDB_DISABLED"] = "true"



---

## 2. Data download, sampling and preprocessing

### 2.1. Download the corpus

In [None]:
# Your code to download the corpus here
dataset = load_dataset("mteb/amazon_reviews_multi", trust_remote_code=True)

In [None]:
print(dataset)
#Dataset consists of 1200000 data with the 30000 data for validation and 30000 for test dataset.
#Data fetures are id, text, label, and label text.

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 1200000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 30000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 30000
    })
})


In [None]:
# HEre, you could find some more information about the dataset provided by the provider.
print(dataset['train'].info)

DatasetInfo(description='We provide an Amazon product reviews dataset for multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.\nFor each language, there are 200,000, 5,000 and 5,000 reviews in the training, development and test sets respectively. The maximum number of reviews per reviewer is 20 and the maximum number of reviews per product is 20. All reviews are truncated after 2,000 characters, and all reviews are at least 20 characters long.\nNote that the language of a review does not necessarily match the language of its marketplace (e.g. revi

### 2.2. Sampling and preprocessing

In [None]:
# Your code for any necessary sampling and preprocessing here
#Downsizing the dataset to make the computation faster and possible
#The origin dataset has 1,200,000 training data and 30,000 for each of test and validation set. The traiing set is dowsized to 12,000 data and 3,000 for validation and test datasets.
#Data is shuffled to make sampling reasonable and with good distribution of different data class.
train_dataset = dataset['train'].shuffle().select(range(int(len(dataset['train']) * 0.01)))
val_dataset = dataset['validation'].shuffle().select(range(int(len(dataset['validation']) * 0.1)))
test_dataset = dataset['test'].shuffle().select(range(int(len(dataset['test']) * 0.1)))

# Now, a new dataset for further analysis is made here from the dowscaled data.
downsampled_dataset = DatasetDict({'train': train_dataset, 'validation': val_dataset, 'test': test_dataset})
print(downsampled_dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 12000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 3000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 3000
    })
})


In [None]:
# Label distribution to explore the evenly distribution of labels
for split in downsampled_dataset.keys():
    print(f"{split} labels:")
    labels = downsampled_dataset[split]['label']
    unique_labels, counts = np.unique(labels, return_counts=True)
    for label, count in zip(unique_labels, counts):
        print(f"'{label}': {count} samples")
    print()

train labels:
'0': 2292 samples
'1': 2455 samples
'2': 2419 samples
'3': 2414 samples
'4': 2420 samples

validation labels:
'0': 636 samples
'1': 579 samples
'2': 602 samples
'3': 579 samples
'4': 604 samples

test labels:
'0': 573 samples
'1': 610 samples
'2': 595 samples
'3': 629 samples
'4': 593 samples



In [None]:
# Your code for any necessary sampling and preprocessing here
#Use bert-base-cased model as was used also during the exercises.
model_name = "bert-base-cased"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

In [None]:
# Define a simple function that applies the tokenizer
# maximum length of BERT models is 512 due to the position embeddings
def tokenize(example):
    return tokenizer(
        example["text"],
        max_length=512,
        truncation=True)
tokenized_datasets = downsampled_dataset.map(tokenize)

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

---

## 3. Machine learning model

### 3.1. Model training

In [None]:
# Your code to train the transformer based model on the training set and evaluate the performance on the validation set here
#Taking advantage of course exercises
# Model name
#There are few languages in the datset. Therefore, we use multilingual model here.
model_name = "bert-base-multilingual-cased"
# Initialize the model
#We have 5 labels here as we want to give 1 to 5 stars.
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Set training arguments
training_args = transformers.TrainingArguments(
    output_dir="checkpoints",
    eval_strategy="steps",
    logging_strategy="no",
    load_best_model_at_end=True,
    eval_steps=100,
    learning_rate=0.00001,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    max_steps=500,
    report_to="none")

In [None]:
# Load the accuracy metric
accuracy = evaluate.load("accuracy")

# Define the compute_accuracy function
def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = outputs.argmax(axis=-1)  # Pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels)

# Collator
data_collator = DataCollatorWithPadding(tokenizer)

# Argument gives the number of steps of patience before early stopping
early_stopping = EarlyStoppingCallback(
    early_stopping_patience=5)

In [None]:
# Print a sample of test labels to see that they are not ordered
print(downsampled_dataset["test"]["label"][:100])

[4, 2, 0, 1, 1, 3, 2, 4, 4, 0, 3, 2, 1, 1, 3, 0, 2, 4, 1, 3, 0, 1, 0, 0, 1, 3, 3, 1, 1, 0, 3, 2, 4, 2, 3, 4, 1, 4, 3, 1, 1, 4, 1, 0, 3, 2, 0, 3, 1, 1, 3, 0, 1, 4, 4, 1, 4, 1, 2, 3, 0, 1, 0, 2, 2, 1, 3, 4, 1, 3, 3, 3, 3, 4, 1, 2, 0, 0, 3, 1, 0, 3, 1, 4, 4, 1, 1, 3, 3, 0, 1, 3, 2, 3, 4, 3, 4, 3, 2, 4]


In [None]:
# Initialize the Trainer
trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer=tokenizer,
    callbacks=[early_stopping])

max_steps is given, it will override any value given in num_train_epochs


In [None]:
trainer.train()

Step,Training Loss,Validation Loss


KeyboardInterrupt: 

### 3.2 Hyperparameter optimization

In [None]:
# Define a set of hyperparameters to search over
learning_rates = [1e-5, 3e-5]# 2e-5,
weight_decays = [0.01, 0.1]#0.0,

best_accuracy = 0
best_hyperparams = {}

for lr in learning_rates:
    for wd in weight_decays:
        training_args = TrainingArguments(
            output_dir="./results",
            eval_strategy="epoch",
            learning_rate=lr,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=16,
            num_train_epochs=3,
            weight_decay=wd,
        )

        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=small_train_dataset,
            eval_dataset=small_eval_dataset,
            tokenizer=tokenizer,
            compute_metrics=compute_metrics,
        )

        trainer.train()
        eval_results = trainer.evaluate()
        accuracy = eval_results["eval_accuracy"]

        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_hyperparams = {"learning_rate": lr, "weight_decay": wd}

print(f"Best hyperparameters: {best_hyperparams}")
print(f"Best accuracy: {best_accuracy}")


### 3.3. Evaluation on test set

In [None]:
# Your code to evaluate the final model on the test set here
test_results = trainer.evaluate(encoded_dataset["test"])
print(test_results)

### 3.4. Cross-lingual experiments

In [None]:
# Your code to train and evaluate the cross-lingual model
cross_lingual_results = {}
for lang in ["de", "es", "fr", "ja", "zh"]:
    test_dataset = load_dataset("mteb/amazon_reviews_multi", lang, split="test")
    encoded_test_dataset = test_dataset.map(preprocess_function, batched=True)
    results = trainer.evaluate(encoded_test_dataset)
    cross_lingual_results[lang] = results
print(cross_lingual_results)

In [None]:
# Train on English and evaluate on English
trainer.train()
baseline_results = trainer.evaluate(encoded_dataset["test"])
print("Baseline results (Train on English, Evaluate on English):", baseline_results)


In [None]:
# Your code to evaluate the final model on the test set here
test_results = trainer.evaluate(encoded_dataset["test"])
print(test_results)


In [None]:
# Train on multiple non-English languages and evaluate on English
non_english_train_dataset = encoded_dataset["train"].filter(lambda example: not example['id'].startswith('en'))
trainer.train_dataset = non_english_train_dataset
trainer.train()
mixed_zero_shot_results = trainer.evaluate(encoded_dataset["test"])
print("Mixed zero-shot results (Train on multiple non-English languages, Evaluate on English):", mixed_zero_shot_results)


---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to random baseline / expected performance / state of the art

(Compare your results with the random and state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. and 5.2. Model and Data selection

(Briefly describe which model was used and why. Also, describe how the test data was downsampled, include relevant code.)

### 5.3. Prompt design

(Include your final prompt here. Also, explain here all prompt engineering insights you learned during the project.)

### 5.4. Generate

In [None]:
# Your code to run the generative model and extract predictions from the model output.

### 5.5. Evaluation and results

(Briefly summarize your results)

### 5.6 Error analysis (group projects only)

(Present the error analysis results here)