# Sentiment Analysis in Japanese
***
## Table of Contents
1. [Introduction](#1-introduction)
1. [Loading Data](#2-loading-data)
1. [Data Preprocessing](#3-data-preprocessing)
1. [Loading Pre-Trained Model](#4-loading-pre-trained-model)
1. [Tokenisation](#5-tokenisation)
1. [Training Arguments](#6-training-arguments)
1. [Evaluation Metrics](#7-evaluation-metrics)
1. [Fine-Tuning Transformer Model](#8-fine-tuning-transformer-model)
1. [Predictions with Fine-Tuned Model](#9-predictions-with-fine-tuned-model)
1. [References](#10-references)
***

In [None]:
import numpy as np
import torch
import torch.nn.functional as F
from datasets import load_dataset
import evaluate

from typing import Dict, List, Tuple, Any, Mapping, Sequence
from numpy.typing import NDArray
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
    pipeline,
)

## 1. Introduction
Text classification is a fundamental task in natural language processing (NLP) that involves assigning predefined labels or classes to given texts. One important application of text classification is sentiment analysis, a challenging task that seeks to capture the context and nuances of language in order to identify sentiments such as positive, neutral, or negative.

Unlike English, some languages (e.g., Japanese and Chinese) do not use whitespace to separate words. This lack of explicit word boundaries presents a significant challenge for NLP systems, as it requires accurate word segmentation before higher-level tasks can be performed.

This project aims to implement sentiment analysis for Amazon reviews written in Japanese by fine-tuning a transformer model from Hugging Face. We will utilise a BERT-based transformer model that has been pre-trained on large-scale Japanese text corpora.

## 2. Loading Data
Retrieved from [Hugging Face - Datasets: SetFit/amazon_reviews_multi_ja](https://huggingface.co/datasets/SetFit/amazon_reviews_multi_ja/viewer)

In [2]:
ds = load_dataset("SetFit/amazon_reviews_multi_ja")

Repo card metadata block was not found. Setting CardData to empty.


In [3]:
ds

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 5000
    })
})

## 3. Data Preprocessing

In [4]:
RANDOM_SEED = 42
N_LABELS = 3
N_EPOCHS = 3
MY_MODEL_PATH = "./my_finetuned_model"

ID_2_LABEL = {0: "NEGATIVE", 1: "NEUTRAL", 2: "POSITIVE"}
LABEL_2_ID = {"NEGATIVE": 0, "NEUTRAL": 1, "POSITIVE": 2}


In [5]:
ds["train"][0]

{'id': 'ja_0388536',
 'text': '普段使いとバイクに乗るときのブーツ兼用として購入しました。見た目や履き心地は良いです。 しかし、２ヶ月履いたらゴム底が削れて無くなりました。また、バイクのシフトペダルとの摩擦で表皮が剥がれ、本革でないことが露呈しました。ちなみに防水とも書いていますが、雨の日は内部に水が染みます。 安くて見た目も良く、履きやすかったのですが、耐久性のなさ、本革でも防水でも無かったことが残念です。結局、本革の防水ブーツを買い直しました。',
 'label': 0,
 'label_text': '0'}

In [6]:
labels, counts = np.unique(ds["train"]["label"], return_counts=True)
for label, count in zip(labels, counts):
    print(f"label: {label}, count: {count}")

label: 0, count: 40000
label: 1, count: 40000
label: 2, count: 40000
label: 3, count: 40000
label: 4, count: 40000


Labels in the range 0 – 4 represent review ratings from 1 to 5 stars, respectively. For sentiment classification purposes, we map these labels into three categories:

- Labels 0 and 1 $\rightarrow$ **Negative**
- Label 2 $\rightarrow$ **Neutral**
- Labels 3 and 4 $\rightarrow$ **Positive**

In [7]:
def convert_rating_to_sentiment(
    batch: Mapping[str, Sequence[int]],
    id2label: Dict[int, str],
) -> Dict[str, List[Any]]:
    """
    Convert numerical review ratings to sentiment class (int) and text labels.

    Args:
        batch: A batch of dataset samples containing the key "label", whose values are integer ratings.
        id2label: A dictionary mapping sentiment class IDs to their string labels,
            e.g., `{0: "Negative", 1: "Neutral", 2: "Positive"}`.

    Returns:
        A dictionary containing:
            - label: List of integer sentiment classes corresponding to each sample,
            - label_text: List of sentiment label strings corresponding to each sample.
    """
    ratings = batch["label"]
    sentiments = []
    sentiments_text = []
    for r in ratings:
        if r in [0, 1]:
            sentiments.append(0)
            sentiments_text.append(id2label[0])
        elif r == 2:
            sentiments.append(1)
            sentiments_text.append(id2label[1])
        else:
            sentiments.append(2)
            sentiments_text.append(id2label[2])
    return {"label": sentiments, "label_text": sentiments_text}

In [8]:
ds = ds.map(
    convert_rating_to_sentiment,
    batched=True,
    fn_kwargs={"id2label": ID_2_LABEL},
)

In [9]:
labels, counts = np.unique(ds["train"]["label"], return_counts=True)
for label, count in zip(labels, counts):
    print(f"label: {label}, count: {count}")

label: 0, count: 80000
label: 1, count: 40000
label: 2, count: 80000


## 4. Loading Pre-Trained Model
For this task, a pre-trained BERT model ([bert-base-japanese-v3](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3)) will be used.

In [10]:
model_name = "tohoku-nlp/bert-base-japanese-v3"
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=N_LABELS,
    id2label=ID_2_LABEL,
    label2id=LABEL_2_ID,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at tohoku-nlp/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 5. Tokenisation
Tokenisation is a required step to convert raw text into a sequence of tokens that a model can process. 

The pretrained model was originally trained with a specific vocabulary and tokenisation scheme. Therefore, it is essential to use the tokeniser corresponding to the same pretrained model to ensure consistency between training and inference. This practice prevents vocabulary mismatches and guarantees that input texts are processed in a way the model expects, thereby maintaining optimal model performance.

In [11]:
tokeniser = AutoTokenizer.from_pretrained(model_name)


def tokenise_function(example):
    return tokeniser(
        example["text"],
        padding="max_length",
        truncation=True,
        max_length=128,
    )


tokenised_dataset = ds.map(tokenise_function, batched=True)



Map:   0%|          | 0/200000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [12]:
tokenised_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5000
    })
})

The `DataCollatorWithPadding()` dinamically applies padding to a batch of tokenised input sequences so that all sequences in the batch have the same length.

In [13]:
data_collator = DataCollatorWithPadding(tokenizer=tokeniser, return_tensors="pt")

Format the dataset for PyTorch, including the labels as PyTorch integer tensors.

In [14]:
tokenised_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

## 6. Training Arguments
The following code configures the training arguments for fine-tuning a Transformer model using Hugging Face's `Trainer` API. For this project, F1-Score will be employed as the evaluation metric to determine the best-performing model.

In [15]:
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=N_EPOCHS,
    weight_decay=0.01,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

## 7. Evaluation Metrics
The overall performance of the fine‑tuned model will be evaluated using two metrics: accuracy and the F1‑score.

In [None]:
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")


def compute_metrics(
    eval_pred: Tuple[NDArray[np.float64], NDArray[np.int8]],
) -> Dict[str, float]:
    """
    Compute accuracy and weighted F1-score for model predictions.

    This function calculates the accuracy and weighted F1-score of the model's predictions
    given the logits and true labels for a batch. Logits are expected to be floating-point
    arrays, and labels are expected to be integer arrays. Predictions are derived via the
    argmax function, accuracy is the proportion of correct predictions, and the F1-score
    is weighted by class frequency.

    Args:
        eval_pred:  A tuple containing the logits (as a NumPy array of floats, shape [N, C]) and
                    the true integer labels (NumPy array of shape [N]), where N is the number of
                    samples and C is the number of classes.

    Returns:
        Dictionary with classification metrics:
            - "accuracy": Overall accuracy as a float.
            - "f1": Weighted F1-score as a float.
    """
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(
        predictions=predictions, references=labels, average="weighted"
    )
    return {"accuracy": accuracy["accuracy"], "f1": f1["f1"]}

## 8. Fine-Tuning Transformer Model
Fine-tuning transformer models is computationally expensive and time-consuming. To accelerate the training process, only 20% of the complete dataset will be used for training and validation. This corresponds to 40k tokens for training and 1k tokens for validation, compared with 200k and 5k tokens respectively in the original dataset.

In [17]:
tokenised_train_dataset = tokenised_dataset["train"].train_test_split(
    test_size=0.2, seed=RANDOM_SEED
)["test"]

tokenised_val_dataset = tokenised_dataset["validation"].train_test_split(
    test_size=0.2, seed=RANDOM_SEED
)["test"]

In [18]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenised_train_dataset,
    eval_dataset=tokenised_val_dataset,
    processing_class=tokeniser,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.5909,0.573127,0.755,0.74799
2,0.489,0.585562,0.746,0.7498
3,0.3747,0.653701,0.75,0.75091




TrainOutput(global_step=7500, training_loss=0.4975051188151042, metrics={'train_runtime': 6649.2331, 'train_samples_per_second': 18.047, 'train_steps_per_second': 1.128, 'total_flos': 7893402531840000.0, 'train_loss': 0.4975051188151042, 'epoch': 3.0})

In [19]:
trainer.save_model(MY_MODEL_PATH)
tokeniser.save_pretrained(MY_MODEL_PATH)

('./my_finetuned_model/tokenizer_config.json',
 './my_finetuned_model/special_tokens_map.json',
 './my_finetuned_model/vocab.txt',
 './my_finetuned_model/added_tokens.json')

In [20]:
# Evaluate on test set
metrics = trainer.evaluate(tokenised_dataset["test"])
print(metrics)



{'eval_loss': 0.6751722097396851, 'eval_accuracy': 0.751, 'eval_f1': 0.75170711393122, 'eval_runtime': 77.3708, 'eval_samples_per_second': 64.624, 'eval_steps_per_second': 4.045, 'epoch': 3.0}


## 9. Predictions with Fine-Tuned Model
After training, the fine-tuned model and tokeniser can be used to predict the label and confidence score of a single text sample.

In [30]:
text_1 = "この商品はとても使いやすく、期待以上の性能でした。買ってよかったです！"
text_2 = "価格の割に品質が低いと感じました。リピートはしません。"
text_3 = "何とも言えないです。良くも悪くもないです。"
text_4 = "時々役に立つかもしれません。"

The *'proper'* approach is to use PyTorch's inference mode, compute the probabilities and confidence from the logits, and then make a prediction using the argmax() function.

In [31]:
my_tokeniser = AutoTokenizer.from_pretrained(MY_MODEL_PATH)
my_model = AutoModelForSequenceClassification.from_pretrained(MY_MODEL_PATH)

inputs = my_tokeniser(text_1, return_tensors="pt")
with torch.inference_mode():
    logits = my_model(**inputs).logits
    probabilities = F.softmax(logits, dim=-1)
    confidence = probabilities.max().item()
    predicted_class_id = probabilities.argmax().item()

print(f"Text: {text_1}")
print(
    f"Predicted: {predicted_class_id} ({my_model.config.id2label[predicted_class_id]}), Confidence: {confidence:.4f}"
)

Text: この商品はとても使いやすく、期待以上の性能でした。買ってよかったです！
Predicted: 2 (POSITIVE), Confidence: 0.9924


Pipelines are another efficient and convenient way to use models for inference. By specifying the model parameter with the path to the fine‑tuned model, we can load our own model; if this parameter is omitted, the pipeline automatically selects a default pre‑trained model.

In [32]:
classifier = pipeline(
    "sentiment-analysis",
    model=MY_MODEL_PATH,
)

print(f"Text: {text_1}")
print(classifier(text_1))

print(f"\nText: {text_2}")
print(classifier(text_2))

print(f"\nText: {text_3}")
print(classifier(text_3))

print(f"\nText: {text_4}")
print(classifier(text_4))

Device set to use mps:0


Text: この商品はとても使いやすく、期待以上の性能でした。買ってよかったです！
[{'label': 'POSITIVE', 'score': 0.9924416542053223}]

Text: 価格の割に品質が低いと感じました。リピートはしません。
[{'label': 'NEGATIVE', 'score': 0.9454251527786255}]

Text: 何とも言えないです。良くも悪くもないです。
[{'label': 'NEUTRAL', 'score': 0.8939847946166992}]

Text: 時々役に立つかもしれません。
[{'label': 'NEUTRAL', 'score': 0.8266585469245911}]


## 10. References

1. Hugging Face. (n.d.). *Fine-tuning*.<br>
https://huggingface.co/docs/transformers/training

1. Hugging Face. (n.d.). *Text classification*.<br>
https://huggingface.co/docs/transformers/tasks/sequence_classification

1. Hugging Face. (n.d.). *Trainer*.<br>
https://huggingface.co/docs/transformers/trainer

1. PyTorch Docs. (n.d.). *Transformers*.<br>
https://docs.pytorch.org/docs/stable/generated/torch.nn.Transformer.html

1. tohoku-nlp. (2023). *BERT base Japanese (unidic-lite with whole word masking, CC-100 and jawiki-20230102)*. Hugging Face.<br>
https://huggingface.co/tohoku-nlp/bert-base-japanese-v3

1. SetFit. (2022). *SetFit/amazon_reviews_multi_ja*. Hugging Face.<br>
https://huggingface.co/datasets/SetFit/amazon_reviews_multi_ja/viewer?views%5B%5D=train