# Notebook 3: Test of Oseti against other SA models

In [1]:
import oseti
import statistics
import nltk
#nltk.download('all') #runs first time only
from nltk.sentiment import SentimentIntensityAnalyzer
from pathlib import Path
import os
import numpy as np
from transformers import pipeline
import torch
from tqdm import tqdm
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


Note: Here, the Oseti library is used in a slightly modified form. Oseti is dependent on the MeCab tokenization ,however, the current version of Oseti was not adjusted for the updated MeCab. The code may be adjusted for MeCab, but in this notebook we use Neolog Dictionary with Janome as a tokenizer. In practice, this has almost no effect for the sentiment score. In the analysis of the corpora, pure Oseti was used with minor adjustments for compatibility with a newer MeCab version.

In [2]:
sia = SentimentIntensityAnalyzer()
analyzer = oseti.Analyzer()

# Oseti-dictionary based sentiment analysis vs rule-based VADER
Here, we juxtapose the two approaches. Oseti sentiment analyzer has a built-in sentence tokenizer, while VADER demands usage of a particular tokenizers (like one in the NLTK package).

In [3]:
with open ("text samplings\\direct speech sampling JA.txt", encoding="utf-8") as file:
    text = file.read()
sampling_ja = text.split("\n")

with open ("text samplings\\direct speech sampling EN.txt", encoding="utf-8") as file:
    text = file.read()
sampling_en = text.split("\n")

In [4]:
annotated_dataset = pd.read_excel("text samplings\\annotated dataset.xlsx", index_col=False)
annotated_sents_ja = list(annotated_dataset["Original"])
annotated_sents_en = list(annotated_dataset["Translation"])
true_annotated_scores = list(annotated_dataset["Sentiment"])
annotated_dataset.head()


Unnamed: 0,Original,Translation,Sentiment,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,私は狼狼した。,I was in a panic.,-1,,,
1,声は細かったが精一杯の喜びの声だ。,"My voice was thin, but it was the best voice I...",0,,,
2,硬直せぬ前に婦長さんが体を整える。,"Before I could stiffen, the head nurse adjuste...",-1,,,
3,家への強烈な不安が、 またも頭にのしかかって来たが、たちまち忙しさにきりきり舞いを始めて、そ...,A strong sense of anxiety about home once agai...,-1,,,
4,家を出るとき輝一が冗談ともっかずに云うと、 道子は肩を叩いて送り出した。,When Teruichi said this jokingly as he left th...,0,,,


In [5]:
#number of samplings
no_sampling_ja = len(sampling_ja)
no_sampling_en = len(sampling_en)

### Oseti results

In [6]:
oseti_sentiment = [statistics.mean(analyzer.analyze(sent)) for sent in sampling_ja]
oseti_annotated = [statistics.mean(analyzer.analyze(sent)) for sent in annotated_sents_ja]

### VADER results

In [8]:
vader_sentiment = [sia.polarity_scores(sent)['compound'] for sent in sampling_en]
vader_annotated = [sia.polarity_scores(sent)['compound'] for sent in annotated_sents_en]

### bert-finetuned-japanese-sentiment
Training dataset: Amazon Reviews\
No.: 20000 reviews\
Link: https://huggingface.co/christian-phu/bert-finetuned-japanese-sentiment

In [9]:
# model needs the following dependencies:
#!pip install fugashi
#!pip install unidic_lite


sentiment_analyzer = pipeline(
            "sentiment-analysis",
            model="christian-phu/bert-finetuned-japanese-sentiment"
        )
bert_sentiment = []
for sent in tqdm(sampling_ja):
    result = sentiment_analyzer(sent)[0]
    label_to_score = {'positive': 1, 'neutral': 0, 'negative': -1}
    compound_score = label_to_score[result['label']] * result['score']
    bert_sentiment.append(compound_score)

Device set to use cpu
100%|██████████| 159/159 [00:24<00:00,  6.38it/s]


In [10]:
bert_annotated = []
for sent in tqdm(annotated_sents_ja):
    result = sentiment_analyzer(sent)[0]
    label_to_score = {'positive': 1, 'neutral': 0, 'negative': -1}
    compound_score = label_to_score[result['label']]
    bert_annotated.append(compound_score)

100%|██████████| 100/100 [00:07<00:00, 13.87it/s]


### japanese-sentiment-analysis
Training dataset: Corporate financial reports\
No.: 200 reports (6,119 sentences)\
Link: https://huggingface.co/jarvisx17/japanese-sentiment-analysis

In [12]:
sentiment_analyzer_jarv = pipeline("sentiment-analysis", model="jarvisx17/japanese-sentiment-analysis")
jarv_sentiment = []
for sent in tqdm(sampling_ja):
    result = sentiment_analyzer_jarv(sent)[0]
    label_to_score = {'positive': 1, 'neutral': 0, 'negative': -1}
    compound_score = label_to_score[result['label']] * result['score']
    jarv_sentiment.append(compound_score)

Device set to use cpu
100%|██████████| 159/159 [00:22<00:00,  7.23it/s]


In [13]:
jarv_annotated = []
for sent in tqdm(annotated_sents_ja):
    result = sentiment_analyzer_jarv(sent)[0]
    label_to_score = {'positive': 1, 'neutral': 0, 'negative': -1}
    compound_score = label_to_score[result['label']]
    jarv_annotated.append(compound_score)

100%|██████████| 100/100 [00:08<00:00, 12.09it/s]


### Japanese Stock Comment Sentiment Model
Training dataset: Comments and discussions related to Japanese stocks\
No.: Not clarified\
Link: https://huggingface.co/c299m/japanese_stock_sentiment\
\
\
This model is inapplicable for SA, as it estimates only market trends in two categories: "bullish" and "bearish".

### Finance-sentiment-ja-base
Training dataset: Japanese financial news\
No.: ≈5,000 sentences/phrases\
Link: https://huggingface.co/bardsai/finance-sentiment-ja-base\
\
The model is unoperabable as in the majority of cases it outputs neutral sentiment scores.

In [15]:
sentiment_analyzer_bardsai = pipeline("sentiment-analysis", model="bardsai/finance-sentiment-ja-base")
bardsai_sentiment = []
for sent in tqdm(sampling_ja):
    result = sentiment_analyzer_bardsai(sent)[0]
    label_to_score = {'positive': 1, 'neutral': 0, 'negative': -1}
    compound_score = label_to_score[result['label']] * result['score']
    bardsai_sentiment.append(compound_score)

Device set to use cpu
100%|██████████| 159/159 [00:17<00:00,  8.89it/s]


In [16]:
bardsai_annotated = []
for sent in tqdm(annotated_sents_ja):
    result = sentiment_analyzer_bardsai(sent)[0]
    label_to_score = {'positive': 1, 'neutral': 0, 'negative': -1}
    compound_score = label_to_score[result['label']]
    bardsai_annotated.append(compound_score)

100%|██████████| 100/100 [00:11<00:00,  9.01it/s]


### Models Overview

In [33]:
comparative_df = pd.DataFrame({"Oseti": oseti_sentiment, "VADER": vader_sentiment, "bert-finetuned-japanese-sentiment": bert_sentiment,
                               "japanese-sentiment-analysis": jarv_sentiment,"finance-sentiment-ja-base": bardsai_sentiment})

In [34]:
comparative_df.to_csv("Models Overview Dataframe.csv")

In [35]:
comparative_df.describe()

Unnamed: 0,Oseti,VADER,bert-finetuned-japanese-sentiment,japanese-sentiment-analysis,finance-sentiment-ja-base
count,159.0,159.0,159.0,159.0,159.0
mean,0.058192,0.002313,0.374328,0.314734,0.010143
std,0.517422,0.384876,0.646258,0.923552,0.145701
min,-1.0,-0.9287,-0.997148,-0.999939,-0.99992
25%,0.0,-0.0644,0.0,-0.988994,0.0
50%,0.0,0.0,0.652621,0.98831,0.0
75%,0.0,0.1378,0.984478,0.999605,0.0
max,1.0,0.9001,0.999272,0.999955,0.99901


The transformer models for Japanese sentiment analysis did not demonstrate a strong rationale for their advantage over the simplistic, dictionary-based method used by Oseti.

1. They do not provide a direct interface for calculating sentiment intensity. Instead, intensity scores are indirectly inferred from the model’s confidence (probability) in classifying a sentence as positive, negative, or neutral.

2. Among the four documented models, only two are operational. The Japanese Stock Comment Sentiment Model is not suitable for this study, as its sentiment classes ("bearish" and "bullish") do not align with the required categories. The finance-sentiment-ja-base model tends to classify most sentences as neutral when applied to samples from the Atomic Bomb Literature corpus.

3. The transformer models did not demonstrate a meaningfully stronger correlation with the VADER model, nor among themselves.

4. Given the advantages of rule-based models like VADER—particularly their transparency and traceability—we consider VADER a reliable reference point. When comparing against this benchmark, Oseti shows significantly better alignment. Although finance-sentiment-ja-base produced slightly higher precision, recall, and F1 scores, 155 out of its 157 predictions were classified as neutral, limiting its practical usefulness.

In [36]:
comparative_df.corr()

Unnamed: 0,Oseti,VADER,bert-finetuned-japanese-sentiment,japanese-sentiment-analysis,finance-sentiment-ja-base
Oseti,1.0,0.357845,0.268596,0.271637,0.155789
VADER,0.357845,1.0,0.345126,0.381054,0.153501
bert-finetuned-japanese-sentiment,0.268596,0.345126,1.0,0.329962,0.000454
japanese-sentiment-analysis,0.271637,0.381054,0.329962,1.0,0.146033
finance-sentiment-ja-base,0.155789,0.153501,0.000454,0.146033,1.0


In [38]:
from sklearn.metrics import precision_score, recall_score, f1_score

def transform_sentiment(input_scores):
    """sent > 0 -> 1; sent < 0 -> -1; sent = 0 -> 0"""
    transformed_sentiment = [1 if score > 0 else (-1 if score < 0 else 0) for score in input_scores]
    return transformed_sentiment

y_vader = transform_sentiment(vader_sentiment)  
y_oseti = transform_sentiment(oseti_sentiment)
y_bert = transform_sentiment(bert_sentiment)
y_jarv = transform_sentiment(jarv_sentiment)
y_bardsai = transform_sentiment(bardsai_sentiment)

def get_metrics(true_values, predicted_values):
    """Calculates precision, recall, and F1 score"""
    precision = precision_score(true_values, predicted_values, average='weighted')  # Using 'weighted' for multi-class
    recall = recall_score(true_values, predicted_values, average='weighted')
    f1 = f1_score(true_values, predicted_values, average='weighted')
    return [precision, recall, f1]


oseti_metrics = get_metrics(y_vader, y_oseti)
bert_metrics = get_metrics(y_vader, y_bert)
jarv_metrics = get_metrics(y_vader, y_jarv)
bardsai_metrics = get_metrics(y_vader, y_bardsai)

metrics_df = pd.DataFrame({"Metric": ["Precision", "Recall", "F1"], "Oseti": oseti_metrics,
        "bert-finetuned-japanese-sentiment": bert_metrics, "japanese-sentiment-analysis": jarv_metrics,
        "finance-sentiment-ja-base": bardsai_metrics})

metrics_df

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Unnamed: 0,Metric,Oseti,bert-finetuned-japanese-sentiment,japanese-sentiment-analysis,finance-sentiment-ja-base
0,Precision,0.584939,0.360627,0.230748,0.657199
1,Recall,0.591195,0.345912,0.408805,0.45283
2,F1,0.577118,0.328412,0.289929,0.305955


In [None]:
metrics_df.to_csv("Models against VADER tests.csv", index=False)

Average sentiment  scores for different models

In [44]:
for group in [oseti_sentiment, vader_sentiment, bert_sentiment, jarv_sentiment, bardsai_sentiment]:
    print(statistics.mean(group))

0.058192323050813614
0.002313207547169812
0.3743280177971102
0.3147344626720596
0.010143305520591495


As observable from the sentiment scores data for different models, transfomer models tend to output sentiment to more positive values. Again, the model *Finance-sentiment-ja-base* with peculiar behavior demonstrates closeness to VADER and OSETI outputs by tending to "neutralize" its outcomes.

In [45]:
for group in [oseti_sentiment, vader_sentiment, bert_sentiment, jarv_sentiment, bardsai_sentiment]:
    print(statistics.median(group))

0
0.0
0.6526210308074951
0.9883104562759399
0.0


As the sampling is limited, the median values demonstrate that the operationable transformers models still tend to make the output sentiment scores more positive.

### Testing Models with Annotated Dataset
From the previously used sample, 50 lines of direct speech (50 phrases or sentences) and 50 lines of authorial narration were selected. I manually annotated these 100 samples with one of three sentiment labels: positive (1), negative (-1), or neutral (0). Although some studies employ human annotation with continuous numeric sentiment values (see, for example, Bizzoni and Feldkamp), I deliberately refrained from this approach.

First, such evaluations are highly subjective, and having a limited number of annotators (N = 1) does not provide a sufficient basis for trusting the results. Second, one of the key strengths of sentiment analysis (SA) approaches that output numerical sentiment scores lies in their ability to follow traceable, rule-based processes. This characteristic aligns with the broader goals of computational criticism, which aims to offer new, reproducible perspectives on literary texts (see Ramsay).

Therefore, for evaluating the adequacy of model performance, I focus solely on whether the outputs correctly match the general polarity (tonality) of the text. In previous experiments, Transformer-based models showed that they do not output true sentiment scores in a strict sense, but rather reflect their internal confidence in polarity classification. As a result, these scores tend to have limited variation and diverge significantly from the more nuanced distribution of emotion found in natural speech.


In [8]:
annotated_dataset.head()

Unnamed: 0,Original,Translation,Sentiment,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,私は狼狼した。,I was in a panic.,-1,,,
1,声は細かったが精一杯の喜びの声だ。,"My voice was thin, but it was the best voice I...",0,,,
2,硬直せぬ前に婦長さんが体を整える。,"Before I could stiffen, the head nurse adjuste...",-1,,,
3,家への強烈な不安が、 またも頭にのしかかって来たが、たちまち忙しさにきりきり舞いを始めて、そ...,A strong sense of anxiety about home once agai...,-1,,,
4,家を出るとき輝一が冗談ともっかずに云うと、 道子は肩を叩いて送り出した。,When Teruichi said this jokingly as he left th...,0,,,


In [21]:
def transform_sentiment(input_scores):
    """sent > 0 -> 1; sent < 0 -> -1; sent = 0 -> 0"""
    transformed_sentiment = [1 if score > 0 else (-1 if score < 0 else 0) for score in input_scores]
    return transformed_sentiment

In [22]:
expanded_df = annotated_dataset[["Sentiment"]]
expanded_df["Oseti"] = oseti_annotated
expanded_df["VADER"] = transform_sentiment(vader_annotated)
expanded_df["bert-finetuned-japanese-sentiment"] = bert_annotated
expanded_df["japanese-sentiment-analysis"] = jarv_annotated
expanded_df["finance-sentiment-ja-base"] = bardsai_annotated
expanded_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  expanded_df["Oseti"] = oseti_annotated
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  expanded_df["VADER"] = transform_sentiment(vader_annotated)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  expanded_df["bert-finetuned-japanese-sentiment"] = bert_annotated
A value is trying to be set on a copy of

Unnamed: 0,Sentiment,Oseti,VADER,bert-finetuned-japanese-sentiment,japanese-sentiment-analysis,finance-sentiment-ja-base
0,-1,-1.0,-1,-1,1,0
1,0,1.0,1,1,1,0
2,-1,1.0,0,0,1,0
3,-1,-1.0,1,1,1,0
4,0,1.0,0,-1,1,0


In [23]:
expanded_df.to_csv("100 samples annotated dataset with other model.csv", index=False)

In [26]:
from sklearn.metrics import precision_score, recall_score, f1_score



def get_metrics(input_true_values, input_predicted_values):
    true_values = [int(value) for value in input_true_values]
    predicted_values = [int(value) for value in input_predicted_values]
    """Calculates precision, recall, and F1 score"""
    precision = precision_score(true_values, predicted_values, average='weighted')  # Using 'weighted' for multi-class
    recall = recall_score(true_values, predicted_values, average='weighted')
    f1 = f1_score(true_values, predicted_values, average='weighted')
    return [precision, recall, f1]


oseti_annotated_metrics = get_metrics(true_annotated_scores, expanded_df['Oseti'])
vader_annotated_metrics = get_metrics(true_annotated_scores, expanded_df['VADER'])
bert_annotated_metrics = get_metrics(true_annotated_scores, expanded_df['bert-finetuned-japanese-sentiment'])
jarv_annotated_metrics = get_metrics(true_annotated_scores, expanded_df['japanese-sentiment-analysis'])
bardsai_annotated_metrics = get_metrics(true_annotated_scores, expanded_df['finance-sentiment-ja-base'])

annotated_metrics_df = pd.DataFrame({"Metric": ["Precision", "Recall", "F1"], "Oseti": oseti_annotated_metrics,
        "VADER": vader_annotated_metrics,
        "bert-finetuned-japanese-sentiment": bert_annotated_metrics, "japanese-sentiment-analysis": jarv_annotated_metrics,
        "finance-sentiment-ja-base": bardsai_annotated_metrics})

annotated_metrics_df

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Unnamed: 0,Metric,Oseti,VADER,bert-finetuned-japanese-sentiment,japanese-sentiment-analysis,finance-sentiment-ja-base
0,Precision,0.577143,0.620889,0.575513,0.15033,0.758763
1,Recall,0.56,0.57,0.39,0.22,0.55
2,F1,0.550859,0.574698,0.428713,0.159883,0.419309


The tests with the annotated dataset demonstrate that Oseti outperforms existing Transformer models trained on non-literary texts and limited datasets. In terms of precision, Oseti performs on par with the most suitable model tested—bert-finetuned-japanese-sentiment—but it significantly outperforms it in recall and, consequently, in the F1 score.

Meanwhile, when applied to translated texts, the rule-based VADER model demonstrates significantly better precision. In contrast, finance-sentiment-ja-base, which tends to "neutralize" sentiment, once again exhibits unreliable behavior.

As a result, the minimalistic Oseti model performs competitively with more resource-intensive and complex Transformer-based models.

In [27]:
annotated_metrics_df.to_csv("metrics for annotated dataset and other models.csv", index=False)

### References
Bizzoni, Yuri, and Pascale Feldkamp. “Sentiment Analysis for Literary Texts: Hemingway as a Case-Study.” Journal of Data Mining & Digital Humanities, vol. NLP4DH, Apr. 2024. DOI.org (Crossref), https://doi.org/10.46298/jdmdh.13155.

Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. University of Illinois Press, 2011.
