Skip to content

seinecle/umibench

Repository files navigation

Umibench: testbench for 📏 factuality and 🎭 sentiment classifiers

Function

Umibench compares the performance of various models on 2 tasks: detection of factuality (objectivity vs subjectivity) and detection of sentiment (positive, negative or neutral).

It generates a leaderboard which is visible at the bottom of this page.

How to add a model or a dataset to this leaderboard? Just provide a link!

I will take care of everything. Please send:

  • for a labelled dataset: a link to a place where I can download it.
  • for a model: a url to the API of the model. If an API key is necessary, please provide it. Otherwise I will do my best to run it at my own expense.

Which models are tested so far?

  1. Thesis_Titan (paper, api)
  2. TimeLMs (paper, api)
  3. gpt-3.5-turbo-basic-prompt (paper, api)
  4. OpenHermes-2-Mistral-7B-advanced-prompt (paper, api)
  5. gpt-3.5-turbo-advanced-prompt (paper, api)
  6. OpenHermes-2-Mistral-7B-basic-prompt (paper, api)
  7. umigon (paper, api)

Supplementary information on these models

Against which annotated datasets are these models tested?

  1. mpqa (paper, data source)
    a set of newswire articles on international politics annotated for factuality and sentiment
  2. kaggle-headlines (paper, data source)
    a set of headlines from US newspapers annotated for factuality
  3. xfact (paper, data source)
    a database of factual statements labeled for veracity by expert fact-checkers
  4. subjqa (paper, data source)
    a set of consumer reviews on goods from the “electronics” product category
  5. apple (paper, data source)
    a set of tweets mentioning Apple and annotated for neutral, positive and negative sentiment
  6. alexa (paper, data source)
    a set of factual statements curated by the Amazon Alexa development team
  7. carblacac (paper, data source)
    a set of 200 tweets from anonymous individuals on their daily lives annotated for negative and positive sentiment
  8. clef2023 (paper, data source)
    a set of phrases extracted from news articles annotated for factuality and subjectivity

Find supplementary information on each dataset here

Leaderboard

Factuality: differentiating objective from subjective statements

Umigon and TimeLMs are models for sentiment analysis. We test them on factuality by considering that a prediction for "neutral sentiment" is equivalent to an objective statement, while a predicition for a positive or negative sentiment is equivalent to predicting a subjective statement.

Weighted F1 values:

alexa apple carblacac clef2023 kaggle-headlines mpqa subjqa xfact
OpenHermes-2-Mistral-7B-advanced-prompt 0,986 0,609 0,961 0,641 0,738 0,784 0,974 0,792
OpenHermes-2-Mistral-7B-basic-prompt 0,969 0,707 0,916 0,563 0,544 0,678 0,883 0,388
Thesis_Titan 0,964 0,612 0,860 0,821 0,857 0,877 0,789 0,960
TimeLMs 0,872 0,614 0,956 0,610 0,719 0,706 0,948 0,671
umigon 0,960 0,656 0,788 0,606 0,945 0,783 0,957 0,978

Overall scores and leaderboard for models on factuality task

The values for each model are the sums of the weighted F1 scores for each dataset, weighted by the number of entries of each dataset.

umigon Thesis_Titan OpenHermes-2-Mistral-7B-advanced-prompt TimeLMs OpenHermes-2-Mistral-7B-basic-prompt
overall score 0,907 0,906 0,790 0,699 0,517
rank 1 2 3 4 5

Sentiment: differentiating between positive sentiment, negative sentiment and neutral sentiment

here we do not test the model "Thesis Titan" as it is a model for the task of factuality categorization, which cannot be extended or adapted to sentiment analysis

also, the models are tested against just one dataset: MPQA. The reason is that AFAIK this is the only annotated dataset in existence which makes a rigorous distinction between different sentiment valences all while annotating texts for their subjective or objective character

Weighted F1 values:

apple carblacac mpqa
OpenHermes-2-Mistral-7B-advanced-prompt 0,598 0,867 0,847
OpenHermes-2-Mistral-7B-basic-prompt 0,690 0,883 0,708
TimeLMs 0,623 0,862 0,762
gpt-3.5-turbo-advanced-prompt 0,724 0,820 0,867
gpt-3.5-turbo-basic-prompt 0,663 0,832 0,827
umigon 0,614 0,700 0,862

Overall scores and leaderboard for models on sentiment task

The values for each model are the sums of the weighted F1 scores for each dataset, weighted by the number of entries of each dataset.

gpt-3.5-turbo-advanced-prompt umigon OpenHermes-2-Mistral-7B-advanced-prompt gpt-3.5-turbo-basic-prompt TimeLMs OpenHermes-2-Mistral-7B-basic-prompt
overall score 0,832 0,798 0,791 0,790 0,735 0,712
rank 1 2 3 4 5 6

How to run it

Umibench is programmed in Java.

  • you need Java 17 or later.
  • clone this repo, open it in your fav IDE
  • in the directory private, rename example-properties.txt to properties.txt and change the API keys in it.
  • navigate to the package src/main/java/net/clementlevallois/umigon/eval/models and open the classes where the API calls to Huggingface are made. Replace the endpoints with the endpoints of the models you want to test. Public endpoints don't have enough capacity. You need to spin your own endpoints.
  • run the main class of the project (Controller.java)

Why factuality and sentiment compared in the same test bench?

Most if not all models for sentiment analysis classify the 2 following statements as being SUBJECTIVE / NEGATIVE, which I put in question:

  • "State terrorism is a political concept": should be OBJECTIVE, NOT SUBJECTIVE / NEGATIVE
  • "State terrorism is not the best course of action": should be SUBJECTIVE / NEGATIVE

This distinction, which can appear as a subtle one, is actually very important to maintain when it comes to the analysis of discourse in media, politics and culture in general. Otherwise, the examination of opinions and debates on topics ladden with a positive or negative factual valence will be tainted by this lack of a distinction.

I developed Umigon, a model for sentiment analysis, which thrives to maintain this distinction. Annotated datasets which would include labels on whether the text entries are objective from the point of view of the locutor are rare. Actually, only the MPQA dataset (listed above) maintains this distinction.

To extend the range of tests, I have included datasets which are carefully annotated for factuality: objective or subjective? The expectation is that a good model in sentiment analysis (maintaining the distinction made above) has to perform well on this factuality test as well - otherwise this means that its definition of "sentiment" makes a confusion between actual sentiment, and positively or negatively ladden factuals.

Contact

Clement Levallois, levallois@em-lyon.com

This readme file and the leaderboard it includes has been generated on 2024-02-27T16:17:58.352201800