Skip to content

A collection of multilingual 3-class sentiments (positive, neutral, negative) dataset.

License

Notifications You must be signed in to change notification settings

tyqiangz/multilingual-sentiment-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multilingual Sentiment Datasets

A collection of multilingual sentiments datasets grouped into 3 classes -- positive, neutral, negative.

Most multilingual sentiment datasets are either 2-class positive or negative, 5-class ratings of products reviews (e.g. Amazon multilingual dataset) or multiple classes of emotions. However, to an average person, sometimes positive, negative and neutral classes suffice and are more straightforward to perceive and annotate. Also, a positive/negative classification is too naive, most of the text in the world is actually neutral in sentiment. Furthermore, most multilingual sentiment datasets don't include Asian languages (e.g. Malay, Indonesian) and are dominated by Western languages (e.g. English, German).

For emotions related datasets, I group the negative (respectively positive) emotions into the negative (respectively positive) class. For ratings datasets I assign the 1 star reviews to the negative class, 3 star review to the neutral class and assign the 5 star review to the positive class.

Disclaimer: All credits goes to the respective dataset owners, this repository is merely an aggregation of the datasets.

Datasets

Dataset name Language Source No. of texts Classes
IndoNLU (EmoT) Indonesian Twitter train: 3521
val: 440
test: 440
anger, fear, happy, love, sadness
IndoNLU (SmSA) Indonesian Online platforms train: 11000
val: 1260
test: 500
positive, negative, neutral
IndoNLU (CASA) Indonesian Automobile platforms train: 810
val: 90
test: 180
positive, negative, neutral (6 aspects)
IndoNLU (HoASA) Indonesian Hotel reviews train: 2283
val: 285
test: 286
positive, negative, neutral,
positive-negative (10 aspects)
Multilingual
Amazon Reviews
English, Japanese, German, French, Chinese, Spanish Amazon For each language:
train: 200,000
val: 5,000
test: 5,000
1 star, 2 star, 3 star, 4 star, 5 star
GoEmotions English Reddit 211225 admiration, amusement, anger,
annoyance, approval, caring,
confusion, curiosity, desire,
disappointment, disapproval, disgust,
embarrassment, excitement, fear,
gratitude, grief, joy,
love, nervousness, optimism,
pride, realization, relief,
remorse, sadness, surprise
Offenseval Dravidian Tamil-English, Malayalam-English, Kannada-English Social media Tamil:
train: 35139
val: 4388

Malayalam:
train: 16010
val: 1999

Kannada:
train: 6217
val: 777
Not_offensive, Offensive_Untargetede, Offensive_Targeted_Insult_Individual, Offensive_Targeted_Insult_Group, Offensive_Targeted_Insult_Other,
not-{lang}
SemEval-2018 Task 1:
Affect in Tweets
English, Arabic, Spanish Twitter English:
train: 6838
val: 886
test: 3259

Spanish:
train: 3561
val: 679
test: 2854

Arabic:
train: 2278
val: 585
test: 1518
anger, anticipation, disgust,
fear, joy, love,
optimism, pessimism,
sadness, surprise, trust
Emotion English Twitter train: 16000
val: 2000
test: 2000
anger, anticipation, disgust, fear, joy, sadness, surprise, and trust
IMDB English Movies train: 25000
test: 25000
positive, negative
Amazon Polarity English Amazon train: 3600000
test: 400000
positive, negative
Yelp Reviews English Yelp train: 650000
test: 50000
1 star, 2 star, 3 star, 4 star, 5 star
Yelp Polarity English Yelp train: 560,000
test: 38,000
positive, negative

About

A collection of multilingual 3-class sentiments (positive, neutral, negative) dataset.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages