Skip to content
/ S3D Public

This repository contains our sarcasm annotated datasets along with notebooks to use our fine-tuned language models for our EMNLP 2022 Workshop Paper: "Utilizing Weak Supervision to Create S3D: A Sarcasm Annotated Dataset"

License

Notifications You must be signed in to change notification settings

surrey-nlp/S3D

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo        Surrey Institute for People-centred AI

S3D: A Weakly Supervised Sarcasm Dataset

GitHub issues GitHub stars GitHub forks License: CC BY-SA 4.0 Twitter Follow

This is the repository for our 'Utilizing Weak Supervision to Create S3D: A Sarcasm Annotated Dataset' paper submitted to the EMNLP NLP+CSS 2022 workshop. This repository includes our SAD dataset along with version 1 and 2 of our S3D dataset. Both of these twitter datasets can be used for the purpose of training sarcasm detection models.

Datasets

SAD - We provide the Tweet IDs and the given sarcasm labels of 2340 manually annotated tweets which were collected observing the #sarcasm hashtag. Available on HuggingFace

S3D-v1 - We provide the Tweet IDs of 100,000 tweets along with their respective labels which were predicted by a fine-tuned BERTweet model which was trained on our 'Combined dataset', a corpus of over a million tweets and reddit comments labelled for sarcasm in previous works. Available on HuggingFace

S3D-v2 - We provide the Tweet IDs of 100,000 tweets along with their respective labels which were predicted by an ensemble of our 'best' three fine-tuned sarcasm detection models. Available on HuggingFace

Experiments

We provide a notebook to show the labelling process of our datasets. You can reproduce the experiments to create S3D-v1 and S3D-v2 via our Python notebooks which uses HuggingFace to load the relevant models to label the dataset.

Models

Models Fine-tuned Models Description
BERTweet BERTweet-base-finetuned-SARC-combined-DS BERTweet model fine-tuned on our combined dataset
BERTweet BERTweet-base-finetuned-SARC-DS BERTweet model fine-tuned on the SARC dataset
RoBERTalarge roberta-large-finetuned-SARC-combined-DS RoBERTalarge model fine-tuned on our combined dataset

Maintainer(s)

Jordan Painter
Diptesh Kanojia

About

This repository contains our sarcasm annotated datasets along with notebooks to use our fine-tuned language models for our EMNLP 2022 Workshop Paper: "Utilizing Weak Supervision to Create S3D: A Sarcasm Annotated Dataset"

Topics

Resources

License

Stars

Watchers

Forks