Multilingual News Article Similarity

Introduction

This repository is the code for Team Innovator's submission of SemEval 2022 Task 8 paper titled "Multi-Task Training with Hyperpartisian and Semantic Relation for Multi-Lingual News Article Similarity". The shared task emphasizes finding the similarity of multilingual news articles irrespective of the style of writing, political spin, tone, or any othermore subjective "design decision" imposed by a medium/outlet. We propose a pipeline consisting of text rank to filter irrelevant information followed by a multi-task approach which allows multiple sub-tasks to share the same encoder during training thereby facilitating knowledge transfer.

Data

The model is trained on multiple subtasks as outlined below. The results are evaluated on SemEval dataset found here.

SemEval Dataset

The SemEval dataset consists of a csv file with each row corresponding to a pair of article. For each article, url_lang, link and id is mentioned. Along with it, the similarity score across Geography, Entities, Time, Narrative, Style, Tone and Overall are mentioned. The final evaluation is done on the Overall similarity. The content of the news article is extracted using the script given here.

Subtask Dataset

Subtask	Description	Dataset
Semantic Textual Similarity	Determine how semantically similar two pieces of text are.	STS benchmark
Hyperpartisan detection	Given a news article, decide whether it follows a hyperpartisan argumentation, i.e., whether it exhibits blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person.	Hyperpartisan News Detection
Stance detection	It involves estimating the relative perspective (or stance) of two pieces of text respective to a topic, claim or issue.	Fake News Challenge - 1
Fake news inference detection	Fake news Detection using the Natural Language Inference. This entails categorizing a piece of text into categories such as "pants-on-fire", "false", "barely true", "half-true", "mostly true", and "true."	Fake News Inference Dataset
Paraphrase detection	Determine whether a particular sentence is a paraphrase of the original text.	Microsoft Research Paraphrase Corpus

The preprocessed version of the above datasets are available under dataset folder and some are used directly through hugging-face glue-dataset so there is no need to download the datasets.

Models

The models can be run locally by cloning the current repository or through google colab using the following links. The pearson score reported are for the validation dataset during training.

Model	Pearson Score	Link
Main Model: Multi-task Training	0.835
Experiment 1: Multi-Objective Weighted Loss Training	0.811
Experiment 2: Multi-task Training with Multilingual Text Rank	0.737

Setup

Clone the current repository and upload it in google drive
Open the concerning notebook in the training module folder and enable GPU access
Connect the notebook to your google drive. You can see the tutorial here
Install the dependencies mentioned in the initial cells and the rest of the cells

Contributors

Nidhir Bhavsar* (Navrachana University, Gujarat, India): nidbhavsar989@gmail.com
Rishikesh Devanathan* (Indian Institute of Technology Patna, India) rishi.devanathan@gmail.com
Aakash Bhatnagar* (Navrachana University, Gujarat, India): akashbharat.bhatnagar@gmail.com
Tirthankar Ghosal (UFAL, MFFCharles University, Czech Republic): tghosal@acm.org
Muskaan Singh (IDIAP Research Institute, Switzerland)

* denotes equal contribution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Multilingual News Article Similarity

Introduction

Data

SemEval Dataset

Subtask Dataset

Models

Setup

Contributors

Files

README.md

Latest commit

History

README.md

File metadata and controls

Multilingual News Article Similarity

Introduction

Data

SemEval Dataset

Subtask Dataset

Models

Setup

Contributors