Skip to content

ZhangDataLab/CeleTrip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CeleTrip (Celebrity Trip Detection Framework)

This is the source code for paper 'Where Did the President Visit Last Week? Detecting Celebrity Trips from News Articles'.

Abstract

Celebrities’ whereabouts are of pervasive importance. For instance, where politicians go, how often they visit, and who they meet, come with profound geopolitical and economic implications. Although news articles contain travel information of celebrities, it is not possible to perform large-scale and network-wise analysis due to the lack of automatic itinerary detection tools. To design such tools, we have to overcome difficulties from the heterogeneity among news articles: 1) One single article can be noisy, with irrelevant people and locations, especially when the articles are long. 2) Though it may be helpful if we consider multiple articles together to determine a particular trip, the key semantics are still scattered across different articles intertwined with various noises, making it hard to aggregate them effectively. 3) Over 20% of the articles refer to the celebrities' trips indirectly, instead of using the exact celebrity names or location names, leading to large portions of trips escaping regular detecting algorithms. We model text content across articles related to each candidate location as a graph to better associate essential information and cancel out the noises. Besides, we design a special pooling layer based on attention mechanism and node similarity, reducing irrelevant information from longer articles. To make up the missing information resulted from indirect mentions, we construct knowledge sub-graphs for named entities (person, organization, facility, etc.). Specifically, we dynamically update embeddings of event entities like the G7 summit from news descriptions since the properties (date and location) of the event change each time, which is not captured by the pre-trained event representations. The proposed CeleTrip jointly trains these modules, which outperforms all baseline models and achieves 82.53% in the F1 metric. By open-sourcing the first tool and a carefully curated dataset for such a new task, we hope to facilitate relevant research in celebrity itinerary mining as well as the social and political analysis built upon the extracted trips.

Celebrity Trip Dataset

We provide a real-word celebrity trip dataset in file Celebrity Trip Dataset. We collecte the trips of 26 politicians and 24 artists from January 2016 to February 2021 from Wikipedia, and obtain the date and locations. Afterwards, we crawl 2,617,548 news URLs from 01/2016 to 02/2021 from GDELT, and get news articles using URLs from Newspaper3k. We label trip locations and non-trip locations from the news articles, using the ground truth trip locations of celebrities provided by Wikipedia.

CeleTrip Model

Prerequisites

The code has been successfully tested in the following environment. (For older dgl versions, you may need to modify the code)

  • Python 3.8.1
  • PyTorch 1.11.0
  • dgl 0.9.0
  • Sklearn 1.1.2
  • SpaCy 3.2.1
  • SpaCy en-core-web-trf 3.2.0
  • Gensim 3.8.3
  • nltk 3.6.5

Getting Started

Prepare data

You can download the dataset from the link google driver. We provide a sample of our dataset.

Celebrity Location Date Article Label
Donald Trump Langley 2017-01-21 ['Trump bring politic to the CIA .', "WASHINGTON — member of the national security community react with shock on Saturday after President Donald Trump ’s inaugural visit to CIA headquarters in which he use a speech in front of the agency 's memorial to attack the medium and his critic .", ... ] True

Training CeleTrip

Please run following commands for building graph.

python build_multilocation_graph.py

Please run following commands for training.

python main_celetrip.py

Preprocessing Tools

We open source our preprocessing tool (Time Detection and Location Extraction) in Preprocessing Tools/.

Others

We train our model on two Tesla M40. All our data shared from this work will be made FAIR[1].

[1] FORCE11. 2020. The FAIR Data principles. https://force11.org/info/the-fair-data-principles/.

Cite

Please cite our paper if you find this code useful for your research:

About

This is the source code for paper "Where Did the President Visit Last Week? Detecting Celebrity Trips from News Articles" in ICWSM 2024.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages