Data pipeline to collect, process, analyze tweets about COVID-19

The repository contains scripts and software modules used to collect, process, and analyze tweets about COVID-19. A MongoDB database is employed to store the downloaded and processed data. The script tweets_processor.sh is responsible for the heavy-load of processing tasks. It computes the type, extracts the complete text, identifies the location, calculates the sentiment, infers the language, and computes metrics (retweets, favorites) of tweets. The script also updates the collection of users.

The directory data contains some datasets that help in processing tweets. For example, the dataset banned_accounts.csv includes a list of Twitter accounts that we decided to exclude from our tweets' collection. The datasets demonyms_spain.csv and places_spain*.* are used to identify tweets' location. The dataset keywords_covid.txt holds the list of keywords we used to find tweets. The datasets query_version_history.txt and query_versions.csv show the different queries we employed in our searches on Twitter. In these files, all versions of queries are reported together with usage start and end dates.

The directory legacy includes scripts and software code used in the past and are made available if they can be useful in the future.

The infrastructure's core is in src, which is organized in three main modules, data_loader.py, data_explorter.py, and data_wrangler.py to load, export, and process data, respectively. Functions implemented in these modules are expected to be called from the script run.py, the main script to run processing and analysis. Complementary, the directory utils contains utilitarian classes that are key in executing the loading, exporting, and processing tasks. Next, the complete structure of the directory is presented.

Source code structure

...
├── src                                 <- Source code of the infrastructure
│   ├── __init__.py                     <- Makes src a Python module
│   ├── run.py                          <- Main script to run processing and analysis
|   |── test.py                         <- Script with testcases to evalute code
|   |── report_generator.py             <- Analysis function used in reporting
|   |── network_analysis.py             <- Class used to conduct network analysis
|   |── config.json.example             <- Example of a configuration file
│   ├── utils
│   │   └── __init__.py                 <- Makes utils a Python module
│   │   └── db_manager.py               <- Class to operate the MongoDB database
│   │   └── demographic_detector.py     <- Class to infer demographich features of users
│   │   └── embedding_trainer.py        <- Class to train word-embeddings from a corpus of tweets
│   │   └── figure_maker.py             <- Class to plot figures
│   │   └── language_detector.py        <- Class to detect language of tweets
│   │   └── location_detector.py        <- Class to detect location of tweets or users
│   │   └── sentiment_analyzer.py       <- Class to compute polarity score of tweets
│   │   └── utils.py                    <- General utilitarian
│   │   └── lib                         
│   │       └── dependency.txt          <- Instructions to download dependency of fasttext
│   │       └── lid.176.bin             <- Dependency binary of fasttext
│   │   └── dashboard                         
│   │       └── app.py                  <- Proof-of-concept of an interactive dashboard
│   │       └── assets                  <- Directory with the assets required by the dashboard 
...

Demographic detector (M3Inference)

The demographic characteristics of users are calculated using the library M3Inference, a deep-learning system for demographic inference. Details on how M3Inference works under the hood can be learned in the article Demographic Inference and Representative Population Estimates from Multilingual Social Media Data.

M3Inference helps us infer users' gender and age and identify which user accounts are controlled by organizations and which by "people." A posterior manual inspection on a representative sample of the M3Inference results showed a low accuracy of age inferences, hence, only gender and type of account (organization/non-organization) are considered for the analyses.

Sentiment analyzer

The pipeline supports the sentiment analysis of tweets in Spanish, Catalan, Vasque, Galician, and English. For Catalan, Vasque, and Galician, Polyglot, a python multilingual toolkit for natural language processing, is used. Polyglot returns scores between 0 (most positive) and -1 (most negative) that are then normalized using the (Hyperbolic Tangent function TanH to have scores in the range of -1 - 1.

For English, Vader, a rule-based sentiment analysis tool, is applied together with Polyglot. TanH is also used here to normalize the scores of both tools, which are then averaged.

In the case of the Spanish tweets, a combination of three tools are employed. Apart from Polyglot, a customized version of Affin and the machine-learning-based solution Senti-Py are employed. As in the English case, the resulting scores are normalized using TanH and then averaged.

Language detector

Even when the API of Twitter provides information about the language of tweets, we saw certain inaccuracy in this information. Tweets in Catalan are identified as French, tweets in Galician are reported as Portuguese, etc. So, we decided to run language detector tools on all tweets as part of the processing task.

Three tools are used for this purpose, namely Polyglot, FastText, and LangId. The majority vote is applied to decide among the results of the three tools. Meaning, the language of tweets is determined by the language detected by the majority of the tools. If there isn't a clear candidate (i.e., all tools detect different languages), undefined is answered.

Location detector

A multi-criteria approach has been implemented to detect the location of tweets. First, a data model has been developed to store information about cities, provinces, and autonomous communities in Spain. In the model, each city, province, autonomous community, country of interest has the following properties.

Property	Type	Description	Valid values	Example
name	String	Name of the city/province/autonomous community/country	List of strings	España
alternative_names	List	Alternative names given to the city/province/autonomous community/country	List of strings	['Spain', 'Spagna', 'Espanya']
type	String	Type of place	City, Province, Autonomous Community, Country	Country
flag_emoji_code	List	Emoji code of the place's flag	List of emoji codes in Github version	[':Spain:']
languages	List	Languages spoken in the place	List of strings	['es', 'ca', 'gl', 'eu']
homonymous	Integer	Whether there is a homonymous place somewhere	1 (there is a homonymous) / 0	0
demonyms	Complex	Demonyms associated to the place
demonyms.names	List	Names of the demonyms	List of strings	['Español', 'Española']
demonyms.banned_prefix	List	Demonyms preceded by these terms are ignored	List of strings	['en', 'hablo', 'es', 'lo']
demonyms.banned_places	List	Demonyms are ignored if places listed here appear in location	List of strings	['San Juan', 'Nuevo León']

An additional property is used to create hierarchical relationships between places. Places of type country contain a list of their regions, which at the same time include a list of their provinces. Provinces contain a list of their cities and so on. Following the example of the table, the place España has the property regions, which contains the Spain's regions. See data/places_esp.json for an example of how cities, provinces, and regions of Spain are defined using the data model.

Criteria

Matching place name: the location self-declared by the user is inspected term by terms trying to find the name of a place included in the data model. If the located place has a homonymous somewhere, the name of the corresponding country, region, or province should also appear in the self-declared location. To favor inclusion over exclusion, an exception to this rule are locations that contain a unique string with the name of the place. For example, the city Cordoba exists in Spain and Argentina, so if a user declares location=Cordoba or location=Cordoba, Andalucia, they are both valid matches while location=Soy de Cordoba no.
Matching demonyms in description: the users' description is analyzed term by term, trying to find a match with the demonyms defined in the data model. Here, demonyms preceded by the terms included in the list of banned_prefix are ignored. Likewise, if there is a match with some of the defined demonyms but the user declares that she is located in one of the places listed in banned_places, the match is discarded.
Language of description: tweets (and their corresponding users) are assigned to places according to the language of the users' descriptions. For example, tweets authored by users with descriptions written in Vasque are assigned to Vasque Country.
Matching emoji flags: location is inspected attempting to find the defined emojis.

Algorithm

Criterion 1) is executed first; if the place could not be found, the criterion 2) is applied. If criteria 1) and 2) did not match, criteria 3) and 4) are executed in this order. If non of the criteria produce a match, the tweet (and its corresponding) author are assigned to an unknown place.

Evaluation

The approach has been tested using the test set data/location_detector_testset.csv and showed an overall accuracy of 0.968, a recall of 0.983, a precision of 0.981, and F1 of 0.982. The table below shows the evaluation of each of the criteria.

Criterion	Accuracy	Recall	Precision	F1
Matching place name	0.982	0.981	0.995	0.993
Matching demonyms in description	0.920	0.994	0.922	0.957
Language of description	0.834	0.861	0.975	0.911
Matching emoji flags	0.991	0.995	0.995	0.995

Dashboard

A proof-of-concept dashboard can be launched by running python app.py from src/dashboard

Installation

Install requirements pip install -r requirements.txt
Rename src/config.json.example to src/config.json
Set information about mongo db in src/config.json

Command Line Interface (CLI)

The list of supported commands are in src/run.py and all commands must be run from the src directory. Some illustrative examples are presented below.

Example 1: Identify location

python run.py add-location-flags [mongo_collection_name] --config_file [mongo_config_file_name]

Example 2: Detect language

python run.py add-language-flag [mongo_collection_name] --config_file [mongo_config_file_name]

Example 3: Analyze sentiment

python run.py sentiment-analysis [mongo_collection_name] --config_file [mongo_config_file_name]

Technology

Python 3.6
MongoDB Community Edition--used as data storage repository

Contributors

Jorge Saldivar, Nataly Buslón, and María José Rementería

Name		Name	Last commit message	Last commit date
Latest commit History 537 Commits
data		data
legacy		legacy
reports		reports
screenshots		screenshots
sna		sna
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
compute_demographics.cmd		compute_demographics.cmd
create_mongo_view.sh		create_mongo_view.sh
mongo_view_manager.py		mongo_view_manager.py
requirements.container.txt		requirements.container.txt
requirements.txt		requirements.txt
tweets_downloader.sh		tweets_downloader.sh
tweets_extractor.sh		tweets_extractor.sh
tweets_processor.sh		tweets_processor.sh
update_mongo_view.sh		update_mongo_view.sh
update_users_status.sh		update_users_status.sh

social-link-analytics-group-bsc/tw_coronavirus

Folders and files

Latest commit

History

Repository files navigation