Field of Research Classification (FoRC)

This is a repository for the dataset of the Field of Research Classification (FoRC) shared task, as part of the NFDI for Data Science and Artificial Intelligence project. Additionally, it is used for my master's thesis at Osnabrück University's Institute of Cognitive Science.

Overview

This repository is for constructing a dataset that can be used for the task of classifying scholarly papers into fields of research (FoR). The labels (i.e. FoR) are derived from the Open Research Knowledge Graph (ORKG) research fields taxonomy. The dataset is constructed based on two different sources, the ORKG, and arXiv. Abstracts are added from Crossref API, Semantic Scholar Academic Graph (S2AG) API, and OpenAlex.

Pipeline

The dataset construction pipeline consists of:

Querying the ORKG rdfDump to get scholarly papers that contain the research field property (https://orkg.org/property/P30) and getting their metadata.
Obtaining abstracts from from Crossref API, S2AG API, and OpenAlex.
Sampling the arXiv snapshot with a random threshold (default is 50K) while keeping the original distribution.
Merging the two datasets and preprocessing.

How to run

Before running the code, the following datasets need to be installed:

An arXiv snapshot: https://www.kaggle.com/datasets/Cornell-University/arxiv?resource=download. Note that the path in data_processing/process_arxiv_data.py needs to be modified if changed from the default.
The file lid.176.bin from the fastText language identification package: https://fasttext.cc/docs/en/language-identification.html. The file needs to be unzipped and the path in data_processing/data_cleaning_utils.py needs to be modified if changed from the default.

Navigate to the repository directory and run the following commands:

Requirements

pip install -r requirements.txt

Dataset construction

python data_processing/process_merged_data.py

This will create a dataset at data_processing/data/merged_data.csv.

Contribution

This repository was developed by Raia Abu Ahmad (raia.abu_ahmad@dfki.de).

The initial basis for the data construction code was developed by the ORKG team. We used their code and developed it further. Their current version can be found at https://gitlab.com/TIBHannover/orkg/nlp/experiments/orkg-research-fields-classifier.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.idea		.idea
baseline		baseline
data_processing		data_processing
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
arxiv_dist.json		arxiv_dist.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

baseline

baseline

data_processing

data_processing

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

init.py

init.py

arxiv_dist.json

arxiv_dist.json

requirements.txt

requirements.txt

Repository files navigation

Field of Research Classification (FoRC)

Overview

Pipeline

How to run

Requirements

Dataset construction

Contribution

About

Releases

Packages

Languages

DFKI-NLP/nfdi4ds-forc

Folders and files

Latest commit

History

Repository files navigation

Field of Research Classification (FoRC)

Overview

Pipeline

How to run

Requirements

Dataset construction

Contribution

About

Resources

Stars

Watchers

Forks

Languages