Extracting Definienda in Mathematical Scholarly Articles with~Transformers

This repository contains the datasets, codes and experimental results of our paper: Extracting Definienda in Mathematical Scholarly Articles with~Transformers (Accepted by the 2nd Workshop on Information Extraction from Scientific Publications at IJCNLP-AACL 2023).

An 8 minutes' video about our work:

https://youtu.be/tUioJooDDio?si=6EnTN_5l-9t86IKk

And our slides.

If you have no access to our ArXiv papers store, you may start from Step 2.

Step 0. Collect .tex sources of ArXiv papers

Get_paper_IDs.ipynb pulls ArXiv IDs of paper in combinatoric category and stores the IDs to a csv file 28477_id+dir.csv.

Then you need to run get_list_of_papers ( make sure that 28477_id+dir.csv, get_paper.sh and add_extthm.py are copied to the same repository ) on the server where you store the ArXiv paper sources to copy the .tex files to a folder "que_tex/".

Step 1. Extract definition-definiendum pairs from .tex files

Always on the same server as in raw data collection step, run

python get_def-term_pairs.py que_tex out_def_all.csv

to print the definitions in all the .tex to a single .csv "out_def_all.csv".

Step 2. Clean the dataset

Run Prepare_term-def_dataset.ipynb to clean the noises in extracted definition-definiendum pairs and generate IOB-format dataset for named entity recognition. If you start with this step, you can load out_def_all_1007.csv. You can find our intermediate outputs of this step in intermediate_outputs/ We saved the cleaned and labeled data in data/all_labeled_data+ID.csv.

Step 3.

Fine-tune pre-trained language models for token classification:

You may run the following notebooks with our labeled data in "data/":

Transformer's native evaluation of our experiments can be found in finetuning_results/. Our fine-tuned models are available here:

Ask ChatGPT to extract definienda with your own API KEY:

Ask_ChatGPT_to_Extract.ipynb

Our results on the test set can be found in GPT_results/Human_corrected_annotations+gpt_res.csv

Step 4. Evaluation

Run Eval_Finetuning_10-fold.ipynb to align fine-tuned models' predictions with ChatGPT's answers. Our experimental results are in GPT_results/.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
GPT_results		GPT_results
data		data
finetuning_results		finetuning_results
intermediate_outputs		intermediate_outputs
scripts_and_outputs_on_data_server		scripts_and_outputs_on_data_server
.gitignore		.gitignore
Ask_ChatGPT_to_Extract.ipynb		Ask_ChatGPT_to_Extract.ipynb
CCRobertaForTokenCLS.ipynb		CCRobertaForTokenCLS.ipynb
Eval_Finetuning_10-fold.ipynb		Eval_Finetuning_10-fold.ipynb
Extracting_Definienda_in_Mathematical_Scholarly_Articles_with_Transformers_WIESP_slides.pdf		Extracting_Definienda_in_Mathematical_Scholarly_Articles_with_Transformers_WIESP_slides.pdf
Get_paper_IDs.ipynb		Get_paper_IDs.ipynb
Prepare_term-def_dataset.ipynb		Prepare_term-def_dataset.ipynb
README.md		README.md
RobertaForTokenCLS.ipynb		RobertaForTokenCLS.ipynb
SciBERTForTokenCLS.ipynb		SciBERTForTokenCLS.ipynb

sufianj/def_extraction

Folders and files

Latest commit

History

Repository files navigation

Extracting Definienda in Mathematical Scholarly Articles with~Transformers

Step 0. Collect .tex sources of ArXiv papers

Step 1. Extract definition-definiendum pairs from .tex files

Step 3.

Fine-tune pre-trained language models for token classification:

Ask ChatGPT to extract definienda with your own API KEY:

Step 4. Evaluation

About

Resources

Stars

Watchers

Forks

Languages