citation-parser

A deep learning model using Transformer and CRF to solve citation parsing problem.

Dataset Used: GIANT

Model Architecture:

Feature: Word Embedding & Character Embedding(from BiLSTM)
Features are fed into transformer encoder layer which got attention module
Outputs from encoder layer are passed through 2-layers of fully connected linear layer
Output of linear layer is fed into CRF to find out the most possible tag

Tokenization Details

Built custom tokenizer which labels each word according to the XML tag on the citation string. Considered all the punctuation as a seperate class. Note: Removed DOI/URI from training data.

BIO tagging scheme is used for the labels.

File Introduction

Model trainer files

Trainer/citation-parser-trainer.py [Main]
Trainer/corpus.py [Loads datasets]
Trainer/transformerModel.py [Architecture definition]

Data preparation

Data Preprocessing/csv-downsampler.ipynb

This file is used to downsample GIANT dataset into the target number. You can define sample size to take from each CSV and save it (as csv).

Data Preprocessing/dataset-builder.ipynb

It has one part with tokenizer. It takes on a citaion string and returns list of token-label tuples for each sentence. Another part(s) takes on CSV, read each citation string, send it to tokenizers and build text files. Each line has one token-label pair. Each CSV has one respective text file.

Data Preprocessing/concat-text-files.ipnyb Concats all the text files into one. There's a space between each sentence.

Bash file

script-trainer.sh To run this on HPC:

sbatch script-trainer.sh

You may need to create environment and install required modules for a successful run.

Evalution

citation-parser-inference-block-token.ipynb

This file is used to evaluate test sets. First it loads the model from directory. You need to set up the path of your dataset for evaluation.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Data Preprocessing		Data Preprocessing
Dataset		Dataset
Evaluation		Evaluation
Trainer		Trainer
Utility		Utility
sample files		sample files
LICENSE		LICENSE
README.md		README.md
script-trainer.sh		script-trainer.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Preprocessing

Data Preprocessing

Dataset

Dataset

Evaluation

Evaluation

Trainer

Trainer

Utility

Utility

sample files

sample files

LICENSE

LICENSE

README.md

README.md

script-trainer.sh

script-trainer.sh

Repository files navigation

citation-parser

Model Architecture:

Tokenization Details

File Introduction

Model trainer files

Data preparation

Bash file

Evalution

About

Releases

Packages

Languages

License

rayansami/citation-parser

Folders and files

Latest commit

History

Repository files navigation

citation-parser

Model Architecture:

Tokenization Details

File Introduction

Model trainer files

Data preparation

Bash file

Evalution

About

Resources

License

Stars

Watchers

Forks

Languages