This repository contains the code for the reproduction paper Cross-domain Retrieval in the Legal and Patent Domain: a Reproducability Study of the paper BERT-PLI: Modeling Paragraph-Level Interactions for Legal Case Retrieval and is based on the BERT-PLI Github repository.
We added the missing data preprocessing scripts as well as the script for fine-tuning the BERT model on binary classification, which is based on HuggingFace' transformers library. Furthermore we added scripts for the evaluation with the SciKitLearn classification report as well as for the ranking evaluation using the pytrec_eval libary.
The open-sourced trained models can be found here.
Please cite our work as follows
@inproceedings{althammer2021crossdomain,
title={Cross-domain Retrieval in the Legal and Patent Domains: a Reproducibility Study},
author={Sophia Althammer and Sebastian Hofstätter and Allan Hanbury},
year={2021},
booktitle={Advances in Information Retrieval, 43rd European Conference on IR Research, ECIR 2021},
}
-
./model/nlp/BertPoolOutMax.py
: model paragraph-level interactions between documents. -
./model/nlp/AttenRNN.py
: aggregate paragraph-level representations.
-
./config/nlp/BertPoolOutMax.config
: parameters for./model/nlp/BertPoolOutMax.py
. -
./config/nlp/AttenGRU.config
/./config/nlp/AttenLSTM.config
: parameters for./model/nlp/AttenRNN.py
(GRU / LSTM, repectively)
required format of the tsv-files:
[
label (0/1),
claim_id,
passage_id,
claim_text,
passage_text
]
./preprocessing/coliee19_task2_create_train_input.py
: preprocessing for the patent train dataset from the COLIEE2019 Task 2 for BERT fine-tuning
python coliee19_task2_create_train.py --train-dir /home/data/coliee/task2/train --output-dir /home/data/coliee/task2/ouput
./preprocessing/coliee19_task2_create_test.py
: preprocessing for the patent test dataset from the COLIEE2019 Task 2 for BERT fine-tuning
python coliee19_task2_create_test.py --train-dir /home/data/coliee/task2/train --output-dir /home/data/coliee/task2/ouput --test-gold-labels /home/data/coliee/task2/task2_test_golden-labels.xml
./preprocessing/coliee19_task2_create_test.py
: preprocessing for the patent train and merged test dataset from the COLIEE2019 Task 2 for BERT fine-tuning
python coliee19_task2_create_train_test.py --test-dir /home/data/coliee/task2/test --train-dir /home/data/coliee/task2/train --output-dir /home/data/coliee/task2/ouput --test-gold-labels /home/data/coliee/task2/task2_test_golden-labels.xml
./preprocessing/clefip13_ctp_create_train.py
: preprocessing for the patent training dataset from the CLEF-IP claim-to-passage task for BERT fine-tuning
python clefip13_ctp_create_train.py --train-dir /home/data/clefip/ctp/ --output-dir /home/data/clefip/ctp/ouput --corpus-dir --corpus-dir /home/data/clefip/corpus
./preprocessing/clefip13_ctp_create_test.py
: preprocessing for the patent test dataset from the CLEF-IP claim-to-passage task for BERT fine-tuning
python clefip13_ctp_create_test.py --train-dir /home/data/clefip/ctp/ --output-dir /home/data/clefip/ctp/ouput --corpus-dir --corpus-dir /home/data/clefip/corpus
- finds the english topics of the train and test files and creates txt-files with the document ids of english topics
python clefip11_pac_filer_english_topics.py --train-dir /home/data/clefip/pac/train --train_topics /home/data/clefip/pac/train/files --corpus-dir /home/data/clefip/corpus --test-dir /home/data/clefip/test/ --test-topics /home/data/clefip/test/files/
pyserini Github repository for further explanations.
General setup for both domains:
-
index_jsonl: create JSON-format for the pyserini indexer, either:
- Folder with files, each of which contains an array of JSON documents
- Folder with files, each of which contains a JSON on an individual line (often called JSONL format)
with the following format:
{'id': '001', 'contents': 'text'}
./preprocessing/coliee19_task1_index_jsonl.py
: json-format for the legal dataset from the COLIEE2019 task 1 for pyserini index creation
python coliee19_task1_index_jsonl.py --train-dir /home/data/coliee/task2/corpus/
./preprocessing/clefip11_pac_index_jsonl.py
: json-format for the patent dataset from the CLEF-IP prior-art-candidate task for pyserini index creation
python clefip11_pac_index_jsonl.py --corpus-dir /home/data/clefip/corpus/ --json-dir /home/data/clefip/corpus_json
- create index with pyserini
python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
-threads 1 -input integrations/resources/sample_collection_jsonl \
-index indexes/sample_collection_jsonl -storePositions -storeDocvectors -storeRaw
- index_search: search the created index with pyserini for the given topics
./preprocessing/coliee19_task1_index_search.py
: search the legal index for the topics from the COLIEE2019 task 1 for pyserini index search
python coliee19_task1_index_search.py --train-dir /home/data/coliee/task2/corpus/
./preprocessing/clefip11_pac_index_search.py
: search the patent index for the topics from the CLEF-IP prior-art-candidate task for pyserini index search
python clefip11_pac_index_search.py --index-dir /home/data/clefip/index/ --topic-dir /home/data/clefip/task1/train/
create format for the BertDocParaFormatter.py
./preprocessing/coliee19_task1_json_lines.py
: create format for AttentionRNN training dataset from the COLIEE2019 Task 1
python coliee19_task1_json_lines.py --train-dir /home/data/coliee/train/
./preprocessing/coliee19_task1_json_lines_test.py
: create format for AttentionRNN test dataset from the COLIEE2019 Task 1
python coliee19_task1_json_lines_test.py --train-dir /home/data/coliee/train/ --output-dir /home/data/coliee/output --test-gold-labels /home/data/coliee/task1_test_golden-labels.xml
./preprocessing/clefip11_pac_json_lines.py
: create format for AttentionRNN training/test dataset from the COLIEE2019 Task 1 from the CLEF-IP prior-art-candidate task
python clefip11_pac_json_lines.py --train-dir /home/data/clefip/train/ --corpus-dir /home/data/clefip/corpus --folder-name bm25_top50
python poolout_to_train.py --train-dir /home/data/clefip/train/
./formatter/nlp/BertDocParaFormatter.py
: prepare input for./model/nlp/BertPoolOutMax.py
An example:
{
"guid": "queryID_docID",
"q_paras": [...], // a list of paragraphs in query case,
"c_paras": [...], // a list of parameters in candidate case,
"label": 0, // 0 or 1, denote the relevance
}
./formatter/nlp/AttenRNNFormatter.py
: prepare input for./model/nlp/AttenRNN.py
An example:
{
"guid": "queryID_docID",
"res": [[],...,[]], // N * 768, result of BertPoolOutMax,
"label": 0, // 0 or 1, denote the relevance
}
finetune.py
/poolout.py
/train.py
/test.py
, main entrance for fine-tuning,pooling out, training, and testing.
- See
requirements.txt
- Finetune BERT model on paragraph-level interaction binary classification
python finetune.py --model_name bert-base-uncased --task_name MRPC --do_train --do_eval --data_dir /home/data/ --max_seq_length 512 --per_device_train_batch_size 1 --learning_rate 1e-5 --num_train_epochs 3.0 --save_steps 403 --gradient_accumulation 16 --output_dir /home/data/output
- Get paragraph-level interactions by BERT:
python3 poolout.py -c config/nlp/BertPoolOutMax.config -g [GPU_LIST] --checkpoint [path of Bert checkpoint] --result [path to save results]
- Train
python3 train.py -c config/nlp/AttenGRU.config -g [GPU_LIST]
or
python3 train.py -c config/nlp/AttenLSTM.config -g [GPU_LIST]
- Test
python3 test.py -c config/nlp/AttenGRU.config -g [GPU_LIST] --checkpoint [path of Bert checkpoint] --result [path to save results]
or
python3 test.py -c config/nlp/AttenLSTM.config -g [GPU_LIST] --checkpoint [path of Bert checkpoint] --result [path to save results]
- Eval
evaluate the recall of the index search for the topics for the COLIEE2019 Task 2
python coliee19_task1_eval_index.py --train-dir /home/data/coliee/task1/train --test-gold-labels /home/data/coliee/task1/task1_test_golden-labels.xml
evaluate the recall of the index search for the topics for the CLEF-IP 2011 prior-art-candidate search
python clefip11_pac_eval_index.py --train-dir /home/data/clefip/pac/train --folder-name bm25_top50
evaluate the binary classification metrics for the COLIEE2019 of CLEF-IP tasks
python eval_predictions_binary.py --label-file /home/coliee/task1/test/test.json --pred-file /home/coliee/task1/test/pred.txt
or for BM25 run:
python eval_predictions_binary.py --label-file /home/coliee/task1/test/test.json --bm25-folder /home/coliee/task1/bm25/ --cutoff 6
evaluate the ranking metrics for COLIEE2019 or CLEF-IP tasks
python eval_predictions_ranking.py --label-file /home/coliee/task1/test/test.json --pred-file /home/coliee/task1/test/pred.txt
Students paired t-test for the binary metrics for COLIEE2019 to the BM25 baseline:
python eval_predictions_binary_ttest.py --pred-file /home/coliee/task1/test/output_pred.txt --bm25-folder /home/coliee/task1/bm25/ --cutoff 6
Students paired t-test for the ranking metrics for COLIEE2019 or CLEF-IP tasks:
To compare different prediction model:
python eval_predictions_ranking_ttest.py --label-file /home/coliee/task1/test/test.json --pred-file /home/coliee/task1/test/pred.txt --pred_file2 /home/coliee/task1/test/pred2.txt
To compare to the baseline BM25 performance:
python eval_predictions_ranking_ttest.py --label-file /home/coliee/task1/test/test.json --pred-file /home/coliee/task1/test/pred.txt --bm25_folder /home/coliee/task1/test/bm25_top50/
For the legal datasets refer to COLIEE 2019, for the paragraph-level finetuning of the BERT model the dataset from Task 1 is used, for the document retrieval the dataset from Task 2.
For the patent dataset refer to CLEF-IP 2013, for the paragraph-level finetuning of the BERT model the dataset from the claims-to-passage 2013 task is used , for the document retrieval the dataset from the prior art candidate search 2011 task. The candidates are retrieved from the patent corpus published in 2013.
For more details, please refer to our reproduction paper Cross-domain Retrieval in the Legal and Patent Domain: a Reproducability Study. If you have any questions, please email sophia.althammer@tuwien.ac.at .