What is it?

Component of a papers recommender system in a cross-lingual and multidisciplinary scope.

Result of the Coursework of MBA in Data Science and Analytics - USP / ESALQ - 2020-2022.

Designed to be customizable in many ways:

sentence-transformer model
the maximum number of candidate articles for the evaluation of semantic similarity
accepts any type of document that has bibliographic references

Dependences

sentence-transformer
celery
mongoengine

Sentence transformer models

Choose one of the models: https://huggingface.co/models?library=sentence-transformers&pipeline_tag=sentence-similarity&sort=downloads&search=multilingual

For instance, paraphrase-xlm-r-multilingual-v1.

The instructions below you can get from "</> Use in sentence-transformers" button found at any hugging face model web page.

git lfs install
git clone https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1
# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

Move paraphrase-xlm-r-multilingual-v1 to the folder where you will keep the models, for instance, /path/my_sentence_transformers_models

If you get any error related to git lfs. Check https://git-lfs.github.com/.

MongoDB

docker pull mongo
docker run -d -p 27017:27017 --name papers_recommeder_mongo mongo:latest

Rabbitmq

docker pull rabbitmq
docker run -d -p 5672:5672 --name papers_recommeder_rabbitmq rabbitmq

Model

The algorithm adopted is a combination of recommender systems graph based and content based filtering with semantic similarity

The identification of the relationship between scientific articles is made during the document's entry into the system through the common bibliographic references. Subsequently, the documents are ranked by semantic similarity and recorded in a database.

The recommendation system works in two steps: creating links between articles via common citations and assigning a similarity coefficient for a selection of these linked articles.

The system itself does not establish which articles should be recommended.

The recommendation system client defines which articles to present as a recommendation depending on the criticality of the use case.

Installation

pip install -U xlingual_papers_recommender

Configurations

export DATABASE_CONNECT_URL=mongodb://my_user:my_password@127.0.0.1:27017/my_db
export CELERY_BROKER_URL="amqp://guest@0.0.0.0:5672//"
export CELERY_RESULT_BACKEND_URL="rpc://"
export MODELS_PATH=/path/my_sentence_transformers_models
export DEFAULT_MODEL=paraphrase-xlm-r-multilingual-v1

Celery

Start service

celery -A xlingual_papers_recommender.core.tasks worker -l info -Q default,low_priority,high_priority --pool=solo --autoscale 8,4 --loglevel=DEBUG

Clean queue

celery worker -Q low_priority,default,high_priority --purge

Usage

Register new paper

xlingual_papers_recommender receive_paper [--skip_update SKIP_UPDATE] source_file_path log_file_path

positional arguments: source_file_path /path/document.json log_file_path /path/registered.jsonl

optional arguments: -h, --help show this help message and exit --skip_update SKIP_UPDATE if it is already registered, skip_update do not update

Examples of source_file_path:

docs
└── examples
    ├── document1.json
    ├── document2.json
    ├── document3.json
    ├── document4.json
    ├── document5.json
    ├── document51.json
    ├── document6.json
    ├── document6_2.json
    ├── document7.json
    └── document7_2.json

References attributes:

pub_year
vol
num
suppl
page
surname
organization_author
doi
journal
paper_title
source
issn
thesis_date
thesis_loc
thesis_country
thesis_degree
thesis_org
conf_date
conf_loc
conf_country
conf_name
conf_org
publisher_loc
publisher_country
publisher_name
edition
source_person_author_surname
source_organization_author

Get paper recommendations

usage: xlingual_papers_recommender get_connected_papers [-h] [--min_score MIN_SCORE] pid

positional arguments:
  pid                   pid

optional arguments:
  -h, --help            show this help message and exit
  --min_score MIN_SCORE
                        min_score

Load papers data from datasets

Register parts

usage: xlingual_papers_recommender_ds_loader register_paper_part [-h] [--skip_update SKIP_UPDATE] [--pids_selection_file_path PIDS_SELECTION_FILE_PATH]
                                                                 {abstracts,references,keywords,paper_titles,articles} input_csv_file_path output_file_path

positional arguments:
  {abstracts,references,keywords,paper_titles,articles}
                        part_name
  input_csv_file_path   CSV file with papers part data
  output_file_path      jsonl output file path

optional arguments:
  -h, --help            show this help message and exit
  --skip_update SKIP_UPDATE
                        True to skip if paper is already registered
  --pids_selection_file_path PIDS_SELECTION_FILE_PATH
                        Selected papers ID file path (CSV file path which has "pid" column)

Register articles

Example:

xlingual_papers_recommender_ds_loader register_paper_part articles articles.csv articles.jsonl

Required columns

pid
main_lang
uri
subject_areas
pub_year
doi (optional)
network_collection (optional)

Register abstracts

Example:

xlingual_papers_recommender_ds_loader register_paper_part abstracts /inputs/abstracts.csv /outputs/abstracts.jsonl

Columns

pid
lang
original
text (padronizado)

Same for paper_titles and keywords datasets.

Register references

Example:

xlingual_papers_recommender_ds_loader register_paper_part references /inputs/references.csv /outputs/references.jsonl

Columns

pub_year
vol
num
suppl
page
surname
organization_author
doi
journal
paper_title
source
issn
thesis_date
thesis_loc
thesis_country
thesis_degree
thesis_org
conf_date
conf_loc
conf_country
conf_name
conf_org
publisher_loc
publisher_country
publisher_name
edition
source_person_author_surname
source_organization_author

Merge papers parts

usage: xlingual_papers_recommender_ds_loader merge_parts [-h] [--split_into_n_papers SPLIT_INTO_N_PAPERS] [--create_paper CREATE_PAPER]
                                                         input_csv_file_path output_file_path

positional arguments:
  input_csv_file_path   Selected papers ID file path (CSV file path which has "pid" column)
  output_file_path      jsonl output file path

optional arguments:
  -h, --help            show this help message and exit
  --split_into_n_papers SPLIT_INTO_N_PAPERS
                        True to create one register for each abstract
  --create_paper CREATE_PAPER
                        True to register papers

Example:

xlingual_papers_recommender_ds_loader merge_parts pids.csv output.jsonl

Register papers from loaded datasets

usage: xlingual_papers_recommender_ds_loader register_paper [-h] [--skip_update SKIP_UPDATE] input_csv_file_path output_file_path

positional arguments:
  input_csv_file_path   Selected papers ID file path (CSV file path which has "pid" column)
  output_file_path      jsonl output file path

optional arguments:
  -h, --help            show this help message and exit
  --skip_update SKIP_UPDATE
                        True to skip if paper is already registered

Example:

xlingual_papers_recommender_ds_loader register_paper pids.csv output.jsonl

Generate reports from papers, sources and connections

usage: xlingual_papers_recommender_reports all [-h] reports_path

positional arguments:
  reports_path  /path

optional arguments:
  -h, --help    show this help message and exit

Example:

xlingual_papers_recommender_reports all /reports

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
restful_api		restful_api
tests/fixtures		tests/fixtures
xlingual_papers_recommender		xlingual_papers_recommender
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

License

robertatakenaka/tcc_rs

Folders and files

Latest commit

History

Repository files navigation

What is it?

Dependences

Sentence transformer models

MongoDB

Rabbitmq

Model

Installation

Configurations

Celery

Start service

Clean queue

Usage

Register new paper

Get paper recommendations

Load papers data from datasets

Register parts

Register articles

Register abstracts

Register references

Merge papers parts

Register papers from loaded datasets

Generate reports from papers, sources and connections

About

Resources

License

Stars

Watchers

Forks

Languages