Bibliotech platform by Romain

Bibliographic analysis framework using full text articles with topic modeling browsable thought a search engine.

The following libraries or docker images are used in this project:

GROBID: metadata (title, authors...) and full text extraction from the PDF (used with grobid_client_python)
OpenAlex analysis: data enrichment (adding institutions details, ORCID...)
Elasticsearch: internal search engine
Kibana: web user interface for data exploration

This project code is under AGPLv3 license.

PDF to documents

From the PDF, extract metadata and full text, and create the documents

Process the full text TEI XML from GROBID to extract metadata the paragraph and store them in data/documents.parquet (each row is a paragraph/document and contains the text + the metadata).

How to run

Clone the git repository, configure the .env.template and rename it to .env

Before running docker compose, you need to run the following command on the host (at each reboot):

sudo sysctl -w vm.max_map_count=262144

Then you can run the containers (the docker build is triggered at the first startup):

sudo docker compose up

TODO

BERTopic: topic modeling (separate deployment)

Romain THOMAS 2024

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
data		data
.env.template		.env.template
.env_docker		.env_docker
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
bibliotech_main.py		bibliotech_main.py
delete_documents.py		delete_documents.py
docker-compose.yml		docker-compose.yml
elastic-multi-nodes-docker-compose.yml		elastic-multi-nodes-docker-compose.yml
elastic-single-node-docker-compose.yml		elastic-single-node-docker-compose.yml
enrich_documents.py		enrich_documents.py
extract_pdf.py		extract_pdf.py
generate_documents.py		generate_documents.py
grobid-docker-compose.yml		grobid-docker-compose.yml
grobid_config.json		grobid_config.json
grobid_config_docker.json		grobid_config_docker.json
ingest_documents.py		ingest_documents.py
log_config.py		log_config.py
manipulate_documents.py		manipulate_documents.py
python-code-docker-compose.yml		python-code-docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bibliotech platform by Romain

PDF to documents

How to run

TODO

About

Releases

Packages

Languages

License

romain894/bibliotech

Folders and files

Latest commit

History

Repository files navigation

Bibliotech platform by Romain

PDF to documents

How to run

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages