Skip to content

romain894/bibliotech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bibliotech platform by Romain

Bibliographic analysis framework using full text articles with topic modeling browsable thought a search engine.

The following libraries or docker images are used in this project:

  • GROBID: metadata (title, authors...) and full text extraction from the PDF (used with grobid_client_python)
  • OpenAlex analysis: data enrichment (adding institutions details, ORCID...)
  • Elasticsearch: internal search engine
  • Kibana: web user interface for data exploration

This project code is under AGPLv3 license.

PDF to documents

From the PDF, extract metadata and full text, and create the documents

Process the full text TEI XML from GROBID to extract metadata the paragraph and store them in data/documents.parquet (each row is a paragraph/document and contains the text + the metadata).

How to run

Clone the git repository, configure the .env.template and rename it to .env

Before running docker compose, you need to run the following command on the host (at each reboot):

sudo sysctl -w vm.max_map_count=262144

Then you can run the containers (the docker build is triggered at the first startup):

sudo docker compose up

TODO

  • BERTopic: topic modeling (separate deployment)

Romain THOMAS 2024

About

Bibliographic analysis platform

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published