Bibliographic analysis framework using full text articles with topic modeling browsable thought a search engine.
The following libraries or docker images are used in this project:
- GROBID: metadata (title, authors...) and full text extraction from the PDF (used with
grobid_client_python
) - OpenAlex analysis: data enrichment (adding institutions details, ORCID...)
- Elasticsearch: internal search engine
- Kibana: web user interface for data exploration
This project code is under AGPLv3 license.
From the PDF, extract metadata and full text, and create the documents
Process the full text TEI XML from GROBID to extract metadata the paragraph and store them in data/documents.parquet
(each row is a paragraph/document and contains the text + the metadata).
Clone the git repository, configure the .env.template
and rename it to .env
Before running docker compose, you need to run the following command on the host (at each reboot):
sudo sysctl -w vm.max_map_count=262144
Then you can run the containers (the docker build is triggered at the first startup):
sudo docker compose up
- BERTopic: topic modeling (separate deployment)
Romain THOMAS 2024