Skip to content

Digitized historical publications (magazines, books, documents) abound in iconography (engravings, illustrations, photographs, diagrams). In view of this, a natural need arises to search for images that fulfill a given information need.

Notifications You must be signed in to change notification settings

yngalxx/Master_degree

Repository files navigation

Master thesis: The task of image retrieval in historical publications

Language grade: Python Total alerts

Digitized historical publications (magazines, books, documents) abound in iconography (engravings, illustrations, photographs, diagrams). In view of this, a natural need arises to search for images that fulfill a given information need.

This masters thesis is aimed at creating an application to search for visual content in newspapers originally published by the Chronicling America project. The area of work included:

  • Acquisition of high-resolution images using scraper.
  • Data preprocessing.
  • Training and evaluations of the object detection model.
  • Prediction and visualization of results.
  • Cropping visual content from test set of original images using resulting bounding boxes predictions, apply OCR on them and store results in SQLite database.
  • Implementation of full-text search engine.
  • Creating GUI that selects appropriate visual content according to the user's text query based on predictions and OCR results.

Dataset described here: https://news-navigator.labs.loc.gov

Instruction:

  1. Before use:
    • clone the repository,
    • install requirements,
    • install Tesseract OCR using homebrew (run "brew install tesseract"),
    • install spaCy language core for english (run "python -m spacy download en_core_web_sm"),
    • run "python setup.py install"
  2. Run "python scraper_runner.py" to obtain high-resolution images from the Newspaper Navigator project.
  3. Run "python preprocessing_runner.py" to create model input data from source annotations files.
  4. Run "python model_runner.py" to start training, evaluation or both (feel free to try various argument values).
  5. Run "python metric_runner.py" to calculate the average precision (AP) for each class, as well as its mean value (mAP).
  6. Run "python visualization_runner.py" to visualize several random model predictions.
  7. Run "python predict_runner.py" to make prediction on your own single newspaper image.
  8. Run "python ocr_runner.py" to crop visual content from test set of original images using resulting bbox predictions, apply OCR on them and store results in SQLite database.
  9. Run "python gui_runner.py" to launch GUI for full-text search through OCR results on predicted visual content.

IMPORTANT:

  • If you intend to use GPU install Pytorch using following command: "pip3 install torch==1.10.2+cu113 torchvision==0.11.3+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html".
  • Each script in directory named 'runner' is a command line application (despite of 'constants.py', where you can edit default arguments). Run each with argument '--help' to see the description of the other arguments.
  • Valid paths are generated automatically, but you can provide specific ones using click arguments in the command line for each runner.

Gallery:

Welcoming window:

Alt text

Search results window:

Alt text

Detailed single result window:

Alt text

About

Digitized historical publications (magazines, books, documents) abound in iconography (engravings, illustrations, photographs, diagrams). In view of this, a natural need arises to search for images that fulfill a given information need.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages