Skip to content

Vincent Van Gogh's museum gallery scraping code to create a ML model dataset

License

Notifications You must be signed in to change notification settings

sa-artea/VVG-Gallery-Scrapy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vincent van Gogh Gallery Scrapper

Tp train a Machine Learning model based in the Vincent van Gogh collection data. In here the script scrap the museum Webpage, recovers all the possible information from Vincent, Including the paints description, its search tags, collection's data, the image file, and related work.

This script creates a local gallery from the Web request within an specific path and creates a CSV file compiling all the data recovered from the online gallery.

Each option in the menu complete the Gallery information scraping an specific column.

  1. Creates the gallery's index, recovering the ID, the title and the target URL to scrap the rest of the information.
  2. Saves the gallery's information into a CSV file.
  3. Loads the gallery's information from the CSV file.
  4. Check the current gallery's dataframe description.
  5. Scrap the basic description data for each gallery's objects.
  6. Recover the download link to the image of each gallery's objects.
  7. Sets a boolean flag for each available image is available in the local directory.
  8. Scrap the search-tags related to each gallery's objects.
  9. Scrap the museum's Object-Data related to each gallery's objects.
  10. Scrap the related work of each of the gallery's objects.
  11. Export each available image into RGB and B&W images.
  12. Export all available data from the dataframe to JSON files in the local directory.
  13. Full automatic execution from step 5 to step 12.

Originaly developed for the final project for the tittle of Digital humanities Msc. degree between 2020 - 2021.

The code was refactored and commented for the official and final presentation for the 2020/2021 project of the Uniandes Digital Humanities graduate program.


Development Enviroment

#TODO add IDE version, pyliter, python version, bs4, Selenium, pandas + links


Project Structure

LICENSE: MIT Project license description.

README: Project general description.

PROJECT STRUCTURE:

  • *\App is the main folder with the MVC (Model-View-Controller) architecture of the script, to run it execute the view.py file and follow the console instructions.

    • Model.py module containing the Gallery class, in here the pandas dataframe works with the Page implementation to format the scrapped data.
    • View.py: Console interface to create, populate and save the gallery's dataframe.
    • Controller.py: module connecting the Model.py and the View.py, it controls the export process to JSON format and all the data cleaning functions.
  • *\Data is the folder containing the CSV files containing the gallery's scraped data.

    • vanGoghGallery_large.csv Gallery's large file with 964 register of Vincent van Gogh work.
    • vanGoghGallery_small.csv Gallery's small file with 61 register of Vincent van Gogh work. Useful for functional tests.
  • *\Lib is the main folder containing modules and classes useful for scrapping the gallery's online data.

    • *\Recovery Containts the Content.py module with the Page class for scrapping the VVG museum HTMLs.
    • *\Utils Containts the Error.py module with the reraise method to traceback errors in the code's execution.
  • *\Tests is the folder containing basic experiments and proofe of concept of the code developed in *\Lib.

    • test_page.py basic tests for the Page class and its methods.
    • test_selenium_bs4.py proofe of concept to use selenium with bs4 in the collection index.

Data Structure

The description of the CSV files inside the *\Data folder goes as follows:

  • ID: element ID in the gallery and local folder name.
  • TITLE: tittle of the element in the gallery.
  • COLLECTION_URL: recovered element (paint) URL.
  • DOWNLOAD_URL: direct image URL/link for the image in the gallery.
  • HAS_PICTURE: boolean if there is a picture file in the local folder.
  • DESCRIPTION: JSON with the description of the element.
  • SEARCH_TAGS: JSON with the collection tags of the element.
  • OBJ_DATA: JSON with the museum object data of the element.
  • RELATED_WORKS: JSON with the related work text and URLs of the element.
  • IMG_DATA: numpy RGB matrix created from the original image.
  • IMG_SHAPE: numpy shape information from the original image.

Important Notes

  • Config.py files are Python scripts to work around the relative import of the project local dependencies. It is needed in all script folders such as lib, and *\Recovery.
  • Selenium needs a special instalation and configuration to execute in the local repository. For more information go to the URLs:

Releases

No releases published

Packages

No packages published

Languages