An internet crawler that digs the resources of the PAN Biblioteka Gdańska
HTML Python Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
config
deployment
image_detector
tests
.gitignore
LICENSE
README.md
__init__.py
analyzer.py
converter.py
downloader.py
gif_downloader.py
oai_api.py
pga.py
requirements.txt
twitter_api.py
utils.py

README.md

PAN Kreator bot

PAN Kreator bot is an internet crawler that digs the resources of the PAN Biblioteka Gdańska and posts interesting results on the Twitter/Facebook.

Bot uses the OAI-PMH API to connect to the pbc.gda.pl and perform a query. Matching record is downloaded, unzipped and converted from djvu to jpg. Finally, the image is posted on the Twitter.

But this is just the part of the bot's abilities. This guy uses machine learning algorithms (Support Vector Machine) to get the idea about the content of the downloaded book. He's able to tell the difference between the text, blank page and image (preferably a figure). Bot goes through all pages of a books and picks only those that are worth posting from his point of view. When a book ends, he chooses the page that seems to contain highest percent of images.

How does he know what to look for?

The bot was initially taught to distinguish three categories of pages by a human. We used a set of 368 images that contained different data.

For example this was marked as a text (which we don't want to publish on Twitter):

this as a blank page (also not very interesting):

but this as an image, because it contains something different and possibly worth showing:

The effectiveness of the image recognition is quite hard to predict, but it makes the results of bot's work interesting.

To check what PAN Kreator have found recently, please visit his Twitter or Facebook page.

https://twitter.com/PAN_Kreator

https://www.facebook.com/pankreatorbot/

Please follow him if you like this!