No description, website, or topics provided.
Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
volumes
.gitignore
README.md
analyse.py
barcodes.csv
dates-byvol.csv
dates.csv
documents.csv
errors.txt
harvest.py
references.json
unmatched_references.txt

README.md

DFAT Documents

Code to harvest DFAT's collection of Historical Documents and extract a bit of metadata. It was developed as a quick demonstration only, so it could be improved and extended in many ways.

Content on the DFAT website is made available under a CC-BY licence.

The documents are scraped from the DFAT website and converted to Markdown. You can find the results in the volumes directory. Metadata is saved to the documents.csv file. I've also published the harvested files to a new experimental website using Jekyll.

Dates are extracted from the documents using simple pattern matching. The date is then saved to the front matter of the Markdown document. I've created a plot of the distribution of documents by month, as well as one by month and volume. On the Jekyll site you'll find a date index that lists all the documents by month. At the bottom of this page you'll find a list of documents that I couldn't find a date in -- if you browse these you'll see various ways my pattern matching could be improved to cope with variations in punctuation, missing numbers etc.

I've also attempted to find references to files in the National Archives of Australia. I've then used my RecordSearch-tools code to search for these references in RecordSearch -- if found, they're added as links in the Markdown. I created a list of the references I couldn't find in RecordSearch -- once again a few patterns are obvious, and it's pretty easy to work out how my code my be improved to find more links. You can also browse a list of the linked files.