Work for the Harvard Royal African Company (RAC) research project, centered primarily on computational textual analysis of historical letters related to the RAC.
Digital versions of the original paper texts were scraped from Oxford Scholarly Editions Online using Web Scraper. A dump of the scraped data in .csv format is stored in
raw_html, organized by volume number.
The code for cleaning and data manipulation are contained within the following Jupyter Notebooks:
Clean-Scraped.ipynb - Extraction of letter numbers from raw HTML, extraction of letter text and joining with metadata based on volume and letter number.
This directory contains various files of the text data at various stages of preprocessing, along with a hand-curated metadata file:
csv/032818_RAC_Networks_Database.xlsx- Hand-compiled metadata on the letters, with key information including unique identifiers (UID), date, place and author
csv/texts.csv- Extracted letter texts with corresponding volume and letter number in .csv format.
metadata_text_merged.csv- Joined data file of the text file
texts.csvand the metadata file, joined on volume number and letter number. This should be the primary reference data file.