Skip to content
Work for Harvard Royal African Company research project
Jupyter Notebook
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
aliases
csv
ocr_deprecated
scraped
word2vec
.gitignore
Clean-Scraped.ipynb
Cooccurrence-Analysis.ipynb
Date-Analysis.ipynb
Edit-Distance.ipynb
Exploratory-Analysis.ipynb
Generate-Search-Spreadsheets.ipynb
Helper-Sort-Sender-Names.ipynb
Interactive-Search.ipynb
Interloper-Analysis.ipynb
Most-Common-Terms.ipynb
README.md
requirements.txt

README.md

RAC

Work for the Harvard Royal African Company (RAC) research project, centered primarily on computational textual analysis of historical letters related to the RAC.

Directory Structure

Raw HTML

Digital versions of the original paper texts were scraped from Oxford Scholarly Editions Online using Web Scraper. A dump of the scraped data in .csv format is stored in raw_html, organized by volume number.

Jupyter Notebooks

The code for cleaning and data manipulation are contained within the following Jupyter Notebooks:

Clean-Scraped.ipynb - Extraction of letter numbers from raw HTML, extraction of letter text and joining with metadata based on volume and letter number.

csv Directory

This directory contains various files of the text data at various stages of preprocessing, along with a hand-curated metadata file:

  1. csv/032818_RAC_Networks_Database.xlsx - Hand-compiled metadata on the letters, with key information including unique identifiers (UID), date, place and author
  2. csv/texts.csv - Extracted letter texts with corresponding volume and letter number in .csv format.
  3. metadata_text_merged.csv - Joined data file of the text file texts.csv and the metadata file, joined on volume number and letter number. This should be the primary reference data file.

Questions?

Email wenzhexue2014@gmail.com

You can’t perform that action at this time.