# A Humanist's Cookbook for Natural Language Processing in Python

Brandon Walsh and Rebecca Bultman

## Table of Contents

### Introduction

    "How did I do that last time?" - Brandon to himself, every day

The project began with the goal of keeping Brandon from reinventing the wheel when working on natural language processing projects. By collecting together common scripts and approaches for personal use, it served as a common reference point that would be easier to consult than searching back through older repositories on GitHub when presented with new iterations of old problems. Eventually and once Rebecca came onboard the project, we started documenting the methods with a more generalized audience in mind.

The project is presented as a series of notebooks, a collection of Python 3 recipes for common problems and issues associated with preparing data for text analysis and natural language processing. The target audience is students or intermediate programmers who have begun to learn their way around Python but who need a little help pulling the pieces together to get something done. The goal is twofold:

1. Present codeblocks for common problems.
2. Contextualize those blocks with humanists in mind.

The text, at this point at least, makes no real attempt at coverage. Nor is there a real sequence through which someone should be expected to work. It simply gather scripts relevant to our work as we came to them in a way that is presented for a more generalized audience. There are a number of ways to approach any NLP problem, and any tweak can have big ramifications for the results. Hopefully this text will help intermediate programmers get going and give them the knowledge to have a little better understanding of what they're working with.

While the notebooks do primarily deal with Python, they do occassionally make reference to the terminal. Both authors use macs, so there may be differences, in installations especially, when working with other operating systems. Otherwise things should be more or less consistent on other operating systems.

We'll periodically add news things as we get to them. Initial release and last updated Summer of 2020. Issues found with things here can be logged on the [issue tracker for our GitHub repository](https://github.com/walshbr/humanists-nlp-cookbook/issues/new). 

This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.

**2/19/2021 note: the texts in this cookbook are all written in or translated to English. While working through the book, we did work with ancient Greek and ancient Hebrew, primarily by supplementing what you see here with the [Classical Language Toolkit](http://cltk.org/), and we would strongly recommend readers working in those language consult the CLTK directly. This cookbook deals mostly with plumbing issues that occur prior to actually working with the content of a text, so much of it *should* be language agnostic. But different languages might present different difficulties for tokenization or developing a corpus pipeline specific to your unique corpus. Readers working in other languages might benefit from [Quinn Dombrowski's work on issues in multilingual text analysis](https://www.quinndombrowski.com/?q=blog/2020/10/15/whats-word-multilingual-dh-and-english-default) as a useful primer in some of the issues involved. Thanks to [Jennifer Isasi](https://jenniferisasi.github.io/) for the feedback on this point!**

### First Steps

* [Set up](set_up.ipynb)
* [Working with the file structure](file_structure.ipynb)

### Getting Data
* [Working with Plain Text Files](plain_text.ipynb)
* [Getting data from websites](scraping.ipynb) 
* [Working with TEI](tei.ipynb)

### Preparing Data

* [Stopwords](stopwords.ipynb)
* [Dividing your text](dividing.ipynb)
* [Preparing a corpus pipeline](corpus.ipynb)