<center>
  <h1>Digital Tools and Methods for the Humanities and Social Sciences</h1>
  <img src="https://raw.githubusercontent.com/sul-cidr/Workshops/master/cidr-logo.no-text.240x140.png" alt="Center for Interdisciplinary Digital Research @ Stanford"/>
</center>

<h1>Text Analysis with Python</h1>

## Front-Matter

### Instructors
- Scott Bailey (CIDR), <em>scottbailey@stanford.edu</em>
- Simon Wiles (CIDR), <em>simon.wiles@stanford.edu</em>

### Learning Objectives

Develop practical knowledge of an end-to-end workflow for text analysis in Python using two specific libraries: spaCy and textacy.

- Import data
- Clean/preprocess text data
- Analyze single documents
- Analyze a full corpus


### Topics

- Document Tokenization
- Part-of-Speech (POS) Tagging
- Named-Entity Recognition (NER)
- Corpus Analysis and Vectorization


### Jupyter Notebooks and Google Colaboratory

Jupyter notebooks are a way to write and run Python code in an interactive way. They're quickly becoming a standard way of putting together data, code, and written explanations or visualizations into a single document and sharing that. There are a lot of ways that you can run Jupyter notebooks, including just locally on your computer, but we've decided to use Google's Colaboratory notebook platform for this workshop.  Colaboratory is “a Google research project created to help disseminate machine learning education and research.”  If you would like to know more about Colaboratory in general, you can visit the [Welcome Notebook](https://colab.research.google.com/notebooks/welcome.ipynb).

Using the Google Colaboratory platform allows us to focus on learning and writing Python in the workshop rather than on setting up Python, which can sometimes take a bit of extra work depending on platforms, operating systems, and other installed applications. If you'd like to install a Python distribution locally, though, we have some instructions (with gifs!) on installing Python through the Anaconda distribution, which will also help you handle virtual environments: https://github.com/sul-cidr/Workshops/wiki/Installing-and-Configuring-Anaconda-and-Jupyter-Notebooks

If you run into problems, or would like to look into other ways of installing Python or handling virtual environments, feel free to send us an email (contact-cidr@stanford.edu) or visit us during our [consulting hours](https://library.stanford.edu/research/cidr/consulting).

### Environment
If you would prefer to use Anaconda or your own local installation of python or Jupyter Notebooks, for this workshop you will need an environment with the following packages installed and available:
- `spacy`
- `textacy`

Please note that we will not have time during the workshop to support you with problems related to a local environment, and we do recommend using the Colaboratory notebooks if you are at all unsure.

### Evaluation survey
At the end of the workshop, we would be very grateful if you can, please, spend 1 minute answering a few questions that will help us to continue our workshop series.
- https://stanforduniversity.qualtrics.com/jfe/form/SV_cIqXu1piN6JTeJv


# Document-level Analysis with `spaCy`

Let's start by learning how spaCy works, and using it to start analyzing a single textual document. We'll work with some sample data throughout, but talk through importing larger corpora later in the workshop. 

For now, we'll start with imports, setting up the model, and working with a short text. 

In [0]:
import spacy

spaCy uses pre-trained neural network models to process text. Here we're going to download and use a medium-sized English multi-task CNN, which has high accuracy for part of speech tagging, entity recognition, and includes word vectors.

In [0]:
!python -m spacy download en_core_web_md

In [0]:
# Once we've installed the model, we can load it like any other Python library
import en_core_web_md

In [0]:
# Load the language model
nlp = en_core_web_md.load()

In [0]:
# From H.G. Well's A Short History of the World, Project Gutenberg 
text = """Even under the Assyrian monarchs and especially under
Sardanapalus, Babylon had been a scene of great intellectual
activity.  {111} Sardanapalus, though an Assyrian, had been quite
Babylon-ized.  He made a library, a library not of paper but of
the clay tablets that were used for writing in Mesopotamia since
early Sumerian days.  His collection has been unearthed and is
perhaps the most precious store of historical material in the
world.  The last of the Chaldean line of Babylonian monarchs,
Nabonidus, had even keener literary tastes.  He patronized
antiquarian researches, and when a date was worked out by his
investigators for the accession of Sargon I he commemorated the
fact by inscriptions.  But there were many signs of disunion in
his empire, and he sought to centralize it by bringing a number of
the various local gods to Babylon and setting up temples to them
there.  This device was to be practised quite successfully by the
Romans in later times, but in Babylon it roused the jealousy of
the powerful priesthood of Bel Marduk, the dominant god of the
Babylonians.  They cast about for a possible alternative to
Nabonidus and found it in Cyrus the Persian, the ruler of the
adjacent Median Empire.  Cyrus had already distinguished himself
by conquering Croesus, the rich king of Lydia in Eastern Asia
Minor.  {112} He came up against Babylon, there was a battle
outside the walls, and the gates of the city were opened to him
(538 B.C.).  His soldiers entered the city without fighting.  The
crown prince Belshazzar, the son of Nabonidus, was feasting, the
Bible relates, when a hand appeared and wrote in letters of fire
upon the wall these mystical words: _"Mene, Mene, Tekel,
Upharsin,"_ which was interpreted by the prophet Daniel, whom he
summoned to read the riddle, as "God has numbered thy kingdom and
finished it; thou art weighed in the balance and found wanting and
thy kingdom is given to the Medes and Persians."  Possibly the
priests of Bel Marduk knew something about that writing on the
wall.  Belshazzar was killed that night, says the Bible.
Nabonidus was taken prisoner, and the occupation of the city was
so peaceful that the services of Bel Marduk continued without
intermission."""

In [0]:
doc = nlp(text)

Once we pass the text into the NLP model, spaCy processes the entire text and makes many features available.

## Tokenization

The doc created by spaCy immediately provides access to the word level tokens of the text.

In [0]:
for token in doc[:15]:
  print(token)

Each of these tokens has a number of properties, and we'll look a bit more closely at this in a minute when we think about preprocessing texts, but let's continue our quick tour. 

spaCy also automatically provides sentence level tokenization.

In [0]:
for sent in doc.sents:
    print(sent.text + "\n--\n")

We can collect both words and sentences into standard Python data structures.

In [0]:
sentences = [sent.text for sent in doc.sents]
sentences

In [0]:
words = [token.text for token in doc]
words[:30]

### Filtering Tokens

Let's start with cleaning the text and counting to see what we can learn.

In [0]:
# One of the common things we do in text analysis is to remove punctuation
no_punct = [token for token in doc if token.is_punct == False]
for token in no_punct[:20]:
  print(token.text, token.is_punct)

In [0]:
# This has worked, but left in new line characters and spaces
no_punct_or_space = [token for token in doc if token.is_punct == False and token.is_space == False]
for token in no_punct_or_space[:30]:
  print(token.text)

In [0]:
# Let's say we also want to remove numbers, and lowercase everything
lower_alpha = [token.lower_ for token in no_punct_or_space if token.is_alpha == True]
lower_alpha[:30]

One other common bit of preprocessing is to remove stopwords, that is, the common words in a language that don't convey the information that we are looking for in our analysis. For example, if we looked for the most common words in a text, we would want to remove stopwords so that we don't only get words such as 'a,' 'the,' and 'and.'

In [0]:
clean = [token.lower_ for token in no_punct_or_space if token.is_alpha == True and token.is_stop == False]
clean[:30]

For this piece, we've used spaCy's built in stopword list, which is used to create the property `is_stop` for each token. There's a good chance you would want to create custom stopwords lists though, especially if you're working with historical text. 

In [0]:
# We'll just pick a couple of words we know are in the example
custom_stopwords = ["assyrian", "babylon"]

custom_clean = [token.lower_ for token in doc if token.lower_ not in custom_stopwords]
custom_clean

At this point, we have a list of lower-cased tokens that doesn't contain punctuation, white-space, numbers, or stopwords. Depending on our analysis, we may or may not want to do this much cleaning. But, it is good to understand how much we can do just with spaCy. 

### Counting Tokens

Let's then look at what we can do now that we have groups of tokens at different lengths. We can start with just counting.

In [0]:
print("Number of tokens in document: ", len(doc))
print("Number of tokens in cleaned document: ", len(clean))
print("Number of unique tokens in cleaned document: ", len(set(clean)))

In [0]:
from collections import Counter

In [0]:
full_counter = Counter([token.lower_ for token in doc])
full_counter.most_common(20)

In [0]:
cleaned_counter = Counter(clean)
cleaned_counter.most_common(20)

**Question:** Why do we have to use a list comprehension for the non-clean doc while we can just pass a variable directly for the cleaned set of tokens?

## Part-of-Speech Tagging

Let's turn to the other aspects of the text that spaCy exposes for us. Depending on what questions we might have about the text, these will be more or less helpful. 

We'll start with parts of speech. 

In [0]:
# spaCy provides two levels of POS tagging. Here's the more general.
for token in doc[:30]:
  print(token.text, token.pos_)

In [0]:
# We also have the more specific Penn Treenbank tags.
# https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
for token in doc[:30]:
  print(token.text, token.tag_)

We can accumulate the groups of tokens by way of these in order understand distributions of parts of speech throughout the text. 

In [0]:
nouns = [token for token in doc if token.pos_ == "NOUN"]
verbs = [token for token in doc if token.pos_ == "VERB"]
proper_nouns = [token for token in doc if token.pos_ == "PROPN"]
adjectives = [token for token in doc if token.pos_ == "ADJ"]
adverbs = [token for token in doc if token.pos_ == "ADV"]

In [0]:
pos_counts = {
    "nouns": len(nouns),
    "verbs": len(verbs),
    "proper_nouns": len(proper_nouns),
    "adjectives": len(adjectives),
    "adverbs": len(adverbs) 
}

pos_counts

spaCy also provides full dependency parsing, but we're going to leave that alone for the moment. We'll turn instead to named entity recognition. 

## Named-Entity Recognition

https://spacy.io/api/annotation#named-entities

In [0]:
for ent in doc.ents:
  print(ent.text, ent.label_)

What if we only care about geo-political entities or locations?

In [0]:
ent_filtered = [(ent.text, ent.label_) for ent in doc.ents if ent.label_ in ["GPE", "LOC"]]
ent_filtered

### Visualizing Parses

spaCy also has a nice built-in visualizer.

In [0]:
from spacy import displacy

In [0]:
displacy.render(doc, style="ent", jupyter=True)

### Activity

Pick either a particular part of speech or a named entity type, and write code to determine the most common words of that type. 

# Corpus-level Analysis with `textacy`

Let's shift to thinking about a whole corpus rather than a single document.

In doing so, we could keep working with spaCy directly if the features that it exposes help us answer the research questions we are asking. 

Instead, though, we're going to take advantage of textacy, a library built on spaCy that adds features, including a sense of a Corpus and built in analytics on it. 

In [0]:
!pip install textacy

## Generating Corpora

We'll use some of the data that is included in textacy as our corpus. You could absolutely import data otherwise, whether through reading in plain text or xml files, or pulling text data and metadata from a csv file. 

In [0]:
import textacy
import textacy.datasets

In [0]:
# We'll work with some Supreme Court cases: https://chartbeat-labs.github.io/textacy/_modules/textacy/datasets/supreme_court.html
data = textacy.datasets.SupremeCourt()

In [0]:
data.download()

What we have here is a collection of Supreme Court decisions, both full text and metadata. 

Let's look at a single one to see what we have.

In [0]:
single = list(data.texts(limit=1))[0]
single[:200]

Let's go ahead and pull a full set of texts with metadata. To keep it a bit more manageable time-wise, we'll only collect 100 of the records.

In [0]:
records = data.records(limit=100)

# Records here is a generator - we can look at the first record by passing it to the next function.
next(records)

textacy includes the idea of a corpus, while spaCy only has an idea of a single documents, though you can compose documents in standard Python data structures. Every corpus takes some texts or text plus metadata, along with a language model. 

In [0]:
corpus = textacy.Corpus(nlp, data=records)

In [0]:
corpus

In [0]:
[doc._.preview for doc in corpus[:5]]

We can see that the type of each item in the corpus is a `Doc` - this is effectively a spaCy doc with all of the calculated features. Textacy does give you some capacity to work with those features through it's API, and also exposes new features, such as ngrams and ranking algorithms for single documents. We'll come back to these once we work a bit at the corpus level. 

We can filter this corpus based on metadata once we make it.

In [0]:
# Here we'll find all the cases where the number of justices voting in the majority was greater than 6. 
recent = [doc for doc in corpus.get(lambda doc: doc._.meta["n_maj_votes"] > 6)]
len(recent)

In [0]:
recent[0]._.preview

## Analyzing the Corpus

Let's look at what we get out of the box from textacy once we've built a corpus.

In [0]:
print("number of documents: ", corpus.n_docs)
print("number of sentences: ", corpus.n_sents)
print("number of tokens: ", corpus.n_tokens)

In [0]:
# We'll pass as_strings so that the results we look at will give us strings rather than unique ids.
counts = corpus.word_counts(as_strings=True)

Notice that, by default, the `word_counts` function is doing a certain amount of cleaning for you: https://chartbeat-labs.github.io/textacy/api_reference/lang_doc_corpus.html#textacy.corpus.Corpus.word_counts 

In [0]:
sorted(counts.items(), key=lambda x: x[1], reverse=True)[:20]

For an explanation of `-PRON-`, see https://spacy.io/api/annotation#lemmatization. Basically it's spaCy's way of lemmatizing pronouns. 

In [0]:
word_doc_counts = corpus.word_doc_counts(weighting="freq", smooth_idf=True, filter_stops=True, as_strings=True)

In [0]:
sorted(word_doc_counts.items(), key=lambda x:x[1], reverse=True)[:20]

We should note that these are not tf-idf values, which are term frequencies for individual docs weighted by the inverse document frequency. This is a measure of the number of docs the words appear in weighted by inverse document frequency. We're still getting a sense of which words across the corpus and in the context of the corpus seem to have the most importance, if document frequency is a proxy for importance. 

Textacy provides access to different algorithms that can be run on docs, such as TextRank for keyword extraction. We'll start by working on a single doc, and then look at how we might scale up to thinking about the corpus.

In [0]:
import textacy.ke

In [0]:
key_terms_textrank = textacy.ke.textrank(corpus[4])
key_terms_textrank

For comparison, we'll take a look at another algorithm, Yake. 

In [0]:
key_terms_yake = textacy.ke.yake(corpus[4])
key_terms_yake

Let's think about aggregating keywords over part of the corpus.

In [0]:
key_terms_textrank_corpus = [textacy.ke.yake(doc) for doc in corpus[:20]]

In [0]:
key_terms_textrank_corpus

In [0]:
flat_list = [item for sublist in key_terms_textrank_corpus for item in sublist]
flat_list

In [0]:
keyword_counter = Counter(flat_list)
keyword_counter.most_common(20)

### Activity:
Let's combine a few different pieces. Try filtering the corpus on some metadata to construct a sub-corpus. Then use one of the textacy keyword algorithms to determine the most common keywords across your subcorpus. 

## Keyword in context

One thing that researchers often find helpful in working with text is simply seeing keywords in context. 

In [0]:
for doc in corpus[:20]:
  textacy.text_utils.KWIC(doc.text, "agriculture")

## Vectorization

Let's continue with corpus level analysis by taking advantage of textacy's vectorizer class, which wraps functionality from scikit-learn. We could just work directly in scikit-learn, but it can be nice for mental overhead to learn one library and be able to do a great deal with it. 

We'll create a vectorizer, sticking with the normal term frequency defaults but discarding words that appear in less than 3 documents or more than 95% of documents. We'll also limit our features to the top 500 words according to document frequency.This means our feature set, or columns, will have a higher degree of representation across the corpus. We could vectorize according to tf-idf as well.

In [0]:
import textacy.vsm

vectorizer = textacy.vsm.Vectorizer(min_df=3, max_df=.95, max_n_terms=500)
tokenized_corpus = (doc._.to_terms_list(ngrams=1, as_strings=True,
                                        filter_punct=True, 
                                        filter_stops=True, 
                                        filter_nums=True 
                                        ) for doc in corpus)
dtm = vectorizer.fit_transform(tokenized_corpus)
dtm

We have now have a matrix representation of our corpus, where rows are documents, and columns (or features) are words from the corpus. The value at any given point is the number of times that the word appears in that document. Once we have a document-term matrix, we could do a few different things with it, just within textacy, though we could take it and pass it into different algorithms within scikit-learn or other libraries. 

In [0]:
# Let's first look at some of the terms
vectorizer.terms_list[:20]

We can see that we are still getting a number of terms which ought to be filtered out, such as numbers and punctuation. We would want to clean this up more before vectorizing in the future. 

## Topic Modeling

Let's look quickly at one examples of what we can do with a vectorized corpus. Topic modeling is very popular for semantic exploration of texts, and there are numerous implementations. Textacy uses implementations from scikit-learn. 

Our corpus is rather small for topic modeling, but just to see how it's done here, we'll go ahead.

In [0]:
import textacy.tm

In [0]:
model = textacy.tm.TopicModel("lda", n_topics=10)
model.fit(dtm)
doc_topic_matrix = model.transform(dtm)
doc_topic_matrix

In [0]:
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_n=10):
  print("topic", topic_idx, ":", "   ".join(top_terms))