<center>
  <h1>Digital Tools and Methods for the Humanities and Social Sciences</h1>
  <img src="https://raw.githubusercontent.com/sul-cidr/Workshops/master/cidr-logo.no-text.240x140.png" alt="Center for Interdisciplinary Digital Research @ Stanford"/>
</center>

<h1>Text Analysis with Python (and spaCy/textacy)</h1>

### Instructors
- Peter Broadwell (CIDR), <em>broadwell@stanford.edu</em>
- Simon Wiles (CIDR), <em>simon.wiles@stanford.edu</em>

### Signing in
Please sign in for this workshop at https://signin.cidr.link/Text_Analysis_with_Python/ -- when you've submitted the sign-in form, please keep the evaluation form open in a browser tab as a reminder to complete it when the workshop is over.

### About the workshop

**Learning objective**: To develop practical knowledge of methods for analyzing single documents and multi-text corpora in Python using two popular libraries: spaCy and textacy.

### Topics

- Document Tokenization
- Part-of-Speech (POS) Tagging
- Named-Entity Recognition (NER)
- Corpus Vectorization
- Topic Modeling
- Document Similarity
- Stylistic Analysis

**Note:** The examples from this workshop use English texts, but all of the methods are applicable to other languages. The availability of specialized resources (parsing rules, dictionaries, trained models) can vary considerably by language, however.

### A brief word about terms

**Text analysis** involves extraction of information from significant amounts  of free-form text, e.g., literature (prose, poetry), historical records, long-form survey responses, legal documents. Some of the techniques used also are applicable to short-form text data, including documents that are already in tabular format.

Text analysis methods are built upon techniques for **Natural Language Processing** (NLP), which began as rule-based approaches to parsing human language and eventually incorporated statistical machine learning methods as well as, most recently, neural network/deep learning-based approaches.

**Text mining** typically refers to the extraction of information from very large corpora of unstructured texts.

### Jupyter notebooks and Google Colaboratory

Jupyter notebooks are a way to write and run Python code interactively. They're now a standard tool for putting together data, code, and written explanations or visualizations into a single shareable document. There are a lot of ways to run Jupyter notebooks, including just locally on your computer, but we've decided to use Google's Colaboratory notebook platform for this workshop. Colaboratory is “a Google research project created to help disseminate machine learning education and research.”  If you would like to know more about Colaboratory in general, you can visit the [Welcome Notebook](https://colab.research.google.com/notebooks/welcome.ipynb).

Using the Google Colaboratory platform allows us to focus on learning and writing Python in the workshop rather than on setting up Python, which can sometimes take a bit of extra work depending on platforms, operating systems, and other installed applications. If you'd like to install a Python distribution locally, though, we have some instructions (with gifs!) on installing Python through the Anaconda distribution, which will also help you handle virtual environments: https://github.com/sul-cidr/Workshops/wiki/Installing-and-Configuring-Anaconda-and-Jupyter-Notebooks

If you run into problems, or would like to look into other ways of installing Python or handling virtual environments, feel free to send us an email (contact-cidr@stanford.edu) for an online consultation.

### Environment
If you would prefer to use Anaconda or your own local installation of Python or Jupyter Notebooks, you will need an environment with the following packages installed and available to complete this workshop:
- `spacy`
- `textacy`

Please note that we will not have time during the workshop to support you with problems related to a local environment, so we do recommend using the Colaboratory notebooks during the workshop.

### Evaluation survey
At the end of the workshop, we would be very grateful if you would please spend a minute answering a few questions that will help us continue to develop our workshop series.
- https://evaluations.cidr.link/Text_Analysis_with_Python/

## Why spaCy and textacy?

The language processing features of spaCy and the corpus analysis methods of textacy together offer a wide range of functionality for text analysis in a well-maintained and well-documented software package that incorporates cutting-edge techniques as well as standard approaches.

The "C" in spaCy (and textacy) stands for Cython, which is Python that is compiled to C code and thus offers some performance advantages over interpreted Python, especially when working with large machine-learning models. The use of machine-learning models, including neural networks, is a key feature of spaCy and textacy. The writers of these libraries also have developed [Prodigy](https://prodi.gy/), a similarly leading-edge but approachable tool for training custom machine-learning models for text analysis, among other uses.

### Other Python-based text analysis tools

The powerful and easy-to-use string manipulation features built into Python and its standard library, along with its flexible data structures and straightforward web and file I/O, make the language a popular choice for text processing. Numerous other libraries incorporating sophisticated features for text analysis and natural language processing have been built upon these capabilities.

[nltk](https://www.nltk.org/) -- Natural Language Toolkit; old-school symbolic and statistical natural language processing; English only

[Stanza](https://github.com/stanfordnlp/stanza/) -- a wrapper for accessing the Java-based Stanford CoreNLP package; also now has a pipeline for using neural networks for NLP tasks

[TextBlob](https://textblob.readthedocs.io/en/dev/) -- similar to spaCy in ease of use, though not as expansive in functionality and also limited to English

[scikit-learn](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) -- the central machine learning library collection for Python; includes many functions for text analysis and model building, including vectorization and topic models (which textacy uses behind the scenes)

[Gensim](https://radimrehurek.com/gensim/) -- a popular library for higher-level analyses like semantic word embedding; also does topic modeling

[flairNLP](https://github.com/flairNLP/flair) -- a somewhat more bleeding-edge NLP library for multiple languages that incorporates deep-learning frameworks

It's also possible to build text analysis models directly upon Python-friendly neural network/deep learing platforms like TensorFlow and PyTorch, although some of the tools above offer similar features with much less hassle.

Finally, the big cloud computing platforms all offer various text processing capabilities, often with Python APIs -- though it's recommended to get familiar with locally run libraries like those above so that you can judge whether using cloud services is warranted.

# Document-level analysis with `spaCy`

Let's start by learning how spaCy works and using it to begin analyzing a single text document. We'll work with larger corpora later in the workshop.

In [None]:
!pip install spacy

In [None]:
import spacy

spaCy uses pre-trained statistical and deep-learning [models](https://spacy.io/models/en) to process text. The models are differentiated by language (17 languages are supported at present), capabilities, training text, and size. Smaller models are more efficient; larger models are more accurate. Here we'll download and use a medium-sized English multi-task model, which supports part of speech tagging, entity recognition, and includes a word vector model.

In [None]:
!python -m spacy download en_core_web_md

In [None]:
# Once we've installed the model, we can import it like any other Python library
import en_core_web_md

In [None]:
# This instantiates a spaCy text processor based on the installed model
nlp = en_core_web_md.load()

In [None]:
# From H.G. Wells's A Short History of the World, Project Gutenberg 
text = """Even under the Assyrian monarchs and especially under
Sardanapalus, Babylon had been a scene of great intellectual
activity.  {111} Sardanapalus, though an Assyrian, had been quite
Babylon-ized.  He made a library, a library not of paper but of
the clay tablets that were used for writing in Mesopotamia since
early Sumerian days.  His collection has been unearthed and is
perhaps the most precious store of historical material in the
world.  The last of the Chaldean line of Babylonian monarchs,
Nabonidus, had even keener literary tastes.  He patronized
antiquarian researches, and when a date was worked out by his
investigators for the accession of Sargon I he commemorated the
fact by inscriptions.  But there were many signs of disunion in
his empire, and he sought to centralize it by bringing a number of
the various local gods to Babylon and setting up temples to them
there.  This device was to be practised quite successfully by the
Romans in later times, but in Babylon it roused the jealousy of
the powerful priesthood of Bel Marduk, the dominant god of the
Babylonians.  They cast about for a possible alternative to
Nabonidus and found it in Cyrus the Persian, the ruler of the
adjacent Median Empire.  Cyrus had already distinguished himself
by conquering Croesus, the rich king of Lydia in Eastern Asia
Minor.  {112} He came up against Babylon, there was a battle
outside the walls, and the gates of the city were opened to him
(538 B.C.).  His soldiers entered the city without fighting.  The
crown prince Belshazzar, the son of Nabonidus, was feasting, the
Bible relates, when a hand appeared and wrote in letters of fire
upon the wall these mystical words: _"Mene, Mene, Tekel,
Upharsin,"_ which was interpreted by the prophet Daniel, whom he
summoned to read the riddle, as "God has numbered thy kingdom and
finished it; thou art weighed in the balance and found wanting and
thy kingdom is given to the Medes and Persians."  Possibly the
priests of Bel Marduk knew something about that writing on the
wall.  Belshazzar was killed that night, says the Bible.
Nabonidus was taken prisoner, and the occupation of the city was
so peaceful that the services of Bel Marduk continued without
intermission."""

By default, spaCy applies its entire NLP "pipeline" to the text as soon as it is provided to the model and outputs a processed "doc."

<img src="https://d33wubrfki0l68.cloudfront.net/3ad0582d97663a1272ffc4ccf09f1c5b335b17e9/7f49c/pipeline-fde48da9b43661abcdf62ab70a546d71.svg">

In [None]:
doc = nlp(text)

## Tokenization

The doc created by spaCy immediately provides access to the word-level tokens of the text.

In [None]:
for token in doc[:15]:
    print(token)

Each of these tokens has a number of properties, and we'll look a bit more closely at them in a minute.

spaCy also automatically provides sentence-level segmenting (senticization).

In [None]:
import itertools

for sent in itertools.islice(doc.sents, 10):
    print(sent.text + "\n--\n")

You'll notice that the line breaks in the sample text are making the extracted sentences and also the word-level tokens a bit messy. The simplest way to avoid this is just to replace all single line breaks from the text with spaces before running it throug the spaCy pipeline, i.e., as a **preprocessing** step.

There are other ways to handle this within the spaCy pipeline; an important feature of spaCy is that every phase of the built-in pipeline can be replaced by a custom module. One could imagine, for example, writing a replacement sentencizer that takes advantage of the presence of two spaces between all sentences in the sample text. But we will leave that as an exercise for the reader.

In [None]:
text_as_line = text.replace("\n", " ")

doc = nlp(text_as_line)

for sent in itertools.islice(doc.sents, 10):
    print(sent.text + "\n--\n")

We can collect both words and sentences into standard Python data structures (lists, in this case).

In [None]:
sentences = [sent.text for sent in doc.sents]
sentences

In [None]:
words = [token.text for token in doc]
words

### Filtering tokens

After extracting the tokens, we can use some attributes and methods provided by spaCy, along with some vanilla Python methods, to filter the tokens to just the types we're interested in analyzing.

In [None]:
# If we're only interested in analyzing word tokens, we can remove punctuation:
for token in doc[:20]:
    print(f'TOKEN: {token.text:15} IS_PUNCTUATION: {token.is_punct:}')
no_punct = [token for token in doc if token.is_punct == False]

no_punct[:20]

In [None]:
# There are still some space tokens; here's how to remove spaces and newlines:
no_punct_or_space = [token for token in doc if token.is_punct == False and token.is_space == False]
for token in no_punct_or_space[:30]:
    print(token.text)

In [None]:
# Let's say we also want to remove numbers and lowercase everything that remains
lower_alpha = [token.lower_ for token in no_punct_or_space if token.is_alpha == True]
lower_alpha[:30]

One additional common filtering step is to remove stopwords. In theory, stopwords can be any words we're not interested in analyzing, but in practice, they are often the most common words in a language that do not carry much semantic information (e.g., articles, conjunctions).

In [None]:
clean = [token.lower_ for token in no_punct_or_space if token.is_alpha == True and token.is_stop == False]
clean[:30]

We've used spaCy's built-in stopword list; membership in this list determines the property `is_stop` for each token. It's good practice to be wary of any built-in stopword list, however -- there's a good chance you will want to remove some words that aren't on the list and to include some that are, especially if you're working with specialized texts.

In [None]:
# We'll just pick a couple of words we know are in the example
custom_stopwords = ["assyrian", "babylon"]

custom_clean = [token for token in clean if token not in custom_stopwords]
custom_clean

At this point, we have a list of lower-cased tokens that doesn't contain punctuation, white-space, numbers, or stopwords. Depending on your analytical goals, you may or may not want to do this much cleaning, but hopefully you have a greater appreciation for the kinds of cleaning that can be done with spaCy.

### Counting tokens

Now that we've used spaCy to tokenize and clean our text, we can begin one of the most fundamental text analysis tasks: counting words!

In [None]:
print("Number of tokens in document: ", len(doc))
print("Number of tokens in cleaned document: ", len(clean))
print("Number of unique tokens in cleaned document: ", len(set(clean)))

In [None]:
from collections import Counter

full_counter = Counter([token.lower_ for token in doc])
full_counter.most_common(20)

In [None]:
cleaned_counter = Counter(clean)
cleaned_counter.most_common(20)

## Part-of-speech tagging

Let's consider some other aspects of the text that spaCy exposes for us. One of the most noteworthy features is part-of-speech tagging.

In [None]:
# spaCy provides two levels of POS tagging. Here's the more general level.
for token in doc[:30]:
    print(token.text, token.pos_)

In [None]:
# spaCy also provides the more specific Penn Treenbank tags.
# https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
for token in doc[:30]:
    print(token.text, token.tag_)

We can count the occurrences of each part of speech in the text, which may be useful for document classification (fiction may have different proportions of parts of speech relative to nonfiction, for example) or stylistic analysis (more on that later).

In [None]:
nouns = [token for token in doc if token.pos_ == "NOUN"]
verbs = [token for token in doc if token.pos_ == "VERB"]
proper_nouns = [token for token in doc if token.pos_ == "PROPN"]
adjectives = [token for token in doc if token.pos_ == "ADJ"]
adverbs = [token for token in doc if token.pos_ == "ADV"]

In [None]:
pos_counts = {
    "nouns": len(nouns),
    "verbs": len(verbs),
    "proper_nouns": len(proper_nouns),
    "adjectives": len(adjectives),
    "adverbs": len(adverbs) 
}

pos_counts

spaCy performs morphosyntactic analysis of individual tokens, including lemmatizing inflected or conjugated forms to their base (dictionary) forms. Reducing words to their lemmatized forms can help to make a large corpus more manageable and is generally more effective than just stemming words (trimming the inflected/conjugated endings of words until just the base portion remains), but should only be done if the inflections are not relevant to your analysis.

In [None]:
for token in doc:
    if token.pos_ in ["NOUN", "VERB"] and token.orth_ != token.lemma_:
        print(f"{token.text:15} {token.lemma_}")

### Parsing

spaCy's trained models also provide full dependency parsing, tagging word tokens with their syntactic relations to other tokens. This functionality drives spaCy's built-in senticization as well.

We won't spend much time exploring this feature, but it's useful to see how it enables the extraction of multi-word "noun chunks" from the text. Note also that textacy (discussed below) has a built-in function to extract subject-verb-object triples from sentences.

In [None]:
for chunk in itertools.islice(doc.noun_chunks, 20):
    print(chunk.text)

## Named-entity recognition

spaCy's models do a pretty good job of identifying and classifying named entities (people, places, organizations).

It is also fairly easy to customize and fine-tune these models by providing additional training data (e.g., texts with entities labeled according to the desired scheme), but that's out of the scope of this workshop.

In [None]:
for ent in doc.ents:
    print(f'{ent.text:20} {ent.label_:15} {spacy.explain(ent.label_)}')

What if we only care about geo-political entities or locations?

In [None]:
ent_filtered = [(ent.text, ent.label_) for ent in doc.ents if ent.label_ in ["GPE", "LOC"]]
ent_filtered

### Visualizing Parses

The built-in displaCy visualizer can render the results of the named-entity recognition, as well as the dependency parser.

In [None]:
from spacy import displacy

In [None]:
displacy.render(doc, style="ent", jupyter=True)

### Activity

Pick either a particular part of speech or a named entity type, and write code to determine the most common words of that type in the sample text.

# Corpus-level analysis with `textacy`

Let's shift to thinking about a whole corpus rather than a single document. We could analyze multiple documents with spaCy and then knit the results together with some extra Python. Instead, though, we're going to take advantage of textacy, a library built on spaCy that adds corpus analysis features.

For reference, here's the [online documentation for textacy](https://textacy.readthedocs.io/en/stable/api_reference/root.html).

In [None]:
!pip install textacy

## Generating corpora

We'll use some of the data that is included in textacy as our corpus. It is certainly possible to build your own corpus by importing data from files in plain text, XML, JSON, CSV or other formats, but working with one of textacy's "pre-cooked" datasets simplifies things a bit.

In [None]:
import textacy
import textacy.datasets

In [None]:
# We'll work with a dataset of ~8,400 ("almost all") U.S. Supreme Court
# decisions from November 1946 through June 2016
# https://github.com/bdewilde/textacy-data/releases/tag/supreme_court_py3_v1.0
data = textacy.datasets.SupremeCourt()

In [None]:
data.download()

The documentation indicates the metadata that is available with each text.

In [None]:
help(textacy.datasets.supreme_court)

textacy is based on the concept of a corpus, whereas spaCy focuses on single documents. A textacy corpus is instantiated with a spaCy language model (we're using the one from the first half of this workshop) that is used to apply its analytical pipeline to each text in the corpus, and also given a set of records consisting of texts with metadata (if metadata is available).

Let's go ahead and define a set of records (texts with metadata) that we'll then add to our corpus. To keep the processing time of the data set a bit more manageable, we'll just look at a set of court decisions from a short span of time.

In [None]:
corpus = textacy.Corpus(nlp)

# There are 79 docs in this range -- they'll take a minute or two to process
recent_decisions = data.records(date_range=('2010-01-01', '2010-12-31'))

for record in recent_decisions:
    print("Adding",record[1]['case_name'])
    corpus.add_record(record)

# If the three lines above are taking too long to process all 79 docs,
# comment them out and uncomment the two lines below to download and import
# a preprocessed version of the corpus

#!wget https://github.com/sul-cidr/Workshops/raw/master/Text_Analysis_with_Python/data/scotus_2010.bin.gz
#corpus = textacy.Corpus.load(nlp, "scotus_2010.bin.gz")

In [None]:
print(len(corpus))
[doc._.preview for doc in corpus[:5]]

We can see that the type of each item in the corpus is a `Doc` - this is a processed spaCy output document, with all of the extracted features. textacy provides some capacity to work with those features via its API, and also exposes new document-level features, such as ngrams and algorithms to determine a document's readability level, among others.

We can filter this corpus based on metadata attributes.

In [None]:
corpus[0]._.meta

In [None]:
# Here we'll find all the cases where the number of justices voting in the majority was greater than 6. 
supermajorities = [doc for doc in corpus.get(lambda doc: doc._.meta["n_maj_votes"] > 6)]
len(supermajorities)

In [None]:
supermajorities[0]._.preview

## Finding important words in the corpus

In [None]:
print("number of documents: ", corpus.n_docs)
print("number of sentences: ", corpus.n_sents)
print("number of tokens: ", corpus.n_tokens)

In [None]:
# Set as_strings to True so that the results will display strings rather than unique ids.
counts = corpus.word_counts(by = "orth", filter_nums=True)

In [None]:
def show_doc_counts(input_corpus, weighting, limit=20):
    doc_counts = input_corpus.word_doc_counts(weighting=weighting, filter_stops=True, by = "orth")
    print("\n".join([f'{a:15} {str(b)}' for a,b in sorted(doc_counts.items(), key=lambda x:x[1], reverse=True)[:limit]]))

`word_doc_counts` provides a few ways of quantifying the prevalence of individual words across the corpus: whether a word appears many times in most documents, just a few times in a few documents, many times in a few documents, or just a few times in most documents.

In [None]:
print("# DOCS APPEARING IN / TOTAL # DOCS")
show_doc_counts(corpus, "freq")
print("\nLOG(TOTAL # DOCS / # DOCS APPEARING IN)")
show_doc_counts(corpus, "idf")

textacy provides implementations of algorithms for identifying words and phrases that are representative of a document (aka **keyterm extraction**).

In [None]:
from textacy.extract import keyterms as ke

In [None]:
corpus[0].text

In [None]:
# Run the TextRank algorithm (Mihalcea, R., & Tarau, P., 2004) for a given document

key_terms_textrank = ke.textrank(corpus[0])
key_terms_textrank

For comparison, we'll take a look at another algorithm, Yake (Campos et al., 2018)

In [None]:
key_terms_yake = ke.yake(corpus[0])
key_terms_yake

### Activity:
Let's combine a few different pieces. Try filtering the corpus on some metadata to construct a sub-corpus. Then use one of the textacy keyword algorithms to determine the most common keywords across your subcorpus. 

## Keyword in context

Sometimes researchers find it helpful just to see a particular keyword in context.

In [None]:
for doc in corpus[:5]:
    print(doc._.meta.get('case_name'))
    for match in textacy.extract.kwic.keyword_in_context(doc.text, "judgment"):
        print(" ".join(match).replace("\n", " "))

## Vectorization

Let's continue with corpus-level analysis by taking advantage of textacy's vectorizer class, which wraps functionality from `scikit-learn` to count the prevalence of certain tokens in each document of the corpus and to apply weights to these counts if desired. We could just work directly in `scikit-learn`, but it can be nice for mental overhead to learn one library and be able to do a great deal with it.

We'll create a vectorizer, sticking with the normal term frequency defaults but discarding words that appear in fewer than 3 documents or more than 95% of documents. We'll also limit our features to the top 500 words according to document frequency. This means our feature set, or columns, will have a higher degree of representation across the corpus. We could further scale these counts according to document frequency (or inverse document frequency) weights, or normalize the weights so that they add up to 1 for each document row (L1 norm), and so on.

In [None]:
import textacy.representations

vectorizer = textacy.representations.Vectorizer(min_df=3, max_df=.95, max_n_terms=500)

tokenized_corpus = [[token.orth_ for token in list(textacy.extract.words(doc, filter_nums=True, filter_stops=True, filter_punct=True))] for doc in corpus]

dtm = vectorizer.fit_transform(tokenized_corpus)
dtm

We have now have a matrix representation of our corpus, where rows are documents, and columns (or features) are words from the corpus. The value at any given point is the number of times that the word appears in that document. Once we have a document-term matrix, we could do several things with it just within textacy, though we also can pass it into different algorithms within `scikit-learn` or other libraries. 

In [None]:
# Let's look at some of the terms
vectorizer.terms_list[:20]

We can see that we are still getting a number of terms which might be filtered out, such as symbols and abbreviations. The most straightforward solutions are to filter the terms against a dictionary during vectorization, which carries the risk of inadvertently filtering words that you'd prefer to keep in the dataset, or curating a custom stopword list, which can be inflexible and time consuming. Otherwise, it is often the case that the corpus analysis tools used with the vectorized texts (e.g., topic modeling or stylistic analysis -- see below) have ways of recognizing and sequestering unwanted terms so that they can be excluded from the results if desired.

## Topic modeling

Let's look quickly at one example of what we can do with a vectorized corpus. Topic modeling is very popular for semantic exploration of texts, and there are numerous implementations of it. Textacy uses implementations from scikit-learn. 

Our corpus is rather small for topic modeling, but just to see how it's done here, we'll go ahead.

First, though, topic modeling works best when the texts are divided into approximately equal-sized "chunks." A quick word-count of the corpus will show that the decisions are of quite variable lengths, which will skew the topic model.

In [None]:
for doc in corpus:
    print(doc._.meta.get('case_name'), doc._.meta.get('decision_date'), doc.__len__())

We'll re-chunk the texts into documents of not more than 500 words and then recompute the document-term matrix.

In [None]:
chunked_corpus_unflattened = [ [text[x:x+500] for x in range(0,len(text),500)] for text in tokenized_corpus]
chunked_corpus = list(itertools.chain.from_iterable(chunked_corpus_unflattened))
chunked_dtm = vectorizer.fit_transform(chunked_corpus)
chunked_dtm

In [None]:
import textacy.tm

model = textacy.tm.TopicModel("lda", n_topics=15)
model.fit(chunked_dtm)
doc_topic_matrix = model.transform(chunked_dtm)

In [None]:
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_n=10):
  print("topic", topic_idx, f"{model.topic_weights(doc_topic_matrix)[topic_idx]:.0%}", ":", "   ".join(top_terms))

## Document similarity with word2vec and clustering

textacy provides several built-in methods for measuring the degree of similarity between two documents, including a `word2vec`-based approach that computes the semantic similarity between documents based on the word vector model included with the spaCy language model. This technique is capable of inferring, for example, that two documents are topically related even if they don't share any words but use synonyms for a shared concept.

To evaluate this similarity comparison, we'll compute the similarity of each pair of docs in the corpus, and then branch out into `scikit-learn` a bit to look for clusters based on these similarity measurements.

In [None]:
import numpy as np

dim = corpus.n_docs

distance_matrix = np.zeros((dim,dim))
    
for i, doc_i in enumerate(corpus):
    for j, doc_j in enumerate(corpus):
        if i == j:
            continue # defaults to 0
        if i > j:
            distance_matrix[i,j] = distance_matrix[j,i]
        else:
            distance_matrix[i,j] = 1 - doc_i.similarity(doc_j)
distance_matrix

The [OPTICS](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html) hierarchical density-based clustering algorithm only finds one cluster with its default settings, but an examination of the legal issue types coded to each decision indicates that the `word2vec`-based clustering has indeed produced a group of semantically related documents.

In [None]:
from sklearn.cluster import OPTICS

clustering = OPTICS(metric='precomputed').fit(distance_matrix)
print(clustering.labels_)

In [None]:
for i, cluster_id in enumerate(clustering.labels_):
    if cluster_id == -1:
        continue
    print(cluster_id, corpus[i]._.meta['us_cite_id'], data.issue_area_codes[corpus[i]._.meta['issue_area']], ':', data.issue_codes[corpus[i]._.meta['issue']])

## Case study: Stylistic analysis of U.S. Supreme Court opinions

One of the more impressive affordances of text analysis is the ability to infer the authorship of a text with some degree of confidence based only upon the stylistic attributes observed through statistical analysis of the writer's use of common "function" words (conjunctions, articles). Note however that if a writer desires to be anonymous, she can attempt to confuse such algorithms by intentionally adopting a different writing style.

The U.S. Supreme Court decisions corpus does usually identify the author of each opinion, but sometimes these are inconsistently described in the text, especially for decisions issued with no majority author listed (e.g., unanimous decisions). Applying stylistic analysis techniques to a substantial subset of the corpus (here we use just the majority author portion of the decisions from 1993 to 2016) also can expose similarities between the writing styles of particular justices -- or perhaps of their clerks.

In [None]:
# This code replaces the corpus above with a much larger corpus that may take a long
# time to process and which may take up a considerable amount of RAM.

# Note that the text of each decision in this corpus is trimmed to include only
# the opinion of the majority author (and not any footnotes, supporting or
# dissenting opinions, etc.) via regular expression matching of formulaic
# phrases in the case file, specifically:
# [JUSTICE NAME] delivered the opinion of the Court.
# --- MAJORITY AUTHOR'S OPINION HERE ---
# It is so ordered.

"""
corpus = textacy.Corpus(nlp)
import re

# 1986 - 2016 = Rehnquist and Roberts courts, but start at 1993 because there was a lot of turnover then
for year in range(1993, 2017):
    print("Finding record(s) from",year)
    year_generator = data.records(date_range=(str(year)+'-01-01', str(year)+'-12-31'))
    for record in year_generator:
        # There are only 2 opinions in this range by Byron White; exclude them
        if record[1]['maj_opinion_author'] == 95:
            continue
        text = record[0]
        # NOTE: These regexs works well on 1993-2016, but it is NOT guaranteed to work on the full corpus
        if record[1]['maj_opinion_author'] != -1:
            match = re.search(r"delivered the opinion of the Court\.(.*)It is so ordered", text, flags=re.I | re.M | re.DOTALL)
        else:
            match = re.search(r"Per Curiam\.(.*)It is so ordered", text, flags=re.I | re.M | re.DOTALL)
        if match is None:
            continue
        opinion = match.group(1)
        if len(opinion) < 500:
            continue
        mutable_record = list(record)
        mutable_record[0] = opinion
        corpus.add_record(mutable_record)
"""

# Rather than wait for the corpus to be processed, we'll just download a pre-processed
# version from Github. It should take about two minutes to import.

!wget https://github.com/sul-cidr/Workshops/raw/master/Text_Analysis_with_Python/data/recent_opinions.bin.gz
opinions_corpus = textacy.Corpus.load(nlp, "recent_opinions.bin.gz")

In [None]:
# If you are running this notebook locally, unhashtag the below lines if you get an error in the above cell: 
# !pip install wget
# import wget
# wget.download("https://github.com/sul-cidr/Workshops/raw/master/Text_Analysis_with_Python/data/recent_opinions.bin.gz")

In [None]:
# A helper function to extract the non-numeric tokens from a document, then
# filter these by function (no nouns) and semantic (no stopwords) tokens,
# and to return these lists as well as the sentences in the document.
def get_doc_tokens_and_sents(doc):
    sents = doc.sents
    alpha_tokens = [token for token in doc if token.is_punct == False and token.is_space == False and token.is_alpha == True]
    semantic_tokens = [token for token in alpha_tokens if token.is_stop == False]
    function_tokens = [token for token in alpha_tokens if token.is_stop == True]
    
    return [sents, alpha_tokens, semantic_tokens, function_tokens]

# This code just builds a list of author names from the corpus and assigns
# them to colors and shapes for drawing plots later
doc_author_names = []

author_opinions = {}

author_names = []
author_surnames = []

for i, doc in enumerate(opinions_corpus):
    maj_author_id = doc._.meta["maj_opinion_author"]
    maj_author_name = data.opinion_author_codes[maj_author_id]
    if maj_author_name is None:
        maj_author_name = "None"
    doc_author_names.append(maj_author_name)
    if maj_author_name not in author_opinions:
        author_opinions[maj_author_name] = [i]
        author_names.append(maj_author_name)
        author_surnames.append(maj_author_name.split(',')[0].strip())
    else:
        author_opinions[maj_author_name].append(i)
    
from matplotlib import cm
cmap = cm.get_cmap('gist_rainbow', len(author_names))

available_markers = ['o', 'v', '^', '<', '>', 's', 'P', '*', '+', 'X', 'D', '1', '2', '3', '4', 'p', 'x', '|', '_']

author_name_to_color = {}
author_name_to_marker = {}
for i, name in enumerate(author_names):
    author_name_to_color[name] = cmap(i)
    author_name_to_marker[name] = available_markers[i]

doc_author_colors = [author_name_to_color[name] for name in doc_author_names]

We'll create a vectorized version of the majority author decisions 1993-2016 subcorpus in which all nouns have been removed, leaving mostly "functional" rather than "semantic" words behind. Although it's not entirely foolproof, this filtering step is meant to ensure that stylistic, rather than content-based attributes of the documents remain as material for comparison and analysis.

In [None]:
from textacy.representations.vectorizers import Vectorizer

function_vectorizer = Vectorizer(min_df=1, max_df=1.0) # This will count all words in the corpus
# Remove the nouns
function_corpus = [[token.lower_ for token in list(textacy.extract.words(doc, filter_nums=True, filter_stops=False, filter_punct=True, exclude_pos=['NOUN', 'PROPN']))] for doc in opinions_corpus]

function_dtm = function_vectorizer.fit_transform(function_corpus)
# For comparison, the full corpus contains 65,840 terms
print(function_dtm.shape)

In [None]:
# This is a helper function to compute pairwise cosine (dis)similarities of docs in the corpus.
# cosine = similarity based on proportions of shared word frequencies
import numpy as np
from scipy.spatial import distance

def get_distance_matrix(method, corpus_doc_vectors):
    dim = len(corpus_doc_vectors)
    distance_matrix = np.zeros((dim,dim))
    
    for i, vec1 in enumerate(corpus_doc_vectors):
        print("row",i)
        for j, vec2 in enumerate(corpus_doc_vectors):
            if i == j:
                continue # defaults to 0
            if i > j:
                distance_matrix[i,j] = distance_matrix[j,i]
            else:
                #print(vec1, vec2)
                distance_matrix[i,j] = method(vec1, vec2)
    return distance_matrix

In [None]:
# This reorders the corpus documents by opinion author, rather than case ID/date,
# then computes a distance matrix based on the cosine similarity of the
# "non-semantic" (function) word set for each pair of documents in the corpus

dim = len(opinions_corpus)

reordered_corpus = [None] * dim
reordered_author_docs = []

author_start_indices = []
author_ranges = {}

new_index=0
for i, name in enumerate(author_names):
    author_ranges[name] = [new_index, -1]
    author_start_indices.append(new_index)
    for doc_index in author_opinions[name]:
        reordered_corpus[new_index] = opinions_corpus[doc_index]
        reordered_author_docs.append(name)
        author_ranges[name][1] = new_index
        new_index += 1

reordered_functions = [[token.orth_ for token in list(textacy.extract.words(doc, filter_nums=True, filter_stops=False, filter_punct=True, exclude_pos=['NOUN', 'PROPN']))] for doc in reordered_corpus]
reordered_dtm = function_vectorizer.fit_transform(reordered_functions)
reordered_function_matrix = get_distance_matrix(distance.cosine, reordered_dtm.toarray())

In [None]:
# Plot a correlation heatmap matrix view of the corpus, ordered by opinion author.

maxval = np.amax(reordered_function_matrix)

%matplotlib inline
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12, 12), dpi=80)
plt.imshow(reordered_function_matrix, cmap="viridis", vmax=maxval/2, interpolation='bicubic')
plt.colorbar()
ax = fig.gca()
ax.set_xticks(author_start_indices)
ax.set_xticklabels(author_surnames)
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")
# Including the left-hand labels causes the plot to be shifted incorrectly, for some reason.
#ax.set_yticks(author_start_indices)
#ax.set_yticklabels(author_surnames)

plt.show()
for author in author_names:
    print(author,str(author_ranges[author][0]) + '-' + str(author_ranges[author][1]))

The document-level stylistic similarity relationships from the matrix above
also can be viewed as a two-dimensional scatterplot via techniques like PCA
(Principal Component Analysis) and MDS (Multi-Dimensional Scaling).

In [None]:
from sklearn.manifold import MDS
#from sklearn.decomposition import PCA

mds = MDS(dissimilarity='precomputed')
pos = mds.fit_transform(reordered_function_matrix)
#pca = PCA()
#pos = pca.fit_transform(reordered_function_matrix)

The resulting plot does not immediately reveal obvious document clusterings, but this doesn't necessarily indicate that the stylistic features are wholly uninformative...

In [None]:
%matplotlib inline

plt.figure(figsize=(12,8), dpi=72)

for name in author_names:
    color = author_name_to_color[name]
    marker = author_name_to_marker[name]
    xs = []
    ys = []
    for i, doc_name in enumerate(doc_author_names):
        if doc_name != name:
            continue
        xs.append(pos[i,0])
        ys.append(pos[i,1])

    sc = plt.scatter(xs, ys, c=[color], label=name, alpha=0.7, marker=marker, edgecolors='none')

plt.legend()
plt.grid(True)

Another approach is to train a document classifier on the observed stylistic
features (i.e., function word frequencies) and then test the classifier to see
how well it can differentiate between authors.

In [None]:
# We'll use sklearn's own vectorizer for this part
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

doc_texts = []

for doc in opinions_corpus:
    sents, alpha_tokens, semantic_tokens, function_tokens = get_doc_tokens_and_sents(doc)
    doc_texts.append(" ".join(token.lower_ for token in function_tokens))

vectorizer.fit(doc_texts)

X_texts = np.array(doc_texts)
y_labels = np.array(doc_author_names)

If we train a Naive Bayes document classifier (which is perhaps most famous for its effectiveness as a spam filter) so that it learns various word frequency associations with the author labels, we can then run the classifier against a "held-out" portion of the corpus to see how well it performs.

Note that the resulting classification accuracy is around 50%, which is much higher than the 7% you'd exect if the classifier was just choosing one of the 15 author names at random. Not too bad considering the features it is working with -- mostly counts of words like "a", "and", and "the".

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()

def train_test_nb(random_seed):
    (X_texts_train, X_texts_test,
     y_labels_train, y_labels_test) = train_test_split(X_texts, y_labels, test_size=0.33, random_state=random_seed)

    X_features_train = vectorizer.fit_transform(X_texts_train)
    X_features_test = vectorizer.transform(X_texts_test)

    classifier.fit(X_features_train, y_labels_train)

    accuracy = classifier.score(X_features_test, y_labels_test)
    print("NB accuracy:",accuracy)

# Run a few times with different train/test sets, to make sure it's not a fluke
train_test_nb(42)
train_test_nb(16)
train_test_nb(97)

A related approach is to train a classifier on the full corpus and then to run this classifier on the full corpus. Normally, this kind of "overfitting" produces artificially high accuracy, but it also can be used to expose cases in which the classifier gets certain classifications consistently wrong. This type of classifier "confusion" is a sign that the writing styles of the authors being confused are in fact similar.

In [None]:
from sklearn.metrics import confusion_matrix

X_features = vectorizer.fit_transform(X_texts)

classifier.fit(X_features, y_labels)

y_labels_pred = classifier.predict(X_features)
cm = confusion_matrix(y_labels, y_labels_pred, labels=author_names)

import seaborn as sn

# Read the confusion matrix as
# left-hand labels = the "true" author
# bottom-row labels = the author predicted by the classifier

plt.figure(figsize = (10,7))
sn.heatmap(cm, annot=True, cbar=False, xticklabels=author_surnames, yticklabels=author_surnames, cmap="Blues", fmt='d')
b, t = plt.ylim()
b += 0.5
t -= 0.5
plt.ylim(b, t)
plt.show()

In [None]:
# This code displays a few of the features (words) that the classifier
# found most helpful when making its classification decisions. These are
# indeed very common "function" words (the, of, to, that, and, in, is, for)

vocab = np.array(vectorizer.get_feature_names())

def get_feature_counts(dtm, labels, categories, term, vocab):
  category_counts = {}
  for category in categories:
    category_counts[category] = 0
    for i, label in enumerate(labels):
      if label == category:
        vocab_position = np.where(vocab == term)[0][0]
        category_counts[category] += dtm[i, vocab_position]
  return category_counts

def most_informative_features(classifier, vectorizer, categories, n=20):
    class_labels = classifier.classes_
    if vectorizer is None:
        feature_names = classifier.steps[0].get_feature_names()
    else:
        feature_names = vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_log_prob_[0], feature_names))[-n:]
    topn_class2 = sorted(zip(classifier.feature_log_prob_[1], feature_names))[-n:]
    for prob, feat in reversed(topn_class2):
        print(class_labels[1], prob, feat)
        print(str(get_feature_counts(X_features, y_labels, categories, feat, vocab)))
    print()
    for prob, feat in reversed(topn_class1):
        print(class_labels[0], prob, feat)
        print(str(get_feature_counts(X_features, y_labels, categories, feat, vocab)))

most_informative_features(classifier, vectorizer, author_names)

Here we'll compute and visualize a few more style-related aspects of the texts associated with each majority author.

The "entropy" of the words used by an author in a given document, which roughly corresponds to the "unpredictability" or "variety" in an author's word choices, has proven to be effective in differentiating authors stylistically in the absence of any other identifying data. This seems to be the case here as well, at least for some of the justices.

In [None]:
author_function_entropies = []
author_alpha_entropies = []
author_opinion_lengths = []

for i, author_name in enumerate(author_opinions):
    author_function_entropies.append([])
    author_alpha_entropies.append([])
    author_opinion_lengths.append([])
    for doc_id in author_opinions[author_name]:
        doc = opinions_corpus[doc_id]
        sents, alpha_tokens, semantic_tokens, function_tokens = get_doc_tokens_and_sents(doc)
        function_entropy = textacy.text_stats.basics.entropy(function_tokens)
        alpha_entropy = textacy.text_stats.basics.entropy(alpha_tokens)
        opinion_length = len(alpha_tokens)

        author_function_entropies[i].append(function_entropy)
        author_alpha_entropies[i].append(alpha_entropy)
        author_opinion_lengths[i].append(opinion_length)

In [None]:
%matplotlib inline

bplots = {"Function Word Entropies": author_function_entropies,
          "Semantic + Function Word Entropies": author_alpha_entropies,
          "Opinion Lengths": author_opinion_lengths}

for bplot in bplots:
    fig = plt.figure(figsize=(12,6), dpi=72)
    fig.gca().set_title(bplot)
    plt.boxplot(bplots[bplot], labels=author_surnames, notch=True)
    plt.show()

## Additional Topics and Resources

**Sentiment analysis**: This workshop has touched on most of the main features of spaCy and textacy. One popular type of text analysis not covered is sentiment analysis, which neither spaCy nor textacy provides as a built-in feature (yet), and which anyway is not directly applicable to the historical and legal corpora used in this workshop. Among the available Python packages, [TextBlob](https://textblob.readthedocs.io/en/dev/quickstart.html) provides an especially straightforward implementation of sentiment analysis, albeit only for English texts.

**State of the field**: Text analysis is one of the foundational areas of inquiry for computational scholarship in the humanities and social sciences. Research continues to progress at a rapid pace, particularly involving deep-learning approaches to language translation, comprehension and generation. These are now quite likely to involve the productive exchange of methods with research into computational analysis of images, audio and video. Surveying the other software packages listed at the beginning of this workshop is a good way to stay up to date on what is currently available.

**Ethical concerns**: Although this workshop covers fairly basic topics in text analysis, the more advanced methods discussed here already begin to engage with issues of bias, privacy, and automation related to computational models and AI. The research reports and essays at https://ainowinstitute.org/research.html provide illuminating perspectives on these topics.

Here's the link to the evaluation survey again:
- https://evaluations.cidr.link/Text_Analysis_with_Python/

Thank you!