<a href="https://colab.research.google.com/github/programminghistorian/jekyll/blob/Issue-3052/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Corpus Analysis with spaCy
## by Megan S. Kane
#### https://programminghistorian.org/en/lessons/corpus-analysis-with-spacy
(Slightly adapted by Yevgen Matusevych for Collecting Data class at RUG) 

(Adapted again by Miriam Weigand to perform a corpus analysis on the Grimes corpus set for RUG course)

### Introduction

Say you have a big collection of texts. Maybe you’ve gathered speeches from the French Revolution, compiled a bunch of Amazon product reviews, or unearthed a collection of diary entries written during the first world war. In any of these cases, computational analysis can be a good way to compliment close reading of your corpus… but where should you start?

One possible way to begin is with spaCy, an industrial-strength library for Natural Language Processing (NLP) in Python. spaCy is capable of processing large corpora, generating linguistic annotations including part-of-speech tags and named entities, as well as preparing texts for further machine classification. This lab is a ‘spaCy 101’ of sorts, a primer for researchers who are new to spaCy and want to learn how it can be used for corpus analysis. It may also be useful for those who are curious about natural language processing tools in general, and how they can help us to answer humanities research questions.


### Lab Goals

By the end of this lab, you will be able to:

- Upload a corpus of texts to a platform for Python analysis (using Google Colaboratory)
- Use spaCy to enrich the corpus through tokenization, lemmatization, part-of-speech tagging and named entity recognition
- Conduct frequency analyses using part-of-speech tags and named entities
- Download an enriched dataset for use in future NLP analyses



### Why Use spaCy for Corpus Analysis?

As the name implies, corpus analysis involves studying corpora, or large collections of documents. Typically, the documents in a corpus are representative of the group(s) a researcher is interested in studying, such as the writings of a specific author or genre. By analyzing these texts at scale, researchers can identify meaningful trends in the way language is used within the target group(s).

Though computational tools like spaCy can’t read and comprehend the meaning of texts like humans do, they excel at ‘parsing’ (analyzing sentence structure) and ‘tagging’ (labeling) them. When researchers give spaCy a corpus, it will ‘parse’ every document in the collection, identifying the grammatical categories to which each word and phrase in each text most likely belongs. NLP Algorithms like spaCy use this information to generate lexico-grammatical tags that are of interest to researchers, such as lemmas (base words), part-of-speech tags and named entities (more on these in the Part-of-Speech Analysis and Named Entity Recognition sections below). Furthermore, computational tools like spaCy can perform these parsing and tagging processes much more quickly (in a matter of seconds or minutes) and on much larger corpora (hundreds, thousands, or even millions of texts) than human readers would be able to.

Though spaCy was designed for industrial use in software development, researchers also find it valuable for several reasons:

- It’s easy to set up and use spaCy’s Trained Models and Pipelines; there is no need to call a wide range of packages and functions for each individual task
- It uses fast and accurate algorithms for text-processing tasks, which are kept up-to-date by the developers so it’s efficient to run
- It performs better on text-splitting tasks than Natural Language Toolkit (NLTK), because it constructs syntactic trees for each sentence
- You may still be wondering: What is the value of extracting language data such as lemmas, part-of-speech tags, and named entities from a corpus? How can this data help researchers answer meaningful humanities research questions? To illustrate, let’s look at the example corpus and questions developed for this lab.



### Dataset: Michigan Corpus of Upper-Level Student Papersdataset-michigan-corpus-of-upper-level-student-papers

The Michigan Corpus of Upper-Level Student Papers (MICUSP) is a corpus of 829 high-scoring academic writing samples from students at the University of Michigan. The texts come from 16 disciplines and seven genres, all were written by senior undergraduate or graduate students and received an A-range score in a university course.1 The texts and their metadata are publicly available on MICUSP Simple, an online interface which allows users to search for texts by a range of fields (for example genre, discipline, student level, textual features) and conduct simple keyword analyses across disciplines and genres.

This lab will explore a subset of documents from MICUSP: 67 Biology papers and 98 English papers. Writing samples in this select corpus belong to all seven MICUSP genres: Argumentative Essay, Creative Writing, Critique/Evaluation, Proposal, Report, Research Paper, and Response Paper. This select corpus .txt_files.zip and the associated metadata.csv are available to download as sample materials for this lab. The dataset has been culled from the larger corpus in order to investigate the differences between two distinct disciplines of academic writing (Biology and English). It is also a manageable size for the purposes of this lab.


### Research Questions: Linguistic Differences Within Student Paper Genres and Disciplines
This lab will describe how spaCy’s utilities in stopword removal, tokenization, and lemmatization can assist in (and hinder) the preparation of student texts for analysis. You will learn how spaCy’s ability to extract linguistic annotations such as part-of-speech tags and named entities can be used to compare conventions within subsets of a discursive community of interest. The lab focuses on lexico-grammatical features that may indicate genre and disciplinary differences in academic writing.

The following research questions will be investigated:

1: Do students use certain parts-of-speech more frequently in Biology texts versus English texts, and does this linguistic discrepancy signify differences in disciplinary conventions?
Prior research has shown that even when writing in the same genres, writers in the sciences follow different conventions than those in the humanities. Notably, academic writing in the sciences has been characterized as informational, descriptive, and procedural, while scholarly writing in the humanities is narrativized, evaluative, and situation-dependent (that is, focused on discussing a particular text or prompt)5. By deploying spaCy on the MICUSP texts, researchers can determine whether there are any significant differences between the part-of-speech tag frequencies in English and Biology texts. For example, we might expect students writing Biology texts to use more adjectives than those in the humanities, given their focus on description. Conversely, we might suspect English texts to contain more verbs and verb auxiliaries, indicating a more narrative structure. To test these hypotheses, you’ll learn to analyze part-of-speech counts generated by spaCy, as well as to explore other part-of-speech count differences that could prompt further investigation.

2: Do students use certain named entities more frequently in different academic genres, and do these varying word frequencies signify broader differences in genre conventions?
As with disciplinary differences, research has shown that different genres of writing have their own conventions and expectations. For example, explanatory genres such as research papers, proposals and reports tend to focus on description and explanation, whereas argumentative and critique-driven texts are driven by evaluations and arguments6. By deploying spaCy on the MICUSP texts, researchers can determine whether there are any significant differences between the named entity frequencies in texts within the seven different genres represented (Argumentative Essay, Creative Writing, Critique/Evaluation, Proposal, Report, Research Paper, and Response Paper). We may suspect that argumentative genres engage more with people or works of art, since these could be entities serving to support their arguments or as the subject of their critiques. Conversely, perhaps dates and numbers are more prevalent in evidence-heavy genres, such as research papers and proposals. To test these hypotheses, you’ll learn to analyze the nouns and noun phrases spaCy has tagged as ‘named entities.’

### Installing, Importing and Preprocessing

In [1]:
# Install and import spacy and plotly.
!pip install spaCy
!pip install plotly



In [2]:
# Import spacy
import spacy

# Install English language model
!spacy download en_core_web_sm

# Import os to upload documents and metadata
import os

# Load spaCy visualizer
from spacy import displacy

# Import pandas DataFrame packages
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

# Import graphing package
import plotly.graph_objects as go
import plotly.express as px

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [5]:
# Create empty lists for file names and contents
texts = []
file_names = []

# Iterate through each file in the folder
for _file_name in os.listdir('Grimes'):
# Look for only text files
    if _file_name.endswith('.txt'):
    # Append contents of each text file to text list
        texts.append(open('Grimes' + '/' + _file_name, 'r', encoding='utf-8').read())
        # Append name of each file to file name list
        file_names.append(_file_name)

In [6]:
# Create dictionary object associating each file name with its text
d = {'Filename':file_names,'Text':texts}

In [7]:
# Turn dictionary into a dataframe
paper_df = pd.DataFrame(d)

In [8]:
paper_df.head()

Unnamed: 0,Filename,Text
0,Flesh Without Blood.txt,"[Intro]\nOoh, ah-ah\nOoh, ah-ah\n\n[Verse 1]\n..."
1,Delete Forever.txt,"[Verse 1]\nLying so awake, things I can't esca..."
2,Kill V. Maim.txt,"[Verse 1]\nI got in a fight, I was indisposed\..."
3,Vanessa.txt,"[Intro]\nI've been\n\n[Verse 1]\nOh, I've been..."
4,Butterfly.txt,"[Verse 1]\nBig beats, black cloud\nGet it wron..."


The beginnings of some of the texts may contain extra spaces (indicated by \t or \n). These characters can be replaced by a single space using the str.replace() method.

In [9]:
# Remove extra spaces from papers
paper_df['Text'] = paper_df['Text'].str.replace('\s+', ' ', regex=True).str.strip()
paper_df.head()

Unnamed: 0,Filename,Text
0,Flesh Without Blood.txt,"[Intro] Ooh, ah-ah Ooh, ah-ah [Verse 1] You cl..."
1,Delete Forever.txt,"[Verse 1] Lying so awake, things I can't escap..."
2,Kill V. Maim.txt,"[Verse 1] I got in a fight, I was indisposed I..."
3,Vanessa.txt,"[Intro] I've been [Verse 1] Oh, I've been wait..."
4,Butterfly.txt,"[Verse 1] Big beats, black cloud Get it wrong,..."


In [12]:
# Load metadata.
metadata_df = pd.read_csv('metadata.csv')
metadata_df.head()

Unnamed: 0,title,length,release_year,ablum_title
0,World ♡ Princess,04:41,2010.0,Halfaxa
1,Kill V. Maim,04:06,2015.0,Art Angels
2,Oblivion,04:11,2012.0,Visions
3,Butterfly,04:13,2015.0,Art Angels
4,Flesh Without Blood,04:25,2015.0,Art Angels


In [13]:
# Remove .txt from title of each paper
paper_df['Filename'] = paper_df['Filename'].str.replace('.txt', '', regex=True)

# Rename column from paper ID to Title
metadata_df.rename(columns={"title": "Filename"}, inplace=True)

In [14]:
# Merge metadata and papers into new DataFrame
# Will only keep rows where both essay and metadata are present
final_paper_df = metadata_df.merge(paper_df,on='Filename')

Let's check the head of the DataFrame again to confirm everything has worked well. Check the first five rows to make sure each has a filename, title, discipline, paper type and text (the full lyrics)

In [15]:
# Print DataFrame
final_paper_df.head()

Unnamed: 0,Filename,length,release_year,ablum_title,Text
0,Kill V. Maim,04:06,2015.0,Art Angels,"[Verse 1] I got in a fight, I was indisposed I..."
1,Oblivion,04:11,2012.0,Visions,[Verse 1] I never walk about after dark It's m...
2,Butterfly,04:13,2015.0,Art Angels,"[Verse 1] Big beats, black cloud Get it wrong,..."
3,Flesh Without Blood,04:25,2015.0,Art Angels,"[Intro] Ooh, ah-ah Ooh, ah-ah [Verse 1] You cl..."
4,California,03:18,2015.0,Art Angels,"[Verse 1] This, this music makes me cry It sou..."


The resulting DataFrame is now ready for analysis.

## Text Enrichment with spaCy

### Creating Doc Objects


To use spaCy, the first step is to load one of spaCy’s Trained Models and Pipelines which will be used to perform tokenization, part-of-speech tagging, and other text enrichment tasks. A wide range of options are available (see the full list here), and they vary based on size and language.

We’ll use en_core_web_sm, which has been trained on written web texts. It may not perform as accurately as the those trained on medium and large English language models, but it will deliver results most efficiently. Once we’ve loaded en_core_web_sm, we can check what actions it performs; parser, tagger, lemmatizer, and NER, should be among those listed.

In [16]:
# Load nlp pipeline
nlp = spacy.load('en_core_web_sm')

# Check what functions it performs
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


Now that the nlp function is loaded, let’s test out its capacities on a single sentence. Calling the nlp function on a single sentence yields a Doc object. This object stores not only the original text, but also all of the linguistic annotations obtained when spaCy processed the text.

In [17]:
#Define example sentence
sentence = "This is 'an' example? sentence"

# Call the nlp model on the sentence
doc = nlp(sentence)


Next we can call on the Doc object to get the information we’re interested in. The command below loops through each token in a Doc object and prints each word in the text along with its corresponding part-of-speech:

In [18]:
# Loop through each token in doc object
for token in doc:
    # Print text and part of speech for each
    print(token.text, token.pos_)

This PRON
is AUX
' PUNCT
an DET
' PUNCT
example NOUN
? PUNCT
sentence NOUN


Let’s try the same process on the student texts. As we’ll be calling the NLP function on every text in the DataFrame, we should first define a function that runs nlp on whatever input text is given. Functions are a useful way to store operations that will be run multiple times, reducing duplications and improving code readability.

In [19]:
# Define a function that runs the nlp pipeline on any given input text
def process_text(text):
    return nlp(text)

#NOTE: I really do NOT get why we are creating a function for a function? We already have the nlp functions so what is the purpose of this?

After the function is defined, use .apply() to apply it to every cell in a given DataFrame column. In this case, nlp will run on each cell in the Text column of the final_paper_df DataFrame, creating a Doc object from every student text. These Doc objects will be stored in a new column of the DataFrame called Doc.

Running this function takes several minutes because spaCy is performing all the parsing and tagging tasks on each text. However, when it is complete, we can simply call on the resulting Doc objects to get parts-of-speech, named entities, and other information of interest, just as in the example of the sentence above.

In [20]:
# Apply the function to the "Text" column, so that the nlp pipeline is called on each student essay
final_paper_df['Doc'] = final_paper_df['Text'].apply(process_text)

### Text Reduction

#### Tokenization

A critical first step spaCy performs is tokenization, or the segmentation of strings into individual words and punctuation markers. Tokenization enables spaCy to parse the grammatical structures of a text and identify characteristics of each word-like part-of-speech.

To retrieve a tokenized version of each text in the DataFrame, we’ll write a function that iterates through any given Doc object and returns all functions found within it.

In [21]:
# Define a function to retrieve tokens from a doc object
def get_token(doc):
    return [(token.text) for token in doc]

As with the function used to create Doc objects, the token function can be applied to the DataFrame. In this case, we will call the function on the Doc column, since this is the column which stores the results from the processing done by spaCy.

In [39]:
# Run the token retrieval function on the doc objects in the dataframe
final_paper_df['Tokens'] = final_paper_df['Doc'].apply(get_token)
final_paper_df.head(6)

Unnamed: 0,Filename,length,release_year,ablum_title,Text,Doc,Tokens,Lemmas,POS,Proper_Nouns
0,Kill V. Maim,04:06,2015.0,Art Angels,"[Verse 1] I got in a fight, I was indisposed I...","([, Verse, 1, ], I, got, in, a, fight, ,, I, w...","[[, Verse, 1, ], I, got, in, a, fight, ,, I, w...","[[, verse, 1, ], I, get, in, a, fight, ,, I, b...","[(X, XX), (NOUN, NN), (NUM, CD), (PUNCT, -RRB-...","[Pre, -, Chorus, B, Italiana, B, Verse, Pre, -..."
1,Oblivion,04:11,2012.0,Visions,[Verse 1] I never walk about after dark It's m...,"([, Verse, 1, ], I, never, walk, about, after,...","[[, Verse, 1, ], I, never, walk, about, after,...","[[, verse, 1, ], I, never, walk, about, after,...","[(X, XX), (NOUN, NN), (NUM, CD), (PUNCT, -RRB-...","[La, La, La, La, La, La, La, La, La]"
2,Butterfly,04:13,2015.0,Art Angels,"[Verse 1] Big beats, black cloud Get it wrong,...","([, Verse, 1, ], Big, beats, ,, black, cloud, ...","[[, Verse, 1, ], Big, beats, ,, black, cloud, ...","[[, verse, 1, ], big, beat, ,, black, cloud, g...","[(X, XX), (NOUN, NN), (NUM, CD), (PUNCT, -RRB-...","[Verse, Pre, -, Chorus, Sweeter, Calculate, Ve..."
3,Flesh Without Blood,04:25,2015.0,Art Angels,"[Intro] Ooh, ah-ah Ooh, ah-ah [Verse 1] You cl...","([, Intro, ], Ooh, ,, ah, -, ah, Ooh, ,, ah, -...","[[, Intro, ], Ooh, ,, ah, -, ah, Ooh, ,, ah, -...","[[, intro, ], Ooh, ,, ah, -, ah, Ooh, ,, ah, -...","[(X, XX), (X, XX), (PUNCT, -RRB-), (PROPN, NNP...","[Ooh, Ooh, Pre, -, Aye, Baby, Uncontrollable, ..."
4,California,03:18,2015.0,Art Angels,"[Verse 1] This, this music makes me cry It sou...","([, Verse, 1, ], This, ,, this, music, makes, ...","[[, Verse, 1, ], This, ,, this, music, makes, ...","[[, verse, 1, ], this, ,, this, music, make, I...","[(X, XX), (NOUN, NN), (NUM, CD), (PUNCT, -RRB-...","[Lord, Pre, -, Chorus, Ca, California, Ca, Cal..."
5,Pin,03:33,2015.0,Art Angels,[Verse 1] Dirt in your fingernails Blood on yo...,"([, Verse, 1, ], Dirt, in, your, fingernails, ...","[[, Verse, 1, ], Dirt, in, your, fingernails, ...","[[, verse, 1, ], Dirt, in, your, fingernail, b...","[(X, XX), (NOUN, NN), (NUM, CD), (PUNCT, -RRB-...","[Dirt, Gentle, Drunk, Tearin, Lighter, Light, ..."


If we compare the Text and Tokens column, we find a couple of differences. In the table below, you’ll notice that most importantly, the words, spaces, and punctuation markers in the Tokens column are separated by commas, indicating that each have been parsed as individual tokens. The text in the Tokens column is also bracketed; this indicates that tokens have been generated as a list.

In [23]:
tokens = final_paper_df[['Text', 'Tokens']].copy()
tokens.head()

Unnamed: 0,Text,Tokens
0,"[Verse 1] I got in a fight, I was indisposed I...","[[, Verse, 1, ], I, got, in, a, fight, ,, I, w..."
1,[Verse 1] I never walk about after dark It's m...,"[[, Verse, 1, ], I, never, walk, about, after,..."
2,"[Verse 1] Big beats, black cloud Get it wrong,...","[[, Verse, 1, ], Big, beats, ,, black, cloud, ..."
3,"[Intro] Ooh, ah-ah Ooh, ah-ah [Verse 1] You cl...","[[, Intro, ], Ooh, ,, ah, -, ah, Ooh, ,, ah, -..."
4,"[Verse 1] This, this music makes me cry It sou...","[[, Verse, 1, ], This, ,, this, music, makes, ..."


#### Lemmatization

Another process performed by spaCy is lemmatization, or the retrieval of the dictionary root word of each word (for example “brighten” for “brightening”). We’ll perform a similar set of steps to those above to create a function to call the lemmas from the Doc object, then apply it to the DataFrame.

In [24]:
# Define a function to retrieve lemmas from a doc object
def get_lemma(doc):
    return [(token.lemma_) for token in doc]

# Run the lemma retrieval function on the doc objects in the dataframe
final_paper_df['Lemmas'] = final_paper_df['Doc'].apply(get_lemma)

Lemmatization can help reduce noise and refine results for researchers who are conducting keyword searches. For example, let’s compare counts of the word “fly” in the original Tokens column and in the lemmatized Lemmas column.

In [30]:
print(f'"fly" appears in the text tokens column ' + str(final_paper_df['Tokens'].apply(lambda x: x.count('fly')).sum()) + ' times.')
print(f'"fly" appears in the lemmas column ' + str(final_paper_df['Lemmas'].apply(lambda x: x.count('fly')).sum()) + ' times.')

"fly" appears in the text tokens column 2 times.
"fly" appears in the lemmas column 4 times.


As expected, there are more instances of “fly” in the Lemmas column, as the lemmatization process has grouped inflected word forms (writing, writer) into the base word “write.”

### Text Annotation

#### Part of Speech Tagging

spaCy facilitates two levels of part-of-speech tagging: coarse-grained tagging, which predicts the simple universal part-of-speech of each token in a text (such as noun, verb, adjective, adverb), and detailed tagging, which uses a larger, more fine-grained set of part-of-speech tags (for example 3rd person singular present verb). The part-of-speech tags used are determined by the English language model we use. In this case, we’re using the small English model, and you can explore the differences between the models on spaCy’s website.

We can call the part-of-speech tags in the same way as the lemmas. Create a function to extract them from any given Doc object and apply the function to each Doc object in the DataFrame. The function we’ll create will extract both the coarse- and fine-grained part-of-speech for each token (token.pos_ and token.tag_, respectively).

In [31]:
# Define a function to retrieve lemmas from a doc object
def get_pos(doc):
    #Return the coarse- and fine-grained part of speech text for each token in the doc
    return [(token.pos_, token.tag_) for token in doc]

# Define a function to retrieve parts of speech from a doc object
final_paper_df['POS'] = final_paper_df['Doc'].apply(get_pos)

We can create a list of the part-of-speech columns to review them further. The first (coarse-grained) tag corresponds to a generally recognizable part-of-speech such as a noun, adjective, or punctuation mark, while the second (fine-grained) category are a bit more difficult to decipher.

In [32]:
# Create a list of part of speech tags
list(final_paper_df['POS'])

[[('X', 'XX'),
  ('NOUN', 'NN'),
  ('NUM', 'CD'),
  ('PUNCT', '-RRB-'),
  ('PRON', 'PRP'),
  ('VERB', 'VBD'),
  ('ADP', 'IN'),
  ('DET', 'DT'),
  ('NOUN', 'NN'),
  ('PUNCT', ','),
  ('PRON', 'PRP'),
  ('AUX', 'VBD'),
  ('VERB', 'VBN'),
  ('PRON', 'PRP'),
  ('AUX', 'VBD'),
  ('ADV', 'RB'),
  ('PUNCT', ','),
  ('SCONJ', 'IN'),
  ('DET', 'PDT'),
  ('DET', 'DT'),
  ('ADJ', 'JJ'),
  ('NOUN', 'NN'),
  ('CCONJ', 'CC'),
  ('PRON', 'PRP'),
  ('AUX', 'VBP'),
  ('ADV', 'RB'),
  ('DET', 'DT'),
  ('NOUN', 'NN'),
  ('PUNCT', ','),
  ('CCONJ', 'CC'),
  ('PRON', 'PRP'),
  ('VERB', 'VBP'),
  ('PRON', 'WP'),
  ('PRON', 'PRP'),
  ('AUX', 'MD'),
  ('PRON', 'PRP'),
  ('VERB', 'VBD'),
  ('NOUN', 'NNS'),
  ('ADP', 'IN'),
  ('ADJ', 'JJ'),
  ('NOUN', 'NNS'),
  ('PRON', 'PRP'),
  ('VERB', 'VBP'),
  ('ADP', 'RP'),
  ('ADP', 'IN'),
  ('ADJ', 'JJ'),
  ('PRON', 'PRP'),
  ('VERB', 'VBD'),
  ('ADP', 'IN'),
  ('DET', 'DT'),
  ('NOUN', 'NN'),
  ('CCONJ', 'CC'),
  ('PRON', 'PRP'),
  ('AUX', 'VBP'),
  ('PART', 'RB'),
  (

Fortunately, spaCy has a built-in function called explain that can provide a short description of any tag of interest. If we try it on the tag IN using spacy.explain("IN"), the output reads conjunction, subordinating or preposition.

In [46]:
spacy.explain("IN")

'conjunction, subordinating or preposition'

In [48]:
spacy.explain("PROPN")

'proper noun'

In some cases, you may want to get only a set of part-of-speech tags for further analysis, like all of the proper nouns. A function can be written to perform this task, extracting only words which have been fitted with the proper noun tag.

In [34]:
# Define function to extract proper nouns from Doc object
def extract_proper_nouns(doc):
    return [token.text for token in doc if token.pos_ == 'PROPN']

# Apply function to Doc column and store resulting proper nouns in new column
final_paper_df['Proper_Nouns'] = final_paper_df['Doc'].apply(extract_proper_nouns)

Listing the nouns in each text can help us ascertain the texts’ subjects. Let’s list the nouns in two different texts, the text located in row 1 of the DataFrame and the text located in row 6.

In [40]:
list(final_paper_df.loc[[1,6], 'Proper_Nouns'])

[['La', 'La', 'La', 'La', 'La', 'La', 'La', 'La', 'La'],
 ['wanna',
  'party',
  'wanna',
  'Baby',
  'Baby',
  'wanna',
  'party',
  'wanna',
  'Baby',
  'Baby',
  'Said',
  'wanna',
  'party',
  'wanna',
  'Baby',
  'Baby',
  'Said',
  'Said',
  'Said']]

NOTE: It seems that <span style="color:red">spacy is making some mistakes with the annotations here</span>. There are many words that are not actually nouns, which are unfortunately tagged as such. 

#### Named Entity Recognition

spaCy can tag named entities in the text, such as names, dates, organizations, and locations. Call the full list of named entities and their descriptions using this code:

In [41]:
# Get all NE labels and assign to variable
labels = nlp.get_pipe("ner").labels

# Print each label and its description
for label in labels:
    print(label + ' : ' + spacy.explain(label))

CARDINAL : Numerals that do not fall under another type
DATE : Absolute or relative dates or periods
EVENT : Named hurricanes, battles, wars, sports events, etc.
FAC : Buildings, airports, highways, bridges, etc.
GPE : Countries, cities, states
LANGUAGE : Any named language
LAW : Named documents made into laws.
LOC : Non-GPE locations, mountain ranges, bodies of water
MONEY : Monetary values, including unit
NORP : Nationalities or religious or political groups
ORDINAL : "first", "second", etc.
ORG : Companies, agencies, institutions, etc.
PERCENT : Percentage, including "%"
PERSON : People, including fictional
PRODUCT : Objects, vehicles, foods, etc. (not services)
QUANTITY : Measurements, as of weight or distance
TIME : Times smaller than a day
WORK_OF_ART : Titles of books, songs, etc.


We’ll create a function to extract the named entity tags from each Doc object and apply it to the Doc objects in the DataFrame, storing the named entities in a new column:

In [49]:
# Define function to extract named entities from doc objects
def extract_named_entities(doc):
    return [ent.label_ for ent in doc.ents]

# Apply function to Doc column and store resulting named entities in new column
final_paper_df['Named_Entities'] = final_paper_df['Doc'].apply(extract_named_entities)
final_paper_df['Named_Entities']

0     [CARDINAL, WORK_OF_ART, ORG, CARDINAL, WORK_OF...
1     [CARDINAL, TIME, ORG, ORG, CARDINAL, PERSON, P...
2     [CARDINAL, PERSON, CARDINAL, TIME, PERSON, CAR...
3                  [PERSON, CARDINAL, ORG, WORK_OF_ART]
4     [CARDINAL, WORK_OF_ART, GPE, GPE, CARDINAL, DA...
5     [CARDINAL, CARDINAL, GPE, PERSON, WORK_OF_ART,...
6              [PERSON, PERSON, PERSON, PERSON, PERSON]
7                                                    []
8     [PERSON, PERSON, PERSON, PERSON, PERSON, PERSO...
9     [CARDINAL, DATE, DATE, CARDINAL, DATE, DATE, D...
10                [ORG, CARDINAL, CARDINAL, DATE, DATE]
11    [TIME, PERSON, PERSON, PERSON, PERSON, PERSON,...
Name: Named_Entities, dtype: object

We can add another column with the words and phrases identified as named entities:



In [43]:
# Define function to extract text tagged with named entities from doc objects
def extract_named_entities(doc):
    return [ent for ent in doc.ents]

# Apply function to Doc column and store resulting text in new column
final_paper_df['NE_Words'] = final_paper_df['Doc'].apply(extract_named_entities)
final_paper_df['NE_Words']

0     [(1), (Pre, -, Chorus), (Italiana), (2), (Pre,...
1     [(1), (all, the, hours), (La, -, la, -, la, -,...
2     [(1), (Write), (2), (tonight), (Butterflies), ...
3     [(Intro, ], Ooh), (2), (Cause), (Pre, -, Choru...
4     [(1), (Pre, -, Chorus), (California), (Califor...
5     [(1), (three), (Tearin), (Lighter), (Light), (...
6             [(Intro), (Said), (Said), (Said), (Said)]
7                                                    []
8     [(Intro, ], Have), (Shinigami), (Shinigami), (...
9     [(1), (all, day), (all, day), (2), (all, day),...
10    [(Intro), (1), (2), (Everyday), (Everyday, -, ...
11    [(afternoon, Morning), (Rosa), (Vampires), (Ch...
Name: NE_Words, dtype: object

Let’s visualize the words and their named entity tags in a single text. Call the first text’s Doc object and use displacy.render to visualize the text with the named entities highlighted and tagged:

In [44]:
# Extract the first Doc object
doc = final_paper_df['Doc'][1]

# Visualize named entity tagging in a single paper
displacy.render(doc, style='ent', jupyter=True)

Note: <span style="color:red">again spacy shows inconsistencies with its annotations</span>. Numbers are tagged as cardinal and the vocalizations of 'La-la-la's are tagged as organisation or person. 

### Download Enriched Dataset

To save the dataset of doc objects, text reductions and linguistic annotations generated with spaCy, download the final_paper_df DataFrame to your local computer as a .csv file:

In [45]:
# Save DataFrame as csv (in Google Drive)
# Use this step only to save  csv to your computer's working directory
final_paper_df.to_csv('grimes_corpus_with_spaCy_tags.csv')