<a href="https://colab.research.google.com/github/programminghistorian/jekyll/blob/Issue-3052/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Corpus processing of Grimes' songs with spaCy
### by Miriam Weigand, s3584674 
#### Code adapted from Megan S. Kane, https://programminghistorian.org/en/lessons/corpus-analysis-with-spacy for Collecting Data class at RUG

## Introduction
---

### Why Use spaCy for Corpus Analysis?
---

As the name implies, corpus analysis involves studying corpora, or large collections of documents. Typically, the documents in a corpus are representative of the group(s) a researcher is interested in studying, such as the writings of a specific author or genre. By analyzing these texts at scale, researchers can identify meaningful trends in the way language is used within the target group(s).

Though computational tools like spaCy can’t read and comprehend the meaning of texts like humans do, they excel at ‘parsing’ (analyzing sentence structure) and ‘tagging’ (labeling) them. When researchers give spaCy a corpus, it will ‘parse’ every document in the collection, identifying the grammatical categories to which each word and phrase in each text most likely belongs. NLP Algorithms like spaCy use this information to generate lexico-grammatical tags that are of interest to researchers, such as lemmas (base words), part-of-speech tags and named entities (more on these in the Part-of-Speech Analysis and Named Entity Recognition sections below). Furthermore, computational tools like spaCy can perform these parsing and tagging processes much more quickly (in a matter of seconds or minutes) and on much larger corpora (hundreds, thousands, or even millions of texts) than human readers would be able to.

Though spaCy was designed for industrial use in software development, researchers also find it valuable for several reasons:

- It’s easy to set up and use spaCy’s Trained Models and Pipelines; there is no need to call a wide range of packages and functions for each individual task
- It uses fast and accurate algorithms for text-processing tasks, which are kept up-to-date by the developers so it’s efficient to run
- It performs better on text-splitting tasks than Natural Language Toolkit (NLTK), because it constructs syntactic trees for each sentence

Say you have a big collection of texts. Maybe you’ve gathered speeches from the French Revolution, compiled a bunch of Amazon product reviews, or unearthed a collection of diary entries written during the first world war. In any of these cases, computational analysis can be a good way to compliment close reading of your corpus… but where should you start?

One possible way to begin is with spaCy, an industrial-strength library for Natural Language Processing (NLP) in Python. spaCy is capable of processing large corpora, generating linguistic annotations including part-of-speech tags and named entities, as well as preparing texts for further machine classification. This lab is a ‘spaCy 101’ of sorts, a primer for researchers who are new to spaCy and want to learn how it can be used for corpus analysis. It may also be useful for those who are curious about natural language processing tools in general, and how they can help us to answer humanities research questions.

### <span style="color:"> Dataset </span>: 
---


The corpus was manually collected via Grimes' Genius webpage. More info on the selection criteria and collection process is available in the Read.me on [this project's Github page](https://github.com/v1alina/collecting_data_assignment4).

The corpus consists of **16 Grimes' songs** published between **2010 and 2022**.



###  <span style="color:red"> Research Questions </span>: 
---

The following research questions will be investigated:



### Installing, Importing and Preprocessing
---

In [138]:
# Install and import spacy and plotly. You can skip this step if you have already installed everything.
#!pip install spaCy
#!pip install plotly

In [1]:
# Import spacy
import spacy

# Install English language model
!spacy download en_core_web_sm

# Import os to upload documents and metadata
import os

# Load spaCy visualizer
from spacy import displacy

# Import pandas DataFrame packages
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

# Import graphing package
import plotly.graph_objects as go
import plotly.express as px

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m41.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [23]:
# Create empty lists for file names and contents
texts = []
file_names = []

# Iterate through each file in the folder
for _file_name in os.listdir('Grimes'):
# Look for only text files
    if _file_name.endswith('.txt'):
    # Append contents of each text file to text list
        texts.append(open('Grimes' + '/' + _file_name, 'r', encoding='utf-8').read())
        # Append name of each file to file name list
        file_names.append(_file_name)

In [24]:
# Create dictionary object associating each file name with its text
d = {'Filename':file_names,'Text':texts}

In [25]:
# Turn dictionary into a dataframe
lyrics_df = pd.DataFrame(d)

In [26]:
lyrics_df.head()

Unnamed: 0,Filename,Text
0,Flesh Without Blood.txt,"[Intro]\nOoh, ah-ah\nOoh, ah-ah\n\n[Verse 1]\n..."
1,Delete Forever.txt,"[Verse 1]\nLying so awake, things I can't esca..."
2,Kill V. Maim.txt,"[Verse 1]\nI got in a fight, I was indisposed\..."
3,Vanessa.txt,"[Intro]\nI've been\n\n[Verse 1]\nOh, I've been..."
4,World Princess.txt,[Intro]\nThinking of her all my life\nNow I go...


The beginnings of some of the texts may contain extra spaces (indicated by \t or \n). These characters can be replaced by a single space using the str.replace() method.

As you may have noted, the Genius lyrics are formatted all in the same way; Brackets indicate which part of the song the lyrics belong to. This means that the text from Genius is already a bit 'cooked' but due to the consistency in formatting this is not a bad thing! I want to use this information to make our data set a bit more clear.

Before I merge the table I want to do two more things:
1. Create rows for all the different parts of the songs
2. Replace the "Text" column of our data frame with the cleaned text that does not include the brackets + song part

For this purpose I am using Regular Expressions (RE).

In [27]:
# Remove extra spaces from songs
lyrics_df['Text'] = lyrics_df['Text'].str.replace('\s+', ' ', regex=True).str.strip()
lyrics_df.head()

Unnamed: 0,Filename,Text
0,Flesh Without Blood.txt,"[Intro] Ooh, ah-ah Ooh, ah-ah [Verse 1] You cl..."
1,Delete Forever.txt,"[Verse 1] Lying so awake, things I can't escap..."
2,Kill V. Maim.txt,"[Verse 1] I got in a fight, I was indisposed I..."
3,Vanessa.txt,"[Intro] I've been [Verse 1] Oh, I've been wait..."
4,World Princess.txt,[Intro] Thinking of her all my life Now I go d...


In [28]:
# Extract sections and create new columns
import re

pattern = r'\[([^]]+)\]' # matches anything inside brackets: [example]
sections = lyrics_df['Text'].str.extractall(pattern)



#Iterate through sections and update DataFrame
for section in sections[0].unique():
    section_name = section.lower().replace(' ', '_')  # Convert to lowercase and replace spaces with underscores
    lyrics_df[section_name] = lyrics_df['Text'].apply(lambda x: re.search(fr'\[{section}\](.*?)(?=\[|$)', x, re.DOTALL).group(1) if re.search(fr'\[{section}\](.*?)(?=\[|$)', x, re.DOTALL) else None)

#Display the updated DataFrame
lyrics_df.head()

Unnamed: 0,Filename,Text,intro,verse_1,pre-chorus,chorus,verse_2,bridge,outro,verse_3,verse_4,verse_5,post-chorus,refrain,interlude,verse
0,Flesh Without Blood.txt,"[Intro] Ooh, ah-ah Ooh, ah-ah [Verse 1] You cl...","Ooh, ah-ah Ooh, ah-ah","You claw, you fight, you lose, got a doll tha...","Aye-yeah, aye-yeah Aye, I don't see the light...","(Now you’ll never know) Baby, believe me And ...","You hate, you bite, you lose after all, I jus...","Ooh-oh-oh-oh, ooh-oh-oh-oh Hey, hey, sing alo...",(Now you’ll never know) Aah-ah ah (Then your ...,,,,,,,
1,Delete Forever.txt,"[Verse 1] Lying so awake, things I can't escap...",,"Lying so awake, things I can't escape Lately,...","Always down when I'm not up, guess it's just ...","I see everything, I see everything Don't you ...",Funny how they think us naive when we're on t...,,,,,,,,,
2,Kill V. Maim.txt,"[Verse 1] I got in a fight, I was indisposed I...",,"I got in a fight, I was indisposed I was in, ...",B-E-H-A-V-E Arrest us Italiana mobster Lookin...,"Eh I don't behave, I don't behave, oh eh I do...","I did something bad, maybe I was wrong Someti...","Oh, the fire it's all right 'Cause we can mak...",B-E-H-A-V-E Arrest us Italiana mobster Lookin...,,,,,,,
3,Vanessa.txt,"[Intro] I've been [Verse 1] Oh, I've been wait...",I've been,"Oh, I've been waiting destiny And my heart is...",And I know and I need you in the storm And I ...,"Hey, hey, you want to play Well baby, I can g...","I hold on, and I don't care what you say But ...","Everyday, everyday, everyday, everyday Everyd...",,,,,,,,
4,World Princess.txt,[Intro] Thinking of her all my life Now I go d...,Thinking of her all my life Now I go down,"I cannot feel, I cannot feel I cannot feel, I...",,,"Thinking of her, baby that won't go Now I go ...",,,Thinking of her all my life Now I go down Thi...,"Thinking of her, baby that won't go We can go...",Thinking of her all my life Now I go down Thi...,,,,


In [29]:
def clean_brackets(text):
    pattern = r'\[([^]]+)\]'

    # Use re.sub to replace brackets with an empty space in the text
    cleaned_text = re.sub(pattern, '', text)
    
    return cleaned_text

lyrics_df['Text'] = lyrics_df['Text'].apply(clean_brackets)

# Display the updated DataFrame
lyrics_df.head()

Unnamed: 0,Filename,Text,intro,verse_1,pre-chorus,chorus,verse_2,bridge,outro,verse_3,verse_4,verse_5,post-chorus,refrain,interlude,verse
0,Flesh Without Blood.txt,"Ooh, ah-ah Ooh, ah-ah You claw, you fight, y...","Ooh, ah-ah Ooh, ah-ah","You claw, you fight, you lose, got a doll tha...","Aye-yeah, aye-yeah Aye, I don't see the light...","(Now you’ll never know) Baby, believe me And ...","You hate, you bite, you lose after all, I jus...","Ooh-oh-oh-oh, ooh-oh-oh-oh Hey, hey, sing alo...",(Now you’ll never know) Aah-ah ah (Then your ...,,,,,,,
1,Delete Forever.txt,"Lying so awake, things I can't escape Lately,...",,"Lying so awake, things I can't escape Lately,...","Always down when I'm not up, guess it's just ...","I see everything, I see everything Don't you ...",Funny how they think us naive when we're on t...,,,,,,,,,
2,Kill V. Maim.txt,"I got in a fight, I was indisposed I was in, ...",,"I got in a fight, I was indisposed I was in, ...",B-E-H-A-V-E Arrest us Italiana mobster Lookin...,"Eh I don't behave, I don't behave, oh eh I do...","I did something bad, maybe I was wrong Someti...","Oh, the fire it's all right 'Cause we can mak...",B-E-H-A-V-E Arrest us Italiana mobster Lookin...,,,,,,,
3,Vanessa.txt,"I've been Oh, I've been waiting destiny And ...",I've been,"Oh, I've been waiting destiny And my heart is...",And I know and I need you in the storm And I ...,"Hey, hey, you want to play Well baby, I can g...","I hold on, and I don't care what you say But ...","Everyday, everyday, everyday, everyday Everyd...",,,,,,,,
4,World Princess.txt,Thinking of her all my life Now I go down I ...,Thinking of her all my life Now I go down,"I cannot feel, I cannot feel I cannot feel, I...",,,"Thinking of her, baby that won't go Now I go ...",,,Thinking of her all my life Now I go down Thi...,"Thinking of her, baby that won't go We can go...",Thinking of her all my life Now I go down Thi...,,,,


### Enriching the corpus with metadata
---

In [30]:
# Load metadata.
metadata_df = pd.read_csv('metadata.csv')

In [31]:
# Since this corpus is rather small, we can check all the 16 works at one glance
# otherwise it might be good to do random checks of the metadata
metadata_df.head(16)

Unnamed: 0,title,length,release_year,ablum_title
0,World Princess,04:41,2010,Halfaxa
1,Kill V. Maim,04:06,2015,Art Angels
2,Oblivion,04:11,2012,Visions
3,Butterfly,04:13,2015,Art Angels
4,Flesh Without Blood,04:25,2015,Art Angels
5,California,03:18,2015,Art Angels
6,Pin,03:33,2015,Art Angels
7,Realiti,05:07,2015,Art Angels
8,"World Princess, Pt.II",05:06,2015,Art Angels
9,Violence,03:40,2020,Miss Anthropocene


In [32]:
# Remove .txt from title of each song
lyrics_df['Filename'] = lyrics_df['Filename'].str.replace('.txt', '', regex=True)

# Rename column from filename to Title
metadata_df.rename(columns={"title": "Filename"}, inplace=True)

In [33]:
# Merge metadata and songs into new DataFrame
# Will only keep rows where both songs and metadata are present
final_lyrics_df = metadata_df.merge(lyrics_df,on='Filename')

Let's check the entirety of the DataFrame again to confirm everything has worked well. The DataFrame should now have a filename, length, release year, album title and the full raw text (song lyrics) and have a total of 16 rows (0 - 15):

In [34]:
# Print DataFrame
final_lyrics_df.head(16)

Unnamed: 0,Filename,length,release_year,ablum_title,Text,intro,verse_1,pre-chorus,chorus,verse_2,bridge,outro,verse_3,verse_4,verse_5,post-chorus,refrain,interlude,verse
0,World Princess,04:41,2010,Halfaxa,Thinking of her all my life Now I go down I ...,Thinking of her all my life Now I go down,"I cannot feel, I cannot feel I cannot feel, I...",,,"Thinking of her, baby that won't go Now I go ...",,,Thinking of her all my life Now I go down Thi...,"Thinking of her, baby that won't go We can go...",Thinking of her all my life Now I go down Thi...,,,,
1,Kill V. Maim,04:06,2015,Art Angels,"I got in a fight, I was indisposed I was in, ...",,"I got in a fight, I was indisposed I was in, ...",B-E-H-A-V-E Arrest us Italiana mobster Lookin...,"Eh I don't behave, I don't behave, oh eh I do...","I did something bad, maybe I was wrong Someti...","Oh, the fire it's all right 'Cause we can mak...",B-E-H-A-V-E Arrest us Italiana mobster Lookin...,,,,,,,
2,Oblivion,04:11,2012,Visions,I never walk about after dark It's my point o...,,I never walk about after dark It's my point o...,,See you on a dark night (La-la-la-la-la) See ...,"And no, I'm not a jerk I would ask if you cou...",To look into my eyes and tell me La-la-la-la-...,,,,,,,,
3,Butterfly,04:13,2015,Art Angels,"Big beats, black cloud Get it wrong, get loud...",,"Big beats, black cloud Get it wrong, get loud...","I don't need to know So, do you want to? Am I...","Oh, no, it came Higher than an aeroplane Don'...","Oh, then, get lost Take his shit, maybe not L...",If you're looking for a dream girl I'll never...,If you're looking for a dream girl I'll never...,"Run away, get caught Put in cell, livestock C...","Big bird, dead man Wish I could save them Don...",,,,,
4,Flesh Without Blood,04:25,2015,Art Angels,"Ooh, ah-ah Ooh, ah-ah You claw, you fight, y...","Ooh, ah-ah Ooh, ah-ah","You claw, you fight, you lose, got a doll tha...","Aye-yeah, aye-yeah Aye, I don't see the light...","(Now you’ll never know) Baby, believe me And ...","You hate, you bite, you lose after all, I jus...","Ooh-oh-oh-oh, ooh-oh-oh-oh Hey, hey, sing alo...",(Now you’ll never know) Aah-ah ah (Then your ...,,,,,,,
5,California,03:18,2015,Art Angels,"This, this music makes me cry It sounds just ...",,"This, this music makes me cry It sounds just ...","The things they see in me, I cannot see mysel...","Ca-ah-ah-ah, California You only like me when...","Oh (Ah-ah-ah) Come Monday, it's a dream (Ah-a...",,"Oh I, eh, I Oh na, na, na, ne Oh I, eh, I Oh ...",,,,,,,
6,Pin,03:33,2015,Art Angels,Dirt in your fingernails Blood on your knees ...,,Dirt in your fingernails Blood on your knees ...,,Oh Falling off the edge with you Oh It was to...,Bite off your fingernails Cut up your skin Te...,Thought I had won I thought I won til I lost ...,,,,,,,,
7,Realiti,05:07,2015,Art Angels,"Ees I tahw si siht, pu teg When we were youn...","Ees I tahw si siht, pu teg",,There were moments when it seemed okay (But I...,"Oh, baby, every morning there are mountains t...",,,"(Give me a sign) Oh, baby, every To reality (...",,,,,,"Oh, I fear that no life will ever be like thi...","When we were young, we used to get so close t..."
8,"World Princess, Pt.II",05:06,2015,Art Angels,"I got a big dream, small world in between Me ...",,"I got a big dream, small world in between Me ...",But I can't see something more than the thing...,It's mine It's mine,"I saw the parade, big band, masquerade So wha...","Dim the light In your head, in your heart, in...","(It's mine) If I stare into the darkness, I w...",,,,,,I know most likely How I used to be a frail a...,
9,Violence,03:40,2020,Miss Anthropocene,"I'm, like, begging for it, baby Makes you wan...","I'm, like, begging for it, baby Makes you wan...",,,"I'm, like, begging for it, baby Makes you wan...",,,"And I like it like that, and I like it like t...",,,,,"You wanna make me bad, make me bad And I like...",,


I made some changes to some of the song titles/ filenames as the unconventional song names of Grimes made it difficult to merge the tables correctly. 

For instances I changes the song title "REALiTi" to "Realiti" and "World ♡ Princess" to simply "World Princess" as the heart was causing an issue when merging the tables.

The resulting DataFrame is now ready for text enrichment.

## Text Enrichment with spaCy

### Creating Doc Objects


To use spaCy, the first step is to load one of spaCy’s Trained Models and Pipelines which will be used to perform tokenization, part-of-speech tagging, and other text enrichment tasks. A wide range of options are available, and they vary based on size and language.

We’ll use en_core_web_sm, which has been trained on written web texts. It may not perform as accurately as the those trained on medium and large English language models, but it will deliver results most efficiently. Once we’ve loaded en_core_web_sm, we can check what actions it performs; parser, tagger, lemmatizer, and NER, should be among those listed.

In [35]:
# Load nlp pipeline
nlp = spacy.load('en_core_web_sm')

# Check what functions it performs
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [36]:
# Define a function that runs the nlp pipeline on any given input text
def process_text(text):
    return nlp(text)

After the function is defined, use .apply() to apply it to every cell in a given DataFrame column. In this case, nlp will run on each cell in the Text column of the final_lyrics_df DataFrame, creating a Doc object for every lyrical text. These Doc objects will be stored in a new column of the DataFrame called Doc.

Running this function takes several minutes because spaCy is performing all the parsing and tagging tasks on each text. However, when it is complete, we can simply call on the resulting Doc objects to get parts-of-speech, named entities, and other information of interest, just as in the example of the sentence above.

In [37]:
# Apply the function to the "Text" column, so that the nlp pipeline is called on each song
final_lyrics_df['Doc'] = final_lyrics_df['Text'].apply(process_text)

### Text Reduction

#### Tokenization

A critical first step spaCy performs is tokenization, or the segmentation of strings into individual words and punctuation markers. Tokenization enables spaCy to parse the grammatical structures of a text and identify characteristics of each word-like part-of-speech.

To retrieve a tokenized version of each text in the DataFrame, we’ll write a function that iterates through any given Doc object and returns all functions found within it.

In [38]:
# Define a function to retrieve tokens from a doc object
def get_token(doc):
    return [(token.text) for token in doc]

As with the function used to create Doc objects, the token function can be applied to the DataFrame. In this case, we will call the function on the Doc column, since this is the column which stores the results from the processing done by spaCy.

In [39]:
# Run the token retrieval function on the doc objects in the dataframe
final_lyrics_df['Tokens'] = final_lyrics_df['Doc'].apply(get_token)
final_lyrics_df.head(6)

Unnamed: 0,Filename,length,release_year,ablum_title,Text,intro,verse_1,pre-chorus,chorus,verse_2,...,outro,verse_3,verse_4,verse_5,post-chorus,refrain,interlude,verse,Doc,Tokens
0,World Princess,04:41,2010,Halfaxa,Thinking of her all my life Now I go down I ...,Thinking of her all my life Now I go down,"I cannot feel, I cannot feel I cannot feel, I...",,,"Thinking of her, baby that won't go Now I go ...",...,,Thinking of her all my life Now I go down Thi...,"Thinking of her, baby that won't go We can go...",Thinking of her all my life Now I go down Thi...,,,,,"( , Thinking, of, her, all, my, life, Now, I, ...","[ , Thinking, of, her, all, my, life, Now, I, ..."
1,Kill V. Maim,04:06,2015,Art Angels,"I got in a fight, I was indisposed I was in, ...",,"I got in a fight, I was indisposed I was in, ...",B-E-H-A-V-E Arrest us Italiana mobster Lookin...,"Eh I don't behave, I don't behave, oh eh I do...","I did something bad, maybe I was wrong Someti...",...,B-E-H-A-V-E Arrest us Italiana mobster Lookin...,,,,,,,,"( , I, got, in, a, fight, ,, I, was, indispose...","[ , I, got, in, a, fight, ,, I, was, indispose..."
2,Oblivion,04:11,2012,Visions,I never walk about after dark It's my point o...,,I never walk about after dark It's my point o...,,See you on a dark night (La-la-la-la-la) See ...,"And no, I'm not a jerk I would ask if you cou...",...,,,,,,,,,"( , I, never, walk, about, after, dark, It, 's...","[ , I, never, walk, about, after, dark, It, 's..."
3,Butterfly,04:13,2015,Art Angels,"Big beats, black cloud Get it wrong, get loud...",,"Big beats, black cloud Get it wrong, get loud...","I don't need to know So, do you want to? Am I...","Oh, no, it came Higher than an aeroplane Don'...","Oh, then, get lost Take his shit, maybe not L...",...,If you're looking for a dream girl I'll never...,"Run away, get caught Put in cell, livestock C...","Big bird, dead man Wish I could save them Don...",,,,,,"( , Big, beats, ,, black, cloud, Get, it, wron...","[ , Big, beats, ,, black, cloud, Get, it, wron..."
4,Flesh Without Blood,04:25,2015,Art Angels,"Ooh, ah-ah Ooh, ah-ah You claw, you fight, y...","Ooh, ah-ah Ooh, ah-ah","You claw, you fight, you lose, got a doll tha...","Aye-yeah, aye-yeah Aye, I don't see the light...","(Now you’ll never know) Baby, believe me And ...","You hate, you bite, you lose after all, I jus...",...,(Now you’ll never know) Aah-ah ah (Then your ...,,,,,,,,"( , Ooh, ,, ah, -, ah, Ooh, ,, ah, -, ah, , Y...","[ , Ooh, ,, ah, -, ah, Ooh, ,, ah, -, ah, , Y..."
5,California,03:18,2015,Art Angels,"This, this music makes me cry It sounds just ...",,"This, this music makes me cry It sounds just ...","The things they see in me, I cannot see mysel...","Ca-ah-ah-ah, California You only like me when...","Oh (Ah-ah-ah) Come Monday, it's a dream (Ah-a...",...,"Oh I, eh, I Oh na, na, na, ne Oh I, eh, I Oh ...",,,,,,,,"( , This, ,, this, music, makes, me, cry, It, ...","[ , This, ,, this, music, makes, me, cry, It, ..."


If we compare the Text and Tokens column, we find a couple of differences. In the table below, you’ll notice that most importantly, the words, spaces, and punctuation markers in the Tokens column are separated by commas, indicating that each have been parsed as individual tokens. The text in the Tokens column is also bracketed; this indicates that tokens have been generated as a list.

We can have a closer look by creating a subset of our dataframe:

In [40]:
tokens = final_lyrics_df[['Text', 'Tokens']].copy()
tokens.head()

Unnamed: 0,Text,Tokens
0,Thinking of her all my life Now I go down I ...,"[ , Thinking, of, her, all, my, life, Now, I, ..."
1,"I got in a fight, I was indisposed I was in, ...","[ , I, got, in, a, fight, ,, I, was, indispose..."
2,I never walk about after dark It's my point o...,"[ , I, never, walk, about, after, dark, It, 's..."
3,"Big beats, black cloud Get it wrong, get loud...","[ , Big, beats, ,, black, cloud, Get, it, wron..."
4,"Ooh, ah-ah Ooh, ah-ah You claw, you fight, y...","[ , Ooh, ,, ah, -, ah, Ooh, ,, ah, -, ah, , Y..."


#### Lemmatization

Another process performed by spaCy is lemmatization, or the retrieval of the dictionary root word of each word (for example “brighten” for “brightening”). We’ll perform a similar set of steps to those above to create a function to call the lemmas from the Doc object, then apply it to the DataFrame.

In [41]:
# Define a function to retrieve lemmas from a doc object
def get_lemma(doc):
    return [(token.lemma_) for token in doc]

# Run the lemma retrieval function on the doc objects in the dataframe
final_lyrics_df['Lemmas'] = final_lyrics_df['Doc'].apply(get_lemma)

Lemmatization can help reduce noise and refine results for researchers who are conducting keyword searches. For example, let’s compare counts of the word “fly” in the original Tokens column and in the lemmatized Lemmas column.

In [42]:
print(f'"fly" appears in the text tokens column ' + str(final_lyrics_df['Tokens'].apply(lambda x: x.count('fly')).sum()) + ' times.')
print(f'"fly" appears in the lemmas column ' + str(final_lyrics_df['Lemmas'].apply(lambda x: x.count('fly')).sum()) + ' times.')

"fly" appears in the text tokens column 2 times.
"fly" appears in the lemmas column 4 times.


As expected, there are more instances of “fly” in the Lemmas column, as the lemmatization process has grouped inflected word forms (such as flying, flew) into the base word “fly.”

We can even see the effects of lemmatizing on a relatively small corpus size as this. Look at the vast difference between the tokens and lemmas for the verb 'do':

In [45]:
print(f'"do" appears in the text tokens column ' + str(final_lyrics_df['Tokens'].apply(lambda x: x.count('do')).sum()) + ' times.')
print(f'"do" appears in the lemmas column ' + str(final_lyrics_df['Lemmas'].apply(lambda x: x.count('do')).sum()) + ' times.')

"do" appears in the text tokens column 59 times.
"do" appears in the lemmas column 93 times.


Personally, I believe that the **lemmatizer is one of the features that makes spaCy a bit more powerful as opposed to the also commonly used NLTK library** (for Natural Language Processing). This is because spaCy is better at creating lemmas for verbs. [Using NLTK, unless you specify that the word you are lemmatizing is a verb, it might not create the correct lemma](https://stackoverflow.com/questions/25534214/nltk-wordnet-lemmatizer-shouldnt-it-lemmatize-all-inflections-of-a-word). This is because the '-ing' form in English can also be used as an adjective. Consider the following example: "they have a very loving relationship". The word loving is not used as a verb in this context.

SpaCy is better at recognizing when something is a verb and needs to be lemmatized, making it a bit more convenient.

### Text Annotation

#### Part of Speech Tagging

spaCy facilitates two levels of part-of-speech tagging: coarse-grained tagging, which predicts the simple universal part-of-speech of each token in a text (such as noun, verb, adjective, adverb), and detailed tagging, which uses a larger, more fine-grained set of part-of-speech tags (for example 3rd person singular present verb). The part-of-speech tags used are determined by the English language model we use. In this case, we’re using the small English model, and you can explore the differences between the models on spaCy’s website.

We can call the part-of-speech tags in the same way as the lemmas. Create a function to extract them from any given Doc object and apply the function to each Doc object in the DataFrame. The function we’ll create will extract both the coarse- and fine-grained part-of-speech for each token (token.pos_ and token.tag_, respectively).

In [160]:
# Define a function to retrieve lemmas from a doc object
def get_pos(doc):
    #Return the coarse- and fine-grained part of speech text for each token in the doc
    return [(token.pos_, token.tag_) for token in doc]

# Define a function to retrieve parts of speech from a doc object
final_lyrics_df['POS'] = final_lyrics_df['Doc'].apply(get_pos)

We can create a list of the part-of-speech columns to review them further. The first (coarse-grained) tag corresponds to a generally recognizable part-of-speech such as a noun, adjective, or punctuation mark, while the second (fine-grained) category are a bit more difficult to decipher.

In [161]:
# Create a list of part of speech tags
list(final_lyrics_df['POS'])

[[('SPACE', '_SP'),
  ('NOUN', 'NN'),
  ('ADP', 'IN'),
  ('PRON', 'PRP'),
  ('DET', 'PDT'),
  ('PRON', 'PRP$'),
  ('NOUN', 'NN'),
  ('ADV', 'RB'),
  ('PRON', 'PRP'),
  ('VERB', 'VBP'),
  ('ADP', 'RP'),
  ('SPACE', '_SP'),
  ('PRON', 'PRP'),
  ('AUX', 'MD'),
  ('PART', 'RB'),
  ('VERB', 'VB'),
  ('PUNCT', ','),
  ('PRON', 'PRP'),
  ('AUX', 'MD'),
  ('PART', 'RB'),
  ('VERB', 'VB'),
  ('PRON', 'PRP'),
  ('AUX', 'MD'),
  ('PART', 'RB'),
  ('VERB', 'VB'),
  ('PUNCT', ','),
  ('PRON', 'PRP'),
  ('AUX', 'MD'),
  ('PART', 'RB'),
  ('VERB', 'VB'),
  ('PRON', 'PRP'),
  ('AUX', 'MD'),
  ('PART', 'RB'),
  ('VERB', 'VB'),
  ('PUNCT', ','),
  ('PRON', 'PRP'),
  ('AUX', 'MD'),
  ('PART', 'RB'),
  ('VERB', 'VB'),
  ('PRON', 'PRP'),
  ('AUX', 'MD'),
  ('PART', 'RB'),
  ('VERB', 'VB'),
  ('PUNCT', ','),
  ('PRON', 'PRP'),
  ('AUX', 'MD'),
  ('PART', 'RB'),
  ('VERB', 'VB'),
  ('PRON', 'PRP'),
  ('AUX', 'MD'),
  ('PART', 'RB'),
  ('VERB', 'VB'),
  ('PUNCT', ','),
  ('PRON', 'PRP'),
  ('AUX', 'MD'),
  ('

Fortunately, spaCy has a built-in function called explain that can provide a short description of any tag of interest. If we try it on the tag IN using spacy.explain("IN"), the output reads conjunction, subordinating or preposition.

In [162]:
spacy.explain("IN")

'conjunction, subordinating or preposition'

In [163]:
spacy.explain("PROPN")

'proper noun'

In some cases, you may want to get only a set of part-of-speech tags for further analysis, like all of the proper nouns. A function can be written to perform this task, extracting only words which have been fitted with the proper noun tag.

In [164]:
# Define function to extract proper nouns from Doc object
def extract_proper_nouns(doc):
    return [token.text for token in doc if token.pos_ == 'PROPN']

# Apply function to Doc column and store resulting proper nouns in new column
final_lyrics_df['Proper_Nouns'] = final_lyrics_df['Doc'].apply(extract_proper_nouns)

Listing the nouns in each text can help us ascertain the texts’ subjects. Let’s list the nouns in two different texts, the text located in row 1 of the DataFrame and the text located in row 6.

In [165]:
list(final_lyrics_df.loc[[1,6], 'Proper_Nouns'])

[['Hey',
  'Hey',
  'Baby',
  'Everyday',
  'y',
  'Everyday',
  'y',
  'day',
  'Hey',
  'Hey',
  'Hey',
  'Hey',
  'Baby'],
 ['Shinigami',
  'Shinigami',
  'Got',
  'Shinigami',
  'Shinigami',
  'Got',
  'Shinigami',
  'Rent',
  'Shinigami',
  'Evеrything',
  'Got',
  'Shinigami',
  'Arе',
  'Shinigami',
  'Got',
  'Shinigami',
  'Got',
  'Shinigami',
  'Shinigami',
  'Shinigami',
  'Got',
  'Shinigami',
  'Shinigami',
  'Got',
  'Shinigami',
  'Shinigami',
  'Got',
  'Shinigami',
  'Shinigami',
  'Got',
  'Shinigami']]

NOTE: It seems that <span style="color:red">spacy is making some mistakes with the annotations here</span>. There are many words that are not actually nouns, which are unfortunately tagged as such. Many of these are capitalized at the start of the word. Additionally, I know that there are actual nouns in these song lyrics that were unfortunately not recognized.


My hypothesis for spaCy's shortcomings here is that **lyrics do not have a very classic sentence structure as opposed to written text formats** (i.e. papers, books, news articles, forum posts). Perhaps, the model was not trained on a lot of lyrics and therefore performs badly on them.

A look into spaCy's documentation reveals that we can [train our pipelines so that they perform better on unseen data](https://spacy.io/usage/training). I believe this could be a great option for further enriching this data set. This is good to keep in mind for other projects, but it overshoots the scope of this assignment for now.

#### Named Entity Recognition

spaCy can tag named entities in the text, such as names, dates, organizations, and locations. Call the full list of named entities and their descriptions using this code:

In [166]:
# Get all NE labels and assign to variable
labels = nlp.get_pipe("ner").labels

# Print each label and its description
for label in labels:
    print(label + ' : ' + spacy.explain(label))

CARDINAL : Numerals that do not fall under another type
DATE : Absolute or relative dates or periods
EVENT : Named hurricanes, battles, wars, sports events, etc.
FAC : Buildings, airports, highways, bridges, etc.
GPE : Countries, cities, states
LANGUAGE : Any named language
LAW : Named documents made into laws.
LOC : Non-GPE locations, mountain ranges, bodies of water
MONEY : Monetary values, including unit
NORP : Nationalities or religious or political groups
ORDINAL : "first", "second", etc.
ORG : Companies, agencies, institutions, etc.
PERCENT : Percentage, including "%"
PERSON : People, including fictional
PRODUCT : Objects, vehicles, foods, etc. (not services)
QUANTITY : Measurements, as of weight or distance
TIME : Times smaller than a day
WORK_OF_ART : Titles of books, songs, etc.


**During the annotation process, I realized that the NER for the lyrical corpus I created were *not very reliable*.**

For example, it misidentified vocalisations such as 'Oh-ohh' as organisations or persons and there were other similar cases. As these annotations cannot be considered very useful, I decided to leave them out and not add them to the CSV file.

Again, as mentioned at the end of the POS tagging sections, further training of the pipeline might be the correct way to go about this. 

### Download Enriched Dataset

To save the dataset of doc objects, text reductions and linguistic annotations generated with spaCy, download the final_lyrics_df DataFrame to your local computer as a .csv file:

In [47]:
# Save DataFrame as csv (in Google Drive)
# Use this step only to save  csv to your computer's working directory
final_lyrics_df.to_csv('grimes_corpus_with_spaCy_tags.csv')
print('Successfully saved grimes_corpus_with_spaCy_tags.csv!')

Successfully saved grimes_corpus_with_spaCy_tags.csv!


Great, we now should have a file called `grimes_corpus_with_spaCy_tags.csv` saved in the same working directory as this Notebook!