# Annotating Female Abolitionist Poets using SpaCy

This notebook will cover the proccess of creating a .csv file with annotated data for the Female Abolitionist Poems dataset using SpaCy (see [README](https://github.com/tijn-donners/collecting_data_spaCy/blob/main/README.md) for more info). First, let's import the required Python libraries


In [1]:
import os #for loading (meta)data
import pandas as pd
import spacy
#!spacy download en_core_web_sm #downloads english language model


### Now that the required modules are imported, we make two lists: filenames and a list of texts

In [2]:
texts = []
file_names = []

for file_name in os.listdir('data'):
    if file_name.endswith('.txt'):
        texts.append(open('data' + '/' + file_name, 'r', encoding='utf-8').read())  #the 'r' is for 'read mode'
        file_names.append(file_name)

### Now we turn these lists into a dictionary, and then a pandas DataFrame

In [3]:
d = {'Filename':file_names,'Text':texts}
df = pd.DataFrame(d)
df.head()

Unnamed: 0,Filename,Text
0,More_1788.txt,"SLAVERY,\nA POEM.\n\nBY\n\nHANNAH MORE\n\n\n..."
1,Barbauld_1791.txt,"EPISTLE\n\nTO\nWILLIAM WILBERFORCE, ESQ.\n\n \..."
2,Card_1792_1.txt,A\n\nPOEM\n\nON THE\n\nAFRICAN\n\nSLAVE TRADE....
3,More_1797.txt,CHEAP REPOSITORY\nTHE\nSORROWS OF YAMBA;\n...


### As we can see, there are tabs (\t) and new lines (\n) that have to be removed from the raw text

In [4]:
df['Text'] = df['Text'].str.replace(r'\s+', ' ', regex=True).str.strip()
df.head(20)

Unnamed: 0,Filename,Text
0,More_1788.txt,"SLAVERY, A POEM. BY HANNAH MORE O great design..."
1,Barbauld_1791.txt,"EPISTLE TO WILLIAM WILBERFORCE, ESQ. [Price On..."
2,Card_1792_1.txt,A POEM ON THE AFRICAN SLAVE TRADE. ADDRESSED T...
3,More_1797.txt,"CHEAP REPOSITORY THE SORROWS OF YAMBA; OR, THE..."


### Now we load in the metadata.csv and merge the tables on the Filename columns

In [5]:
metadata_df = pd.read_csv('metadata.csv')
#Both DataFrames can be merged on Filename
merged_df =  df.merge(metadata_df, on='Filename')
merged_df.head()

Unnamed: 0,Filename,Text,Title,Year,Author,AuthorGender,Link
0,More_1788.txt,"SLAVERY, A POEM. BY HANNAH MORE O great design...","Slavery, A Poem",1788,Hannah More,F,https://brycchancarey.com/slavery/morepoems.htm
1,Barbauld_1791.txt,"EPISTLE TO WILLIAM WILBERFORCE, ESQ. [Price On...","Epistle to William Wilberforce, Esq. On the Re...",1791,Anna Letitia Barbauld,F,https://brycchancarey.com/slavery/barbauld1.htm
2,Card_1792_1.txt,A POEM ON THE AFRICAN SLAVE TRADE. ADDRESSED T...,A Poem on the African Slave Trade. Addressed t...,1792,Mary Birkett Card,F,https://brycchancarey.com/slavery/mbc1.htm
3,More_1797.txt,"CHEAP REPOSITORY THE SORROWS OF YAMBA; OR, THE...",The Sorrows of Yamba,1797,Hannah More and Eaglesfield Smith,F,https://brycchancarey.com/slavery/yamba.htm


### Here the texts are tokenized and the token count per text is calculated

In order to do this, we first load in the nlp pipeline (this was already downloaded in the first cell of this notebook)

In [6]:
pipeline = spacy.load('en_core_web_sm') #load the nlp pipeline

In [7]:
def process_text(text):
    return pipeline(text)

In [8]:
#Here I apply the function to each row in the merged_df dataframe to create a new column with spaCy doc objects
merged_df['Doc'] = merged_df['Text'].apply(process_text)
merged_df.head(7)

Unnamed: 0,Filename,Text,Title,Year,Author,AuthorGender,Link,Doc
0,More_1788.txt,"SLAVERY, A POEM. BY HANNAH MORE O great design...","Slavery, A Poem",1788,Hannah More,F,https://brycchancarey.com/slavery/morepoems.htm,"(SLAVERY, ,, A, POEM, ., BY, HANNAH, MORE, O, ..."
1,Barbauld_1791.txt,"EPISTLE TO WILLIAM WILBERFORCE, ESQ. [Price On...","Epistle to William Wilberforce, Esq. On the Re...",1791,Anna Letitia Barbauld,F,https://brycchancarey.com/slavery/barbauld1.htm,"(EPISTLE, TO, WILLIAM, WILBERFORCE, ,, ESQ, .,..."
2,Card_1792_1.txt,A POEM ON THE AFRICAN SLAVE TRADE. ADDRESSED T...,A Poem on the African Slave Trade. Addressed t...,1792,Mary Birkett Card,F,https://brycchancarey.com/slavery/mbc1.htm,"(A, POEM, ON, THE, AFRICAN, SLAVE, TRADE, ., A..."
3,More_1797.txt,"CHEAP REPOSITORY THE SORROWS OF YAMBA; OR, THE...",The Sorrows of Yamba,1797,Hannah More and Eaglesfield Smith,F,https://brycchancarey.com/slavery/yamba.htm,"(CHEAP, REPOSITORY, THE, SORROWS, OF, YAMBA, ;..."


In [9]:
#This fucntion tokenizes a given spaCy doc object
def tokenize(doc):
    return [(token.text) for token in doc]

#now we can apply this function to the datframe
merged_df['Tokens'] = merged_df['Doc'].apply(tokenize)
merged_df.head()

Unnamed: 0,Filename,Text,Title,Year,Author,AuthorGender,Link,Doc,Tokens
0,More_1788.txt,"SLAVERY, A POEM. BY HANNAH MORE O great design...","Slavery, A Poem",1788,Hannah More,F,https://brycchancarey.com/slavery/morepoems.htm,"(SLAVERY, ,, A, POEM, ., BY, HANNAH, MORE, O, ...","[SLAVERY, ,, A, POEM, ., BY, HANNAH, MORE, O, ..."
1,Barbauld_1791.txt,"EPISTLE TO WILLIAM WILBERFORCE, ESQ. [Price On...","Epistle to William Wilberforce, Esq. On the Re...",1791,Anna Letitia Barbauld,F,https://brycchancarey.com/slavery/barbauld1.htm,"(EPISTLE, TO, WILLIAM, WILBERFORCE, ,, ESQ, .,...","[EPISTLE, TO, WILLIAM, WILBERFORCE, ,, ESQ, .,..."
2,Card_1792_1.txt,A POEM ON THE AFRICAN SLAVE TRADE. ADDRESSED T...,A Poem on the African Slave Trade. Addressed t...,1792,Mary Birkett Card,F,https://brycchancarey.com/slavery/mbc1.htm,"(A, POEM, ON, THE, AFRICAN, SLAVE, TRADE, ., A...","[A, POEM, ON, THE, AFRICAN, SLAVE, TRADE, ., A..."
3,More_1797.txt,"CHEAP REPOSITORY THE SORROWS OF YAMBA; OR, THE...",The Sorrows of Yamba,1797,Hannah More and Eaglesfield Smith,F,https://brycchancarey.com/slavery/yamba.htm,"(CHEAP, REPOSITORY, THE, SORROWS, OF, YAMBA, ;...","[CHEAP, REPOSITORY, THE, SORROWS, OF, YAMBA, ;..."


### Lemmatization and Part of Speech functions

In [10]:
def lemmatize(doc):
    return [(token.lemma_) for token in doc]

def pos(doc):
    return [spacy.explain((token.pos_)) for token in doc] #returns a list of the full explanations of the pos abbreviations

In [11]:
merged_df['POS'] = merged_df['Doc'].apply(pos)
merged_df['Lemmas'] = merged_df['Doc'].apply(lemmatize)

Now we will quickly use NLTK to compare word counts in the Lemmas and Tokens columns, to validate the lemmatization worked. (note that you need NLTK installed in your venv for this cell to run)

In [12]:
from nltk import FreqDist
fdist_token = FreqDist(merged_df['Tokens'][0])  #looks at the first entry in the Tokens column
fdist_lemma = FreqDist(merged_df['Lemmas'][0])  #looks at the first entry in the Lemmas column

print(f'"have" count in Tokens column: {fdist_token["have"]}')
print(f'"have" count in Lemmas column: {fdist_lemma["have"]}')

"have" count in Tokens column: 12
"have" count in Lemmas column: 30


The lemma 'have' appears more often in the Lemmas column than in the Tokens column, which indicates a succesful lemmatization.

### Let's reorder the dataframe columns fro readability and write the dataframe to csv


In [18]:
reordered_df = merged_df[['Filename','Title','Year','Author','AuthorGender','Link','Text','Doc','Tokens','POS','Lemmas']]
reordered_df.head()

Unnamed: 0,Filename,Title,Year,Author,AuthorGender,Link,Text,Doc,Tokens,POS,Lemmas
0,More_1788.txt,"Slavery, A Poem",1788,Hannah More,F,https://brycchancarey.com/slavery/morepoems.htm,"SLAVERY, A POEM. BY HANNAH MORE O great design...","(SLAVERY, ,, A, POEM, ., BY, HANNAH, MORE, O, ...","[SLAVERY, ,, A, POEM, ., BY, HANNAH, MORE, O, ...","[proper noun, punctuation, determiner, proper ...","[SLAVERY, ,, a, POEM, ., by, HANNAH, MORE, o, ..."
1,Barbauld_1791.txt,"Epistle to William Wilberforce, Esq. On the Re...",1791,Anna Letitia Barbauld,F,https://brycchancarey.com/slavery/barbauld1.htm,"EPISTLE TO WILLIAM WILBERFORCE, ESQ. [Price On...","(EPISTLE, TO, WILLIAM, WILBERFORCE, ,, ESQ, .,...","[EPISTLE, TO, WILLIAM, WILBERFORCE, ,, ESQ, .,...","[proper noun, particle, proper noun, proper no...","[EPISTLE, to, WILLIAM, WILBERFORCE, ,, ESQ, .,..."
2,Card_1792_1.txt,A Poem on the African Slave Trade. Addressed t...,1792,Mary Birkett Card,F,https://brycchancarey.com/slavery/mbc1.htm,A POEM ON THE AFRICAN SLAVE TRADE. ADDRESSED T...,"(A, POEM, ON, THE, AFRICAN, SLAVE, TRADE, ., A...","[A, POEM, ON, THE, AFRICAN, SLAVE, TRADE, ., A...","[determiner, noun, adposition, determiner, pro...","[a, poem, on, the, AFRICAN, slave, TRADE, ., a..."
3,More_1797.txt,The Sorrows of Yamba,1797,Hannah More and Eaglesfield Smith,F,https://brycchancarey.com/slavery/yamba.htm,"CHEAP REPOSITORY THE SORROWS OF YAMBA; OR, THE...","(CHEAP, REPOSITORY, THE, SORROWS, OF, YAMBA, ;...","[CHEAP, REPOSITORY, THE, SORROWS, OF, YAMBA, ;...","[proper noun, proper noun, determiner, noun, a...","[CHEAP, REPOSITORY, the, sorrows, of, YAMBA, ;..."


In [19]:
reordered_df.to_csv('annotated_dataset.csv')