# Run BookNLP

Book NLP is a natural language processing pipeline for large texts. It performs the following tasks: 
 * Part-of-speech tagging
 * Dependency parsing
 * Entity recognition
 * Character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> TOM_SAWYER) and coreference resolution
 * Quotation speaker identification
 * Supersense tagging (e.g., "animal", "artifact", "body", "cognition", etc.)
 * Event tagging
 * Referential gender inference (TOM_SAWYER -> he/him/his)

This tutorial runs BookNLP on one and multiple plain text files. BookNLP results (5 different files generated per text) are saved in a folder on your local machine. Some simple analyses (counts, named entity extraction) are also performed. 

Learn more about BookNLP:
* https://github.com/booknlp/booknlp
* https://colab.research.google.com/drive/1c9nlqGRbJ-FUP2QJe49h21hB4kUXdU_k?usp=sharing#scrollTo=CVb7N2RDkLVm

## Install Packages  

In [None]:
!pip install --no-dependencies booknlp
from booknlp.booknlp import BookNLP
#!pip install spacy
import spacy
!python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')
import glob
import os
import pandas as pd

## Prep Model and Files

In [None]:
#Set parameters of BookNLP model
model_params={
    "pipeline":"entity,quote,supersense,event,coref", 
    "model":"big", 
}

booknlp=BookNLP("en", model_params)

In [None]:
#Get current working directory 
path = os.getcwd()
print(path)

#Change working directory
path = os.chdir("/PATHNAME")

## Run Book NLP on Multiple Files

In [None]:
#Define folder where files are stored to run BookNLP on
files = [f for f in os.listdir(path) if os.path.isfile(f)]

#Loop to run book NLP on each file
for f in files: 
    #Define filename as each file in folder
    inputFile = f
    #Set folder for booknlp results
    outputDir = "BookNLP_Results/"
    #Set ID as name of each file
    idd = f
    #Run book nlp
    booknlp.process(inputFile, outputDir, idd)


## Analyze BookNLP Data

### Convert TSV Files to CSV Files
Convert entities, tokens, and supersense files to csv for further analysis

In [None]:
#Change working directory to folder with BookNLP results
path = os.chdir("/PATHNAME")

#Define new folder for csv files
csvs = "/PATHNAME"

In [None]:
import glob 
#Set file endings to look for
tokens = 'tokens'
supersense = ''
endings = ('.entities', 'supersense', 'tokens')

#Iterate over original BookNLP files
for filename in os.listdir(path):
    #Read tsv into a pandas dataframe
    if filename.endswith(endings):
        df = pd.read_csv(filename, sep="\t", quoting=3)
        #Write the dataframe to a csv file
        df.to_csv(os.path.join(csvs, filename + '.csv'), index=False)
        

In [None]:
#Zip files in separate folders (for sharing purposes)
import glob
from zipfile import ZipFile

with zipfile.ZipFile('booknlp_entities.zip', 'w') as newzip:
    for filepath in glob.glob("*.txt.entities.csv"):
        newzip.write(filepath)

with zipfile.ZipFile('booknlp_tokens.zip', 'w') as newzip:
    for filepath in glob.glob("*.txt.tokens.csv"):
        newzip.write(filepath)

with zipfile.ZipFile('booknlp_supersense.zip', 'w') as newzip:
    for filepath in glob.glob("*.txt.supersense.csv"):
        newzip.write(filepath)

### Explore BookNLP Output

In [None]:
path = os.chdir("/PATHNAME")

#Open NER file for one book
for filepath in glob.glob("FILENAME.csv"):
    # Read the .entities.txt file into a pandas dataframe
    ner_df = pd.read_csv(filepath)
# Print the resulting dataframe
ner_df

In [None]:
#Open token file for one book 
for filepath in glob.glob("FILENAME.csv"):
    # Read the .entities.txt file into a pandas dataframe
    pos_df = pd.read_csv(filepath)
# Print the resulting dataframe
pos_df

In [None]:
#Open supersense file for one book 
for filepath in glob.glob("FILENAME.csv"):
    # Read the .entities.txt file into a pandas dataframe
    supersense_df = pd.read_csv(filepath)
# Print the resulting dataframe
supersense_df

### Find All Groups of Named Entities

Several groups of named entities may be of interest to researchers: people, organizations, locations, and vehicles.

BookNLP tags the following person-based named entities: 
* PERSON: People, including fictional.

BookNLP tags the following person-based named entities: 
* ORG: Companies, agencies, institutions, etc.

BookNLP tags the following place and space-based named entities: 
* FAC: Buildings, airports, highways, bridges, etc
* GPE: Countries, cities, states
* LOC: Non-GPE locations, mountain ranges, bodies of water.

BookNLP tags the following vehicle named entities: 
* VEH: Vehicles

This code finds all instances of all groups and appends them to new dataframe column for each text.

More about named entity tags: https://towardsdatascience.com/explorations-in-named-entity-recognition-and-was-eleanor-roosevelt-right-671271117218

In [None]:
#Import to check progress of code
from tqdm import tqdm

#Get location named entities in each file and append to a dataframe
filename_list = []
loc_lists = []

for filepath in glob.glob("*.txt.entities.csv"):
    #Add filename to list of file names
    filename_list.append(filepath)
    # Read the .entities.txt file into a pandas dataframe
    df = pd.read_csv(filepath)
    #Define list for locations in df
    loc_list = []
    #Iterate through dataframe and check if NE category is LOC
    for index, row in tqdm(df.iterrows(), total=len(df)):
        if row['cat'] == 'LOC':
            loc_list.append(row['text'])
        if row['cat'] == 'GPE':
            loc_list.append(row['text'])
        if row['cat'] == 'FAC':
            loc_list.append(row['text'])
    loc_lists.append(loc_list)

In [None]:
loc_lists

In [None]:
#Combine words into one string
import itertools

#Combine list of lists into one big list
master_loc_list = list(itertools.chain.from_iterable(loc_lists))
master_loc_list

In [None]:
#Make sure all words are strings
string_loc_list = [str(word) for word in master_loc_list]

#Join words in list to string
text = ' '.join(string_loc_list)

#Remove punctuation
text = text.translate(str.maketrans('', ''. string.punctuation))

print(text)

In [None]:
from wordcloud import WordCloud, STOPWORDS
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
import matplotlib.pyplot as plt

# Load stopwords from nltk
stop_words = set(stopwords.words('english'))

wordcloud = WordCloud(stopwords=stop_words, width=800, height=800, background_color='white', max_words=50)

wordcloud.generate(text)

#Plot word cloud
plt.figure(figsize=(8,8))
plt.imshow(wordcloud)
plt.axis('off')
plt.show

In [None]:
#Get 20  most frequent locations in corpus
from collections import Counter
import string 

#Tokenize text
tokens = [token.lower() for token in nltk.word_tokenize(text) if token.lower() not in stop_words]

#Count word frequencies
word_freq = Counter(tokens)

#Get top 20 most common words
top_words = dict(word_freq.most_common(20))

#Plot word frequencies
plt.bar(top_words.keys(), top_words.values())
plt.title('Top 20 Most Frequent Words')
plt.xlabel('Words')
plt.xticks(rotation=45)
plt.ylabel('Frequency')
plt.show


In [None]:
#Import to check progress of code
from tqdm import tqdm

#Find the people, orgs and vehicles
filename_list = []
people_lists = []
org_lists = []
vehicle_lists = []

for filepath in glob.glob("*.txt.entities.csv"):
    #Add filename to list of file names
    filename_list.append(filepath)
    # Read the .entities.txt file into a pandas dataframe
    df = pd.read_csv(filepath)
    #Define list for locations in df
    people_list = []
    org_list = []
    veh_list = []
    #Iterate through dataframe and check if NE category is LOC
    for index, row in tqdm(df.iterrows(), total=len(df)):
        if row['cat'] == 'PERSON':
            # If it is, append data fram the text column to the list
            people_list.append(row['text'])
        if row['cat'] == 'ORG':
            # If it is, append data fram the text column to the list
            org_list.append(row['text'])
        if row['cat'] == 'VEH':
            # If it is, append data fram the text column to the list
            veh_list.append(row['text'])
    #Append location list to big location list
    people_lists.append(people_list)
    org_lists.append(org_list)
    veh_lists.append(veh_list)

    

In [None]:
loc_list

In [None]:
#Create df with filenames and location entities
locations_df = pd.DataFrame({'Book': filename_list, 'Locations': loc_lists, 'People': people_lists})
locations_df

In [None]:
#Define function to count number of each type of named entity in each file
def extract_entity_counts(file):
    with open(file, "r", encoding='utf-8') as f:
        text = f.read()
    doc = nlp(text)
    entity_counts = {}
    for entity in doc.ents:
        if entity.label_ not in entity_counts:
            entity_counts[entity.label_] = 1
        else: 
            entity_counts[entity.label_] += 1
    return entity_counts

#Create empty df to store results
columns = ['file','entity_type', 'count']
df = pd.DataFrame(columns=columns)

In [None]:
import glob

#Loop over all BookNLP .book files in the directory
for filepath in glob.glob("*.book"):
    #Extract named entity from the file
    entity_counts = extract_entity_counts(filepath)
    print(entity_counts)

#Print resulting dataframe
df