# Phase 2: Analysis & evaluation

In part one of the project I constructed a data processing pipeline that took in the raw text of 3 books, and produced 'ready-to-use' outputs saved to a database:

![Main pipeline components](pipeline.png)

The objective now is to use this dataset to:

__1. Identify key characters in novels (information extraction)__

And then  demonstrate how we could use the information extracted (about the books and their characters) to: 

__2. Present potentially useful contextual and quantitative information to aid the reading experience (visualization)__

The outputs, if successful, should potentially _assist_ a human literary analyst with contextual information; as well as providing a more quantitative perspective to supplement traditional qualitative methods.

The three initial novels selected for the project were obtained (with permission) from Project Gutenberg at the following locations:

* __Great Expectations__: https://www.gutenberg.org/files/1400/1400-0.txt (Dickens, 1998)<br>
* __Alice In Wonderland__: https://www.gutenberg.org/ebooks/19033.txt.utf-8 (Carroll, 2008)<br>
* __Little Women__: https://www.gutenberg.org/ebooks/514.txt.utf-8 (Alcott, 1996)<br>

I prepared a fourth book for the database which will be reserved for the evaluation phase - a test case we can use to assess whether the proposed methods can be applied to other books:

* __Jane Eyre__: https://www.gutenberg.org/cache/epub/1260/pg1260.txt (Bronte, 1998)<br>

#### Library imports

In [1]:
# Standard utilities
import re
import pprint
import datetime
import itertools
import warnings
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm_notebook
from collections import defaultdict

# NLP tasks
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import TreebankWordTokenizer
import spacy
nlp = spacy.load('en_core_web_sm')
from nltk.corpus import wordnet as wn
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Storage
from sqlalchemy import create_engine

# Visualization
from matplotlib import pyplot as plt
import seaborn as sns
from matplotlib_venn import venn3, venn3_circles
from wordcloud import WordCloud

# Data checking
import pywikibot
import requests

#### Settings

In [2]:
# General settings
warnings.filterwarnings('ignore')

# Pandas settings
pd.set_option('display.max_colwidth', 500)
tqdm_notebook.pandas()

# Pandas highlight columns gold
def highlight_column(column):
    '''Highlight specified columns in df - used as argument in df.style.apply() function'''
    return ['background-color: gold' for item in column.values]

# Visualization settings
plt.rcParams["figure.figsize"] = (8, 6)
sns.set_theme()
authors = ['dickens', 'carroll', 'alcott']
all_authors = ['dickens', 'carroll', 'alcott', 'bronte']
author_colors = {'dickens': '#ED230D', 'carroll': '#00A1FE', 'alcott': '#1EB100', 'bronte': '#FF1393'}
author_colormaps = {'dickens': 'YlOrRd', 'carroll': 'GnBu', 'alcott': 'YlGn', 'bronte': 'PuRd'}

# SQL settings
%load_ext sql
%sql sqlite:///book_store.db
# Set the style to the old default as advised by
# https://stackoverflow.com/questions/79153112/keyerror-default-when-attempting-to-create-a-table-using-magic-line-sql-in-j
%config SqlMagic.style = '_DEPRECATED_DEFAULT'

engine = create_engine('sqlite:///book_store.db', echo=False)
connection = engine.raw_connection()

# Track time
processing_start = datetime.datetime.now()

# Pywikibot settings
pywikibot.config.max_retries = 2

## 1. The datasets

### 1.1 SQL data

The current database, ```book_store.db```,  contains 5 tables (see _phase_1_data_preparation_and_eda.ipynb_ for more details):

- The vocabularies used by each of the 4 authors
- The book components is the main information source, where each book is broken down by author, chapter, paragraph, sentence, text, lemmas and sentence lengths
- The chapter headings are, as implied, whatever the chapter headings are - even if simply "Chapter IV."

In [3]:
%sql SELECT name FROM sqlite_master WHERE type='table'

 * sqlite:///book_store.db
Done.


name
wikidata
chapter_headings
book_components
dickens_vocab
carroll_vocab
alcott_vocab
bronte_vocab


Here is a sample of the first record from ```book_components``` - this is the data we will work with when performing information extraction and analysis:

In [4]:
result = %sql SELECT * FROM book_components LIMIT 1
result = result.DataFrame()
result.transpose()

 * sqlite:///book_store.db
Done.


Unnamed: 0,0
index,0
author,dickens
chapter,0
paragraph,0
sentence,0
text,"My father’s family name being Pirrip, and my Christian name Philip, my infant tongue could make of both names nothing longer or more explicit than Pip."
lemmas,my;father;family;name;be;pirrip;and;my;christian;name;philip;my;infant;tongue;could;make;of;both;name;nothing;long;or;more;explicit;than;pip


### 1.2 Incorporating Wikidata

Because we don't have a 'source of truth' against which to compare the results we'll obtain I decided to use Wikidata as a proxy for this missing source of truth. Wikidata has a well-defined structure. My main objective is to extract the __characters__ (and their known __aliases__) from each book in order to use them as a comparative data source in the evaluation phase of this project, but I'll also bring in __genre__, __period__, __narrative location__ and __intended public__ which may be informative if they are available. Here is an example of the Wikidata structures relevant to my task:

![Wikidata structure](little_women_structures.png)

* __Items__ are entries in the knowledge base: each item has a unique identifier beginning with ```Q*```.
* Each item has a __label__ (e.g. 'Little Women') and a __description__ (e.g. 'novel by Louisa May Alcott')
* Claims represent links to the detail behind each item
* Items may have __properties__ - properties are standardized: each property's unique identifier begins with ```P*``` (e.g. 'P674' is for characters and 'P136' is for genre)
* Properties may refer to further items, for example the character _Jo March_ is also an item ('Q27902552') with its own label ('Jo March') and description ('fictional character from Louisa May Alcott's Little Women') which can be retrieved
* __Aliases__ are also sometimes available, for example the character _Jo March_ is also known as _Josephine March_, _Aunt Dodo_, _Jo Bhaer_, _Josephine Bhaer_

The Python library __pywikibot__ ([MediaWiki, 2022](https://www.mediawiki.org/w/index.php?title=Manual:Pywikibot&oldid=5039041)) enables us to navigate this structure and retrieve only the information we need for our purposes. I selected this library because:

* It is recommended by Wikidata itself, which provides good documentation and tutorials such as [this one](https://www.wikidata.org/w/index.php?title=Wikidata:Pywikibot_-_Python_3_Tutorial/Data_Harvest&oldid=1056597057) (Wikidata, 2019) which gave me the starter code I needed to get going on formulating my own retrieval strategy (see below)
* It is actively maintained - I've used release 6.6.3 December 2021, but there is already 6.6.4 available as of January 2022 ([The Pywikibot Team, 2022](https://pypi.org/project/pywikibot/#history)) - unlike for example Wikidata 0.7.0 which was last updated in 2020 ([Minhee, 2020](https://pypi.org/project/Wikidata/#history))
* It abstracts away the need to understand complex [SPARQL](https://query.wikidata.org) retrieval queries

#### Preparing to retrieve Wikidata information

I've broken the retrieval process down into 4 functions so that it is clear what is happening at each stage of the process.

##### Component 1 ```make_wiki_page()``` - formulate a request for a page

_Requirement notes:_

1. Check that the book argument is given in the correct format (```str```)
2. The language / family combination of 'en' and 'wikipedia' is important - but here I'm coding it into the function so there is no possibility for user error
3. The book variable could be any string - at this stage there is no validation, we're just formulating our request

In [5]:
def make_wiki_page(book):
    '''Takes in the name of a book and returns a pywikibot.page.Page object 
    - does not require internet connection. Note that book is case-sensitive.'''
    
    if not isinstance(book, str):
        return 'Argument "book" should be a string representing a valid book title'

    site = pywikibot.Site('en', 'wikipedia')
    page = pywikibot.Page(site, book)
    return page     

##### Component 2 ```get_wiki_item()``` - request item from Wikidata

_Requirement notes:_

1. Check that page is given in the correct format (```pywikibot.page.Page```)
2. An internet connection is required to retrieve the item, ```pywikibot.config.max_retries``` above is set to a maximum of 2 retries, BUT if there is still no response after that a ```TimeoutError``` may occur
3. If the item exists it will be returned BUT if it does not exist a ```NoPageError``` may occur

In [6]:
def get_wiki_item(page):
    '''Takes in a pywikibot.page.Page object and returns a pywikibot.page.ItemPage object
    if it exists - requires an internet connection'''
    
    if not isinstance(page, pywikibot.page.Page):
        return 'Argument "page" should be a valid pywikibot.page.Page object'
    
    try:
        item = pywikibot.ItemPage.fromPage(page)     
        return item
    
    except Exception as err:
        return err

#### Component 3 ```compile_wiki_data()``` - obtain and package just the data points we need

_Requirement notes:_

1. Check that item is given in the correct format (```pywikibot.page.ItemPage```)
2. An internet connection is required to retrieve the item claims to the properties of interest (these ```P*``` codes can easily be looked up on the Wikidata pages, simply by hovering over the property of interest's label as shown above)
3. Not every property is necessarily available for every book
3. Book details and property details should be assembled into a dict format

In [7]:
def compile_wiki_data(item):
    '''Takes in a pywikibot.page.ItemPage object and returns a dictionary containing labels, 
    descriptions, and properties of interest - requires an internet connection'''
    
    if not isinstance(item, pywikibot.page.ItemPage):
        return 'Argument "item" should be a valid pywikibot.page.ItemPage object'
    
    try: 
        # Create a dictionary to store our results 
        book_wikidata = {}

        properties_of_interest = {'characters': 'P674', 
                                  'main subject': 'P921',
                                  'genre': 'P136', 
                                  'set in period': 'P2408',
                                  'narrative location': 'P840',
                                  'intended public': 'P2360'}

        # Extract high-level data entries to our results dictionary
        book_wikidata['item'] = item.id
        book_wikidata['title'] = item.labels.get('en')
        book_wikidata['description'] = item.descriptions.get('en')

        # Claims contain the detail of each property we are interested in
        item_dict = item.get()
        clm_dict = item_dict.get('claims')
        for key, value in properties_of_interest.items():
            # Create an empty list for each property to hold results
            book_wikidata[key] = []
            # Get a list of claims
            clm_list = clm_dict.get(value)
            if clm_list != None:
                for clm in clm_list:
                    # Only if snaktype is 'value' is there actual data to return
                    if clm.snaktype == 'value':
                        clm_trgt = clm.getTarget()
                        claim_detail = clm_trgt.get()
                        # Append the labels for each item in the property, e.g. name of each character for characters
                        if key == 'characters':
                            book_wikidata[key].append((claim_detail.get('labels').get('en'), claim_detail.get('aliases').get('en')))
                        else:
                            book_wikidata[key].append((claim_detail.get('labels').get('en')))
                        
        return book_wikidata
    except Exception as err:
        return err

##### Component 4 ```get_book_wikidata()``` - fetch and compile the data

_Requirement notes:_

1. Here we put the 3 functions so far together - each step either returns a valid instance for further processing or an error message which can shed light on the issue encountered
2. A nested dict of data for each book is returned if no errors

In [8]:
def get_book_wikidata(book):
    '''Takes in the name of a book and returns a dict of properties of interest. Note that
    book is case-sensitive.'''
    page = make_wiki_page(book)
    
    # Check that page is valid
    if isinstance(page, pywikibot.page.Page):
        item = get_wiki_item(page)
        # Check that item is valid
        if isinstance(item, pywikibot.page.ItemPage):
            book_wikidata = compile_wiki_data(item)
            # Check that the dict is valid
            if isinstance(book_wikidata, dict):
                return book_wikidata
            else:
                err = f'''There was an issue with book_wikidata: {book_wikidata}'''
                return err
        else:
            err = f'''There was an issue with item: {item}'''
            return err
    else:
        err = f'''There was an issue with page: {page}'''
        return err

#### Retrieving WikiData information

We are now ready to retrieve the data itself. Once we have it, we want to store it in the ```book_store``` database along with all the other project data so that it is available for future iterations without the need to keep connecting and retrieving the data each time - this step would then become part of the standard data processing pipeline along with all the other steps included in the preliminary part of the project.

In [9]:
# Runs for a couple of minutes potentially (depending on connection speed - 
# alcott has the most characters and therefore runs the longest)
wikidata_on_books = {}
for author, book in [("dickens", "Great Expectations"),
                     ("carroll", "Alice's Adventures in Wonderland"),
                     ("alcott", "Little Women"),
                     ("bronte", "Jane Eyre")]:
    wikidata = get_book_wikidata(book)
    if isinstance(wikidata, dict): 
        wikidata_on_books[author] = get_book_wikidata(book)
        print(f'''{author} completed''')
    else:
        print(f'''{author} encountered an issue: {wikidata}''')

dickens completed
carroll completed
alcott completed
bronte completed


In [10]:
# Transform the data from a dict to an appropriate format for storing in our db
wikidata = pd.json_normalize(wikidata_on_books).transpose().reset_index()
wikidata.rename(columns={0:'data'}, inplace=True)
wikidata[['author', 'property']] = wikidata['index'].str.split('.', expand=True)
wikidata = wikidata[['author', 'property', 'data']]

# We cannot store data type list in the database so transform any list / tuple combinations using
# ';' to delimit characters and '^' to delimit character / alias combinations and '|' to delimit aliases
wikidata_sql = wikidata.copy()
wikidata_sql['data'] = wikidata_sql.apply(lambda row: \
                      ';'.join([char[0] + '^' + '|'.join(char[1]) if char[1] != None \
                      else char[0] for char in row['data']]) if isinstance(row['data'], list) and \
                                          row['property'] == 'characters' else row['data'], axis=1)
wikidata_sql['data'] = wikidata_sql['data'].apply(lambda x: \
                      ';'.join(x) if isinstance(x, list)  else x)

# Check the outputs
wikidata_sql.head()

Unnamed: 0,author,property,data
0,dickens,item,Q219552
1,dickens,title,Great Expectations
2,dickens,description,1861 novel by Charles Dickens
3,dickens,characters,Pip^Philip Pirrip;Miss Havisham;Estella^Estella Havisham|Estella Drummle;Abel Magwitch;John Wemmick;Compeyson
4,dickens,main subject,orphan


In [11]:
# The 'data' column for the characters property can be converted 
# from string to its original format with the following recipe if required:

# wikidata_sql['data'] = wikidata_sql['data'].apply(lambda x: [(name, alias.split('|')) \
# if '|' in alias else (name, alias) for name, alias in [tuple(item.split('^'))  
# if '^' in item else (item, 'None') for item in x.split(';')]])

In [12]:
# And then save to the database
wikidata_sql.to_sql('wikidata', engine, schema=None, 
              if_exists='replace', index=True)

36

In [13]:
# And check that it landed safely
%sql SELECT name FROM sqlite_master WHERE type='table';

 * sqlite:///book_store.db
Done.


name
chapter_headings
book_components
dickens_vocab
carroll_vocab
alcott_vocab
bronte_vocab
wikidata


## 2. Identifying key characters

As a reminder we'll be looking at 3 methods:

* NER using __NLTK__
* NER using __spaCy__
* Proper noun chunking using __NLTK__ parts-of-speech tagging

The overall process I plan to follow is depicted below:

![identify characters](identify_characters4.png)

1. Retrieve our pre-processed book data from our __SQL__ database
2. Obtain a stratified __sample__ (so that each book length remains representative)
3. Do an initial pass to __extract__ named entities / proper nouns
4. __Review__ the outputs and identify any __refinements__ we can make to the process to improve results
5. __Extract__ named entities / proper nouns from the full books
6. Get __name frequencies__ from the resulting data
7. __Group similar names__ in an attempt to see if we can resolve similar character names like _Queen_ and _Queen of Hearts_
8. __Evaluate__ the results

In the development stage I will only use ```dickens```, ```carroll``` and ```alcott```. I am reserving ```bronte``` for the final evaluation step to gauge how transferable the proposed methods are to a book our model has not seen before.

For each of the transformations required to arrive at the final output I'll create a function so that when it comes to processing ```bronte``` at the end I can just daisy-chain the functions together to get the final outputs for that book. It's beyond the scope of this project but ideally for productionizing one would want to create a library with suitable methods so that new books could easily be processed as required.

### 2.1 Retrieve pre-processed books data from SQL

In [14]:
book_components = pd.read_sql_query \
('SELECT * FROM book_components WHERE author IN ("dickens", "carroll", "alcott")', 'sqlite:///book_store.db')
book_components['lemmas'] = book_components['lemmas'].apply(lambda x: x.split(';'))

### 2.2 Obtain a stratified sample from books

For performance reasons, initially I'm just going to explore a stratified sample from the books so that I can check results and investigate the main issues that need to be resolved before moving on to processing the books in their entirety:

In [15]:
sample_indices = book_components.groupby\
('author', group_keys=False).apply(lambda x: x.sample(frac=0.2, random_state = 0)).index
book_sample = book_components[book_components.index.isin(sample_indices)].copy()

Let's check how many sentences we have from each book - the stratified sampling approach is inline with expectations (```carroll``` is much shorter than ```dickens``` or ```alcott```):

In [16]:
pd.pivot_table(book_sample, index = ['author'], values = 'sentence', aggfunc = len)

Unnamed: 0_level_0,sentence
author,Unnamed: 1_level_1
alcott,1866
carroll,322
dickens,1928


### 2.3 Initial pass to extract information

Below I create and process 3 functions, each of which takes in some text and returns either named entities or proper nouns:

##### NLTK named entity recognition - ```get_named_entities_nltk()```

In [17]:
def get_named_entities_nltk(text, ner_label_list = ['PERSON']):
    '''Takes in text and a list of entity labels (default is just ['PERSON']) and returns a list
    of tuples in the form (label, entity) - as processed by nltk.
    Note the first time ne_chunk is used some downloads are required. Run the following:
    >>> nltk.download('averaged_perceptron_tagger_eng')
    >>> nltk.download('maxent_ne_chunker_tab')
    '''
    named_entity_tree = nltk.ne_chunk(nltk.tag.pos_tag(word_tokenize(text)))
    named_entities = []
    for tree in named_entity_tree.subtrees():
        if tree.label() in ner_label_list:
            named_entities.append((tree.label(), ' '.join([name[0] for name in tree.leaves()])))
    if len(named_entities) > 0:
        return named_entities
    else:
        return None # it's possible a sentence does not contain named entities

In [22]:
# Get named entities
book_sample['nltk_entities'] = book_sample['text'].progress_apply(lambda x: get_named_entities_nltk(x))
# Check outputs
book_sample.head(1).style.apply(highlight_column, subset=['nltk_entities'])

  0%|          | 0/4116 [00:00<?, ?it/s]

Unnamed: 0,index,author,chapter,paragraph,sentence,text,lemmas,nltk_entities
9,9,dickens,0,2,2,"At such a time I found out for certain that this bleak place overgrown with nettles was the churchyard; and that Philip Pirrip, late of this parish, and also Georgiana wife of the above, were dead and buried; and that Alexander, Bartholomew, Abraham, Tobias, and Roger, infant children of the aforesaid, were also dead and buried; and that the dark flat wilderness beyond the churchyard, intersected with dikes and mounds and gates, with scattered cattle feeding on it, was the marshes; and that the low leaden line beyond was the river; and that the distant savage lair from which the wind was rushing was the sea; and that the small bundle of shivers growing afraid of it all and beginning to cry, was Pip.","['at', 'such', 'a', 'time', 'i', 'find', 'out', 'for', 'certain', 'that', 'this', 'bleak', 'place', 'overgrow', 'with', 'nettle', 'be', 'the', 'churchyard', 'and', 'that', 'philip', 'pirrip', 'late', 'of', 'this', 'parish', 'and', 'also', 'georgiana', 'wife', 'of', 'the', 'above', 'be', 'dead', 'and', 'bury', 'and', 'that', 'alexander', 'bartholomew', 'abraham', 'tobias', 'and', 'roger', 'infant', 'child', 'of', 'the', 'aforesaid', 'be', 'also', 'dead', 'and', 'bury', 'and', 'that', 'the', 'dark', 'flat', 'wilderness', 'beyond', 'the', 'churchyard', 'intersect', 'with', 'dike', 'and', 'mound', 'and', 'gate', 'with', 'scattered', 'cattle', 'feed', 'on', 'it', 'be', 'the', 'marsh', 'and', 'that', 'the', 'low', 'leaden', 'line', 'beyond', 'be', 'the', 'river', 'and', 'that', 'the', 'distant', 'savage', 'lair', 'from', 'which', 'the', 'wind', 'be', 'rush', 'be', 'the', 'sea', 'and', 'that', 'the', 'small', 'bundle', 'of', 'shiver', 'grow', 'afraid', 'of', 'it', 'all', 'and', 'begin', 'to', 'cry', 'be', 'pip']","[('PERSON', 'Philip Pirrip'), ('PERSON', 'Georgiana'), ('PERSON', 'Alexander'), ('PERSON', 'Roger'), ('PERSON', 'Pip')]"


#### spaCy named entity recognition - ```get_named_entities_spacy()```

In [23]:
def get_named_entities_spacy(text, ner_label_list = ['PERSON']):
    '''Takes in text and a list of entity labels (default is just ['PERSON']) and returns a list
    of tuples in the form (label, entity) - as processed by spacy'''
    named_entities = [(ent.label_, ent.text) for ent in nlp(text).ents if ent.label_ in ner_label_list]
    if len(named_entities) > 0:
        return named_entities
    else:
        return None # it's possible a sentence does not contain named entities

In [25]:
# Get named entities
book_sample['spacy_entities'] = book_sample['text'].progress_apply(lambda x: get_named_entities_spacy(x))
# Check outputs
book_sample.head(1).style.apply(highlight_column, subset=['spacy_entities'])

  0%|          | 0/4116 [00:00<?, ?it/s]

Unnamed: 0,index,author,chapter,paragraph,sentence,text,lemmas,nltk_entities,spacy_entities
9,9,dickens,0,2,2,"At such a time I found out for certain that this bleak place overgrown with nettles was the churchyard; and that Philip Pirrip, late of this parish, and also Georgiana wife of the above, were dead and buried; and that Alexander, Bartholomew, Abraham, Tobias, and Roger, infant children of the aforesaid, were also dead and buried; and that the dark flat wilderness beyond the churchyard, intersected with dikes and mounds and gates, with scattered cattle feeding on it, was the marshes; and that the low leaden line beyond was the river; and that the distant savage lair from which the wind was rushing was the sea; and that the small bundle of shivers growing afraid of it all and beginning to cry, was Pip.","['at', 'such', 'a', 'time', 'i', 'find', 'out', 'for', 'certain', 'that', 'this', 'bleak', 'place', 'overgrow', 'with', 'nettle', 'be', 'the', 'churchyard', 'and', 'that', 'philip', 'pirrip', 'late', 'of', 'this', 'parish', 'and', 'also', 'georgiana', 'wife', 'of', 'the', 'above', 'be', 'dead', 'and', 'bury', 'and', 'that', 'alexander', 'bartholomew', 'abraham', 'tobias', 'and', 'roger', 'infant', 'child', 'of', 'the', 'aforesaid', 'be', 'also', 'dead', 'and', 'bury', 'and', 'that', 'the', 'dark', 'flat', 'wilderness', 'beyond', 'the', 'churchyard', 'intersect', 'with', 'dike', 'and', 'mound', 'and', 'gate', 'with', 'scattered', 'cattle', 'feed', 'on', 'it', 'be', 'the', 'marsh', 'and', 'that', 'the', 'low', 'leaden', 'line', 'beyond', 'be', 'the', 'river', 'and', 'that', 'the', 'distant', 'savage', 'lair', 'from', 'which', 'the', 'wind', 'be', 'rush', 'be', 'the', 'sea', 'and', 'that', 'the', 'small', 'bundle', 'of', 'shiver', 'grow', 'afraid', 'of', 'it', 'all', 'and', 'begin', 'to', 'cry', 'be', 'pip']","[('PERSON', 'Philip Pirrip'), ('PERSON', 'Georgiana'), ('PERSON', 'Alexander'), ('PERSON', 'Roger'), ('PERSON', 'Pip')]","[('PERSON', 'Alexander'), ('PERSON', 'Bartholomew'), ('PERSON', 'Abraham'), ('PERSON', 'Tobias'), ('PERSON', 'Roger'), ('PERSON', 'Pip')]"


#### NLTK proper noun extraction - ```get_proper_nouns_nltk()```

In reviewing the character names, especially with Alice I noticed character names like _Queen of Hearts_ where the preposition 'of' occurs between the 2 proper nouns. It's important to see this as one thing, rather than three so I'm catering for that here. In addition compound names like _Billy the Kid_ are in common use so I'm including the determiner _the_ in this recipe too.

In [26]:
def get_proper_nouns_nltk(text):
    '''Takes in text and returns a list of proper nouns (or consecutive proper nouns 
    like "Mrs. Smith" if applicable). Also caters for compound proper nouns formed with
    "of" or "the" such as "Richard of Orange" or "Billy the Kid"'''
    consecutive_tokens = nltk.tag.pos_tag(word_tokenize(text))
    # Structure to hold final list of proper nouns
    proper_nouns = []
    # Temporary structure in which to build up proper nouns
    name_list = []
    for (word, pos) in consecutive_tokens:
        if pos == 'NNP' or word.lower() in ['of', 'the']:
            name_list.append(word)
        elif len(name_list) > 0:
            full_name = ' '.join(name_list)
            # Remove any leading 'of' and 'the' instances
            full_name = re.sub(r'((( ?[Tt]he)|( ?[Oo]f)))+$', '', \
                               re.sub(r'(^(([Tt]he ?)|([Oo]f ?))+)', '', full_name))
            if full_name != '':
                proper_nouns.append(full_name)
            name_list = []
    if len(proper_nouns) > 0:
        return proper_nouns
    else:
        return None # it's possible a sentence does not contain any proper nouns

In [27]:
# Get proper nouns
book_sample['proper_nouns'] = book_sample['text'].progress_apply(lambda x: get_proper_nouns_nltk(x))
# Check outputs
book_sample.head(1).style.apply(highlight_column, subset=['proper_nouns'])

  0%|          | 0/4116 [00:00<?, ?it/s]

Unnamed: 0,index,author,chapter,paragraph,sentence,text,lemmas,nltk_entities,spacy_entities,proper_nouns
9,9,dickens,0,2,2,"At such a time I found out for certain that this bleak place overgrown with nettles was the churchyard; and that Philip Pirrip, late of this parish, and also Georgiana wife of the above, were dead and buried; and that Alexander, Bartholomew, Abraham, Tobias, and Roger, infant children of the aforesaid, were also dead and buried; and that the dark flat wilderness beyond the churchyard, intersected with dikes and mounds and gates, with scattered cattle feeding on it, was the marshes; and that the low leaden line beyond was the river; and that the distant savage lair from which the wind was rushing was the sea; and that the small bundle of shivers growing afraid of it all and beginning to cry, was Pip.","['at', 'such', 'a', 'time', 'i', 'find', 'out', 'for', 'certain', 'that', 'this', 'bleak', 'place', 'overgrow', 'with', 'nettle', 'be', 'the', 'churchyard', 'and', 'that', 'philip', 'pirrip', 'late', 'of', 'this', 'parish', 'and', 'also', 'georgiana', 'wife', 'of', 'the', 'above', 'be', 'dead', 'and', 'bury', 'and', 'that', 'alexander', 'bartholomew', 'abraham', 'tobias', 'and', 'roger', 'infant', 'child', 'of', 'the', 'aforesaid', 'be', 'also', 'dead', 'and', 'bury', 'and', 'that', 'the', 'dark', 'flat', 'wilderness', 'beyond', 'the', 'churchyard', 'intersect', 'with', 'dike', 'and', 'mound', 'and', 'gate', 'with', 'scattered', 'cattle', 'feed', 'on', 'it', 'be', 'the', 'marsh', 'and', 'that', 'the', 'low', 'leaden', 'line', 'beyond', 'be', 'the', 'river', 'and', 'that', 'the', 'distant', 'savage', 'lair', 'from', 'which', 'the', 'wind', 'be', 'rush', 'be', 'the', 'sea', 'and', 'that', 'the', 'small', 'bundle', 'of', 'shiver', 'grow', 'afraid', 'of', 'it', 'all', 'and', 'begin', 'to', 'cry', 'be', 'pip']","[('PERSON', 'Philip Pirrip'), ('PERSON', 'Georgiana'), ('PERSON', 'Alexander'), ('PERSON', 'Roger'), ('PERSON', 'Pip')]","[('PERSON', 'Alexander'), ('PERSON', 'Bartholomew'), ('PERSON', 'Abraham'), ('PERSON', 'Tobias'), ('PERSON', 'Roger'), ('PERSON', 'Pip')]","['Philip Pirrip', 'Georgiana', 'Alexander', 'Bartholomew', 'Abraham', 'Tobias', 'Roger', 'Pip']"


### 2.4 Review the outputs and look for possible method refinements

### 3 3.4.1 Quality of the outputs / issues

Let's have a look at a sample of results from ```dickens```. Below we see some pitfalls with NER already:

* _Miss Havisham_ is missed by __spaCy__ in the second sentence and incorrectly truncated to just _Havisham_ in the fourth
* In the third sentence only _Abel_ is recognised by __NLTK__ but only _Magwitch_ is recognised by __spaCy__
* _Pip_ is missed by both in the fifth sentence

Proper nouns look better, with all the characters present and correct, BUT:

* An apostrophe has also been identified as a proper noun which is incorrect

In [28]:
book_sample[(book_sample['author'] == 'dickens') & \
            (book_sample['text'].str.contains('Pip|Havisham|Estella|Abel'))].sample(5, random_state=5)

Unnamed: 0,index,author,chapter,paragraph,sentence,text,lemmas,nltk_entities,spacy_entities,proper_nouns
6106,6106,dickens,37,86,0,"“Why should I look at him?” returned Estella, with her eyes on me instead.","[why, should, i, look, at, he, return, estella, with, her, eye, on, i, instead]","[(PERSON, Estella)]",,[Estella]
1231,1231,dickens,7,91,0,"“You are to wait here, you boy,” said Estella; and disappeared and closed the door.","[you, be, to, wait, here, you, boy, say, estella, and, disappear, and, close, the, door]","[(PERSON, Estella)]",,"[”, Estella]"
8200,8200,dickens,51,33,0,"“Pip,” said Mr. Jaggers, laying his hand upon my arm, and smiling openly, “this man must be the most cunning impostor in all London.”","[pip, say, mr., jaggers, lay, his, hand, upon, my, arm, and, smile, openly, this, man, must, be, the, most, cunning, impostor, in, all, london]","[(PERSON, Pip), (PERSON, Mr. Jaggers)]","[(PERSON, Jaggers)]","[Pip, ”, Mr. Jaggers, London]"
1358,1358,dickens,8,37,0,"“Yes,” said I. “Estella waved a blue flag, and I waved a red one, and Miss Havisham waved one sprinkled all over with little gold stars, out at the coach-window.","[yes, say, i., estella, wave, a, blue, flag, and, i, wave, a, red, one, and, miss, havisham, wave, one, sprinkle, all, over, with, little, gold, star, out, at, the, coach, window]","[(PERSON, Estella), (PERSON, Miss Havisham)]","[(PERSON, I. “Estella), (PERSON, Havisham)]","[Yes, ”, Estella, Miss Havisham]"
2620,2620,dickens,16,73,0,"And now, because my mind was not confused enough before, I complicated its confusion fifty thousand-fold, by having states and seasons when I was clear that Biddy was immeasurably better than Estella, and that the plain honest working life to which I was born had nothing in it to be ashamed of, but offered me sufficient means of self-respect and happiness.","[and, now, because, my, mind, be, not, confuse, enough, before, i, complicate, its, confusion, fifty, thousand, -, fold, by, have, state, and, season, when, i, be, clear, that, biddy, be, immeasurably, well, than, estella, and, that, the, plain, honest, work, life, to, which, i, be, bear, have, nothing, in, it, to, be, ashamed, of, but, offer, i, sufficient, mean, of, self, respect, and, happiness]","[(PERSON, Biddy), (PERSON, Estella)]",,"[Biddy, Estella]"


In [29]:
# Save the sample indices so that we can look at the examples again later
sample_indices = book_sample[(book_sample['author'] == 'dickens') & \
         (book_sample['text'].str.contains('Pip|Havisham|Estella|Abel'))].sample\
(5, random_state=5).index.tolist()

I'm expecting trouble with ```carroll```. The way these characters are referenced is not typical of most training corpora: referring to a character as '_the Duchess_' or '_the Caterpillar_' is a little unusual. And indeed with NER the results are terrible:

* None of the 'odd' characters like _Caterpillar_ and _Duchess_ have been identified - in fact only _Alice_ herself in the fifth sentence gets picked up

The proper nouns do at least pick out these capitalised characters, however:

* _No_ and _Very_ are being picked up as proper nouns which is incorrect. Once again there are apostrophes, and now quotation marks too, included. These definitely need further investigation.

In [30]:
book_sample[(book_sample['author'] == 'carroll') & \
            (book_sample['text'].str.contains('White Rabbit|Caterpillar|Duchess'))].sample(5, random_state=15)

Unnamed: 0,index,author,chapter,paragraph,sentence,text,lemmas,nltk_entities,spacy_entities,proper_nouns
10575,10575,carroll,7,10,2,"Next came the guests, mostly Kings and Queens, and among them Alice recognised the White Rabbit: it was talking in a hurried nervous manner, smiling at everything that was said, and went by without noticing her.","[next, come, the, guest, mostly, king, and, queens, and, among, they, alice, recognise, the, white, rabbit, it, be, talk, in, a, hurried, nervous, manner, smile, at, everything, that, be, say, and, go, by, without, notice, she]","[(PERSON, Next), (PERSON, Kings)]","[(PERSON, Kings), (PERSON, Alice)]","[Kings, Queens, Alice, White]"
10712,10712,carroll,8,14,0,"“Very true,” said the Duchess: “flamingoes and mustard both bite.","[very, true, say, the, duchess, flamingo, and, mustard, both, bite]",,,"[Very, ”]"
10120,10120,carroll,4,13,0,“Why?” said the Caterpillar.,"[why, say, the, caterpillar]",,,[Caterpillar]
11189,11189,carroll,11,37,0,The White Rabbit put on his spectacles.,"[the, white, rabbit, put, on, his, spectacle]",,,[White Rabbit]
11244,11244,carroll,11,68,0,"The long grass rustled at her feet as the White Rabbit hurried by—the frightened Mouse splashed his way through the neighbouring pool—she could hear the rattle of the teacups as the March Hare and his friends shared their never-ending meal, and the shrill voice of the Queen ordering off her unfortunate guests to execution—once more the pig-baby was sneezing on the Duchess’s knee, while plates and dishes crashed around it—once more the shriek of the Gryphon, the squeaking of the Lizard’s slat...","[the, long, grass, rustle, at, her, foot, as, the, white, rabbit, hurry, by, the, frightened, mouse, splash, his, way, through, the, neighbour, pool, she, could, hear, the, rattle, of, the, teacup, as, the, march, hare, and, his, friend, share, their, never, end, meal, and, the, shrill, voice, of, the, queen, order, off, her, unfortunate, guest, to, execution, once, more, the, pig, baby, be, sneeze, on, the, duchess, knee, while, plate, and, dish, crash, around, it, once, more, the, shriek, ...","[(PERSON, Mouse), (PERSON, Mock Turtle)]","[(PERSON, Mouse), (PERSON, Queen)]","[White Rabbit, Mouse, March Hare, Queen, Duchess ’, Gryphon, Lizard ’, Mock Turtle]"


In [31]:
# Save the sample indices so that we can look at the examples again later
sample_indices.extend(book_sample[(book_sample['author'] == 'carroll') & \
  (book_sample['text'].str.contains('White Rabbit|Caterpillar|Duchess'))].sample(5, random_state=15).index.tolist())

When examining results from ```alcott``` we see similar issues with NER:

* _Meg_ has been missed by __NLTK__ and __spaCy__ in the first sentence
* In the second sentence _Amy_ is missed by __spaCy__, and _Hall_ is erroneously included

And as before with proper nouns we see:

* _Aren_ has been picked up as a proper noun (and also as a person by __NLTK__), and once again those extraneous punctuation marks are labelled as proper nouns

In [32]:
book_sample[(book_sample['author'] == 'alcott') & \
            (book_sample['text'].str.contains('Jo|Meg|Amy'))].sample(5, random_state=5)

Unnamed: 0,index,author,chapter,paragraph,sentence,text,lemmas,nltk_entities,spacy_entities,proper_nouns
20565,20565,alcott,48,52,0,"“There’s no need for me to say it, for everyone can see that I’m far happier than I deserve,” added Jo, glancing from her good husband to her chubby children, tumbling on the grass beside her.","[there, ’, no, need, for, i, to, say, it, for, everyone, can, see, that, i, am, far, happy, than, i, deserve, add, jo, glance, from, her, good, husband, to, her, chubby, child, tumble, on, the, grass, beside, she]","[(PERSON, Jo)]","[(PERSON, Jo)]","[’, ”, Jo]"
16890,16890,alcott,29,78,1,"I wouldn’t have told you, for I set my heart on surprising you, and I flatter myself I’ve done it,” said Jo, when she got her breath.","[i, would, not, have, tell, you, for, i, set, my, heart, on, surprise, you, and, i, flatter, myself, i, have, do, it, say, jo, when, she, get, her, breath]","[(PERSON, Jo)]","[(PERSON, Jo)]","[”, Jo]"
19810,19810,alcott,44,58,0,"“You are the same Jo still, dropping tears about one minute, and laughing the next.","[you, be, the, same, jo, still, drop, tear, about, one, minute, and, laugh, the, next]",,"[(PERSON, Jo)]",[Jo]
17687,17687,alcott,33,38,1,"I’ve tried, because one feels awkward in company not to do as everybody else is doing, but I don’t seem to get on”, said Jo, forgetting to play mentor.","[i, have, try, because, one, feel, awkward, in, company, not, to, do, as, everybody, else, be, do, but, i, do, not, seem, to, get, on, say, jo, forget, to, play, mentor]","[(PERSON, Jo)]","[(PERSON, Jo)]",[Jo]
16979,16979,alcott,30,19,2,"Just be calm, cool, and quiet, that’s safe and ladylike, and you can easily do it for fifteen minutes,” said Amy, as they approached the first place, having borrowed the white parasol and been inspected by Meg, with a baby on each arm.","[just, be, calm, cool, and, quiet, that, ’, safe, and, ladylike, and, you, can, easily, do, it, for, fifteen, minute, say, amy, as, they, approach, the, first, place, having, borrow, the, white, parasol, and, be, inspect, by, meg, with, a, baby, on, each, arm]","[(PERSON, Amy), (PERSON, Meg)]","[(PERSON, Amy)]","[’, ”, Amy, Meg]"


In [33]:
# Save the sample indices so that we can look at the examples again later
sample_indices.extend(book_sample[(book_sample['author'] == 'alcott') & \
                      (book_sample['text'].str.contains('Jo|Meg|Amy'))].sample(5, random_state=5).index.tolist())

If we have a look at the inner workings of the _Aren_ example from ```alcott```, we can see that our punctuation is being dealt with incorrectly (run ```nltk.help.upenn_tagset()``` for a list of the Penn Treebank tags and their meanings if required): 

* ```('“', 'JJ')``` - this is not an adjective
* ```('’', 'NNP')``` - this is not a proper noun
* ```('”', 'VB')``` - this is not a verb

This incorrect handling of punctuation is not only causing punctuation to be returned as proper nouns, but it's also affecting the tagging of surrounding terms:

In [34]:
sample = book_sample.loc[11559]['text']
pos_tree = nltk.tag.pos_tag(word_tokenize(sample))
print(f'''{sample}
''')
for tree in pos_tree:
    print(tree)

KeyError: 11559

We see a similar pattern with the _Very_ example from ```alcott``` where the surrounding punctuation is incorrectly tagged and _Very_ itself is tagged as a proper noun ```NNP```:

In [None]:
sample = book_sample.loc[10378]['text']
pos_tree = nltk.tag.pos_tag(word_tokenize(sample))
print(f'''{sample}
''')
for tree in pos_tree:
    print(tree)

It struck me that these apostrophes and quotation marks differ from the standard modern ones. I wondered whether changing ```‘’``` to ```''``` and ```“”``` to ```""``` would resolve the issue as they are more common in modern corpora, and indeed it does: in the following example ```"``` is now tagged as punctuation, and as a result _Very_ has correctly been tagged as an adverb (```RB```)

In [None]:
adjusted_sample = '''"Very," said Alice: " - where's the Duchess?"'''
pos_tree = nltk.tag.pos_tag(word_tokenize(adjusted_sample))
print(f'''{adjusted_sample}
''')
for tree in pos_tree:
    print(tree)

Let's have a look at the punctuation we have across all 3 full books and see if there are any other irregular items that stand out. In an offline exercise I also investigated the difference between ```-``` and ```—``` (dashes and hyphens respectively) but found they did not have a negative impact on the tagging. I also looked at the impact of these characters: ```[]&*$```, but their presence was negligible and therefore no changes were required.

In [None]:
punctuation_df = book_components['text'].str.extractall(r'([^\w\s])').reset_index(drop=True)
punctuation_list = punctuation_df[0].unique()
punctuation_list

So I'm going to go ahead and replace our 'abnormal' punctuation with 'normal' punctuation in our ```book_sample```:

In [None]:
punctuation_replacements = {'“':'"', '”':'"', '‘':"'", '’':"'" }
for before, after in punctuation_replacements.items():
    book_sample['text'] = book_sample['text'].str.replace(before, after)

And now repeat the NER / POS tagging tasks so we can check the effect:

In [None]:
book_sample['nltk_entities_v2'] = book_sample['text'].progress_apply(lambda x: get_named_entities_nltk(x))
book_sample['spacy_entities_v2'] = book_sample['text'].progress_apply(lambda x: get_named_entities_spacy(x))
book_sample['proper_nouns_v2'] = book_sample['text'].progress_apply(lambda x: get_proper_nouns_nltk(x))

This change has resolved the edge cases to do with punctuation that we identified earlier (compare the outputs of the original columns with the ```*_v2``` columns):

In [None]:
view_columns = ['author', 'text', 'nltk_entities', 'spacy_entities', 'proper_nouns', 'nltk_entities_v2',
       'spacy_entities_v2', 'proper_nouns_v2']
book_sample.loc[book_sample.index.isin(sample_indices), view_columns].style.apply\
(highlight_column, subset=['nltk_entities_v2', 'spacy_entities_v2', 'proper_nouns_v2'])

### 3.4.2 Names with titles

Another aspect that I want to look into is the use of titles like _Mr._ and _Mrs._ as these are less common in modern corpora. When it comes to _Mr._ results for NER are mixed: sometimes the _Mr._ is separated from the name (as in the first 2 rows), other times not (as in the third row):

In [None]:
book_sample.loc[book_sample['text'].str.contains('Mr. Dashwood'), view_columns].head(3)

For some reason the results for _Mrs._ with NER appear even worse!

In [None]:
book_sample.loc[book_sample['text'].str.contains('Mrs. March'), view_columns].head(3)

Incidentally __spaCy__ ([spacy.io, 2022](https://spacy.io/usage/rule-based-matching)) has a suggested workaround for this issue using rule-based matching, but I am limiting my exploration to the standard functionality in the interests of time.

### 3.4.3 Place names

So far the quality of the names extracted by the proper nouns method is looking the best and at this point I'm planning to rely quite heavily (although not exclusively) on them. However, as I mentioned in the introduction, these names will include proper nouns that we do not want, with the most likely high-frequency ones being place names. We know __Great Expectations__ is very much to do with _London_ and a quick check shows us that _London_ is indeed being picked up by the proper nouns method, so we'll certainly need to reduce this problem as much as possible by identifying known place names and removing them from the list.

In [None]:
book_sample.loc[book_sample['text'].str.contains('London'), view_columns].head(3)

In __WordNet__ the most common 'sense' of a word is the first synset - or 'synonym set' ([Jurafsky & Martin, 2021](https://web.stanford.edu/~jurafsky/slp3/18.pdf)). So if the word _London_ is most typically associated with being a place name then this will be the first sense. Synsets are located within a hierarchy and place names resolve upwards in the hierarchy to the synset ```location.n.01'``` as shown:

In [None]:
synset = wn.synsets('London')
synset[0].hypernym_paths()[0]

Using this logic we can get an evaluation from __WordNet__ on whether a name is known to be a place or not. Let's look at 2 of the place names seen above (_London_ and _Cheapside_) as well as a character name _Pip_:

In [None]:
test_location_names = ['London', 'Pip', 'Cheapside']
for name in test_location_names:
    if len(wn.synsets(name)) == 0:
        location = 'not found in WordNet'
    else:
        location = wn.synset('location.n.01') in wn.synsets(name)[0].hypernym_paths()[0]
    print(f'''{name}: place name = {location}''')

Here we see that __WordNet__ is probably only going to eliminate major place names for us. Although _Cheapside_ is a place, it's not one that is contained in __WordNet__.

An alternative to __WordNet__ would be to use a set of __gazetteers__. A gazetteer is a dictionary of place names. For example _Cheapside_ is known to the __Gazetteer of British Place Names__ ([The Association of British Counties, 2022](https://gazetteer.org.uk/purchase)). This route would likely lead to better elimination of place names. But the drawback would be the need to compile multiple gazetteers for different countries (this one won't help us with __Little Women__ which is set in America), and this is probably a small project on its own so for now I will stick with __WordNet__ to eliminate the main ones.

## 3.5 Extract named entities / proper nouns from the full books

Having refined our process a little, using a sample of the data, we are now ready to begin processing on the full books. Our first steps are:

1. performing required punctuation replacements on the full texts
2. extracting __nltk__ and __spaCy__ named entities for ```PERSON```
3. extracting all proper nouns ```NNP```

In [None]:
# Create a copy of our data to continue processing
book_entities = book_components.copy()

#### Replace punctuation

In [None]:
punctuation_replacements = {'“':'"', '”':'"', '‘':"'", '’':"'" }
for before, after in punctuation_replacements.items():
    book_entities['text'] = book_entities['text'].str.replace(before, after)

#### Get named entities

In [None]:
# Takes a little longer to run on the full books!
book_entities['nltk_entities'] = book_entities['text'].progress_apply(lambda x: get_named_entities_nltk(x))
book_entities['spacy_entities'] = book_entities['text'].progress_apply(lambda x: get_named_entities_spacy(x))

#### Get proper nouns

In [None]:
book_entities['proper_nouns'] = book_entities['text'].progress_apply(lambda x: get_proper_nouns_nltk(x))

In [None]:
# Let's now check the outputs are as expected
book_entities.head(2).style.apply(highlight_column, subset=['nltk_entities', 'spacy_entities', 'proper_nouns'])

## 3.6 Get name frequencies and trim the list of names

The next task is to compile a list of all the names identified, and then for each name count how many mentions were found by each of the 3 methods.

In addition we'll perform some further 'cleanup' steps identified earlier:

1. removing locations as far as possible (using __WordNet__)
2. general cleanup steps:
    * removing names that only consist of a title like _Mr._ or _Mrs_.
    * removing names that only consist of a single character
    * removing names that contain no alphanumeric characters
    
We then need to trim the number of names to a manageable size. For this I will employ 2 methods:

* The first cut will be based on a threshold for the number of mentions for each name
* The second cut will be based on a threshold for the % 'mention space' each name occupies

The latter idea is based on Sack's findings ([2011](https://www.aaai.org/ocs/index.php/FSS/FSS11/paper/viewFile/4230/4528)) that character mentions follow a long tail distribution _and_ in addition I'm expecting that the method would also clear out more noise from the dataset.

Counting name frequencies obviously needs to be done per book so my first step is to split our data into 3 separate dataframes: one for each book. I'll use a ```dict``` to hold this information to simplify further processing across all three books (where I can loop through the books to apply the same processing steps whenever I need):

In [None]:
processed_books = {}
for author in authors:
    processed_books[author] = book_entities[book_entities['author'] == author]
    
# Check the output is as expected
processed_books['dickens'].head(2)

#### Get character frequencies - ```get_occcurences()```
My next step is to retrieve the names that each technique has found and get a count of the frequencies of each. The following helper function will facilitate this:

In [None]:
def get_occurrences(pd_series, tuples = True):
    '''Takes in a pandas series of names and returns a frequency count of names, tuples is True
    for ner but should be False for proper nouns'''
    combined = list(itertools.chain.from_iterable(pd_series[~pd_series.isna()]))
    if tuples:
        person_occurrences = [person[1] for person in combined]
    else:
        person_occurrences = [person for person in combined]
    person_set = set(person_occurrences)
    named_entity_counts = {word:person_occurrences.count(word) for word in person_set}
    sorted_occurrences = {k: v for k, v in sorted(named_entity_counts.items(), \
                                                  reverse=True, key=lambda item: item[1])}
    return sorted_occurrences

#### Compile character frequencies - ```compile_entities()```

Here we get occurrences per author and extraction type:

In [None]:
def compile_entities(book_dict, author, extraction_type):
    series = book_dict[author][extraction_type].copy()
    if extraction_type in ['nltk_entities', 'spacy_entities']:
        tuples = True
    else:
        tuples = False
    result = pd.json_normalize(get_occurrences(series, tuples = tuples)).transpose().reset_index()
    result.columns = ['name', extraction_type]
    return result

In [None]:
extraction_types = ['nltk_entities', 'spacy_entities', 'proper_nouns']
entities = {}
for author in authors:
    entities[author] = {}
    for extraction_type in extraction_types:
        entities[author][extraction_type] = compile_entities(processed_books, \
                                                             author=author, extraction_type=extraction_type)
        
# Check the output is as expected
entities['dickens']['nltk_entities'].head(5).style.apply(highlight_column, subset=['nltk_entities'])        

#### Combine entity counts - ```combine_entities()```

And finally we combine the data so that we can compare the results (names found by all 3 methods, and the frequency of each by method):

In [None]:
def combine_entities(entity_dict, author, extraction_types):
    # Create a list to hold character names
    all_names = []
    # Collect the character names generated by all 3 methods
    for extraction_type in extraction_types:
        all_names.extend(entity_dict[author][extraction_type]['name'].tolist())
    # Dedupe character names
    all_names = list(set(all_names)) 
    # Create a dataframe to hold our data
    name_summary = pd.DataFrame(data=all_names, columns = ['name'])
    # Bring in the frequencies for each method
    for extraction_type in extraction_types:
        name_summary = name_summary.merge(entity_dict[author][extraction_type], how='left', on='name')
    # Fill 0 where a method found no occurrences for a name
    name_summary.fillna(0, inplace=True)
    name_summary = name_summary.astype('int64', errors='ignore')
    return name_summary

In [None]:
for author in authors:
    entities[author]['extraction_summary'] = combine_entities\
    (entities, author=author, extraction_types=extraction_types)

# Check the output is as expected
entities['dickens']['extraction_summary'].query('name == "Joe"').style.apply\
(highlight_column, subset=['nltk_entities', 'spacy_entities', 'proper_nouns']) 

#### Reduce the list of names based on number of mentions

Looking at the results we can confirm 2 things:

* The names found do indeed follow the expected long-tailed distribution in all 3 cases
* The tails are ridiculously long - ```alcott``` has more than 1200 proper nouns!

Given the goal of providing background information that can aid in literary analysis it's unnecessary to burden anyone with that many names! Furthermore a lot of those low-frequency names are probably just 'noise' as a result of the proper nouns including non-character information.

In [None]:
for author in authors:
    print(f'''{author}: {len(entities[author]['extraction_summary']['proper_nouns'])} names found''')
    ax = entities[author]['extraction_summary']['proper_nouns'].sort_values\
        (ascending = False).reset_index(drop=True).plot(color = author_colors[author], linewidth=4, figsize = (20, 3))
    plt.title(f'''{author} number of proper nouns found / frequencies''', fontsize = 16)
    plt.ylabel('frequencies')
    plt.show()

We can easily see how much noise is included in this raw dataset just by sampling some names. The objective is to eliminate very low frequency names either because they are noise (not actually character names) or because they are characters whose role is presumably very minor.

In [None]:
entities['alcott']['extraction_summary'].sample(5, random_state=2).style.apply(highlight_column, subset=['name'])        

#### Reduce entities based on count thresholds - ```reduce_entities()```

Since the proper noun extraction appeared to produce the best _quality_ names I decided to only look at names where proper noun extraction found at least one instance - and in combination with this the name had to be found at least 3 times by any one of the extraction methods. Arriving at these cutoffs took a bit of offline experimentation and was something of a manual process given that we don't have a labelled dataset, but in the end these criteria work reasonably well for all 3 books:

In [None]:
def reduce_entities(entity_dict, author):
    reduced_entities = \
        entity_dict[author]['extraction_summary'][((entity_dict[author]['extraction_summary']['nltk_entities'] > 2 ) | 
        (entity_dict[author]['extraction_summary']['spacy_entities'] > 2) |
        (entity_dict[author]['extraction_summary']['proper_nouns'] > 2)) & 
        (entity_dict[author]['extraction_summary']['proper_nouns'] != 0)].copy()
    return reduced_entities

In [None]:
# Get all entities / proper nouns with > 2 mentions
main_entities = {}
for author in authors:
    main_entities[author] = reduce_entities(entities, author)

Here we see the names have been reduced to a much more manageable number:

In [None]:
for author in authors:
    print(f'''{author}: {len(main_entities[author]['proper_nouns'])} names found''')

#### Eliminate WordNet locations - ```test_for_location()```

We can now test if a name is found as a WordNet location, and if so we drop it from the list:

In [None]:
def test_for_location(name):
    '''Takes in a name and uses Wordnet to check if the most common sense (which is always the first one)
    resolves to the hypernym 'location.n.01'.'''
    synset = wn.synsets(name)
    # The name may not occur in Wordnet (most person names don't)
    if len(synset) == 0:
        location = False
    # But if it does, check if it could occur as a location
    else:
        location = wn.synset('location.n.01') in synset[0].hypernym_paths()[0]
    return location

In [None]:
for author in authors:
    main_entities[author]['location'] = main_entities[author]['name'].apply(lambda x: test_for_location(x))
    to_drop = main_entities[author][main_entities[author]['location']==True].index
    main_entities[author].drop(to_drop, inplace = True)
    main_entities[author].reset_index(drop=True, inplace=True)

#### General cleanup steps - ```cleanup_entities()```

In [None]:
common_titles = ['Mr.', 'Mr', 'Mrs.', 'Mrs', 'Ms.', 'Ms', 'Miss', 
                 'Dr.', 'Dr', 'Rev.', 'Rev', 'Prof.', 'Prof']
def cleanup_entities(entity_dict, author, common_titles=common_titles):
    to_drop = []
    # Get names that are just common titles
    to_drop.extend(entity_dict[author][entity_dict[author]['name'].isin(common_titles)].index.tolist())
    # Get names that are only 1 character long
    to_drop.extend(entity_dict[author][entity_dict[author]['name'].str.len() == 1].index.tolist())
    # Get names that contain no alpha characters
    to_drop.extend(entity_dict[author][entity_dict[author]['name'].str.contains\
                                       (r'^[^a-zA-Z]*$', regex = True)].index.tolist())
    cleaned_entity_dict = entity_dict[author].copy()
    cleaned_entity_dict.drop(to_drop, inplace = True)
    cleaned_entity_dict.reset_index(drop=True)  
    return cleaned_entity_dict

In [None]:
for author in authors:
    main_entities[author] = cleanup_entities(main_entities, author)

Once again the number of names has been slightly reduced in all cases:

In [None]:
for author in authors:
    print(f'''{author}: {len(main_entities[author]['proper_nouns'])} names found''')

#### Reduce entities based on % mentions - ```final_cut_entities()```

In order to get a feel for how much 'space' a character takes up Sack ([2011](https://www.aaai.org/ocs/index.php/FSS/FSS11/paper/viewFile/4230/4528) citing Alex Woloch) , I looked at the mentions of each character as a % of the total mentions in our current list of names. Using a threshold of 0.1% (0.001 - again some iterations with manual review were required given the unlabelled dataset), I eliminated characters whose mention count was below this threshold as 'probably too minor to care about'. This step approximately halved the number of characters in ```dickens``` and ```alcott``` while leaving ```carroll``` as-is (the latter has much fewer minor characters compared to the other books). Our final frequency distributions are shown below:

In [None]:
def final_cut_entities(entity_dict, author):
    final_cut_entity_dict = entity_dict[author].copy()
    total_mentions = final_cut_entity_dict['proper_nouns'].sum()
    final_cut_entity_dict['proper_nouns_perc'] = final_cut_entity_dict['proper_nouns'].apply\
        (lambda x: round(x/total_mentions, 4))
    result = final_cut_entity_dict.query('proper_nouns_perc >= 0.001')
    return result

In [None]:
reduced_entities = {}
for author in authors:
    reduced_entities[author] = final_cut_entities(main_entities, author)

We are left with entities that have a percentage of proper nouns >= 0.001:

In [None]:
reduced_entities['dickens'].head().style.apply(highlight_column, subset=['proper_nouns_perc'])        

In [None]:
for author in authors:
    print(f'''{author}: {len(reduced_entities[author]['proper_nouns'])} names found''')
    ax = reduced_entities[author]['proper_nouns'].sort_values(ascending = False).reset_index\
        (drop=True).plot(color = author_colors[author], linewidth=4, figsize = (20, 3))
    plt.title(f'''{author} number of names / frequencies''', fontsize = 16)
    plt.ylabel('frequencies')
    plt.show()

## 3.7 Group similar names

The next objective is to relate a list of names like the following with a single common grouping 'index':

```['Herbert', 'Pocket', 'Herbert Pocket', 'Mrs. Pocket', 'Mr. Pocket', 'Mr. Matthew Pocket', 'Miss Sarah Pocket', 'Mr. Herbert', 'Sarah Pocket', 'Miss Pocket', 'Matthew Pocket']``` 

We can infer these are all related to one another in some way (although we cannot be sure how) as they share a similar group of individual names including 'Herbert', 'Pocket', 'Sarah', 'Matthew'.

This may seem slightly counter-intuitive BUT it is the most pragmatic way I can think of to deal with similar names. For instance if we read _Herbert_ we would easily say that _Herbert Pocket_ should be the same fellow. But what about when we come across _Mr. Pocket_? Once we know of the existence of _Matthew Pocket_ we can no longer be sure which man is being referred to with _Mr. Pocket_. To err on the side of caution my approach is to 'lasso' all names which could be related to one another into a single group.

With a small number of alternatives like this, we can then say they are aliases for the same character:

```['Cat', 'Cheshire Cat']``` 

The moment there are too many as above, then we err on the side of caution and just pick out the name with the single largest frequency as a character name we can be sure of, and leave the rest to one side as 'uncertain'.

The overall recipe for this section is:

1. Generate a set of name permutations so that we can check which single names are contained within multi-word names, like 'Herbert' contained within 'Herbert Pocket'
2. Where a set of names associated with a single word like 'Herbert', e.g. {'Herbert', 'Pocket'} overlaps with another set of names with a common element - say {'Sarah', 'Pocket'} - merge these names into a single group

[Given a little more time to experiment it would have been useful to see how we could refine this based on the conventions of first name (usually unique) and surname (often not unique).]

#### Group overlapping entities - ```group_entities()```

In [None]:
# Check for single names contained within multi-word names, and then group names with common elements
# We don't want to group names on their generic components like 'Mr.' or 'of'
name_stopwords = common_titles + ['of', 'Of', 'the' 'The']

def group_entities(entity_dict, author):
    # Get main list of names to test, e.g. 'Herbert Pocket'
    names_to_test = entity_dict[author]['name'].tolist()
    # Add single name components to this list e.g. 'Herbert', 'Pocket'
    names_to_test = names_to_test + list(set([item for sublist in [name.split(' ') \
                                   for name in names_to_test] for item in sublist if item not in name_stopwords]))
    # Generate name permutations
    permutations = list(itertools.permutations(names_to_test, r=2))

    relationships = []
    # Get the multi-name relationships
    for i in range(len(permutations)):
        test_case = permutations[i]
        if len(test_case[0].split(' ')) == 1:
            if re.search(rf'\b{test_case[0]}\b', test_case[1]):
                relationships.append(test_case)
    # Add the single name relationships
    single_names = [(name, name) for name in names_to_test]
    for name in single_names:
        if name not in relationships:
            relationships.append(name)
    # Group relationships by single names and their elements
    d = defaultdict(list)
    for key, value in relationships:
        d[key].append(value)
    d = dict(d)
    
    # Use 'successive merging' to build final name sets - method with thanks to Alain T. via this post:
    # https://stackoverflow.com/questions/56567089/combining-lists-with-overlapping-elements
    # - this uses sets and therefore intersection to check for overlapping elements (T., 2019)
    pooled = [set(subList) for subList in d.values()]
    merging = True
    while merging:
        merging=False
        for i,group in enumerate(pooled):
            merged = next((g for g in pooled[i+1:] if g.intersection(group)),None)
            if not merged: continue
            group.update(merged)
            pooled.remove(merged)
            merging = True
    pooled = [list(name_set) for name_set in pooled]
    names_df = pd.DataFrame(columns = ['name'])
    names_df['name'] = pooled
    names_df = names_df.explode('name')
    names_df.reset_index(inplace=True)
    
    merged_df = entity_dict[author].copy()
    merged_df = merged_df.merge(names_df, how = 'left', on = 'name')
    
    # Get total character mentions per index
    totals = pd.DataFrame(merged_df.groupby(by = ['index'])['proper_nouns'].sum())
    totals.reset_index(inplace = True)

    # Get character counts per index
    counts = pd.DataFrame(merged_df.groupby(by = ['index'])['name'].count())
    counts.reset_index(inplace = True)
    counts.columns = ['index', 'char_count']
    
    # Merge character totals and counts into final entities
    merged_df = merged_df.merge(totals, how = 'left', on = 'index', suffixes = ('_ind_totals', '_grp_totals'))
    merged_df = merged_df.merge(counts, how = 'left', on = 'index', )
    
    return merged_df

In [None]:
merged_entities = {}
for author in authors:
    merged_entities[author] = group_entities(reduced_entities, author)

Here we see that all the names containing _Pocket_ now share the same index:

In [None]:
merged_entities['dickens'][merged_entities['dickens']['name'].str.contains\
                           ('Pocket')].style.apply(highlight_column, subset=['name', 'index'])

And also all the names associated with, but not necessarily containing, _Pocket_ also share the same index:

In [None]:
index_to_view = merged_entities['dickens'][merged_entities['dickens']['name'].str.contains\
                                           ('Pocket')].iloc[0]['index']
merged_entities['dickens'][merged_entities['dickens']['index'] == index_to_view].style.apply\
    (highlight_column, subset=['name', 'index'])

#### Extract final entities - ```extract_final_entities()```

Our final step is to get the name per grouping that has the most mentions, we'll treat this as they 'key value' for the indices we've established:

In [None]:
def extract_final_entities(entity_dict, author):
    final_entities = entity_dict[author].copy()
    # Get the maximum per index for each character grouping based on proper_nouns_ind_totals
    max_records = final_entities.groupby('index', group_keys=False).apply\
        (lambda x: x.loc[x['proper_nouns_ind_totals'].idxmax()])
    # Get the records where we can only keep one name as there may be ambiguities (> 3 'variants')
    singletons = max_records[max_records['char_count'] > 3].reset_index(drop=True)
    # Drop these ambiguous records from our main df
    to_drop = final_entities[final_entities['index'].isin(singletons['index'])].index
    final_entities.drop(to_drop, inplace = True)
    # And just re-include the singleton records we can be more sure about
    final_entities = pd.concat([final_entities, singletons])
    # Merge final_entities with max_records to get the single names representative of name variants
    max_records.reset_index(drop = True, inplace = True)
    final_entities = final_entities.merge(max_records[['index', 'name']], \
                                          how = 'left', on='index', suffixes=('_original', '_key'))
    final_entities.rename(columns={'name_original':'name', 'name_key': 'key'}, inplace=True)
    return final_entities

In [None]:
final_entities = {}
for author in authors:
    final_entities[author] = extract_final_entities(merged_entities, author)

# Check that the outputs are as expected
final_entities['dickens'][final_entities['dickens']['name'].str.contains\
                          ('Herbert|Pocket|Pip')].style.apply(highlight_column, subset=['name', 'index', 'key'])

Above we are seeing that where we have name variants we can say are associated with reasonable certainty (< 3 variants) they have been retained and are grouped with a common index: see _Pip_. However, where there were name variants that we could not say for sure what the associations were only the single instance with the most mentions has been retained: see _Herbert_ which has been selected from amongst all the _Pocket_ names. 

\[If I were to refine this method a little more I would look at possible first name / last name associations to see if genuinely related names could be extracted in slightly more detail, but that is again outside of the scope of this particular investigation.\]

## 3.8 Save key information to our database

### 3.8.1 Save ```book_entities```

In [None]:
book_entities_sql = book_entities.copy()
# Get the book entities into a format where we can save it to the DB (which does not like lists or tuples)
book_entities_sql['nltk_entities'] = book_entities_sql['nltk_entities'].apply\
    (lambda x: ';'.join(map(str, x)) if x != None else 'None')
book_entities_sql['spacy_entities'] = book_entities_sql['spacy_entities'].apply\
    (lambda x: ';'.join(map(str, x)) if x != None else 'None')
book_entities_sql['lemmas'] = book_entities_sql['lemmas'].apply\
    (lambda x: ';'.join(x) if x != None else 'None')
book_entities_sql['proper_nouns'] = book_entities_sql['proper_nouns'].apply\
    (lambda x: ';'.join(x) if x != None else 'None')

In [None]:
# We can always convert back again with the following recipes:

# For NER (which has tuples)
# book_entities_sql['nltk_entities'].apply(lambda x: [tuple(re.sub(r'[\(\)\']', '', item).split(', ')) \
# for item in x.split(';')] if x != 'None' else None)

# For proper nouns (which has lists)
# book_entities_sql['proper_nouns'].apply(lambda x: x.split(';') if x != 'None' else None)

In [None]:
# Save the outputs to our db
book_entities_sql.to_sql('book_entities', connection, schema=None, 
              if_exists='replace', index=True)

### 3.8.2 Save ```final_entities```

In [None]:
# Save to the database
for author in authors:
    table_name = author + '_entities'
    final_entities[author].to_sql(table_name, connection, schema=None, 
              if_exists='replace', index=True)

In [None]:
# And check that it landed safely
%sql SELECT name FROM sqlite_master WHERE type='table';

## 3.9 Evaluation

### 3.9.1 Loop in our 4th unseen book ```bronte```

The final 'recipe' was developed and refined on ```dickens```, ```carroll``` and ```alcott```. Before we evaluate I'm now going to add ```bronte``` so that we can evaluate the final results not only on the books used in the development process but on one that hasn't been seen before.

In [None]:
# Get bronte from our db
processed_books['bronte'] = pd.read_sql_query\
    ('SELECT * FROM book_components WHERE author IN ("bronte")', 'sqlite:///book_store.db')
processed_books['bronte']['lemmas'] = processed_books['bronte']['lemmas'].apply(lambda x: x.split(';'))

# Replace punctuation
for before, after in punctuation_replacements.items():
    processed_books['bronte']['text'] = processed_books['bronte']['text'].str.replace(before, after)
    
# Get named entities
processed_books['bronte']['nltk_entities'] = processed_books['bronte']['text'].progress_apply\
    (lambda x: get_named_entities_nltk(x))
processed_books['bronte']['spacy_entities'] = processed_books['bronte']['text'].progress_apply\
    (lambda x: get_named_entities_spacy(x))

# Get proper nouns
processed_books['bronte']['proper_nouns'] = processed_books['bronte']['text'].progress_apply\
    (lambda x: get_proper_nouns_nltk(x))

# Get name frequencies and trim lists
for author in ['bronte']:
    # Compile entities
    entities[author] = {}
    for extraction_type in extraction_types:
        entities[author][extraction_type] = compile_entities(processed_books, author=author, \
                                                             extraction_type=extraction_type)
    # Combine entities 
    entities[author]['extraction_summary'] = combine_entities(entities, author=author, \
                                                              extraction_types=extraction_types)
    # Reduce entities
    main_entities[author] = reduce_entities(entities, author)
    # Remove locations   
    main_entities[author]['location'] = main_entities[author]['name'].apply(lambda x: test_for_location(x))
    to_drop = main_entities[author][main_entities[author]['location']==True].index
    main_entities[author].drop(to_drop, inplace = True)
    main_entities[author].reset_index(drop=True, inplace=True)
    # General cleanup    
    main_entities[author] = cleanup_entities(main_entities, author)
    # Final cut    
    reduced_entities[author] = final_cut_entities(main_entities, author)
    # Group entities
    merged_entities[author] = group_entities(reduced_entities, author)  
    # Extract final entities
    final_entities[author] = extract_final_entities(merged_entities, author)

### 3.9.2 Get Wikidata into the right format for lookup

In this step I will transform our Wikidata list of characters such that each index represents a 'main name' for the character and aliases are associated with that index.

In [None]:
# Read in the wikdata characters from our database
wikidata = pd.read_sql_query('SELECT * FROM wikidata WHERE property = "characters"', 'sqlite:///book_store.db')

In [None]:
# Transform the string format back to a list / tuple format
wikidata['data'] = wikidata['data'].apply(lambda x: [(name, alias.split('|')) \
if '|' in alias else (name, alias) for name, alias in [tuple(item.split('^'))  
if '^' in item else (item, 'None') for item in x.split(';')]])

Here we assemble the final Wikidata list so that ```index``` is the unique id for each character, ```key``` is the 'main' name assigned to the character and ```name``` is the list of all names associated with that character:

In [None]:
wikidata_chars = {}
for author in all_authors:
    wiki_char_list = wikidata.query('author == @author  and property == "characters"')['data'].tolist()
    wiki_char_df = pd.DataFrame(wiki_char_list[0], columns =['key', 'name'])
    # We're keeping the index here as our unique character key
    wiki_char_df.reset_index(inplace = True)
    # Get the rows which have aliases
    aliases_df = wiki_char_df[wiki_char_df['name'].notna()]
    # Each name should be associated with itself as an alias for convenience when matching later
    wiki_char_df['name'] = wiki_char_df['key']
    # Group the main names and the aliases together
    wiki_char_df = pd.concat([wiki_char_df, aliases_df])
    wiki_char_df = wiki_char_df.explode('name')
    wikidata_chars[author] = wiki_char_df.sort_values('index').reset_index(drop= True)
    
# Check that the output is as expected
wikidata_chars['bronte'].head(10)

There is one manual adjustment I'm going to make to this list: Wikidata lists a character _Alice's sister_, however this character is never actually named in the book and therefore I'm going to exclude it for purposes of our evaluation.

In [None]:
to_drop = wikidata_chars['carroll'][wikidata_chars['carroll']['key'] == "Alice's sister"].index
wikidata_chars['carroll'] = wikidata_chars['carroll'].drop(to_drop).reset_index(drop = True)

### 3.9.3 Match names found

In this step I match names found in Wikidata to names found in our final entities data:

In [None]:
for author in all_authors:
    keys_to_match = dict(zip(wikidata_chars[author]['name'], wikidata_chars[author]['key']))
    final_entities[author]['wiki_terms_key'] = final_entities[author]['key'].apply\
        (lambda x: '; '.join(list(set([key for name, key in keys_to_match.items() if re.search\
                                       (rf'\b{x}\b', name) or re.search(rf'\b{name}\b', x)]))))
    indices_to_match = dict(zip(wikidata_chars[author]['name'], wikidata_chars[author]['index']))
    final_entities[author]['wiki_terms_index'] = final_entities[author]['key'].apply\
        (lambda x: '; '.join(list(set([str(index) for name, index in indices_to_match.items() \
                                       if re.search(rf'\b{x}\b', name) or re.search(rf'\b{name}\b', x)]))))

### 3.9.4 Manual review of top 20 names and "precision"

Remember that precision answers the question '_Of the results we got, how many were valid?_'. As mentioned previously, this is by necessity a manual exercise, since we do not have exhaustive character lists from Wikidata. I will be limiting the character visualizations in the next section to only the top 20 per novel, so I'm going to assess how many of these top 20 were valid results. Top 20 refers to the top 20 'indexes' found so _Pip_ and _Mr. Pip_ will count as one name.

In evaluating if a name is valid or not, 2 aspects have to be considered - I have chosen to handle them as follows:

* Was the name identified a valid character name? (if not -1 from score)
* Was the name identified genuine, but not picked up as being an alias for another? (in this case -0.5 from score)

The method of scoring is rather arbitrary - it should only be viewed as a rough guide to evaluation.

In [None]:
fields_of_interest = ['name', 'key', 'index', 'proper_nouns_ind_totals', 'proper_nouns_grp_totals', 'wiki_terms_key', 'wiki_terms_index']

In [None]:
top_20 = {}
for author in all_authors:
    unique_indices_totals = final_entities[author][['index', 'proper_nouns_grp_totals']].drop_duplicates()
    top_indices = unique_indices_totals.nlargest(20, 'proper_nouns_grp_totals')['index'].tolist()
    top_20[author] = final_entities[author][final_entities[author]['index'].isin(top_indices)]

#### Dickens top 20

By way of reminder: ```dickens``` had the smallest set of characters in Wikidata - below (check field ```wiki_terms_key```) we see which matches were found in our data:

In [None]:
print(f'''Number of unique Wikidata characters: {len(wikidata_chars['dickens']['key'].unique())}

Wikidata character list excluding aliases: 
{wikidata_chars['dickens']['key'].unique().tolist()}
''')
top_20['dickens'][fields_of_interest].sort_values('index', ascending = False)

#### Comparison to Wikidata

Notice the following:

* _Abel Magwitch_, although listed as a main character by WikiData, is not found in our top 20. This highlights 2 issues: 1) inability to match aliases: we have _Provis_ which is in fact an alias for _Magwitch_ but we have no way of knowing they are one and the same character and 2) some characters loom large, even if they are not frequently mentioned by name. In the beginning of __Great Expectations__ our hero _Pip_ has an encounter with _Magwitch_ which alters the course of his life, but in ways we really only understand later when _Magwitch_ returns to the narrative as _Provis_.
* Notice also, the name _Drummle_ is associated only with _Estella Drummle_ in Wikidata so there is some confusion on the matching between our data and Wikidata in this instance. Availability of a more comprehensive character list in Wikidata would have helped to alleviate this evaluation issue!

#### Precision of dickens top 20

* Most of the names in the top 20 are genuine names (even _Aged_ - there is actually a character referred to as _Aged P._ or _Aged Parent_ in the book!)
* _Sunday_ is the only name that is incorrect (-1)
* _Handel_ is one of Pip's nicknames (-0.5)


Our manual precision calculation is therefore:

$$ \frac{TP}{TP + FP} = \frac{18.5}{20} = 0.93$$ 

#### Carroll top 20

By way of reminder: ```carroll``` had quite a few unusual characters - below we see which matches were found in our data - surprisingly comprehensive:

In [None]:
print(f'''Number of unique Wikidata characters: {len(wikidata_chars['carroll']['key'].unique())}

Wikidata character list excluding aliases: 
{wikidata_chars['carroll']['key'].unique().tolist()}
''')
top_20['carroll'][fields_of_interest].sort_values('index', ascending = False)

#### Comparison to Wikidata

The results are surprisingly good, with similar variants of names being correctly grouped - except where we could not be sure how to understand relationships. For example it's disappointing the _Queen_ does not also include the variant _Queen of Hearts_, but as discussed above, this is as a result of the similar name _Knave of Hearts_ confusing matters.

Also (somewhat randomly!) _King_ has been extracted successfully as a separate character from the _Queen_ / _Queen of Hearts_ because in the book he is never referenced as _King of Hearts_ although that is what he is. Because of this no room for doubt was created.

#### Review of precision

* Most of the names in the top 20 are genuine names 
* _Come_ is the only name that is shouldn't be in the list (-1)
* _Majesty_, is ambiguous: does it refer to the King or the Queen or both? We can't know without context (-0.5)
* _Footman_ is also ambiguous, there are in fact 2 footmen, the _Frog-Footman_ and the _Fish-Footman_ (-0.5)



Our manual precision calculation is therefore:

$$ \frac{TP}{TP + FP} = \frac{18}{20} = 0.90$$ 

#### Alcott top 20

By way of reminder: ```alcott``` was essentially about the _March_ family so everyone could have the surname _March_. Wikidata had an extremely comprehensive character list compared with Dickens. Below we see which matches were found in our data - where ```wiki_terms_index``` contains more than one item we cannot be sure if the character is a match or not:

In [None]:
print(f'''Number of unique Wikidata characters: {len(wikidata_chars['alcott']['key'].unique())}

Wikidata character list excluding aliases: 
{wikidata_chars['alcott']['key'].unique().tolist()}
''')
top_20['alcott'][fields_of_interest].sort_values('index', ascending = False)

#### Review of precision

* Most of the names in the top 20 are genuine names (_May_ is a real character not a month!)
* _Christmas_ and _Poor_ are the only names that shouldn't be in the list (-2)
* _Teddy_ and _Laurie_ are ambiguous - they are one and the same person (2 x -0.5)
* _Mrs. March_, _Mother_, _Marmee_ are ambiguous - they are one and the same person (3 x -0.5)


Our manual precision calculation is therefore:

$$ \frac{TP}{TP + FP} = \frac{15.5}{20} = 0.78$$ 

#### Bronte top 20

```bronte``` also has a small set of characters in Wikidata - below (check field ```wiki_terms_key```) we see which matches were found in our data:

In [None]:
print(f'''Number of unique Wikidata characters: {len(wikidata_chars['bronte']['key'].unique())}

Wikidata character list excluding aliases: 
{wikidata_chars['bronte']['key'].unique().tolist()}
''')
top_20['bronte'][fields_of_interest].sort_values('index', ascending = False)

#### Comparison to Wikidata

There were only 3 characters listed on Wikidata - we got them!

#### Precision of bronte top 20

* Most of the names in the top 20 are genuine names (including _St. John_, pronounced 'sinjin' when it is someone's name)
* _Did_, _Thornfield_ and _Lowood_ are the 3 names that are incorrect (-3)
* _God_ is debatable (-0.5)


Our manual precision calculation is therefore:

$$ \frac{TP}{TP + FP} = \frac{16.5}{20} = 0.83$$ 

### 3.9.5 Recall

I'm going to look at recall from 2 points of view:

__1. Overall, of the results we should have obtained (Wikidata list), how many did we get?__

Recall is influenced by the number of items that are in the Wikidata lists. __Great Expectations__ actually has many more characters, but only the 6 very _main_ ones were given, where with __Little Women__ 61 characters were given, many of whom really are quite minor.

__2. In the Top 20, of the results we should have obtained (Wikidata list), how many did we get?__

This latter is not really 'recall' but gives us an indication of how many of the characters we wanted to get actually turned up in the Top 20.

Here I create and process a function to compile the results and store them:

In [None]:
def compile_recall_data(final_entities_dict):
    recall_tracking = {}
    for author in all_authors:
        recall_tracking[author] = {}
        wiki_items = wikidata_chars[author]['key'].unique().tolist()
        found_items = final_entities_dict[author].loc[(final_entities_dict[author]['wiki_terms_key'] != '') & \
                     (~final_entities_dict[author]['wiki_terms_key'].str.contains(';')), \
                                                      'wiki_terms_key'].unique().tolist()
        other_items = final_entities_dict[author].loc[~final_entities_dict[author]['wiki_terms_key'].isin\
                                                      (found_items), 'key'].unique().tolist()
        recall_tracking[author]['wiki_items'] = len(wiki_items)
        recall_tracking[author]['found'] = len(found_items)
        recall_tracking[author]['not_found'] = len(list(set(wiki_items).difference(set(found_items))))
        recall_tracking[author]['recall'] = len(found_items) / len(wiki_items)
        recall_tracking[author]['found_items'] = found_items
        recall_tracking[author]['not_found_items'] = list(set(wiki_items).difference(set(found_items)))
        recall_tracking[author]['other_items'] = other_items
    return recall_tracking

In [None]:
overall_tracking = compile_recall_data(final_entities)
top20_tracking = compile_recall_data(top_20)

#### 3.9.5.1 Overall results

Below we see, of the results we should have obtained (Wikidata list), how many we got from the final list of characters (pie chart sizes are notionally relative to the number of characters that were listed in Wikidata):

In [None]:
# Pie chart sizes are notionally relative to the number of characters that were listed in Wikidata:
labels = ['found', 'not_found']
relative_wikidata_sizes = {'dickens': (3,3), 'carroll': (4, 4), 'alcott': (5, 5), 'bronte': (3,3)}
for author in all_authors:
    data = [overall_tracking[author]['found'], overall_tracking[author]['not_found']]
    fig1, ax = plt.subplots(figsize=relative_wikidata_sizes[author])

    ax.pie(x=data, labels=labels, autopct='%1.1f%%',
            shadow=True, startangle=90, colors = [author_colors[author], 'lightgrey'])
    plt.title(f'''recall for {author} - {overall_tracking[author]['found']} / \
{len(wikidata_chars[author]['key'].unique())} names found from wikidata list''', fontsize = 16)
    plt.show()
    
    print(f'''Found from Wikidata: 
    
    {sorted(overall_tracking[author]['found_items'])}

Not found from Wikidata:

    {sorted(overall_tracking[author]['not_found_items'])}''')

One can see above how imperfect the metric is, being heavily influenced by the level of detail available in WikiData. Notwithstanding, I'm satisfied with these results: even in the case of ```alcott```, to find 29 of the given 61 names is acceptable - especially given that the 29 contains all the characters that are usually top of mind for me as a human reader: _Jo_, _Meg_, _Amy_, _Beth_, _Laurie_, _Marmee_.

#### 3.9.5.2 Top 20 results

Below we see, of the results we should have obtained (Wikidata list), how many we got from in the Top 20 list of characters:

In [None]:
# Pie chart sizes are notionally relative to the number of characters that were listed in Wikidata:
labels = ['in wikidata', 'not in wikidata']
relative_wikidata_sizes = {'dickens': (3,3), 'carroll': (4, 4), 'alcott': (5, 5), 'bronte': (3,3)}
for author in all_authors:
    data = [top20_tracking[author]['found'], 20 - top20_tracking[author]['found']]
    fig1, ax = plt.subplots(figsize=relative_wikidata_sizes[author])

    ax.pie(x=data, labels=labels, autopct=lambda x: '{:.0f}'.format(x*np.sum(data)/100),
            shadow=True, startangle=90, colors = [author_colors[author], 'lightgrey'])
    plt.title(f'''{author} - {top20_tracking[author]['found']} names found from wikidata list in top 20 names''', \
              fontsize = 16)
    plt.show()
    print(f'''Found from Wikidata: 
    
    {sorted(top20_tracking[author]['found_items'])}

Not from Wikidata:

    {sorted(top20_tracking[author]['other_items'])}''')

As seen in the Top 20 detail above, there are a few names that are not valid characters in the set of names 'Not from Wikidata' but an overall majority of those in the grey pies are valid, even though not mentioned by Wikidata. 

### 3.9.6 BookNLP comparison

At this point I would like branch off just to show the Top 20 results obtained via the [BookNLP demo](https://colab.research.google.com/drive/1c9nlqGRbJ-FUP2QJe49h21hB4kUXdU_k?usp=sharing) on Google Colab. I ran the big model to extract the top 20 characters.

#### dickens

```
Id	Count	Name
302	1244	Herbert
245	1191	Miss Havisham
175	1170	Joe
289	974	 Wemmick
214	953	 Estella
265	839	 Mr. Jaggers
269	685	 Pip
204	602	 Biddy
191	410	 Mr. Pumblechook
184	348	 Mr. Wopsle
315	175	 Drummle
403	146	 Compeyson
396	132	 Provis
304	117	 Handel
275	115	 Trabb
307	115	 Mrs. Pocket
247	112	 Orlick
392	106	 Magwitch
230	97	  Mr. Pocket
226	68	  Camilla
```

_Observations_:

* Aliases are also not picked up by BookNLP: _Provis_ and _Magwitch_ have separate character id's, as do _Pip_ and _Handel_, and we can't be sure about _Herbert_ and _Mr. Pocket_ either.

#### carroll

```
Id	Count	Name
21	691	  Alice
59	79	   Alice
27	50	   the Mouse
23	35	   the Rabbit
45	12	   the Caterpillar
51	11	   the Hatter
53	11	   the Dormouse
25	8	    Dinah
37	8	    the Dodo
52	8	    the March Hare
42	7	    W. RABBIT
48	7	    the Cheshire - Cat
54	7	    the Knave of Hearts
35	5	    the Duck
43	5	    the Rabbit's--"Pat ! Pat
28	4	    William the Conqueror
40	4	    The Duchess
41	4	    Mary Ann
46	4	    the Pigeon
57	4	    Two
```

_Observations_:

* There are some issues with the _White Rabbit_, and strangely _Alice_ is counted as 2 separate character id's

#### alcott

```
Id	Count	Name
134	6356	Jo
192	2627	Amy
163	1766	Laurie
136	1101	Laurie
223	1065	Beth
200	949	 Meg
149	702	 Mrs. March
467	467	 Mr. Bhaer
199	362	 John
283	300	 Laurie
327	238	 Mother
153	235	 Hannah
558	234	 Demi
154	208	 Aunt March
239	176	 Meg
292	144	 Fred
448	135	 Beth
328	95	  Beth
179	93	  Mr. March
290	82	  Miss Kate
```

_Observations_:

* Again we see some duplicated character names (for example 3 instances of _Beth_)

#### bronte

```
Id	Count	Name
300	1917	Mr. Rochester
188	1202	St. John
155	819	 Jane
168	449	 Bessie
268	443	 Adèle
157	342	 Mrs. Fairfax
221	341	 Helen
308	324	 Miss Ingram
167	281	 Mrs. Reed
171	240	 Georgiana
242	235	 Mary
417	201	 Diana
209	167	 Miss Temple
306	136	 Mrs. Fairfax
204	134	 Mr. Brocklehurst
170	129	 John
271	125	 Grace Poole
349	115	 Mason
378	109	 Jane
163	107	 Reader
```

_Observations_:

* What we can see from the ```Count``` column in each case is that coreference resolution (which is part of the default pipeline) has increased the number of mentions per character considerably compared to my results.

Overall, having seen that, even with the latest technology, this is quite a hard problem to solve, I'm not displeased with the results obtained using my relatively naïve techniques!

In the final section I will reflect on some improvements that I believe could be made in future iterations. In the meantime I will proceed to the second part of this project: looking at some analyses and visualizations we could present to a human reader to complement their reading of the book.