# Extracting substrings of text from start and end chararcters


Let's say you've run spaCy or BookNLP and you've got a list of named entities and their start or end characters (in the case of spaCy) or the start and end tokesn (in the case of BookNLP). 

How do would you use those start or end characters to get some of the original text surrounding that named entity?

Let's look at an example using Arthur Conan Doyle's *The Adventures of Sherlock Holmes*:

## Define filepath

In [1]:
sherlock_holmes_text_filepath = '../_datasets/texts/literature/Arthur-Conan-Doyle-The-Adventures-of-Sherlock-Holmes.txt'

sherlock_holmes_text = open(sherlock_holmes_text_filepath, mode='r', encoding='utf-8').read()

## Example: what we want: the 200 characters before and the 200 characters after the word "Odessa"
BookNLP tells us that one of the first named entities in *The Adventures of Sherlock Holmes* is "Odessa", at the character start position of 2627 and the character end position 2633. Let's confirm that this is true by using the slice method `[]`:


In [4]:
sherlock_holmes_text[2627:2633]

'Odessa'

### Let's look at the 200 characters before and 200 characters after "Odessa":

In [5]:
print(sherlock_holmes_text[2627-200:2633+200])

n following out those clues, and clearing up those
mysteries which had been abandoned as hopeless by the official police.
From time to time I heard some vague account of his doings: of his
summons to Odessa in the case of the Trepoff murder, of his clearing up
of the singular tragedy of the Atkinson brothers at Trincomalee, and
finally of the mission which he had accomplished so delicately and
successfu


### How do we do this systematically?
To do this systematically, we have to get the start and end characters of each of the named entities. And we also have to get the character -/+200 characters away (the start and end characters of our small section of context around the named entity. 

## Step 1: Generate NER data on geographic locations:

### Import our libraries

In [101]:
import pandas as pd
import booknlp
from spacy import displacy
from collections import Counter
import pandas as pd

In [33]:
import en_core_web_sm
nlp = en_core_web_sm.load()

## spaCy NER data
Unlike BookNLP, spaCy gives us the start and end characters of named entities, but doesn't authomatically put the outputs into a `pandas` data frame. We have to do that ourselves.

### Let's run spaCy on our Sherlock Holmes text

In [34]:

spacy_document = nlp(sherlock_holmes_text)

## Let's create a dataframe that includes only the "GPE" named entities, along with the start and end positions

In [81]:
# Create three emptiy lists, which will become the columns of our dataframe
named_entities = []
entity_start_characters = []
entity_end_characters = []


# loop over all the named entities in our spacy_document to extract spacy's data
for named_entity in spacy_document.ents:
    if named_entity.label_ == "GPE": # Look at just the geopolitical entities, labeled "GPE"
        # add the text of that entity to our "named_entities" list
        named_entities.append(named_entity.text)
        
        # add the start character of that entity to our "entities_start_characters" list
        entity_start_characters.append(named_entity.start_char)
        
        # add the end character of that entity to our "entities_end_characters" list
        entity_end_characters.append(named_entity.end_char)
    

In [82]:
# Zip together our lists (this is a necessary step before making a dataframe out of a series of lists)
zipped = list(zip(named_entities, entity_start_characters, entity_end_characters))

In [83]:
# Make a dataframe out of our zipped lists
sherlock_holmes_spacy_GPE_entities_df = pd.DataFrame(zipped, columns=['named_entities', 'entity_start_characters', 'entity_end_characters'])

In [102]:
# Let's peek inside our dataframe: 
sherlock_holmes_spacy_GPE_entities_df

Unnamed: 0,named_entities,entity_start_characters,entity_end_characters
0,Bohemia,93,100
1,Holmes,1750,1756
2,Holland,2864,2871
3,Holmes,4850,4856
4,London,5775,5781
5,Egria,9174,9179
6,Bohemia,9219,9226
7,Bohemia,9517,9524
8,Holmes,10437,10443
9,Hercules,11046,11054


### Let's use the start and end characters to extract the strings of context around our spaCy GPE entities

In [92]:
# Create a list of pairs of start and end characters for each GPE named entity
spaCy_entityStartandEndLocations = list(sherlock_holmes_spacy_GPE_entities_df[['entity_start_characters', 'entity_end_characters']].itertuples(index=False, name=None))

# Define a variable that is the full text of Sherlock Holmes stories, read in as a string
    # We've defined this variable above, but I wanted to include it here so we know what
    # "sherlock_holmes_text" refers to
sherlock_holmes_text = open(sherlock_holmes_text_filepath, mode='r', encoding='utf-8').read()

# Create an empty list that we will populate with the contexts for NER locations
spaCy_contexts_for_NER_locations = []

# Loop over each of the start and end locations to produce a 400-character chunk of context
for start,end in spaCy_entityStartandEndLocations:
    # The next line of code will slice our Sherlock Holmes text file by
    # the start character positions, minus 200 
    # the end character positions, plus 200 
    # then add that text to our contexts list
    spaCy_contexts_for_NER_locations.append(sherlock_holmes_text[start-200:end+200])

### Print the contexts around spaCy named geopolitical locations:

In [93]:
spaCy_contexts_for_NER_locations

['',
 're disturbing than a strong emotion in a nature such as his. And\nyet there was but one woman to him, and that woman was the late Irene\nAdler, of dubious and questionable memory.\n\nI had seen little of Holmes lately. My marriage had drifted us away\nfrom each other. My own complete happiness, and the home-centred\ninterests which rise up around the man who first finds himself master\nof his own establishment',
 'der, of his clearing up\nof the singular tragedy of the Atkinson brothers at Trincomalee, and\nfinally of the mission which he had accomplished so delicately and\nsuccessfully for the reigning family of Holland. Beyond these signs of\nhis activity, however, which I merely shared with all the readers of\nthe daily press, I knew little of my former friend and companion.\n\nOne night—it was on the twentieth of Mar',
 'o harness.”\n\n“Then, how do you know?”\n\n“I see it, I deduce it. How do I know that you have been getting\nyourself very wet lately, and that you have a m

### Print the context around the fifth location, "London"

In [87]:
print(spaCy_contexts_for_NER_locations[4])

 of the sole in
order to remove crusted mud from it. Hence, you see, my double
deduction that you had been out in vile weather, and that you had a
particularly malignant boot-slitting specimen of the London slavey. As
to your practice, if a gentleman walks into my rooms smelling of
iodoform, with a black mark of nitrate of silver upon his right
forefinger, and a bulge on the right side of his top-hat to


## BookNLP NER on *The Adventures of Sherlock Holmes*


In [52]:
# Import BookNLP
from booknlp.booknlp import BookNLP

# Define the model parameters
# We can choose between the "big" or "small" model
# Below we are only using the named entity recognition pipeline ("entity")
# But there are other options that we might use, like quotations, or "coreference" resoluation
model_params={
		"pipeline":"entity", 
		"model":"big"
	}
	
booknlp=BookNLP("en", model_params)

# Input file to process
input_file="../_datasets/texts/literature/Arthur-Conan-Doyle-The-Adventures-of-Sherlock-Holmes.txt"

# Output directory to store resulting files in
output_directory="sherlock_holmes/"

# File within this directory will be named ${book_id}.entities, ${book_id}.tokens, etc.
book_id="adventures_of_sherlock_holmes"

{'pipeline': 'entity', 'model': 'big'}
--- startup: 2.603 seconds ---


In [53]:
booknlp.process(input_file, output_directory, book_id)

--- spacy: 25.151 seconds ---
--- entities: 223.445 seconds ---
--- quotes: 0.166 seconds ---
--- name coref: 0.483 seconds ---
--- TOTAL (excl. startup): 252.697 seconds ---, 127661 words


### Let's read in the tagged entities BookNLP output as a DataFrame

In [7]:
# Let's read in the tagged entities BookNLP output as a DataFrame
sherlock_holmes_entities_df = pd.read_csv('../_week9/sherlock_holmes/adventures_of_sherlock_holmes.entities', delimiter='\t')

In [8]:
# Let's peek inside the dataframe
sherlock_holmes_entities_df

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
0,240,3,4,PROP,PER,Sherlock Holmes
1,241,6,8,PROP,PER,Arthur Conan Doyle
2,1,14,15,PROP,FAC,Bohemia II
3,-1,17,22,NOM,FAC,The Red - Headed League III
4,2,35,39,PROP,VEH,The Five Orange Pips VI
...,...,...,...,...,...,...
18063,-1,127638,127638,PRON,PER,she
18064,-1,127644,127648,NOM,FAC,a private school at Walsall
18065,239,127648,127648,PROP,GPE,Walsall
18066,0,127651,127651,PRON,PER,I


### Create a new dataframe that includes only the "GPE" named entities

In [20]:
sherlock_holmes_GPE_entities_df = sherlock_holmes_entities_df[sherlock_holmes_entities_df['cat'] == 'GPE']
sherlock_holmes_GPE_entities_df

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
9,4,99,99,PROP,GPE,BOHEMIA
62,6,509,509,PROP,GPE,Odessa
65,7,531,531,PROP,GPE,Trincomalee
68,8,551,551,PROP,GPE,Holland
83,9,657,657,PROP,GPE,Scarlet
174,10,1214,1214,PROP,GPE,London
229,11,1638,1638,PROP,GPE,Europe
264,-1,2025,2029,NOM,GPE,a German - speaking country
265,4,2032,2032,PROP,GPE,Bohemia
266,12,2037,2037,PROP,GPE,Carlsbad


How many rows?

In [96]:
len(sherlock_holmes_GPE_entities_df)

420

### Getting the start and end characters from the start/end tokens for our named entities
BookNLP gives us the start and end TOKENS, rather than the start and end CHARACTERS. We need characters to slice up our text file as a string, so we'll have to get those start and end characters (what BookNLP calls the "byte_onset" and "byte_offset") from a different set of BookNLP output data: the .tokens file.

In [46]:
# Let's read in the tokens BookNLP output as a DatFrame 
# The tokens file contains information about each token in the text file 
# including token ID and the index position in the original text (per https://github.com/booknlp/booknlp#usage)
# NOTE: there is a " in one line - to make sure pandas doesn't confuse it for this
# quote for a delimiter, we use the paremeter "quoting=3" (See more in the pandas)
sherlock_holmes_tokens_df = pd.read_csv('sherlock_holmes/adventures_of_sherlock_holmes.tokens', encoding='utf-8', engine="python", delimiter='\t', quoting=3)

In [49]:
sherlock_holmes_tokens_df

Unnamed: 0,paragraph_ID,sentence_ID,token_ID_within_sentence,token_ID_within_document,word,lemma,byte_onset,byte_offset,POS_tag,fine_POS_tag,dependency_relation,syntactic_head_ID,event
0,0,0,0,0,The,the,1,4,DET,DT,det,1,O
1,0,0,1,1,Adventures,Adventures,5,15,PROPN,NNPS,nsubj,-1,O
2,0,0,2,2,of,of,16,18,ADP,IN,prep,1,O
3,0,0,3,3,Sherlock,Sherlock,19,27,PROPN,NNP,compound,4,O
4,0,0,4,4,Holmes,Holmes,28,34,PROPN,NNP,pobj,2,O
...,...,...,...,...,...,...,...,...,...,...,...,...,...
127656,2545,6644,55,127656,met,meet,562169,562172,VERB,VBN,ccomp,127652,O
127657,2545,6644,56,127657,with,with,562173,562177,ADP,IN,prep,127656,O
127658,2545,6644,57,127658,considerable,considerable,562178,562190,ADJ,JJ,amod,127659,O
127659,2545,6644,58,127659,success,success,562191,562198,NOUN,NN,pobj,127657,O


### Let's create a smaller dataframe that inlcudes only the token ID  and byte onset and offset

Here, instead of filtering for values in a dataframe, we're creating a new dataframe with just a few of the columns from the larger tokens dataframe (since we don't need most of this information)

In [98]:
sherlock_holmes_tokens_byte_onset_df = sherlock_holmes_tokens_df[['token_ID_within_document', 'byte_onset']]

sherlock_holmes_tokens_byte_offset_df = sherlock_holmes_tokens_df[['token_ID_within_document', 'byte_offset']]


In [99]:
#Let's take a peek inside the byte onset dataframe
sherlock_holmes_tokens_byte_onset_df

Unnamed: 0,token_ID_within_document,byte_onset
0,0,1
1,1,5
2,2,16
3,3,19
4,4,28
...,...,...
127656,127656,562169
127657,127657,562173
127658,127658,562178
127659,127659,562191


## Merging our dataframes
We're going to merge our `sherlock_holmes_tokens_byte_onset_df dataframe` with our `sherlock_holmes_GPE_entities_df dataframe`. We can do this because we have the unique token IDs. We can merge the "start token"  in our dataset  with the "token_ID_within_document" in the `sherlock_holmes_tokens_byte_onset_df`.

Notice that the tokens dataframe is much larger (127661 rows) than our GPE entities dataframe (only 420 rows). When we merge them using matching rows, the rows will only merge that have matches. If the merge was successful, we should only have 420 rows.

In [52]:
# Merged the byte_onset data frame with our GPE entities dataframe using `merge()`
sherlock_holmes_merged_byte_onset = pd.merge(sherlock_holmes_GPE_entities_df, sherlock_holmes_tokens_byte_onset_df.set_index('token_ID_within_document'), left_on='start_token', right_index=True)
sherlock_holmes_merged_byte_onset

Unnamed: 0,COREF,start_token,end_token,prop,cat,text,byte_onset
9,4,99,99,PROP,GPE,BOHEMIA,566
62,6,509,509,PROP,GPE,Odessa,2627
65,7,531,531,PROP,GPE,Trincomalee,2740
68,8,551,551,PROP,GPE,Holland,2864
83,9,657,657,PROP,GPE,Scarlet,3350
174,10,1214,1214,PROP,GPE,London,5775
229,11,1638,1638,PROP,GPE,Europe,7638
264,-1,2025,2029,NOM,GPE,a German - speaking country,9190
265,4,2032,2032,PROP,GPE,Bohemia,9219
266,12,2037,2037,PROP,GPE,Carlsbad,9241


### Because some of these named entities, like "New Jersey" are more than one token long, we're going to need to get the ending character––the "byte offset"–– from the *end* token, rather than from the start token:

In [53]:
# Merged the merged dataframe a SECOND time (this time, with the byte_offset data) suing `merge()`
sherlock_holmes_merged_entities_df = pd.merge(sherlock_holmes_merged_byte_onset, sherlock_holmes_tokens_byte_offset_df.set_index('token_ID_within_document'), left_on='end_token', right_index=True)

In [54]:
sherlock_holmes_merged_entities_df

Unnamed: 0,COREF,start_token,end_token,prop,cat,text,byte_onset,byte_offset
9,4,99,99,PROP,GPE,BOHEMIA,566,573
62,6,509,509,PROP,GPE,Odessa,2627,2633
65,7,531,531,PROP,GPE,Trincomalee,2740,2751
68,8,551,551,PROP,GPE,Holland,2864,2871
83,9,657,657,PROP,GPE,Scarlet,3350,3357
174,10,1214,1214,PROP,GPE,London,5775,5781
229,11,1638,1638,PROP,GPE,Europe,7638,7644
264,-1,2025,2029,NOM,GPE,a German - speaking country,9190,9215
265,4,2032,2032,PROP,GPE,Bohemia,9219,9226
266,12,2037,2037,PROP,GPE,Carlsbad,9241,9249


In [97]:
len(sherlock_holmes_merged_entities_df)

420

### Let's use the start and end characters to extract the strings of context around our BookNLP GPE entities

In [94]:
## # Create a list of pairs of start and end characters for each named entity
BookNLP_entityStartandEndLocations = list(sherlock_holmes_merged_entities_df[['byte_onset', 'byte_offset']].itertuples(index=False, name=None))

# Define a variable that is the full text of Sherlock Holmes stories, read in as a string
    # We've defined this variable (twice!) above; but I've included it so we know what "sherlock_holmes_text" refers to
sherlock_holmes_text = open(sherlock_holmes_text_filepath, mode='r', encoding='utf-8').read()

# Create an empty list that we will populate with the contexts for NER locations
BookNLP_contexts_for_NER_locations = []

# Loop over each of the start and end locations to produce a 400-character chunk of context
for start,end in BookNLP_entityStartandEndLocations:
    # The next line of code will slice our Sherlock Holmes text file by
    # the start character positions, minus 200 
    # the end character positions, plus 200 
    # then add that text to our contexts list
    BookNLP_contexts_for_NER_locations.append(sherlock_holmes_text[start-200:end+200])

In [95]:
BookNLP_contexts_for_NER_locations

['.    The Adventure of the Engineer’s Thumb\n   X.     The Adventure of the Noble Bachelor\n   XI.    The Adventure of the Beryl Coronet\n   XII.   The Adventure of the Copper Beeches\n\n\n\n\nI. A SCANDAL IN BOHEMIA\n\n\nI.\n\nTo Sherlock Holmes she is always _the_ woman. I have seldom heard him\nmention her under any other name. In his eyes she eclipses and\npredominates the whole of her sex. It was not that he felt a',
 'n following out those clues, and clearing up those\nmysteries which had been abandoned as hopeless by the official police.\nFrom time to time I heard some vague account of his doings: of his\nsummons to Odessa in the case of the Trepoff murder, of his clearing up\nof the singular tragedy of the Atkinson brothers at Trincomalee, and\nfinally of the mission which he had accomplished so delicately and\nsuccessfu',
 ' police.\nFrom time to time I heard some vague account of his doings: of his\nsummons to Odessa in the case of the Trepoff murder, of his clearing up\nof 

In [71]:
#Let's print the context around the second word in our list, "Odessa"
print(BookNLP_contexts_for_NER_locations[1])

n following out those clues, and clearing up those
mysteries which had been abandoned as hopeless by the official police.
From time to time I heard some vague account of his doings: of his
summons to Odessa in the case of the Trepoff murder, of his clearing up
of the singular tragedy of the Atkinson brothers at Trincomalee, and
finally of the mission which he had accomplished so delicately and
successfu


## To write output to a single text file

In [255]:
output_file = open('booknlp-contexts.txt', mode='w', encoding='utf-8')

for context in BookNLP_contexts_for_NER_locations:
     output_file.write(context)
     output_file.write('\n')
output_file.close()

## To write output to multiple text files in a new directory

In [128]:
# To output our context for GPE-tagged words as a series of new files with the same beginning, followed by the number of the section

#Import pathlib 
from pathlib import Path

# Define and name the new output directory using pathlib
path = Path('BookNLP_GPE_contexts/')
path.mkdir(exist_ok=True)

# Set the prefix for our output files, followed by the number of the section
begining_of_output_filenames = 'Sherlock_Holmes_BookNLP_GPE_contexts-'

# Iterate over each of the chunks of context for BookNLP NER
for i in range(1, len(BookNLP_contexts_for_NER_locations)+1):
    open(str(path) + "/" + begining_of_output_filenames+str(i)+'.txt','w').write(BookNLP_contexts_for_NER_locations[i-1])