This week's task is to try to run some taggers on the Audible data given to us by Dan. We have two sets of files. The first is an unlabeled dataset that contains approximately 25% of the text from 1000 books. The second dataset consists of 10 books from the first dataset in XML files that are labeled with parts of speech, word chunks (i.e. verb phrases, noun phrases), lemmatized, and the basic entities (People, Organizations, and Locations). Let's do some exploration of these files.

First, let's find out how many individual words are given entity tags, and what type of tags they are:

In [35]:
import lxml.etree as etree
from bs4 import BeautifulSoup
import glob

files = glob.glob('data/audible/processedText/*.xml')
entities = {'P': 0, 'L':0, 'O':0, 'D':0}
total_tokens = 0
for f in files:
    xml_data = open(f).read()
    soup = BeautifulSoup(xml_data, 'lxml')
    for i in soup.find_all('w'):
        for item in i:
            total_tokens += 1
            if 'ner' in i.attrs:
                entities[i.attrs['ner'][0]] += 1

In [36]:
for i in entities.keys():
    print('There are ' + str(entities[i]) + ' instances of ' + str(i))
print('***** ' + str(total_tokens) + ' tokens ******')

There are 9057 instances of P
There are 1407 instances of L
There are 391 instances of O
There are 0 instances of D
***** 272320 tokens ******


In the labeled data, we see that there are just over 270k individual tokens, and nearly 11k of them are labeled as entities. The majority of these entities (~9k) are people, with a few location tags, and a small number of Organizations. Dates are not labeled in these data, so we'll exclude them from our experimentation.

One thing to note is that this method of counting double-counts some instances. For instance:

In [142]:
files = glob.glob('data/audible/processedText/*.xml')
total_tokens = 0
unique_entities = {}
for f in files:
    xml_data = open(f).read()
    soup = BeautifulSoup(xml_data, 'lxml')
    for i in soup.find_all(lambda tag: 'ner' in tag.attrs):
        if i['ner']+'_'+f[-14:-4] in unique_entities.keys():
            unique_entities[i['ner']+'_'+f[-14:-4]] += 1
        else:
            unique_entities[i['ner']+'_'+f[-14:-4]] = 1
        

In [180]:
import operator
longest_entity = max(unique_entities.iteritems(), key=operator.itemgetter(1))
ent_id, file_id = longest_entity[0].split('_')
xml_data = open('data/audible/processedText/'+file_id+'.xml').read()
soup = BeautifulSoup(xml_data, 'lxml')
entity = soup.findAll(ner=ent_id)
print('The entity that takes up the largest number of strings is:')
print(' '.join(i.text for i in entity))
print('\n')
print('*****************')
print('The entire sentence that contains this entity is:')
print(' '.join(entity[0].parent.parent.text.split()))

The entity that takes up the largest number of strings is:
Harmonic Field of Glass Bells and Green Gig


*****************
<b>The entire sentence that contains this entity is:</b>
We might add here that later on the constructors had an article published in a prominent scientific journal under the title of ' Recursive β – Metafunctions in the Special Case of a Bogus Polypolice Transmogrification Conversion on an Oscillating Harmonic Field of Glass Bells and Green Gig , Kerosene Lamp on the Left to Divert Attention , Solved by Beastly Incarceration – Concatenation ' , which was subsequently exploited by the tabloids as ' The Police State Rears Its Ugly Head ' .


In the former counting method, this entity would have been counted as 8 distinct entities, whereas it's actually meant to be one entity. Interestingly, this example is incorrectly labeled. This entity is labeled as an organization, but it's actually part of a title (and would make a great band name!). Regardless, this highlights the importance of why using a standard scoring approach wont work so well.

# Spacy Entities

Below, I explore how Spacy handles entities in this text. I've obtained a python wrapper for implementing the CONLL scoring system from [here](https://github.com/mesnilgr/is13). The perl CONLL scoring script is available [here](http://www.cnts.ua.ac.be/conll2002/ner/bin/conlleval.txt). One of the difficulties here is that the spacy tokenizer works a bit differently than the tokenizer used on the Audible data. Below, I print out a sample of one of the documents with the ner tags. The printout is in the following format:


|spacy_text   |spacy_tag   |******   |audible_text   |audible_tag   |
|---|---|--:|---|---|
|disemboweled   |O-   |******   |disemboweled   |O   |
|,   |O-   |******   |,   |O   |
|buried   |O-   |******   |buried   |O   |


In [196]:
import conll_scoring
import spacy
nlp = spacy.load('en')

In [277]:
words = soup.find_all(['w', 'c'])
doc = nlp(' '.join(w.text for w in words))
start = 175
subdoc = doc[start:225]
for i, w in enumerate(subdoc):
    if 'ner' in words[i+start].attrs:
        print w.text + '\t' + w.ent_iob_ + '-' + w.ent_type_ + '\t******  ' + words[i+start].text + '\t' + words[i+start].attrs['ner']
    else:
        print w.text + '\t' + w.ent_iob_ + '-' + w.ent_type_ + '\t******  ' + words[i+start].text + '\t' + 'O'

disemboweled	O-	******  disemboweled	O
,	O-	******  ,	O
buried	O-	******  buried	O
alive	O-	******  alive	O
,	O-	******  ,	O
crucified	O-	******  crucified	O
and	O-	******  and	O
burnt	O-	******  burnt	O
at	O-	******  at	O
the	O-	******  the	O
stake	O-	******  stake	O
,	O-	******  ,	O
after	O-	******  after	O
which	O-	******  which	O
your	O-	******  your	O
ashes	O-	******  ashes	O
shall	O-	******  shall	O
be	O-	******  be	O
sent	O-	******  sent	O
into	O-	******  into	O
orbit	O-	******  orbit	O
as	O-	******  as	O
a	O-	******  a	O
and	O-	******  and	O
perpetual	O-	******  perpetual	O
reminder	O-	******  reminder	O
to	O-	******  to	O
all	O-	******  all	O
would	O-	******  would-be	O
-	O-	******  regicides	O
be	O-	******  ,	O
regicides	O-	******  amen	O
,	O-	******  .	O
amen	O-	******  '	O
.	O-	******  '	O
'	O-	******  Ca	O
'	O-	******  n't	O
Ca	O-	******  you	O
n't	O-	******  wait	O
you	O-	******  a	O
wait	O-	******  bit	O
a	O-	******  ?	O
bit	O-	******  '	O
?	O-	******  asked	O
'	O-	*****

One can see that the compound word would-be (about halfway down) is treated differently by the two systems. This makes lining up the annotations a bit of a challenge.

In [330]:
# ugh, this is ugly.
def condense_tokens(spacy_doc, audible_words, position, offset):
    spacy_tokens = nlp(audible_words[position+offset[1]].text)
    if 'ner' in audible_words[i+offset[1]].attrs:
        spacy_text = ''.join(j.text for j in spacy_doc[i+offset[0]:i+offset[0]+len(spacy_tokens)]) 
        spacy_tag = ''.join(j.ent_iob_ for j in spacy_doc[i+offset[0]:i+offset[0]+len(spacy_tokens)]) + '-' + \
        ''.join(j.ent_type_ for j in spacy_doc[i+offset[0]:i+offset[0]+len(spacy_tokens)]) 
        audible_text = audible_words[i+offset[1]].text 
        audible_tag = audible_words[i+offset[1]].attrs['ner']
    else:
        spacy_text = ''.join(j.text for j in spacy_doc[i+offset[0]:i+offset[0]+len(spacy_tokens)])
        spacy_tag = ''.join(j.ent_iob_ for j in spacy_doc[i+offset[0]:i+offset[0]+len(spacy_tokens)]) + '-' + \
        ''.join(j.ent_type_ for j in spacy_doc[i+offset[0]:i+offset[0]+len(spacy_tokens)])
        audible_text = audible_words[i+offset[1]].text 
        audible_tag = 'O-'
    
    offset[0] += len(spacy_tokens)-1
    return(spacy_text, spacy_tag, audible_text, audible_tag, offset)

words = soup.find_all(['w', 'c'])
doc = nlp(' '.join(w.text for w in words))
offset = [0, 0] #spacy, audible
subdoc = doc #for testing
spacy_text = []
spacy_tag = []
audible_text = []
audible_tag = []
for i, w in enumerate(subdoc):
    if i+offset[1] == len(words):
        break
    spacy_tokens = nlp(words[i+offset[1]].text) #check to see if the tokenizations match
    
    if len(spacy_tokens)==1: #if they match
        if 'ner' in words[i+offset[1]].attrs:
            spacy_text.append(subdoc[i+offset[0]].text)
            spacy_tag.append(subdoc[i+offset[0]].ent_iob_ + '-' + subdoc[i+offset[0]].ent_type_)
            audible_text.append(words[i+offset[1]].text)
            audible_tag.append(words[i+offset[1]].attrs['ner'])
        else:
            spacy_text.append(subdoc[i+offset[0]].text)
            spacy_tag.append(subdoc[i+offset[0]].ent_iob_ + '-' + subdoc[i+offset[0]].ent_type_)
            audible_text.append(words[i+offset[1]].text)
            audible_tag.append('O')
            
    if len(spacy_tokens)>1: #if spacy splits it up into more tokens
        stext, stag, atext, atag, offset = condense_tokens(doc, words, i, offset)
        spacy_text.append(stext)
        spacy_tag.append(stag)
        audible_text.append(atext)
        audible_tag.append(atag)

In [338]:
import pandas as pd
df = pd.DataFrame({'spacy_text':spacy_text,
             'spacy_tag':spacy_tag,
             'audible_text':audible_text,
             'audible_tag':audible_tag})
df.spacy_tag.value_counts()

O-                                      28961
B-PERSON                                  329
I-ORG                                     137
B-CARDINAL                                134
OOO-                                      118
B-ORG                                     115
B-GPE                                      89
OO-                                        46
B-ORDINAL                                  44
I-DATE                                     33
B-NORP                                     29
I-PERSON                                   27
I-TIME                                     23
I-CARDINAL                                 21
B-DATE                                     21
I-MONEY                                    17
I-FAC                                      15
B-TIME                                     11
I-WORK_OF_ART                              10
B-FAC                                       9
B-PRODUCT                                   9
I-EVENT                           

In [260]:
sentences[0].find_all(['w', 'c'])

[<w idx="28305" lem="finally" loc="167433" pos="RB">Finally</w>,
 <c idx="28306">,</c>,
 <w idx="28307" lem="the" loc="167442" pos="DT">the</w>,
 <w idx="28308" lem="order" loc="167446" pos="NN">order</w>,
 <w idx="28309" lem="be" loc="167452" pos="VBD">was</w>,
 <w idx="28310" lem="issue" loc="167456" pos="VBN">issued</w>,
 <w idx="28311" lem="that" loc="167463" pos="IN">that</w>,
 <w idx="28312" lem="anyone" loc="167468" pos="NN">anyone</w>,
 <w idx="28313" lem="resemble" loc="167475" pos="VBG">resembling</w>,
 <w idx="28314" lem="a" loc="167486" pos="DT">a</w>,
 <w idx="28315" lem="policeman" loc="167488" pos="NN">policeman</w>,
 <w idx="28316" lem="be" loc="167498" pos="VBD">was</w>,
 <w idx="28317" lem="to" loc="167502" pos="TO">to</w>,
 <w idx="28318" lem="be" loc="167505" pos="VB">be</w>,
 <w idx="28319" lem="detain" loc="167508" pos="VBN">detained</w>,
 <w idx="28320" lem="and" loc="167517" pos="CC">and</w>,
 <w idx="28321" lem="hold" loc="167521" pos="VBN">held</w>,
 <w idx="2

In [206]:
words = soup.find_all(['w', 'c'])
' '.join(w.text for w in words)



In [207]:
doc = nlp(' '.join(w.text for w in words))

In [241]:
subdoc = doc[0:500]
for i, w in enumerate(subdoc):
    if 'ner' in words[i].attrs:
        print w.text + '    ' + w.ent_iob_ + '-' + w.ent_type_ + '******' + words[i].text + words[i].attrs['ner']

Keeper    B-PERSON******KeeperO37
of    O-******ofO37
the    B-ORG******theO37
Royal    I-ORG******RoyalO37
Seal    I-ORG******SealO37
'    O-******TrurlP485
saying    O-******MannequinP486
Signed    O-******KroolP487
,    O-******TrurlP488
,    O-******TrurlP489


In [191]:
conll_scoring.conlleval('O', 'O', 'dog', 'temp')

{'f1': 0.0, 'p': 0.0, 'r': 0.0}

In [236]:
words[i].attrs['loc']

'168903'