This week's task is to try to run some taggers on the Audible data given to us by Dan. We have two sets of files. The first is an unlabeled dataset that contains approximately 25% of the text from 1000 books. The second dataset consists of 10 books from the first dataset in XML files that are labeled with parts of speech, word chunks (i.e. verb phrases, noun phrases), lemmatized, and the basic entities (People, Organizations, and Locations). Let's do some exploration of these files.

First, let's find out how many individual words are given entity tags, and what type of tags they are:

In [1]:
import lxml.etree as etree
from bs4 import BeautifulSoup
import glob

files = glob.glob('data/audible/processedText/*.xml')
entities = {'P': 0, 'L':0, 'O':0, 'D':0}
total_tokens = 0
for f in files:
    xml_data = open(f).read()
    soup = BeautifulSoup(xml_data, 'lxml')
    for i in soup.find_all('w'):
        for item in i:
            total_tokens += 1
            if 'ner' in i.attrs:
                entities[i.attrs['ner'][0]] += 1

In [36]:
for i in entities.keys():
    print('There are ' + str(entities[i]) + ' instances of ' + str(i))
print('***** ' + str(total_tokens) + ' tokens ******')

There are 9057 instances of P
There are 1407 instances of L
There are 391 instances of O
There are 0 instances of D
***** 272320 tokens ******


In the labeled data, we see that there are just over 270k individual tokens, and nearly 11k of them are labeled as entities. The majority of these entities (~9k) are people, with a few location tags, and a small number of Organizations. Dates are not labeled in these data, so we'll exclude them from our experimentation.

One thing to note is that this method of counting double-counts some instances. For instance:

In [9]:
files = glob.glob('data/audible/processedText/*.xml')
total_tokens = 0
unique_entities = {}
for f in files:
    xml_data = open(f).read()
    soup = BeautifulSoup(xml_data, 'lxml')
    for i in soup.find_all(lambda tag: 'ner' in tag.attrs):
        if i['ner']+'_'+f[-14:-4] in unique_entities.keys():
            unique_entities[i['ner']+'_'+f[-14:-4]] += 1
        else:
            unique_entities[i['ner']+'_'+f[-14:-4]] = 1
        

In [10]:
import operator
longest_entity = max(unique_entities.iteritems(), key=operator.itemgetter(1))
ent_id, file_id = longest_entity[0].split('_')
xml_data = open('data/audible/processedText/'+file_id+'.xml').read()
soup = BeautifulSoup(xml_data, 'lxml')
entity = soup.findAll(ner=ent_id)
print('The entity that takes up the largest number of strings is:')
print(' '.join(i.text for i in entity))
print('\n')
print('*****************')
print('The entire sentence that contains this entity is:')
print(' '.join(entity[0].parent.parent.text.split()))

The entity that takes up the largest number of strings is:
Harmonic Field of Glass Bells and Green Gig


*****************
The entire sentence that contains this entity is:
We might add here that later on the constructors had an article published in a prominent scientific journal under the title of ' Recursive β – Metafunctions in the Special Case of a Bogus Polypolice Transmogrification Conversion on an Oscillating Harmonic Field of Glass Bells and Green Gig , Kerosene Lamp on the Left to Divert Attention , Solved by Beastly Incarceration – Concatenation ' , which was subsequently exploited by the tabloids as ' The Police State Rears Its Ugly Head ' .


In the former counting method, this entity would have been counted as 8 distinct entities, whereas it's actually meant to be one entity. Interestingly, this example is incorrectly labeled. This entity is labeled as an organization, but it's actually part of a title (and would make a great band name!). Regardless, this highlights the importance of why using a standard scoring approach wont work so well.

# Spacy Entities

Below, I explore how Spacy handles entities in this text. The standard way of scoring this type of data is the one used for the CONLL shared tasks. The perl CONLL scoring script is available [here](http://www.cnts.ua.ac.be/conll2002/ner/bin/conlleval.txt), and I use this as my scorer. One of the difficulties here is that the spacy tokenizer works a bit differently than the tokenizer used on the Audible data. Below, I print out a sample of one of the documents with the ner tags. The printout is in the following format:


|spacy_text   |spacy_tag   |******   |audible_text   |audible_tag   |
|---|---|--:|---|---|
|disemboweled   |O-   |******   |disemboweled   |O   |
|,   |O-   |******   |,   |O   |
|buried   |O-   |******   |buried   |O   |


In [202]:
import spacy
nlp = spacy.load('en')

In [12]:
words = soup.find_all(['w', 'c'])
doc = nlp(' '.join(w.text for w in words))
start = 175
subdoc = doc[start:225]
for i, w in enumerate(subdoc):
    if 'ner' in words[i+start].attrs:
        print w.text + '\t' + w.ent_iob_ + '-' + w.ent_type_ + '\t******  ' + words[i+start].text + '\t' + words[i+start].attrs['ner']
    else:
        print w.text + '\t' + w.ent_iob_ + '-' + w.ent_type_ + '\t******  ' + words[i+start].text + '\t' + 'O'

disemboweled	O-	******  disemboweled	O
,	O-	******  ,	O
buried	O-	******  buried	O
alive	O-	******  alive	O
,	O-	******  ,	O
crucified	O-	******  crucified	O
and	O-	******  and	O
burnt	O-	******  burnt	O
at	O-	******  at	O
the	O-	******  the	O
stake	O-	******  stake	O
,	O-	******  ,	O
after	O-	******  after	O
which	O-	******  which	O
your	O-	******  your	O
ashes	O-	******  ashes	O
shall	O-	******  shall	O
be	O-	******  be	O
sent	O-	******  sent	O
into	O-	******  into	O
orbit	O-	******  orbit	O
as	O-	******  as	O
a	O-	******  a	O
and	O-	******  and	O
perpetual	O-	******  perpetual	O
reminder	O-	******  reminder	O
to	O-	******  to	O
all	O-	******  all	O
would	O-	******  would-be	O
-	O-	******  regicides	O
be	O-	******  ,	O
regicides	O-	******  amen	O
,	O-	******  .	O
amen	O-	******  '	O
.	O-	******  '	O
'	O-	******  Ca	O
'	O-	******  n't	O
Ca	O-	******  you	O
n't	O-	******  wait	O
you	O-	******  a	O
wait	O-	******  bit	O
a	O-	******  ?	O
bit	O-	******  '	O
?	O-	******  asked	O
'	O-	*****

One can see that the compound word would-be (about halfway down) is treated differently by the two systems. This makes lining up the annotations a bit of a challenge. Especially since it may mean that Spacy will provide multiple tags to the same word which is given only one tag from Audible. To counteract this, I'll group together the words that are split apart, and give them an entity tag if any of the tokens contains one. Fortunately, there were no cases where these split words contained two different tags.

In [64]:
# ugh, this is ugly.
def condense_tokens(spacy_doc, audible_words, position, offset):
    spacy_tokens = nlp(audible_words[position+offset[1]].text)
    if 'ner' in audible_words[i+offset[1]].attrs:
        spacy_text = ''.join(j.text for j in spacy_doc[i+offset[0]:i+offset[0]+len(spacy_tokens)]) 
        spacy_tag = ''.join(j.ent_iob_ for j in spacy_doc[i+offset[0]:i+offset[0]+len(spacy_tokens)]) + '-' + \
        ''.join(j.ent_type_ for j in spacy_doc[i+offset[0]:i+offset[0]+len(spacy_tokens)]) 
        audible_text = audible_words[i+offset[1]].text 
        audible_tag = audible_words[i+offset[1]].attrs['ner']
    else:
        spacy_text = ''.join(j.text for j in spacy_doc[i+offset[0]:i+offset[0]+len(spacy_tokens)])
        spacy_tag = ''.join(j.ent_iob_ for j in spacy_doc[i+offset[0]:i+offset[0]+len(spacy_tokens)]) + '-' + \
        ''.join(j.ent_type_ for j in spacy_doc[i+offset[0]:i+offset[0]+len(spacy_tokens)])
        audible_text = audible_words[i+offset[1]].text 
        audible_tag = 'O-'
    
    offset[0] += len(spacy_tokens)-1
    return(spacy_text, spacy_tag, audible_text, audible_tag, offset)

spacy_text = []
spacy_tag = []
audible_text = []
audible_tag = []
for f in files:
    xml_data = open(f).read()
    soup = BeautifulSoup(xml_data, 'lxml')
    words = soup.find_all(['w', 'c'])
    doc = nlp(' '.join(w.text for w in words))
    offset = [0, 0] #spacy, audible
    subdoc = doc #for testing
    for i, w in enumerate(subdoc):
        if i+offset[1] == len(words):
            break
        spacy_tokens = nlp(words[i+offset[1]].text) #check to see if the tokenizations match

        if len(spacy_tokens)==1: #if they match
            if 'ner' in words[i+offset[1]].attrs:
                spacy_text.append(subdoc[i+offset[0]].text)
                spacy_tag.append(subdoc[i+offset[0]].ent_iob_ + '-' + subdoc[i+offset[0]].ent_type_)
                audible_text.append(words[i+offset[1]].text)
                audible_tag.append(words[i+offset[1]].attrs['ner'])
            else:
                spacy_text.append(subdoc[i+offset[0]].text)
                spacy_tag.append(subdoc[i+offset[0]].ent_iob_ + '-' + subdoc[i+offset[0]].ent_type_)
                audible_text.append(words[i+offset[1]].text)
                audible_tag.append('O')

        if len(spacy_tokens)>1: #if spacy splits it up into more tokens
            stext, stag, atext, atag, offset = condense_tokens(doc, words, i, offset)
            spacy_text.append(stext)
            spacy_tag.append(stag)
            audible_text.append(atext)
            audible_tag.append(atag)

In [95]:
import pandas as pd
df = pd.DataFrame({'spacy_text':spacy_text,
             'spacy_tag':spacy_tag,
             'audible_text':audible_text,
             'audible_tag':audible_tag})

## Standardizing Tags
Spacy's entity recognizer includes many other categories that are not listed in the Audible data. The next bits of code removes those entities from the tags provided by Spacy and then makes the remaining ones in a standard format that is similar to that expected by the CONLL script. I then standardize the tags from the Audible data.

In [136]:
import re
def clean_spacy_tags(tags):
    prog = re.compile('^(B|I|O)+-(CARDINAL|PERCENT|DATE|NORP|TIME|QUANTITY|MONEY|ORDINAL|LAW|FAC|WORK_OF_ART|PRODUCT|EVENT|LANGUAGE)+$')
    if prog.match(tags):
        out = 'O-'
    else:
        out = tags
    prog = re.compile('^O+-$')
    if prog.match(out):
        out = 'O-'
    prog = re.compile('^(B|I|O)(B|I|O)*?-GPE$')
    if prog.match(out):
        matched = prog.match(out).group(1)
        out = matched+'-L'
    prog = re.compile('^((B|I)|O)+-(PERSON|ORG|LOC)+$')
    if prog.match(out):
        matched1 = prog.match(out).group(2)
        matched2 = prog.match(out).group(3)[0]
        out = matched1+'-'+matched2[0]
    return(out)

df['cleaned_spacy_tags'] = df.spacy_tag.apply(clean_spacy_tags)

df.cleaned_spacy_tags.value_counts()
#df[df.cleaned_spacy_tags=='O-P']

O-     315119
B-P      7195
B-L      1341
I-P       977
B-O       904
I-O       592
I-L       188
Name: cleaned_spacy_tags, dtype: int64

In [148]:
def clean_audible_tags(item):
    prog = re.compile('^O$')
    if prog.match(item):
        out = 'O-'
    else:
        out = item
    prog = re.compile('^(P|L|O)([0-9])+$')
    if prog.match(out):
        matched = '-'+prog.match(out).group(1)
        out = matched
    return(out)

df['aud_clean'] = df.audible_tag.apply(clean_audible_tags)
df['counter'] = df.groupby('audible_tag').cumcount()

df['iob_sys'] = 'O'

df['iob_sys'][df.counter==0] = 'B'
df['iob_sys'][df.counter>0] = 'I'
df['aud_clean'][df.aud_clean!='O-'] = df['iob_sys']+df['aud_clean']
df.aud_clean.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


O-     315461
I-P      5857
B-P      3200
B-L       743
I-L       664
I-O       235
B-O       156
Name: aud_clean, dtype: int64

## Spacy Scoring
Below, I write a file for the conll script. I think that evaluation here consists of entire chunks, rather than individualized tokens. As such, if System one identified "Susan Brown" as one entity chunk, and system two only identified "Susan" as the chunk, then this would be incorrect.

In [256]:
outdat = df[['audible_text', 'aud_clean', 'cleaned_spacy_tags']]
outdat['audible_text'] = outdat.audible_text.apply(lambda x: x.replace(' ', ''))
outdat.to_csv('temp', sep=' ', encoding='utf-8', index=False, header=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


processed 326316 tokens with 9108 phrases; found: 9488 phrases; correct: 6682.

`accuracy:  97.33%; precision:  70.43%; recall:  73.36%; FB1:  71.86`

                L: precision:  48.88%; recall:  62.96%; FB1:  55.03  1342
                
                O: precision:   5.53%; recall:  25.00%; FB1:   9.06  940
                
                P: precision:  82.90%; recall:  76.02%; FB1:  79.31  7206