# Annotate data with Named Entity Recognition (NER)

## Maximum entropy classifier with a semi-supervised learning approach to detect namend entities in a German text

This example uses a max entropy classifier which was bootstrapped with a wordlist for labeling a unlabeled corpus. A good overview about different learning technics and specially about the semi-supervised approach can be found in the paper form Nadeau et Al.

```
Nadeau, David, and Satoshi Sekine. "A survey of named entity recognition and classification." Lingvisticae Investigationes 30.1 (2007): 3-26.
```

This is my naive approach to Namend Entity Recognition. The goal is to extract 'technical' entities like Java, .Net, etc. from a text.

In this context semi-supervised learning means:
- start with a bootstrap algorithm which labels the unlabeled corpus with a two wordlists. A label TECH for technical entities and NONTECH for words which are not technical entities.
- new labeled corpus is used to train the classifier 

<h3>Notice</h3>

This is just a playground example, for better results you may have to use a better corpus and modify the feature extraction function.



In [4]:
import collections

import nltk
import nltk.classify.util
from nltk.util import ngrams

from textblob import TextBlob
from bs4 import BeautifulSoup

import pandas as pd

<h3>Corpus</h3>

I used a simple corpus with some sentences from different Wikipedia articles about different programming languages. So keep in mind that the F-Score is not really meaningful. To get a better result use a larger and domain specific corpus for which you want to train entities.
http://data.stackexchange.com/stackoverflow/query/edit/368452

In [5]:
posts = pd.read_csv('data/stackoverflow_posts.csv')

print posts.head()

                                                Body
0  <p>After I have migrated my project from VS201...
1  <p>How does GHC handle thunks that are accesse...
2  <p>An example usage:</p>\n\n<p><a href="http:/...
3  <p>With the new PreferenceFragmentCompat from ...
4  <p>I've always programed android with Eclipse ...


In [6]:
content = ''
for i in posts.index:
    content += posts.ix[i]['Body']

soup = BeautifulSoup(content, 'html.parser')
    
blob = TextBlob(soup.get_text()) 

print('sentences', len(blob.sentences))

('sentences', 24185)


<h3>feature extraction</h3>

So the following table describes the features which will be extracted and used for named entity recognition.

feature|description
---|---|---
word|current word (eg. Java)
word-1|the previous word 
word+1|the next word
length|length of the current word
isupper|true if current word has at least one cased character
special-char|if current word contains special char (eg. +, -, etc.)


A addional feautre could be POS (part of spech tag).


In [7]:

def contains_word(word, wordlist):
    for w in wordlist:
        if w.lower() in word.lower():
            return True
        
    return False

def contains_digits(s):
    return any(char.isdigit() for char in s)

def extract_features(word, i, sentence):
    
    features = {"word": word,
                "length": len(word),
                "isupper": word.isupper(),
                "special-char": contains_word(word, ['+', '-', '.', '#']),
                "contains-digit": contains_digits(word)}
    
    # token - 1
    if i == 0:
        features["word-1"] = "<START>"    
    else:
        features["word-1"] = sentence[i-1]
        
    
    # token + 1
    if (i+1) < len(sentence):
         features["word+1"] = sentence[i+1]
         
    else:
        features["word+1"] = "<END>"
        
    
    return features 
    

<h3>Bootstrapping</h3>


In [8]:
def get_features_from_text(blob, wordlist, max=1000):

    featuresets = []
    
    
    s_count = 0
    w_count = 0
    max_count = 0
    
    for s in blob.sentences:
    
        for w in s.words:
            if max_count >= max:
                break
        
            if w.lower() in map(lambda x:x.lower(), wordlist):
                featuresets.append(extract_features(w, w_count, s.words))
                max_count += 1
            
            #featuresets.append((extract_features(blob.sentences[s_count], w_count), get_label(w, wordlist, "TECH")))
            w_count += 1
            
            
        s_count +=1
        w_count = 0

    
    return featuresets


In [9]:
words = pd.read_csv('data/top1000en.txt', header=None);
words.columns = ['Word']
print words.head()

nonefeats = get_features_from_text(blob, words['Word'].values)

print('NONTECH', len(nonefeats))

  Word
0  the
1   of
2   to
3  and
4    a
('NONTECH', 1000)


In [10]:
tech = pd.read_csv('data/top_stackoverflow_tags.csv');
print tech.head()


techfeats = get_features_from_text(blob, tech['TagName'].values)

print('TECH', len(techfeats))

   Id     TagName   Count  ExcerptPostId  WikiPostId
0   1        .net  216781        3624959     3607476
1   2        html  453768        3673183     3673182
2   3  javascript  937830        3624960     3607052
3   4         css  332802        3644670     3644669
4   5         php  803687        3624936     3607050
('TECH', 1000)


<h3>Train the classifier</h3>

In [11]:
featureset_tech = [(f, 'TECH') for f in techfeats]
print featureset_tech[0]

({'word': u'LINQ', 'contains-digit': False, 'special-char': False, 'word-1': u'following', 'length': 4, 'word+1': u'statement', 'isupper': True}, 'TECH')


In [12]:
featureset_nontech = [(f, 'NONTECH') for f in nonefeats]
print featureset_nontech[0]

({'word': u'After', 'contains-digit': False, 'special-char': False, 'word-1': '<START>', 'length': 5, 'word+1': u'I', 'isupper': False}, 'NONTECH')


In [13]:
techcutoff = len(featureset_tech)*3/4
nonecutoff = len(featureset_nontech)*3/4



trainfeats = featureset_tech[:techcutoff] + featureset_nontech[:nonecutoff]
testfeats = featureset_tech[techcutoff:] + featureset_nontech[nonecutoff:]


print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
print 'train with %d TECH instances, train with %d NONE instances' % (techcutoff, nonecutoff)



train on 1500 instances, test on 500 instances
train with 750 TECH instances, train with 750 NONE instances


In [14]:
classifier = nltk.MaxentClassifier.train(trainfeats, algorithm="GIS", trace=0)
print('accuracy', nltk.classify.accuracy(classifier, testfeats))

('accuracy', 0.902)


<h3>Evaluation</h3>

The following method evaluates the classifier and returns:

- F-measure
- precision 
- recall 




In [15]:
def eval_clf(classifier, testfeats):
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)
    
    labels = {}
    
    for i, (feats, label) in enumerate(testfeats):
        
        
        refsets[label].add(i)
        observed = classifier.classify(feats)
        
        testsets[observed].add(i)
        labels[label] = True
        
    print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
    
    for label in labels.keys():
        print
        print 'label:', label
        print 'precision:', nltk.metrics.precision(refsets[label], testsets[label])
        print 'recall:', nltk.metrics.recall(refsets[label], testsets[label])
        print 'F-measure:', nltk.metrics.f_measure(refsets[label], testsets[label])
        
        #print 'size', len(refsets[label])
        
        
    
    print 
    print "most_informative_features:"
    
    classifier.show_most_informative_features()
    

In [16]:
eval_clf(classifier, testfeats)

accuracy: 0.902

label: TECH
precision: 0.857651245552
recall: 0.964
F-measure: 0.907721280603

label: NONTECH
precision: 0.958904109589
recall: 0.84
F-measure: 0.89552238806

most_informative_features:
  -7.152 word==u'go' and label is 'NONTECH'
  -4.687 word==u'function' and label is 'NONTECH'
   4.171 word==u'those' and label is 'NONTECH'
   4.051 word==u'reason' and label is 'NONTECH'
   3.711 word==u'decided' and label is 'NONTECH'
  -3.665 word+1==u'Point' and label is 'TECH'
  -3.644 length==2 and label is 'TECH'
  -3.524 word==u'file' and label is 'NONTECH'
   3.292 word+1==u'removed' and label is 'NONTECH'
   3.254 word==u'call' and label is 'NONTECH'


<h3>Label new data</h3>

The following example tries to label a new sentence.

In [17]:
new_text = "Where can i find a good PHP or Ruby Tutorial? Should I use Scrum or Kanban."
blob = TextBlob(new_text)


for s in blob.sentences:
    w_count = 0
    for w in s.words:
        print w + ' (' + classifier.classify(extract_features(w, w_count, s.words)) + ')'
        w_count += 1
    print ""

Where (NONTECH)
can (NONTECH)
i (NONTECH)
find (NONTECH)
a (NONTECH)
good (NONTECH)
PHP (TECH)
or (NONTECH)
Ruby (TECH)
Tutorial (TECH)

Should (NONTECH)
I (NONTECH)
use (NONTECH)
Scrum (TECH)
or (NONTECH)
Kanban (TECH)

