# Training and predicting with a part-of-speech tagging model

---

Tutorial followed along from: http://nlpforhackers.io/training-pos-tagger/

POS tagging is the basis of nearly all NLP related tasks downstream. Therefore, learning how POS tagging works is a good first step when entering NLP. I wish I could'ev used this on my 6th grade english homework.

### Labeled POS datasets

To create a supervised model to tag parts-of-speech you need a labeled dataset! Luckily, NLTK, a python package, comes with one right out of the box. It is found in:

    nltk.corpus.treebank.tagged_sents()
    
Other datasets commonly used are:

    penn treebank
    
Interestingly, since twitter and conversational text have far more nuanced, grundgy, and slangy terms there are also datasets to train pos-tagging from twitter:

    http://www.cs.cmu.edu/~ark/TweetNLP/

In [15]:
import nltk
nltk.data.path.append("/Volumes/Secondary/")
from nltk import sent_tokenize, word_tokenize
import pickle
#from sklearn import DictVectorizer

In [2]:
tagged_sentences = nltk.corpus.treebank.tagged_sents()
 
print(tagged_sentences[0])
print("Tagged sentences: ", len(tagged_sentences))
print("Tagged words:", len(nltk.corpus.brown.tagged_words()))

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
Tagged sentences:  3914
Tagged words: 1161192


### Create a feature set to be used to predict word's part of speech

---

Intuition and feature choice learned here: https://www.youtube.com/watch?v=LivXkL2DO_w

Inclusion of features comes from our natural domain knowledge of the english language. We intuitively know many rules that we would want to include when predicting part of speech. Some are below:
    - Capitalization - usually a proper noun
    - Prefix - 
    - Suffix - 'ly' refers to adverbs
    - Numeric -
    - Position in sentence - First or Last word   
    

In [3]:
def features(sentence, index):
    """ sentence: [w1, w2, ...], index: the index of the word """
    return {
        'word': sentence[index],
        'is_first': index == 0,
        'is_last': index == len(sentence) - 1,
        'is_capitalized': sentence[index][0].upper() == sentence[index][0],
        'is_all_caps': sentence[index].upper() == sentence[index],
        'is_all_lower': sentence[index].lower() == sentence[index],
        'prefix-1': sentence[index][0],
        'prefix-2': sentence[index][:2],
        'prefix-3': sentence[index][:3],
        'suffix-1': sentence[index][-1],
        'suffix-2': sentence[index][-2:],
        'suffix-3': sentence[index][-3:],
        'prev_word': '' if index == 0 else sentence[index - 1],
        'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
        'has_hyphen': '-' in sentence[index],
        'is_numeric': sentence[index].isdigit(),
        'capitals_inside': sentence[index][1:].lower() != sentence[index][1:]
    }

In [4]:
sent = 'Geoff went to the library today'
token_sent = word_tokenize(sent)
print("Example for the word 'went': ",features(sentence=token_sent,index=1))

Example for the word 'went':  {'word': 'went', 'is_first': False, 'is_last': False, 'is_capitalized': False, 'is_all_caps': False, 'is_all_lower': True, 'prefix-1': 'w', 'prefix-2': 'we', 'prefix-3': 'wen', 'suffix-1': 't', 'suffix-2': 'nt', 'suffix-3': 'ent', 'prev_word': 'Geoff', 'next_word': 'to', 'has_hyphen': False, 'is_numeric': False, 'capitals_inside': False}


### Split the tagged corpus into train/test

---

First we must seperate the tags from the corpus words. The tags are the 'y'. To create our 'X' we first will take the words, then run them through our featurizing function. 

In [5]:
cut = int(.7 * len(tagged_sentences))
training_sentences = tagged_sentences[:cut]
testing_sentences = tagged_sentences[cut:]

In [6]:
print(len(training_sentences))
print(len(testing_sentences))

2739
1175


In [7]:
def untag(tagged_sentence,t='word'):
    if t == 'word':
        return [w for w,t in tagged_sentence]
    elif t == 'tag':
        return [t for w,t in tagged_sentence]

In [8]:
def transform_tagged_Xy(tagged_sentences):
    X = []
    y = []

    for tagged_sentence in tagged_sentences:
        for i, word_tag in enumerate(tagged_sentence):
            X.append(features(untag(tagged_sentence),i))
            word, tag = word_tag
            y.append(tag)

    return X, y

#### Example: tag and word features

In [9]:
i = 4
print(y[i])
print(X[i])

NameError: name 'y' is not defined

In [10]:
X_train, y_train = transform_tagged_Xy(training_sentences)
X_test, y_test = transform_tagged_Xy(testing_sentences)

### DictVectorizer

---

The 'X' features are still in a dictionary format, so we can use sklearn.DictVectorizer to convert the raw feature information in 'X' into a numpy array of one-hot encodded data. For more info see: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/dict_vectorizer.py

It is similar to the CountVectorizer, in the sense that the more data it is 'fit' with the more columns will be associated with each sample. 

In addition, any unseen data at transform will be mapped to 0.


In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline

In [18]:
dv = DictVectorizer(sparse=False)

In [22]:
X_train[:2]

[{'capitals_inside': False,
  'has_hyphen': False,
  'is_all_caps': False,
  'is_all_lower': False,
  'is_capitalized': True,
  'is_first': True,
  'is_last': False,
  'is_numeric': False,
  'next_word': 'Vinken',
  'prefix-1': 'P',
  'prefix-2': 'Pi',
  'prefix-3': 'Pie',
  'prev_word': '',
  'suffix-1': 'e',
  'suffix-2': 're',
  'suffix-3': 'rre',
  'word': 'Pierre'},
 {'capitals_inside': False,
  'has_hyphen': False,
  'is_all_caps': False,
  'is_all_lower': False,
  'is_capitalized': True,
  'is_first': False,
  'is_last': False,
  'is_numeric': False,
  'next_word': ',',
  'prefix-1': 'V',
  'prefix-2': 'Vi',
  'prefix-3': 'Vin',
  'prev_word': 'Pierre',
  'suffix-1': 'n',
  'suffix-2': 'en',
  'suffix-3': 'ken',
  'word': 'Vinken'}]

In [24]:
#fitting only sample 1, the length of resulting vector sample 1 is len 17
print('fitting only sample 1, the length of resulting vector sample 1: ',len(dv.fit_transform(X_train[:1])[0]))
#fitting samples 1-3, the length of resulting vector sample 1 is len 35
print('fitting only sample 1-3, the length of resulting vector sample 1: ',len(dv.fit_transform(X_train[:20])[0]))

fitting only sample 1, the length of resulting vector sample 1:  17
fitting only sample 1-3, the length of resulting vector sample 1:  162


In [26]:
dv.fit_transform(X_train[:20])[0]

array([ 0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.])

### Now time to train the model

---

Using a pipeline, first DictVectorize the features, then train a classifier. I chose randomforrest because I have had good results with it in the past. 

The original author in a sense, stunted the growth of the model by only training on the first 10k samples. Once running it a couple times its clear why. Due to the DictVectorizer the more samples added and unique prefix and suffixes balloon the dimensionality of 'X' exponentially. On my computer training 10k is ~1.5mins where as training 20k is ~5.5mins.

The result is ~93% accuracy! Though this seems good, it might be misrepresenting its performance since F1 is usually used for NLP tasks.

In [None]:
%%time
clf = Pipeline([
    ('vectorizer', DictVectorizer(sparse=False)),
#    ('classifier', DecisionTreeClassifier(criterion='entropy')),
    ('classifier', RandomForestClassifier(n_estimators=100,criterion='entropy')),
])

#clf.fit(X[:20000], y[:20000])   # Use only the first 10K samples if you're running it multiple times. It takes a fair bit :)

clf.fit(X_train[:20000], y_train[:20000])
 
print("Accuracy:", clf.score(X_test, y_test))

---

In addition, this is a good time to save the trained model so it can be used int he future without retraining. 

In [None]:
pickle.dump(clf, open("20k_pos_tagger.p", "wb"))

### Now that we have a model trained, lets wrap it up to predict future parts of speech

---

I chose to add the word_tokenizer into the function and allow on/off functionality with a flag after I ran into an issue below.

In [27]:
def predict_pos_tags(sent,token_sent_on=False):
    global clf
    
    if clf:
        pass
    else:
        clf = pickle.load(open("20k_pos_tagger.p", "rb"))
    
    '''Takes a sentence, and predicts the part of speech tag for each word.
    
    '''
    if token_sent_on == True:
        token_sent = word_tokenize(sent)
    else:
        token_sent = sent
    
    sentence_features = []
    sentence_tags = []
    
    for i, word in enumerate(token_sent):
        sentence_features.append(features(token_sent,index=i))
        sentence_tags.append(clf.predict(features(token_sent,index=i))[0])
    
    return list(zip(token_sent,sentence_tags))

In [29]:
clf = pickle.load(open("20k_pos_tagger.p", "rb"))

In [30]:
sentence = 'Geoff went to the library today.'
predict_pos_tags(sentence,token_sent_on=True)

[('Geoff', 'NNP'),
 ('went', 'VBD'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('library', 'NN'),
 ('today', 'NN'),
 ('.', '.')]

In [31]:
token_sent = ['Geoff','went','to','the','library','today','.']
predict_pos_tags(token_sent)

[('Geoff', 'NNP'),
 ('went', 'VBD'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('library', 'NN'),
 ('today', 'NN'),
 ('.', '.')]

### NLTK has an out-of-the-box POS tagger, lets compare performance

---

Even on the simple example sentance the NLTK parser and ours has a discripency! This is great! Lets see how well they perform on the test corpus. Since our POS tagger was trained on it we can only use the test data.

The following blocks of code are only to format the train/test into a way the NLTK_POS can work

In [32]:
nltk.pos_tag(token_sent)

[('Geoff', 'NNP'),
 ('went', 'VBD'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('library', 'JJ'),
 ('today', 'NN'),
 ('.', '.')]

In [33]:
def remove_tagged_Xy(tagged_sentences):
    X = []
    y = []

    for tagged_sentence in tagged_sentences:
        X.append(untag(tagged_sentence))
        for i, word_tag in enumerate(tagged_sentence):
            word, tag = word_tag
            y.append(tag)

    return X, y

In [34]:
X_test, y_test = remove_tagged_Xy(testing_sentences)

y_pred = nltk.pos_tag_sents(X_test)

In [35]:
y_word = [word for pred in y_pred for word, pos in pred]
y_pred = [pos for pred in y_pred for word, pos in pred]

zipped_y_pred = list(zip(y_word,y_pred,y_test))

### Now that we have a list of word/pos_pred/pos_truth, we can make a scorer

---

Keep a count of which the model predicted correctly and which it didn't. Also keep track of which words it incorrectly predicted - perhaps the NLTK model will predict differently than ours.

In [36]:
def score_pred(zipped_y_pred):
    '''Score accuracy of predictions of NLTK and custom POS tagger.
    '''
    correct = 0
    incorrect = 0
    y_error = []
    
    for word, y_pred, y_test in zipped_y_pred:
        if y_pred == y_test:
            correct += 1
        else:
            incorrect += 1
            y_error.append((word,y_pred,y_test))
            
    print('number correct: ', correct)
    print('number incorrect: ', incorrect)

    accuracy = correct/(correct+incorrect)
    
    return accuracy, y_error

In [37]:
nltk_score, nltk_errors = score_pred(zipped_y_pred)

number correct:  26552
number incorrect:  3067


### Create same format for custom POS parser

---


In [38]:
[pos for word, pos in predict_pos_tags(X_test[0])][:5]

['CC', 'DT', 'NN', 'MD', 'RB']

In [39]:
y_word = []
y_pred = []

for sent in X_test:
    y_word.append(untag(predict_pos_tags(sent)))
    y_pred.append(untag(predict_pos_tags(sent),t='tag'))

In [40]:
y_pred_flat = []
y_word_flat = []
for word, pred in zip(y_word,y_pred):
    for pos in pred:
        y_pred_flat.append(pos)
    for w in word:
        y_word_flat.append(w)

In [41]:
custom_zipped_y_pred = list(zip(y_word_flat,y_pred_flat,y_test))

In [42]:
custom_score, custom_errors = score_pred(custom_zipped_y_pred)

number correct:  27517
number incorrect:  2102


### Examine the errors to see where each model is going awry

---

NLTK model errors:

    many *'s in the errors, these don't look like real words
    lots of 0's too
    hyphenated words
    proper nouns
    not many common words incorrectly tagged
    
Custom model:

    some proper nouns
    errors look like more common words

In [None]:
nltk_errors

In [None]:
custom_errors

In [None]:
import re

In [None]:
a = '*RNR*-2'

In [None]:
match = re.search('\*',a)

In [None]:
match.group()

In [None]:
zero_count = 0
astrict_count = 0
nltk_no_astrict_errors = []

for tup in nltk_errors:
    match1 = re.search('\*',tup[0])
    match2 = re.search('0',tup[0])
    if match1:
        astrict_count += 1
    elif match2:
        zero_count += 1
    else:
        nltk_no_astrict_errors.append(tup)
        
print('astrict count: ',astrict_count)
print('zero count: ', zero_count)

In [None]:
nltk_no_astrict_errors

### Recalculating score ommiting astricts and zero errors from the NLTK model

---

Manually taking the print out of our score function:

    number correct:  26552
    number incorrect:  3067
    
And subracting the astrict errors:

    astrict count:  1599
    zero count:  344
    
This is much better than our model's performance, but we'll see if we can add extra features to ours to make it more robust

In [None]:
correct = 26552
incorrect = 3067

incorrect_astrict = 1599
incorrect_zero = 344
incorrect = incorrect - incorrect_astrict - incorrect_zero

total = correct + incorrect

correct / total

### Conclusion:

---

All in all, our model was only trained on 20k samples of the total ~70k and performs only 3% worse in terms of accuracy than the out-of-the-box NLTK POS tagger. I'm pretty happy with this. All credit for the feature rules given to Bogdani from the tutorial above.

I'd like to calculate precision and recall to get an F1 score in a later exercise - since I've read this is what NLP models are commonly based on. Perhaps theres a better way than coding up counters myself though. I'll read into it before diving in.