## Complete Guide for training a part of speech tagger
##### By Ruben Seoane, all credits to nlpforhackers.io
Based on the following tutotial: https://nlpforhackers.io/training-pos-tagger/

**Part Of Speech tagging** is one of the main components of most NLP analysis. It implies labelling words with their appropiate part of Speec (Noun, Verb, Adjective, Adverb, Pronoun..)

### Penn Treebank Tags
The most popular tag set is the Penn Treebank tagset, taggers such as NLTK and Standford CoreNLP tagger are trained on this tag set.

### What is POS tagging


In [3]:
from nltk import word_tokenize, pos_tag

print(pos_tag(word_tokenize("I'm learning NLP")))

[('I', 'PRP'), ("'m", 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP')]


### POS Tagging tools in NLTK
NLTK provides tools to build simple taggers as:
1. **DefaultTagger** tags everything with the same tag.
2. **RegexTagger** applies tags according to a set of regular expressions.
3. **UnigramTagger** picks the most frequent tag for a known word.
4. **BigramTagger**, **TrigramTagger** working similarly to the unigram tagger, but taking some of the context into consideration.

### Picking a Corpus to train the POS tagger
These resources are vey hard to come by as the task of annotating a big enough amount o text is very time and resource consuming. One such resource can be found within NLTK 

In [5]:
import nltk

tagged_sentences = nltk.corpus.treebank.tagged_sents()

print (tagged_sentences[0])
print ("Tagged sentences: ", len(tagged_sentences))
print("Tagged words: ", len(nltk.corpus.treebank.tagged_words()))

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
Tagged sentences:  3914
Tagged words:  100676


### Trainning our own POS Tagger using scikit-learn
Before training a classifier, we must first agree on what features we will be using.
Obvious choices: the word itself, and the word before and after.
It can be more precise i we consider the **2-letter suffix, a great indicator of past-tense verbs**, ending in "-ed". 3-letter suffix helpsrecognize the present principle ending in "-ing".

In [7]:
def features(sentence, index):
    """ sentence: [w1, w2, ...], index: the index of the word """
    return {
        'word': sentence[index],
        'is_first': index == 0,
        'is_last': index == len(sentence) - 1,
        'is_capitalized': sentence[index][0].upper() == sentence[index][0],
        'is_all_caps': sentence[index].upper() == sentence[index],
        'is_all_lower': sentence[index].lower() == sentence[index],
        'prefix-1': sentence[index][0],
        'prefix-2': sentence[index][:2],
        'prefix-3': sentence[index][:3],
        'suffix-1': sentence[index][-1],
        'suffix-2': sentence[index][-2:],
        'suffix-3': sentence[index][-3:],
        'prev_word': '' if index == 0 else sentence[index - 1],
        'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
        'has_hyphen': '-' in sentence[index],
        'is_numeric': sentence[index].isdigit(),
        'capitals_inside': sentence[index][1:].lower() != sentence[index][1:]
    }

import pprint
pprint.pprint(features(['This', 'is', 'a', 'sentence'], 2))


{'capitals_inside': False,
 'has_hyphen': False,
 'is_all_caps': False,
 'is_all_lower': True,
 'is_capitalized': False,
 'is_first': False,
 'is_last': False,
 'is_numeric': False,
 'next_word': 'sentence',
 'prefix-1': 'a',
 'prefix-2': 'a',
 'prefix-3': 'a',
 'prev_word': 'is',
 'suffix-1': 'a',
 'suffix-2': 'a',
 'suffix-3': 'a',
 'word': 'a'}


A helper function to **strip the tags from our tagged corpus** and feed is to our classifier:


In [14]:
def untag(tagged_sentences):
    return [w for w, t in tagged_sentences]

Le's build our training set. As the corpus is composed of sentences, but our classifier should accept features for a single word, we'll have to perform some transformations.

In [15]:
# Split the dataset for training and testing
cutoff =int(.75 * len(tagged_sentences))
training_sentences= tagged_sentences[:cutoff]
test_sentences = tagged_sentences[cutoff:]

print (len(training_sentences))
print(len(test_sentences))

2935
979


In [19]:
def transform_to_dataset(tagged_sentences):
    X, y =[], []
    
    for tagged in tagged_sentences:
        for index in range(len(tagged)):
            X.append(features(untag(tagged), index))
            y.append(tagged[index][1])
            
        return X, y
    
X, y = transform_to_dataset(training_sentences)

Now we can train the classifier, we will try with a **DecissionTreeClassifier**:

from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('vectorizer', DictVectorizer(sparse=False)),
    ('classifier', DecisionTreeClassifier(criterion='entropy'))
])
clf.fit(X[:10000], y[:10000])

print ('Training Completed')

X_test, y_test = transform_to_dataset(test_sentences)
print("Accuracy: ", clf.score(X_test, y_test))

**_Training finishes very fast, accuracy between 19-23%??_**


### Let's Tag

In [22]:
def pos_tag(sentence):
    tagged_sentence = []
    tags = clf.predict([features(sentence, index) for index in range(len(sentence))])
    return zip(sentence, tags)

print(pos_tag(word_tokenize("This is my friend, John.")))

<zip object at 0x000002AB2C10E5C8>


##### Check why this is happening, train on a different IDE, review NLTK documentation