# References
1. http://www.aclweb.org/anthology/C94-1027
2. https://becominghuman.ai/part-of-speech-tagging-tutorial-with-the-keras-deep-learning-library-d7f93fa05537

In [1]:
import numpy as np

In [2]:
import nltk
nltk.download('treebank')
nltk.download('universal_tagset')

[nltk_data] Downloading package treebank to /home/rushabh/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /home/rushabh/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


True

In [3]:
from nltk.corpus import treebank

sentences = treebank.tagged_sents(tagset='universal')
len(sentences)

3914

80% of data for training, 20% for testing.  
25% of training data used as validation set  
75% of training data used as training set  

In [4]:
train_test_cutoff = int(.80 * len(sentences)) 
training_sentences = sentences[:train_test_cutoff]
testing_sentences = sentences[train_test_cutoff:]
 
train_val_cutoff = int(.25 * len(training_sentences))
validation_sentences = training_sentences[:train_val_cutoff]
training_sentences = training_sentences[train_val_cutoff:]

A dictionary of features is created   
Features  
1. checks if the term(word) is first in the sentence  
2. checks if the term(word) is last in the sentence  
3. 2 and 3. letter prefixes  
4. 2 and 3 letter suffixes  
5. previous and next words  

In [6]:
def add_basic_features(sentence_terms, index):
    term = sentence_terms[index]
    return {
        'term': term,
        'is_first': index == 0,
        'is_last': index == len(sentence_terms) - 1,
        'prefix-2': term[:2],
        'prefix-3': term[:3],
        'suffix-2': term[-2:],
        'suffix-3': term[-3:],,
        'prev_word': '' if index == 0 else sentence_terms[index - 1],
        'next_word': '' if index == len(sentence_terms) - 1 else sentence_terms[index + 1]
    }

untag() is used to remove the tag associated with each word in a sentence.   
transform_to_dataset() generates the input and output data

In [7]:
def untag(tagged_sentence):
    return [w for w, _ in tagged_sentence]

def transform_to_dataset(tagged_sentences):
    X, y = [], []
    for pos_tags in tagged_sentences:
        for index, (term, class_) in enumerate(pos_tags):
            X.append(add_basic_features(untag(pos_tags), index))
            y.append(class_)
    return X, y

In [8]:
X_train, y_train = transform_to_dataset(training_sentences)
X_test, y_test = transform_to_dataset(testing_sentences)
X_val, y_val = transform_to_dataset(validation_sentences)

In [9]:
from sklearn.feature_extraction import DictVectorizer
 
dict_vectorizer = DictVectorizer(sparse=False)
dict_vectorizer.fit(X_train + X_test + X_val)

DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
        sparse=False)

Used to convert the list of dictionary of elements to a vector as shown here https://stackoverflow.com/questions/27473957/understanding-dictvectorizer-in-scikit-learn

In [10]:
X_train = dict_vectorizer.transform(X_train)
X_test = dict_vectorizer.transform(X_test)
X_val = dict_vectorizer.transform(X_val)

LabelEncoder is used for Encoding each Part-of-Speech Label with a number.  
Reference - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

In [11]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder.fit(y_train + y_test + y_val)

LabelEncoder()

In [12]:
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
y_val = label_encoder.transform(y_val)

One hot encoding

In [13]:
from keras.utils import np_utils
 
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
y_val = np_utils.to_categorical(y_val)

Using TensorFlow backend.


A 3 layer Fully Connected Perceptron is used as mentioned in Schmid's paper - http://www.aclweb.org/anthology/C94-1027

In [14]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation

def build_model(input_dim, hidden_neurons, output_dim):
    model = Sequential([
        Dense(hidden_neurons, input_dim=input_dim),
        Activation('relu'),
        Dropout(0.2),
        Dense(hidden_neurons),
        Activation('relu'),
        Dropout(0.2),
        Dense(output_dim, activation='softmax')
    ])

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [15]:
from keras.wrappers.scikit_learn import KerasClassifier

model_params = {
    'build_fn': build_model,
    'input_dim': X_train.shape[1],
    'hidden_neurons': 512,
    'output_dim': y_train.shape[1],
    'epochs': 5,
    'batch_size': 256,
    'verbose': 1,
    'validation_data': (X_val, y_val),
    'shuffle': True
}

clf = KerasClassifier(**model_params)

In [16]:
hist = clf.fit(X_train, y_train)

Train on 61107 samples, validate on 19530 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [17]:
score = clf.score(X_test, y_test, verbose=0)    
print('model accuracy: {}'.format(score))

model accuracy: 0.9656669494485752


Comparison to Schmid's implementation
1. Uses 6 gram model(preceding 3+ following 2 + word).  
I used 3 (1 preceding + word + 1 succeding) as I get almost the same accuracy(96.6%) as shown above.

2. I also check if the word is first or last in the sentence.

3. Instead of creating a prefix / suffix tree and checking if a particular prefix or suffix exists I consider 2 and 3. letter prefixes
2 and 3 letter suffixes for every sentence and encode it in the vector