# **Deep Learning for Natural Language Processsing** 

In this notebook we are going to do the following:

### **Part -1: We will build a custom Word Embedding model with Word2Vec for the words present a given text corpus.**

### **Part - 2: Then we will apply it for POS Tagging - a Multi-class Classification task.**

# ----------------------------------------------------------------------------

## Part 1:  Building Custom Word Embeddings

To train our embeddings we will make use of the Skip-gram’s implementation from the Word2Vec module of the gensim library. 

It provides the algorithms for both Skip-gram and a closely related model — Continuous Bag-of-Words (CBOW). 

Gensim’s Word2Vec models are trained on a list (or some other iterable) of sentences that have been pre-processed and tokenised — split into separate words and punctuation. 
Luckily, the NLTK library provides a number of tokenised corpora, such as the Brown corpus, so we can skip the text processing step and jump straight into defining our model!

Before we begin we have to download the necessary NLTK resources using the NLTK data downloader to download the ‘brown’ corpus.

### Download the Brown Corpus

In [None]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

### Import Libraries

In [None]:
from nltk.corpus import brown
from gensim.models import Word2Vec
import multiprocessing
import collections
import numpy as np

### Build Vocabs

In [None]:
sentences = brown.sents()
print(sentences[:5])

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ['The', 'September-October', 'term', 'jury', 'had', 'been', 'charged', 'by', 'Fulton', 'Superior', 'Court', 'Judge', 'Durwood', 'Pye', 'to', 'investigate', 'reports', 'of', 'possible', '``', 'irregularities', "''", 'in', 'the', 'hard-fought', 'primary', 'which', 'was', 'won', 'by', 'Mayor-nominate', 'Ivan', 'Allen', 'Jr.', '.'], ['``', 'Only', 'a', 'relative', 'handful', 'of', 'such', 'rep

In [None]:
len(sentences)

57340

### Build the Word2Vec Model to Train Custom Embeddings

Now we will train our model. 

To do that we simply need to create a new Word2Vec instance. 

**Word2Vec** constructor takes a broad range of parameters, but we will only concentrate on a few that are most relevant:

• **sentences** — The iterable over the tokenised sentences we will train on (the Brown sentences).

• **size** — The dimensionality of our embeddings. Unfortunately, there is no single best value that suits all applications. Typically, models for more syntax-related tasks, such as part-of-speech tagging or parsing, work well with lower values, such as 50. But many other tasks work best with higher values like 300 or 500.

• **window** — This determines which words are considered contexts of the target. For the window of size n the contexts are defined by capturing n words to the left of the target and n words to its right. The size of window will affect the type of similarity captured in the emebeddings — bigger windows will result in more topical/domain similarities.

• **min_count** — We can use this parameter to tell the model to ignore some infrequent words — don’t create an embedding for them and don’t include them as contexts. The min_count defines a threshold frequency value that needs to be reached for the word to be included in the vocabulary.

• **negative** — Defines the number of negative samples (incorrect training pair instances) that are drawn for each good sample.

• **iter** — How many epochs do we want to train for — how many times we want to pass through our training data.

• **workers** — Determines how many worker threads will be used to train the model.

In [None]:
EMB_DIM=300

w2v = Word2Vec(sentences, size=EMB_DIM, window=5, min_count=5, negative=15, iter=10, workers=multiprocessing.cpu_count())

# Get trained embeddings as KeyedVectors instance
word_vectors = w2v.wv
print(word_vectors)

<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x7fec217fcc90>


In [None]:
# Total words in the corpus
w2v.corpus_total_words

1161192

Words Similar to a Given Word

In [None]:
result = word_vectors.similar_by_word('Sunday')
print('\n Most similary to Sunday : \n', result[:3])

result = word_vectors.similar_by_word('money')
print('\n Most similary to money : \n', result[:3])

result = word_vectors.similar_by_word('child')
print('\n Most similary to child : \n', result[:3])


 Most similary to Sunday : 
 [('Friday', 0.9132860898971558), ('Monday', 0.9124994277954102), ('Saturday', 0.8899158239364624)]

 Most similary to money : 
 [('job', 0.7186464071273804), ('care', 0.7074971199035645), ('advantage', 0.6929278373718262)]

 Most similary to child : 
 [('person', 0.8067123889923096), ('artist', 0.753311038017273), ('woman', 0.7446346879005432)]


In [None]:
result = word_vectors.most_similar(positive=['child'], negative=['person'])
print('\n Most similary to child but dissimilar to person : \n', result[:3])


 Most similary to child but dissimilar to person : 
 [('voice', 0.36989983916282654), ('Pamela', 0.3337928056716919), ('smile', 0.32788094878196716)]


### We have obtained the Word Embeddings.
### Now lets use these embeddings in a multi-class classification task.

# --------------------------------------------------------------------------


# Part-2: Part of Speech (POS) Tagging -  Multi-class Clasification of  Words


### Download Conll2000 Dataset

In [None]:
nltk.download('conll2000')

[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.


True

In [None]:
from nltk.corpus import conll2000
from gensim.models import Word2Vec
import tensorflow as tf
from tensorflow.keras.layers import Dense, Embedding, Activation, Flatten
from tensorflow.keras import Sequential
from tensorflow.keras.utils import to_categorical


In [None]:
train_words = conll2000.tagged_words('train.txt')
test_words = conll2000.tagged_words('test.txt')

print(train_words[:10])


[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ('pound', 'NN'), ('is', 'VBZ'), ('widely', 'RB'), ('expected', 'VBN'), ('to', 'TO'), ('take', 'VB'), ('another', 'DT')]


Our first step is to process this data into a model-friendly format — replace all words and tags with their corresponding indexes and split the data into inputs and outputs (tag labels). 

To do that we will need a dictionary which maps words to their corresponding ids and a similar dictionary for the tags. We will create the latter based on our CONLL training data, but to create the first we will use the vocabulary of our trained embedding model — as it should only contain the words which we are able to represent.

In [None]:
# Build dictionary of the vocab words and the POS tags

def get_tag_vocabulary(tagged_words):
 """
 Accepts text in the form of (word, tag) tuples and returns
 a dictionary mapping POS-tags to unique ids.
 """

 tag2id = {}
 for item in tagged_words:
  tag = item[1]
  tag2id.setdefault(tag, len(tag2id))
 return tag2id


In [None]:
# The word vectors.vocab dictionary stores Vocab objects, rather than integers
# But we would like our dictionary to map words to ints.
word2id = {k: v.index for k,v in word_vectors.vocab.items()}
tag2id = get_tag_vocabulary(train_words)


In [None]:
word2id

{'The': 14,
 'Fulton': 5615,
 'County': 1280,
 'Grand': 5377,
 'said': 59,
 'Friday': 1852,
 'an': 34,
 'investigation': 2586,
 'of': 3,
 'recent': 595,
 'primary': 1162,
 'election': 1521,
 'produced': 1206,
 '``': 12,
 'no': 67,
 'evidence': 475,
 "''": 13,
 'that': 8,
 'any': 84,
 'irregularities': 9647,
 'took': 220,
 'place': 188,
 '.': 2,
 'jury': 1754,
 'further': 499,
 'in': 7,
 'the': 0,
 'City': 762,
 'Executive': 8895,
 'Committee': 1235,
 ',': 1,
 'which': 35,
 'had': 25,
 'over-all': 3165,
 'charge': 869,
 'deserves': 5880,
 'praise': 5616,
 'and': 4,
 'thanks': 3917,
 'Atlanta': 3166,
 'for': 11,
 'manner': 838,
 'was': 10,
 'conducted': 2046,
 'term': 1391,
 'been': 48,
 'charged': 1962,
 'by': 24,
 'Superior': 5881,
 'Court': 960,
 'Judge': 2861,
 'to': 5,
 'investigate': 7781,
 'reports': 1407,
 'possible': 254,
 'won': 1604,
 'Allen': 4961,
 'Jr.': 1469,
 'Only': 1062,
 'a': 6,
 'relative': 2530,
 'handful': 6888,
 'such': 91,
 'received': 609,
 'considering': 2862,
 

In [None]:
tag2id

{'#': 24,
 '$': 26,
 "''": 20,
 '(': 25,
 ')': 27,
 ',': 11,
 '.': 14,
 ':': 39,
 'CC': 12,
 'CD': 18,
 'DT': 2,
 'EX': 22,
 'FW': 40,
 'IN': 1,
 'JJ': 8,
 'JJR': 33,
 'JJS': 30,
 'MD': 23,
 'NN': 0,
 'NNP': 10,
 'NNPS': 28,
 'NNS': 9,
 'PDT': 37,
 'POS': 13,
 'PRP': 29,
 'PRP$': 17,
 'RB': 4,
 'RBR': 32,
 'RBS': 36,
 'RP': 38,
 'SYM': 42,
 'TO': 6,
 'UH': 43,
 'VB': 7,
 'VBD': 21,
 'VBG': 16,
 'VBN': 5,
 'VBP': 15,
 'VBZ': 3,
 'WDT': 34,
 'WP': 31,
 'WP$': 41,
 'WRB': 35,
 '``': 19}

### Unknown Words

We add a new word to our vocabulary — the ‘UNK’, which will represent all words we don’t have an embedding for. 

But adding this word to the vocabulary means it will need to have a corresponding embedding, not present in our representations. One solution would be to retrain Skip-gram after having replaced some occurrences of low frequency words in our training data with an ‘UNK’ token. 

But we will approach this problem from a different angle by approximating the UNK’s vector with a mean of all existing embeddings. After doing so, we will add this new representation to the matrix of all other embeddings.

In [None]:
def add_new_word(new_word, new_vector, new_index, embedding_matrix, word2id):
 """
 Add a new word to the existing matrix of word embeddings.
 """
 
 # Inserting the vector before given index along axis 0
 embedding_matrix = np.insert(embedding_matrix, [new_index], [new_vector], axis=0)

 # Updating the indexes of words that follow the new word
 word2id = {word: (index+1) if index>=new_index else index for word, index in word2id.items()}

 word2id[new_word] = new_index
 return embedding_matrix, word2id
 

In [None]:
def get_int_data(tagged_words, word2id, tag2id):
 """
 Replaces all words and tag with their corresponding ids
 and separates words (features) from the tags (labels)
 """

 # X will hold word ids, Y will hold ids of their tags
 X, Y =[], [] 
 unk_count=0
 for word, tag in tagged_words:
  Y.append(tag2id.get(tag))
  if word in word2id:
   X.append(word2id.get(word))
  else:
   X.append(UNK_INDEX)
   unk_count+=1
 print('Data Created. Percentage of unknown words : %.3f' %(unk_count/len(tagged_words)))
 return np.array(X), np.array(Y)


In [None]:
# Generally UNK is associated with index 0
UNK_INDEX=0
UNK_TOKEN="UNK"

embedding_matrix = word_vectors.vectors # word embeddings found from the Brown corpus
unk_vector = embedding_matrix.mean(0)
embedding_matrix, word2id = add_new_word(UNK_TOKEN, unk_vector, UNK_INDEX, embedding_matrix, word2id)


Now it’s time to get our integer, model-friendly data — both for the train and test splits.


In [None]:
X_train, Y_train = get_int_data(train_words, word2id, tag2id)

X_test, Y_test = get_int_data(test_words, word2id, tag2id)

Y_train, Y_test = to_categorical(Y_train), to_categorical(Y_test)

Data Created. Percentage of unknown words : 0.143
Data Created. Percentage of unknown words : 0.149


In [None]:
X_train

array([   0,    8,    1, ..., 2750,  802,    3])

In [None]:
Y_train

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)


## Define and Train the Model

Our next step is to define the model for POS classification. We will do so using TensorFlow’s implementation of the Keras API. 

Our model will take as input an index into the word embedding matrix, which will be used to look up the appropriate embedding. 

It will have one hidden layer with the tanh activation function and at the final layer will use the softmax activation — outputting a probability distribution over all possible tags.

In [None]:
HIDDEN_SIZE = 50
BATCH_SIZE = 128

def define_model(embedding_matrix, class_count):
 """
 Creates and returns a simple part-of-speech model, which takes only one word as input.
 """
 vocab_length = len(embedding_matrix)
 model = Sequential()

 model.add(Embedding(input_dim=vocab_length, 
                     output_dim=EMB_DIM, 
                     weights=[embedding_matrix], # the matrix holding the pre-trained embeddings
                     input_length=1,     # specifies how many index we are looking up
                     trainable=False))   # We don't want to train this layer
 model.add(Flatten())
 model.add(Dense(HIDDEN_SIZE))
 model.add(Activation("tanh"))
 model.add(Dense(class_count))
 model.add(Activation("softmax"))

 model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=["accuracy"])
 return model


In [None]:
class_count = len(tag2id)

In [None]:
pos_model = define_model(embedding_matrix, len(tag2id))
pos_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 1, 300)            4552200   
_________________________________________________________________
flatten (Flatten)            (None, 300)               0         
_________________________________________________________________
dense (Dense)                (None, 50)                15050     
_________________________________________________________________
activation (Activation)      (None, 50)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 44)                2244      
_________________________________________________________________
activation_1 (Activation)    (None, 44)                0         
Total params: 4,569,494
Trainable params: 17,294
Non-trainable params: 4,552,200
_________________________________________

### Train the Model on POS Data

In [None]:
pos_model.fit(X_train, Y_train, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7febe2beae90>


## Evaluate the Model

Now that we have a trained model it’s time to see how well it’s performing on the unseen data. 

We will use it to tag the words from the test data and calculate the accuracy of its predictions: the ratio of the number of correct tags to the number of all words in the test set. 

To get more insight, we will also determine what are the most commonly mistagged words.

In [None]:
def evaluate_model(model, x_test, y_test):
 """
 Evaluates the given model by computing the accuracy of its prediction
 on the given test data and prints out 10 most mistagged words.
 """
 loss, acc = model.evaluate(x_test, y_test)
 print("\n Loss : %.2f" %loss)
 print("\n Accuracy : %.2f" %acc)

 # Following lines are used to get most commonly mistagged words
 y_pred = model.predict_classes(x_test)
 error_counter = collections.Counter()

 for i in range(len(x_test)):
  correct_tag_id = np.argmax(y_test[i], axis=0) # turn one-hot-encoding to an index
  if y_pred[i] != correct_tag_id:
   word = id2word[x_test[i]]
   error_counter[word] +=1
 print('\n Most common errors : \n', error_counter.most_common(10))


In [None]:
id2word = sorted(word2id, key=word2id.get)
evaluate_model(pos_model, X_test, Y_test)


 Loss : 0.60

 Accuracy : 0.81





 Most common errors : 
 [('UNK', 5034), ('that', 136), ('have', 51), ('as', 37), ('more', 30), ('Jaguar', 29), ('executive', 21), ('about', 18), ('American', 18), ('yield', 16)]


## Make Prediction/POS for a Single Word

In [None]:
idx=2100
idx=5000
idx = 34442

test_word = X_test[idx]
print('\n Index of the word being tested : ', test_word)
print('\n The actual word being tested : ', id2word[test_word])

# Prediction of the model
pred_idx = np.argmax(pos_model.predict(X_test[idx:idx+1]), axis=1)[0]

val_list = list(tag2id.values())
key_list = list(tag2id.keys())
position = val_list.index(pred_idx)
print('\n Predicted POS Tag : ', key_list[position])



 Index of the word being tested :  345

 The actual word being tested :  line

 Predicted NER Tag :  NN



We could probably do even better with stronger embeddings — if you want you can retrain Skip-gram on a bigger corpus and see how the performance of the POS model improves.