## Assignment 3 - Named Entity Recognition

In this assignment, we are going to build a Named Entity Recognition model. With this model, we will also tag new data.

More on Named Entity Recognition:

https://blog.paralleldots.com/data-science/named-entity-recognition-milestone-models-papers-and-technologies/

https://blog.paralleldots.com/product/applications-named-entity-recognition-api/

### Steps:

**1. Import the data**

**2. Build the model**

**3. Pick a dataset to run the model on**

**4. Build a function to load new data and print the tags**

Your web application will load small sections of text (such as tweets or headlines) and from that, you will tag the text based on the presence of named entities.

*What you will be graded on:*

1. Ability to build a model on word and tag data

2. Ability to use the model to predict on new data and display that prediction

*The model will be based on:*
1. Embeddings from words
2. Embeddings from tag inputs

### Step 1: Importing the data

Below is some code to get you started. As in the part of speech tagging example, you will have to write code to:

0. Split your data into a train/test set (Do a 80/20 or 90/10 split since we'll be later applying this model to an entirely separate set of data)
1. Find the set of all words
2. Find the set of all tags
3. ***Optional*** **Create a function called ent_tagger** that will turn a sentence into this output for model building :
``` [('Thousands', 'O'), ('of', 'O'), ('demonstrators', 'O'), ('have',  'O'), ('marched',  'O'), ('through',  'O'), ('London', 'B-geo'), ('to',  'O'), ('protest',  'O'), ('the',  'O'), ('war',  'O'), ('in',  'O'), ('Iraq',  'B-geo'), ('and', 'O'), ('demand',  'O'), ('the',  'O'), ('withdrawal', 'O'), ('of', 'O'), ('British', 'B-gpe'), ('troops',  'O'), ('from', 'O'), ('that', 'O'), ('country', 'O'), ('.', 'O')]
```
4. Make a dictionary of words to index and entity tag to index

*** Part of Step 1: Formatting the data ***

Data will need to be:

1. Indexed
2. Limited by vocabulary (ie replace tokens with UNKNOWN if they are too rare, come up with a reasonable limit based on your survey of the data and also model performance)
3. Padded

In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv("ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")
print(data.tail(10))
print(data.shape)

              Sentence #       Word  POS    Tag
1048565  Sentence: 47958     impact   NN      O
1048566  Sentence: 47958          .    .      O
1048567  Sentence: 47959     Indian   JJ  B-gpe
1048568  Sentence: 47959     forces  NNS      O
1048569  Sentence: 47959       said  VBD      O
1048570  Sentence: 47959       they  PRP      O
1048571  Sentence: 47959  responded  VBD      O
1048572  Sentence: 47959         to   TO      O
1048573  Sentence: 47959        the   DT      O
1048574  Sentence: 47959     attack   NN      O
(1048575, 4)


In [2]:
order = range(0, 1048575)
data['order'] = order

print(data.head(10))
print(data.shape)

    Sentence #           Word  POS    Tag  order
0  Sentence: 1      Thousands  NNS      O      0
1  Sentence: 1             of   IN      O      1
2  Sentence: 1  demonstrators  NNS      O      2
3  Sentence: 1           have  VBP      O      3
4  Sentence: 1        marched  VBN      O      4
5  Sentence: 1        through   IN      O      5
6  Sentence: 1         London  NNP  B-geo      6
7  Sentence: 1             to   TO      O      7
8  Sentence: 1        protest   VB      O      8
9  Sentence: 1            the   DT      O      9
(1048575, 5)


convert all words to lower case

In [3]:
data['Word'] = data['Word'].str.lower()
data.head(10)

Unnamed: 0,Sentence #,Word,POS,Tag,order
0,Sentence: 1,thousands,NNS,O,0
1,Sentence: 1,of,IN,O,1
2,Sentence: 1,demonstrators,NNS,O,2
3,Sentence: 1,have,VBP,O,3
4,Sentence: 1,marched,VBN,O,4
5,Sentence: 1,through,IN,O,5
6,Sentence: 1,london,NNP,B-geo,6
7,Sentence: 1,to,TO,O,7
8,Sentence: 1,protest,VB,O,8
9,Sentence: 1,the,DT,O,9


merge each word into a sentence

In [4]:
Word_transpose = data.sort_values('order').groupby('Sentence #')['Word'].apply(lambda data: data.reset_index(drop=True)).unstack()

In [5]:
Sentence = Word_transpose.apply(lambda x: ' '.join(x.dropna().astype(str).values), axis=1)

In [6]:
Sentence.head(3)

Sentence #
Sentence: 1      thousands of demonstrators have marched throug...
Sentence: 10     iranian officials say they expect to get acces...
Sentence: 100    helicopter gunships saturday pounded militant ...
dtype: object

merge each tag into a sentence format

In [7]:
Tag_transpose = data.sort_values('order').groupby('Sentence #')['Tag'].apply(lambda df: df.reset_index(drop=True)).unstack()

In [8]:
Tag = Tag_transpose.apply(lambda x: ' '.join(x.dropna().astype(str).values), axis=1)
Tag.head(3)

Sentence #
Sentence: 1      O O O O O O B-geo O O O O O B-geo O O O O O B-...
Sentence: 10     B-gpe O O O O O O O O O O O O O O B-tim O O O ...
Sentence: 100    O O B-tim O O O O O B-geo O O O O O B-org O O ...
dtype: object

In [9]:
data =pd.concat([Sentence, Tag], axis=1)
data.columns = ['Words', 'Tags']
data

Unnamed: 0_level_0,Words,Tags
Sentence #,Unnamed: 1_level_1,Unnamed: 2_level_1
Sentence: 1,thousands of demonstrators have marched throug...,O O O O O O B-geo O O O O O B-geo O O O O O B-...
Sentence: 10,iranian officials say they expect to get acces...,B-gpe O O O O O O O O O O O O O O B-tim O O O ...
Sentence: 100,helicopter gunships saturday pounded militant ...,O O B-tim O O O O O B-geo O O O O O B-org O O ...
Sentence: 1000,they left after a tense hour-long standoff wit...,O O O O O O O O O O O
Sentence: 10000,u.n. relief coordinator jan egeland said sunda...,B-geo O O B-per I-per O B-tim O B-geo O B-gpe ...
Sentence: 10001,mr. egeland said the latest figures show 1.8 m...,B-per I-per O O O O O O O O O O O O O O O O O ...
Sentence: 10002,he said last week 's tsunami and the massive u...,O O O O O O O O O O O O O O O O O O B-geo O B-...
Sentence: 10003,"some 1,27,000 people are known dead .",O O O O O O O
Sentence: 10004,"aid is being rushed to the region , but the u....",O O O O O O O O O O B-geo O O O O O O O O O O ...
Sentence: 10005,lebanese politicians are condemning friday 's ...,B-gpe O O O B-tim O O O O O O O O B-geo O O O ...


In [30]:
data.to_csv("ner_data_formatted.csv")
#just in case next time i dont wanna go through all the formatting.... so i can just use my well-formatted dataset lol

tokenize words

In [10]:
from nltk.tokenize import word_tokenize

data['Tokenized_Words'] = [word_tokenize(i) for i in data['Words']]
data['Tokenized_Tags'] = [word_tokenize(i) for i in data['Tags']]

In [11]:
data.head(3)

Unnamed: 0_level_0,Words,Tags,Tokenized_Words,Tokenized_Tags
Sentence #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sentence: 1,thousands of demonstrators have marched throug...,O O O O O O B-geo O O O O O B-geo O O O O O B-...,"[thousands, of, demonstrators, have, marched, ...","[O, O, O, O, O, O, B-geo, O, O, O, O, O, B-geo..."
Sentence: 10,iranian officials say they expect to get acces...,B-gpe O O O O O O O O O O O O O O B-tim O O O ...,"[iranian, officials, say, they, expect, to, ge...","[B-gpe, O, O, O, O, O, O, O, O, O, O, O, O, O,..."
Sentence: 100,helicopter gunships saturday pounded militant ...,O O B-tim O O O O O B-geo O O O O O B-org O O ...,"[helicopter, gunships, saturday, pounded, mili...","[O, O, B-tim, O, O, O, O, O, B-geo, O, O, O, O..."


Split your data into a train/test set

In [12]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2, random_state=42)

In [13]:
print(data.shape, train.shape, test.shape)

(47959, 4) (38367, 4) (9592, 4)


Limited by vocabulary: if a certain word only occur once, we just ignore it. 

In [14]:
import pickle

def make_lexicon(token_seqs, min_freq=1):
    # First, count how often each word appears in the text.
    token_counts = {}
    for seq in token_seqs:
        for token in seq:
            if token in token_counts:
                token_counts[token] += 1
            else:
                token_counts[token] = 1

    # Then, assign each word to a numerical index. Filter words that occur less than min_freq times.
    lexicon = [token for token, count in token_counts.items() if count >= min_freq]
    # Indices start at 1. 0 is reserved for padding, and 1 is reserved for unknown words.
    lexicon = {token:idx + 2 for idx,token in enumerate(lexicon)}
    lexicon[u'<UNK>'] = 1 # Unknown words are those that occur fewer than min_freq times
    lexicon_size = len(lexicon)

    print("LEXICON SAMPLE ({} total items):".format(len(lexicon)))
    print(dict(list(lexicon.items())[:20]))
    
    return lexicon

Find the set of all words and the set of all tags

In [15]:
print("WORDS:")
words_lexicon = make_lexicon(train['Tokenized_Words'])

WORDS:
LEXICON SAMPLE (28954 total items):
{'south': 2, 'korea': 3, "'s": 4, 'government': 5, 'tuesday': 6, 'also': 7, 'unveiled': 8, 'a': 9, 'so-called': 10, 'green': 11, 'new': 12, 'job': 13, 'creation': 14, 'plan': 15, ',': 16, 'expected': 17, 'to': 18, 'create': 19, '9,60,000': 20, 'jobs': 21}


In [16]:
print("TAGS:")
tags_lexicon = make_lexicon(train['Tokenized_Tags'])

TAGS:
LEXICON SAMPLE (18 total items):
{'B-geo': 2, 'I-geo': 3, 'O': 4, 'B-tim': 5, 'I-tim': 6, 'B-org': 7, 'B-per': 8, 'I-per': 9, 'B-gpe': 10, 'I-org': 11, 'B-art': 12, 'B-eve': 13, 'I-eve': 14, 'I-gpe': 15, 'B-nat': 16, 'I-nat': 17, 'I-art': 18, '<UNK>': 1}


Indexed: From strings to numbers

In [17]:
def tokens_to_idxs(token_seqs, lexicon):
    #idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for token in token_seq]  
    #                                                                 for token_seq in token_seqs]
    #idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] ] for token in token_seqs]  
    idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for token in token_seq]  
                                                                     for token_seq in token_seqs]                                                                 
    return idx_seqs

In [18]:
train['Word_Idxs'] = tokens_to_idxs(train['Tokenized_Words'], words_lexicon)
train['Tag_Idxs'] = tokens_to_idxs(train['Tokenized_Tags'], tags_lexicon)
train[['Tokenized_Words', 'Word_Idxs', 'Tokenized_Tags', 'Tag_Idxs']][:10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0_level_0,Tokenized_Words,Word_Idxs,Tokenized_Tags,Tag_Idxs
Sentence #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sentence: 16935,"[south, korea, 's, government, tuesday, also, ...","[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...","[B-geo, I-geo, O, O, B-tim, O, O, O, O, O, O, ...","[2, 3, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."
Sentence: 33479,"[when, the, lion, found, that, he, could, not,...","[23, 24, 25, 26, 27, 28, 29, 30, 31, 16, 28, 3...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."
Sentence: 10275,"[the, cost, of, major, food, commodities, has,...","[24, 41, 42, 43, 44, 45, 46, 47, 48, 24, 49, 5...","[O, O, O, O, O, O, O, O, O, O, B-tim, I-tim, O...","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 6, 4, 4, 4, ..."
Sentence: 32476,"[argentina, 's, lionel, messi, tied, the, matc...","[59, 4, 60, 61, 62, 24, 63, 64, 65, 66, 22]","[B-org, O, B-per, I-per, O, O, O, B-tim, O, B-...","[7, 4, 8, 9, 4, 4, 4, 5, 4, 5, 4]"
Sentence: 10421,"[in, addition, to, 65,000, regular, h1-b, visa...","[67, 68, 18, 69, 70, 71, 72, 16, 73, 7, 46, 74...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."
Sentence: 45981,"[palestinian, president, mahmoud, abbas, is, r...","[97, 98, 99, 100, 101, 102, 18, 103, 104, 105,...","[B-gpe, B-per, I-per, I-per, O, O, O, O, O, O,...","[10, 8, 9, 9, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,..."
Sentence: 30601,"[the, discovery, channel, and, nasa, have, res...","[24, 116, 117, 35, 118, 103, 119, 120, 51, 42,...","[O, B-org, I-org, O, B-org, O, O, B-tim, O, O,...","[4, 7, 11, 4, 7, 4, 4, 5, 4, 4, 4, 4, 4, 2, 4,..."
Sentence: 36662,"[afghan, president, hamid, karzai, led, his, c...","[126, 98, 127, 128, 129, 105, 130, 4, 131, 132...","[B-gpe, B-per, I-per, I-per, O, O, O, O, B-tim...","[10, 8, 9, 9, 4, 4, 4, 4, 5, 6, 4, 5, 4, 4, 4,..."
Sentence: 47265,"[mr., blair, also, stressed, that, his, decisi...","[143, 144, 7, 145, 27, 105, 146, 18, 147, 148,...","[B-per, I-per, O, O, O, O, O, O, O, O, O, O, O...","[8, 9, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 4, ..."
Sentence: 25420,"[diplomats, reported, some, progress, during, ...","[157, 102, 158, 159, 160, 134, 4, 161, 16, 35,...","[O, O, O, O, O, B-tim, O, O, O, O, O, O, O, O,...","[4, 4, 4, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."


In [19]:
test.head(3)

Unnamed: 0_level_0,Words,Tags,Tokenized_Words,Tokenized_Tags
Sentence #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sentence: 22048,the report calls on president bush and congres...,O O O O B-per I-per O B-org O O B-gpe O O O O ...,"[the, report, calls, on, president, bush, and,...","[O, O, O, O, B-per, I-per, O, B-org, O, O, B-g..."
Sentence: 1273,the construction on the baku-t'bilisi-ceyhan o...,O O O O O O O O O O O O O O O B-org I-org O O ...,"[the, construction, on, the, baku-t'bilisi-cey...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
Sentence: 1541,the pact was initially approved after discussi...,O O O O O O O O B-per I-per O B-gpe B-per I-pe...,"[the, pact, was, initially, approved, after, d...","[O, O, O, O, O, O, O, O, B-per, I-per, O, B-gp..."


In [20]:
test['Word_Idxs'] = tokens_to_idxs(test['Tokenized_Words'], words_lexicon)
test['Tag_Idxs'] = tokens_to_idxs(test['Tokenized_Tags'], tags_lexicon)
test[['Tokenized_Words', 'Word_Idxs', 'Tokenized_Tags', 'Tag_Idxs']][:10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0_level_0,Tokenized_Words,Word_Idxs,Tokenized_Tags,Tag_Idxs
Sentence #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sentence: 22048,"[the, report, calls, on, president, bush, and,...","[24, 564, 356, 181, 98, 1024, 35, 2722, 18, 73...","[O, O, O, O, B-per, I-per, O, B-org, O, O, B-g...","[4, 4, 4, 4, 8, 9, 4, 7, 4, 4, 10, 4, 4, 4, 4,..."
Sentence: 1273,"[the, construction, on, the, baku-t'bilisi-cey...","[24, 1989, 181, 24, 1, 862, 7333, 16, 24, 1, 1...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."
Sentence: 1541,"[the, pact, was, initially, approved, after, d...","[24, 3805, 222, 4634, 74, 413, 5668, 570, 98, ...","[O, O, O, O, O, O, O, O, B-per, I-per, O, B-gp...","[4, 4, 4, 4, 4, 4, 4, 4, 8, 9, 4, 10, 8, 9, 9,..."
Sentence: 41443,"[zelenovic, had, lived, in, khanty-mansiisk, ,...","[21841, 153, 7064, 67, 1, 16, 158, 6124, 1095,...","[B-per, O, O, O, B-geo, O, O, O, O, O, O, B-ge...","[8, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 2, 4, 4, 5, ..."
Sentence: 18642,"[exports, have, grown, significantly, because,...","[380, 103, 2046, 8703, 184, 42, 24, 2135, 2798...","[O, O, O, O, O, O, O, O, O, O, O, O, B-geo, I-...","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 3, 4, ..."
Sentence: 17216,"[mrs., sheehan, began, holding, vigils, in, cr...","[1948, 11896, 700, 163, 1, 67, 5991, 272, 273,...","[B-per, I-per, O, O, O, O, B-geo, O, O, O, O, ...","[8, 9, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4, ..."
Sentence: 37529,"[another, incident, details, the, shooting, de...","[1351, 1790, 686, 24, 3409, 782, 42, 75, 4188,...","[O, O, O, O, O, O, O, O, O, O, B-gpe, O, O, O,...","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 10, 4, 4, 4, 4,..."
Sentence: 2185,"[the, report, shows, that, murders, in, large,...","[24, 564, 1531, 27, 9992, 67, 1332, 1794, 16, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."
Sentence: 8273,"[it, exploded, as, passengers, were, working, ...","[73, 2831, 502, 1768, 213, 519, 18, 10250, 24,...","[O, O, O, O, O, O, O, O, O, O, O, O, B-geo, O,...","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, ..."
Sentence: 19145,"[in, the, past, few, years, ,, uighur, separat...","[67, 24, 49, 5132, 51, 16, 7979, 1693, 581, 10...","[O, O, B-tim, I-tim, O, O, B-org, O, O, O, O, ...","[4, 4, 5, 6, 4, 4, 7, 4, 4, 4, 4, 4, 4, 4, 4, ..."


Padding: since each sentence has a different number of words, we create a padded matrix equal to the length on the longest sentence in the training set. For all sentences with fewer words, we prepend the row with zeros representing an empty word (and tag) position

In [21]:
from keras.preprocessing.sequence import pad_sequences

def pad_idx_seqs(idx_seqs, max_seq_len):
    # Keras provides a convenient padding function; 
    padded_idxs = pad_sequences(sequences=idx_seqs, maxlen=max_seq_len)
    return padded_idxs

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  return f(*args, **kwds)


In [22]:
max_seq_len = max([len(idx_seq) for idx_seq in train['Word_Idxs']]) # Get length of longest sequence
train_padded_words = pad_idx_seqs(train['Word_Idxs'], 
                                  max_seq_len + 1) #Add one to max length for offsetting sequence by 1
train_padded_tags = pad_idx_seqs(train['Tag_Idxs'],
                                 max_seq_len + 1)  #Add one to max length for offsetting sequence by 1

print("WORDS:\n", train_padded_words)
print("SHAPE:", train_padded_words.shape, "\n")

print("TAGS:\n", train_padded_tags)
print("SHAPE:", train_padded_tags.shape, "\n")

WORDS:
 [[    0     0     0 ...    12    21    22]
 [    0     0     0 ...    24    40    22]
 [    0     0     0 ...    57    58    22]
 ...
 [    0     0     0 ...    24  3978    22]
 [    0     0     0 ... 24644 14475    22]
 [    0     0     0 ...   141  8494    22]]
SHAPE: (38367, 105) 

TAGS:
 [[0 0 0 ... 4 4 4]
 [0 0 0 ... 4 4 4]
 [0 0 0 ... 4 4 4]
 ...
 [0 0 0 ... 4 5 4]
 [0 0 0 ... 8 9 4]
 [0 0 0 ... 4 4 4]]
SHAPE: (38367, 105) 



define input and output

In [23]:
import pandas
print(pandas.DataFrame(list(zip(train_padded_words[0,1:], train_padded_tags[0,:-1], train_padded_tags[0, 1:])),
                columns=['Input Words', 'Input Tags', 'Output Tags']))

     Input Words  Input Tags  Output Tags
0              0           0            0
1              0           0            0
2              0           0            0
3              0           0            0
4              0           0            0
5              0           0            0
6              0           0            0
7              0           0            0
8              0           0            0
9              0           0            0
10             0           0            0
11             0           0            0
12             0           0            0
13             0           0            0
14             0           0            0
15             0           0            0
16             0           0            0
17             0           0            0
18             0           0            0
19             0           0            0
20             0           0            0
21             0           0            0
22             0           0      

### Step 2. Build the model

Here we will build a Bidirectional LSTM-CRF model using the `Bidirectional` function from Keras and `CRF` function from Keras-contrib

**Documentation and source code:**

https://keras.io/layers/wrappers/#bidirectional

https://github.com/keras-team/keras-contrib

Fit your model with a validation split of 0.1, feel free to use as many epochs as you like. Base your predictions both from the input words **and** the tags from previous words like in the POS example.

After building your model, grade your performance on your test set, both by comparing your predicted output to the actual (*at least 3 examples*) and calculate the averaged precision and recall for your tags.

In [24]:
print(train_padded_words[1,:])
print(train_padded_tags[1,:])

[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0 23 24 25 26 27 28 29 30 31 16 28 32 33 24 34 35
 36 37 16 35 38 39 24 40 22]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4]


In [25]:
print(len(train_padded_words[1,:]))
print(len(train_padded_tags[1,:]))

105
105


In [26]:
train_padded_words.shape
train_padded_tags.shape

(38367, 105)

In [27]:
test.head(3)

Unnamed: 0_level_0,Words,Tags,Tokenized_Words,Tokenized_Tags,Word_Idxs,Tag_Idxs
Sentence #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Sentence: 22048,the report calls on president bush and congres...,O O O O B-per I-per O B-org O O B-gpe O O O O ...,"[the, report, calls, on, president, bush, and,...","[O, O, O, O, B-per, I-per, O, B-org, O, O, B-g...","[24, 564, 356, 181, 98, 1024, 35, 2722, 18, 73...","[4, 4, 4, 4, 8, 9, 4, 7, 4, 4, 10, 4, 4, 4, 4,..."
Sentence: 1273,the construction on the baku-t'bilisi-ceyhan o...,O O O O O O O O O O O O O O O B-org I-org O O ...,"[the, construction, on, the, baku-t'bilisi-cey...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[24, 1989, 181, 24, 1, 862, 7333, 16, 24, 1, 1...","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."
Sentence: 1541,the pact was initially approved after discussi...,O O O O O O O O B-per I-per O B-gpe B-per I-pe...,"[the, pact, was, initially, approved, after, d...","[O, O, O, O, O, O, O, O, B-per, I-per, O, B-gp...","[24, 3805, 222, 4634, 74, 413, 5668, 570, 98, ...","[4, 4, 4, 4, 4, 4, 4, 4, 8, 9, 4, 10, 8, 9, 9,..."


In [28]:
#get lexicon loopup

def get_lexicon_lookup(lexicon):
    lexicon_lookup = {idx: lexicon_item for lexicon_item, idx in lexicon.items()}
    print("LEXICON LOOKUP SAMPLE:")
    print(dict(list(lexicon_lookup.items())[:20]))
    return lexicon_lookup

tags_lexicon_lookup = get_lexicon_lookup(tags_lexicon)

LEXICON LOOKUP SAMPLE:
{2: 'B-geo', 3: 'I-geo', 4: 'O', 5: 'B-tim', 6: 'I-tim', 7: 'B-org', 8: 'B-per', 9: 'I-per', 10: 'B-gpe', 11: 'I-org', 12: 'B-art', 13: 'B-eve', 14: 'I-eve', 15: 'I-gpe', 16: 'B-nat', 17: 'I-nat', 18: 'I-art', 1: '<UNK>'}


In [29]:
# Save a tags_lexicon and words_lexicon into pickle files
import pickle

pickle.dump( tags_lexicon, open( "tags_lexicon.pkl", "wb" ) )
pickle.dump( words_lexicon, open( "words_lexicon.pkl", "wb" ) )

In [30]:
train.head(2)

Unnamed: 0_level_0,Words,Tags,Tokenized_Words,Tokenized_Tags,Word_Idxs,Tag_Idxs
Sentence #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Sentence: 16935,south korea 's government tuesday also unveile...,B-geo I-geo O O B-tim O O O O O O O O O O O O ...,"[south, korea, 's, government, tuesday, also, ...","[B-geo, I-geo, O, O, B-tim, O, O, O, O, O, O, ...","[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...","[2, 3, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."
Sentence: 33479,"when the lion found that he could not escape ,...",O O O O O O O O O O O O O O O O O O O O O O O O O,"[when, the, lion, found, that, he, could, not,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[23, 24, 25, 26, 27, 28, 29, 30, 31, 16, 28, 3...","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."


In [31]:
len(train['Tokenized_Tags'][0])
#len(train['Tokenized_Words'][0])

22

**seq_input_length**: the length of the padded matrices for the word and tag sentence inputs, which will be the same since there is a one-to-one mapping between tags. This is equal to the length of the longest sentence in the training data.

**n_word_input_nodes**: the number of unique words in the lexicon, plus one to account for matrix padding represented by 0 values. This indicates the number of rows in the word embedding layer, where each row corresponds to a word.

**n_tag_input_nodes**: the number of unique tags in the dataset, plus one to account for padding. This indicates the number of rows in the tag embedding layer, where each row corresponds to a tag.

**n_word_embedding_nodes**: the number of dimensions in the word embedding layer, which can be freely defined. Here, it is set to 300.

**n_tag_embedding_nodes**: the number of dimensions in the tag embedding layer, which can be freely defined. Here, it is set to 100.

**n_hidden_nodes**: the number of dimensions in the hidden layer. Like the embedding layers, this can be freely chosen. Here, it is set to 500.

In [32]:
from keras.utils import to_categorical
train_padded_tags = [to_categorical(i, num_classes=22) for i in train_padded_tags]

In [33]:
from keras.models import Model, Input
from keras.models import Sequential
from keras.layers import Input, Concatenate, TimeDistributed, Dense
from keras.layers import Activation,Conv1D,Dense,Embedding,Input,Dropout,LSTM,Bidirectional,MaxPooling1D,Flatten,concatenate
from keras.utils import plot_model,Progbar
from keras.preprocessing.sequence import pad_sequences
from keras.initializers import RandomUniform

max_length = 105
n_word_input_nodes=len(words_lexicon) + 1, #Add one for 0 padding
n_tag_input_nodes=len(tags_lexicon) + 1, #Add one for 0 padding
n_word_embedding_nodes=300,
n_tag_embedding_nodes=100,
n_hidden_nodes=500

model2 = Sequential()
#model2.add = Input
#model2 = Input(shape=(max_length,))
#model2.add(Dense(512, input_shape=(max_length, )))
#model2.add(Dense(512, input_shape=(105, 38367)))
model2.add(Embedding(input_dim=n_word_input_nodes[0] +1,
                     input_length=max_length,
                     output_dim=20, 
                     mask_zero=True))
model2.add(Bidirectional(LSTM(300, input_shape = (), return_sequences=True, 
                                             dropout=0.50, recurrent_dropout=0.25)))
model2.add(TimeDistributed(Dense(units=22, 
                                         activation='softmax')))
#model2.add(Activation('softmax'))

model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')

model2.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 105, 20)           579120    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 105, 600)          770400    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 105, 22)           13222     
Total params: 1,362,742
Trainable params: 1,362,742
Non-trainable params: 0
_________________________________________________________________


In [34]:
train_padded_words.shape
#train_padded_tags.shape

(38367, 105)

In [35]:
train_padded_tags

[array([[1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([[1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([[1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([[1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([

In [36]:
batch_size = 32
epochs = 5
model2_fit = model2.fit(x= train_padded_words, y= np.array(train_padded_tags), 
                   batch_size = batch_size,
                   epochs = epochs,
                   verbose = 1,
                   validation_split = 0.1)

#model2.save_weights('HW3_model2.h5') #Save model
#model2.save('HW3_model2.h5') #Save model

Train on 34530 samples, validate on 3837 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [37]:
#max_seq_len = max([len(idx_seq) for idx_seq in test['Word_Idxs']]) # Get length of longest sequence
max_seq_len = 104
test_padded_words = pad_idx_seqs(test['Word_Idxs'], 
                                  max_seq_len + 1) #Add one to max length for offsetting sequence by 1
test_padded_tags = pad_idx_seqs(test['Tag_Idxs'],
                                 max_seq_len + 1)  #Add one to max length for offsetting sequence by 1

print("WORDS:\n", test_padded_words)
print("SHAPE:", test_padded_words.shape, "\n")

print("TAGS:\n", test_padded_tags)
print("SHAPE:", test_padded_tags.shape, "\n")

WORDS:
 [[    0     0     0 ...   389   563    22]
 [    0     0     0 ...   290  2191    22]
 [    0     0     0 ...  5448  8716    22]
 ...
 [    0     0     0 ...    19    21    22]
 [    0     0     0 ... 24407  6110    22]
 [    0     0     0 ...   748  6153    22]]
SHAPE: (9592, 105) 

TAGS:
 [[0 0 0 ... 4 4 4]
 [0 0 0 ... 4 4 4]
 [0 0 0 ... 4 4 4]
 ...
 [0 0 0 ... 4 4 4]
 [0 0 0 ... 4 4 4]
 [0 0 0 ... 5 6 4]]
SHAPE: (9592, 105) 



In [38]:
test.head(3)

Unnamed: 0_level_0,Words,Tags,Tokenized_Words,Tokenized_Tags,Word_Idxs,Tag_Idxs
Sentence #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Sentence: 22048,the report calls on president bush and congres...,O O O O B-per I-per O B-org O O B-gpe O O O O ...,"[the, report, calls, on, president, bush, and,...","[O, O, O, O, B-per, I-per, O, B-org, O, O, B-g...","[24, 564, 356, 181, 98, 1024, 35, 2722, 18, 73...","[4, 4, 4, 4, 8, 9, 4, 7, 4, 4, 10, 4, 4, 4, 4,..."
Sentence: 1273,the construction on the baku-t'bilisi-ceyhan o...,O O O O O O O O O O O O O O O B-org I-org O O ...,"[the, construction, on, the, baku-t'bilisi-cey...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[24, 1989, 181, 24, 1, 862, 7333, 16, 24, 1, 1...","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."
Sentence: 1541,the pact was initially approved after discussi...,O O O O O O O O B-per I-per O B-gpe B-per I-pe...,"[the, pact, was, initially, approved, after, d...","[O, O, O, O, O, O, O, O, B-per, I-per, O, B-gp...","[24, 3805, 222, 4634, 74, 413, 5668, 570, 98, ...","[4, 4, 4, 4, 4, 4, 4, 4, 8, 9, 4, 10, 8, 9, 9,..."


In [39]:
test_padded_tags

array([[0, 0, 0, ..., 4, 4, 4],
       [0, 0, 0, ..., 4, 4, 4],
       [0, 0, 0, ..., 4, 4, 4],
       ...,
       [0, 0, 0, ..., 4, 4, 4],
       [0, 0, 0, ..., 4, 4, 4],
       [0, 0, 0, ..., 5, 6, 4]], dtype=int32)

In [40]:
from keras.utils import to_categorical
test_padded_tags = [to_categorical(i, num_classes=22) for i in test_padded_tags]

In [41]:
test_padded_words

array([[    0,     0,     0, ...,   389,   563,    22],
       [    0,     0,     0, ...,   290,  2191,    22],
       [    0,     0,     0, ...,  5448,  8716,    22],
       ...,
       [    0,     0,     0, ...,    19,    21,    22],
       [    0,     0,     0, ..., 24407,  6110,    22],
       [    0,     0,     0, ...,   748,  6153,    22]], dtype=int32)

In [42]:
test_padded_tags

[array([[1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([[1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([[1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([[1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([

In [43]:
model2_fit.history

{'loss': [0.40925528253640364,
  0.21257546664647975,
  0.1760508212634456,
  0.15887706981787916,
  0.14821302712491213],
 'val_loss': [0.2490084861090384,
  0.1885258686248915,
  0.16251268243071995,
  0.15756557339689276,
  0.14840370966162816]}

In [44]:
score = model2.evaluate(test_padded_words, np.array(test_padded_tags), verbose=0)
#prediction = model2.predict(test_padded_words, verbose = 1)

In [45]:
model2_pred = model2.predict(test_padded_words, verbose = 1)



In [46]:
score

0.14923693049540612

In [71]:
model2_pred.shape

(9592, 105, 22)

In [106]:
model2_pred[0][0] #this is for 1 word

array([2.4473278e-02, 7.0979816e-08, 1.3921768e-02, 2.7304003e-03,
       8.9763671e-01, 9.1587013e-04, 9.8175427e-04, 3.5263315e-02,
       3.2889592e-03, 2.5611741e-03, 8.7878219e-04, 6.1755502e-03,
       7.0179049e-03, 2.1207957e-03, 6.6361798e-04, 1.4992574e-05,
       5.6557218e-04, 1.7462524e-05, 7.7187718e-04, 6.2100099e-08,
       9.6338397e-08, 6.2948111e-08], dtype=float32)

In [104]:
model2_pred[0].shape #this is for 1 word

(105, 22)

In [97]:
import numpy as np
model2_pred_tag = model2_pred.max(axis=1)[None,:]
model2_pred_tag[0]

array([[2.44732779e-02, 7.09798158e-08, 3.15121710e-02, ...,
        6.21000993e-08, 9.63383968e-08, 6.29481107e-08],
       [1.67458244e-02, 2.50766430e-08, 9.88412797e-01, ...,
        2.15525322e-08, 3.60119365e-08, 2.21819114e-08],
       [3.06374636e-02, 3.06996348e-08, 9.82302487e-01, ...,
        2.65947833e-08, 4.38043379e-08, 2.98040099e-08],
       ...,
       [5.96012510e-02, 3.96474782e-08, 2.01744977e-02, ...,
        4.10766212e-08, 6.69945450e-08, 4.35406662e-08],
       [7.40099885e-03, 1.10055012e-07, 6.31961048e-01, ...,
        1.09961533e-07, 1.28268923e-07, 1.17650515e-07],
       [1.39994789e-02, 1.78495938e-08, 4.70103649e-03, ...,
        1.70730221e-08, 2.37048106e-08, 1.72141821e-08]], dtype=float32)

### Step 3. Pick a dataset

Pick a dataset that has short text, similar to the sentences you just tagged. Headlines and tweets are good choices.

https://www.kaggle.com/datasets?sortBy=relevance&group=public&search=news&page=1&pageSize=20&size=all&filetype=all&license=all

I got my test dataset from: https://www.kaggle.com/c/twitter-sentiment-analysis2/data

In [23]:
import pandas as pd
twitter = pd.read_csv("test.csv", encoding="latin1")
twitter['SentimentText'] = twitter['SentimentText'].map(lambda x: x.lstrip())
twitter.head(10)

Unnamed: 0,SentimentText
0,is so sad for my APL friend.............
1,I missed the New Moon trailer...
2,omg its already 7:30 :O
3,.. Omgaga. Im sooo im gunna CRy. I've been at...
4,i think mi bf is cheating on me!!! T_T
5,or i just worry too much?
6,Juuuuuuuuuuuuuuuuussssst Chillin!!
7,Sunny Again Work Tomorrow :-| TV...
8,handed in my uniform today . i miss you already
9,hmmmm.... i wonder how she my number @-)


tokenize

In [24]:
from nltk.tokenize import word_tokenize

twitter['Test_Word'] = [word_tokenize(i) for i in twitter['SentimentText']]

twitter.head(10)

Unnamed: 0,SentimentText,Test_Word
0,is so sad for my APL friend.............,"[is, so, sad, for, my, APL, friend, ..., ..., ..."
1,I missed the New Moon trailer...,"[I, missed, the, New, Moon, trailer, ...]"
2,omg its already 7:30 :O,"[omg, its, already, 7:30, :, O]"
3,.. Omgaga. Im sooo im gunna CRy. I've been at...,"[.., Omgaga, ., Im, sooo, im, gunna, CRy, ., I..."
4,i think mi bf is cheating on me!!! T_T,"[i, think, mi, bf, is, cheating, on, me, !, !,..."
5,or i just worry too much?,"[or, i, just, worry, too, much, ?]"
6,Juuuuuuuuuuuuuuuuussssst Chillin!!,"[Juuuuuuuuuuuuuuuuussssst, Chillin, !, !]"
7,Sunny Again Work Tomorrow :-| TV...,"[Sunny, Again, Work, Tomorrow, :, -|, TV, Toni..."
8,handed in my uniform today . i miss you already,"[handed, in, my, uniform, today, ., i, miss, y..."
9,hmmmm.... i wonder how she my number @-),"[hmmmm, ..., ., i, wonder, how, she, my, numbe..."


transfer text word to numeric

In [25]:
def tokens_to_idxs(token_seqs, lexicon):
    
    idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for token in token_seq]  
                                                                     for token_seq in token_seqs]                                                                 
    return idx_seqs

In [26]:
twitter['Word_Idxs'] = tokens_to_idxs(twitter['Test_Word'], words_lexicon)
twitter.head(10)

Unnamed: 0,SentimentText,Test_Word,Word_Idxs
0,is so sad for my APL friend.............,"[is, so, sad, for, my, APL, friend, ..., ..., ...","[101, 2852, 22491, 82, 7102, 1, 4719, 15526, 1..."
1,I missed the New Moon trailer...,"[I, missed, the, New, Moon, trailer, ...]","[1, 3222, 24, 1, 1, 25323, 15526]"
2,omg its already 7:30 :O,"[omg, its, already, 7:30, :, O]","[1, 255, 2266, 1, 4950, 1]"
3,.. Omgaga. Im sooo im gunna CRy. I've been at...,"[.., Omgaga, ., Im, sooo, im, gunna, CRy, ., I...","[25231, 1, 22, 1, 1, 1, 1, 1, 22, 1, 20263, 40..."
4,i think mi bf is cheating on me!!! T_T,"[i, think, mi, bf, is, cheating, on, me, !, !,...","[674, 12218, 1, 1, 101, 27799, 181, 4755, 4244..."
5,or i just worry too much?,"[or, i, just, worry, too, much, ?]","[90, 674, 4275, 4476, 1935, 1380, 15000]"
6,Juuuuuuuuuuuuuuuuussssst Chillin!!,"[Juuuuuuuuuuuuuuuuussssst, Chillin, !, !]","[1, 1, 4244, 4244]"
7,Sunny Again Work Tomorrow :-| TV...,"[Sunny, Again, Work, Tomorrow, :, -|, TV, Toni...","[1, 1, 1, 1, 4950, 1, 1, 1]"
8,handed in my uniform today . i miss you already,"[handed, in, my, uniform, today, ., i, miss, y...","[2137, 67, 7102, 6180, 701, 22, 674, 6681, 382..."
9,hmmmm.... i wonder how she my number @-),"[hmmmm, ..., ., i, wonder, how, she, my, numbe...","[1, 15526, 22, 674, 17074, 1916, 238, 7102, 10..."


padding

In [51]:
from keras.preprocessing.sequence import pad_sequences

def pad_idx_seqs(idx_seqs, max_seq_len):
    # Keras provides a convenient padding function; 
    padded_idxs = pad_sequences(sequences=idx_seqs, maxlen=max_seq_len)
    return padded_idxs

In [52]:
#max_seq_len = max([len(idx_seq) for idx_seq in test['Word_Idxs']]) # Get length of longest sequence
max_seq_len = 104
twitter_padded_words = pad_idx_seqs(twitter['Word_Idxs'], 
                                  max_seq_len + 1) #Add one to max length for offsetting sequence by 1

print("WORDS:\n", twitter_padded_words)
print("SHAPE:", twitter_padded_words.shape, "\n")

WORDS:
 [[    0     0     0 ... 15526 15526    22]
 [    0     0     0 ...     1 25323 15526]
 [    0     0     0 ...     1  4950     1]
 ...
 [    0     0     0 ...  4146 12183   339]
 [    0     0     0 ...  4401   772   979]
 [    0     0     0 ... 20332    35   895]]
SHAPE: (299989, 105) 



### Step 4. Tag your new data!

Create a modification to the **ent_tagger function** that combined words and tags from your original dataset. Now allow the function to also load new text from your new data set, and output the tags predicted from your trained model alongside the text. Make your function load five random texts from your data and output the tagged text.

prediction

In [53]:
twitter_pred = model2.predict(twitter_padded_words, verbose = 1)



In [54]:
twitter_pred

array([[[5.26923593e-03, 4.66981120e-09, 4.07995796e-03, ...,
         6.81531453e-09, 8.15907342e-09, 6.04837336e-09],
        [5.26923593e-03, 4.66981120e-09, 4.07995796e-03, ...,
         6.81531453e-09, 8.15907342e-09, 6.04837336e-09],
        [5.26923593e-03, 4.66981120e-09, 4.07995796e-03, ...,
         6.81531453e-09, 8.15907342e-09, 6.04837336e-09],
        ...,
        [3.88303215e-06, 5.47951728e-09, 3.70813505e-04, ...,
         4.80038764e-09, 6.59133059e-09, 3.94181354e-09],
        [2.59347144e-06, 1.59388525e-09, 1.51832166e-04, ...,
         1.56488245e-09, 1.71828063e-09, 1.08578335e-09],
        [8.32022735e-08, 5.84924427e-12, 6.76375930e-05, ...,
         7.90559371e-12, 7.14522885e-12, 4.02597790e-12]],

       [[4.94878665e-02, 3.62786295e-06, 9.20156315e-02, ...,
         3.90264358e-06, 4.77561707e-06, 3.80467350e-06],
        [4.94878665e-02, 3.62786295e-06, 9.20156315e-02, ...,
         3.90264358e-06, 4.77561707e-06, 3.80467350e-06],
        [4.94878665e-02, 

In [None]:
twitter_pred_tag = np.argmax(to_categorical(twitter_pred, 22))

### Result:

In [36]:
twitter[8:9]

Unnamed: 0,SentimentText,Test_Word,Word_Idxs
8,handed in my uniform today . i miss you already,"[handed, in, my, uniform, today, ., i, miss, you, already]","[2137, 67, 7102, 6180, 701, 22, 674, 6681, 3820, 2266]"


In [39]:
from IPython.display import HTML, display
import tabulate
import pandas as pd
pd.set_option('display.max_colwidth', -1)

result_tbl = [["Twitter", "Actual Words", "Predicted Tags"],
            [twitter['SentimentText'][8:9], twitter['Test_Word'][8:9], "O O O O B-tim O O O O O"]
    ]

display(HTML(tabulate.tabulate(result_tbl, tablefmt='html')))

0,1,2
Twitter,Actual Words,Predicted Tags
"8 handed in my uniform today . i miss you already Name: SentimentText, dtype: object","8 [handed, in, my, uniform, today, ., i, miss, you, already] Name: Test_Word, dtype: object",O O O O B-tim O O O O O


##### it is possible that the majority of tags are O, which could lead unbalance in labels.
##### But I'm glad my model predict the word "today" as a "B-tim" correctly!!
##### Which means sometimes the model works!!

#### the end...