For this demo, we will use the [MIT Restaurant Corpus](https://groups.csail.mit.edu/sls/downloads/restaurant/) -- a dataset of transcriptions of spoken utterances about restaurants.

The dataset has following entity types:

* 'B-Rating'
* 'I-Rating',
* 'B-Amenity',
* 'I-Amenity',
* 'B-Location',
* 'I-Location',
* 'B-Restaurant_Name',
* 'I-Restaurant_Name',
* 'B-Price',
* 'B-Hours',
* 'I-Hours',
* 'B-Dish',
* 'I-Dish',
* 'B-Cuisine',
* 'I-Price',
* 'I-Cuisine'

Let us load the dataset and see what are we working with.

In [292]:
with open('../data/sent_train', 'r') as train_sent_file:
  train_sentences = train_sent_file.readlines()

with open('../data/label_train', 'r') as train_labels_file:
  train_labels = train_labels_file.readlines()

with open('../data/sent_test', 'r') as test_sent_file:
  test_sentences = test_sent_file.readlines()

with open('../data/label_test', 'r') as test_labels_file:
  test_labels = test_labels_file.readlines()


Let us see some example data points.

In [293]:
# Print the 6th sentence in the test set i.e. index value 5.
print(test_sentences[5])

# Print the labels of this sentence
print(test_labels[5])

any good ice cream parlors around 

O B-Rating B-Cuisine I-Cuisine I-Cuisine B-Location 



#Defining Features for Custom NER

First, let us install the required modules.

In [294]:
# Install pycrf and crfsuit packages using pip command
!pip install pycrf
!pip install sklearn-crfsuite

You should consider upgrading via the '/Users/sreedevigattu/sree/PGDS/MachineLearning/venv/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/sreedevigattu/sree/PGDS/MachineLearning/venv/bin/python3 -m pip install --upgrade pip' command.[0m




We will now start with computing features for our input sequences.

We have defined the following features for CRF model building:

- f1 = input word is in lower case; 
- f2 = last 3 characters of word;
- f3 = last 2 characers of word;
- f4 = 1; if the word is in uppercase, 0 otherwise;
- f5 = 1; if word is a number; otherwise, 0 
- f6= 1; if the word starts with a capital letter; otherwise, 0


In [295]:
#Define a function to get the above defined features for a word.
def getFeaturesForAWord(sentence, pos):
    word = sentence[pos]
    features = {
        'word.lower' : word.lower(), 
        'word[-3:]' : word[-3:],
        'word[-2:]' : word[-2:],   
        'word.isupper' : word.isupper(),
        'word.isidigt' : word.isdigit(),
        'words.startsWithCapital' : word[0].isupper()}

    if pos > 0:
        prev_word = sentence[pos-1]
        features.update({
            'prev_words.lower' : prev_word.lower(),
            'prev_words.isupper' : prev_word.isupper(),
            'prev_words.isdigit' : prev_word.isdigit(), 
            'prev_words.startsWithCapital' : word[0].isupper()})
    else:
        features.update({'BEG' : True})

    return features

#Computing Features 

Define a function to get features for a sentence using the already defined 'getFeaturesForOneWord' function

In [296]:
# Define a function to get features for a sentence 
# using the 'getFeaturesForOneWord' function.
def getFeaturesForASentence(sentence):
    sentence_words = sentence.split()
    return [getFeaturesForAWord(sentence_words, pos) for pos in range(len(sentence_words))]

Define function to get the labels for a sentence.

In [297]:
# Define a function to get the labels for a sentence.
def getLabeslInListForASentence(labels):
    return labels.split()

Example features for a sentence


In [298]:
# Apply function 'getFeaturesForOneSentence' to get features on a single sentence which is at index value 5 in train_sentences
test = train_sentences[5]
print(test)

features = getFeaturesForASentence(test)
print(features)


a place that serves soft serve ice cream 

[{'word.lower': 'a', 'word[-3:]': 'a', 'word[-2:]': 'a', 'word.isupper': False, 'word.isidigt': False, 'words.startsWithCapital': False, 'BEG': True}, {'word.lower': 'place', 'word[-3:]': 'ace', 'word[-2:]': 'ce', 'word.isupper': False, 'word.isidigt': False, 'words.startsWithCapital': False, 'prev_words.lower': 'a', 'prev_words.isupper': False, 'prev_words.isdigit': False, 'prev_words.startsWithCapital': False}, {'word.lower': 'that', 'word[-3:]': 'hat', 'word[-2:]': 'at', 'word.isupper': False, 'word.isidigt': False, 'words.startsWithCapital': False, 'prev_words.lower': 'place', 'prev_words.isupper': False, 'prev_words.isdigit': False, 'prev_words.startsWithCapital': False}, {'word.lower': 'serves', 'word[-3:]': 'ves', 'word[-2:]': 'es', 'word.isupper': False, 'word.isidigt': False, 'words.startsWithCapital': False, 'prev_words.lower': 'that', 'prev_words.isupper': False, 'prev_words.isdigit': False, 'prev_words.startsWithCapital': False}, {

Get the features for sentences of X_train and X_test and get the labels of Y_train and Y_test data.

In [299]:
type(train_sentences), len(train_sentences)
X_train = [getFeaturesForASentence(sentence) for sentence in train_sentences]
Y_train = [getLabeslInListForASentence(labels) for labels in train_labels]

X_test = [getFeaturesForASentence(sentence) for sentence in test_sentences]
Y_test = [getLabeslInListForASentence(labels) for labels in test_labels]
    

In [300]:
print(X_train[0]);print();print(Y_train[0])

[{'word.lower': '2', 'word[-3:]': '2', 'word[-2:]': '2', 'word.isupper': False, 'word.isidigt': True, 'words.startsWithCapital': False, 'BEG': True}, {'word.lower': 'start', 'word[-3:]': 'art', 'word[-2:]': 'rt', 'word.isupper': False, 'word.isidigt': False, 'words.startsWithCapital': False, 'prev_words.lower': '2', 'prev_words.isupper': False, 'prev_words.isdigit': True, 'prev_words.startsWithCapital': False}, {'word.lower': 'restaurants', 'word[-3:]': 'nts', 'word[-2:]': 'ts', 'word.isupper': False, 'word.isidigt': False, 'words.startsWithCapital': False, 'prev_words.lower': 'start', 'prev_words.isupper': False, 'prev_words.isdigit': False, 'prev_words.startsWithCapital': False}, {'word.lower': 'with', 'word[-3:]': 'ith', 'word[-2:]': 'th', 'word.isupper': False, 'word.isidigt': False, 'words.startsWithCapital': False, 'prev_words.lower': 'restaurants', 'prev_words.isupper': False, 'prev_words.isdigit': False, 'prev_words.startsWithCapital': False}, {'word.lower': 'inside', 'word[-3:

In [301]:
X_test[0], Y_test[0]

([{'word.lower': 'a',
   'word[-3:]': 'a',
   'word[-2:]': 'a',
   'word.isupper': False,
   'word.isidigt': False,
   'words.startsWithCapital': False,
   'BEG': True},
  {'word.lower': 'four',
   'word[-3:]': 'our',
   'word[-2:]': 'ur',
   'word.isupper': False,
   'word.isidigt': False,
   'words.startsWithCapital': False,
   'prev_words.lower': 'a',
   'prev_words.isupper': False,
   'prev_words.isdigit': False,
   'prev_words.startsWithCapital': False},
  {'word.lower': 'star',
   'word[-3:]': 'tar',
   'word[-2:]': 'ar',
   'word.isupper': False,
   'word.isidigt': False,
   'words.startsWithCapital': False,
   'prev_words.lower': 'four',
   'prev_words.isupper': False,
   'prev_words.isdigit': False,
   'prev_words.startsWithCapital': False},
  {'word.lower': 'restaurant',
   'word[-3:]': 'ant',
   'word[-2:]': 'nt',
   'word.isupper': False,
   'word.isidigt': False,
   'words.startsWithCapital': False,
   'prev_words.lower': 'star',
   'prev_words.isupper': False,
   'prev_wo

In [302]:
type(X_train)

list

#CRF Model Training

 Now we have all the information we need to train our CRF. Let us see how we can do that.

In [303]:
import sklearn_crfsuite

from sklearn_crfsuite import metrics

In [304]:
type(X_train), type(Y_train)

(list, list)

We create a CRF object and passtraining data to it. The model then "trains" and learns the weights for feature functions.

In [305]:
# Build the CRF model.
crf = sklearn_crfsuite.CRF(max_iterations=100)
try:
    crf.fit(X_train, Y_train)
except AttributeError:
    pass

In [306]:
labels = list(crf.classes_)
labels.remove('O')
labels

['B-Rating',
 'I-Rating',
 'B-Amenity',
 'I-Amenity',
 'B-Location',
 'I-Location',
 'B-Restaurant_Name',
 'I-Restaurant_Name',
 'B-Price',
 'B-Hours',
 'I-Hours',
 'B-Dish',
 'I-Dish',
 'B-Cuisine',
 'I-Price',
 'I-Cuisine']

#Model Testing and Evaluation 
The model is trained, let us now see how good it performs on the test data.

In [307]:
# Calculate the f1 score using the test data
Y_pred = crf.predict(X_test)
metrics.flat_f1_score(Y_test, Y_pred, average='weighted', labels=labels)

0.7918269093318624

In [308]:
# group B and I results
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    Y_test, Y_pred, labels=sorted_labels, digits=3
))

                   precision    recall  f1-score   support

        B-Amenity      0.778     0.692     0.733       533
        I-Amenity      0.699     0.744     0.721       524
        B-Cuisine      0.860     0.821     0.840       532
        I-Cuisine      0.752     0.630     0.685       135
           B-Dish      0.792     0.726     0.757       288
           I-Dish      0.618     0.669     0.643       121
          B-Hours      0.741     0.675     0.706       212
          I-Hours      0.851     0.888     0.869       295
       B-Location      0.872     0.846     0.859       812
       I-Location      0.800     0.850     0.825       788
          B-Price      0.836     0.743     0.786       171
          I-Price      0.652     0.652     0.652        66
         B-Rating      0.802     0.826     0.814       201
         I-Rating      0.861     0.696     0.770       125
B-Restaurant_Name      0.881     0.794     0.835       402
I-Restaurant_Name      0.809     0.735     0.770       



In [309]:
# Print the orginal labels and predicted labels for the sentence  in test data, which is at index value 10.
id =12
print(test_sentences[id])
print("Original", Y_test[id])
print("Predicted", Y_pred[id])

any restaurants open right now 

Original ['O', 'O', 'B-Hours', 'I-Hours', 'I-Hours']
Predicted ['O', 'O', 'B-Hours', 'I-Hours', 'I-Hours']


#Transitions Learned by CRF

In [310]:
from util import print_top_likely_transitions
from util import print_top_unlikely_transitions

In [311]:
print_top_likely_transitions(crf.transition_features_)

B-Amenity -> I-Amenity 6.688133
B-Restaurant_Name -> I-Restaurant_Name 6.598282
B-Location -> I-Location 6.313856
B-Dish -> I-Dish  5.968869
B-Hours -> I-Hours 5.876036
I-Restaurant_Name -> I-Restaurant_Name 5.753196
B-Cuisine -> I-Cuisine 5.654875
B-Rating -> I-Rating 5.600620
I-Amenity -> I-Amenity 5.540266
I-Location -> I-Location 5.518915


### Graded Questions

In [312]:
with open('../data/data.txt', 'r') as input_file:
    content = input_file.readlines()

sentences = content[0].split('.'); len(sentences)

187

In [313]:
import spacy # import spacy module
model = spacy.load("en_core_web_sm") #load pre-trained model

In [314]:
processed_docs = []
for sentence in sentences:
    processed_docs.append(model(sentence))

In [315]:
print(sentences[2]);processed_doc = processed_docs[2]

[ print(word.text, ' --- ', word.pos_, ' --- ', word.tag_) for word in processed_doc ];print()
[ print(ent.text, ' --- ', ent.label_) for ent in processed_doc.ents ]

 When the company had started the price of its stock was $10, which you can consider as the base price
   ---  SPACE  ---  _SP
When  ---  SCONJ  ---  WRB
the  ---  DET  ---  DT
company  ---  NOUN  ---  NN
had  ---  AUX  ---  VBD
started  ---  VERB  ---  VBN
the  ---  DET  ---  DT
price  ---  NOUN  ---  NN
of  ---  ADP  ---  IN
its  ---  PRON  ---  PRP$
stock  ---  NOUN  ---  NN
was  ---  AUX  ---  VBD
$  ---  SYM  ---  $
10  ---  NUM  ---  CD
,  ---  PUNCT  ---  ,
which  ---  PRON  ---  WDT
you  ---  PRON  ---  PRP
can  ---  AUX  ---  MD
consider  ---  VERB  ---  VB
as  ---  ADP  ---  IN
the  ---  DET  ---  DT
base  ---  NOUN  ---  NN
price  ---  NOUN  ---  NN

10  ---  MONEY


[None]

In [316]:
sentences[0]

"The stock price of a certain company was $100 a year ago, once this company came into boom phase then it's stock price rise to $200"

In [317]:
count = 0
price = 0
#print(len(sentences))
for sentence in sentences:
    processed_doc = model(sentence)
    #print(count, end=' - ')
    for ent in processed_doc.ents:
        if ent.label_ == "MONEY":
            count += 1
            price += int(ent.text)
            #print(ent.text,end=' - ')
        if ent.label_ == "DATE":
            print(ent.text)
    #print()

print(count, price, "average", price/count)

a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
a year ago
248 23128 average 93.25806451612904


In [318]:
X_train = [ "Bella is not giving money", 
            "Tony is teaching NLP to Chris", 
            "Joe should focus to his job", 
            "Peter is trained by Adele", 
            "Temperature is maintained by AC", 
            "He is playing the Cricket", 
            "Guitar has been played by him"]

# ‘P’, ‘A’ and ‘E’ stand for Person, Activity and Entity
Y_train = [ "P O O A E",  # ‘Bella (P) is not giving (A) money (E)’
            "P O A E O P", #‘Tony (P) is teaching (A) NLP (E) to Chris (P)’
            "P O A O O E", #‘Joe (P) should focus (A) to his job (E)’
            "P O A O P", #‘Peter (P) is trained (A) by Adele (P)’
            "E O A O E", #‘Temperature (E) is maintained (A) by AC (E)’
            "O O A O E", #‘He is playing (A) the Cricket (E)’
            "E O O A O V" #‘Guitar (E) has been played (A) by him(V)’
]
len(X_train), len(Y_train)

(7, 7)

In [319]:
from spacy import displacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

import pandas as pd

def getXXX(sent, labels):
    print(sent);print(labels)
    doc = nlp(sent);print(type(doc))
    processed_doc = model(sent);print(type(processed_doc))
    word_ents = {}
    for ent in processed_doc.ents:
        word_ents[ent.text] = ent.label_
    '''for word in doc:
        print(word, len(list(word.children)))'''
    # {list(word.children)[0].dep_ if len(list(word.children)) >0 else ''} 
    word = [];pos =[];tag = [];ner = [];dep = [];label = []
    for i in range(len(processed_doc)):
        token = processed_doc[i]

        word.append(token.text)
        pos.append(token.pos_)
        tag.append(token.tag_)
        ner.append(word_ents[token.text] if token.text in word_ents.keys() else "")
        dep.append(token.dep_)
        label.append(labels[i])
 
        #print(f"{token.text:15} - {token.pos_:6} - {token.tag_:4} - {token.dep_:5} - {len(list(token.children))} - {word_ents[token.text] if token.text in word_ents.keys() else ''}")
    
    print(labels)
    df = pd.DataFrame(data={'words':word, 'pos': pos, 'tag': tag, 'ner': ner, 'dep': dep, 'labels': label})
    #print(df)    
    displacy.render(processed_doc, style="dep")
    return df

for c in range(len(X_train)):
    sent = X_train[c]
    labels = Y_train[c].split(' ')
    df = getXXX(sent, labels)
    break

Bella is not giving money
['P', 'O', 'O', 'A', 'E']
<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.doc.Doc'>
['P', 'O', 'O', 'A', 'E']


In [320]:
getXXX('Harry is not gardening as it is raining', "P O O A O O O A".split(' '))

Harry is not gardening as it is raining
['P', 'O', 'O', 'A', 'O', 'O', 'O', 'A']
<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.doc.Doc'>
['P', 'O', 'O', 'A', 'O', 'O', 'O', 'A']


Unnamed: 0,words,pos,tag,ner,dep,labels
0,Harry,PROPN,NNP,PERSON,nsubj,P
1,is,AUX,VBZ,,ROOT,O
2,not,PART,RB,,neg,O
3,gardening,VERB,VBG,,acomp,A
4,as,SCONJ,IN,,mark,O
5,it,PRON,PRP,,nsubj,O
6,is,AUX,VBZ,,aux,O
7,raining,VERB,VBG,,advcl,A


In [321]:
doc = nlp("Harry is not gardening as it is raining")
for tok in doc:
    print(tok.text,tok.dep_,end=' ; ')

displacy.render(doc, style="dep", jupyter = True)

Harry nsubj ; is ROOT ; not neg ; gardening acomp ; as mark ; it nsubj ; is aux ; raining advcl ; 

In [322]:
processed_doc = model('Forest');print(processed_doc, type(processed_doc))
for ent in processed_doc.ents:
    print(ent.text, ent.label_)

Forest <class 'spacy.tokens.doc.Doc'>


In [323]:
'''doc = nlp(passive[7])
displacy.render(doc, style='dep')'''

"doc = nlp(passive[7])\ndisplacy.render(doc, style='dep')"

In [324]:
processed_doc = model(X_train[0])
[ print(word.text, ' --- ', word.pos_, ' --- ', word.tag_) for word in processed_doc];print()
for ent in processed_doc.ents:
    print(ent.text, ent.label_)

Bella  ---  PROPN  ---  NNP
is  ---  AUX  ---  VBZ
not  ---  PART  ---  RB
giving  ---  VERB  ---  VBG
money  ---  NOUN  ---  NN

Bella PERSON


In [325]:
'''F1(X, xi, xi-1, i) = 1 if xi starts with a capital letter; otherwise, 0. 
F2(X, xi, xi-1, i) = 1 if xi is the parent of the ‘nsubj’ and ‘nusubjpass’ dependency tagged words AND the corresponding children words should have the NER tag of ‘E’ and ‘P’; otherwise, 0.
F3(X, xi, xi-1, i) = 1, if xi is the parent of the ‘dobj’ dependency tagged words AND the corresponding children words should have the NER tag of ‘E’ and ‘P’; otherwise, 0.'''

'F1(X, xi, xi-1, i) = 1 if xi starts with a capital letter; otherwise, 0. \nF2(X, xi, xi-1, i) = 1 if xi is the parent of the ‘nsubj’ and ‘nusubjpass’ dependency tagged words AND the corresponding children words should have the NER tag of ‘E’ and ‘P’; otherwise, 0.\nF3(X, xi, xi-1, i) = 1, if xi is the parent of the ‘dobj’ dependency tagged words AND the corresponding children words should have the NER tag of ‘E’ and ‘P’; otherwise, 0.'

In [326]:
#Define a function to get the above defined features for a word.
def getFeaturesForAWord(sentence, pos):
    word = sentence[pos]
    features = [
        f'words.startsWithCapital={word[0].isupper()}',
        
        f'={word.lower()}', 
        f'word[-3:]={word[-3:]}',
        f'word[-2:]={word[-2:]}',   
        f'word.isupper={word.isupper()}',
        f'word.isidigt={word.isdigit()}',
        
    ]

    if pos > 0:
        prev_word = sentence[pos-1]
        features.extend([
            f'prev_words.lower={prev_word.lower()}',
            f'prev_words.isupper()={prev_word.isupper()}',
            f'prev_words.isdigit()={prev_word.isdigit()}', 
            f'prev_words.startsWithCapital={word[0].isupper()}'])
    else:
        features.append('BEG')

    return features

# Define a function to get features for a sentence 
# using the 'getFeaturesForOneWord' function.
def getFeaturesForASentence(sentence):
    sentence_words = sentence.split()
    return [getFeaturesForAWord(sentence_words, pos) for pos in range(len(sentence_words))]

In [327]:
# Build the CRF model.
crf = sklearn_crfsuite.CRF(max_iterations=100)
try:
    crf.fit(X_train, Y_train)
except AttributeError:
    pass

Y_pred = crf.predict(X_test)
metrics.flat_f1_score(Y_test, Y_pred, average='weighted')

id =12
print(test_sentences[id])
print("Original", Y_test[id])
print("Predicted", Y_pred[id])

print_top_likely_transitions(crf.transition_features_)
print_top_unlikely_transitions(crf.transition_features_)

ValueError: The numbers of items and labels differ: |x| = 25, |y| = 9