For this demo, we will use the [MIT Restaurant Corpus](https://groups.csail.mit.edu/sls/downloads/restaurant/) -- a dataset of transcriptions of spoken utterances about restaurants.

The dataset has following entity types:

* 'B-Rating'
* 'I-Rating',
* 'B-Amenity',
* 'I-Amenity',
* 'B-Location',
* 'I-Location',
* 'B-Restaurant_Name',
* 'I-Restaurant_Name',
* 'B-Price',
* 'B-Hours',
* 'I-Hours',
* 'B-Dish',
* 'I-Dish',
* 'B-Cuisine',
* 'I-Price',
* 'I-Cuisine'

Let us load the dataset and see what are we working with.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
label_test_file = "/content/drive/MyDrive/Datasets/label_test.txt"
label_train_file = "/content/drive/MyDrive/Datasets/label_train.txt"
sent_test_file = "/content/drive/MyDrive/Datasets/sent_test.txt"
sent_train_file = "/content/drive/MyDrive/Datasets/sent_train.txt"

In [4]:
with open(sent_train_file, 'r') as train_sent_file:
  train_sentences = train_sent_file.readlines()

with open(label_train_file, 'r') as train_labels_file:
  train_labels = train_labels_file.readlines()

with open(sent_test_file, 'r') as test_sent_file:
  test_sentences = test_sent_file.readlines()

with open(label_test_file, 'r') as test_labels_file:
  test_labels = test_labels_file.readlines()


Let us see some example data points.

In [6]:
train_sentences[0], test_labels[0]

('2 start restaurants with inside dining \n',
 'O B-Rating I-Rating O B-Location I-Location B-Amenity \n')

In [8]:
# Print the 6th sentence in the test set i.e. index value 5.

sixth_sentence = test_sentences[5]
print(sixth_sentence)

# Print the labels of this sentence

test_labels[5]



any good ice cream parlors around 



'O B-Rating B-Cuisine I-Cuisine I-Cuisine B-Location \n'

#Defining Features for Custom NER

First, let us install the required modules.

In [9]:
# Install pycrf and crfsuit packages using pip command

! pip install pycrf
! pip install sklearn.crfsuite


Collecting pycrf
  Downloading pycrf-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pycrf
  Building wheel for pycrf (setup.py) ... [?25l[?25hdone
  Created wheel for pycrf: filename=pycrf-0.0.1-py3-none-any.whl size=1870 sha256=fc42204229809dae76368ec3e2db86cb91037151a30faba1b75f176d21624608
  Stored in directory: /root/.cache/pip/wheels/e3/d2/c9/ba15b05ba596e2eafeb83c2903e79d634207367555aae8c7d2
Successfully built pycrf
Installing collected packages: pycrf
Successfully installed pycrf-0.0.1
Collecting sklearn.crfsuite
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-crfsuite>=0.9.7 (from sklearn.crfsuite)
  Downloading python_crfsuite-0.9.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl (10 kB)
Downloading python_crfsuite-0.9.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.wh



We will now start with computing features for our input sequences.

We have defined the following features for CRF model building:

- f1 = input word is in lower case;
- f2 = last 3 characters of word;
- f3 = last 2 characers of word;
- f4 = 1; if the word is in uppercase, 0 otherwise;
- f5 = 1; if word is a number; otherwise, 0
- f6= 1; if the word starts with a capital letter; otherwise, 0


In [11]:
#Define a function to get the above defined features for a word.

def getFeaturesForOneWord(sentence, pos):
  word = sentence[pos]

  features = [
      'word.lower=' + word.lower(),
      'word[-3:]=' + word[-3:],
      'word[-2:]=' + word[-2:],
      'word.isupper=%s' % word.isupper(),
      'word.isdigit=%s' % word.isdigit(),
      'word.startsWithCapital=%s' % word[0].isupper()
  ]

  if pos > 0:
    prev_word = sentence[pos-1]
    features.extend([
        'prev_word.lower=' + prev_word.lower(),
        'prev_word.isdigit=%s' % prev_word.isdigit(),
        'prev_word.startsWithCapital=%s' % prev_word[0].isupper()
    ])
  else:
    features.append('BEG')

  if pos == len(sentence)-1:
    features.append('END')

  return features


#Computing Features

Define a function to get features for a sentence using the already defined 'getFeaturesForOneWord' function

In [12]:
# Define a function to get features for a sentence
# using the 'getFeaturesForOneWord' function.

def getFeaturesForOneSentence(sentence):
  sentence_list = sentence.split()
  return [getFeaturesForOneWord(sentence_list, pos) for pos in range(len(sentence_list))]


Define function to get the labels for a sentence.

In [13]:
# Define a function to get the labels for a sentence.
def getLabelsInListForOneSentence(labels):
  return labels.split()

Example features for a sentence


In [15]:
# Apply function 'getFeaturesForOneSentence' to get features on a single sentence which is at index value 5 in train_sentences
print(train_sentences[5])
features = getFeaturesForOneSentence(train_sentences[5])
features[2]


a place that serves soft serve ice cream 



['word.lower=that',
 'word[-3:]=hat',
 'word[-2:]=at',
 'word.isupper=False',
 'word.isdigit=False',
 'word.startsWithCapital=False',
 'prev_word.lower=place',
 'prev_word.isdigit=False',
 'prev_word.startsWithCapital=False']

Get the features for sentences of X_train and X_test and get the labels of Y_train and Y_test data.

In [16]:
X_train = [getFeaturesForOneSentence(sentence) for sentence in train_sentences]
Y_train = [getLabelsInListForOneSentence(labels) for labels in train_labels]

X_test = [getFeaturesForOneSentence(sentence) for sentence in test_sentences]
Y_test = [getLabelsInListForOneSentence(labels) for labels in test_labels]

#CRF Model Training

 Now we have all the information we need to train our CRF. Let us see how we can do that.

In [17]:
import sklearn_crfsuite

from sklearn_crfsuite import metrics

We create a CRF object and passtraining data to it. The model then "trains" and learns the weights for feature functions.

In [19]:
# Build the CRF model.

crf = sklearn_crfsuite.CRF(max_iterations=100)
crf.fit(X_train, Y_train)


#Model Testing and Evaluation
The model is trained, let us now see how good it performs on the test data.

In [20]:
# Calculate the f1 score using the test data

Y_pred = crf.predict(X_test)
metrics.flat_f1_score(Y_test, Y_pred, average='weighted')


0.8792035513463958

In [26]:
# Print the orginal labels and predicted labels for the sentence  in test data, which is at index value 10.
print(test_sentences[10])
print(Y_test[10])
print(Y_pred[10])


any places around here that has a nice view 

['O', 'O', 'B-Location', 'I-Location', 'O', 'O', 'O', 'B-Amenity', 'I-Amenity']
['O', 'O', 'B-Location', 'I-Location', 'O', 'O', 'O', 'B-Amenity', 'I-Amenity']


#Transitions Learned by CRF

In [27]:
from util import print_top_likely_transitions
from util import print_top_unlikely_transitions

ModuleNotFoundError: No module named 'util'

In [28]:
print_top_likely_transitions(crf.transition_features_)

NameError: name 'print_top_likely_transitions' is not defined

In [None]:
print_top_unlikely_transitions(crf.transition_features_)