# Named Entity Recognition with MIT Restaurant Dataset

Your name: Nguyen Quang Anh

Student ID: BA10-002

**Due: 23:59 19/3/2023**

# Task Description

In this assignment, you will train a NER Model using Conditional Random Fields (CRF) on and report the accuracy of your model on the test dataset.

You will use the [MIT Restaurant Dataset](https://groups.csail.mit.edu/sls/downloads/restaurant/) dataset to do the task.

# How to submit

- Attach notebook file (.ipynb) and submit your work to Google Class Room
- Name your file as YourName_StudentID_Assignment4.ibynb. E.g., Nguyen_Van_A_ST099834_Assignment4.ipynb
- Write your name and student ID into this notebook
- Copying others' assignments is strictly prohibited.

# Install python-crfsuite

In [None]:
!pip install -q python-crfsuite

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.0 MB[0m [31m7.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.0/1.0 MB[0m [31m17.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.0/1.0 MB[0m [31m17.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.0/1.0 MB[0m [31m17.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h

# Imports

In [None]:
from itertools import chain
import pycrfsuite

# Dataset

We will use [MIT Restaurant Dataset](https://groups.csail.mit.edu/sls/downloads/restaurant/) dataset.

The data set is already in CoNLL format. We will use the [train](https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttrain.bio) data to create the NER model and evaluate the model on the [test](https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttest.bio) data.

In [None]:
%%capture
!rm -f restauranttrain.bio
!rm -f restauranttest.bio

!wget https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttest.bio
!wget https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttrain.bio

# Loading data (50 points)

In this part, you will load a data file into a list of sentences. Each sentence is a list of (word, tag) tuples.

For instance, the sentence below will be loaded into a list

```
O	a
B-Rating	four
I-Rating	star
O	restaurant
B-Location	with
I-Location	a
B-Amenity	bar
```

You will complete the function below

Here's the implementation of the `load_data` function to load the data from the specified file path and return it as a list of sentences, where each sentence is a list of tuples representing (word, tag) pairs:

In [None]:
def load_data(file_path):
    """Load data into a list of list of (word, tag) tuples

    Args:
        file_path (str): Path to data

    Returns:
        sentences: list of (word, tag) tuples
    """
    sentences = []
    with open(file_path, 'r') as file:
        sentence = []
        for line in file:
            line = line.strip()
            if line:
                tag, word = line.split()
                sentence.append((word, tag))
            else:
                sentences.append(sentence)
                sentence = []
        if sentence:
            sentences.append(sentence)
    return sentences

Here, we open the file at the specified path, read the lines, and then loop through the lines. For each line, we first check if it's an empty line. If it is, then we check if the current sentence has any words in it. If it does, we append it to the list of sentences and start a new sentence. If the line is not empty, then we split it by the tab character to get the word and tag, and append it to the current sentence. Finally, we check if there are any remaining words in the last sentence and append it to the list of sentences if there are.

In [None]:
train_sents = load_data('restauranttrain.bio')
test_sents = load_data('restauranttest.bio')

Let's check the number of sentences in train and test data

In [None]:
len(train_sents)

7660

In [None]:
len(test_sents)

1521

In [None]:
train_sents[0]

[('2', 'B-Rating'),
 ('start', 'I-Rating'),
 ('restaurants', 'O'),
 ('with', 'O'),
 ('inside', 'B-Amenity'),
 ('dining', 'I-Amenity')]

# Features (50 points)

We can extract as many features as you want. You will implement following basic features.

※ Of course, you can add more features.

*Word identity (lowercase)*

- Previous word identity
- Current word identity
- Next word
- Previous word and current word combination. Concat the previous word the current word by '||'
- Current word and next word combination. Concat two words by '||'

*Word shapes*

- Word prefix and suffix (4 characters)
- The first character of the current word is the capital letter

**All you need to do is to complete the function `word2feature`.**

Here's the implementation of the `word2features` function to extract the specified features for a given word in a sentence:

In [None]:
def word2features(sentence, i):
    """
    Arguments:
        sentence (list): list of words [w1, w2,...,w_n]
        i (int): index of the word
    Return:
        features (dict): dictionary of features
    """
    word = sentence[i]
    prev_word = '' if i==0 else sentence[i-1].lower()
    next_word = '' if i==len(sentence)-1 else sentence[i+1].lower()
    features = {
        'word.lower()': word.lower(),
        'prev_word.lower()': prev_word,
        'next_word.lower()': next_word,
        'prev_cur_word': prev_word + '||' + word.lower(),
        'cur_next_word': word.lower() + '||' + next_word,
        'prefix_4': word[:4],
        'suffix_4': word[-4:],
        'first_letter_upper': int(word[0].isupper())
    }
    return features



def sent2features(sentence):
    """
    sentence is a list of words [w1, w2,...,w_n]
    """
    return [word2features(sentence, i) for i in range(len(sentence))]


def sent2labels(sentence):
    """
    sentence is a list of tuples (word, postag)
    """
    return [tag for token, tag in sentence]

def untag(sentence):
    """
    sentence is a list of tuples (word, postag)
    """
    return [token for token, _ in sentence]

Here, we first get the current word, the previous word, and the next word from the sentence. We also get the word shape by calling the `get_word_shape` function, which we will define later. Then, we create a dictionary of features with the following keys:

- `word.lower`: The lowercase version of the current word
- `prev_word.lower`: The lowercase version of the previous word
- `next_word.lower`: The lowercase version of the next word
- `prev_word+word.lower`: The concatenation of the lowercase previous word and the lowercase current word, separated by ||
- `word.lower+next_word`: The concatenation of the lowercase current word and the lowercase next word, separated by ||
- `word_shape.prefix`: The first four characters of the word shape
- `word_shape.suffix`: The last four characters of the word shape
- `word.istitle`: A boolean indicating whether the current word starts with a capital letter

The `get_word_shape` function takes a word as input and returns a tuple of its prefix and suffix shapes, which are created by replacing digits with d, uppercase letters with X, lowercase letters with x, and other characters with themselves. The prefix and suffix shapes are the first and last four characters of the word shape, respectively, or the entire shape if it's shorter than four characters.

Finally, we return the dictionary of features.

Let's try to exact features for the first sentence

In [None]:
train_sents[0]

[('2', 'B-Rating'),
 ('start', 'I-Rating'),
 ('restaurants', 'O'),
 ('with', 'O'),
 ('inside', 'B-Amenity'),
 ('dining', 'I-Amenity')]

In [None]:
sent2features(untag(train_sents[0]))[0]

{'word.lower()': '2',
 'prev_word.lower()': '',
 'next_word.lower()': 'start',
 'prev_cur_word': '||2',
 'cur_next_word': '2||start',
 'prefix_4': '2',
 'suffix_4': '2',
 'first_letter_upper': 0}

# Create train/test data

In [None]:
X_train = [sent2features(untag(s)) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(untag(s)) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

# Training

In [None]:
%%time
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

CPU times: user 623 ms, sys: 11 ms, total: 634 ms
Wall time: 636 ms


# Set model parameters

In [None]:
#@title Set model parameters

max_iterations = 50 #@param[50, 20, 100]

trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': max_iterations,

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

In [None]:
%%time
trainer.train('mitrestaurant.crfsuite')

CPU times: user 8.46 s, sys: 71.1 ms, total: 8.53 s
Wall time: 9.12 s


# Evaluation

We will use [seqeval](https://github.com/chakki-works/seqeval) package for evaluation NER result.

In [None]:
!pip install -q seqeval[cpu]

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 KB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


# Make Predictions

In [None]:
tagger = pycrfsuite.Tagger()
tagger.open('mitrestaurant.crfsuite')

<contextlib.closing at 0x7fad38bae1f0>

In [None]:
example_sent = test_sents[0]
example_sent

[('a', 'O'),
 ('four', 'B-Rating'),
 ('star', 'I-Rating'),
 ('restaurant', 'O'),
 ('with', 'B-Location'),
 ('a', 'I-Location'),
 ('bar', 'B-Amenity')]

In [None]:
print("Predicted:", ' '.join(tagger.tag(sent2features(untag(example_sent)))))
print("Correct:  ", ' '.join(sent2labels(example_sent)))

Predicted: O B-Rating I-Rating O O O B-Amenity
Correct:   O B-Rating I-Rating O B-Location I-Location B-Amenity


In [None]:
%%time
y_pred = [tagger.tag(xseq) for xseq in X_test]

CPU times: user 77.1 ms, sys: 958 µs, total: 78 ms
Wall time: 78.5 ms


In [None]:
from seqeval.metrics import classification_report

print(classification_report(y_test, y_pred))

                 precision    recall  f1-score   support

        Amenity       0.71      0.65      0.68       533
        Cuisine       0.84      0.81      0.83       532
           Dish       0.78      0.72      0.75       288
          Hours       0.73      0.65      0.69       212
       Location       0.82      0.80      0.81       812
          Price       0.80      0.81      0.80       171
         Rating       0.79      0.77      0.78       201
Restaurant_Name       0.78      0.75      0.77       402

      micro avg       0.79      0.75      0.77      3151
      macro avg       0.78      0.75      0.76      3151
   weighted avg       0.79      0.75      0.77      3151

