# Named Entity Recognition with MIT Restaurant Dataset

Your name: Nguyen Van A

Student ID: USTH001

**Due: 23:59 19/3/2023**

## Task Description

In this assignment, you will train a NER Model using Conditional Random Fields (CRF) on and report the accuracy of your model on the test dataset.

You will use the [MIT Restaurant Dataset](https://groups.csail.mit.edu/sls/downloads/restaurant/) dataset to do the task.

## How to submit

- Attach notebook file (.ipynb) and submit your work to Google Class Room
- Name your file as YourName_StudentID_Assignment4.ibynb. E.g., Nguyen_Van_A_ST099834_Assignment4.ipynb
- Write your name and student ID into this notebook
- Copying others' assignments is strictly prohibited.


## Install python-crfsuite

In [None]:
!pip install -q python-crfsuite

[?25l[K     |▍                               | 10 kB 20.2 MB/s eta 0:00:01[K     |▉                               | 20 kB 22.9 MB/s eta 0:00:01[K     |█▎                              | 30 kB 26.1 MB/s eta 0:00:01[K     |█▊                              | 40 kB 18.5 MB/s eta 0:00:01[K     |██▏                             | 51 kB 21.0 MB/s eta 0:00:01[K     |██▋                             | 61 kB 18.0 MB/s eta 0:00:01[K     |███                             | 71 kB 16.5 MB/s eta 0:00:01[K     |███▌                            | 81 kB 17.7 MB/s eta 0:00:01[K     |████                            | 92 kB 19.3 MB/s eta 0:00:01[K     |████▍                           | 102 kB 18.6 MB/s eta 0:00:01[K     |████▉                           | 112 kB 18.6 MB/s eta 0:00:01[K     |█████▎                          | 122 kB 18.6 MB/s eta 0:00:01[K     |█████▊                          | 133 kB 18.6 MB/s eta 0:00:01[K     |██████▏                         | 143 kB 18.6 MB/s eta 0:

## Imports

In [None]:
from itertools import chain
import pycrfsuite

## Dataset

We will use [MIT Restaurant Dataset](https://groups.csail.mit.edu/sls/downloads/restaurant/) dataset.

The data set is already in CoNLL format. We will use the [train](https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttrain.bio) data to create the NER model and evaluate the model on the [test](https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttest.bio) data.

### Download data

In [None]:
%%capture
!rm -f restauranttrain.bio
!rm -f restauranttest.bio

!wget https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttest.bio
!wget https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttrain.bio

## Loading data (30 points)

In this part, you will load a data file into a list of sentences. Each sentence is a list of (word, tag) tuples.

**Note: Blank lines are used to seperate sentences.**

For instance, the sentence below will be loaded into a list

```
O	a
B-Rating	four
I-Rating	star
O	restaurant
B-Location	with
I-Location	a
B-Amenity	bar
```

You will complete the function below

In [None]:
## Add necessary import here

def load_data(file_path):
    """Load data into a list of list of (word, tag) tuples

    Args:
        file_path (str): Path to data

    Returns:
        sentences: list of (word, tag) tuples
    """
    sentences = []

    #TODO: Write your code here

    return sentences

In [None]:
train_sents = load_data('restauranttrain.bio')
test_sents = load_data('restauranttest.bio')

Let's check the number of sentences in train and test data

In [None]:
len(train_sents)

In [None]:
len(test_sents)

In [None]:
train_sents[0]

## Features (50 points)

We can extract as many features as you want. You will implement following basic features.

※ Of course, you can add more features.

*Word identity (lowercase)*

- Previous word identity
- Current word identity
- Next word
- Previous word and current word combination. Concat the previous word the current word by '||'
- Current word and next word combination. Concat two words by '||'

*Word shapes*

- Word prefix and suffix (4 characters)
- The first character of the current word is the capital letter

**All you need to do is to complete the function `word2feature`.**

In [None]:
def word2features(sentence, i):
    """
    Arguments:
        sentence (list): list of words [w1, w2,...,w_n]
        i (int): index of the word
    Return:
        features (dict): dictionary of features
    """
    word = sentence[i]
    prev_word = '' if i==0 else sentence[i-1].lower()
    next_word = '' if i==len(sentence)-1 else sentence[i+1].lower()
    features = {
        #TODO: Write your features here
    }

    return features


def sent2features(sentence):
    """
    sentence is a list of words [w1, w2,...,w_n]
    """
    return [word2features(sentence, i) for i in range(len(sentence))]


def sent2labels(sentence):
    """
    sentence is a list of tuples (word, postag)
    """
    return [tag for token, tag in sentence]

def untag(sentence):
    """
    sentence is a list of tuples (word, postag)
    """
    return [token for token, _ in sentence]

Let's try to extract features for the first sentence

In [None]:
train_sents[0]

In [None]:
sent2features(untag(train_sents[0]))[0]

### Create train/test data

In [None]:
X_train = [sent2features(untag(s)) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(untag(s)) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

## Training

In [None]:
%%time
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

CPU times: user 1.01 s, sys: 8.76 ms, total: 1.02 s
Wall time: 1.03 s


In [None]:
#@title Set model parameters

max_iterations = 50 #@param[50, 20, 100]

trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': max_iterations,

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

In [None]:
%%time
trainer.train('mitrestaurant.crfsuite')

## Evaluation (20 points)

We will use [seqeval](https://github.com/chakki-works/seqeval) package for evaluation NER result.

In [None]:
!pip install -q seqeval[cpu]

### Make Predictions

In [None]:
tagger = pycrfsuite.Tagger()
tagger.open('mitrestaurant.crfsuite')

<contextlib.closing at 0x7fa0a3ce0c50>

In [None]:
example_sent = test_sents[0]
example_sent

In [None]:
print("Predicted:", ' '.join(tagger.tag(sent2features(untag(example_sent)))))
print("Correct:  ", ' '.join(sent2labels(example_sent)))

In [None]:
%%time
y_pred = [tagger.tag(xseq) for xseq in X_test]

In [None]:
from seqeval.metrics import classification_report

print(classification_report(y_test, y_pred))

# References

1. Datasets for Entity Recognition: https://github.com/juand-r/entity-recognition-datasets
2. [sklearn-crfsuite tutorial](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#let-s-use-conll-2002-data-to-build-a-ner-system).
3. [Quick Recipe: Build a POS tagger using a Conditional Random Field](https://nlpforhackers.io/crf-pos-tagger/)
4. [NLP Guide: Identifying Part of Speech Tags using Conditional Random Fields](https://medium.com/analytics-vidhya/pos-tagging-using-conditional-random-fields-92077e5eaa31)
5. [CRFsuite - Tutorial on Chunking Task](http://www.chokkan.org/software/crfsuite/tutorial.html)