# Classifying Voice Commands

<img src="https://www.cheatsheet.com/wp-content/uploads/2016/01/Siri-in-iOS-9-640x305.png" width=400>

Goal:

a) predict the intent of the speaker of a voice command 

b) extract the interesting named entities within the command.

Our focus: part (b) -- NER (Named entity recognition)


Figuring out *what* the speaker wants, and then *how* to accomplish that request. 

<img src="https://miro.medium.com/max/2594/1*rq7FCkcq4sqUY9IgfsPEOg.png" width="500">

---

In [1]:
%tensorflow_version 2.x
%pip install -q transformers

import tensorflow as tf
from urllib.request import urlretrieve
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from transformers import BertTokenizer
from transformers import TFBertModel
from tensorflow.keras.layers import Dropout, Dense
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import SparseCategoricalAccuracy

model_name = "bert-base-cased"
tokenizer = BertTokenizer.from_pretrained(model_name)

SNIPS_DATA_BASE_URL = (
    "https://github.com/ogrisel/slot_filling_and_intent_detection_of_SLU/blob/"
    "master/data/snips/"
)
for filename in ["train", "valid", "test", "vocab.intent", "vocab.slot"]:
    path = Path(filename)
    if not path.exists():
        print(f"Downloading {filename}...")
        urlretrieve(SNIPS_DATA_BASE_URL + filename + "?raw=true", path)


def parse_line(line):
    data, intent_label = line.split(" <=> ")
    items = data.split()
    words = [item.rsplit(":", 1)[0]for item in items]
    word_labels = [item.rsplit(":", 1)[1]for item in items]
    return {
        "intent_label": intent_label, 
        "words": " ".join(words),
        "word_labels": " ".join(word_labels),
        "length": len(words),
    }

def encode_dataset(text_sequences):
    # Create token_ids array (initialized to all zeros), where 
    # rows are a sequence and columns are encoding ids
    # of each token in given sequence.
    token_ids = np.zeros(shape=(len(text_sequences), max_token_len),
                         dtype=np.int32)
    
    for i, text_sequence in enumerate(text_sequences):
        encoded = tokenizer.encode(text_sequence)
        token_ids[i, 0:len(encoded)] = encoded

    attention_masks = (token_ids != 0).astype(np.int32)
    return {"input_ids": token_ids, "attention_masks": attention_masks}


train_lines = Path("train").read_text().strip().splitlines()
valid_lines = Path("valid").read_text().strip().splitlines()
test_lines = Path("test").read_text().strip().splitlines()

df_train = pd.DataFrame([parse_line(line) for line in train_lines])
df_valid = pd.DataFrame([parse_line(line) for line in valid_lines])
df_test = pd.DataFrame([parse_line(line) for line in test_lines])

max_token_len = 43

encoded_train = encode_dataset(df_train["words"])
encoded_valid = encode_dataset(df_valid["words"])
encoded_test = encode_dataset(df_test["words"])

intent_names = Path("vocab.intent").read_text().split()
intent_map = dict((label, idx) for idx, label in enumerate(intent_names))
intent_train = df_train["intent_label"].map(intent_map).values
intent_valid = df_valid["intent_label"].map(intent_map).values
intent_test = df_test["intent_label"].map(intent_map).values

base_bert_model = TFBertModel.from_pretrained("bert-base-cased")

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m52.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m81.6 MB/s[0m eta [36m0:00:00[0m
[?25h

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading train...
Downloading valid...
Downloading test...
Downloading vocab.intent...
Downloading vocab.slot...


Downloading tf_model.h5:   0%|          | 0.00/527M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


## Intent Classification + NER

Refining our Natural Language Understanding system by capturing the important named elements within each voice command.

Token level classification of the BIO labels:

```
      Book : O
         a : O
     table : O
       for : O
       two : B-party_size_number
        at : O
        Le : B-restaurant_name
         R : I-restaurant_name
     ##itz : I-restaurant_name
       for : O
    Friday : B-timeRange
     night : I-timeRange
         ! : O
```

load the list of possible word token labels and augment it with an additional padding label so we can ignore special tokens:

In [2]:
# Build a map from slot name to a unique id.
slot_names = ["[PAD]"] + Path("vocab.slot").read_text().strip().splitlines()
slot_map = {}
for label in slot_names:
    slot_map[label] = len(slot_map)
slot_map

{'[PAD]': 0,
 'B-album': 1,
 'B-artist': 2,
 'B-best_rating': 3,
 'B-city': 4,
 'B-condition_description': 5,
 'B-condition_temperature': 6,
 'B-country': 7,
 'B-cuisine': 8,
 'B-current_location': 9,
 'B-entity_name': 10,
 'B-facility': 11,
 'B-genre': 12,
 'B-geographic_poi': 13,
 'B-location_name': 14,
 'B-movie_name': 15,
 'B-movie_type': 16,
 'B-music_item': 17,
 'B-object_location_type': 18,
 'B-object_name': 19,
 'B-object_part_of_series_type': 20,
 'B-object_select': 21,
 'B-object_type': 22,
 'B-party_size_description': 23,
 'B-party_size_number': 24,
 'B-playlist': 25,
 'B-playlist_owner': 26,
 'B-poi': 27,
 'B-rating_unit': 28,
 'B-rating_value': 29,
 'B-restaurant_name': 30,
 'B-restaurant_type': 31,
 'B-served_dish': 32,
 'B-service': 33,
 'B-sort': 34,
 'B-spatial_relation': 35,
 'B-state': 36,
 'B-timeRange': 37,
 'B-track': 38,
 'B-year': 39,
 'I-album': 40,
 'I-artist': 41,
 'I-city': 42,
 'I-country': 43,
 'I-cuisine': 44,
 'I-current_location': 45,
 'I-entity_name': 

#### Word to Token Encodings

Following function generates *token-aligned* integer ids from the BIO *word-level* annotations. <img src="https://www.emoji.co.uk/files/twitter-emojis/symbols-twitter/11214-anticlockwise-downwards-and-upwards-open-circle-arrows.png" width=20>

If a certain word is broken down into multiple tokens by BERT, the word-level label is replicated for all of the word's tokens. The "B-" prefix is only used for the 1st of the tokens, while the rest of the tokens have the same label but with the "I-" prefix.



In [3]:
# Uses the slot_map of slot name to unique id, defined above, as well
# as the BERT tokenizer, to create a np array with each row corresponding
# to a given sequence, and the columns as the id of the given token slot labels.
def encode_token_labels(text_sequences, true_word_labels):
    encoded = np.zeros(shape=(len(text_sequences), max_token_len), dtype=np.int32)
    for i, (text_sequence, word_labels) in enumerate(zip(text_sequences, true_word_labels)):
        encoded_labels = []
        for word, word_label in zip(text_sequence.split(), word_labels.split()):
            tokens = tokenizer.tokenize(word)
            encoded_labels.append(slot_map[word_label])
            expand_label = word_label.replace("B-", "I-")
            if not expand_label in slot_map:
                expand_label = word_label
            encoded_labels.extend([slot_map[expand_label]] * (len(tokens) - 1))
        encoded[i, 1:len(encoded_labels) + 1] = encoded_labels
    return encoded

In [4]:
df_train['words']

0        Add Don and Sherri to my Meditate to Sounds of...
1        put United Abominations onto my rare groove pl...
2        add the tune by misato watanabe to the Trapeo ...
3        add this artist to my this is miguel bosé play...
4        add heresy and the hotel choir to the evening ...
                               ...                        
13079    find a Consolidated Theatres showing The Good ...
13080    where can i see animated movies in the neighbo...
13081          Showtimes for animated movies in the area .
13082    Which animated movies are playing at Megaplex ...
13083               What movie schedules start at sunset ?
Name: words, Length: 13084, dtype: object

In [5]:
df_train['word_labels']

0        O B-entity_name I-entity_name I-entity_name O ...
1        O B-entity_name I-entity_name O B-playlist_own...
2        O O B-music_item O B-artist I-artist O O B-pla...
3        O O B-music_item O B-playlist_owner B-playlist...
4        O B-entity_name I-entity_name I-entity_name I-...
                               ...                        
13079    O O B-location_name I-location_name O B-movie_...
13080    O O O O B-movie_type I-movie_type B-spatial_re...
13081    O O B-movie_type I-movie_type B-spatial_relati...
13082    O B-movie_type I-movie_type O O O B-location_n...
13083      O B-object_type I-object_type O O B-timeRange O
Name: word_labels, Length: 13084, dtype: object

#### Ex 1

Encoding the token labels for train, validation, & test:

In [6]:
# Encoding the token labels and store in variables slot_train, slot_valid, slot_test.

slot_train = encode_token_labels(df_train['words'], df_train['word_labels'])
slot_valid = encode_token_labels(df_valid['words'], df_valid['word_labels'])
slot_test = encode_token_labels(df_test['words'], df_test['word_labels'])

The encoded token labels for the 1st training sequence:

In [7]:
slot_train[0]

array([ 0, 72, 72, 10, 46, 46, 46, 72, 26, 25, 60, 60, 60, 60, 60, 60, 72,
       72,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0], dtype=int32)

Special tokens such as `[PAD]` and `[SEP]` as well as all padded positions have a 0 label.

#### Exercise 2

Building **joint sequence and token classification model** which will be trained on encoded dataset with the NER labels


In [8]:
class JointIntentAndSlotFillingModel(tf.keras.Model):

    def __init__(self, intent_num_labels=None, slot_num_labels=None,
                dropout_prob=0.1):
        super().__init__(name="joint_intent_slot")

        self.bert = base_bert_model
        
        # TODO: define the dropout, intent & slot classifier layers
        self.dropout = Dropout(dropout_prob)
        self.intent_classifier = Dense(intent_num_labels, name="intent_classifier") 
        self.slot_classifier = Dense(slot_num_labels, name="slot_classifier") 

    def call(self, inputs, **kwargs):
        tokens_output = self.bert(inputs["input_ids"], attention_mask=inputs["attention_masks"])[0]
        pooled_output = self.bert(inputs["input_ids"], attention_mask=inputs["attention_masks"])[1]

        # TODO: use the new layers to predict slot class (logits) for each
        # token position in input sequence (size: (batch_size, seq_len, slot_num_labels)).
        tokens_output = self.dropout(tokens_output, \
                                     training=kwargs.get("training", False))  
        slot_logits = self.slot_classifier(tokens_output)
        return slot_logits, pooled_output

# TODO: create an instantiation of this model
joint_model = JointIntentAndSlotFillingModel(len(intent_map), len(slot_map))

In [9]:
# Define one classification loss for each output (intent & NER):
losses = [SparseCategoricalCrossentropy(from_logits=True),
          SparseCategoricalCrossentropy(from_logits=True)]
          
joint_model.compile(optimizer=Adam(learning_rate=3e-5, epsilon=1e-08),
                    loss=losses,
                    metrics=[SparseCategoricalAccuracy('accuracy')])

In [10]:
# Train the model.
history = joint_model.fit(encoded_train, (slot_train, intent_train), \
    validation_data=(encoded_valid, (slot_valid, intent_valid)), \
    epochs=1, batch_size=32)



Validation accuracy: 99% after only training for one epoach

#### Classification

<img src="https://orbitcarrot.com/wp-content/uploads/2014/12/predict.png" width=100>

Next step for prediction: following function which uses trained model to make a prediction on a single text sequence, & displays both the sequence-wise and the token-wise class labels.


#### Exercise 3

In [15]:
# Use the model we trained to get the intent & slot logits
# and print the actual string of the class corresponding to
# highest logit score for each token, and the sentence overall.

def show_predictions(text, intent_names, slot_names):
    encoded_text = encode_dataset([text])
    input_ids = tf.constant(encoded_text["input_ids"])
    attention_mask = tf.constant(encoded_text["attention_masks"])

    inputs = {
        "input_ids": input_ids,
        "attention_mask": attention_mask
    }

    outputs = joint_model(inputs) 
    slot_logits, intent_logits = outputs
    slot_ids = slot_logits.numpy().argmax(axis=-1)[0, 1:-1]
    intent_id = intent_logits.numpy().argmax(axis=-1)[0]
    print("## Intent:", intent_names[intent_id])
    print("## Slots:")
    for token, slot_id in zip(tokenizer.tokenize(text), slot_ids):
        print(f"{token:>10} : {slot_names[slot_id]}")