**In this homework, you will implement several AI models to conduct the intent detection task.**
![alt text](https://i.ibb.co/fXmYHRq/ec5.jpg)

# Part 0: Data Preprocessing

In this section, you will have a general idea of how the data looks like and do some simple transformation.

In [3]:
import pickle
samples = pickle.load(open("sample.p", "rb"))
test_sentences = pickle.load(open("test_sentences.p", "rb"))

In [4]:
###data structure###
### [[sentence, label]] ###
print(samples[:3])

[['Turn off the holoemitter.', 2], ['Halt.', 1], ['Get off tiptoes', 6]]


There are nine categories for these sentences, which are 'no', 'driving', 'light', 'head', 'state', 'connection', 'stance', 'animation' and 'grid'. The mapping from index to category name are shown below.

In [5]:
ind2cat = {0: 'no', 1: 'driving', 2: 'light', 3: 'head', 4: 'state', 5: 'connection', 6: 'stance', 7: 'animation', 8: 'grid'}

In [6]:
### Distribution on categories ###
cat2sentence = {}
for sample in samples:
  sentence = sample[0]
  cat = ind2cat[sample[1]]
  if cat not in cat2sentence:
    cat2sentence[cat] = [sentence]
  else:
    cat2sentence[cat].append(sentence)

print("number of sentences for each category")
for cat, sentences in cat2sentence.items():
  print(cat, ": ", len(sentences))

number of sentences for each category
light :  716
driving :  784
stance :  758
head :  698
grid :  678
state :  676
animation :  645
no :  629
connection :  673


### Train/Validation Split

In [7]:
from sklearn.model_selection import train_test_split
SENTENCES = [sample[0] for sample in samples]
LABELS = [sample[1] for sample in samples]
X_train, X_val, y_train, y_val = train_test_split(SENTENCES, LABELS, test_size=0.2)

### Clean Text
Write a tokenization function clean(sentence) which takes as input a string of text and returns a list of tokens derived from that text. Here, we define a token to be a contiguous sequence of non-whitespace characters. We will remove punctuation marks and convert the text to lowercase. Hint: Use the built-in constant string.punctuation, found in the string module, and/or python's regex library, re.

In [8]:
import numpy as np
import nltk
import re
nltk.download('stopwords')
from nltk.corpus import stopwords
STOPWORDS = stopwords.words('english')

def clean(sentence):
  '''1. tokenize the sentence (remove punctuation)
     2. remove the stop words
     3. convert all words to lowercase'''
  sentence = re.sub(r"[^\w]", " ", sentence).lower().split()  #1, 3
  sentence = [i for i in sentence if i not in STOPWORDS]      #2
  return sentence

X_train_token = [clean(sentence) for sentence in X_train]
X_val_token = [clean(sentence) for sentence in X_val]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [9]:
X_token = X_train_token + X_val_token
max_len = np.max([len(i) for i in X_token]) # Find the maximum length of tokens in train/val

### Build a Vocabulary
Build a vocabulary to map each word to an index, you need to first find the unique words in train/val set.

Once you build a vocabulary, it's better to save it to a file for future use. Because the vocabulary may change each time you run the code.

In [10]:
temp_count = [j for i in X_token for j in i]
word_count = {i: temp_count.count(i) for i in set(temp_count)} # count the frequency of each word

word2ind = {} # build your vocabulary
for i, x in enumerate(word_count):
  word2ind[x] = i

vocab_size = len(word2ind)

# Part 1: Recurrent Neural Network

### Convert token to vector
Convert each list of tokens into an array use the vocabulary you built before. The length of the vector is the max_len and remember to do zero-padding if a list's lenghth is smaller than max_len.

In [11]:
def vectorize(tokens, max_len, word2ind):
  '''
  Input: list of tokens
  Output: 1D numpy array (length = max_len)
  '''
  word_ind = np.zeros((max_len, ))
  for i in range(len(tokens)):
    word_ind[i] = word2ind.get(tokens[i], 0)
  return word_ind

def vectorize_tokens(tokens, max_len, min_len, word2ind):
  '''
  Input: list of tokens
  Output: 1D numpy array (length = max_len)
  '''
  word_ind = np.zeros((max_len, ))
  for i in range(min_len):
    if (tokens[i], _ in word2ind.items()):
      word_ind[i] = word2ind.get(tokens[i], 0)
  return word_ind

X_train_array = np.array([vectorize(tokens, max_len, word2ind) for tokens in X_train_token])
X_val_array = np.array([vectorize(tokens, max_len, word2ind) for tokens in X_val_token])
assert X_train_array.shape[-1] == max_len

### One-hot label
Convert the scalar label to 1D array (length = 9), e.g 0 -> array([1, 0, 0, 0, 0, 0, 0, 0, 0])

In [12]:
def onehot(y):
  y_onehot = np.zeros((len(y), len(ind2cat)))
  for i, x in enumerate(y):
    y_onehot[i, x] = 1
  return y_onehot

y_train_onehot = onehot(y_train)
y_val_onehot = onehot(y_val)
assert y_train_onehot.shape[1] == 9

### Build the Recurrent Neural Network
Now it's time to build the RNN network to do the classification task, you could just refer to this [official document](https://www.tensorflow.org/guide/keras/rnn).

You will need the Embedding layer, RNN layer and Dense layer, your last layer should project to the number of labels.

In [13]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import Embedding, LSTM, Dense

model = keras.Sequential()
# Embedding Layer, Input Dimension = vocab_size, Output Dimension = 64
model.add(Embedding(input_dim=vocab_size, output_dim=64))
# Two LSTM layers with 64 Units
model.add(LSTM(64, return_sequences=True))
model.add(LSTM(64))
# Dense to the number of classes with softmax activation function
model.add(Dense(9, activation="softmax"))
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 64)          87168     
                                                                 
 lstm (LSTM)                 (None, None, 64)          33024     
                                                                 
 lstm_1 (LSTM)               (None, 64)                33024     
                                                                 
 dense (Dense)               (None, 9)                 585       
                                                                 
Total params: 153,801
Trainable params: 153,801
Non-trainable params: 0
_________________________________________________________________


In [14]:
model.compile(optimizer='adam', loss='categorical_crossentropy',metrics=['accuracy'])
model.fit(X_train_array, y_train_onehot, batch_size=16, epochs=10, validation_data=(X_val_array, y_val_onehot))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f6933f16d10>

### Evaluate on the test sentences
Now run your model to predict on the test sentences, you need to do the preprocessing on these sentences first and save your prediction to a list of labels, e.g [0, 2, 1, 5, ....]

In [15]:
test_prediction = []
#TODO
test_token = [clean(sentence) for sentence in test_sentences]
test_array = np.array([vectorize_tokens(i, max_len, np.minimum(max_len, len(i)), word2ind) for i in test_token])
test_prediction = np.argmax(model.predict(test_array), axis=1)

# Part 2. Word Embedding via pymagnitude
Instead of using the vocabulary to convert word to number, you could use pretrained word embeddings to do the task.

Next, you'll need to download a pre-trained set of word embeddings. We'll get a set trained with Google's word2vec algorithm, which we discussed in class. You can check the full list of available embeddings, feel free to try different embeddings.

In [19]:
# Load the embedding
from pymagnitude import *
vectors = Magnitude("GoogleNews-vectors-negative300.magnitude") 
D = vectors.query("cat").shape[0]

### Convert tokens to embeddings
You could now use the pymagnitude to query each token and convert them to a list of embeddings. Note that you need to do zero padding to match the maximum length.

In [20]:
def embedding(list_tokens, max_len, vectors, D):
  '''
  return an array with the shape (n_of_samples, max_len, D)
  '''
  n_of_samples = len(list_tokens)
  word_ind = np.zeros((n_of_samples, max_len, D))
  for i in range(n_of_samples):
    tokens = list_tokens[i]
    min_len = np.minimum(max_len, len(tokens))
    for j in range(min_len):
      word_ind[i, j, :] = vectors.query(tokens[j])
  return word_ind
  
X_train_embedding = embedding(X_train_token, max_len, vectors, D)
X_val_embedding = embedding(X_val_token, max_len, vectors, D)

assert X_train_embedding.shape[-1] == D
assert X_train_embedding.shape[-2] == max_len

### Build the RNN model
Similar to Part 1, build a RNN model using your new embedding.

In [21]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import LSTM, Dense

model = keras.Sequential()
#TODO
# LSTM Layer with input shape (max_len, D), output shape (max_len, 256)
model.add(LSTM(256, input_shape=(max_len, D), return_sequences=True))
# LSTM Layer with 128 units
model.add(LSTM(128))
# Dense to 64 with tanh activation function
model.add(Dense(64, activation="tanh"))
# Dense to number of classes with softmax function
model.add(Dense(9, activation="softmax"))
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_2 (LSTM)               (None, 29, 256)           570368    
                                                                 
 lstm_3 (LSTM)               (None, 128)               197120    
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 dense_2 (Dense)             (None, 9)                 585       
                                                                 
Total params: 776,329
Trainable params: 776,329
Non-trainable params: 0
_________________________________________________________________


In [22]:
model.compile(optimizer='adam', loss='categorical_crossentropy',metrics=['accuracy'])
model.fit(X_train_embedding, y_train_onehot, batch_size=16, epochs=10, validation_data=(X_val_embedding, y_val_onehot))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f67ea474dd0>

### Evaluate on the test sentences
Now run your model to predict on the test sentences, you need to do the preprocessing on these sentences first and save your prediction to a list of labels, e.g [0, 2, 1, 5, ....]

In [23]:
test_prediction = []
#TODO
test_embedding = embedding(test_token, max_len, vectors, D)
test_prediction = np.argmax(model.predict(test_embedding), axis=1)

# Part 3: BERT

In this part, you will use the BERT pipeline to further improve the performance.

This part is open-ended, we just provide one example of using BERT, feel free to find other tutorial online to customize on this task.

Here is the list of all existing models.

In [26]:
#from transformers import *
from transformers import BertTokenizer, TFBertModel, BertConfig, TFBertForSequenceClassification
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") #feel free to change the model
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=9)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Use BERT Tokenizer to preprocess the data
The BERT Tokenizer will return a dictionary which contains 'input_ids', 'token_type_ids' and 'attention_mask', we will use the 'input_ids' and 'attention_mask' later

In [27]:
# Test the tokenizer
sent = X_train[0]
tokenized_sequence= bert_tokenizer.encode_plus(sent,add_special_tokens = True,
                                              max_length =30,pad_to_max_length = True, 
                                              return_attention_mask = True)
print(tokenized_sequence)
print(bert_tokenizer.decode(tokenized_sequence['input_ids']))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


{'input_ids': [101, 7592, 1012, 2071, 2017, 2377, 2033, 2070, 2189, 2011, 4202, 9170, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
[CLS] hello. could you play me some music by taylor swift? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]




Use the bert tokenizer described above, encode the training and validations sentences, note that the max length should be 64.

In [28]:
def BERT_Tokenizer(sentences):
  '''Input: list of sentences
     Output: two numpy array
  '''
  text_dict = bert_tokenizer.batch_encode_plus(sentences,
                                              max_length=64,
                                              padding="max_length",
                                              truncation=True,
                                              add_special_tokens=True,
                                              return_tensors="pt")
  return np.array(text_dict["input_ids"]), np.array(text_dict["attention_mask"])

X_train_ids, X_train_masks = BERT_Tokenizer(X_train)
X_val_ids, X_val_masks = BERT_Tokenizer(X_val)
y_train_array = np.array(y_train)
y_val_array = np.array(y_val)
assert X_train_ids.shape[-1] == 64

In [29]:
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-6,epsilon=1e-08)
bert_model.compile(loss=loss,optimizer=optimizer,metrics=[metric])

In [36]:
bert_model.fit([X_train_ids,X_train_masks],y_train_array,batch_size=16,epochs=5,validation_data=([X_val_ids,X_val_masks],y_val_array))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f6a2cacd290>

### Evaluate on test sentences
Again, use BERT to predict on the test sentences.

In [37]:
test_prediction = []
#TODO
test_ids, test_masks = BERT_Tokenizer(test_sentences)
test_prediction = np.argmax(bert_model.predict((test_ids, test_masks))[0], axis=1)