# Lab 11: Dialogue Act Tagging

Dialogue act (DA) tagging is an important step in the process of developing dialog systems. DA tagging is a problem usually solved by supervised machine learning approaches that all require large amounts of hand labeled data. A wide range of techniques have been investigated for DA tagging. In this lab, we explore two approaches to DA classification. We are using the Switchboard Dialog Act Corpus for training.
Corpus can be downloaded from http://compprag.christopherpotts.net/swda.html.


The downloaded dataset should be kept in a data folder in the same directory as this file. 

In [1]:
%tensorflow_version 1.14

import pandas as pd
import glob
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import numpy as np
import keras

import sklearn.metrics
import tensorflow as tf
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook as tqdm

print(tf.__version__)

`%tensorflow_version` only switches the major version: 1.x or 2.x.
You set: `1.14`. This will be interpreted as: `1.x`.


TensorFlow 1.x selected.


Using TensorFlow backend.


1.15.2


In [0]:
from zipfile import ZipFile

with ZipFile('swda.zip', 'r') as swda:
   # Extract all the contents of zip file in current directory
   swda.extractall()

In [0]:
f = glob.glob("swda/sw*/sw*.csv")
frames = []
for i in range(0, len(f)):
    frames.append(pd.read_csv(f[i]))

result = pd.concat(frames, ignore_index=True)


In [4]:
print("Number of converations in the dataset:",len(result))


Number of converations in the dataset: 223606


The dataset has many different features, we are only using act_tag and text for this training.


In [0]:
reduced_df = result[['act_tag','text']]


Reduce down the number of tags to 43 - converting the combined tags to their generic classes:

In [0]:
# Imported from "https://github.com/cgpotts/swda"
# Convert the combination tags to the generic 43 tags

import re
def damsl_act_tag(input):
        """
        Seeks to duplicate the tag simplification described at the
        Coders' Manual: http://www.stanford.edu/~jurafsky/ws97/manual.august1.html
        """
        d_tags = []
        tags = re.split(r"\s*[,;]\s*", input)
        for tag in tags:
            if tag in ('qy^d', 'qw^d', 'b^m'): pass
            elif tag == 'nn^e': tag = 'ng'
            elif tag == 'ny^e': tag = 'na'
            else: 
                tag = re.sub(r'(.)\^.*', r'\1', tag)
                tag = re.sub(r'[\(\)@*]', '', tag)            
                if tag in ('qr', 'qy'):                         tag = 'qy'
                elif tag in ('fe', 'ba'):                       tag = 'ba'
                elif tag in ('oo', 'co', 'cc'):                 tag = 'oo_co_cc'
                elif tag in ('fx', 'sv'):                       tag = 'sv'
                elif tag in ('aap', 'am'):                      tag = 'aap_am'
                elif tag in ('arp', 'nd'):                      tag = 'arp_nd'
                elif tag in ('fo', 'o', 'fw', '"', 'by', 'bc'): tag = 'fo_o_fw_"_by_bc'            
            d_tags.append(tag)
        # Dan J says (p.c.) that it makes sense to take the first;
        # there are only a handful of examples with 2 tags here.
        return d_tags[0]

In [7]:
reduced_df["act_tag"] = reduced_df["act_tag"].apply(lambda x: damsl_act_tag(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


There are 43 tags in this dataset. Some of the tags are Yes-No-Question('qy'), Statement-non-opinion('sd') and Statement-opinion('sv'). Tags information can be found here http://compprag.christopherpotts.net/swda.html#tags. 


To get unique tags:

In [0]:
unique_tags = set()
for tag in reduced_df['act_tag']:
    unique_tags.add(tag)

In [0]:
one_hot_encoding_dic = pd.get_dummies(list(unique_tags))


In [0]:
tags_encoding = []
for i in range(0, len(reduced_df)):
    tags_encoding.append(one_hot_encoding_dic[reduced_df['act_tag'].iloc[i]])

The tags are one hot encoded.

To create sentence embeddings:

In [0]:
sentences = []
for i in range(0, len(reduced_df)):
    sentences.append(reduced_df['text'].iloc[i].split(" "))


In [0]:
wordvectors = {}
index = 1
for s in sentences:
    for w in s:
        if w not in wordvectors:
            wordvectors[w] = index
            index += 1

In [0]:
# Max length of 137
MAX_LENGTH = len(max(sentences, key=len))

In [0]:
sentence_embeddings = []
for s in sentences:
    sentence_emb = []
    for w in s:
        sentence_emb.append(wordvectors[w])
    sentence_embeddings.append(sentence_emb)


Then we split the dataset into test and train.

In [0]:
from sklearn.model_selection import train_test_split
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(sentence_embeddings, np.array(tags_encoding))


And pad the sentences with zero to make all sentences of equal length.


In [0]:
MAX_LENGTH = 137

In [0]:
from keras.preprocessing.sequence import pad_sequences
 
train_sentences_X = pad_sequences(X_train, maxlen=MAX_LENGTH, padding='post')
test_sentences_X = pad_sequences(X_test, maxlen=MAX_LENGTH, padding='post')

Split Train into Train and Validation - about 10% into validation - In order to validate the model as it is training

In [0]:

train_input = train_sentences_X[:140000]
val_input = train_sentences_X[140000:]

train_labels = y_train[:140000]
val_labels = y_train[140000:]


# Model 1 - 

The first approach we'll try is to treat DA tagging as a standard multi-class text classification task, in the way you've done before with sentiment analysis and other tasks. Each utterance will be treated independently as a text to be classified with its DA tag label. This model has an architecture of:

- Embedding  
- BLSTM  
- Fully Connected Layer
- Softmax Activation

 The model architecture is as follows: Embedding Layer (to generate word embeddings) Next layer Bidirectional LSTM. Feed forward layer with number of neurons = number of tags. Softmax activation to get the probabilities.


In [0]:
VOCAB_SIZE = len(wordvectors) # 43,731
MAX_LENGTH = len(max(sentences, key=len))
EMBED_SIZE = 100 # arbitary
HIDDEN_SIZE = len(unique_tags) 

In [20]:

from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Dropout, InputLayer, Bidirectional, TimeDistributed, Activation, Embedding
from keras.optimizers import Adam

#Building the network

# Include 2 BLSTM layers, in order to capture both the forward and backward hidden states
model = Sequential()

# Embedding layer
model.add(Embedding(input_dim=VOCAB_SIZE,output_dim=EMBED_SIZE,input_length=MAX_LENGTH))

# Bidirectional 1
model.add(Bidirectional(LSTM(HIDDEN_SIZE, return_sequences=True)))

# Bidirectional 2
model.add(Bidirectional(LSTM(HIDDEN_SIZE, return_sequences=False)))

# Dense layer
model.add(Dense(HIDDEN_SIZE))

#model.add(TimeDistributed(Dense(1, activation='softmax')))
# Activation
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

model.summary()






Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 137, 100)          4373100   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 137, 86)           49536     
_________________________________________________________________
bidirectional_2 (Bidirection (None, 86)                44720     
_________________________________________________________________
dense_1 (Dense)              (None, 43)                3741      
_________________________________________________________________
activation_1 (Activation)    (None, 43)                0         
Total params: 4,471,097
Trainable params: 4,471,097
Non-trainable params: 0
_________________________________________________________________


In [22]:
# Train the model - using validation 

model.fit(train_input,train_labels,validation_data=(val_input,val_labels), epochs = 3,batch_size=1000, verbose = 1)

Train on 140000 samples, validate on 27704 samples
Epoch 1/3

InvalidArgumentError: ignored

In [0]:
score = model.evaluate(test_sentences_X, y_test, batch_size=100)

In [0]:
print("Overall Accuracy:", score[1]*100)


## Evaluation


The overall accuracy is 67%, an effective accuracy for this task.

In addition to overall accuracy, you need to look at the accuracy of some minority classes. Signal-non-understanding ('br') is a good indicator of "other-repair" or cases in which the other conversational participant attempts to repair the speaker's error. Summarize/reformulate ('bf') has been used in dialogue summarization. Report the accuracy for these classes and some frequent errors you notice the system makes in predicting them. What do you think the reasons are？

## Minority Classes

In [0]:
# Generate predictions for the test data
y_pred = model.predict(test_sentences_X,batch_size=100)

In [0]:
from sklearn.metrics import confusion_matrix

confusion_matrix(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1))


In [0]:
import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt

confusion_mat = confusion_matrix(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1))
df_cm = pd.DataFrame(confusion_mat, index = [i for i in unique_tags],
                  columns = [i for i in unique_tags])
plt.figure(figsize = (15,10),)
sn.heatmap(df_cm, annot=True)

In [0]:
from sklearn.metrics import classification_report,accuracy_score

print('accuracy %s' % accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1)))
print(classification_report(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1),target_names=unique_tags))

In [0]:
# Argmax value of "br" as index
br_index = np.argmax(one_hot_encoding_dic["br"]) 

# Accuracy, using the index
br_acc = confusion_mat[br_index][br_index] / sum(confusion_mat[br_index])
print('"br" accuracy: ' + str(br_acc*100))

# Argmax value of "bf"
bf_index = np.argmax(one_hot_encoding_dic["bf"])

# Accuracy, using the index
bf_acc = confusion_mat[bf_index][bf_index] / sum(confusion_mat[bf_index])
print('"bf" accuracy: ' + str(bf_acc*100))


Due to the reduced lack of training data for the minority classes, these minority classifiers will not be very confident in classification, as they have not been fully optimised. The frequent classifiers will be more optimised and will generate more confident scores for all examples, effectively crowding out the less confident minority classifiers. 




# Model 2 - Balanced Network


One thing we can do to try to improve performance is therefore to balance the data more sensibly. As the dataset is highly imbalanced, we can simply weight up the minority classes proportionally to their underrepresentation while training. 

In [0]:
import numpy as np
from sklearn.utils.class_weight import compute_class_weight

y_integers = np.argmax(tags_encoding, axis=1)
class_weights = compute_class_weight('balanced', np.unique(y_integers), y_integers)
d_class_weights = dict(enumerate(class_weights))

## Define & Train the model

In [0]:
# Re-built the model for the balanced training

model_balanced = Sequential()
model_balanced.add(Embedding(input_dim=VOCAB_SIZE,output_dim=EMBED_SIZE,input_length=MAX_LENGTH))
model_balanced.add(Bidirectional(LSTM(HIDDEN_SIZE, return_sequences=True)))
model_balanced.add(Bidirectional(LSTM(HIDDEN_SIZE, return_sequences=False)))
model_balanced.add(Dense(HIDDEN_SIZE))
model_balanced.add(Activation('softmax'))

model_balanced.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

model_balanced.summary()


In [0]:
# Train the balanced network -  takes  time to achieve good accuracy
for i in range(10):
  model_balanced.fit(train_input,train_labels,validation_data=(val_input,val_labels), epochs = 5,batch_size=1000, verbose = 1,class_weight=d_class_weights)

## Test the model

In [0]:
# Overall Accuracy
score = model_balanced.evaluate(test_sentences_X, y_test, batch_size=100)

In [0]:
print("Overall Accuracy:", score[1]*100)

In [0]:
# Generate predictions for the test data
label_pred = model_balanced.predict(test_sentences_X, batch_size=100)

## Balanced network evaluation

Report the overall accuracy and the accuracy of  'br' and 'bf'  classes. Suggest other ways to handle imbalanced classes.

In [0]:
# Build the confusion matrix off these predictions

matrix_balanced = sklearn.metrics.confusion_matrix(y_test.argmax(axis=1), label_pred.argmax(axis=1))

df_cm = pd.DataFrame(matrix_balanced, index = [i for i in unique_tags],
                  columns = [i for i in unique_tags])
plt.figure(figsize = (15,10),)
sn.heatmap(df_cm, annot=True)

In [0]:
print('accuracy %s' % accuracy_score(np.argmax(y_test, axis=1), np.argmax(label_pred, axis=1)))
print(classification_report(np.argmax(y_test, axis=1), np.argmax(label_pred, axis=1),target_names=unique_tags))

In [0]:
"br"
br_index = np.argmax(one_hot_encoding_dic["br"])
br_acc = matrix_balanced[br_index][br_index] / sum(matrix_balanced[br_index])
print('"br" accuracy: ' + str(br_acc*100))

"bf"
bf_index = np.argmax(one_hot_encoding_dic["bf"])
bf_acc = matrix_balanced[bf_index][bf_index] / sum(matrix_balanced[bf_index])
print('"bf" accuracy: ' + str(bf_acc*100))



### Accuracies



### Explanation


### Other ways to handle imbalanced classes


- 

- 

Can we improve things by using context information?  Next we try to build a model which predicts DA tag from the sequence of 
previous DA tags, plus the utterance representation. 

# Using Context for Dialog Act Classification

The second approach we will try is a hierarchical approach to DA tagging. We expect there is valuable sequential information among the DA tags. So in this section we apply a BiLSTM on top of the sentence CNN representation. The CNN model learns textual information in each utterance for DA classification, acting like the text classifier from Model 1 above. Then we use a bidirectional-LSTM (BLSTM) above that to learn how to use the context before and after the current utterance to improve the output.

## Define the model

This model has an architecture of:

- Word Embedding
- CNN
- Bidirectional LSTM
- Fully-Connected output



## CNN


This is a classical CNN layer used to convolve over embedings tensor and gether useful information from it. The data is represented by hierarchy of features, which can be modelled using a CNN. We transform/reshape conv output to 2d matrix. Then we pass it to the max pooling layer that applies the max pool operation on windows of different sizes.

In [0]:
from keras.layers import Input,Reshape,Conv2D,MaxPool2D,BatchNormalization,Flatten

filter_sizes = [3,4,5]
num_filters = 64
drop = 0.2
VOCAB_SIZE = len(wordvectors) # 43,731
MAX_LENGTH = len(max(sentences, key=len))
EMBED_SIZE = 100 # arbitary
HIDDEN_SIZE = len(unique_tags) 

# CNN model
inputs = Input(shape=(MAX_LENGTH, ), dtype='int32')
embedding = Embedding(input_dim=VOCAB_SIZE, output_dim=EMBED_SIZE, input_length=MAX_LENGTH)(inputs)
reshape = Reshape((MAX_LENGTH, EMBED_SIZE, 1))(embedding)

# 3 convolutions
conv_0 = Conv2D(num_filters, kernel_size=(filter_sizes[0], EMBED_SIZE), strides=1, padding='valid', kernel_initializer='normal', activation='relu')(reshape)
bn_0 = BatchNormalization()(conv_0)
conv_1 = Conv2D(num_filters, kernel_size=(filter_sizes[1], EMBED_SIZE), strides=1, padding='valid', kernel_initializer='normal', activation='relu')(reshape)
bn_1 = BatchNormalization()(conv_1)
conv_2 = Conv2D(num_filters, kernel_size=(filter_sizes[2], EMBED_SIZE), strides=1, padding='valid', kernel_initializer='normal', activation='relu')(reshape)
bn_2 = BatchNormalization()(conv_2)

# maxpool for 3 layers
maxpool_0 = MaxPool2D(pool_size=(MAX_LENGTH - filter_sizes[0] + 1, 1), padding='valid')(bn_0)
maxpool_1 = MaxPool2D(pool_size=(MAX_LENGTH - filter_sizes[1] + 1, 1), padding='valid')(bn_1)
maxpool_2 = MaxPool2D(pool_size=(MAX_LENGTH - filter_sizes[2] + 1, 1), padding='valid')(bn_2)

# concatenate tensors
merge = keras.layers.concatenate([maxpool_0,maxpool_1,maxpool_2])
# flatten concatenated tensors
flat = Flatten()(merge)
# dense layer (dense_1)
dense_1 = Dense(HIDDEN_SIZE)(flat)
# dropout_1
dropout_1 = Dropout(drop)(dense_1)

## BLSTM

This is used to create LSTM layers. The data we’re working with has temporal properties which we want to model as well — hence the use of a LSTM. You should create a BiLSTM.

In [0]:
# BLSTM model

# Bidirectional 1
BLSTM1 = Bidirectional(LSTM(HIDDEN_SIZE, return_sequences=True))(embedding)

# Bidirectional 2
BLSTM2 = Bidirectional(LSTM(HIDDEN_SIZE, return_sequences=False))(BLSTM1)

# Dense layer (dense_2)
dense_2 = Dropout(drop)(BLSTM2)

# dropout_2
dropout_2 = Dropout(drop)(dense_2)




Concatenate 2 last layers and create the output layer

In [0]:
# concatenate 2 final layers

final = keras.layers.concatenate([dropout_1, dropout_2])

# output
output = Dense(units=HIDDEN_SIZE, activation='softmax')(final)



In [0]:
from keras import Model

# Train the model - using validation 
model = keras.Model(inputs=inputs,outputs=output)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.fit(train_input,train_labels,validation_data=(val_input,val_labels), epochs = 5,batch_size=1000, verbose = 1)


In [0]:
score = model.evaluate(test_sentences_X, y_test, batch_size=100)

In [0]:
print("Overall Accuracy:", score[1]*100)

In [0]:
# Generate predictions for the test data
label_pred = model.predict(test_sentences_X, batch_size=100)

In [0]:
matrix = sklearn.metrics.confusion_matrix(y_test.argmax(axis=1), label_pred.argmax(axis=1))

"br"
br_index = np.argmax(one_hot_encoding_dic["br"])
br_acc = matrix[br_index][br_index] / sum(matrix[br_index])
print('"br" accuracy: ' + str(br_acc*100))

"bf"
bf_index = np.argmax(one_hot_encoding_dic["bf"])
bf_acc = matrix[bf_index][bf_index] / sum(matrix[bf_index])
print('"bf" accuracy: ' + str(bf_acc*100))

Report your overall accuracy. Did context help disambiguate and better predict the minority classes ('br' and 'bf')? What are frequent errors? Show one positive example where adding context changed the prediction.




### Minority Classes



# Advanced:  Bert-Based Model for Dialogue Act Tagging

In the last section we want to use BERT and leverage contextual word embeddings, following on from the last lab you've 
just done. This is an advanced part of the assignment and worth 10 marks (20%) in total. You could use your BERT-based text classifier here (instead of the CNN utterance-level classifier) and see if a pre-trained BERT language model helps. The domain difference from conversational data is one possible downside to using BERT. Explore some techniques to efficiently transfer the knowledge from conversational data and to improve model performance on DA tagging.