# Chatbot: Data Preprocessing (Pytorch)

This notebook preprocess the [MultiWOZ datasets](https://github.com/budzianowski/multiwoz) for later used in training the chatbot in Pytorch.

---
Last updated: 11/04/2021



## Import libraries

In [1]:
import json
import random

from termcolor import colored

import pandas as pd

## The Datasets - MultiWOZ (ver 2.1)

The Multi-Domain Wizard-of-Oz datasets can be downloaded from the following website: https://github.com/budzianowski/multiwoz

After downloading, I uploaded the file `data.json` (263 MB) to my Google Cloud

In [2]:
## Load data from the Cloud
with open('drive/MyDrive/chatbot/data/data.json') as file:
    multiwoz_dataset_json = json.load(file)

print("Number of dialogues: {}\n".format(len(multiwoz_dataset_json)))
print(list(multiwoz_dataset_json.keys())[:10])

Number of dialogues: 10438

['SNG01856.json', 'SNG0129.json', 'PMUL1635.json', 'MUL2168.json', 'SNG0073.json', 'SNG01445.json', 'MUL2105.json', 'PMUL1690.json', 'MUL2395.json', 'SNG0190.json']


In [3]:
# print an example dialogue
example_dialogue = multiwoz_dataset_json['SNG01856.json']['log']
for i in range(len(example_dialogue)):
    if i % 2 == 0:
        print(colored(example_dialogue[i]['text'], 'blue'))
    else:
        print(colored(example_dialogue[i]['text'], 'red'))

[34mam looking for a place to to stay that has cheap price range it should be in a type of hotel[0m
[31mOkay, do you have a specific area you want to stay in?[0m
[34mno, i just need to make sure it's cheap. oh, and i need parking[0m
[31mI found 1 cheap hotel for you that includes parking. Do you like me to book it?[0m
[34mYes, please. 6 people 3 nights starting on tuesday.[0m
[31mI am sorry but I wasn't able to book that for you for Tuesday. Is there another day you would like to stay or perhaps a shorter stay?[0m
[34mhow about only 2 nights.[0m
[31mBooking was successful.
Reference number is : 7GAWK763. Anything else I can do for you?[0m
[34mNo, that will be all. Good bye.[0m
[31mThank you for using our services.[0m


## Remove the invalid dialogue `SNG01862.json`

Based on the [notes of MultiWOZ 2.2](https://github.com/budzianowski/multiwoz/tree/master/data/MultiWOZ_2.2), the authors discovered that `SNG01862.json` is an invalid dialogue and removed it. We therefore removed this dialogue from the datasets as well.

After removing this dialogue, the dataset contains 10,437 dialogues

In [8]:
# print the invalid dialogue
print("[Invaid dialogue - SNG01862.json]\n")
example_dialogue = multiwoz_dataset_json['SNG01862.json']['log']
for i in range(len(example_dialogue)):
    if i % 2 == 0:
        print(colored(example_dialogue[i]['text'], 'blue'))
    else:
        print(colored(example_dialogue[i]['text'], 'red'))

[Invaid dialogue - SNG01862.json]

[34mI want to take a taxi from arbury lodge guest house to the varsity restaurant.[0m
[31mI need a taxi to pick me up at 01:00 at the arbury lodge guesthouse and have me at the varsity restaurant at 01:30[0m
[34mMy taxi should arrive by 13:30[0m
[31mPlease modifiy the following answers based on the latest customer response:

What does the user want?
Where is the departure site ?	
arbury lodge guesthouse
Where is the destination ?	
the varsity restaurant
When does the user want to leave ?	
01:00
When does the user want to arrive by ?	
01:30
Booking completed!
Booked car type	:	black ford
Contact number	:	07792065670
[0m


## Preprocessing (1): add Person 1 & Person 2

For each sentence, we add `Person 1:` or `Person 2:` at the beginning.

In [9]:
## Add "Person 1" and "Person 2" into the dialogue

dialogue_sentences_list = []

for json_index in multiwoz_dataset_json.keys():

    # skip the invalid dialogue
    if json_index == "SNG01862.json":
        continue
    
    dialogue = multiwoz_dataset_json[json_index]['log']
    dialogue_sentences_str = ""

    for i in range(len(dialogue)):
        if i % 2 == 0: # Person 1
            dialogue_sentences_str += " Person 1: " + dialogue[i]['text']
        else: # Person 2
            dialogue_sentences_str += " Person 2: " + dialogue[i]['text']
    
    dialogue_sentences_list.append(dialogue_sentences_str)

print("Number of dialogues: {}".format(len(dialogue_sentences_list)))
print(dialogue_sentences_list[0])

Number of dialogues: 10437
 Person 1: am looking for a place to to stay that has cheap price range it should be in a type of hotel Person 2: Okay, do you have a specific area you want to stay in? Person 1: no, i just need to make sure it's cheap. oh, and i need parking Person 2: I found 1 cheap hotel for you that includes parking. Do you like me to book it? Person 1: Yes, please. 6 people 3 nights starting on tuesday. Person 2: I am sorry but I wasn't able to book that for you for Tuesday. Is there another day you would like to stay or perhaps a shorter stay? Person 1: how about only 2 nights. Person 2: Booking was successful.
Reference number is : 7GAWK763. Anything else I can do for you? Person 1: No, that will be all. Good bye. Person 2: Thank you for using our services.


## Preprocessing (2): shuffling

In [10]:
## Shuffle
random.seed(1)
random.shuffle(dialogue_sentences_list)

## Preprocessing (3): remove spaces at the beginninng/end of the dialogue

In [11]:
for i in range(len(dialogue_sentences_list)):
    dialogue_sentences_list[i] = dialogue_sentences_list[i].strip()

print(dialogue_sentences_list[0])

Person 1: Hello, I'm looking for a hotel that is cheap. It doesn't need to have free parking.  Person 2: I have found 10 hotels that match the criteria you listed. Is there any other criteria you have so we can narrow down your choices some more? Person 1: Yes, the hotel need to have free wifi. Person 2: All 10 of the places to stay have free wifi. Do you have a preference of a hotel or guest house? Or possibly a price or a number of stars? Person 1: No, I just need one of them that is cheap with wifi. Person 2: How about the Alexander bed and breakfast? Person 1: Yea that sounds good. Can I book for 5 people and 4 nights starting Friday Person 2: I'm sorry, that booking wasn't successful. Perhaps try another day or maybe shorten your stay? Person 1: Can we try for 2 nights? Person 2: Yes, that worked. I've got you booked with reference number GR86A29T. Is there anything else I can help you with? Person 1: I am looking for a place to go in the city centre. Person 2: we've got plenty  o

## Split into training/validation/test datasets

We split all 10,437 dialogues into 9,000 training dataset, 500 validation dataset, and 937 test dataset.

In [12]:
data_train = dialogue_sentences_list[:9000]
data_valid = dialogue_sentences_list[9000:9500]
data_test = dialogue_sentences_list[9500:]

print("data_train: {}".format(len(data_train)))
print("data_valid: {}".format(len(data_valid)))
print("data_test: {}".format(len(data_test)))

data_train: 9000
data_valid: 500
data_test: 937


In [13]:
## Example train data
print(data_train[0])

Person 1: Hello, I'm looking for a hotel that is cheap. It doesn't need to have free parking.  Person 2: I have found 10 hotels that match the criteria you listed. Is there any other criteria you have so we can narrow down your choices some more? Person 1: Yes, the hotel need to have free wifi. Person 2: All 10 of the places to stay have free wifi. Do you have a preference of a hotel or guest house? Or possibly a price or a number of stars? Person 1: No, I just need one of them that is cheap with wifi. Person 2: How about the Alexander bed and breakfast? Person 1: Yea that sounds good. Can I book for 5 people and 4 nights starting Friday Person 2: I'm sorry, that booking wasn't successful. Perhaps try another day or maybe shorten your stay? Person 1: Can we try for 2 nights? Person 2: Yes, that worked. I've got you booked with reference number GR86A29T. Is there anything else I can help you with? Person 1: I am looking for a place to go in the city centre. Person 2: we've got plenty  o

In [14]:
## Example validation data
print(data_valid[0])

Person 1: I'm stuck up here in Kings Lynn and really need to get into Cambridge. Can you look up a train for me please? Person 2: I can absolutely help you, but let's get some more information so we can book your ticket. What day and time would you like to leave?  Person 1: I would like to leave on Saturday after 17:00 Person 2: I have 7 different trains. The first one after 17:00 is train TR1499. It leaves at 17:11 and arrives at 17:58. Would that work for you?  Person 1: That works for me. What is the price for that? Person 2: 7.84 pounds for each ticket. Would you like me to book the train? Person 1: No, I'm not ready to book yet.  Thanks though.  Can you recommend a latin American restaurant? Person 2: I'm sorry there are no latin american restaurant Person 1: Are there any in the centre in the cheap price range? Person 2: I have fifteen places to dine in the centre, in a cheap range. What type of food would you like? Person 1: What about one that serves Asian oriental food? Person

In [15]:
## Example test data
print(data_test[0])

Person 1: I need to find a spot on a train on Wednesday, can you help me find one? Person 2: Yes I can. Where are you going and what time would like to arrive or depart? Person 1: I'm leaving from London Kings Cross and going to Cambridge. I'd like to leave after 14:30 on Wednesday.  Person 2: I have 5 options available for wednesday.  You can leave at 15:17, 17:17. 19:17, 21:17, or 23:17.  What time would you like to leave? Person 1: I'll take the first one at 15:17. When does it arrive? Person 2: It will arrive at 16:08 Person 1: ok i am also looking for a place to eat in the expensive price range   and should be located in the west  Person 2: There are nine restaurants available. Is there a specific food type you're looking for?  Person 1: No, nothing in particular. Just anything you would recommend. Person 2: I recommend the graffiti located at Hotel Felix Whitehouse Lane Huntington Road.  Person 1: Can you tell me the food type as well as the phone number and address? Person 2: Ye

## Convert to DataFrame

In [16]:
train_df = pd.DataFrame(data={'dialogues': data_train})
valid_df = pd.DataFrame(data={'dialogues': data_valid})
test_df = pd.DataFrame(data={'dialogues': data_test})

print(train_df.head())
print(valid_df.head())
print(test_df.head())

                                           dialogues
0  Person 1: Hello, I'm looking for a hotel that ...
1  Person 1: Hello, Im looking to book a train fr...
2  Person 1: I'm looking for a restaurant in the ...
3  Person 1: Hi, I would like a restaurant inthe ...
4  Person 1: What is there to do in the centre of...
                                           dialogues
0  Person 1: I'm stuck up here in Kings Lynn and ...
1  Person 1: I am looking for city centre north b...
2  Person 1: I am looking for a place to go in th...
3  Person 1: I need to find a place to stay in th...
4  Person 1: I need help with a car accident disp...
                                           dialogues
0  Person 1: I need to find a spot on a train on ...
1  Person 1: Can you please tell me how to get to...
2  Person 1: I was looking for a hotel called wor...
3  Person 1: I would like to stay at a guesthouse...
4  Person 1: I would like a Hungarian restaurant ...


## Save to Workspace

In [17]:
train_df.to_csv('multiwoz_train.csv', index=False)
valid_df.to_csv('multiwoz_valid.csv', index=False)
test_df.to_csv('multiwoz_test.csv', index=False)

## (Optional) Sanity Check

In [18]:
tmp_train = pd.read_csv('multiwoz_train.csv')
tmp_valid = pd.read_csv('multiwoz_valid.csv')
tmp_test = pd.read_csv('multiwoz_test.csv')

print("tmp_train: {}".format(tmp_train.shape))
print("tmp_valid: {}".format(tmp_valid.shape))
print("tmp_test: {}".format(tmp_test.shape))

tmp_train: (9000, 1)
tmp_valid: (500, 1)
tmp_test: (937, 1)


In [19]:
tmp_train.head()

Unnamed: 0,dialogues
0,"Person 1: Hello, I'm looking for a hotel that ..."
1,"Person 1: Hello, Im looking to book a train fr..."
2,Person 1: I'm looking for a restaurant in the ...
3,"Person 1: Hi, I would like a restaurant inthe ..."
4,Person 1: What is there to do in the centre of...


In [20]:
tmp_valid.head()

Unnamed: 0,dialogues
0,Person 1: I'm stuck up here in Kings Lynn and ...
1,Person 1: I am looking for city centre north b...
2,Person 1: I am looking for a place to go in th...
3,Person 1: I need to find a place to stay in th...
4,Person 1: I need help with a car accident disp...


In [21]:
tmp_test.head()

Unnamed: 0,dialogues
0,Person 1: I need to find a spot on a train on ...
1,Person 1: Can you please tell me how to get to...
2,Person 1: I was looking for a hotel called wor...
3,Person 1: I would like to stay at a guesthouse...
4,Person 1: I would like a Hungarian restaurant ...


## References

The data preprocessing is learned from the Coursera course **Natural Language Processing with Attention Models**
