API Documentation

WordChatbot

Base class for word-level chatbot problems/datasets.

Variables

vocab_type: Word-level or character.
is_generate_per_split: Whether to use data splits.
vocab_file: Same as vocab_filename
vocab_filename: Name of the vocabulary file.
oov_token: How to represent out-of-vocabulary words.
use_subwords_tokenizer: Whether to use sub-word tokenizer.
targeted_vocab_size: Number of words in the vocabulary.
targeted_dataset_size: Number of examples in the full dataset, if 0, then the full dataset is used.
dataset_split: Percentages for each data split.
data_dir: Dataset directory containing processed files.
raw_data_dir: Raw downloaded data directory.
raw_data: Name of the downloaded data.
zipped_data: Name of the zipped downloaded data.
url: Download data from this url.

preprocess_data(train_mode):

Not implemented, needs to be implemented in subclass.

hparams(defaults, unused_model_hparams):

Setting basic problem hparams.

generate_data(data_dir, tmp_dir, task_id):

Main function for data generation. Interleaves own functions with tensor2tensor code.

generate_samples(data_dir, tmp_dir, data_split):

Generate the data samples for a specific data split.

save_vocab(vocab):

Save the vocab given as parameter to a file.

open_6_files():

Open the 6 data files (3 splits, source-target files).

close_n_files(files):

Close the list of files given as parameter.

CharacterChatbot

Base class for character based chatbot problems. Currently unused.

Variables

is_character_level: True for this problem.
targeted_vocab_size: No vocab, so 0.
targeted_dataset_size: Currently only full dataset is supported, so 0.

generator(data_dir, tmp_dir, train):

Generate the character-level data files. The 6 pre-processed files need to be present in the data_dir.

OpensubtitlesChatbot

Implements the Opensubtitles dataset.

Variables

dataset_version: Which year should the dataset come from.

preprocess_data(train_mode):

Run the data preprocessing for a specific data split.

data_pipeline_status(train_mode):

Check each step of the data preprocessing pipeline and only run steps which weren't run before.

download_data(train_mode):

Download the data to the raw data directory.

extract_data(train_mode):

Extract the data to the raw data directory.

create_data(train_mode):

Processes raw data and builds the 6 data files.

clean_line(line):

Preprocess a sentences (several regex rules). returns: The cleaned sentence.

CornellChatbotBasic

Implements a basic chatbot for the Cornell-Movie Dialog Corpus.

preprocess_data(train_mode):

Preprocesses raw data.

create_data(train_mode):

Processes raw data and builds the 6 data files.

clean_line(line):

Preprocess a sentences (several regex rules). returns: The cleaned sentence.

extract_dialog_ids():

Get the dialog structure from the separate file. returns: List of dialog ids.

CornellChatbotSeparateNames

Variables

targeted_name_vocab_size: How many different personas should be appended to the vocabulary.
targeted_vocab_size: Sum of the default vocab size and the name vocab size.

create_data(train_mode):

Processes raw data and builds the 6 data files.

replace_names(line_dict, name_vocab):

Replace infrequent names with unknown tokens.

save_vocab(vocab, name_vocab):

Save the whole vocab to a file in the data directory.

PersonaChatChatbot

preprocess_data(train_mode):

Preprocesses raw data.

extract_data(train_mode):

Extract raw data.

create_data(train_mode):

Processes raw data and builds the 6 data files.

DailyDialogChatbot

preprocess_data(train_mode):

Preprocesses raw data.

create_data(train_mode):

Processes raw data and builds the 6 data files.

API Documentation

WordChatbot

Variables

preprocess_data(train_mode):

hparams(defaults, unused_model_hparams):

generate_data(data_dir, tmp_dir, task_id):

generate_samples(data_dir, tmp_dir, data_split):

save_vocab(vocab):

open_6_files():

close_n_files(files):

CharacterChatbot

Variables

generator(data_dir, tmp_dir, train):

OpensubtitlesChatbot

Variables

preprocess_data(train_mode):

data_pipeline_status(train_mode):

download_data(train_mode):

extract_data(train_mode):

create_data(train_mode):

clean_line(line):

CornellChatbotBasic

preprocess_data(train_mode):

create_data(train_mode):

clean_line(line):

extract_dialog_ids():

CornellChatbotSeparateNames

Variables

create_data(train_mode):

replace_names(line_dict, name_vocab):

save_vocab(vocab, name_vocab):

PersonaChatChatbot

preprocess_data(train_mode):

extract_data(train_mode):

create_data(train_mode):

DailyDialogChatbot

preprocess_data(train_mode):

create_data(train_mode):

Clone this wiki locally