Skip to content

API Documentation

Richard Csaky edited this page Jul 19, 2019 · 4 revisions

WordChatbot

Base class for word-level chatbot problems/datasets.

Variables

  • vocab_type: Word-level or character.
  • is_generate_per_split: Whether to use data splits.
  • vocab_file: Same as vocab_filename
  • vocab_filename: Name of the vocabulary file.
  • oov_token: How to represent out-of-vocabulary words.
  • use_subwords_tokenizer: Whether to use sub-word tokenizer.
  • targeted_vocab_size: Number of words in the vocabulary.
  • targeted_dataset_size: Number of examples in the full dataset, if 0, then the full dataset is used.
  • dataset_split: Percentages for each data split.
  • data_dir: Dataset directory containing processed files.
  • raw_data_dir: Raw downloaded data directory.
  • raw_data: Name of the downloaded data.
  • zipped_data: Name of the zipped downloaded data.
  • url: Download data from this url.

preprocess_data(train_mode):

Not implemented, needs to be implemented in subclass.

hparams(defaults, unused_model_hparams):

Setting basic problem hparams.

generate_data(data_dir, tmp_dir, task_id):

Main function for data generation. Interleaves own functions with tensor2tensor code.

generate_samples(data_dir, tmp_dir, data_split):

Generate the data samples for a specific data split.

save_vocab(vocab):

Save the vocab given as parameter to a file.

open_6_files():

Open the 6 data files (3 splits, source-target files).

close_n_files(files):

Close the list of files given as parameter.

CharacterChatbot

Base class for character based chatbot problems. Currently unused.

Variables

  • is_character_level: True for this problem.
  • targeted_vocab_size: No vocab, so 0.
  • targeted_dataset_size: Currently only full dataset is supported, so 0.

generator(data_dir, tmp_dir, train):

Generate the character-level data files. The 6 pre-processed files need to be present in the data_dir.

OpensubtitlesChatbot

Implements the Opensubtitles dataset.

Variables

  • dataset_version: Which year should the dataset come from.

preprocess_data(train_mode):

Run the data preprocessing for a specific data split.

data_pipeline_status(train_mode):

Check each step of the data preprocessing pipeline and only run steps which weren't run before.

download_data(train_mode):

Download the data to the raw data directory.

extract_data(train_mode):

Extract the data to the raw data directory.

create_data(train_mode):

Processes raw data and builds the 6 data files.

clean_line(line):

Preprocess a sentences (several regex rules). returns: The cleaned sentence.

CornellChatbotBasic

Implements a basic chatbot for the Cornell-Movie Dialog Corpus.

preprocess_data(train_mode):

Preprocesses raw data.

create_data(train_mode):

Processes raw data and builds the 6 data files.

clean_line(line):

Preprocess a sentences (several regex rules). returns: The cleaned sentence.

extract_dialog_ids():

Get the dialog structure from the separate file. returns: List of dialog ids.

CornellChatbotSeparateNames

Variables

  • targeted_name_vocab_size: How many different personas should be appended to the vocabulary.
  • targeted_vocab_size: Sum of the default vocab size and the name vocab size.

create_data(train_mode):

Processes raw data and builds the 6 data files.

replace_names(line_dict, name_vocab):

Replace infrequent names with unknown tokens.

save_vocab(vocab, name_vocab):

Save the whole vocab to a file in the data directory.

PersonaChatChatbot

preprocess_data(train_mode):

Preprocesses raw data.

extract_data(train_mode):

Extract raw data.

create_data(train_mode):

Processes raw data and builds the 6 data files.

DailyDialogChatbot

preprocess_data(train_mode):

Preprocesses raw data.

create_data(train_mode):

Processes raw data and builds the 6 data files.