-
Notifications
You must be signed in to change notification settings - Fork 74
API Documentation
Base class for word-level chatbot problems/datasets.
- vocab_type: Word-level or character.
- is_generate_per_split: Whether to use data splits.
- vocab_file: Same as vocab_filename
- vocab_filename: Name of the vocabulary file.
- oov_token: How to represent out-of-vocabulary words.
- use_subwords_tokenizer: Whether to use sub-word tokenizer.
- targeted_vocab_size: Number of words in the vocabulary.
- targeted_dataset_size: Number of examples in the full dataset, if 0, then the full dataset is used.
- dataset_split: Percentages for each data split.
- data_dir: Dataset directory containing processed files.
- raw_data_dir: Raw downloaded data directory.
- raw_data: Name of the downloaded data.
- zipped_data: Name of the zipped downloaded data.
- url: Download data from this url.
Not implemented, needs to be implemented in subclass.
Setting basic problem hparams.
Main function for data generation. Interleaves own functions with tensor2tensor code.
Generate the data samples for a specific data split.
Save the vocab given as parameter to a file.
Open the 6 data files (3 splits, source-target files).
Close the list of files given as parameter.
Base class for character based chatbot problems. Currently unused.
- is_character_level: True for this problem.
- targeted_vocab_size: No vocab, so 0.
- targeted_dataset_size: Currently only full dataset is supported, so 0.
Generate the character-level data files. The 6 pre-processed files need to be present in the data_dir.
Implements the Opensubtitles dataset.
- dataset_version: Which year should the dataset come from.
Run the data preprocessing for a specific data split.
Check each step of the data preprocessing pipeline and only run steps which weren't run before.
Download the data to the raw data directory.
Extract the data to the raw data directory.
Processes raw data and builds the 6 data files.
Preprocess a sentences (several regex rules). returns: The cleaned sentence.
Implements a basic chatbot for the Cornell-Movie Dialog Corpus.
Preprocesses raw data.
Processes raw data and builds the 6 data files.
Preprocess a sentences (several regex rules). returns: The cleaned sentence.
Get the dialog structure from the separate file. returns: List of dialog ids.
- targeted_name_vocab_size: How many different personas should be appended to the vocabulary.
- targeted_vocab_size: Sum of the default vocab size and the name vocab size.
Processes raw data and builds the 6 data files.
Replace infrequent names with unknown tokens.
Save the whole vocab to a file in the data directory.
Preprocesses raw data.
Extract raw data.
Processes raw data and builds the 6 data files.
Preprocesses raw data.
Processes raw data and builds the 6 data files.