# Existing Datasets

This notebook provides two ways of generating expert demostration. 

We have 5301 data provided by the VirtualHome environment. The data can be found with the link: http://virtual-home.org/release/watch_and_help/watch_data.zip

The data contains train and test split, under ```gather_data_actiongraph_test.json``` and ```gather_data_actiongraph_train.json``` respectively. 

You can get a list of data using the key ```data['train_data'] #(or 'test_data')```, and a list of data will be returned. There are several important keys you can access by calling ```data['train_data'][i].keys()```:

- name: name of this epsiode 
- valid_action_with_walk: actions taken to achieve goal
- init_graph: starting condition of the episode
- graphs: history of graphs that the episode produced 
- goal: the goal states of this episode
- task_name: category of the task
- env_id: id of the environment used for this episode


In [None]:
import json

path = "Path/to/gather_data_actiongraph_{train/test/new_test}.json"
f = open(path)
data = json.load(f)
print(data.keys()) # dict_keys(['train_data']) / dict_keys(['test_data']) / dict_keys(['new_test_data'])
print(len(data['train_data'])) # 5301
print('Task type: {}'.format(data['train_data'][0]['task_name'])) # read_book
print('Goals to achieve: {}'.format(data['train_data'][0]['goal'])) # Goals to achieve: ['on_cupcake_coffeetable', 'on_pudding_coffeetable', 'on_apple_coffeetable', 'on_apple_coffeetable', 'holds_book', 'sit_sofa']
print(data['train_data'][0]['valid_action_with_walk']) # print out actions


# Our Approach to create more data

In order to create more data for training, we need to generate more data in order to create customized dataset for our need. Our solution to do this is to leverage existing planner discussed in [Language Models as Zero-Shot Planners](https://arxiv.org/abs/2201.07207). The following code demonstrate our modification on the existing planner to generate more VirtualHome-executable data. 

The algorithm to generate data can be found in the paper. In addition to the existing algorithm where only one prompt is fed into the model for the prediction of a sequence of actions, we further increase the number of it until one valid epsiode can be generated for a given task. We also define parsers to convert the templated sentence output into structured script which can be executed by VirtualHome.

### Setup

In [4]:
!git clone https://github.com/huangwl18/language-planner
%cd language-planner
!pip install -r requirements.txt
%cd src

Cloning into 'language-planner'...
remote: Enumerating objects: 69, done.[K
remote: Counting objects: 100% (69/69), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 69 (delta 10), reused 66 (delta 10), pack-reused 0[K
Unpacking objects: 100% (69/69), done.
/content/language-planner/src/language-planner
/content/language-planner/src/language-planner/src


Import packages and specify which GPU to be used

In [5]:
import openai
import numpy as np
import torch
from sentence_transformers import SentenceTransformer
from sentence_transformers import util as st_utils
import json

GPU = 0
if torch.cuda.is_available():
    torch.cuda.set_device(GPU)
OPENAI_KEY = None  # replace this with your OpenAI API key, if you choose to use OpenAI API

### Define hyperparemeters for plan generation

Select the source to be used (**OpenAI API** or **Huggingface Transformers**) and the two LMs to be used (**Planning LM** for plan generation, **Translation LM** for action/example matching). Then define generation hyperparameters.

Available language models for **Planning LM** can be found from:
- [Huggingface Transformers](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)
- [OpenAI API](https://beta.openai.com/docs/engines) (you would need to paste your OpenAI API key from your account to `openai.api_key` below)

Available language models for **Translation LM** can be found from:
- [Sentence Transformers](https://huggingface.co/sentence-transformers)

In [76]:
source = 'huggingface'
planning_lm_id = 'gpt2-large'  # see comments above for all options
translation_lm_id = 'stsb-roberta-large'  # see comments above for all options
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if source == 'openai':
    openai.api_key = OPENAI_KEY
    sampling_params = \
            {
                "max_tokens": 10,
                "temperature": 0.6,
                "top_p": 0.9,
                "n": 10,
                "logprobs": 1,
                "presence_penalty": 0.5,
                "frequency_penalty": 0.3,
                "stop": '\n'
            }
elif source == 'huggingface':
    sampling_params = \
            {
                "max_tokens": 10,
                "temperature": 0.1,
                "top_p": 0.9,
                "num_return_sequences": 10,
                "repetition_penalty": 1.2,
                'use_cache': True,
                'output_scores': True,
                'return_dict_in_generate': True,
                'do_sample': True,
            }

### Planning LM Initialization
Initialize **Planning LM** from either **OpenAI API** or **Huggingface Transformers**. Abstract away the underlying source by creating a generator function with a common interface.

In [7]:
def lm_engine(source, planning_lm_id, device):
    if source == 'huggingface':
        from transformers import AutoModelForCausalLM, AutoTokenizer
        tokenizer = AutoTokenizer.from_pretrained(planning_lm_id)
        model = AutoModelForCausalLM.from_pretrained(planning_lm_id, pad_token_id=tokenizer.eos_token_id).to(device)

    def _generate(prompt, sampling_params):
        if source == 'openai':
            response = openai.Completion.create(engine=planning_lm_id, prompt=prompt, **sampling_params)
            generated_samples = [response['choices'][i]['text'] for i in range(sampling_params['n'])]
            # calculate mean log prob across tokens
            mean_log_probs = [np.mean(response['choices'][i]['logprobs']['token_logprobs']) for i in range(sampling_params['n'])]
        elif source == 'huggingface':
            input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
            prompt_len = input_ids.shape[-1]
            output_dict = model.generate(input_ids, max_length=prompt_len + sampling_params['max_tokens'], **sampling_params)
            # discard the prompt (only take the generated text)
            generated_samples = tokenizer.batch_decode(output_dict.sequences[:, prompt_len:])
            # calculate per-token logprob
            vocab_log_probs = torch.stack(output_dict.scores, dim=1).log_softmax(-1)  # [n, length, vocab_size]
            token_log_probs = torch.gather(vocab_log_probs, 2, output_dict.sequences[:, prompt_len:, None]).squeeze(-1).tolist()  # [n, length]
            # truncate each sample if it contains '\n' (the current step is finished)
            # e.g. 'open fridge\n<|endoftext|>' -> 'open fridge'
            for i, sample in enumerate(generated_samples):
                stop_idx = sample.index('\n') if '\n' in sample else None
                generated_samples[i] = sample[:stop_idx]
                token_log_probs[i] = token_log_probs[i][:stop_idx]
            # calculate mean log prob across tokens
            mean_log_probs = [np.mean(token_log_probs[i]) for i in range(sampling_params['num_return_sequences'])]
        generated_samples = [sample.strip().lower() for sample in generated_samples]
        return generated_samples, mean_log_probs

    return _generate

generator = lm_engine(source, planning_lm_id, device)

Downloading:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.02G [00:00<?, ?B/s]

### Translation LM Initialization
Initialize **Translation LM** and create embeddings for all available actions (for action translation) and task names of all available examples (for finding relevant example)

In [10]:
# initialize Translation LM
translation_lm = SentenceTransformer(translation_lm_id).to(device)

# create action embeddings using Translated LM
with open('available_actions.json', 'r') as f:
    action_list = json.load(f)

action_list_embedding = translation_lm.encode(action_list, batch_size=512, convert_to_tensor=True, device=device)  # lower batch_size if limited by GPU memory

# create example task embeddings using Translated LM
with open('available_examples.json', 'r') as f:
    available_examples = json.load(f)

example_task_list = [example.split('\n')[0] for example in available_examples]  # first line contains the task name
example_task_embedding = translation_lm.encode(example_task_list, batch_size=512, convert_to_tensor=True, device=device)  # lower batch_size if limited by GPU memory


### Print out Human labelled episode examples



In [9]:
print("Number of expert demostrations: {}".format(len(available_examples)))

for i in range(5):
    print(available_examples[i])
    print("====================")

Number of expert demostrations: 5088
Task: Write an email
Step 1: Walk to home office
Step 2: Walk to computer
Step 3: Find computer
Step 4: Turn to computer
Step 5: Look at computer
Step 6: Walk to computer
Step 7: Find chair
Step 8: Sit on chair
Step 9: Find keyboard
Step 10: Grab keyboard
Step 11: Find mouse
Step 12: Grab mouse
Step 13: Type on keyboard
Task: Take shower
Step 1: Find clothes dress
Step 2: Find towel
Step 3: Walk to bathroom
Step 4: Walk to shower
Step 5: Find shower
Task: Watch TV
Step 1: Find remote control
Step 2: Grab remote control
Step 3: Find television
Step 4: Switch on television
Step 5: Turn to television
Step 6: Watch television
Step 7: Switch off television
Step 8: Put back remote control
Task: Hang up jacket
Step 1: Walk to home office
Step 2: Walk to clothes jacket
Step 3: Find clothes jacket
Step 4: Grab clothes jacket
Step 5: Walk to wall
Step 6: Find hanger
Step 7: Put clothes jacket on hanger
Task: Wash clothes
Step 1: Walk to bathroom
Step 2: Walk 

### Now we use some of these labelled examples as prompt to generate plans

In [96]:
# helper function for finding similar sentence in a corpus given a query
def find_k_most_similar(query_str, corpus_embedding, k = 1):
    query_embedding = translation_lm.encode(query_str, convert_to_tensor=True, device=device)
    # calculate cosine similarity against each candidate sentence in the corpus
    cos_scores = st_utils.pytorch_cos_sim(query_embedding, corpus_embedding)[0].detach().cpu().numpy()
    # retrieve high-ranked index and similarity score
    inds = np.argpartition(cos_scores, -k)[-k:]
    scores = cos_scores[inds]
    return inds, scores

In [93]:
import collections

def generate(task, k, task_embedding):
    def __verb_mapping(verb):
        mapping = {'find':'grab'}
        return mapping[verb] if verb in mapping else verb

    inds, _ = find_k_most_similar(task, example_task_embedding, k = k)
    curr_prompt = ''
    for example_idx in inds:
        example = available_examples[example_idx]
        # print example and query task
        print('-'*10 + ' GIVEN EXAMPLE ' + '-'*10)
        print(example)
        print('-'*10 + ' EXAMPLE END ' + '-'*10)
        curr_prompt += f'{example}\n\n'
    curr_prompt += f'Task: {task}'
    print(f'\nTask: {task}')
    actions = []
    for step in range(1, MAX_STEPS + 1):
        best_overall_score = -np.inf
        # query Planning LM for single-step action candidates
        samples, log_probs = generator(curr_prompt + f'\nStep {step}:', sampling_params)
        for sample, log_prob in zip(samples, log_probs):
            idxs, scores = find_k_most_similar(sample, action_list_embedding)
            most_similar_idx, matching_score = idxs[0], scores[0]
            # rank each sample by its similarity score and likelihood score
            overall_score = matching_score + BETA * log_prob
            translated_action = action_list[most_similar_idx]
            # heuristic for penalizing generating the same action as the last action
            if step > 1 and translated_action == previous_action:
                overall_score -= 0.5
            # find the translated action with highest overall score
            if overall_score > best_overall_score:
                best_overall_score = overall_score
                best_action = translated_action

        # terminate early when either the following is true:
        # 1. top P*100% of samples are all 0-length (ranked by log prob)
        # 2. overall score is below CUTOFF_THRESHOLD
        # else: autoregressive generation based on previously translated action
        top_samples_ids = np.argsort(log_probs)[-int(P * len(samples)):]
        are_zero_length = all([len(samples[i]) == 0 for i in top_samples_ids])
        below_threshold = best_overall_score < CUTOFF_THRESHOLD
        if are_zero_length:
            print(f'\n[Terminating early because top {P*100}% of samples are all 0-length]')
            break
        elif below_threshold:
            print(f'\n[Terminating early because best overall score is lower than CUTOFF_THRESHOLD ({best_overall_score} < {CUTOFF_THRESHOLD})]')
            break
        else:
            info = best_action.split(' ')
            if len(info) == 2: # e.g. find shower
                 actions += ['[{} <{}>]'.format(__verb_mapping(info[0]), info[1])]
            elif len(info) == 3: # e.g. go to Bathroom
                 actions += ['[{} <{}>]'.format(__verb_mapping(info[0]), info[2])]
            elif len(info) == 4: # e.g. pour facecream into pajamas
                actions += ['[{} <{}> <{}>]'.format(__verb_mapping(info[0]+info[2]), info[1], info[3])]
            previous_action = best_action
            formatted_action = (best_action[0].upper() + best_action[1:]).replace('_', ' ') # 'open_fridge' -> 'Open fridge'
            curr_prompt += f'\nStep {step}: {formatted_action}'
            print(f'Step {step}: {formatted_action}')
    
    return actions


In [94]:
# Define parameters
MAX_STEPS = 20  # maximum number of steps to be generated
CUTOFF_THRESHOLD = 0.8 # early stopping threshold based on matching score and likelihood score
P = 0.5  # hyperparameter for early stopping heuristic to detect whether Planning LM believes the plan is finished
BETA = 0.3  # weighting coefficient used to rank generated samples

In [95]:
data = generate(task='Take Shower', k = 3, task_embedding=example_task_embedding)
print(data)

---------- GIVEN EXAMPLE ----------
Task: Take shower
Step 1: Find clothes dress
Step 2: Find towel
Step 3: Walk to bathroom
Step 4: Walk to shower
Step 5: Find shower
---------- EXAMPLE END ----------
---------- GIVEN EXAMPLE ----------
Task: Take shower
Step 1: Walk to bathroom
Step 2: Walk to shower
Step 3: Find shower
---------- EXAMPLE END ----------
---------- GIVEN EXAMPLE ----------
Task: Take shower
Step 1: Walk to bathroom
Step 2: Walk to shower
Step 3: Find soap
Step 4: Scrub soap
Step 5: Find water
Step 6: Rinse water
---------- EXAMPLE END ----------

Task: Take Shower
Step 1: Walk to bathroom
Step 2: Walk to shower
Step 3: Find shampoo
Step 4: Wash hair

[Terminating early because best overall score is lower than CUTOFF_THRESHOLD (0.5850750530759493 < 0.8)]
['[walk <bathroom>]', '[walk <shower>]', '[grab <shampoo>]', '[wash <hair>]']


#Code below not working

### Let's generate a batch of them

In [None]:
# Batch Generation

In [None]:
data = []
tasks = ['Make Breakfast', 'Take Shower', 'Eat Apple', 'Reply Email']
num_gen_per_task = 10
for task in tasks:
    for n in num_gen_per_task:
        data += [generate(task='Take Shower', k = 3, task_embedding=example_task_embedding)]

In [97]:
# write generated data into output file

fout = open('./generated_results.txt', 'w')


['[walk <bathroom>]', '[walk <shower>]', '[grab <shampoo>]', '[wash <hair>]']

### TODO: Integrate VirtualHome API to filter out invalid sequence of actions via simulation

In [98]:
## some code here 