# Diffusion Model for Open Dialogue

Simplified version

References
- DiffuSeq (cited below)

Adapted from:

@inproceedings{gong2022diffuseq,
  author = {Gong, Shansan and Li, Mukai and Feng, Jiangtao and Wu, Zhiyong and Kong, Lingpeng},
  booktitle = {International Conference on Learning Representations, ICLR},
  title = {{DiffuSeq}: Sequence to Sequence Text Generation with Diffusion Models},
  year = 2023
}

@article{gong2023diffuseqv2,
  title={DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models},
  author={Gong, Shansan and Li, Mukai and Feng, Jiangtao and Wu, Zhiyong and Kong, Lingpeng},
  journal={arXiv preprint arXiv:2310.05793},
  year={2023}
}


In [10]:
# TODO adapt from other codes

## Dataset

- Use Commonsense Conversation dataset (from Reddit)


in diffuseq text_datasets.py some steps to load the dataset itself

- [ ] prepare datasets for training and validation in the format (stored as jsonl file?)
```
{"src": "", "train": ""}
```

- word embeddings (to be loaded?)
- use a corpus


## Training

Note that, in DiffuSeq, a model file is created to store all training progress, configuration etc. (in bash format poitning to raw files?)

- denoise rate ?
- using updates in v2 diffuseq took it from 2 days -> 11 hr learning time

Load the tokenizer

For simplicity choose BERT rather than custom

In [1]:
import torch
# use GPU if available
is_cuda = torch.cuda.is_available()

if is_cuda:
    device = torch.device("cuda")
    print("GPU is available")
else:
    device = torch.device("cpu")
    print("GPU not available, CPU used")

GPU not available, CPU used


In [2]:
# Load tokenizer from bert
# TODO check if this is the best tokenizer for Commonsense Conversation dataset

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# sep_token_id = tokenizer.sep_token_id
# pad_token_id = tokenizer.pad_token_id


  from .autonotebook import tqdm as notebook_tqdm


Diffusion-LM (Li et al., 2022):

 embedding function $EMB(w)$ to map the discrete text $w$ into a continuous space.

load the model embeddings
```
model_weight, tokenizer = load_model_emb(args, tokenizer)
```

Note the implementation details in DiffuSeq (first version) is
"The maximum sequence length is 128, with embedding dimension d = 128, diffusion steps T = 2000
and a square-root noise schedule."

How is it different in v2 or other papers?

In [3]:
# choose hidden_dim

# choose embedding dimension
d=128 
# TODO choose embedding dimension suitable to the task

model = torch.nn.Embedding(tokenizer.vocab_size, 128)

# initialize random embeddings
torch.nn.init.normal_(model.weight)

# torch.save(model.state_dict(), path_save)
# os.sync() FIXME required?

# FIXME need to implement saving and reloading?


Parameter containing:
tensor([[ 0.8701,  0.4414, -0.3765,  ..., -0.8231, -4.0339, -0.7709],
        [ 0.7203,  1.2298, -0.1098,  ...,  0.3902,  0.7161, -1.8676],
        [ 0.5635, -0.0068,  0.4906,  ..., -1.5021,  0.8436,  0.1401],
        ...,
        [-1.9333,  0.7976, -2.9892,  ..., -0.7466, -2.1189, -0.7736],
        [ 0.0074,  1.9049, -0.9545,  ..., -0.2393,  1.0275, -0.2788],
        [-0.6450,  0.6781, -2.1758,  ..., -1.1580,  0.6388, -0.7978]],
       requires_grad=True)

In [8]:
vocab_size = tokenizer.vocab_size
print(vocab_size)

30522


load the text data from file

In [5]:
# read in the data in training data json file 
# TODO do this in a different way 
# FIXME what is the .strip() used for specifically?

import json

data_dir = "./datasets/sample"
path = f'{data_dir}/train.jsonl'

sentence_lst = {'src':[], 'trg': []}
with open(path, 'r') as f_reader:
        for row in f_reader:
            content = json.loads(row)
            sentence_lst['src'].append(content['src'].strip())
            sentence_lst['trg'].append(content['trg'].strip())



# TODO use pandas to load faster? any other package can just load json directly rather than row by row

In [6]:

# Tokenize dataset

# NOTE custom toknizer is of type dict in DiffuSeq

input_id_x = tokenizer(sentence_lst['src'], add_special_tokens=True)['input_ids']
input_id_y = tokenizer(sentence_lst['trg'], add_special_tokens=True)['input_ids']
tokenized_dataset = {'input_id_x': input_id_x, 'input_id_y': input_id_y}

# TODO use batching & multiple processees to make more efficient, use map function?

In [None]:
# TODO crate a DATALOADER

#  if split != 'test':
#         sampler = DistributedSampler(dataset)
#         data_loader = DataLoader(
#             dataset,
#             batch_size=batch_size,  # 20,
#             # drop_last=True,
#             sampler=sampler,
#             # shuffle=not deterministic,
#             num_workers=4,
#         )
#     else:
#         data_loader = DataLoader(
#             dataset,
#             batch_size=batch_size,  # 20,
#             # drop_last=True,
#             # sampler=sampler,
#             shuffle=not deterministic,
#             num_workers=4,
#         )


Note: Is this data in continuous space???

In [7]:
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.distributed import DistributedSampler

data_loader = DataLoader(
    tokenized_dataset,
    batch_size=20,  # TODO choose an appropriate batch size ?
    # shuffle=, # TODO choose if you want to shuffle
    # num_workers=4, # TODO use multiple workers throughout script?
)

create the model and diffusion
```
model, diffusion = create_model_and_diffusion(
        **args_to_dict(args, load_defaults_config().keys())
````

create schedule sampler? is it required?

run the training loop according to batch size, etc.

In [None]:
#  model = TransformerNetModel(
#         input_dims=hidden_dim,
#         output_dims=(hidden_dim if not learn_sigma else hidden_dim*2),
#         hidden_t_dim=hidden_t_dim,
#         dropout=dropout,
#         config_name=config_name,
#         vocab_size=vocab_size,
#         init_pretrained=use_plm_init
#     )

#     betas = gd.get_named_beta_schedule(noise_schedule, diffusion_steps)

#     if not timestep_respacing:
#         timestep_respacing = [diffusion_steps]

#     diffusion = SpacedDiffusion(
#         use_timesteps=space_timesteps(diffusion_steps, timestep_respacing),
#         betas=betas,
#         rescale_timesteps=rescale_timesteps,
#         predict_xstart=predict_xstart,
#         learn_sigmas = learn_sigma,
#         sigma_small = sigma_small,
#         use_kl = use_kl,
#         rescale_learned_sigmas=rescale_learned_sigmas
#     )

In [11]:
# FIXME need to define TransformerNetModel
#  The full Transformer model with attention and timestep embedding.

# Adapted from diffuSeq

# TODO code the transformer from scratch
# TODO design the transformer from scratch

from transformers import BertConfig, BertModel
import torch.nn as nn
import torch as th

# config = BertConfig.from_pretrained('bert-base-uncased')

# Config from DiffuSeq
  #     self,
    #     input_dims,
    #     output_dims,
    #     hidden_t_dim,
    #     dropout=0,
    #     config=None,
    #     config_name='bert-base-uncased',
    #     vocab_size=None,
    #     init_pretrained='no',
    #     logits_mode=1,
    # ):


class TransformerNetModel(nn.Module):
     def __init__(self, vocab_size, input_dims):
        super().__init__()

        # FIXME just using the default config not a param
        config = BertConfig.from_pretrained('bert-base-uncased')
        # FIXME add in this to the params - config, defining here as hard-coded config=
        # FIXME set to an actual value
        # config.hidden_size = what?
        # FIXME set to an actual value
        config.hidden_dropout_prob = 0
        # then pass this to the BertModel config=config

        # FIXME specify input_dims
        self.input_dims = input_dims
        

        # self.hidden_t_dim = hidden_t_dim
        # self.output_dims = output_dims
        # self.dropout = dropout
        # self.logits_mode = logits_mode

        # TODO check this gets assigned by default BERT config
        self.hidden_size = config.hidden_size

        # TODO add work embeddings
        # FIXME vocab_size is define way up above
        self.word_embedding = nn.Embedding(vocab_size, self.input_dims)

        # FIXME what is LM head specifying (?)
        self.lm_head = nn.Linear(self.input_dims, vocab_size)

        with th.no_grad(): # disable gradient calculations
            # FIXME describe what are we doing here
            self.lm_head.weight = self.word_embedding.weight 


        # FIXME what is this for
        # time_embed_dim = hidden_t_dim * 4
        # self.time_embed = nn.Sequential(
        #     linear(hidden_t_dim, time_embed_dim),
        #     SiLU(),
        #     linear(time_embed_dim, config.hidden_size),
        # )


        #  if self.input_dims != config.hidden_size:
        # self.input_up_proj = nn.Sequential(nn.Linear(input_dims, config.hidden_size),
        #                                   nn.Tanh(), nn.Linear(config.hidden_size, config.hidden_size))


        # print('initializing from pretrained bert...')
        #     print(config)

        # FIXME why is this temporary 
        temp_bert = BertModel.from_pretrained('bert-base-uncased', config=config)
        self.word_embedding = temp_bert.embeddings.word_embeddings
            # with th.no_grad():
            #     self.lm_head.weight = self.word_embedding.weight
            # # self.lm_head.weight.requires_grad = False
            # # self.word_embedding.weight.requires_grad = False
            
        # TODO explain what is happening
        self.input_transformers = temp_bert.encoder
        # TODO explain what is doing
        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
        self.position_embeddings = temp_bert.embeddings.position_embeddings
        self.LayerNorm = temp_bert.embeddings.LayerNorm
     
        del temp_bert.embeddings
        del temp_bert.pooler

        # FIXME When does this get used
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        # FIXME what is happening here
        if self.output_dims != config.hidden_size:
            self.output_down_proj = nn.Sequential(nn.Linear(config.hidden_size, config.hidden_size),
                                                nn.Tanh(), nn.Linear(config.hidden_size, self.output_dims))

          

  # FIXME what is the difference btween BertModel, BertConfig, BertTokenizer