<a href="https://colab.research.google.com/github/crux82/ganbert-pytorch/blob/main/GANBERT_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GAN-BERT (in Pytorch and compatible with HuggingFace)

This is a Pytorch (+ **Huggingface** transformers) implementation of the GAN-BERT model from https://github.com/crux82/ganbert. While the original GAN-BERT was an extension of BERT, this implementation can be adapted to several architectures, ranging from Roberta to Albert!

**NOTE**: given that this implementation is different from the original one in Tensorflow, some results can be slighty different.


Let's GO!

Required Imports.

In [1]:
!pip install transformers
import torch
import io
import torch.nn.functional as F
import random
import numpy as np
import time
import math
import datetime
import torch.nn as nn
from transformers import *
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
#!pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 -f https://download.pytorch.org/whl/torch_stable.html
#!pip install sentencepiece

##Set random values
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
if torch.cuda.is_available():
  torch.cuda.manual_seed_all(seed_val)

# !pip install -U sentence-transformers

Defaulting to user installation because normal site-packages is not writeable


In [2]:
# import torch, gc
# gc.collect()
# torch.cuda.empty_cache()

In [2]:
# If there's a GPU available...
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla V100-SXM2-32GB


### Input Parameters


In [3]:
#--------------------------------
#  Transformer parameters
#--------------------------------
max_seq_length = 64
batch_size = 64

#--------------------------------
#  GAN-BERT specific parameters
#--------------------------------
# number of hidden layers in the generator, 
# each of the size of the output space
num_hidden_layers_g = 1; 
# number of hidden layers in the discriminator, 
# each of the size of the input space
num_hidden_layers_d = 1; 
# size of the generator's input noisy vectors
noise_size = 100
# dropout to be applied to discriminator's input vectors
out_dropout_rate = 0.2

# Replicate labeled data to balance poorly represented datasets, 
# e.g., less than 1% of labeled material
apply_balance = True

#--------------------------------
#  Optimization parameters
#--------------------------------
learning_rate_discriminator = 5e-5
learning_rate_generator = 5e-5
epsilon = 1e-8
num_train_epochs = 30
multi_gpu = True
# Scheduler
apply_scheduler = False
warmup_proportion = 0.1
# Print
print_each_n_step = 10

#--------------------------------
#  Adopted Tranformer model
#--------------------------------
# Since this version is compatible with Huggingface transformers, you can uncomment
# (or add) transformer models compatible with GAN

model_name = "bert-base-cased"
#model_name = "bert-base-uncased"
#model_name = "roberta-base"
#model_name = "albert-base-v2"
#model_name = "xlm-roberta-base"
#model_name = "amazon/bort"

#--------------------------------
#  Retrieve the TREC QC Dataset
#--------------------------------
! git clone https://github.com/crux82/ganbert

#  NOTE: in this setting 50 classes are involved
labeled_file = "./ganbert/data/labeled.tsv"
unlabeled_file = "./ganbert/data/unlabeled.tsv"
test_filename = "./ganbert/data/test.tsv"

label_list = ["UNK_UNK","ABBR_abb", "ABBR_exp", "DESC_def", "DESC_desc", 
              "DESC_manner", "DESC_reason", "ENTY_animal", "ENTY_body", 
              "ENTY_color", "ENTY_cremat", "ENTY_currency", "ENTY_dismed", 
              "ENTY_event", "ENTY_food", "ENTY_instru", "ENTY_lang", 
              "ENTY_letter", "ENTY_other", "ENTY_plant", "ENTY_product", 
              "ENTY_religion", "ENTY_sport", "ENTY_substance", "ENTY_symbol", 
              "ENTY_techmeth", "ENTY_termeq", "ENTY_veh", "ENTY_word", "HUM_desc", 
              "HUM_gr", "HUM_ind", "HUM_title", "LOC_city", "LOC_country", 
              "LOC_mount", "LOC_other", "LOC_state", "NUM_code", "NUM_count", 
              "NUM_date", "NUM_dist", "NUM_money", "NUM_ord", "NUM_other", 
              "NUM_perc", "NUM_period", "NUM_speed", "NUM_temp", "NUM_volsize", 
              "NUM_weight"]

Cloning into 'ganbert'...
remote: Enumerating objects: 77, done.[K
remote: Counting objects: 100% (77/77), done.[K
remote: Compressing objects: 100% (59/59), done.[K
remote: Total 77 (delta 33), reused 54 (delta 18), pack-reused 0[K
Unpacking objects: 100% (77/77), 679.94 KiB | 11.93 MiB/s, done.


Load the Tranformer Model

In [4]:
transformer = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




In [5]:
model = transformer

Function required to load the dataset

In [6]:
def get_qc_examples(input_file):
  """Creates examples for the training and dev sets."""
  examples = []

  with open(input_file, 'r', encoding="utf-8") as f:
      contents = f.read()
      file_as_list = contents.splitlines()
      for line in file_as_list[1:]:
          split = line.split(" ")
          question = ' '.join(split[1:])

          text_a = question
          inn_split = split[0].split(":")
          label = inn_split[0] + "_" + inn_split[1]
          examples.append((text_a, label))
      f.close()

  return examples

**Load** the input QC dataset (fine-grained)

In [9]:
#Load the examples
labeled_examples = get_qc_examples(labeled_file)
unlabeled_examples = get_qc_examples(unlabeled_file)
test_examples = get_qc_examples(test_filename)

Functions required to convert examples into Dataloader

In [10]:
def generate_data_loader(input_examples, label_masks, label_map, do_shuffle = False, balance_label_examples = False):
  '''
  Generate a Dataloader given the input examples, eventually masked if they are 
  to be considered NOT labeled.
  '''
  examples = []

  # Count the percentage of labeled examples  
  num_labeled_examples = 0
  for label_mask in label_masks:
    if label_mask: 
      num_labeled_examples += 1
  label_mask_rate = num_labeled_examples/len(input_examples)

  # if required it applies the balance
  for index, ex in enumerate(input_examples): 
    if label_mask_rate == 1 or not balance_label_examples:
      examples.append((ex, label_masks[index]))
    else:
      # IT SIMULATE A LABELED EXAMPLE
      if label_masks[index]:
        balance = int(1/label_mask_rate)
        balance = int(math.log(balance,2))
        if balance < 1:
          balance = 1
        for b in range(0, int(balance)):
          examples.append((ex, label_masks[index]))
      else:
        examples.append((ex, label_masks[index]))
  
  #-----------------------------------------------
  # Generate input examples to the Transformer
  #-----------------------------------------------
  input_ids = []
  input_mask_array = []
  label_mask_array = []
  label_id_array = []

  # Tokenization 
  for (text, label_mask) in examples:
    encoded_sent = tokenizer.encode(text[0], add_special_tokens=True, max_length=max_seq_length, padding="max_length", truncation=True)
    input_ids.append(encoded_sent)
    label_id_array.append(label_map[text[1]])
    label_mask_array.append(label_mask)
  
  # Attention to token (to ignore padded input wordpieces)
  for sent in input_ids:
    att_mask = [int(token_id > 0) for token_id in sent]                          
    input_mask_array.append(att_mask)
  # Convertion to Tensor
  input_ids = torch.tensor(input_ids) 
  input_mask_array = torch.tensor(input_mask_array)
  label_id_array = torch.tensor(label_id_array, dtype=torch.long)
  label_mask_array = torch.tensor(label_mask_array)

  # Building the TensorDataset
  dataset = TensorDataset(input_ids, input_mask_array, label_id_array, label_mask_array)

  if do_shuffle:
    sampler = RandomSampler
  else:
    sampler = SequentialSampler

  # Building the DataLoader
  return DataLoader(
              dataset,  # The training samples.
              sampler = sampler(dataset), 
              batch_size = batch_size) # Trains with this batch size.

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

Convert the input examples into DataLoader

In [11]:
label_map = {}
for (i, label) in enumerate(label_list):
  label_map[label] = i
#------------------------------
#   Load the train dataset
#------------------------------
train_examples = labeled_examples
#The labeled (train) dataset is assigned with a mask set to True
train_label_masks = np.ones(len(labeled_examples), dtype=bool)
#If unlabel examples are available
# if unlabeled_examples:
#   train_examples = train_examples + unlabeled_examples
#   #The unlabeled (train) dataset is assigned with a mask set to False
#   tmp_masks = np.zeros(len(unlabeled_examples), dtype=bool)
#   train_label_masks = np.concatenate([train_label_masks,tmp_masks])

train_dataloader = generate_data_loader(train_examples, train_label_masks, label_map, do_shuffle = True, balance_label_examples = apply_balance)

#------------------------------
#   Load the test dataset
#------------------------------
#The labeled (test) dataset is assigned with a mask set to True
test_label_masks = np.ones(len(test_examples), dtype=bool)

test_dataloader = generate_data_loader(test_examples, test_label_masks, label_map, do_shuffle = False, balance_label_examples = False)

In [12]:
# next(iter(train_dataloader))

We define the Generator and Discriminator as discussed in https://www.aclweb.org/anthology/2020.acl-main.191/

In [13]:
#------------------------------
#   The Generator as in 
#   https://www.aclweb.org/anthology/2020.acl-main.191/
#   https://github.com/crux82/ganbert
#------------------------------
# class Generator(nn.Module):
#     def __init__(self, noise_size=100, output_size=512, hidden_sizes=[512], dropout_rate=0.1):
#         super(Generator, self).__init__()
#         layers = []
#         hidden_sizes = [noise_size] + hidden_sizes
#         for i in range(len(hidden_sizes)-1):
#             layers.extend([nn.Linear(hidden_sizes[i], hidden_sizes[i+1]), nn.LeakyReLU(0.2, inplace=True), nn.Dropout(dropout_rate)])

#         layers.append(nn.Linear(hidden_sizes[-1],output_size))
#         self.layers = nn.Sequential(*layers)

#     def forward(self, noise):
#         output_rep = self.layers(noise)
#         return output_rep

#------------------------------
#   The Discriminator
#   https://www.aclweb.org/anthology/2020.acl-main.191/
#   https://github.com/crux82/ganbert
#------------------------------
class Discriminator(nn.Module):
    def __init__(self, input_size=512, hidden_sizes=[512], num_labels=2, dropout_rate=0.1):
        super(Discriminator, self).__init__()
        self.input_dropout = nn.Dropout(p=dropout_rate)
        layers = []
        hidden_sizes = [input_size] + hidden_sizes
        for i in range(len(hidden_sizes)-1):
            layers.extend([nn.Linear(hidden_sizes[i], hidden_sizes[i+1]), nn.LeakyReLU(0.2, inplace=True), nn.Dropout(dropout_rate)])

        self.layers = nn.Sequential(*layers) #per il flatten
        self.logit = nn.Linear(hidden_sizes[-1],num_labels+1) # +1 for the probability of this sample being fake/real.
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, input_rep):
        input_rep = self.input_dropout(input_rep)
        last_rep = self.layers(input_rep)
        logits = self.logit(last_rep)
        probs = self.softmax(logits)
        return last_rep, logits, probs

We instantiate the Discriminator and Generator

In [14]:
# The config file is required to get the dimension of the vector produced by 
# the underlying transformer
config = AutoConfig.from_pretrained(model_name)
hidden_size = int(config.hidden_size)
# Define the number and width of hidden layers
# hidden_levels_g = [hidden_size for i in range(0, num_hidden_layers_g)]
hidden_levels_d = [hidden_size for i in range(0, num_hidden_layers_d)]

#-------------------------------------------------
#   Instantiate the Generator and Discriminator
#-------------------------------------------------
# generator = Generator(noise_size=noise_size, output_size=hidden_size, hidden_sizes=hidden_levels_g, dropout_rate=out_dropout_rate)
discriminator = Discriminator(input_size=hidden_size, hidden_sizes=hidden_levels_d, num_labels=len(label_list), dropout_rate=out_dropout_rate)

# Put everything in the GPU if available
if torch.cuda.is_available():    
#   generator.cuda()
  discriminator.cuda()
  transformer.cuda()
  if multi_gpu:
    transformer = torch.nn.DataParallel(transformer)

# print(config)

In [15]:
one_walk = [x[0] for x in train_examples]
seed = [x[0] for x in labeled_examples]

In [16]:
import os
try:
    os.mkdir('./save')
except:
    pass
with open("save/seed.txt","w", encoding="utf-8") as file:
    for i in seed:
        file.write(i+"\n")
        
with open("save/one_walk.txt","w", encoding = "utf-8") as file:
    for i in one_walk:
        file.write(i+"\n")      

In [17]:
# from transformers import AutoTokenizer, AutoModel
# import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask



#Sentences we want sentence embeddings for
# sentences = ['This framework generates embeddings for each input sentence',
#              'Sentences are passed as a list of string.',
#              'The quick brown fox jumps over the lazy dog.']

# #Load AutoModel from huggingface model repository
# # tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
# model = AutoModel.from_pretrained(model_name)

# #Tokenize sentences
# encoded_input = tokenizer(sentences, add_special_tokens=True, padding="max_length", truncation=True, max_length=max_seq_length, return_tensors='pt')
# # tokenizer.encode(text[0], add_special_tokens=True, max_length=max_seq_length, padding="max_length", truncation=True)
# encoded_input = tokenizer(sentences, add_special_tokens=True, padding="max_length", truncation=True, max_length=max_seq_length, return_tensors='pt')

# #Compute token embeddings
# with torch.no_grad():
#     model_output = model(**encoded_input)

# #Perform pooling. In this case, mean pooling
# sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

In [18]:
# encoded_input = tokenizer(sentences, add_special_tokens=True, padding="max_length", truncation=True, max_length=max_seq_length, return_tensors='pt')

# # #Compute token embeddings
# with torch.no_grad():
#     model_output = model(**encoded_input)

# # #Perform pooling. In this case, mean pooling
# sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# sentence_embeddings2

In [19]:
# from sentence_transformers import SentenceTransformer
# model_st = SentenceTransformer(model_name,device='cuda:0')

# #Our sentences we like to encode
# sentences = ['This framework generates embeddings for each input sentence',
#     'Sentences are passed as a list of string.',
#     'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()



Let's go with the training procedure

In [21]:
len(one_walk)

5452

In [23]:
save_dir = "./save"
training_stats = []

# Measure the total training time for the whole run.
total_t0 = time.time()

#models parameters
transformer_vars = [i for i in transformer.parameters()]
d_vars = transformer_vars + [v for v in discriminator.parameters()]
# g_vars = [v for v in generator.parameters()]

#optimizer
dis_optimizer = torch.optim.AdamW(d_vars, lr=learning_rate_discriminator)
# gen_optimizer = torch.optim.AdamW(g_vars, lr=learning_rate_generator) 

#scheduler
if apply_scheduler:
  num_train_examples = len(train_examples)
  num_train_steps = int(num_train_examples / batch_size * num_train_epochs)
  num_warmup_steps = int(num_train_steps * warmup_proportion)

  scheduler_d = get_constant_schedule_with_warmup(dis_optimizer, 
                                           num_warmup_steps = num_warmup_steps)
  scheduler_g = get_constant_schedule_with_warmup(gen_optimizer, 
                                           num_warmup_steps = num_warmup_steps)
import os.path
import time
# from torch import device as device_
# model.to('cuda')
# device = device_("cuda" if cuda.is_available() else "cpu")
# model = Model().to(device)
# For each epoch...
for epoch_i in range(0, num_train_epochs):
    # ========================================
    #               Training
    # ========================================
    # Perform one full pass over the training set.
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, num_train_epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    tr_g_loss = 0
    tr_d_loss = 0
    tr_rewarder = 0
    # Put the model into training mode.
    transformer.train() 
#     generator.train()
    discriminator.train()

    ent_w= 1
    eval_file_prefix = os.path.join(save_dir,'evaler_file'+str(ent_w))

    
    file_path =  eval_file_prefix + str(epoch_i) + ".txt"
    while not os.path.exists(file_path):
        time.sleep(5)

    if os.path.isfile(file_path):
        # read file
        with open(eval_file_prefix + str(epoch_i) + ".txt","r",encoding="utf-8") as f:
            sentences_irl = f.read().splitlines()
    else:
        raise ValueError("%s isn't a file!" % file_path)
        
    
    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        # Progress update every print_each_n_step batches.
        if step % print_each_n_step == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack this training batch from our dataloader. 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        b_label_mask = batch[3].to(device)

        real_batch_size = b_input_ids.shape[0]
     
        # Encode real data in the Transformer
        model_outputs = transformer(b_input_ids, attention_mask=b_input_mask)
        hidden_states = model_outputs[-1]
        
        # Generate fake data that should have the same distribution of the ones
        # encoded by the transformer. 
        # First noisy input are used in input to the Generator
#         noise = torch.zeros(real_batch_size, noise_size, device=device).uniform_(0, 1)
#         # Gnerate Fake data
#         gen_rep = generator(noise)
        sent_sampled = random.sample(sentences_irl,batch_size)
        encoded_input = tokenizer(sent_sampled, add_special_tokens=True, padding="max_length", truncation=True, max_length=max_seq_length, return_tensors='pt')
        encoded_input.to("cuda:0")
        # # #Compute token embeddings
        with torch.no_grad():
            model_output = model(**encoded_input)

        # # #Perform pooling. In this case, mean pooling
        sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
        gen_rep = sentence_embeddings
        gen_rep = gen_rep.cuda()
        # Generate the output of the Discriminator for real and fake data.
        # First, we put together the output of the tranformer and the generator
        disciminator_input = torch.cat([hidden_states, gen_rep], dim=0)
        # Then, we select the output of the disciminator
        features, logits, probs = discriminator(disciminator_input)

        # Finally, we separate the discriminator's output for the real and fake
        # data
        features_list = torch.split(features, real_batch_size)
        D_real_features = features_list[0]
        D_fake_features = features_list[1]
      
        logits_list = torch.split(logits, real_batch_size)
        D_real_logits = logits_list[0]
        D_fake_logits = logits_list[1]
        
        probs_list = torch.split(probs, real_batch_size)
        D_real_probs = probs_list[0]
        D_fake_probs = probs_list[1]

        #---------------------------------
        #  LOSS evaluation
        #---------------------------------
        # Generator's LOSS estimation
        g_loss_d = -1 * torch.mean(torch.log(1 - D_fake_probs[:,-1] + epsilon))
        
        
#         g_feat_reg = torch.mean(torch.pow(torch.mean(D_real_features, dim=0) - torch.mean(D_fake_features, dim=0), 2))
#         g_feat_reg = 
        g_loss = g_loss_d
  
        # Disciminator's LOSS estimation
        logits = D_real_logits[:,0:-1]
        log_probs = F.log_softmax(logits, dim=-1)
        # The discriminator provides an output for labeled and unlabeled real data
        # so the loss evaluated for unlabeled data is ignored (masked)
        label2one_hot = torch.nn.functional.one_hot(b_labels, len(label_list)).to(torch.float)
        
        per_example_loss = -torch.sum(label2one_hot * log_probs, dim=-1)
        per_example_loss = torch.masked_select(per_example_loss, b_label_mask.to(device))
        labeled_example_count = per_example_loss.type(torch.float32).numel()

        # It may be the case that a batch does not contain labeled examples, 
        # so the "supervised loss" in this case is not evaluated
        if labeled_example_count == 0:
          D_L_Supervised = 0
        else:
          D_L_Supervised = torch.div(torch.sum(per_example_loss.to(device)), labeled_example_count)
                 
        D_L_unsupervised1U = -1 * torch.mean(torch.log(1 - D_real_probs[:, -1] + epsilon))
        D_L_unsupervised2U = -1 * torch.mean(torch.log(D_fake_probs[:, -1] + epsilon))
        d_loss = D_L_Supervised + D_L_unsupervised1U + D_L_unsupervised2U

        #---------------------------------
        #  OPTIMIZATION
        #---------------------------------
        # Avoid gradient accumulation gen, disc, reward, update
#         gen_optimizer.zero_grad()
        dis_optimizer.zero_grad()

        # Calculate weigth updates
        # retain_graph=True is required since the underlying graph will be deleted after backward
#         g_loss.backward(retain_graph=True)
        d_loss.backward() 
        
        # Apply modifications
#         gen_optimizer.step()
        dis_optimizer.step()

        # A detail log of the individual losses
        #print("{0:.4f}\t{1:.4f}\t{2:.4f}\t{3:.4f}\t{4:.4f}".
        #      format(D_L_Supervised, D_L_unsupervised1U, D_L_unsupervised2U,
        #             g_loss_d, g_feat_reg))

        # Save the losses to print them later
        tr_g_loss += g_loss.item()
        tr_d_loss += d_loss.item()
        try:
            tr_rewarder += D_L_Supervised.item()
        except:
            tr_rewarder += 0.0

        # Update the learning rate with the scheduler
        if apply_scheduler:
          scheduler_d.step()
          scheduler_g.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss_g = tr_g_loss / len(train_dataloader)
    avg_train_loss_d = tr_d_loss / len(train_dataloader)
    avg_train_loss_rewarder = tr_rewarder / len(train_dataloader)
    
    with open(eval_file_prefix + str(epoch_i) + "_gloss.txt","w",encoding="utf-8") as f:
        f.write(str(avg_train_loss_g))
    with open(eval_file_prefix + "_dloss.txt","w",encoding="utf-8") as f:
        f.write(str(avg_train_loss_d))
    with open(eval_file_prefix + str(epoch_i) + "_dloss.txt","w",encoding="utf-8") as f:
        f.write(str(avg_train_loss_d))
        
#     with open(file_path_2,"r",encoding="utf-8") as f:
#                 avg_train_loss_d=f.read()    
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss generetor: {0:.3f}".format(avg_train_loss_g))
    print("  Average training loss discriminator: {0:.3f}".format(avg_train_loss_d))
    print("  Training epcoh took: {:}".format(training_time))
        
    # ========================================
    #     TEST ON THE EVALUATION DATASET
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our test set.
    print("")
    print("Running Test...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    transformer.eval() #maybe redundant
    discriminator.eval()
#     generator.eval()

    # Tracking variables 
    total_test_accuracy = 0
   
    total_test_loss = 0
    nb_test_steps = 0

    all_preds = []
    all_labels_ids = []

    #loss
    nll_loss = torch.nn.CrossEntropyLoss(ignore_index=-1)

    # Evaluate data for one epoch
    for batch in test_dataloader:
        
        # Unpack this training batch from our dataloader. 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():        
            model_outputs = transformer(b_input_ids, attention_mask=b_input_mask)
            hidden_states = model_outputs[-1]
            _, logits, probs = discriminator(hidden_states)
            ###log_probs = F.log_softmax(probs[:,1:], dim=-1)
            filtered_logits = logits[:,0:-1]
            # Accumulate the test loss.
            total_test_loss += nll_loss(filtered_logits, b_labels)
            
        # Accumulate the predictions and the input labels
        _, preds = torch.max(filtered_logits, 1)
        
        all_preds += preds.detach().cpu().tolist()
        all_labels_ids += b_labels.detach().cpu().tolist()
        
    all_preds = np.array(all_preds)
    all_labels_ids = np.array(all_labels_ids)
    # Report the final accuracy for this validation run.
#     all_preds = torch.stack(torch.Tensor(all_preds)).numpy()
#     all_labels_ids = torch.stack(torch.Tensor(all_labels_ids)).numpy()
    test_accuracy = np.sum(all_preds == all_labels_ids) / len(all_preds)
    print("  Accuracy: {0:.3f}".format(test_accuracy))

    # Calculate the average loss over all of the batches.
    avg_test_loss = total_test_loss / len(test_dataloader)
    avg_test_loss = avg_test_loss.item()
    
    # Measure how long the validation run took.
    test_time = format_time(time.time() - t0)
    
    print("  Test Loss: {0:.3f}".format(avg_test_loss))
    print("  Test took: {:}".format(test_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss generator': avg_train_loss_g,
            'Training Loss discriminator': avg_train_loss_d,
            'Valid. Loss': avg_test_loss,
            'Valid. Accur.': test_accuracy,
            'Training Time': training_time,
            'Test Time': test_time
        }
    )




Training...
  Batch    10  of     86.    Elapsed: 0:06:44.
  Batch    20  of     86.    Elapsed: 0:06:47.
  Batch    30  of     86.    Elapsed: 0:06:50.
  Batch    40  of     86.    Elapsed: 0:06:54.
  Batch    50  of     86.    Elapsed: 0:06:57.
  Batch    60  of     86.    Elapsed: 0:07:00.
  Batch    70  of     86.    Elapsed: 0:07:04.
  Batch    80  of     86.    Elapsed: 0:07:07.

  Average training loss generetor: 0.928
  Average training loss discriminator: 4.212
  Training epcoh took: 0:07:09

Running Test...
  Accuracy: 0.592
  Test Loss: 1.963
  Test took: 0:00:01

Training...
  Batch    10  of     86.    Elapsed: 0:22:49.
  Batch    20  of     86.    Elapsed: 0:22:53.
  Batch    30  of     86.    Elapsed: 0:22:56.
  Batch    40  of     86.    Elapsed: 0:22:59.
  Batch    50  of     86.    Elapsed: 0:23:02.
  Batch    60  of     86.    Elapsed: 0:23:06.
  Batch    70  of     86.    Elapsed: 0:23:09.
  Batch    80  of     86.    Elapsed: 0:23:12.

  Average training loss gene

In [24]:
for stat in training_stats:
  print(stat)

print("\nTraining complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))

{'epoch': 1, 'Training Loss generator': 0.9186387432002744, 'Training Loss discriminator': 4.204453378222709, 'Valid. Loss': 2.015935182571411, 'Valid. Accur.': 0.51, 'Training Time': '0:07:31', 'Test Time': '0:00:01'}
{'epoch': 2, 'Training Loss generator': 2.685943609060243, 'Training Loss discriminator': 1.6678712742273198, 'Valid. Loss': 1.2660813331604004, 'Valid. Accur.': 0.696, 'Training Time': '0:22:16', 'Test Time': '0:00:01'}
{'epoch': 3, 'Training Loss generator': 3.885301398676495, 'Training Loss discriminator': 1.0134386281634487, 'Valid. Loss': 1.0490162372589111, 'Valid. Accur.': 0.748, 'Training Time': '0:22:24', 'Test Time': '0:00:01'}
{'epoch': 4, 'Training Loss generator': 4.8890673809273295, 'Training Loss discriminator': 0.6670544550342615, 'Valid. Loss': 0.8141199350357056, 'Valid. Accur.': 0.8, 'Training Time': '0:21:21', 'Test Time': '0:00:01'}
{'epoch': 5, 'Training Loss generator': 5.443946810655816, 'Training Loss discriminator': 0.4466429023548614, 'Valid. L

In [17]:
len(newsgroups_train.target)

11314

In [2]:

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

In [23]:
for i in newsgroups_train.target:print(i)

7
4
4
1
14
16
13
3
2
4
8
19
4
14
6
0
1
7
12
5
0
10
6
2
4
1
12
9
15
7
6
13
12
17
18
10
8
11
8
16
9
4
3
9
9
4
4
8
12
14
5
15
2
13
17
11
7
10
2
14
12
5
4
6
7
0
11
16
0
6
17
7
12
7
3
12
11
7
2
2
0
16
1
2
7
3
2
1
10
12
12
17
12
2
8
8
18
5
0
1
6
12
8
4
17
12
12
12
1
6
18
4
3
10
9
0
13
11
5
14
15
8
4
15
15
1
0
16
9
8
6
13
6
17
14
0
9
1
2
15
13
9
2
8
2
13
2
0
15
14
1
14
17
14
4
4
7
19
1
15
17
16
2
15
9
12
6
9
6
6
18
1
10
6
10
5
2
13
3
9
13
12
13
8
4
3
9
1
12
4
2
2
11
13
4
1
12
0
16
12
16
7
17
15
11
14
2
7
10
14
15
5
16
11
4
13
7
4
13
17
1
15
17
17
9
16
17
0
16
8
13
2
14
10
2
2
9
14
9
2
15
15
4
4
11
11
11
13
0
6
18
13
0
18
16
2
9
11
9
3
0
18
7
9
14
9
10
17
8
5
3
2
13
6
7
1
11
4
12
11
6
2
14
9
1
4
11
15
15
6
5
10
8
4
16
17
9
8
10
17
10
11
8
15
10
4
12
0
5
4
0
18
17
2
3
0
19
19
9
19
5
13
14
10
5
11
6
6
1
13
13
18
17
2
9
15
2
8
9
2
19
14
5
14
15
9
3
18
1
18
10
13
19
18
15
9
14
4
4
0
12
18
14
14
17
13
15
5
4
18
12
9
6
1
4
6
8
0
17
3
1
18
16
18
8
13
12
15
8
6
19
14
4
4
17
5
6
5
13
4
11
10
12
3
12
7


14
11
1
11
9
8
9
3
18
2
14
3
8
14
0
13
6
18
3
10
14
18
0
7
0
18
12
14
7
11
17
19
18
0
19
11
15
3
0
18
19
8
0
6
13
14
12
17
10
4
1
12
7
10
14
1
14
3
19
14
7
2
9
18
2
3
19
15
12
17
6
7
8
6
7
1
11
16
13
1
0
7
14
2
0
15
19
6
10
5
4
4
18
16
5
5
1
12
15
10
15
13
12
19
14
5
10
3
4
19
18
14
7
5
3
8
17
17
3
12
0
2
14
12
10
3
7
18
12
14
14
15
11
14
6
2
15
11
0
11
3
8
5
5
8
8
15
9
1
16
6
5
8
7
0
18
10
1
2
13
6
3
16
10
2
15
5
2
8
13
3
4
7
1
11
13
6
14
0
7
2
13
13
1
18
10
13
8
0
15
15
7
7
16
7
19
14
11
2
0
19
11
9
8
5
2
6
11
1
12
17
18
10
15
11
12
9
11
6
13
16
17
5
8
5
11
9
8
0
19
2
11
10
8
5
4
6
10
13
7
1
12
6
3
12
15
12
6
1
18
12
12
13
9
3
9
1
16
13
1
7
12
0
10
2
3
10
2
15
5
8
8
5
14
9
9
15
7
3
17
0
9
16
14
7
11
14
8
9
1
10
0
1
12
16
5
14
17
18
18
8
0
19
5
17
10
14
5
3
13
1
5
13
18
9
13
9
18
13
3
4
1
5
15
1
9
17
11
0
10
16
15
18
9
12
1
16
7
17
13
4
14
14
11
18
3
9
3
2
7
14
4
8
8
10
1
12
10
16
18
11
16
9
9
15
18
10
11
8
3
7
10
19
2
12
13
14
8
13
1
11
14
11
2
3
8
3
0
14
2
16
5
11
11
17
9
4
17
13
15

5
11
8
13
19
9
15
18
18
1
16
14
6
11
0
6
15
2
6
7
14
8
2
5
14
2
7
1
18
2
10
9
7
12
5
0
18
2
1
13
3
1
12
2
4
15
5
10
9
4
4
3
4
3
5
10
12
0
3
6
15
15
16
4
14
5
13
4
13
8
14
12
11
7
3
17
10
11
16
3
11
19
1
19
7
19
3
15
5
8
13
8
7
4
3
14
17
16
0
18
5
17
19
14
16
0
14
3
10
11
11
11
18
13
15
16
11
4
1
12
10
2
16
18
12
6
8
7
2
8
4
19
2
9
8
7
5
14
0
15
12
15
9
18
16
17
11
10
15
1
11
17
15
5
18
10
11
15
1
1
18
0
12
4
11
0
9
18
13
11
6
19
8
13
14
11
13
17
7
0
7
14
17
7
12
9
4
9
14
17
12
7
9
8
14
9
17
8
15
4
17
6
19
13
8
17
3
2
16
11
13
5
13
6
0
12
4
15
3
14
0
0
6
4
2
13
3
12
12
8
10
6
7
10
16
13
7
10
19
1
0
19
1
3
3
13
17
7
17
17
2
16
8
8
3
2
4
17
17
5
14
6
10
11
16
7
1
10
10
19
11
14
19
7
13
6
1
12
6
19
16
0
3
6
8
9
1
5
8
13
0
10
19
16
3
18
7
13
9
19
15
9
8
6
19
14
9
4
1
16
17
13
17
8
16
13
0
13
19
1
13
15
1
6
19
3
4
18
0
0
7
9
4
7
8
11
17
16
9
3
11
4
14
1
13
9
16
2
12
2
11
13
9
15
9
10
0
7
16
1
12
5
13
15
2
16
2
13
17
19
4
10
9
11
4
0
8
3
4
15
10
9
12
1
3
7
6
15
11
12
17
15
6
17
10
7
8
2
13
1


4
8
6
16
17
14
1
8
11
3
0
11
1
15
6
15
6
8
7
9
17
14
3
8
17
4
3
19
12
12
8
2
7
2
15
2
6
11
0
16
2
3
14
12
11
16
11
1
11
13
13
9
9
13
16
3
9
4
19
11
16
17
6
4
14
11
3
15
14
9
1
6
18
17
15
2
9
1
17
4
2
14
14
11
15
7
18
12
8
6
8
18
2
3
3
0
11
18
14
18
16
13
0
12
13
16
6
8
1
6
14
19
9
15
14
4
2
4
9
1
11
5
2
14
14
7
9
17
3
17
1
18
14
0
6
3
12
0
9
17
15
11
7
7
3
6
15
17
4
13
6
1
4
1
13
2
11
0
9
16
5
17
15
18
7
2
1
8
19
0
15
17
11
8
6
14
0
11
12
8
3
11
5
0
2
19
5
7
16
5
3
15
18
3
4
14
12
19
7
14
16
6
7
10
19
2
17
12
7
14
2
15
3
12
13
4
2
12
6
16
7
15
2
10
15
18
8
11
13
2
13
8
1
4
14
0
17
10
8
2
5
10
1
18
16
18
9
4
5
6
10
2
12
4
15
4
1
17
4
14
4
19
7
15
18
9
9
11
11
17
12
10
0
7
3
14
6
18
8
15
11
4
7
14
5
0
13
7
9
15
2
4
17
14
7
16
18
8
13
4
14
2
11
15
11
10
8
10
8
10
12
2
19
8
2
3
3
5
13
1
18
11
14
8
0
16
2
16
6
16
14
17
7
11
4
6
7
5
14
4
16
1
6
17
11
7
11
14
5
10
6
10
15
8
3
0
15
19
13
3
11
11
2
1
10
3
14
11
8
5
16
18
15
5
19
14
15
11
17
12
12
15
10
10
14
4
2
18
14
10
1
18
17
18
12
18
8
14
5

In [10]:
with open(rnewsgroups_train.filenames[0],"r") as f:
    text = f.read()

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\\\Users\\\\rohit\\\\scikit_learn_data\\\\20news_home\\\\20news-bydate-train\\\\rec.autos\\\\102994'

In [2]:
with open(r"C:\Users\rohit\Downloads\soumya_gpt3_prepared.jsonl",  "r", encoding="utf-8") as f:
    all_data = f.read().splitlines()

In [3]:
import random

random.shuffle(all_data)

In [5]:
len(all_data)

301004

In [6]:
train = all_data[:int(len(all_data)*0.6)]
validation = all_data[int(len(all_data)*0.6):int(len(all_data)*0.8)]
tetst = all_data[int(len(all_data)*0.8):]

In [9]:
with open(r"C:\Users\rohit\OneDrive\Desktop\GPT3\raw_data\train_question2query.jsonl", "w", encoding="utf-8") as file:
    for i in train:
        file.write(i+"\n")
        
with open(r"C:\Users\rohit\OneDrive\Desktop\GPT3\raw_data\val_question2query.jsonl", "w", encoding="utf-8") as file:
    for i in validation:
        file.write(i+"\n")

with open(r"C:\Users\rohit\OneDrive\Desktop\GPT3\raw_data\test_question2query.jsonl", "w", encoding="utf-8") as file:
    for i in tetst:
        file.write(i+"\n")

In [44]:
!pip install --upgrade openai

Collecting openai
  Downloading openai-0.13.0.tar.gz (37 kB)
Building wheels for collected packages: openai
  Building wheel for openai (setup.py): started
  Building wheel for openai (setup.py): finished with status 'done'
  Created wheel for openai: filename=openai-0.13.0-py3-none-any.whl size=46095 sha256=fa25c7d210a84eeb3a3d4c2ca40b9181c084fa0c5a8e1c6a397eb284765b6d9b
  Stored in directory: c:\users\rohit\appdata\local\pip\cache\wheels\43\ff\28\fbed62f325b5caadef004370946438d698c4dd245ba73753f3
Successfully built openai
Installing collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 0.10.5
    Uninstalling openai-0.10.5:
      Successfully uninstalled openai-0.10.5
Successfully installed openai-0.13.0


In [45]:
!export OPENAI_API_KEY="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"

'export' is not recognized as an internal or external command,
operable program or batch file.


In [1]:
!openai tools fine_tunes.prepare_data -f C:\Users\rohit\OneDrive\Desktop\GPT3\raw_data\test_question2query.tsv

'openai' is not recognized as an internal or external command,
operable program or batch file.
