### Flant5 inference
Inference file to run the test and holdout datasets

#### Step 1: Install Required Dependencies

In [1]:
!pip install evaluate
!pip install sacrebleu
!pip install bert-score

Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting pyarrow>=15.0.0 (from datasets>=2.0.0->evaluate)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m5.5 MB/s

Load the datasets, Large Language Model (LLM) and tokenizer.

In [2]:
import os
import torch
from datasets import load_dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration
import evaluate
import json
import bert_score

#### Step 2: Preprocess dataset
The restructure_json function processes a list of JSON file names, reads each file, extracts specific fields ('disfluent' and 'original'), and writes the restructured data to new output files. It constructs file paths dynamically and uses JSON operations to read and write the data.

In [3]:
def restructure_json(file_names):
    """
    Restructures the JSON files specified by the given list of file names.

    Parameters:
    file_names (list): A list of file names (without extension) to be processed.

    Returns:
    None
    """
    for file_name in file_names:
        input_path = os.path.join(os.getcwd(), f"{file_name}.json")
        output_path = os.path.join(os.getcwd(), f"{file_name}_output.json")

        #print(input_path)
        #print(output_path)

        with open(input_path, 'r') as f:
            raw_data = json.load(f)
        #print(raw_data)

        dataset = [{'disfluent': item['disfluent'], 'original': item['original']} for item in raw_data.values()]

        with open(output_path, 'w') as f:
            json.dump(dataset, f, indent=4)



In [4]:
# Mention the train, dev and test file names without extension and if using holdout datasets, rename the holdout dataset to test dataset
# Please make sure that file name defined below should have .json extension
file_names = ["train", "dev", "test"]
restructure_json(file_names)

Load the preprocessed test dataset. The dataset contains 3643 samples consisting of disfluent and original (fluent) questions.

In [5]:
data_files=os.path.join(os.getcwd(), "test_output.json")
test_dataset = load_dataset("json", data_files=data_files)
test_dataset["test"] = test_dataset.pop("train")
test_dataset['test']

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['disfluent', 'original'],
    num_rows: 3643
})

Hugging face login through token to access the model and tokenizer. Make sure to use the mentioned token credentials to access the model and tokenizer

In [6]:
from huggingface_hub import login
login(token="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")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Define your model and tokenizer. Make sure to use the correct model name and tokenizer name

In [7]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
model_name = "vinamra98/optimized_flan_T5_disfl-qa_model"
tokenizer_name = "vinamra98/optimized_flan_T5_disfl-qa_model"

cpu


In [8]:
# Load the tokenizer and model from your Hugging Face account
tokenizer = T5Tokenizer.from_pretrained(tokenizer_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.59k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo):

Define a function that takes a disfluent question as input and returns a fluent version of it. It uses a tokenizer to preprocess the input and a pre-trained model to generate the fluent question, ensuring the output is within a specified length and quality.

In [9]:
def generate_fluent_question(disfluent_question):
    """
    Generates a fluent question from a disfluent question.

    Parameters:
    disfluent_question (str): The disfluent question to be converted.

    Returns:
    str: The generated fluent question.

    """
    inputs = tokenizer(disfluent_question, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

    with torch.no_grad():
        outputs = model.generate(inputs['input_ids'], max_length=512, num_beams=5, early_stopping=True)

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Process a test dataset to convert disfluent questions into fluent ones, store the results in predictions. Use commented-out code to limit the number of examples processed.

In [10]:

predictions = []
references = []
results = []

# If you want to limit the number of examples to generate predictions for, set the max_examples variable uncomment the all the commented lines below
# counter = 0
# max_examples = 10

for example in test_dataset['test']:
    disfluent_question = example['disfluent']
    original_question = example['original']

    generated_question = generate_fluent_question(disfluent_question)
    #print(original_question)
    #print(generated_question)

    predictions.append(generated_question.strip())
    references.append([original_question.strip()])

    results.append({
        'disfluent': disfluent_question,
        'original': original_question,
        'generated': generated_question
    })

    # counter += 1

    # if counter >= max_examples:
    #     break

#### Step 3: Evaluate the results using bleu and bert score

In [11]:
# Load BLEU metric
bleu = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [12]:
# Calculate BLEU score
bleu_score = bleu.compute(predictions=predictions, references=references)
print(f"BLEU Score: {bleu_score['score']:.2f}")

# Calculate BERTScore
P, R, F1 = bert_score.score(predictions, references, lang="en", verbose=True)
print(f"F1 Score: {F1.mean().item():.4f}")

BLEU Score: 90.26


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 2.24 seconds, 4.47 sentences/sec
F1 Score: 0.9923


#### Step 4: Showing top 10 inference outputs

In [13]:
for i, result in enumerate(results[:10]):  # Show the first 10 examples
    print(f"Disfluent: {result['disfluent']}")
    print(f"Original: {result['original']}")
    print(f"Generated: {result['generated']}")
    print("="*50)

Disfluent: In what country is Norse found no wait Normandy not Norse?
Original: In what country is Normandy located?
Generated: In what country is Normandy found?
Disfluent: From which countries no tell me when were the Normans in Normandy?
Original: When were the Normans in Normandy?
Generated: When were the Normans in Normandy?
Disfluent: From which Norse leader I mean countries did the Norse originate?
Original: From which countries did the Norse originate?
Generated: From which countries did the Norse originate?
Disfluent: When I mean Who was the Norse leader?
Original: Who was the Norse leader?
Generated: Who was the Norse leader?
Disfluent: When no what century did the Normans first gain their separate identity?
Original: What century did the Normans first gain their separate identity?
Generated: What century did the Normans first gain their separate identity?
Disfluent: Who gave their name to Frankish in the 1000's and 1100's no Normandy?
Original: Who gave their name to Normand

In [14]:
# with open("inference_results_with_bleu.json", "w") as f:
#     json.dump(results, f, indent=4)