# Assignment 1

# Question 1

## Problem: Fine-tune a lightweight transformer model (like DistilBERT or ALBERT) on a small subset of a question-answering dataset (10000 examples).
- ## After fine-tuning, visualize the attention patterns and explore the impact of pruning specific attention heads on the model's performance. Prune a few attention heads and measure the impact on model accuracy or loss.
- ## Implement layer freezing—freeze different layers during fine-tuning (e.g., freeze the bottom N layers and only train the top layers). Compare how freezing different layers affects performance.

## Analyze the trade-offs between computational efficiency and performance.Fine-tune the model with Adapter Modules added for efficient, task-specific fine-tuning. Adapters are lightweight neural modules that can be inserted into transformer layers and are trained on a specific task without modifying the original model weights. Use BertViz, Captum, or other libraries to visualize attention across heads and layers for different QA examples

## Dataset: Use a small subset of the SQuAD dataset or use another small dataset, such as BoolQ. Aim for a dataset size of 10000 examples for training and 200 examples for evaluation
## Dataset Link:
- https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
- https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

## Tasks:
1. Fine-tune a lightweight transformer model on a small subset of a question-answering dataset (10000 examples).
2. Visualize the attention patterns and explore the impact of pruning specific attention heads on the model's performance. Prune a few attention heads and measure the impact on model accuracy or loss. Use BertViz, Captum, or other libraries to visualize attention across heads and layers for different QA examples.
3. Implement layer freezing—freeze different layers during fine-tuning (e.g., freeze the bottom N layers and only train the top layers). Compare how freezing different layers affects performance.
4. Fine-tune the model with Adapter Modules added for efficient, task-specific fine-tuning. Adapters are lightweight neural modules that can be inserted into transformer layers and are trained on a specific task without modifying the original model weights.

In [None]:
!pip install transformers tokenizers datasets tqdm lib evaluate accelerate bertviz peft

## Loading the Dataset

In [None]:
import pandas as pd
import numpy as np
import evaluate
import collections
from datasets import load_dataset, Dataset

import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import (
    AutoTokenizer,
    DefaultDataCollator,
    AutoModelForQuestionAnswering,
    TrainingArguments,
    Trainer,
    default_data_collator,
    get_scheduler,
    DataCollatorWithPadding
)
from accelerate import Accelerator
from tqdm.auto import tqdm
from sklearn.metrics import f1_score, precision_score, recall_score, mean_squared_error

## Creating the train and test split

In [None]:
ini_ds = load_dataset("rajpurkar/squad")

train_df= pd.DataFrame(ini_ds['train']).sample(10000)
train_df.drop('id', inplace= True, axis=1)
train_df.reset_index(inplace= True)

test_df= pd.DataFrame(ini_ds['validation']).sample(500)
test_df.drop('id', inplace= True, axis=1)
test_df.reset_index(inplace= True)

In [None]:
ini_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
ini_model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

## Methods for Preprocessing and evaluation and DataLoaders

In [None]:
def preprocess_data( df, cur_tokenizer ):

  ques= [ question.strip() for question in df['question'] ]
  cxt= [ context.strip() for context in df['context'] ]
  inputs= cur_tokenizer(  ques, cxt, max_length= 364, truncation= "only_second", return_offsets_mapping= True, padding= "max_length" )

  offset_mapping= inputs.pop('offset_mapping')
  st_pos = []; en_pos = []
  ans = df['answers']

  for i, offset in enumerate(offset_mapping):

      answer = ans[i]
      st_char = answer["answer_start"][0];  en_char = answer["answer_start"][0] + len(answer["text"][0])
      seq_ids = inputs.sequence_ids(i)

      idx = 0
      while seq_ids[idx] != 1:
          idx += 1

      cxt_start = idx

      while seq_ids[idx] == 1:
          idx += 1

      cxt_end = idx - 1
      if offset[cxt_start][0] > en_char or offset[cxt_end][1] < st_char:
          st_pos.append(0)
          en_pos.append(0)
      else:
          # Otherwise it's the start and end token positions
          idx = cxt_start
          while idx <= cxt_end and offset[idx][0] <= st_char:
              idx += 1
          st_pos.append(idx - 1)

          idx = cxt_end
          while idx >= cxt_start and offset[idx][1] >= en_char:
              idx -= 1
          en_pos.append(idx + 1)

  df["start_positions"] = st_pos
  df["end_positions"] = en_pos

  data = {'input_ids': inputs['input_ids'],
        'attention_mask': inputs['attention_mask'],
        'start_positions':st_pos,
        'end_positions': en_pos,
  }

  dff = pd.DataFrame(data)
  dff.to_csv('train_encoded.csv',index=False)
  return Dataset.from_pandas(dff)

def evaluate_model(cur_model, eval_dataloader):
    cur_model.eval()
    exact_matches = 0
    f1_scores = []
    precisions = []
    recalls = []
    mse_list = []
    total = 0

    with torch.no_grad():
        for batch in eval_dataloader:
            input_ids = batch['input_ids']
            attention_mask = batch['attention_mask']
            true_start_positions = batch['start_positions']
            true_end_positions = batch['end_positions']

            outputs = cur_model(input_ids=input_ids, attention_mask=attention_mask)
            pred_start_positions = torch.argmax(outputs.start_logits, dim=-1)
            pred_end_positions = torch.argmax(outputs.end_logits, dim=-1)

            for i in range(len(true_start_positions)):
                true_start = true_start_positions[i].item()
                true_end = true_end_positions[i].item()
                pred_start = pred_start_positions[i].item()
                pred_end = pred_end_positions[i].item()

                if pred_start == true_start and pred_end == true_end:
                    exact_matches += 1

                common_tokens = set(range(pred_start, pred_end + 1)) & set(range(true_start, true_end + 1))
                if common_tokens:
                    precision = len(common_tokens) / (pred_end - pred_start + 1)
                    recall = len(common_tokens) / (true_end - true_start + 1)
                    f1 = 2 * (precision * recall) / (precision + recall)
                else:
                    f1 = 0
                    precision = 0
                    recall = 0

                f1_scores.append(f1)
                precisions.append(precision)
                recalls.append(recall)

                mse_list.append(mean_squared_error([true_start, true_end], [pred_start, pred_end]))

            total += len(true_start_positions)

    exact_match = exact_matches / total * 100
    f1 = np.mean(f1_scores) * 100
    precision = np.mean(precisions) * 100
    recall = np.mean(recalls) * 100

    return {
        "exact_match": exact_match,
        "f1": f1,
        "precision": precision,
        "recall": recall,
    }

In the **SQuAD** (Stanford Question Answering Dataset), the **`answer_start`** list is part of the **`answers`** feature in the dataset, specifically used to indicate the starting character position of the answer within the context.

- **`context`**: A passage or paragraph from which the answer is derived.
- **`question`**: A question that pertains to the context.
- **`answers`**: Contains information about the answer to the question. It's usually a dictionary that has:
  - **`text`**: The actual text of the answer.
  - **`answer_start`**: A list of integer positions indicating where the answer starts in the context. This is the character index where the answer begins within the passage.

In the case of multiple possible answers, there can be more than one starting position (thus, the **list**). For example, this could happen when annotators provide different, valid answer spans for the same question.

### Example:

```json
{
  "context": "The Statue of Liberty is located in New York.",
  "question": "Where is the Statue of Liberty located?",
  "answers": {
    "text": ["New York"],
    "answer_start": [31]
  }
}
```

Here, the answer **"New York"** begins at the **31st character** in the context string.

In [None]:
train_df= preprocess_data(train_df, ini_tokenizer)
test_df= preprocess_data(test_df, ini_tokenizer)

train_dataloader = DataLoader(
    train_df,
    shuffle=True,
    collate_fn=default_data_collator,
    batch_size=16
)
eval_dataloader = DataLoader(
    test_df,
    collate_fn=default_data_collator,
    batch_size=16
)

## Initial Training

### Full Training and Saving the model

In [None]:
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

In [None]:
from tqdm.auto import tqdm

ini_optimizer = AdamW(ini_model.parameters(),lr = 3e-5)
ini_accelerator = Accelerator(mixed_precision="fp16")
ini_model, ini_optimizer, train_dataloader, eval_dataloader = ini_accelerator.prepare(
    ini_model, ini_optimizer, train_dataloader, eval_dataloader
)

num_train_epochs=3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

ini_lr_scheduler = get_scheduler(
    'linear',
    optimizer= ini_optimizer,
    num_warmup_steps = 0,
    num_training_steps = num_training_steps,
)

ini_model.to(DEVICE)

progress_bar = tqdm(range(num_training_steps))
ini_model.train()
for epoch in range(num_train_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(DEVICE) for k, v in batch.items()}
        outputs = ini_model(**batch)
        loss = outputs.loss
        ini_accelerator.backward(loss)

        ini_optimizer.step()
        ini_lr_scheduler.step()
        ini_optimizer.zero_grad()
        progress_bar.update(1)

In [None]:
import os

directory_name = '/content/saved_mdl'

os.makedirs(directory_name, exist_ok=True)
def save(cur_model, cur_optimizer, output_model):
    torch.save({
        'model_state_dict': cur_model.state_dict(),
        'optimizer_state_dict': cur_optimizer.state_dict()
    }, output_model)

save(ini_model, ini_optimizer, '/content/saved_mdl/ini-fine.pth')

### Loading the Saved Model and optimizer

In [None]:
checkpoint = torch.load('/content/saved_mdl/ini-fine.pth', map_location='cpu')
ini_model.load_state_dict(checkpoint['model_state_dict'])
ini_optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

### Obtaining results on the Evaluation dataset

In [None]:
metrics = evaluate_model( ini_model, eval_dataloader)
print("Exact Match:", metrics["exact_match"])
print("F1 Score:", metrics["f1"])
print("Precision:", metrics["precision"])
print("Recall:", metrics["recall"])

### Visualizing Attention head using BertViz

In [None]:
from bertviz import head_view, model_view

def Visualize_attn_heads(cur_model, cur_tokenizer):
  # Define context and question
  context = "There are 45 iterations where one can succeed."
  question = "How many iterations can one succeed in?"

  # Tokenize question and context
  inputs = cur_tokenizer(question, context, return_tensors="pt")
  inputs = {k: v.to(DEVICE) for k, v in inputs.items()}  # Move inputs to device

  # Enable output of attention scores
  cur_model.config.output_attentions = True
  cur_model.eval()

  # Run the model to get outputs and attention scores
  with torch.no_grad():
      outputs = cur_model(**inputs)
      attentions = outputs.attentions  # Extract attention scores

  input_tokens = cur_tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
  # layer_index = 1  # You can change this index to visualize different layers
  # if layer_index < len(attentions):
      # head_view(attentions, input_tokens, layer= layer_index )

  model_view(attentions, input_tokens)

Visualize_attn_heads(ini_model, ini_tokenizer)

### Pruning some heads

In [None]:
heads_to_prune = {
    0: [0, 1],
    1: [2, 3],
    2: [4, 5],
}

def evaluate_pruned_model( cur_model, dataloader, pruneHeads ):
  cur_model.prune_heads(pruneHeads)

  metrics = evaluate_model( cur_model, dataloader)
  print("Exact Match:", metrics["exact_match"])
  print("F1 Score:", metrics["f1"])
  print("Precision:", metrics["precision"])
  print("Recall:", metrics["recall"])

evaluate_pruned_model(ini_model, eval_dataloader, heads_to_prune)

## Freezing Layers

In [None]:
fre_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
fre_model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

In [None]:
# Set the number of layers to freeze
n = 2

# Freeze the bottom n layers
for i in range(min(n, len(fre_model.distilbert.transformer.layer))):
    for param in fre_model.distilbert.transformer.layer[i].parameters():
        param.requires_grad = False

fre_optimizer = AdamW(filter(lambda p: p.requires_grad, fre_model.parameters()), lr=3e-5)
fre_accelerator = Accelerator(mixed_precision="fp16")
fre_model, fre_optimizer, train_dataloader, eval_dataloader = fre_accelerator.prepare(
    fre_model, fre_optimizer, train_dataloader, eval_dataloader
)

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

fre_lr_scheduler = get_scheduler(
    'linear',
    optimizer= fre_optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

fre_model.to(DEVICE)

In [None]:
progress_bar = tqdm(range(num_training_steps))
fre_model.train()
for epoch in range(num_train_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(DEVICE) for k, v in batch.items()}
        outputs = fre_model(**batch)
        loss = outputs.loss
        fre_accelerator.backward(loss)

        fre_optimizer.step()
        fre_lr_scheduler.step()
        fre_optimizer.zero_grad()
        progress_bar.update(1)


### Saving the Freeze-layers model

In [None]:
save(fre_model, fre_optimizer, '/content/saved_mdl/fre-fine.pth')

### Obtaining results on the Evaluation dataset

In [None]:
metrics = evaluate_model( fre_model, eval_dataloader)
print("Exact Match:", metrics["exact_match"])
print("F1 Score:", metrics["f1"])
print("Precision:", metrics["precision"])
print("Recall:", metrics["recall"])

### Visualizing Attention head using BertViz

In [None]:
Visualize_attn_heads(fre_model, fre_tokenizer)

### Pruning some heads

In [None]:
heads_to_prune = {
    0: [0, 1],
    1: [2, 3],
    2: [4, 5],
}

evaluate_pruned_model(fre_model, eval_dataloader, heads_to_prune)

## Using Adapter Modules in fine-tuning
---

### Adapters: used with the help of `peft` lib.
- LoRA (Low-Rank Adaptation) is a method designed to fine-tune large language models efficiently by introducing trainable low-rank matrices into specific layers. Rather than updating all of a model’s weights, LoRA adds lightweight, trainable parameters that require significantly fewer updates and storage, reducing both computational cost and memory requirements.
- Key Points of LoRA
    - It keeps most of the original model parameters frozen, only updating additional low-rank matrices. This reduces the number of trainable parameters.
    - `Low-Rank Decomposition`: LoRA leverages low-rank matrix decomposition, where rank refers to the dimensionality or number of significant singular values in the matrix. By approximating weight updates in a low-dimensional space, LoRA can effectively represent the necessary adaptations while maintaining model performance.
    - `Task Adaptation`: LoRA works well for tasks like domain adaptation, question answering, and sentiment analysis, allowing models to specialize for specific tasks without full re-training.

In [None]:
adp_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
adp_model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

for param in adp_model.parameters():
    param.requires_grad = True

adp_model

In [None]:
from peft import LoraConfig, AdaLoraModel, AdaLoraConfig

config = AdaLoraConfig( peft_type="ADALORA", task_type="SEQ_2_SEQ_LM", init_r=12, lora_alpha=32, target_modules=["q_lin", "k_lin"], lora_dropout=0.01)
model = AdaLoraModel(adp_model, config, "default")

adp_optimizer = AdamW(filter(lambda p: p.requires_grad, adp_model.parameters()), lr=3e-5)
adp_accelerator = Accelerator(mixed_precision="fp16")
adp_model, adp_optimizer, train_dataloader, eval_dataloader = adp_accelerator.prepare(
    adp_model, adp_optimizer, train_dataloader, eval_dataloader
)

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

adp_lr_scheduler = get_scheduler(
    'linear',
    optimizer= adp_optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
adp_model.to("cuda")
progress_bar = tqdm(range(num_training_steps))
adp_model.train()

for epoch in range(num_train_epochs):
    for batch in train_dataloader:
        batch = {k: v.to("cuda") for k, v in batch.items()}
        outputs = adp_model(**batch)
        loss = outputs.loss
        adp_accelerator.backward(loss)

        adp_optimizer.step()
        adp_lr_scheduler.step()
        adp_optimizer.zero_grad()
        progress_bar.update(1)

### Saving the Freeze-layers model

In [None]:
save(adp_model, adp_optimizer, '/content/saved_mdl/adp-fine.pth')

### Obtaining results on the Evaluation dataset

In [None]:
metrics = evaluate_model( adp_model, eval_dataloader)
print("Exact Match:", metrics["exact_match"])
print("F1 Score:", metrics["f1"])
print("Precision:", metrics["precision"])
print("Recall:", metrics["recall"])

### Visualizing Attention head using BertViz

In [None]:
from bertviz import head_view, model_view

Visualize_attn_heads(adp_model, adp_tokenizer)

### Pruning some heads

In [None]:
heads_to_prune = {
    0: [0, 1],
    1: [2, 3],
    2: [4, 5],
}

evaluate_pruned_model(adp_model, eval_dataloader, heads_to_prune)

# Question 2

## Design a system to convert natural language questions into SPARQL queries for retrieving answers from Wikidata. Your system should:
- ## Train and test a Neural Machine Translation (NMT) model using the QALD-9 dataset, which contains question-query pairs.
- ## Perform entity and relation linking to map question entities and relations to Wikidata using tools like BLINK,TagMe or Falcon 2.0.
- ## Find the corresponding answers to the questions.

## Report two metrics:
- ## Accuracy of correct SPARQL query generation.
- ## Accuracy of correct answer retrieval from Wikidata based on the generated SPARQL queries.

## Dataset link:
- Train set:https://github.com/KGQA/QALD_9_plus/blob/main/data/qald_9_plus_train_wikidata.json
- Test set:https://github.com/KGQA/QALD_9_plus/blob/main/data/qald_9_plus_test_wikidata.json

## Consider only the English language question

## CLOCQ API

In [None]:
import json
import re
import requests
from nltk.metrics import edit_distance

In [None]:
import json
import re
import requests

class UsageClass:
    def __init__(self, host="http://localhost", port="8888"):
        self.host = host
        self.port = port
        self.session = requests.Session()
        self.entity_regex = re.compile("^Q[0-9]+$")
        self.property_regex = re.compile("^P[0-9]+$")

    def get_label(self, curit):
        response = self._req("/item_to_label", {"item": curit})
        return json.loads(response)

    def get_labels(self, curit):
        response = self._req("/item_to_labels", {"item": curit})
        return json.loads(response)

    def get_aliases(self, curit):
        response = self._req("/item_to_aliases", {"item": curit})
        return json.loads(response)

    def get_description(self, curit):
        response = self._req("/item_to_description", {"item": curit})
        return json.loads(response)

    def get_types(self, curit):
        response = self._req("/item_to_types", {"item": curit})
        return json.loads(response)

    def get_type(self, curit):
        response = self._req("/item_to_type", {"item": curit})
        return json.loads(response)

    def is_wikidata_entity(self, string):
        return self.entity_regex.match(string) is not None

    def is_wikidata_predicate(self, string):
        return self.property_regex.match(string) is not None

    def _req(self, endpoint, payload, linking_path=False):
        if linking_path:
            url = self.host.replace("api", "linking_api") + endpoint
        elif self.port == "443":
            url = self.host + endpoint
        else:
            url = f"{self.host}:{self.port}{endpoint}"
        return self.session.post(url, json=payload).content.decode("utf-8")

    def get_frequency(self, kb_item):
        response = self._req("/frequency", {"item": kb_item})
        return json.loads(response)

    def get_neighborhood(self, kb_item, p=1000, include_labels=True, include_type=False):
        params = {"item": kb_item, "p": p, "include_labels": include_labels, "include_type": include_type}
        response = self._req("/neighborhood", params)
        return json.loads(response)

    def get_neighborhood_two_hop(self, kb_item, p=1000, include_labels=True, include_type=False):
        params = {"item": kb_item, "p": p, "include_labels": include_labels, "include_type": include_type}
        response = self._req("/two_hop_neighborhood", params)
        return json.loads(response)

    def connect(self, kb_item1, kb_item2):
        params = {"item1": kb_item1, "item2": kb_item2}
        response = self._req("/connect", params)
        return json.loads(response)

    def connectivity_check(self, kb_item1, kb_item2):
        params = {"item1": kb_item1, "item2": kb_item2}
        connectivity_value = self._req("/connectivity_check", params)
        return float(connectivity_value)

    def relation_linking(self, question, parameters=dict(), top_ranked=True):
        params = {"question": question, "parameters": parameters, "top_ranked": top_ranked}
        response = self._req("/relation_linking", params, linking_path=True)
        return json.loads(response)

    def entity_linking(self, question, parameters=dict(), k="AUTO"):
        params = {"question": question, "parameters": parameters, "k": k}
        response = self._req("/entity_linking", params, linking_path=True)
        return json.loads(response)

    def get_search_space(self, question, parameters=dict(), include_labels=True, include_type=False):
        params = {"question": question, "parameters": parameters, "include_labels": include_labels, "include_type": include_type}
        response = self._req("/search_space", params)
        return json.loads(response)

# Helper Functions (Data Preprocessing)

In [None]:
import json
import re

def get_optimal_links(entity_links, relation_links):
    links_map = {}
    for entity in entity_links['linkings']:
        distance = edit_distance(entity['item']['label'].lower(), entity['mention'])
        if entity['mention'] in links_map:
            if distance < links_map[entity['mention']][0]:
                links_map[entity['mention']] = (distance, entity['item']['id'], "wd:")
        else:
            links_map[entity['mention']] = (distance, entity['item']['id'], "wd:")

    for relation in relation_links['linkings']:
        distance = edit_distance(relation['item']['label'].lower(), relation['mention'])
        if relation['mention'] in links_map:
            if distance < links_map[relation['mention']][0]:
                links_map[relation['mention']] = (distance, relation['item']['id'], "wdt:")
        else:
            links_map[relation['mention']] = (distance, relation['item']['id'], "wdt:")

    return links_map

def replace_label_with_id_in_question(text, wd_id):
    label_text = myCollector.get_label(wd_id)
    label_regex = re.compile(r'\b' + re.escape(label_text) + r'\b', re.IGNORECASE)

    prefix = "wd:" if wd_id[0] == "Q" else "wdt:"
    return label_regex.sub(prefix + wd_id, text)

def substitute_mention_with_id(text, mention_text, identifier, id_prefix):
    mention_regex = re.compile(r'\b' + re.escape(mention_text) + r'\b', re.IGNORECASE)
    return mention_regex.sub(id_prefix + identifier, text)

def clean_and_process_question(query_text):
    query_text = query_text.strip().rstrip(".?")
    query_text = query_text.lower()

    entity_links = myCollector.entity_linking(question=query_text)
    relation_links = myCollector.relation_linking(question=query_text)

    optimal_links = get_optimal_links(entity_links, relation_links)

    for mention, link_data in optimal_links.items():
        identifier = link_data[1]
        prefix = link_data[2]
        query_text = substitute_mention_with_id(query_text, mention, identifier, prefix)

    return query_text

def parse_wikidata_elements(sparql_query):
    entity_matches = re.findall(r'wd:[Q0-9]+', sparql_query)
    relation_matches = re.findall(r'wdt:P[0-9]+', sparql_query)
    wikidata_items = entity_matches + relation_matches

    sparql_body = 'SELECT ' + sparql_query.split('SELECT', 1)[-1].strip()

    item_labels = {item: myCollector.get_label(item.split(":")[-1]).lower() for item in wikidata_items}

    return [item_labels, sparql_body]

def reformat_item_with_sparql(data_item):
    """
    1) Replaces mentions of entities and relations within the question with their corresponding Wikidata IDs.
    2) Trims down the SPARQL query for further processing.
    """
    item_id = data_item['id']
    question_text = data_item['question']
    sparql_text = data_item['sparql']

    question_text = question_text.strip().rstrip(".?").lower()

    linked_labels, trimmed_sparql = parse_wikidata_elements(sparql_text)

    question_tokens = question_text.split()

    for i, token in enumerate(question_tokens):
        for identifier, label in linked_labels.items():
            if edit_distance(label, token) <= 3:  # Allowing an edit distance of 3 for slight differences
                question_tokens[i] = identifier

    refined_question = " ".join(question_tokens)

    for identifier in linked_labels:
        refined_question = replace_label_with_id_in_question(refined_question, identifier.split(":")[-1])

    processed_output = {
        'id': item_id,
        'question': refined_question,
        'sparql': trimmed_sparql
    }

    return processed_output

As you can observe above there can be many linkings for the same element, thus we need to pick the best link


In [None]:
myCollector = UsageClass(host="https://clocq.mpi-inf.mpg.de/api", port="443")

ques = "Who is the mayor of new york city?";  wiki_id="Q60"
upd_ques = replace_label_with_id_in_question(ques, wiki_id)
upd_ques

{
        "id": "2",
        "question": "Who developed Skype?",
        "sparql": "PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX wd: <http://www.wikidata.org/entity/> SELECT ?uri WHERE { wd:Q40984 wdt:P178 ?uri . }"
    },
    
{
        "id": "3",
        "question": "Which people were born in Heraklion?",
        "sparql": "PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX wd: <http://www.wikidata.org/entity/> SELECT ?uri WHERE { ?uri wdt:P19 wd:Q160544 . }"
    },


{
        "id": "142",
        "question": "Which telecommunications organizations are located in Belgium?",
        "sparql": "PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX wd: <http://www.wikidata.org/entity/> SELECT DISTINCT ?uri WHERE { { ?uri wdt:P31 wd:Q43229 } UNION { ?uri wdt:P31/(wdt:P279*) wd:Q43229 } . ?uri wdt:P452 wd:Q418 .  ?uri wdt:P17 wd:Q31 . }"
    },

In [None]:
item = {
        "id": "5",
        "question": "Who is the mayor of New York City?",
        "sparql": "PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX wd: <http://www.wikidata.org/entity/> SELECT DISTINCT ?uri WHERE { wd:Q60 wdt:P6 ?uri . }"
    }

print(reformat_item_with_sparql(item), end="\n\n")
print(clean_and_process_question
 (item['question']))

# Basic Data Preprocessing Steps

In [None]:
def load_json_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)
    return data

def write_to_json(new_data, file_path):
    with open(file_path, 'w', encoding='utf-8') as file:
        json.dump(new_data, file, indent=4)

## Extracting only id, english question and sparql (Narrowing)

In [None]:
data=load_json_file("/content/q2_indata/qald_train_wk.json")
narrowed=[]
for question in data['questions']:
    item={}
    item['id']=question['id']
    item['question']=question['question'][0]['string']
    item['sparql']=question['query']['sparql']
    narrowed.append(item)

write_to_json(narrowed,'/content/qald_trnr.json')

data=load_json_file("/content/q2_indata/qald_test_wk.json")
narrowed=[]
for question in data['questions']:
    item={}
    item['id']=question['id']
    item['question']=question['question'][0]['string']
    item['sparql']=question['query']['sparql']
    narrowed.append(item)

write_to_json(narrowed,'/content/qald_tenr.json')

## Converting all the sparql queries to Prefixed sparl form ( Prefixing)

In [None]:
my_prf = {"http://www.wikidata.org/prop/direct/": "wdt:", "http://www.wikidata.org/entity/": "wd:",
}

def convert_expanded_to_prefixed(sparql_query):
    prefix_declaration = " ".join([f"PREFIX {value} <{key}>" for key, value in my_prf.items()])
    prefixed_query = prefix_declaration + " " + sparql_query
    for uri, prefix in my_prf.items():
        uri_pattern = re.escape(uri)
        prefixed_query = re.sub(f"<{uri_pattern}([A-Za-z0-9_]+)>", f"{prefix}\\1", prefixed_query)
    return prefixed_query

def convert_ask_to_prefixed(sparql_query):
    prefix_declaration = " ".join([f"PREFIX {value} <{key}>" for key, value in my_prf.items()])
    prefixed_query = prefix_declaration + " " + sparql_query
    for uri, prefix in my_prf.items():
        uri_pattern = re.escape(uri)
        prefixed_query = re.sub(f"<{uri_pattern}([A-Za-z0-9_]+)>", f"{prefix}\\1", prefixed_query)
    return prefixed_query

In [None]:
data=load_json_file("/content/qald_trnr.json")

prefixed=[]
for item in data:
    item['sparql']=item['sparql'].strip()
    if item['sparql'][0] == "S":
        item['sparql'] = convert_expanded_to_prefixed(item['sparql'])
    elif item['sparql'][0] == "A":
        item['sparql'] = convert_ask_to_prefixed(item['sparql'])

    prefixed.append(item)

write_to_json(prefixed,'/content/tr_nrprf.json')

data=load_json_file("/content/qald_tenr.json")

prefixed=[]
for item in data:
    item['sparql']=item['sparql'].strip()
    if item['sparql'][0] == "S":
        item['sparql'] = convert_expanded_to_prefixed(item['sparql'])
    elif item['sparql'][0] == "A":
        item['sparql'] = convert_ask_to_prefixed(item['sparql'])

    prefixed.append(item)

write_to_json(prefixed,'/content/te_nrprf.json')

## Question overwriting and sparql shortening

In [None]:
data=load_json_file("/content/tr_nrprf.json")

processed=[]
for item in data:
    processed.append(reformat_item_with_sparql(item))

write_to_json(processed,'/content/tr_fin.json')
data=load_json_file("/content/te_nrprf.json")

processed=[]
for item in data:
    mod={}
    mod['id'] = item['id']
    mod['question'] = clean_and_process_question(item['question'])
    mod['actual_sparql'] = item['sparql']
    processed.append(mod)

write_to_json(processed,'/content/te_fin.json')

Adding [start] and [end] token

In [None]:
data=load_json_file("/content/tr_fin.json")
processed=[]
for item in data:
    item['sparql'] = "[start] " + item['sparql'] + " [end]"
    processed.append(item)

write_to_json(processed,'/content/tr_fin.json')

data=load_json_file("/content/te_fin.json")
processed=[]
for item in data:
    query_body = item['actual_sparql'].split('SELECT', 1)[-1].strip()
    query_body = "[start] " + 'SELECT ' + query_body + " [end]"
    item['actual_sparql'] =  query_body
    processed.append(item)

write_to_json(processed,'/content/te_fin.json')

# Neural Machine Translation

In [None]:
import tensorflow as tf
import json
from tensorflow.keras import layers
from keras.layers import TextVectorization
import re
import tensorflow.data as tf_data
import os
import pathlib
import random
import string
import numpy as np
import tensorflow.data as tf_data
import tensorflow.strings as tf_strings
import keras
from keras import layers

In [None]:
def load_json_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)
    return data

# loading the train, validation and
data=load_json_file("tr_fin.json");  txt_pairs=[]
for item in data:
    txt_pairs.append((item['question'],item['sparql']))

data=load_json_file("te_fin.json"); tst_pairs=[]
for item in data:
    tst_pairs.append((item['question'],item['actual_sparql']))

# shuffling the text pairs
random.shuffle( txt_pairs )
nval_samples = int(0.15 * len(txt_pairs))

# Split into training and validation pairs and finding out the number of training samples
ntr_samples = len(txt_pairs) - nval_samples
tr_pairs = txt_pairs[:ntr_samples]
val_pairs = txt_pairs[ntr_samples:ntr_samples + nval_samples]

print(f"Total number of training pairs: {len(tr_pairs)}\nTotal number of Validation Pairs: {len(val_pairs)}\n{len(tst_pairs)} test pairs")

## Text Vectorization
We'll vectorize the English and Spanish sentences, limiting the vocabulary size and sequence length.


In [None]:
vocab_size= 16000; sequence_length = 25; batch_size = 32

# Vectorization of the english and spanish sentences
eng_vect = TextVectorization(max_tokens=vocab_size, output_mode="int", output_sequence_length=sequence_length  )
spa_vect = TextVectorization( max_tokens=vocab_size, output_mode="int", output_sequence_length=sequence_length + 1 )

# Adapting the vectorization
tren_texts = [pair[0] for pair in tr_pairs]; trsp_texts = [pair[1] for pair in tr_pairs]
eng_vect.adapt(tren_texts); spa_vect.adapt(trsp_texts)

def format_dataset(eng, spa):
    eng = eng_vect(eng)
    spa = spa_vect(spa)
    return ({ "encoder_inputs": eng, "decoder_inputs": spa[:, :-1], }, spa[:, 1:] )

def make_dataset(pairs):
    eng_texts, spa_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    spa_texts = list(spa_texts)
    dataset = tf_data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset)
    return dataset.cache().shuffle(2048).prefetch(16)

# Creating the datasets
train_ds = make_dataset(tr_pairs)
val_ds = make_dataset(val_pairs)

## Transformer Model with Attention (Toggleable)
We'll create a Transformer model where attention can be turned on or off, depending on the task.


In [None]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.position_embeddings = layers.Embedding(input_dim=sequence_length, output_dim=embed_dim)
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        if mask is None:
            return None
        else:
            return tf.not_equal(inputs, 0)

    def get_config(self):
        config = super().get_config()
        config.update({
            "sequence_length": self.sequence_length,
            "vocab_size": self.vocab_size,
            "embed_dim": self.embed_dim,
        })
        return config

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential([
            layers.Dense(dense_dim, activation="relu"),
            layers.Dense(embed_dim),
        ])
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, mask=None):
        attention_output = self.attention(query=inputs, value=inputs, key=inputs)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, latent_dim, num_heads, use_attention=True, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.latent_dim = latent_dim
        self.num_heads = num_heads
        self.use_attention = use_attention
        self.attention_1 = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential([
            layers.Dense(latent_dim, activation="relu"),
            layers.Dense(embed_dim),
        ])
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        padding_mask = tf.minimum(tf.cast(mask[:, None, :], dtype="int32"), causal_mask) if mask is not None else None

        attention_output_1 = self.attention_1(query=inputs, value=inputs, key=inputs, attention_mask=causal_mask)
        out_1 = self.layernorm_1(inputs + attention_output_1)

        if self.use_attention:
            attention_output_2 = self.attention_2(
                query=out_1, value=encoder_outputs, key=encoder_outputs, attention_mask=padding_mask
            )
            out_2 = self.layernorm_2(out_1 + attention_output_2)
        else:
            out_2 = out_1

        proj_output = self.dense_proj(out_2)
        return self.layernorm_3(out_2 + proj_output)

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, None]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        return tf.tile(tf.reshape(mask, (1, sequence_length, sequence_length)), [batch_size, 1, 1])

def build_transformer_model(use_attention=True):
    embed_dim = 256
    latent_dim = 2048
    num_heads = 8

    # Encoder
    encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
    x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
    encoder_outputs = TransformerEncoder(embed_dim, latent_dim, num_heads)(x)
    encoder = keras.Model(encoder_inputs, encoder_outputs)

    # Decoder
    decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
    encoded_seq_inputs = keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
    x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)

    if use_attention:
        x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, encoded_seq_inputs)
    else:
        x = layers.Dense(latent_dim, activation="relu")(x)

    x = layers.Dropout(0.5)(x)
    decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
    decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

    # Assembling full transformer model
    decoder_outputs = decoder([decoder_inputs, encoder_outputs])
    transformer = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs, name="transformer")
    return transformer, encoder, decoder

# Evaluating the Model on 4 different settings



In [None]:
epochs = 25
max_decoded_sentence_length = 20
spa_vocab = spa_vect.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))

In [None]:
def decode_sequence(transformer, input_sentence, use_greedy=True, beam_size=3):

    tokenized_input_sentence = eng_vect([input_sentence]); decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = spa_vect([decoded_sentence])[:, :-1]
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])

        if use_greedy:
            sampled_token_index = tf.argmax(predictions[0, i, :]).numpy()
        else:
            top_k_indices = tf.argsort(predictions[0, i, :])[-beam_size:].numpy()
            sampled_token_index = random.choice(top_k_indices)
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token

        if sampled_token == "[end]":
            break
    return decoded_sentence

def evaluate_model(use_attention, use_greedy, beam_size=3, teacher_forcing=True):


    transformer, encoder, decoder = build_transformer_model(use_attention=use_attention)
    transformer.compile( optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"] )
    transformer.fit(train_ds, epochs=epochs, validation_data=val_ds)

    if teacher_forcing:
        history = transformer.fit(train_ds, epochs=epochs, validation_data=val_ds)
    else:
        history = transformer.fit(train_ds.map(lambda x, y: (x, x["decoder_inputs"])), epochs=epochs, validation_data=val_ds)

    test_eng_texts = [pair[0] for pair in tst_pairs]
    translations = []
    for i in range(len(test_eng_texts)):  # Evaluate  sentences
        input_sentence = test_eng_texts[i]
        translated_sentence = decode_sequence(transformer, input_sentence, use_greedy=use_greedy, beam_size=beam_size)
        translations.append((input_sentence, translated_sentence))

    return history.history, translations

def evaluate_all_combinations():
    """
    1. No attention, greedy decoding, with teacher forcing
    2. Attention, beam search decoding, with teacher forcing
    3. Attention, greedy decoding, without teacher forcing
    4. No attention, beam search decoding, without teacher forcing
    """
    history_1, translations_1 = evaluate_model(use_attention=False, use_greedy=True, teacher_forcing=True)
    history_2, translations_2 = evaluate_model(use_attention=True, use_greedy=False, beam_size=3, teacher_forcing=True)
    history_3, translations_3 = evaluate_model(use_attention=True, use_greedy=True, teacher_forcing=False)
    history_4, translations_4 = evaluate_model(use_attention=False, use_greedy=False, beam_size=3, teacher_forcing=False)

    print("Summary")
    print("1. No attention, greedy decoding, with teacher forcing")
    print(f"History: {history_1}")

    print("2. Attention, beam search decoding, with teacher forcing")
    print(f"History: {history_2}")

    print("3. Attention, greedy decoding, without teacher forcing")
    print(f"History: {history_3}")

    print("4. No attention, beam search decoding, without teacher forcing")
    print(f"History: {history_4}")

    return [translations_1,translations_2,translations_3,translations_4]

# Using all the 4 tasks to evaluate
trans1,trans2,trans3,trans4 = evaluate_all_combinations()

# Analyzing Model Performance for different Configurations

## Model Configuration Analysis for SPARQL Query Generation

### 1. No Attention, Greedy Decoding, With Teacher Forcing

- Training Accuracy: Started at 61% and gradually increased to around 63%.
- Training Loss: Decreased from 2.36 to 2.09, showing a steady learning curve.
- Validation Accuracy: Consistently held between 64% to 65%.
- Validation Loss: Remained stable, around 2.7 to 2.8.

### 2. Attention, Beam Search Decoding, With Teacher Forcing

- Training Accuracy: Reached close to 84.77% by the final epochs.
- Training Loss: Dropped significantly, reaching a low of 1.03.
- Validation Accuracy: approaching 80% by the end.
- Validation Loss: around 1.01.

### 3. Attention, Greedy Decoding, Without Teacher Forcing

- Training Accuracy: Ending at around 78%.
- Training Loss: Steadily decreased, ending at 0.87.
- Validation Accuracy: Initially high but dropped to around 60%.
- Validation Loss: Increased to approximately 3.22, indicating poor generalization.

### 4. No Attention, Beam Search Decoding, Without Teacher Forcing

- Training Accuracy: Ended to 62%.
- Training Loss: Decreased from 3.56 to 2.18, showing gradual improvement.
- Validation Accuracy: Declined to around 45% by the end.
- Validation Loss: Rose from 2.95 to 4.42, showing poor generalization.

### Optimal Configuration

- The Attention, Beam Search Decoding, With Teacher Forcing configuration stands out as the best.
- It delivers very good accuracy for both training and validation, showing strong generalisation.

# Saving the Translations

In [None]:
import json

def load_json_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)
    return data

def write_to_json(new_data, file_path):
    with open(file_path, 'w', encoding='utf-8') as file:
        json.dump(new_data, file, indent=4)

data=load_json_file("test_final.json")

processed=[]
i = 0
for item in data:
    item['NA_G_TF'] =  translations_1[i][1]
    item["A_B_TF"] = translations_2[i][1]
    item["A_G_WTF"] = translations_3[i][1]
    item["NA_B_WTF"] = translations_4[i][1]
    processed.append(item)

write_to_json(processed,'test_pred.json')

Translations have been saved in "test_predicted.json"

# Analysing the Accuracy of the Answers fetched by the Predicted Sparql Query

Even though the accuracy in certain configurations reached upto 76% , but the  translations were not accurate enough to send a valid query request
- SPARQL queries are very structured. Without more training examples that show correct SPARQL structures, model can lack the ability to reliably generate queries that are both syntactically and semantically valid.