# Question Answering on FAQs of GST (Goods and Services Tax) in India


🙌 Welcome to the hands-on tutorial dedicated to exploring the cutting-edge capabilities of [Ludwig](https://ludwig.ai/latest/) 0.8, for building an Question Answering model for FAQs (Frequently Asked Questions) on GST (Goods and Services Tax) in India.

Ludwig, an open-source package has been used here to train machine learning models in Encoder-Combination-Decoder (ECD) mode as well as in fine-tuning LLMs via Instruction Tuning mode, through declarative config files.

A bit more info about GST:  GST is a single tax-structure that replaces a multitude of taxes that were there before in India, such as the service tax, central excise duty, VAT, and more. It's the all-in-one tax solution that streamlines the entire tax process in India. This transition from mutlitude-tax system to a single-tax system, raises lots of queries. These queries, along with their answers are avaiable as FAQs. Building a ML model or a fine-tuned LLM would surely help build a chatbot like application on top.

👉👉 Step-by-step explanation of the solution is available [here](https://medium.com/analytics-vidhya/how-to-fine-tune-llms-without-coding-41cf8d4b5d23).

## Installation 🧰

Needs HuggingFace API Token, access approval to Llama2–7b-hf, and a GPU with a minimum of 12 GiB of VRAM. Here in this notebook, T4 GPU is being used.

In [None]:
!pip uninstall -y tensorflow --quiet
!pip install ludwig
!pip install ludwig[llm]

Collecting ludwig
  Downloading ludwig-0.8.6.tar.gz (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting PyYAML!=5.4.*,<6.0.1,>=3.12 (from ludwig)
  Downloading PyYAML-6.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (682 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m682.2/682.2 kB[0m [31m68.9 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json (from ludwig)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
Collecting jsonschema<4.7,>=4.5.0 (from ludwig)
  Downloading jsonschema-4.6.2-py3-none-any.whl (80 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.8/80.8 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25h

Collecting sentence-transformers (from ludwig[llm])
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting faiss-cpu (from ludwig[llm])
  Downloading faiss_cpu-1.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate (from ludwig[llm])
  Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting loralib (from ludwig[llm])
  Downloading loralib-0.1.2-py3-none-any.whl (10 kB)
Collecting peft>=0.4.0 (from ludwig[llm])
  Downloading peft-0.6.2-py3-none-any.whl (174 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━

Enable text wrapping so we don't have to scroll horizontally and create a

---

function to flush CUDA cache.

In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))

get_ipython().events.register('pre_run_cell', set_css)

def clear_cache():
  if torch.cuda.is_available():
    model = None
    torch.cuda.empty_cache()

-> **Setup Your HuggingFace Token** 🤗

We'll be using  Llama-2, which a model released by Meta. However, the model is not openly-accessible and requires requesting for access (assigned to your HuggingFace token).

Obtain a [HuggingFace API Token](https://huggingface.co/settings/tokens) and request access to [Llama2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) before proceeding. You may need to signup on HuggingFace if you don't aleady have an account: https://huggingface.co/join

In [None]:
import getpass
import locale; locale.getpreferredencoding = lambda: "UTF-8"
import logging
import os
import torch
import yaml

from ludwig.api import LudwigModel


os.environ["HUGGING_FACE_HUB_TOKEN"] = getpass.getpass("Token:")
assert os.environ["HUGGING_FACE_HUB_TOKEN"]

Token:··········


## Configurations


Defining config for Instruction Fine Tuning using Mistral 7B model. It is based on [this](https://predibase.com/blog/fine-tuning-mistral-7b-on-a-single-gpu-with-ludwig) tutorial. Prompt has been changed.

In [None]:
instruction_tuning_mistral_yaml = yaml.safe_load("""
model_type: llm
# base_model: mistralai/Mistral-7B-v0.1
base_model: alexsherstinsky/Mistral-7B-v0.1-sharded

quantization:
 bits: 4

adapter:
 type: lora

prompt:
  template: |
    ### Instruction:
    You are a taxation expert on Goods and Services Tax used in India.
    Take the Input given below which is a Question. Give Answer for it as a Response.

    ### Input:
    {Question}

    ### Response:

input_features:
 - name: Question
   type: text
   preprocessing:
      max_sequence_length: 1024

output_features:
 - name: Answer
   type: text
   preprocessing:
      max_sequence_length: 384

trainer:
  type: finetune
  epochs: 5
  batch_size: 1
  eval_batch_size: 2
  gradient_accumulation_steps: 16  # effective batch size = batch size * gradient_accumulation_steps
  learning_rate: 2.0e-4
  enable_gradient_checkpointing: true
  learning_rate_scheduler:
    decay: cosine
    warmup_fraction: 0.03
    reduce_on_plateau: 0

generation:
  temperature: 0.1
  max_new_tokens: 512

backend:
 type: local
""")

Defining config for Instruction Fine Tuning using Llama 2 7B model. It is based on [this](https://colab.research.google.com/drive/1c3AO8l_H6V_x37RwQ8V7M6A-RmcBf2tG) tutorial. Prompt has been changed.

In [None]:
instruction_tuning_llama2_yaml = yaml.safe_load("""
model_type: llm
base_model: meta-llama/Llama-2-7b-hf

quantization:
 bits: 4

adapter:
 type: lora

prompt:
  template: |
    ### Instruction:
    You are a taxation expert on Goods and Services Tax used in India.
    Take the Input given below which is a Question. Give Answer for it as a Response.

    ### Input:
    {Question}

    ### Response:

input_features:
 - name: Question
   type: text

output_features:
 - name: Answer
   type: text

trainer:
 type: finetune
 learning_rate: 0.0003
 batch_size: 1
 gradient_accumulation_steps: 8
 epochs: 3

backend:
 type: local
""")

Following config is for ECD way for solving Question Answering problem, on top of LLama 2 model.

In [None]:
qna_tuning_config_dict = {
    "input_features": [
        {
            "name": "Question",
            "type": "text",
            "encoder": {
                "type": "auto_transformer",
                "pretrained_model_name_or_path": "meta-llama/Llama-2-7b-hf",
                "trainable": False
            },
            "preprocessing": {
                "cache_encoder_embeddings": True
            }
        }
    ],
    "output_features": [
        {
            "name": "Answer",
            "type": "text"
        }
    ]
}

## Dataset
Data in the form of csv is made avilable at the Github location [here](https://raw.githubusercontent.com/yogeshhk/Sarvadnya/master/src/ludwig/data/cbic-gst_gov_in_fgaq.csv). `wget` it ones from the location given below. Keep it in `data` folder, then comment this cell for further executions.

In [None]:
# !pip install wget
# import wget

# # Replace the URL with the raw URL of the file on GitHub
# url = "https://raw.githubusercontent.com/yogeshhk/Sarvadnya/master/src/ludwig/data/cbic-gst_gov_in_fgaq.csv"

# # Download the file
# wget.download(url, 'cbic-gst_gov_in_fgaq.csv')

-> Needs permission. Change to drive location below to where the csv file needed for the notebook resides.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir('/content/drive/MyDrive/ImpDocs/Work/AICoach/Notebooks')

Mounted at /content/drive


In [None]:
from google.colab import data_table; data_table.enable_dataframe_formatter()
import numpy as np; np.random.seed(123)
import pandas as pd

Change to drive location below to where the csv file needed for the notebook resides.

In [None]:
df = pd.read_csv('/content/drive/MyDrive/ImpDocs/Work/AICoach/Notebooks/data/cbic-gst_gov_in_fgaq.csv', encoding='cp1252')
df.head()


Unnamed: 0,Question,Answer
0,Does aggregate turnover include value of inwar...,Refer Section 2(6) of CGST Act. Aggregate turn...
1,What if the dealer migrated with wrong PAN as ...,New registration would be required as partners...
2,A taxable person’s business is in many states....,He is liable to register if the aggregate turn...
3,Can we use provisional GSTIN or do we get new ...,Provisional GSTIN (PID) should be converted in...
4,Whether trader of country liquor is required t...,If the person is involved in 100% supply of go...


A crucial step in our journey involves the compilation of a dataset that mirrors the real-world questions taxpayers grapple with. So, this dataset is a Question Answering dataset. Each row in the dataset consists of an:
- `Question` that describes a query
- `Answer` that describes the correspondng answer

## Running Ludwig: Training

The model's declarative nature allows us to clearly define the architecture, making the training process transparent and insightful.

Instantiation of `LudwigModel` with fine-tuning config `instruction_tuning_llama2_yaml`. Training it on GST csv based dataframe.

In [None]:
model_instruction_tuning = LudwigModel(config=instruction_tuning_llama2_yaml) # instruction_tuning_mistral_yaml, instruction_tuning_llama2_yaml, qna_tuning_config_dict
results_instruction_tuning = model_instruction_tuning.train(dataset=df)

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199


Instantiation of another `LudwigModel` with ECD config `qna_tuning_config_dict`. Training it on GST csv based dataframe. [Commented for this run]

In [None]:
# model_ecd = LudwigModel(config=qna_tuning_config_dict)
# results_ecd = model_ecd.train(dataset=df)

Testing or inferencing dataset has just a couple of questions for which answers are seeked.

In [None]:
test_df = pd.DataFrame([
    {
        "Question": "If I am not an existing taxpayer and wish to newly register under GST, when can I do so?"
    },
    {
        "Question": "Does aggregate turnover include value of inward supplies received on which RCM is payable?"
    },
])


## Runnuing Ludwig: Inferencing

With Ludwig's training complete, the explorers put the model to the test. They fed it a set of questions related to GST, eager to witness the declarative AI framework in action.

**Predictions on fine-tuned model**

In [None]:
predictions_instruction_tuning_df, output_directory = model_instruction_tuning.predict(dataset=test_df)
print(predictions_instruction_tuning_df["Answer_response"].tolist())

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[['nobody can be registered under gst unless he is liable to be registered under section 22 of the cgst act, 2017 read with section 2(6) of the sgst act, 2017.'], ['nobody is liable to pay rcm on inward supplies.']]


  return np.sum(np.log(sequence_probabilities))


The answres are `[['nobody can be registered under gst unless he is liable to be registered under section 22 of the cgst act, 2017 read with section 2(6) of the sgst act, 2017.'], ['nobody is liable to pay rcm on inward supplies.']]`

These are reasonably ok, but both answers starting with `nobody` seems to be a little odd. There could be many reasons, quality of LLM, training paramater, and above all, need far bigger bigger dataset for fine-tuning.

Predictions on ECD model [Commented for this run]

In [None]:
# predictions_ecd_df, output_directory = model_ecd.predict(dataset=test_df)
# print(predictions_ecd_df["Answer_response"].tolist())

## **Observations** 🔎

Fine-tunined model seems to have given decent results. Ludwig's declarative approach provides a clear and concise methodology for building machine learning models, making it an invaluable tool for unraveling the mysteries of complex domains. It becomes extreamly easy to change between these approaches, change base LLMs etc.

# **Resources** 🧺
- Fine-tuning Mistral 7B on a Single GPU with Ludwig https://predibase.com/blog/fine-tuning-mistral-7b-on-a-single-gpu-with-ludwig
- Efficient Fine-Tuning for Llama-v2-7b on a Single GPU https://www.youtube.com/watch?v=g68qlo9Izf0
- If you're new to LLMs, check out this webinar where Daliana Liu discusses the 10 things to know about LLMs: https://www.youtube.com/watch?v=fezMHMk7u5o&t=2027s&ab_channel=Predibase
- Ludwig 0.8 Release Blogpost for the full set of new features: https://predibase.com/blog/ludwig-v0-8-open-source-toolkit-to-build-and-fine-tune-custom-llms-on-your-data
- Ludwig Documentation: https://ludwig.ai/latest/