

# Gen Ai Assignment - Sourikta Nag


### Installing required libraries for model loading, embedding, retrieval, and generation.


- **transformers**: For loading and interacting with language models.
- **faiss-gpu**: For fast similarity search, essential for retrieval.
- **sentence-transformers**: To create embeddings of text for similarity-based retrieval.
- **torch**: Required for models like Mistral and OpenLLaMA.


In [1]:
pip install transformers faiss-cpu sentence-transformers torch


Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m46.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0


### Loading data

- **Load each file separately**: Read the contents of each file.
- **Combine them into one list**: Merge the contents of all the files into a single list.


[Dataset](https://www.kaggle.com/datasets/akshatgupta7/llm-fine-tuning-dataset-of-indian-legal-texts)


In [2]:
import json

# Load the datasets
with open("/content/constitution_qa.json") as f:
    constitution_data = json.load(f)

with open("/content/crpc_qa.json") as f:
    crpc_data = json.load(f)

with open("/content/ipc_qa.json") as f:
    ipc_data = json.load(f)

# Combine datasets into one list
combined_data = constitution_data + crpc_data + ipc_data

# Check the structure of a few entries
print(combined_data[:5])


[{'question': 'What is India according to the Union and its Territory?', 'answer': 'India, that is Bharat, shall be a Union of States.'}, {'question': 'How is India, that is Bharat, defined in terms of its political structure?', 'answer': 'India, that is Bharat, is defined as a Union of States according to the Union and its Territory.'}, {'question': 'What does the territory of India comprise of?', 'answer': 'The territory of India shall comprise the territories of the States, the Union territories specified in the First Schedule, and such other territories as may be acquired.'}, {'question': 'What does the territory of a country, such as India, comprise of, according to their constitutional provisions?', 'answer': 'The territory of a country like India comprises the territories of the States, the Union territories specified in the First Schedule, and such other territories as may be acquired.'}, {'question': 'Who has the authority to admit or establish new States into the Union?', 'an

### Preprocessing Data

- **Standardize text formatting**: make all text lowercase, remove extra whitespace.


In [3]:
def standardize_text(text):
    text = text.lower()  # Convert to lowercase
    text = " ".join(text.split())  # Remove extra whitespace
    return text

# Apply standardization to each question and answer
for entry in combined_data:
    entry["question"] = standardize_text(entry["question"])
    entry["answer"] = standardize_text(entry["answer"])


- **Remove duplicate or redundant questions**: store unique questions in a dictionary to remove duplicates.


In [4]:
unique_entries = {}
for entry in combined_data:
    question = entry["question"]
    # If question is not in dictionary, add it
    if question not in unique_entries:
        unique_entries[question] = entry

# Convert the dictionary back to a list
cleaned_data = list(unique_entries.values())

print("Original data length:", len(combined_data))
print("Cleaned data length:", len(cleaned_data))


Original data length: 14543
Cleaned data length: 14453


- **Save the cleaned dataset**: save the cleaned data as a new json file.


In [5]:
with open("/content/cleaned_legal_data.json", "w") as f:
    json.dump(cleaned_data, f, indent=4)

print("Cleaned dataset saved as cleaned_legal_data.json")


Cleaned dataset saved as cleaned_legal_data.json


### Create Embeddings for Retrieval

- **Initialize the embedding model**: use a pre-trained sentence-bert model for embeddings.


In [6]:
from sentence_transformers import SentenceTransformer
import numpy as np

embedding_model = SentenceTransformer('all-mpnet-base-v2')


  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

- **Generate embeddings**: create embeddings for each question-answer pair and store them.


In [7]:
texts = [entry["question"] + " " + entry["answer"] for entry in cleaned_data]
embeddings = embedding_model.encode(texts, convert_to_tensor=True)

# Save embeddings and other data for retrieval
np.save("legal_embeddings.npy", embeddings.cpu().numpy())
ids = [str(i) for i in range(len(cleaned_data))]


### Building a FAISS Index for Fast Retrieval

- **Load embeddings and initialize FAISS**: set up the FAISS index using the embeddings.


In [8]:
import faiss

# Define the dimension of the embeddings
embedding_dim = embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dim)
index.add(embeddings.cpu().numpy())


- **Test retrieval**: test retrieving the top-k most relevant documents based on a sample query.


In [9]:
def retrieve_top_k(query, k=5):
    query_embedding = embedding_model.encode(query, convert_to_tensor=True).cpu().numpy()
    distances, indices = index.search(query_embedding.reshape(1, -1), k)
    return [(cleaned_data[idx]["question"], cleaned_data[idx]["answer"]) for idx in indices[0]]

# Sample test
print(retrieve_top_k("What is the process for arrest as per CrPC?", k=3))


[('which section details the procedure to be followed when a private person makes an arrest?', 'section 43'), ('what should happen when any person is arrested?', 'when any person is arrested, he shall be examined by a medical officer in the service of central or state government.'), ('what is the procedure by magistrate before whom such person arrested is brought according to section 81?', 'procedure by magistrate before whom such person arrested is brought.')]


To use a private model from Hugging Face, authentication is required. I generated a token from my Hugging Face account and used it to access the model. However, in this assignment, I am unable to use the model due to insufficient GPU memory

In [10]:
#from huggingface_hub import login
#login(".....")


### Loading the Language Model and Tokenizer

- **Load flan-t5-base**: loads flan-t5-base, a t5 model optimized for question-answering;the model and tokenizer are both loaded to prepare for text generation based on retrieved context.


In [11]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load a smaller Flan-T5 model
model_name = "google/flan-t5-base"
model = T5ForConditionalGeneration.from_pretrained(model_name).to("cuda")
tokenizer = T5Tokenizer.from_pretrained(model_name)


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


### Defining the RAG Pipeline to Generate Answers

- **Retrieve context**: calls retrieve_top_k to get the top-k relevant question-answer pairs, which serve as context for the model.
- **Format input for t5**: prepares a combined input of the question and retrieved context and tokenizes it for the t5 model.
- **Generate answer**: passes the input to the model, which generates an answer based on the question and context.


In [12]:
def generate_answer(question, k=5):
    # Step 1: Retrieve relevant context using FAISS
    context = retrieve_top_k(question, k)
    context_text = " ".join([q + " " + a for q, a in context])

    # Step 2: Prepare input for the T5 model
    input_text = f"question: {question} context: {context_text}"
    inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

    # Step 3: Generate the answer
    outputs = model.generate(inputs["input_ids"], max_length=200)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer


### Tesing the Model with Questions

In [13]:
question = "What are the rights of an arrested person under CrPC?"
print("Generated Answer:", generate_answer(question))


Generated Answer: the person arrested should be identified.


- **Some sample questions for additional testing**


In [17]:
# Question on the Indian Constitution
question_1 = "What are the fundamental rights guaranteed by the Indian Constitution?"
print("Generated Answer for Question 1:", generate_answer(question_1))

# Question on CrPC (Code of Criminal Procedure)
question_2 = "What is the procedure for granting bail under the CrPC?"
print("Generated Answer for Question 2:", generate_answer(question_2))

# Question on IPC (Indian Penal Code)
question_3 = "What is the punishment for theft under the Indian Penal Code?"
print("Generated Answer for Question 3:", generate_answer(question_3))

# Question on legal definitions
question_4 = "How does the IPC define 'wrongful restraint'?"
print("Generated Answer for Question 4:", generate_answer(question_4))

# Question on judicial powers under CrPC
question_5 = "What are the powers of a magistrate under the CrPC?"
print("Generated Answer for Question 5:", generate_answer(question_5))

# Question on sedition under IPC
question_6 = "What does the Indian Penal Code say about sedition?"
print("Generated Answer for Question 6:", generate_answer(question_6))

# Question on fundamental duties
question_7 = "What are the fundamental duties of Indian citizens according to the Constitution?"
print("Generated Answer for Question 7:", generate_answer(question_7))

# Question on preventive detention
question_8 = "What provisions exist for preventive detention under the Indian Constitution?"
print("Generated Answer for Question 8:", generate_answer(question_8))

# Question on evidence collection
question_9 = "What are the rules regarding evidence collection under CrPC?"
print("Generated Answer for Question 9:", generate_answer(question_9))

# Question on legal immunity
question_10 = "Who has immunity from legal proceedings under the Indian Constitution?"
print("Generated Answer for Question 10:", generate_answer(question_10))

# Additional questions

# Question on the right to life
question_11 = "What is the significance of the right to life under Article 21 of the Constitution?"
print("Generated Answer for Question 11:", generate_answer(question_11))

# Question on right to information
question_12 = "What rights are provided under the Right to Information Act?"
print("Generated Answer for Question 12:", generate_answer(question_12))

# Question on public nuisance under IPC
question_13 = "How does the IPC define public nuisance?"
print("Generated Answer for Question 13:", generate_answer(question_13))

# Question on appeals in CrPC
question_14 = "What is the process for filing an appeal under the CrPC?"
print("Generated Answer for Question 14:", generate_answer(question_14))

# Question on dowry prohibition
question_15 = "What does the law say about dowry under the Dowry Prohibition Act?"
print("Generated Answer for Question 15:", generate_answer(question_15))

# Question on criminal conspiracy
question_16 = "How does the IPC define criminal conspiracy?"
print("Generated Answer for Question 16:", generate_answer(question_16))

# Question on custodial violence
question_17 = "What are the legal protections against custodial violence in India?"
print("Generated Answer for Question 17:", generate_answer(question_17))

# Question on anticipatory bail
question_18 = "What is anticipatory bail and how can it be obtained under CrPC?"
print("Generated Answer for Question 18:", generate_answer(question_18))

# Question on the legal definition of a contract
question_19 = "What constitutes a contract under the Indian Contract Act?"
print("Generated Answer for Question 19:", generate_answer(question_19))

# Question on contempt of court
question_20 = "What are the types of contempt of court recognized under Indian law?"
print("Generated Answer for Question 20:", generate_answer(question_20))

# Question on juvenile justice
question_21 = "What are the provisions for juvenile offenders under the Juvenile Justice Act?"
print("Generated Answer for Question 21:", generate_answer(question_21))

# Question on plea bargaining
question_22 = "What is the concept of plea bargaining under Indian criminal law?"
print("Generated Answer for Question 22:", generate_answer(question_22))

# Question on rights of women
question_23 = "What legal protections are provided to women against domestic violence?"
print("Generated Answer for Question 23:", generate_answer(question_23))

# Question on property rights
question_24 = "What are the property rights of women under the Hindu Succession Act?"
print("Generated Answer for Question 24:", generate_answer(question_24))

# Question on the right to education
question_25 = "What rights are guaranteed under the Right to Education Act?"
print("Generated Answer for Question 25:", generate_answer(question_25))


Generated Answer for Question 1: freedom of speech and expression
Generated Answer for Question 2: give notice of the application for bail to the public prosecutor
Generated Answer for Question 3: imprisonment for life
Generated Answer for Question 4: the obstruction of a private way over land or water which a person in good faith believes himself to have a lawful right to obstruct
Generated Answer for Question 5: to direct local investigation and examination
Generated Answer for Question 6: offences constituted by an act in respect of which a complaint may be made under section 20 of the cattle-trespass act
Generated Answer for Question 7: administrative service
Generated Answer for Question 8: any law providing for preventive detention
Generated Answer for Question 9: no
Generated Answer for Question 10: the governor or rajpramukh of a state
Generated Answer for Question 11: the right to life and personal liberty means that no person shall be deprived of his life or personal liberty 

### Conclusion

- Each code block is designed to perform a specific part of the rag process:
  - **Setup**: install dependencies and load data.
  - **Data preprocessing**: clean and standardize the data.
  - **Embedding and Indexing**: create embeddings for similarity-based retrieval.
  - **Answer generation**: combine retrieval with language generation for contextually accurate responses.
