<a href="https://colab.research.google.com/github/themodernturing/pakistan-penal-code-qa/blob/main/pakistan_panel_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##📝 Introduction
This Python notebook is designed to build an AI-powered question-answering system using the Pakistan Penal Code as a primary source of information. The goal is to allow users to ask legal questions and receive accurate, context-aware answers based on the contents of the Penal Code.

The notebook combines various powerful tools and techniques from modern machine learning and natural language processing (NLP), including:

##🔍 PDF Text Extraction
Using PyMuPDF (fitz), the notebook extracts the full text content from the Pakistan Penal Code PDF document. This allows us to convert the legal text into a machine-readable format for downstream processing.

##✂️ Text Chunking
Since legal documents are often long and detailed, the full text is split into smaller, overlapping chunks using LangChain’s RecursiveCharacterTextSplitter. This ensures that the language model can effectively process and understand each segment.

##🧠 Semantic Embeddings & Vector Store
The notebook uses HuggingFace sentence-transformers to generate embeddings for each text chunk. These embeddings capture the semantic meaning of the text and are stored in a FAISS vector store for efficient similarity search.

##🔁 Retrieval-Based QA Pipeline
Using LangChain’s RetrievalQA chain, the system retrieves the most relevant sections of the document in response to a user query and uses a pretrained transformer model to generate a natural language answer. This enables a more contextually aware and document-grounded response.

##🌐 Web Interface
With Gradio, a simple and interactive web-based UI can be created, enabling users to input legal queries and receive real-time answers. This makes the system accessible to non-technical users like law students, researchers, and the general public.



In [None]:
!pip install pymupdf


Collecting pymupdf
  Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m86.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.25.5


In [None]:
import fitz  # This works now after installing

def extract_text_from_pdf(file_path="/content/Pakistan Panel Code.pdf"):
    doc = fitz.open(file_path)
    full_text = ""
    for page_num, page in enumerate(doc, start=1):
        text = page.get_text()
        full_text += f"\n\n--- Page {page_num} ---\n\n{text}"
    return full_text

# Usage
pdf_text = extract_text_from_pdf()
print(pdf_text[:1000])  # Preview




--- Page 1 ---

THE PAKISTAN PENAL CODE,1860
Last Amended on 2017­02­16
CONTENTS
SECTIONS:
CHAPTER I
INTRODUCTION
1.
Title and extent of operation of the Code.
2.
Punishment of offences committed within Pakistan.
3.
Punishment of offences committed beyond, but which by law may be tried
within, Pakistan.
4.
Extension of Code to extra­territorial offences.
5.
Certain laws not to be affected by this Act.
CHAPTER II
GENERAL EXPLANATIONS
6.
Definitions in the Code to be understood subject to exceptions.
7.
Sense of expression once explained.
8.
Gender.
9.
Number.
10.
"Man." "Woman."
11.
"Person."
12.
"Public."
13.
[Omitted.]
14.
"Servant of the State."
15.
[Repealed.]
16.
[Repealed.]
17.
"Government."
18.
[Repealed.]
19.
"Judge."
20.
"Court of Justice."
21.
"Public servant."
22.
"Moveable property."
23.
"Wrongful gain." 
Page 1 of 178


--- Page 2 ---

"Wrongful loss." 
 
Gaining wrongfully. Losing wrongfully.
24.
"Dishonestly."
25.
"Fraudulently."
26.
"Reason to believe."
27.
Property in

In [None]:
!pip install langchain




In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Assuming pdf_text holds your full document
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_text(pdf_text)

print(f"Total chunks: {len(chunks)}")
print(chunks[0])  # Preview the first chunk


Total chunks: 851
--- Page 1 ---

THE PAKISTAN PENAL CODE,1860
Last Amended on 2017­02­16
CONTENTS
SECTIONS:
CHAPTER I
INTRODUCTION
1.
Title and extent of operation of the Code.
2.
Punishment of offences committed within Pakistan.
3.
Punishment of offences committed beyond, but which by law may be tried
within, Pakistan.
4.
Extension of Code to extra­territorial offences.
5.
Certain laws not to be affected by this Act.
CHAPTER II
GENERAL EXPLANATIONS
6.
Definitions in the Code to be understood subject to exceptions.
7.
Sense of expression once explained.
8.
Gender.
9.
Number.
10.
"Man." "Woman."
11.
"Person."
12.
"Public."
13.
[Omitted.]
14.
"Servant of the State."
15.
[Repealed.]
16.
[Repealed.]
17.
"Government."
18.
[Repealed.]
19.
"Judge."
20.
"Court of Justice."
21.
"Public servant."
22.
"Moveable property."
23.
"Wrongful gain." 
Page 1 of 178


--- Page 2 ---


In [None]:
!pip install -U langchain langchain-community
!pip install -U openai tiktoken faiss-cpu
!pip install sentence-transformers


Collecting langchain-community
  Downloading langchain_community-0.3.21-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

vectorstore = FAISS.from_texts(chunks, embedding=embedding_model)

# Optional: save the vectorstore
vectorstore.save_local("pakistan_penal_code_index")



  embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from transformers import pipeline

# Load locally using Hugging Face Transformers
qa_pipeline = pipeline("text2text-generation", model="google/flan-t5-base", max_length=512)

llm = HuggingFacePipeline(pipeline=qa_pipeline)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

query = "What is the punishment for theft in the Pakistan Penal Code?"
result = qa_chain({"query": query})

print("Answer:", result["result"])


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cpu
  llm = HuggingFacePipeline(pipeline=qa_pipeline)
  result = qa_chain({"query": query})
Token indices sequence length is longer than the specified maximum sequence length for this model (832 > 512). Running this sequence through the model will result in indexing errors


Answer: imprisonment of either description for a term which may extend to five years, and shall also be liable to fine


In [None]:
import fitz  # PyMuPDF
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from transformers import pipeline
import gradio as gr
import os

# Step 1: Extract text from PDF
def extract_text_from_pdf(file_path):
    doc = fitz.open(file_path)
    full_text = ""
    for page_num, page in enumerate(doc, start=1):
        text = page.get_text()
        full_text += f"\n\n--- Page {page_num} ---\n\n{text}"
    return full_text

# Step 2: Build everything from PDF
def build_qa_system(pdf_path):
    print("📄 Extracting text...")
    full_text = extract_text_from_pdf(pdf_path)

    print("✂️ Splitting text...")
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = splitter.split_text(full_text)

    print("🧠 Embedding & indexing...")
    embedder = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    vectordb = FAISS.from_texts(chunks, embedding=embedder)

    print("🤖 Loading local LLM...")
    hf_pipeline = pipeline("text2text-generation", model="google/flan-t5-base", max_length=512)
    llm = HuggingFacePipeline(pipeline=hf_pipeline)

    print("🔗 Creating QA chain...")
    qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectordb.as_retriever(),
        return_source_documents=True
    )
    return qa

# Step 3: Interface logic
pdf_path = "/content/Pakistan Panel Code.pdf"
qa_chain = build_qa_system(pdf_path)

def answer_question(query):
    result = qa_chain({"query": query})
    return result["result"]

# Step 4: Launch Gradio UI
demo = gr.Interface(
    fn=answer_question,
    inputs=gr.Textbox(label="Ask a question about the Pakistan Penal Code"),
    outputs=gr.Textbox(label="Answer"),
    title="Pakistan Penal Code Chatbot",
    description="Ask legal questions based on the Pakistan Penal Code PDF file."
)

demo.launch(share=True)


📄 Extracting text...
✂️ Splitting text...
🧠 Embedding & indexing...
🤖 Loading local LLM...


Device set to use cpu


🔗 Creating QA chain...
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://c6464a16238a5c024d.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [None]:
!pip install gradio


Collecting gradio
  Downloading gradio-5.25.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.8.0 (from gradio)
  Downloading gradio_client-1.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.5-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting safehttpx<0.2.0,>=0.1.6 (