<a href="https://colab.research.google.com/github/themodernturing/pakistan-penal-code-qa/blob/main/pakistan_panel_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##📝 Introduction
This Python notebook is designed to build an AI-powered question-answering system using the Pakistan Penal Code as a primary source of information. The goal is to allow users to ask legal questions and receive accurate, context-aware answers based on the contents of the Penal Code.

The notebook combines various powerful tools and techniques from modern machine learning and natural language processing (NLP), including:

##🔍 PDF Text Extraction
Using PyMuPDF (fitz), the notebook extracts the full text content from the Pakistan Penal Code PDF document. This allows us to convert the legal text into a machine-readable format for downstream processing.

##✂️ Text Chunking
Since legal documents are often long and detailed, the full text is split into smaller, overlapping chunks using LangChain’s RecursiveCharacterTextSplitter. This ensures that the language model can effectively process and understand each segment.

##🧠 Semantic Embeddings & Vector Store
The notebook uses HuggingFace sentence-transformers to generate embeddings for each text chunk. These embeddings capture the semantic meaning of the text and are stored in a FAISS vector store for efficient similarity search.

##🔁 Retrieval-Based QA Pipeline
Using LangChain’s RetrievalQA chain, the system retrieves the most relevant sections of the document in response to a user query and uses a pretrained transformer model to generate a natural language answer. This enables a more contextually aware and document-grounded response.

##🌐 Web Interface
With Gradio, a simple and interactive web-based UI can be created, enabling users to input legal queries and receive real-time answers. This makes the system accessible to non-technical users like law students, researchers, and the general public.



In [None]:
!pip install pymupdf


In [None]:
 import fitz  # This works now after installing

def extract_text_from_pdf(file_path="/content/Pakistan Panel Code.pdf"):
    doc = fitz.open(file_path)
    full_text = ""
    for page_num, page in enumerate(doc, start=1):
        text = page.get_text()
        full_text += f"\n\n--- Page {page_num} ---\n\n{text}"
    return full_text

# Usage
pdf_text = extract_text_from_pdf()
print(pdf_text[:1000])  # Preview


In [None]:
!pip install langchain


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Assuming pdf_text holds your full document
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_text(pdf_text)

print(f"Total chunks: {len(chunks)}")
print(chunks[0])  # Preview the first chunk


In [None]:
!pip install -U langchain langchain-community
!pip install -U openai tiktoken faiss-cpu
!pip install sentence-transformers


In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

vectorstore = FAISS.from_texts(chunks, embedding=embedding_model)

# Optional: save the vectorstore
vectorstore.save_local("pakistan_penal_code_index")



In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from transformers import pipeline

# Load locally using Hugging Face Transformers
qa_pipeline = pipeline("text2text-generation", model="google/flan-t5-base", max_length=512)

llm = HuggingFacePipeline(pipeline=qa_pipeline)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

query = "What is the punishment for theft in the Pakistan Penal Code?"
result = qa_chain({"query": query})

print("Answer:", result["result"])


In [None]:
import fitz  # PyMuPDF
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from transformers import pipeline
import gradio as gr
import os

# Step 1: Extract text from PDF
def extract_text_from_pdf(file_path):
    doc = fitz.open(file_path)
    full_text = ""
    for page_num, page in enumerate(doc, start=1):
        text = page.get_text()
        full_text += f"\n\n--- Page {page_num} ---\n\n{text}"
    return full_text

# Step 2: Build everything from PDF
def build_qa_system(pdf_path):
    print("📄 Extracting text...")
    full_text = extract_text_from_pdf(pdf_path)

    print("✂️ Splitting text...")
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = splitter.split_text(full_text)

    print("🧠 Embedding & indexing...")
    embedder = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    vectordb = FAISS.from_texts(chunks, embedding=embedder)

    print("🤖 Loading local LLM...")
    hf_pipeline = pipeline("text2text-generation", model="google/flan-t5-base", max_length=512)
    llm = HuggingFacePipeline(pipeline=hf_pipeline)

    print("🔗 Creating QA chain...")
    qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectordb.as_retriever(),
        return_source_documents=True
    )
    return qa

# Step 3: Interface logic
pdf_path = "/content/Pakistan Panel Code.pdf"
qa_chain = build_qa_system(pdf_path)

def answer_question(query):
    result = qa_chain({"query": query})
    return result["result"]

# Step 4: Launch Gradio UI
demo = gr.Interface(
    fn=answer_question,
    inputs=gr.Textbox(label="Ask a question about the Pakistan Penal Code"),
    outputs=gr.Textbox(label="Answer"),
    title="Pakistan Penal Code Chatbot",
    description="Ask legal questions based on the Pakistan Penal Code PDF file."
)

demo.launch(share=True)


In [None]:
!pip install gradio
