<a href="https://colab.research.google.com/github/siddharth0517/AgriDocQA/blob/main/general_Document_QA_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install langchain sentence-transformers faiss-cpu pypdf python-docx

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting pypdf
  Downloading pypdf-6.0.0-py3-none-any.whl.metadata (7.1 kB)
Collecting python-docx
  Downloading python_docx-1.2.0-py3-none-any.whl.metadata (2.0 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pypdf-6.0.0-py3-none-any.whl (310 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.5/310.5 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading python_docx-1.2.0-py3-none-any.whl (252 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.0/253.0 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-docx, pypdf, faiss-cpu
Successfully installed faiss-cpu-1.12.0 pypdf-6.0.0 pytho

**Connecting To LLM**

In [9]:
from google.colab import userdata

api_key = userdata.get("llama")

In [10]:
from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key= api_key,
)

**Document Loading For Multiple File Types**

In [5]:
from langchain_community.document_loaders import PyPDFLoader, TextLoader, UnstructuredWordDocumentLoader

def load_document(file_path):
    if file_path.endswith(".pdf"):
        loader = PyPDFLoader(file_path)
    elif file_path.endswith(".txt"):
        loader = TextLoader(file_path)
    elif file_path.endswith(".docx"):
        loader = UnstructuredWordDocumentLoader(file_path)
    else:
        raise ValueError("Unsupported file format")
    documents = loader.load()
    return documents

**Chunking the Document**

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_documents(documents, chunk_size=1000, chunk_overlap=300):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    return splitter.split_documents(documents)


**Embedding Chunks and Creating Index using FAISS**

In [11]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

def create_index(chunks):
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    index = FAISS.from_documents(chunks, embeddings)
    return index

**Query and Retrieval**

In [12]:
def query_index(index, query, k=3):
    return index.similarity_search(query, k=k)

**Generating Answer Using LLM**

In [13]:
def generate_answer(client, query, retrieved_chunks):
    context = "\n\n".join([f"Chunk {i+1}:\n{doc.page_content}" for i, doc in enumerate(retrieved_chunks)])

    prompt = f"""You are a helpful assistant answering questions based on the following document excerpts:

{context}

User's question: {query}

Provide a clear and concise answer based on the document."""


    response = client.chat.completions.create(
        model="meta-llama/llama-4-maverick:free",
        messages=[
            {"role": "system", "content": "You are a helpful document assistant."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=300,
        temperature=0.3
    )


    return response.choices[0].message.content.strip()


In [14]:
def document_qa_pipeline(file_path, query):
    documents = load_document(file_path)
    chunks = chunk_documents(documents)

    index = create_index(chunks)

    retrieved_chunks = query_index(index, query, k=3)

    answer = generate_answer(client, query, retrieved_chunks)

    return answer

In [21]:
query = "What are the common pest for cotton and how can i manage them"
file_path = "AGRICULTURE.pdf"
answer = document_qa_pipeline(file_path, query)
print(answer)

Based on the document, some of the common pests that affect cotton crops include:

1. Thrips
2. Aphids
3. Leafhopper
4. Mite
5. Boll-worms (Spotted, Spiny, Pink, and Helicoverpa/ American bollworm)
6. Whiteflies
7. Stem weevil
8. Tobacco cutworm

To manage these pests, the following strategies can be employed:

1. **Monitoring**: Intensify pest monitoring through light traps, pheromone traps, and in-situ assessments at farm, village, block, regional, and State levels.
2. **Action Threshold**: Adopt an action threshold of one egg per plant or 1 larva per plant for management.
3. **Cultural Practices**:
	* Synchronized sowing of cotton with short duration varieties.
	* Avoid continuous cropping of cotton and monocropping.
	* Grow less preferred crops like greengram, blackgram, soyabean, castor, sorghum, etc., as intercrops or border crops.
	* Remove and destroy crop residues to prevent pest carryover.
4. **Specific Management Practices**:
	* For Stem weevil: Basal application of FYM and 

**UI using Gradio**

In [17]:
import gradio as gr

In [18]:
def document_qa(file, query):
    file_path = file.name
    documents = load_document(file_path)
    chunks = chunk_documents(documents)
    index = create_index(chunks)
    retrieved_chunks = query_index(index, query)
    answer = generate_answer(client, query, retrieved_chunks)
    return answer

In [22]:
iface = gr.Interface(
    fn=document_qa,
    inputs=[
        gr.File(label="Upload Document"),
        gr.Textbox(label="Enter your question")
    ],
    outputs=gr.Textbox(label="Answer"),
    title="Agri Crop Management QA System",
    description="Upload any document (PDF, TXT, DOCX) and ask questions. The system will answer based on the document content."
)

In [23]:
iface.launch()

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://59c559fd9499c43feb.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


