In [1]:
!pip install cohere pdfplumber pandas nltk
!pip install langchain cohere faiss-cpu PyPDF2
!pip install langchain-community

Collecting cohere
  Downloading cohere-5.9.1-py3-none-any.whl.metadata (3.4 kB)
Collecting pdfplumber
  Downloading pdfplumber-0.11.4-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting boto3<2.0.0,>=1.34.0 (from cohere)
  Downloading boto3-1.35.18-py3-none-any.whl.metadata (6.6 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Downloading fastavro-1.9.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting httpx>=0.21.2 (from cohere)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx-sse==0.4.0 (from cohere)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting parameterized<0.10.0,>=0.9.0 (from cohere)
  Downloading parameterized-0.9.0-py2.py3-none-any.whl.metadata (18 kB)
Collecting types-requests<3.0.0,>=2.0.0 (from cohere)
  Downloading types_requests-2.32.0.20240907-py3-none-any.whl

In [2]:
import cohere
import pdfplumber
import pandas as pd
import nltk
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import sent_tokenize


In [3]:
def extract_text_from_pdf(file_path):
    with pdfplumber.open(file_path) as pdf:
        full_text = ""
        for page in pdf.pages:
            full_text += page.extract_text()
    return full_text



In [5]:
# Example: Extract text from PDF
file_path = 'your-pdf-file'
raw_text = extract_text_from_pdf(file_path)


In [6]:
def clean_text(text):
    cleaned_text = text.replace('\n', ' ').replace('\t', ' ')
    return cleaned_text.strip()

cleaned_text = clean_text(raw_text)
print(cleaned_text[:500])  # Check first 500 characters of cleaned text


Ministry of Labour & Employment, GoI MINISTRY OF LABOUR AND EMPLOYMENT SHRAM SHAKTI BHAWAN RAFI MARG, NEW DELHI – 110001 Tender Document no.: No. Z-14025/05/2024 MoLE/PG-Cell/PMU REQUEST FOR PROPOSAL (RFP) FOR “APPOINTMENT OF A PROJECT MANAGEMENT UNIT FOR MINISTRY OF LABOUR AND EMPLOYMENT” Ministry of Labour and Employment (MoLE) through this RFP, seeks to appoint a PMU having competent and expert IT resources to handle its Technical Services, Tools and Assets as a part of its comprehensive Prog


In [7]:
cohere_api_key = 'your-cohere-api-key'
co = cohere.Client(cohere_api_key)


In [8]:
nltk.download('punkt')

def split_into_sentences(text):
    return sent_tokenize(text)

sentences = split_into_sentences(cleaned_text)

# Generate embeddings for each sentence
embeddings = co.embed(texts=sentences).embeddings


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [9]:
def summarize_text(text):
    response = co.generate(
        model='command-xlarge-nightly',
        prompt=f"Summarize this document:\n{text}",
        max_tokens=150  # Adjust as per the need
    )
    return response.generations[0].text.strip()

summary = summarize_text(cleaned_text[:3000])  # Summarize first 3000 characters
print(summary)


The Ministry of Labour and Employment in India is seeking proposals for the appointment of a Project Management Unit (PMU) to handle its technical services, tools, and assets. The request for proposal (RFP) document outlines the scope of work and technical requirements for interested companies to provide PMU services to the ministry. The RFP was published on May 2, 2024, and clarifications/corrigenda will be published on the GeM portal (https://gem.gov.in). The tender inviting authority is the Government of India, and the job requirement is for the "Appointment of a Project Management Unit" for the Ministry of Labour and Employment (MoLE). 

The pre-bid meeting will take place on


In [10]:
import numpy as np

def find_relevant_section(question, sentences, embeddings):
    # Generate the question embedding
    question_embedding = co.embed(texts=[question]).embeddings

    # Convert embeddings to numpy arrays for easier manipulation
    question_embedding = np.array(question_embedding[0])  # Convert to 1D numpy array
    embeddings = np.array(embeddings)  # Convert the sentence embeddings to a 2D numpy array

    # Check and print the shapes of embeddings for debugging
    print("Shape of question embedding:", question_embedding.shape)  # Should be 1D now
    print("Shape of sentence embeddings:", embeddings.shape)  # Should be 2D

    # Calculate cosine similarity
    similarities = cosine_similarity([question_embedding], embeddings)[0]
    most_relevant_index = similarities.argmax()
    return sentences[most_relevant_index]

# Example: Answering a question
question = "Give summary of Scope of Work"
answer = find_relevant_section(question, sentences, embeddings)
print(answer)


Shape of question embedding: (4096,)
Shape of sentence embeddings: (901, 4096)
The proposal should cover all the aspects of the scope of work.


In [21]:
def chatbot():
    print("Document is ready for questions. Type 'exit' to stop.")
    while True:
        user_input = input("Ask a question: ")
        if user_input.lower() == 'exit':
            break
        response = find_relevant_section(user_input, sentences, embeddings)
        print("Answer:", response)

chatbot()


Document is ready for questions. Type 'exit' to stop.
Ask a question: what is the technical score formulation
Shape of question embedding: (4096,)
Shape of sentence embeddings: (901, 4096)
Answer: P age 42 | 75Ministry of Labour & Employment, GoI o Technical Score Formulation: The total technical score of the bid would comprise of scores from the Technical Bid evaluation by the Consultancy Evaluation Committee (CEC) of MOL&E as per the criteria mentioned in Annexure 1 of this RFP.
Ask a question: exit


**IMPLEMENTING CONTEXT RETRIEVAL**

In [12]:
def find_relevant_sections(question, sentences, embeddings, top_n=3):
    question_embedding = np.array(co.embed(texts=[question]).embeddings[0])
    embeddings = np.array(embeddings)

    # Calculate cosine similarity
    similarities = cosine_similarity([question_embedding], embeddings)[0]

    # Get the top n most relevant sentences
    top_indices = similarities.argsort()[-top_n:][::-1]

    # Return the top n most relevant sentences as a context block
    relevant_sections = " ".join([sentences[i] for i in top_indices])
    return relevant_sections


In [13]:
def generate_response(question, relevant_context):
    prompt = f"Question: {question}\nContext: {relevant_context}\nAnswer:"
    response = co.generate(
        model='command-xlarge-nightly',  # You can adjust the model as per your API access
        prompt=prompt,
        max_tokens=150
    )
    return response.generations[0].text.strip()


In [14]:
def chatbot():
    print("Document is ready for questions. Type 'exit' to stop.")
    while True:
        user_input = input("Ask a question: ")
        if user_input.lower() == 'exit':
            break

        # Step 1: Retrieve the top relevant sections
        relevant_context = find_relevant_sections(user_input, sentences, embeddings, top_n=3)

        # Step 2: Generate a response using the relevant context
        response = generate_response(user_input, relevant_context)

        # Output the generated response
        print("Answer:", response)

chatbot()


Document is ready for questions. Type 'exit' to stop.
Ask a question: exit


**FINE TUNING THE ABOVE CODE**

In [15]:
def detect_question_type(question):
    if any(word in question.lower() for word in ["summary", "overview", "brief"]):
        return "summary"
    if any(word in question.lower() for word in ["formula", "equation", "calculation"]):
        return "formula"
    return "general"


In [16]:
def find_relevant_sections(question, sentences, embeddings, top_n=3):
    question_embedding = np.array(co.embed(texts=[question]).embeddings[0])
    embeddings = np.array(embeddings)

    # Calculate cosine similarity
    similarities = cosine_similarity([question_embedding], embeddings)[0]

    # Determine how many sections to retrieve based on the question type
    question_type = detect_question_type(question)

    if question_type == "summary":
        top_n = 5  # Retrieve more context for summaries
    elif question_type == "formula":
        top_n = 2  # Retrieve specific sections for formulas

    top_indices = similarities.argsort()[-top_n:][::-1]
    relevant_sections = " ".join([sentences[i] for i in top_indices])

    return relevant_sections, question_type


In [17]:
def generate_response(question, relevant_context, question_type):
    if question_type == "summary":
        prompt = f"Question: {question}\nContext: {relevant_context}\nProvide a detailed summary."
    elif question_type == "formula":
        prompt = f"Question: {question}\nContext: {relevant_context}\nProvide the exact formula mentioned."
    else:
        prompt = f"Question: {question}\nContext: {relevant_context}\nAnswer in detail:"

    response = co.generate(
        model='command-xlarge-nightly',
        prompt=prompt,
        max_tokens=250  # Increase token limit for detailed responses
    )
    return response.generations[0].text.strip()


In [23]:
def chatbot():
    print("Document is ready for questions. Type 'exit' to stop.")
    while True:
        user_input = input("Ask a question: ")
        if user_input.lower() == 'exit':
            break

        # Step 1: Retrieve the relevant sections and detect question type
        relevant_context, question_type = find_relevant_sections(user_input, sentences, embeddings, top_n=3)

        # Step 2: Generate a response using the relevant context and question type
        response = generate_response(user_input, relevant_context, question_type)

        # Output the generated response
        print("Answer:", response)

chatbot()


Document is ready for questions. Type 'exit' to stop.
Ask a question: what is the scope of central data management cell
Answer: The Central Data Management Cell (CDMC) is established with the following objectives:
1. Setting up a Central Data Management Cell (CDMC)
2. In the rapidly evolving landscape of labor and employment, MoL&E recognizes the need for data-driven decision-making to effectively cater to the dynamic demands of the workforce, particularly the unorganized sector.
Ask a question: can u explain more about cdmc
Answer: The Central Data Management Cell (CDMC) is a proposed data management system for the Ministry of Labour and Employment (MoL&E) to aid in evidence-based policy making and administrative decision-making. The CDMC has three primary objectives:

- Review the databases available with MoL&E and identify the data gaps for achieving evidence-based policy making: The CDMC will review existing databases to ensure they are comprehensive and identify any data gaps that