<a href="https://colab.research.google.com/github/rajasri433/python/blob/main/untitled11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install PyMuPDF sentence-transformers faiss-cpu transformers


Collecting PyMuPDF
  Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF, faiss-cpu
Successfully installed PyMuPDF-1.25.1 faiss-cpu-1.9.0.post1


In [3]:
import fitz  # PyMuPDF
from sentence_transformers import SentenceTransformer
import faiss
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Initialize models and tokenizer
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-cnn')
llm = AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-large-cnn')

# Function to extract text from PDF using PyMuPDF
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text_data = []
    for page_num in range(doc.page_count):
        page = doc.load_page(page_num)
        text = page.get_text()
        text_data.append(text)
    return text_data

# Function to create chunks of text
def create_chunks(text, chunk_size=512):
    sentences = text.split('. ')
    chunks = []
    current_chunk = []
    current_length = 0
    for sentence in sentences:
        if current_length + len(sentence.split()) > chunk_size:
            chunks.append(' '.join(current_chunk))
            current_chunk = []
            current_length = 0
        current_chunk.append(sentence)
        current_length += len(sentence.split())
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    return chunks

# Function to create embeddings and store in vector database
def store_embeddings(chunks):
    embeddings = embedding_model.encode(chunks)
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)
    return index, embeddings

# Function to handle queries
def handle_query(query, index, chunks):
    query_embedding = embedding_model.encode([query])
    D, I = index.search(query_embedding, k=5)
    retrieved_chunks = [chunks[i] for i in I[0]]
    return retrieved_chunks

# Function to generate response using LLM
def generate_response(query, retrieved_chunks):
    context = '\n'.join(retrieved_chunks)
    input_text = f"Query: {query}\nContext: {context}"
    inputs = tokenizer(input_text, return_tensors='pt', max_length=512, truncation=True)
    summary_ids = llm.generate(inputs['input_ids'], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Function to handle comparison queries
def handle_comparison_query(query, index, chunks):
    # This can be improved to parse specific comparison-related logic
    retrieved_chunks = handle_query(query, index, chunks)
    response = generate_response(query, retrieved_chunks)
    return response

# Main function
def main(pdf_path, query, is_comparison=False):
    text_data = extract_text_from_pdf(pdf_path)
    chunks = []
    for text in text_data:
        chunks.extend(create_chunks(text))

    index, _ = store_embeddings(chunks)

    if is_comparison:
        response = handle_comparison_query(query, index, chunks)
    else:
        retrieved_chunks = handle_query(query, index, chunks)
        response = generate_response(query, retrieved_chunks)

    # Print the query, context, and response
    print("\033[1mQuery:\033[0m", query)
    print('\n'.join(retrieved_chunks))
    print("\033[1mResponse:\033[0m", response)

    return response

# Example usage
pdf_path = '/content/Tables- Charts- and Graphs with Examples from History- Economics- Education- Psychology- Urban Affairs and Everyday Life - 2017-2018.pdf'

# Additional query
query2 = 'From page 6 get the tabular data'
response2 = main(pdf_path, query2)


[1mQuery:[0m From page 6 get the tabular data
Tables, Charts, and 
Graphs Basics

x
y
0
0
1
3
2
6
3
9
4
12
5
15
6
18
7
21
8
24
•
If given a table of data, we should be able to plot it  Below is 
some sample data; plot the data with x on the x-axis and y on the 
y-axis.

Tables, Charts, and 
Graphs 
with Examples from History, Economics, 
Education, Psychology, Urban Affairs and 
Everyday Life
REVISED: MICHAEL LOLKUS 2018

0
5
10
15
20
25
30
0
1
2
3
4
5
6
7
8
•
Below is a plot of the data on the table from the previous 
slide  Notice that this plot is a straight line meaning that a 
linear equation must have generated this data.
•
What if the data is not generated by a linear equation?  We can 
fit the data using a linear regression and use that line as an 
approximation to the data  Regressions are beyond the scope of 
this workshop.

Table of Yearly U.S GDP by 
Industry (in millions of dollars)
Year
2010
2011
2012
2013
2014
2015
All Industries
26093515
27535971
28663246
29601191
308