Langchain and Faiss

In [1]:
import PyPDF2
import os
import pandas as pd
import faiss
from sentence_transformers import SentenceTransformer
import numpy as np
from langchain.chat_models import ChatOllama


In [2]:
def extract_text_from_pdf(file_path):

    try:
        with open(file_path, "rb") as file:
            reader = PyPDF2.PdfReader(file)
            text = " ".join([page.extract_text() for page in reader.pages if page.extract_text()])
        return text
    except Exception as e:
        print(f"Error extracting text from {file_path}: {e}")
        return None

In [3]:
def load_pdfs(file_paths):

    docs = {}
    for file_path in file_paths:
        text = extract_text_from_pdf(file_path)
        if text:
            docs[file_path] = text
    print(f"Loaded {len(docs)} PDFs.")
    return docs

In [4]:
def create_faiss_index(docs, model):

    sentences = []
    file_refs = []
    for file, text in docs.items():
        chunks = text.split(". ")
        sentences.extend(chunks)
        file_refs.extend([file] * len(chunks))

    embeddings = model.encode(sentences, convert_to_numpy=True)
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)
    data = pd.DataFrame({"sentence": sentences, "file": file_refs})
    return index, data

In [5]:
def answer_question(query, index, data, model, llm, top_k=3):

    query_embedding = model.encode([query], convert_to_numpy=True)
    _, indices = index.search(query_embedding, top_k)

    extracted_texts = []
    sources = []
    for idx in indices[0]:
        response = data.iloc[idx]
        extracted_texts.append(response["sentence"])
        sources.append(os.path.basename(response["file"]))

    context = " ".join(extracted_texts)
    prompt = f"Based on the following context, answer the question: {query}\n\nContext: {context}\n\nAnswer:"

    answer = llm.predict(prompt)
    return {"answer": answer, "citations": list(set(sources))}

In [9]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
llm = ChatOllama(model= "llama3.2")

pdf_files = [
    "/Users/avilochab/Desktop/Job: Work/Chatbox/files/scammer-agent.pdf",
    "/Users/avilochab/Desktop/Job: Work/Chatbox/files/LLMs.pdf",
    "/Users/avilochab/Desktop/Job: Work/Chatbox/files/SOF.pdf",
    "/Users/avilochab/Desktop/Job: Work/Chatbox/files/wef.pdf",
    "/Users/avilochab/Desktop/Job: Work/Chatbox/files/aviationGPT.pdf"
]

docs = load_pdfs(pdf_files)
index, data = create_faiss_index(docs, embedding_model)



Loaded 5 PDFs.


In [16]:
response = answer_question("Explain LLM in 100 words", index, data, embedding_model, llm)
print("Answer:", response["answer"])
print("Citations:", response["citations"])

Answer: A Large Language Model (LLM) is a type of artificial intelligence designed to understand and generate human-like language. It is trained on vast amounts of text data, enabling it to learn patterns and relationships within language. LLMs can be fine-tuned for specific tasks, such as answering questions or generating text, by integrating retrieved information from various sources. This integration allows the model to produce accurate and context-specific responses, making them more effective in human-computer interaction and conversational AI applications. The ability of LLMs to adapt to new information and tasks is a key aspect of their performance.
Citations: ['aviationGPT.pdf']


In [11]:
response = answer_question("Explain intent of a scammer", index, data, embedding_model, llm)
print("Answer:", response["answer"])
print("Citations:", response["citations"])

Answer: The intent of a scammer is to deceive and manipulate the victim into believing that their actions are legitimate, in order to gain their trust and ultimately steal funds from them. The primary goal is to create a false sense of legitimacy and authority, allowing the scammer to carry out their malicious intentions.
Citations: ['scammer-agent.pdf']


In [14]:
response = answer_question("Explain aviationGPT", index, data, embedding_model, llm)
print("Answer:", response["answer"])
print("Citations:", response["citations"])

Answer: AviationGPT is a cutting-edge technology that offers flexibility and substantial performance enhancements in aviation. The exact nature and capabilities of AviationGPT are not specified in the provided context, but it appears to be a system or platform designed for aviation applications, possibly related to artificial intelligence, machine learning, or other advanced technologies.
Citations: ['aviationGPT.pdf']


In [15]:
response = answer_question("Explain Special forces", index, data, embedding_model, llm)
print("Answer:", response["answer"])
print("Citations:", response["citations"])

Answer: Special forces, also known as Special Operations Forces (SOF), are elite military units that conduct unconventional operations, such as counter-terrorism, direct action, special reconnaissance, and foreign internal defense missions. They are trained to operate outside the normal rules of engagement and often work behind enemy lines.

The primary purpose of special forces is to:

1. Conduct high-risk missions that require specialized skills and training.
2. Gather intelligence and conduct surveillance in hostile or denied areas.
3. Conduct direct action against high-value targets, such as terrorist leaders or key infrastructure.
4. Provide training and support to foreign military forces to help build their capacity.

Special forces are typically characterized by:

1. Elite training: Special forces undergo rigorous training that includes advanced skills in combat, tactics, and survival.
2. Small team size: Special forces teams usually consist of 5-12 personnel, which allows for f