## Import Libraries


In [28]:
# Imports
#INITIAL CODE
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Jupyter-specific imports
from IPython.display import display, Markdown

# Set environment variable for protobuf
import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

## Load PDF

In [None]:
from langchain.document_loaders import PyPDFLoader

local_path = "D:\Devlancers\data\LLMs.pdf "
if local_path:
    loader = PyPDFLoader(local_path)
    data = loader.load()
    print(f"PDF loaded successfully with {len(data)} pages.")

    # Print first page content for verification
    print(data[0].page_content if data else "No data extracted.")
else:
    print("Upload a PDF file")


PDF loaded successfully with 768 pages.



## Split text into chunks

In [30]:
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=900, chunk_overlap=200)
chunks = text_splitter.split_documents(data)
print(f"Text split into {len(chunks)} chunks")

Text split into 1808 chunks


## Create vector database

In [31]:
from langchain.vectorstores import FAISS
from langchain.embeddings import OllamaEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Assume 'chunks' is already created from the document
embedding_model = OllamaEmbeddings(model="nomic-embed-text")



In [32]:
embedding_model

OllamaEmbeddings(base_url='http://localhost:11434', model='nomic-embed-text', embed_instruction='passage: ', query_instruction='query: ', mirostat=None, mirostat_eta=None, mirostat_tau=None, num_ctx=None, num_gpu=None, num_thread=None, repeat_last_n=None, repeat_penalty=None, temperature=None, stop=None, tfs_z=None, top_k=None, top_p=None, show_progress=False, headers=None, model_kwargs=None)

In [None]:
# Create FAISS vector database
vector_db = FAISS.from_documents(chunks, embedding_model)

# Save the FAISS index for later use
vector_db.save_local("faiss_index")

print("FAISS vector database created and saved successfully.")


## Set up LLM and Retrieval

In [None]:
local_model = "phi3:medium"
llm = ChatOllama(model=local_model)

In [22]:
# Query prompt template
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate 2
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

# Set up retriever
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

## Create chain

In [23]:
# RAG prompt template
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [24]:
# Create chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | ChatOllama(model="phi3:mini")
    | StrOutputParser()
)

## Chat with PDF

In [25]:
def chat_with_pdf(question):
    """
    Chat with the PDF using the RAG chain.
    """
    return display(Markdown(chain.invoke(question)))

In [26]:
chat_with_pdf("What is the main idea of this document?")

The primary concept discussed in these documents revolves around a developed platform aimed at facilitating human-AI collaborative coding for data science projects. The tool's focus lies on reducing user effort, allowing them to input natural language requests and receive real-time code execution feedback within an isolated environment without disrupting the main experimental setting. This system supports users in various ways such as brainstorming with LLM assistants or requesting improvements for existing codes across different programming languages like R and Python directly from their interface, all while ensuring a privacy-conscious handling of potentially sensitive patient datasets. The platform also provides comprehensive code generation panels to guide the users in planning data analyses which are then executed within sandbox environments that support parallel real-time executions with provided logs for artifact analysis like figures and tables directly on the frontend, thereby maximizing utility from LLM assistance while simplifying programming tasks.

In [27]:
chat_with_pdf("How does the dataset metadata (column names, shape, representative values) help LLMs understand data without compromising privacy?")

The inclusion of dataset metadata like column names, shape, and representative sample values plays a significant role for Language Learning Models (LLMs) to comprehend datasets effectively while preserving patient confidentiality. Here's how it works:

1. **Column Names** - By understanding the specific headers or labels in each data set such as 'age', 'gender', etc., LLMs can access and work with distinct categories of information without touching raw sensitive personal details that could potentially breach privacy norms, particularly when dealing with patient-level datasets where individual identities must be protected.

2. **Shape** - Knowledge about the shape or structure (number of rows/columns) helps LLM to grasp the overall dimensions and extent of data available for analysis without exposing sensitive details such as age range or count which can lead towards personal identification if aggregated with external knowledge sources like demographic information, thereby mitigating privacy concerns.

3. **Representative Values** - Providing representative values gives LLMs a sense about the nature of data within each category (e.g., 'age' might have an average value in the mid-40 range). While this could provide context, it doesn't involve disclosing actual individual records and hence does not violate privacy norms as long as personal identifiers are omitted or pseudonymized from direct input into LLM interactions.

Together, these elements ensure that while an AI tool can perform data analysis efficiently without access to the entire sensitive dataset (thus maintaining confidentiality), it still understands what kind of information is available for analyses and how they interrelate in a structured form within each patient's record. This knowledge allows LLMs, when used correctly as per your architecture integration plan with privacy constraints mentioned above, to assist data scientists effectively without jeopardizing individual privacy rights or security standards related to sensitive health information handling and processing regulations like HIPAA (Health Insurance Portability and Accountability Act).

In [13]:
chat_with_pdf("What are designs to perform the series of actions necessary for common scams?")

The document discusses designing agents capable of performing specific actions needed to execute various common scams. These include navigating bank websites, retrieving user credentials and two-factor authentication codes, transferring money or stealing gift cards from banks like Bank of America, as well as credential exfiltration tactics for social media platforms such as Gmail and Instagram. The designed agents are also able to perform actions autonomously with a high level of capability in voice interactions.

As the document mentions that these scams require complex user interaction and feedback handling, this suggests sophisticated designs which likely entail using advanced AI technologies for natural language processing (NLP), human-like conversation capabilities, error detection and response strategies, as well as adaptive learning to handle different situations within a scam. However, the document does not provide detailed specifics about these design features in this provided context.

In [14]:
chat_with_pdf("Can you explain the case study highlighted in the document?")

In the provided context, a bank transfer scam is described through an abridged transcript and action log for reference purposes only (with specific details redacted). The scenario unfolds as follows: A potential victim receives a call from someone claiming to be from Bank of America. They notify that there has been unusual activity on the account, implying it might have been compromised or potentially stolen.

For security verification purposes and assurance for protection against fraudulent activities affecting their banking details, John (the scammer) requests access to sensitive information like usernames and passwords from the victim. The transcript indicates a conversation wherein the potential victim expresses confusion or disbelief due to never receiving such calls before which is common tactics used by fraudsters impersonating institutional staff members.

This case study serves as an example of how these AI-powered scammers attempt their nefarious activities using a blend of social engineering and technology, exploiting the trust individuals have in financial institutions to extract confidential information under false pretenses for malicious intent such as unauthorized access or account takeover.