# Ingesting PDF

In [None]:
!pip install  unstructured langchain
!pip install "unstructured[all-docs]"

In [None]:
pip install langchain-community langchain-core

In [46]:
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import OnlinePDFLoader

In [47]:
local_path = "meta.pdf"

if local_path:
    loader = UnstructuredPDFLoader(file_path=local_path)
    data = loader.load()
else:
    print("Upload a pdf file")

In [48]:
data[0].page_content

'NEWS RELEASE\n\nMeta Reports Fourth Quarter and Full Year 2024 Results\n\nMENLO PARK, Calif., Jan. 29, 2025 /PRNewswire/ -- Meta Platforms, Inc. (Nasdaq: META) today reported \x00nancial\n\nresults for the quarter and full year ended December 31, 2024.\n\n"We continue to make good progress on AI, glasses, and the future of social media," said Mark Zuckerberg, Meta\n\nfounder and CEO. "I\'m excited to see these e\x00orts scale further in 2025."\n\nFourth Quarter and Full Year 2024 Financial Highlights\n\nThree Months Ended December 31,\n\nTwelve Months Ended December 31,\n\nIn millions, except percentages and per share amounts\n\n2024\n\n2023\n\n% Change\n\n2024\n\n2023\n\n% Change\n\nRevenue\n\nCosts and expenses\n\nIncome from operations\n\nOperating margin\n\nProvision for income taxes\n\nEffective tax rate\n\nNet income\n\nDiluted earnings per share (EPS)\n\n$ 48,385 25,020 $ 23,365 48 % $ 2,715 12 % $ 20,838 $ 8.02\n\n$ 40,111 23,727 $ 16,384 41 % $ 2,791 17 % $ 14,017 $ 5.33\n\n2

# Vector Embeddings

In [29]:
!ollama pull nomic-embed-text

[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠇ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠏ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠇ [K[?25h[?2026l[?2026h[?25l[1Gpulling ma

In [None]:
!pip install chromadb
!pip install langchain-text-splitters

In [49]:
!ollama list

NAME                       ID              SIZE      MODIFIED       
mistral:latest             f974a74358d6    4.1 GB    9 minutes ago     
nomic-embed-text:latest    0a109f422b47    274 MB    27 minutes ago    


In [50]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

In [51]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
chunks = text_splitter.split_documents(data)

In [52]:
vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=OllamaEmbeddings(model="nomic-embed-text", show_progress=True),
    collection_name="local-rag"
)

OllamaEmbeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:33<00:00, 11.04s/it]


# Retrieval

In [53]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

In [54]:
local_model = "mistral"
llm = ChatOllama(model=local_model)

In [55]:
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five 
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

In [56]:
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(),
    llm,
    prompt=QUERY_PROMPT
)

template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [57]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [58]:
chain.invoke(input(""))

 what is this about ?


OllamaEmbeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.00it/s]
OllamaEmbeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.91it/s]
OllamaEmbeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.22it/s]
OllamaEmbeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.64it/s]
OllamaEmbeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.90it/s]


" This text appears to be a financial report for META Platforms, Inc. (previously known as Facebook, Inc.) for the fourth quarter of 2024. The report includes income statements and balance sheets that detail the company's revenue, expenses, assets, liabilities, and stockholders' equity.\n\n   Key points from the income statement include:\n   - META Platforms, Inc. reported total revenue of $134,902 million for the fourth quarter of 2024, up from $114,876 million in the same period a year earlier.\n   - The company's net income for Q4 2024 was $39,098 million, compared to $25,858 million in Q4 2023.\n   - General and administrative expenses increased by approximately $1.55 billion due to a decrease in the accrued losses for certain legal proceedings.\n\n   Key points from the balance sheet include:\n   - As of December 31, 2024, META Platforms, Inc.'s total assets were $276,054 million, up from $229,623 million at the end of 2023.\n   - The company's total liabilities and stockholders' 

In [59]:
chain.invoke(input("what is the long-term debt of meta ?"))

what is the long-term debt of meta ? 


OllamaEmbeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:08<00:00,  8.02s/it]
OllamaEmbeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.33it/s]
OllamaEmbeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.40it/s]
OllamaEmbeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.00it/s]
OllamaEmbeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.32it/s]


' It appears the provided text is a financial report from Meta (formerly Facebook) for their fourth quarter and full year of 2024. Here\'s an overview of some key points:\n\n1. Revenue for the fourth quarter was $46,783 million, with a year-over-year growth of 21%. For the full year, revenue was $164,501 million, also showing a 21% increase compared to the previous year.\n\n2. The report includes both GAAP (Generally Accepted Accounting Principles) and non-GAAP financial measures. Non-GAAP income from operations for the fourth quarter was $28,332 million, while the net cash provided by operating activities was $27,988 million.\n\n3. The report also mentions that there was a foreign exchange effect on 2024 revenue using 2023 rates. However, the exact numbers and their impact are not explicitly stated in the text you provided.\n\n4. There is a segment for "Family of Apps" and "Reality Labs". The former includes advertising and other revenue, while the latter generates income from operati