# Ingesting PDF

In [17]:
%pip install --q unstructured langchain
%pip install "unstructured[all-docs]"

Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [18]:
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import OnlinePDFLoader

In [19]:
import os
from langchain_community.document_loaders import UnstructuredPDFLoader

folder_path = "./trendwatching"

# Check if the folder path exists
if os.path.exists(folder_path) and os.path.isdir(folder_path):
    # Iterate over all files in the trendwatching folder
    for filename in os.listdir(folder_path):
        if filename.endswith(".pdf"):  # Check if the file is a PDF
            file_path = os.path.join(folder_path, filename)
            loader = UnstructuredPDFLoader(file_path=file_path)
            data = loader.load()
            # Process the data as needed
else:
    print("Trendwatching folder not found or is not a directory")

In [20]:
# Preview first page
data[0].page_content

"What is\n\n?\n\nA brand-new, monthly\n\nbrieﬁng to help you\n\nnavigate this wild New\n\nNormal. We’ll be\n\ncovering the trends no\n\nbrand can afford to\n\nignore...but that’s not all.\n\nEvery issue includes\n\nexercises and prompts to\n\nget your team innovating:\n\nMaking shifts & making\n\nsh*t happen!\n\nSHARE\n\nIssue No. 3 · October\n\n2020\n\nA global crisis drafting every\n\nbrand – yours included.\n\nAs they’re ﬁghting off the\n\nvirus, ﬁghting inequality,\n\nﬁghting to maintain their\n\nlivelihoods...consumers\n\nglobally are trying to arm\n\nthemselves with\n\nknowledge. The facts! Yet\n\nbias and misinformation\n\nare muddling what\n\nshould be clear.\n\nConsumers are anxiously\n\nanalyzing each bit of\n\nnews for themselves:\n\nWhat’s fact? What’s fake?\n\nWhat’s operating in a\n\nmurky, middle ground?\n\nWhat does this mean for\n\nme? Consumers expect\n\nall brands – not just in\n\nmedia – to take action.\n\nThere are two key\n\nopportunities: You can\n\nhelp them spo

# Vector Embeddings

In [21]:
!ollama pull nomic-embed-text

Error: could not connect to ollama app, is it running?


In [22]:
!ollama list

Error: could not connect to ollama app, is it running?


In [23]:
%pip install --q chromadb
%pip install --q langchain-text-splitters

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [24]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

In [25]:
# Split and chunk
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
chunks = text_splitter.split_documents(data)

In [27]:
vector_db = Chroma.from_documents(
  documents=chunks,
  embedding=OllamaEmbeddings(model="nomic-embed-text", show_progress=True),
  collection_name="local-rag"
)


OllamaEmbeddings:   0%|          | 0/3 [00:00<?, ?it/s]

OllamaEmbeddings: 100%|██████████| 3/3 [00:02<00:00,  1.26it/s]


# Retrieval

In [28]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

In [29]:
# LLM from Ollama
local_model = "llama3"
llm = ChatOllama(model=local_model)

In [30]:
QUERY_PROMPT = PromptTemplate(
  input_variables=["question"],
  template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}"""
)


In [31]:
retriever = MultiQueryRetriever.from_llm(
  vector_db.as_retriever(),
  llm,
  prompt=QUERY_PROMPT
)

# RAG prompt
template = """Answer the question based ONLY on the following context
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [32]:
chain = (
  {"context": retriever, "question": RunnablePassthrough()}
  | prompt
  | llm
  | StrOutputParser()
)

In [34]:
import textwrap
response = chain.invoke("I am building a startup to make friends with strangers and meet them in bars and restaurants. List the trends and companies that I should know about to improve my innovation?")
print('\n'.join(textwrap.wrap(response, width=80)))

OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  1.01it/s]
Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 32.78it/s]
Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 28.52it/s]
Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 50.33it/s]
Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 17.99it/s]
Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 56.25it/s]
Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3
OllamaEmbeddings

I've analyzed the provided context, which is focused on "The Fight For Facts"
trendwatching report by TrendWatching Make→Shift. Based on this report, here are
some trends and companies you may want to consider when building your startup:
1. **Verification badges**: Terveystalo's "Essential Influencers" campaign
suggests creating a badge for trustworthy sources. You could explore similar
ideas for verifying the identities and credibility of your users. 2. **Digital
tools for local knowledge**: The Soufan Center's collaboration with Truepic
showcases how digital tools can be used to verify stories from underreported
areas. You might consider integrating local knowledge or insights into your
platform to enhance the experience of meeting strangers. 3. **Social media
platforms**: As a startup focused on in-person interactions, you'll likely need
to coexist with social media platforms. Understanding their verification
processes and potential badges for trustworthy sources could help you desi