## Part 1: Read a file

In [9]:
from langchain_community.document_loaders.pdf import PyPDFLoader

file_path = "/Users/i749910/Downloads/Detailed_Urban_Planning_and_Smart_Cities_Summary_Corrected.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()

docs

[Document(page_content='Summary of Urban Planning and Smart Cities\nUrban planning has evolved significantly with the advancement of digital technologies, giving rise to\nthe concept \nof the Smart City. This approach aims to improve urban environments by integrating technologies\nthat enhance \nsustainability, efficiency, and the overall quality of life for residents. Smart cities use data analytics, \nInternet of Things (IoT) devices, and artificial intelligence to tackle common urban issues such as\ntraffic congestion, \nwaste management, energy distribution, and public safety.\nKey Components of Smart City Implementation:\n1. Infrastructure and Technology: Developing a robust infrastructure that supports smart devices and\nsensors is \nessential. This includes building efficient transport systems, energy grids, and digital networks that\ncollect \nand process real-time data.\n2. Data Management: Smart cities rely heavily on data collection and analysis. Information gathered\nfrom v

## Part 2: Summarize the document

In [10]:
# Prompt
from langchain_core.prompts import PromptTemplate

prompt_template = """Write a long summary of the following document. 
Only include information that is part of the document. 
Do not include your own opinion or analysis.

Document:
"{document}"
Summary:"""
prompt = PromptTemplate.from_template(prompt_template)

In [11]:
# Define LLM Chain

from langchain_openai import ChatOpenAI
from langchain.chains.llm import LLMChain

llm = ChatOpenAI(
    temperature=0.1,
    model_name="mistral:latest",
    api_key="ollama",
    base_url="http://localhost:11434/v1",
)
llm_chain = LLMChain(llm=llm, prompt=prompt)

In [12]:
# Create full chain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain

stuff_chain = StuffDocumentsChain(
    llm_chain=llm_chain, document_variable_name="document"
)

In [13]:
result = stuff_chain.invoke(docs)
print(result["output_text"])

 The document discusses the concept of Smart Cities, an evolution of urban planning facilitated by digital technologies. The goal is to enhance sustainability, efficiency, and quality of life in urban environments through the integration of data analytics, IoT devices, and AI. Common urban issues such as traffic congestion, waste management, energy distribution, and public safety are addressed using these technologies.

   Key components of Smart City implementation include:
   1. Infrastructure and Technology: Developing robust infrastructure to support smart devices and sensors, including efficient transport systems, energy grids, and digital networks for real-time data collection and processing.
   2. Data Management: Smart cities rely on data collection and analysis to provide insights for informed decision-making and policy development.
   3. Citizen Engagement: Smart City projects prioritize citizen engagement through digital platforms for communication and feedback, fostering a 

In [14]:
# Invoke with limited pages
result = stuff_chain.invoke(docs[:-3])
print(fill(result["output_text"]))

NameError: name 'fill' is not defined

## Part 3: Query a document with MapReduce

A common way to query a document is to use RAG, which involves splitting a document in chunks, generating embeddings for those chunks, storing these chunks in a database, querying the database based on those embeddings (semantic search), and then returning the most relevant chunks to use to build the answer.

Here I take a simpler but more processing-intensive approach, which is to use MapReduce to query the document. The idea is to look at pages one by one, and for each page, to use the model to find the information that is relevant to the question. Then we combine the results to build the answer.

In [8]:
user_query = "What is the data used in this analysis?"

In [10]:
# Map part: Appplied to each page
map_template = """The following is a set of documents
{docs}
Based on this list of documents, please identify the information that is most relevant to the following query:
{user_query} 
If the document is not relevant, please write "not relevant".
Helpful Answer:"""
map_prompt = PromptTemplate.from_template(map_template)
map_prompt = map_prompt.partial(user_query=user_query)
map_chain = LLMChain(llm=llm, prompt=map_prompt)

In [11]:
# Reduce part: Applied to the list of results
reduce_template = """The following is set of partial answers to a user query:
{docs}
Take these and distill it into a final, consolidated answer to the following query:
{user_query} 
Complete Answer:"""
reduce_prompt = PromptTemplate.from_template(reduce_template)
reduce_prompt = reduce_prompt.partial(user_query=user_query)

In [12]:
# Full chain
from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain


reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

# Takes a list of documents, combines them into a single string, and passes this to an LLMChain
combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain, document_variable_name="docs"
)

# Combines and iteratively reduces the mapped documents
reduce_documents_chain = ReduceDocumentsChain(
    # This is final chain that is called.
    combine_documents_chain=combine_documents_chain,
    # If documents exceed context for `StuffDocumentsChain`
    collapse_documents_chain=combine_documents_chain,
    # The maximum number of tokens to group documents into.
    token_max=4000,
)
# Combining documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(
    # Map chain
    llm_chain=map_chain,
    # Reduce chain
    reduce_documents_chain=reduce_documents_chain,
    # The variable name in the llm_chain to put the documents in
    document_variable_name="docs",
    # Return the results of the map steps in the output
    return_intermediate_steps=False,
)

In [13]:
result = map_reduce_chain.invoke(docs[:-3])
print(fill(result["output_text"]))

 The data used in this analysis includes stock prices, trading volume,
and order imbalance measures for a sample of U.S. firms that
experienced a newswire hack between 2010 and 2014. Specifically, the
authors use minute-level trade and quote data from the Trade and Quote
(TAQ) database, which is maintained by the New York Stock Exchange.
They also use firm-level financial data from Compustat and measures of
media coverage from Factiva. The sample includes all U.S. common
stocks listed on the NYSE, NASDAQ, or AMEX exchanges during the study
period. Not all variables are available for all firms and time
periods, resulting in an unbalanced panel. Additionally, the analysis
uses information from legal documents of SEC prosecutions, newswire
servers, and a set of control variables such as log market
capitalization, fraction of shares held by institutional investors,
natural logarithm of number of analysts, natural logarithm of newswire
news in the quarter leading to the announcement, daily 