# Soham Navale MTech (AIML) 23070149021



## Use case

Suppose you have a set of documents (PDFs, Notion pages, customer questions, etc.) and you want to summarize the content.

LLMs are a great tool for this given their proficiency in understanding and synthesizing text.

In this walkthrough we'll go over how to perform document summarization using LLMs.

![Image description](https://github.com/langchain-ai/langchain/blob/master/docs/static/img/summarization_use_case_1.png?raw=1)

## Overview

A central question for building a summarizer is how to pass your documents into the LLM's context window. Two common approaches for this are:

1. `Stuff`: Simply "stuff" all your documents into a single prompt. This is the simplest approach (see [here](/docs/modules/chains#lcel-chains) for more on the `create_stuff_documents_chain` constructor, which is used for this method).

2. `Map-reduce`: Summarize each document on it's own in a "map" step and then "reduce" the summaries into a final summary (see [here](/docs/modules/chains#legacy-chains) for more on the `MapReduceDocumentsChain`, which is used for this method).

![Image description](https://github.com/langchain-ai/langchain/blob/master/docs/static/img/summarization_use_case_2.png?raw=1)

## Quickstart

To give you a sneak preview, either pipeline can be wrapped in a single object: `load_summarize_chain`.

Suppose we want to summarize a blog post. We can create this in a few lines of code.

First set environment variables and install packages:

In [None]:
%pip install --upgrade --quiet  langchain-openai tiktoken chromadb langchain langchainhub
import os
os.environ["OPENAI_API_KEY"] = "sk-proj-KJyd9RwbvZdc2SfPnO9uT3BlbkFJ27bcBlgClithpzJQQoZ1"


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.2/290.2 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.0/311.0 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.9/91.9 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

We can use `chain_type="stuff"`, especially if using larger context window models such as:

* 16k token OpenAI `gpt-3.5-turbo-1106`
* 100k token Anthropic [Claude-2](https://www.anthropic.com/index/claude-2)

We can also supply `chain_type="map_reduce"` or `chain_type="refine"`.

In [None]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/290.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m204.8/290.4 kB[0m [31m6.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pypdf
Successfully installed pypdf-4.2.0


In [None]:
from langchain.chains.summarize import load_summarize_chain
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader("/content/data")
docs = loader.load()

llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo")
chain = load_summarize_chain(llm, chain_type="stuff")

chain.run(docs)



"The text discusses various aspects of Ayurvedic philosophy, focusing on the concepts of doshas, dhatus, srotas, and the interplay between the body, mind, and consciousness. It enumerates the different types of doshic combinations and their effects on health, detailing how doshas can manifest in various combinations and states, leading to different health outcomes. The text also delves into the anatomy and physiology of the human body from an Ayurvedic perspective, describing the functions and disorders associated with different body parts and systems, including the roles of various bodily fluids and tissues.\n\nFurther, it explores the concept of Agni (digestive fire) and its types, which are crucial for maintaining health. The discussion extends to the mind and senses, explaining how they interact with the material and spiritual elements of human existence. The text also touches on philosophical aspects, discussing the nature of the self (Atman), its relationship with the universe (P

## Option 1. Stuff

When we use `load_summarize_chain` with `chain_type="stuff"`, we will use the [StuffDocumentsChain](https://api.python.langchain.com/en/latest/chains/langchain.chains.combine_documents.stuff.StuffDocumentsChain.html#langchain.chains.combine_documents.stuff.StuffDocumentsChain).

The chain will take a list of documents, inserts them all into a prompt, and passes that prompt to an LLM:

In [None]:
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chains.llm import LLMChain
from langchain_core.prompts import PromptTemplate

# Define prompt
prompt_template = """Write a concise summary of the following:
"{text}"
CONCISE SUMMARY:"""
prompt = PromptTemplate.from_template(prompt_template)

# Define LLM chain
llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo")
llm_chain = LLMChain(llm=llm, prompt=prompt)

# Define StuffDocumentsChain
stuff_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="text")

docs = loader.load()
print(stuff_chain.run(docs))



The text discusses various aspects of Ayurvedic philosophy, focusing on the concepts of doshas (bodily humors), dhatus (tissues), srotas (channels), and the interplay between the body, mind, and consciousness. It enumerates different types of doshic combinations and their effects on health, detailing how doshas can manifest in both aggravated and diminished states, affecting the body's balance and function.

The discussion extends to the anatomy and physiology of the human body from an Ayurvedic perspective, describing the roles and functions of various body parts and systems, including the cardiovascular system, skin layers, and the digestive system. It also covers the treatment principles for balancing doshas and treating diseases, emphasizing the importance of understanding the body's constitution and the qualities of different substances and their effects on health.

Furthermore, the text delves into the philosophical aspects of Ayurveda, exploring the nature of the mind, senses, a

Great! We can see that we reproduce the earlier result using the `load_summarize_chain`.

### Go deeper

* You can easily customize the prompt.
* You can easily try different LLMs, (e.g., [Claude](/docs/integrations/chat/anthropic)) via the `llm` parameter.

## Option 2. Map-Reduce

Let's unpack the map reduce approach. For this, we'll first map each document to an individual summary using an `LLMChain`. Then we'll use a `ReduceDocumentsChain` to combine those summaries into a single global summary.

First, we specify the LLMChain to use for mapping each document to an individual summary:

### Creating Map Chain

In [None]:
from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain
from langchain_text_splitters import CharacterTextSplitter

llm = ChatOpenAI(temperature=0)

# Map
map_template = """The following is a set of documents
{docs}
Based on this list of docs, please identify the main themes
Helpful Answer:"""
map_prompt = PromptTemplate.from_template(map_template)
map_chain = LLMChain(llm=llm, prompt=map_prompt)

We can also use the Prompt Hub to store and fetch prompts.

This will work with your [LangSmith API key](https://docs.smith.langchain.com/).

For example, see the map prompt [here](https://smith.langchain.com/hub/rlm/map-prompt).

In [None]:
#from langchain import hub

#map_prompt = hub.pull("rlm/map-prompt")
#map_chain = LLMChain(llm=llm, prompt=map_prompt)

### Creating Reduce Chain

The `ReduceDocumentsChain` handles taking the document mapping results and reducing them into a single output. It wraps a generic `CombineDocumentsChain` (like `StuffDocumentsChain`) but adds the ability to collapse documents before passing it to the `CombineDocumentsChain` if their cumulative size exceeds `token_max`. In this example, we can actually re-use our chain for combining our docs to also collapse our docs.

So if the cumulative number of tokens in our mapped documents exceeds 4000 tokens, then we'll recursively pass in the documents in batches of < 4000 tokens to our `StuffDocumentsChain` to create batched summaries. And once those batched summaries are cumulatively less than 4000 tokens, we'll pass them all one last time to the `StuffDocumentsChain` to create the final summary.

In [None]:
# Reduce
reduce_template = """The following is set of summaries:
{docs}
Take these and distill it into a final, consolidated summary of the main themes.
Helpful Answer:"""
reduce_prompt = PromptTemplate.from_template(reduce_template)

In [None]:
# Note we can also get this from the prompt hub, as noted above
#reduce_prompt = hub.pull("rlm/map-prompt")

In [None]:
reduce_prompt

PromptTemplate(input_variables=['docs'], template='The following is set of summaries:\n{docs}\nTake these and distill it into a final, consolidated summary of the main themes.\nHelpful Answer:')

In [None]:
# Run chain
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

# Takes a list of documents, combines them into a single string, and passes this to an LLMChain
combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain, document_variable_name="docs"
)

# Combines and iteratively reduces the mapped documents
reduce_documents_chain = ReduceDocumentsChain(
    # This is final chain that is called.
    combine_documents_chain=combine_documents_chain,
    # If documents exceed context for `StuffDocumentsChain`
    collapse_documents_chain=combine_documents_chain,
    # The maximum number of tokens to group documents into.
    token_max=4000,
)

### Combining Map and Reduce Chains

Combining our map and reduce chains into one:

In [None]:
# Combining documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(
    # Map chain
    llm_chain=map_chain,
    # Reduce chain
    reduce_documents_chain=reduce_documents_chain,
    # The variable name in the llm_chain to put the documents in
    document_variable_name="docs",
    # Return the results of the map steps in the output
    return_intermediate_steps=False,
)

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=0
)
split_docs = text_splitter.split_documents(docs)

In [None]:
print(map_reduce_chain.run(split_docs))

The main themes identified in the set of documents encompass a wide range of topics related to Ayurvedic principles, human anatomy, bodily functions, doshas, dhatus, srotas, and the nature of the self. These themes include the qualities and characteristics of the doshas (Vata, Pitta, Kapha), the importance of maintaining balance in the body, the impact of vitiated doshas on health, the functions and disorders of bodily tissues and fluids, the role of Agni in digestion, the relationship between the body, mind, and spirit, and the concept of the self in relation to consciousness and perception. The documents also delve into the nature of sense organs, intellect, and sense objects, as well as the eternal nature of the self and its connection to the supreme self. Overall, the main themes revolve around holistic health, balance, and the interconnectedness of physical, mental, and spiritual aspects of human existence.


### Go deeper

**Customization**

* As shown above, you can customize the LLMs and prompts for map and reduce stages.

**Real-world use-case**

* See [this blog post](https://blog.langchain.dev/llms-to-improve-documentation/) case-study on analyzing user interactions (questions about LangChain documentation)!  
* The blog post and associated [repo](https://github.com/mendableai/QA_clustering) also introduce clustering as a means of summarization.
* This opens up a third path beyond the `stuff` or `map-reduce` approaches that is worth considering.

![Image description](https://github.com/langchain-ai/langchain/blob/master/docs/static/img/summarization_use_case_3.png?raw=1)

## Option 3. Refine

[RefineDocumentsChain](/docs/modules/chains#legacy-chains) is similar to map-reduce:

> The refine documents chain constructs a response by looping over the input documents and iteratively updating its answer. For each document, it passes all non-document inputs, the current document, and the latest intermediate answer to an LLM chain to get a new answer.

This can be easily run with the `chain_type="refine"` specified.

In [None]:
chain = load_summarize_chain(llm, chain_type="refine")
chain.run(split_docs)

"The text delves into the enumeration of combinations of doshas, totaling 50 types including aggravation and diminution. It discusses the various types of doshic aggravations and diminutions, as well as the conditions of taking them together. Doshas can combine in different proportions, totaling 62 combinations. The text also explores the normal functions and adverse effects of nature, as well as the characteristics of Vayu in the body, including its forms, vitiation, and effects on strength, complexion, happiness, and lifespan. It further delves into the etiology of V's vitiation in the dhatus, sub-divisions of Vata, and associated disorders. The importance of treating V with opposite qualities and maintaining balance for overall health is emphasized. Additionally, the text discusses the normal functions and adverse effects of Pitta and Kapha, along with their characteristics in vitiation and associated disorders. Treatment methods for pacifying Pitta and Kapha are highlighted, emphas

It's also possible to supply a prompt and return intermediate steps.

In [None]:
prompt_template = """Write a concise summary of the following:
{text}
CONCISE SUMMARY:"""
prompt = PromptTemplate.from_template(prompt_template)

refine_template = (
    "Your job is to produce a final summary\n"
    "We have provided an existing summary up to a certain point: {existing_answer}\n"
    "We have the opportunity to refine the existing summary"
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{text}\n"
    "------------\n"
    "Given the new context, refine the original summary in Italian"
    "If the context isn't useful, return the original summary."
)
refine_prompt = PromptTemplate.from_template(refine_template)
chain = load_summarize_chain(
    llm=llm,
    chain_type="refine",
    question_prompt=prompt,
    refine_prompt=refine_prompt,
    return_intermediate_steps=True,
    input_key="input_documents",
    output_key="output_text",
)
result = chain({"input_documents": split_docs}, return_only_outputs=True)

In [None]:
print(result["output_text"])

Il testo fornisce una conoscenza dettagliata del corpo umano, inclusi i dieci sedi del respiro vitale, il sistema cardiovascolare, i sei strati della pelle, le 360 ossa (tra cui prese per i denti e unghie), gli organi di senso, i 56 sottoparti del corpo, i fluidi corporei, la predominanza dei cinque elementi nelle parti del corpo, i sette dhatus (tessuti corporei), il processo di nutrimento dei dhatus, l'uso di rasa e mala per trattare i dhatus, l'Ojas, l'aumento e la diminuzione dei dhatus, i segni di diminuzione dei dhatus, il sangue e il trattamento dei disturbi del sangue, il Sara- essenza costituzionale, i segni di eccellenza dei tessuti, e i tipi di essenza costituzionale Sara. La conoscenza dettagliata del corpo è essenziale per il benessere, e comprendere le entità del corpo consente di conoscere i fattori utili per il corpo. Il cuore è descritto come il substrato di entità come la mente, gli organi di senso, l'intelletto, gli oggetti dei sensi e il sé, insieme alle qualità, ed

In [None]:
print("\n\n".join(result["intermediate_steps"][:3]))

The text enumerates 50 total combinations of doshic aggravation and diminution, with 25 types of aggravation and 25 types of diminution. This includes 13 types of Tri-doshic aggravation, 9 types of Dual-doshic aggravated dosas, and 3 types of Single-doshic aggravation. The combinations of doshas can vary in proportions, totaling 62 different combinations. Additionally, Vata dosha is described as formless, unstable, non-unctuous, cold, light, subtle, mobile, non-slimy, rough, and with qualities of roughness, lightness, coldness, hardness, coarseness, and non-sliminess.

Il testo elenca 50 combinazioni totali di aggravamento e diminuzione doshici, con 25 tipi di aggravamento e 25 tipi di diminuzione. Questo include 13 tipi di aggravamento Tri-doshico, 9 tipi di dosas aggravati Dual-doshici e 3 tipi di aggravamento Single-doshico. Le combinazioni di dosha possono variare nelle proporzioni, totalizzando 62 diverse combinazioni. Inoltre, il dosha Vata è descritto come informe, instabile, no