<a href="https://colab.research.google.com/github/jeffheaton/app_generative_ai/blob/main/t81_559_class_03_3_text_summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-559: Applications of Generative Artificial Intelligence
**Module 3: Large Language Models**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 3 Material

* Part 3.1: Foundation Models [[Video]](https://www.youtube.com/watch?v=Gb0tk5qq1fA) [[Notebook]](t81_559_class_03_1_llm.ipynb)
* Part 3.2: Text Generation [[Video]](https://www.youtube.com/watch?v=lB97Lqt7q58) [[Notebook]](t81_559_class_03_2_text_gen.ipynb)
* **Part 3.3: Text Summarization** [[Video]](https://www.youtube.com/watch?v=3MoIUXE2eEU) [[Notebook]](t81_559_class_03_3_text_summary.ipynb)
* Part 3.4: Text Classification [[Video]](https://www.youtube.com/watch?v=2VpOwFIGmA8) [[Notebook]](t81_559_class_03_4_classification.ipynb)
* Part 3.5: LLM Writes a Book [[Video]](https://www.youtube.com/watch?v=iU40Rttlb_Q) [[Notebook]](t81_559_class_03_5_book.ipynb)


# Google CoLab Instructions

The following code ensures that Google CoLab is running and maps Google Drive if needed.

In [1]:
import os

try:
    from google.colab import drive, userdata
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# OpenAI Secrets
if COLAB:
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Install needed libraries in CoLab
if COLAB:
    !pip install langchain langchain_openai pypdf

Note: using Google CoLab
Collecting langchain
  Downloading langchain-0.1.16-py3-none-any.whl (817 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_openai
  Downloading langchain_openai-0.1.4-py3-none-any.whl (33 kB)
Collecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.32 (from langchain)
  Downloading langchain_community-0.0.34-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollectin

# 3.3: Text Summarization

Large Language Models (LLMs) like GPT-4 can be utilized to summarize text by extracting key information and presenting it in a concise format. They work by understanding the context and semantic relationships within the original text and then generating a shorter version that retains the essential messages. This process involves natural language understanding and generation capabilities, allowing LLMs to interpret various types of texts, from technical articles to narratives, and produce summaries that are coherent and relevant. The ability to customize the length and focus of the summary based on user preferences makes LLMs particularly effective for digesting large volumes of information quickly and efficiently.

## Summarize Single PDF

We will begin by seeing how to summarize a single PDF. LangChang loads document types, such as PDFs, using a document loader. There are document loaders for various data types. The following code summarizes a single PDF using a generic summarization system prompt.

In [2]:
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain import OpenAI, PromptTemplate
from langchain_openai import ChatOpenAI
from IPython.display import display_markdown

MODEL = 'gpt-4o-mini'

llm = ChatOpenAI(
        model=MODEL,
        temperature=0.2,
        n=1
    )



The following code snippet demonstrates how to use a specific 'load_summarize_chain' function to set up a summarization process using a Large Language Model (LLM) with a "map_reduce" chain type. It starts by loading a PDF from the given URL ("https://arxiv.org/pdf/1706.03762") using the 'PyPDFLoader'. The loaded document is then split into manageable parts ('load_and_split'). These parts are fed into the summarization chain ('chain.run(docs)'), which processes and condenses the content. Finally, the summarized content is displayed in markdown format directly within the output environment, ensuring that the formatting of the summary remains intact.

In [3]:
chain = load_summarize_chain(llm, chain_type="map_reduce")

url = "https://arxiv.org/pdf/1706.03762"
loader = PyPDFLoader(url)
docs = loader.load_and_split()
summary = chain.invoke(docs)['output_text']
display_markdown(summary, raw=True)

The paper "Attention Is All You Need" introduces the Transformer model, based on attention mechanisms, which outperforms existing models in machine translation tasks. The model relies on self-attention instead of recurrence, allowing for faster training and improved performance. The model architecture consists of stacked self-attention and fully connected layers in both the encoder and decoder. The study compares the efficiency of self-attention, recurrent, and convolutional layers in sequence transduction tasks, highlighting the advantages of self-attention. The Transformer model achieves better BLEU scores in translation tasks and demonstrates generalization to other tasks. Future research will explore applications beyond text and improve generation efficiency.

## Summarize with Custom Prompt

LangChain also allows the use of custom system prompts to tailor text summarization according to specific requirements, such as summarizing content in a different language. This flexibility is showcased in the provided code, where a custom prompt template instructs the system to write a concise summary in Spanish. The template is set up to include a placeholder for the text that needs summarizing, followed by an instruction in Spanish to produce a summary. This custom prompt is then incorporated into the summarization process by configuring both the 'map_prompt' and 'combine_prompt' parameters of the 'load_summarize_chain' function. The process begins by downloading a PDF from a specified URL using 'PyPDFLoader', splitting the document into sections, and then applying the summarization chain with the custom prompt to generate a summarized output in Spanish. The summarized content is then displayed in markdown format to maintain proper formatting. This example illustrates the adaptability of LangChain in handling complex summarization tasks that include language-specific instructions.

In [4]:
TEMPLATE = """
Write a concise summary of the information presented. Write the summary in Spanish.

{text}

SUMMARY:"""
PROMPT = PromptTemplate(template=TEMPLATE, input_variables=["text"])

chain = load_summarize_chain(llm, chain_type="map_reduce", map_prompt=PROMPT, combine_prompt=PROMPT)

url = "https://arxiv.org/pdf/1706.03762"
loader = PyPDFLoader(url)
docs = loader.load_and_split()
summary = chain.invoke(docs)['output_text']
display_markdown(summary, raw=True)

El artículo presenta el modelo Transformer, una arquitectura de red neuronal basada en mecanismos de atención sin recurrencia ni convoluciones. El Transformer ha demostrado ser superior en calidad, más paralelizable y más rápido de entrenar en tareas de traducción automática. Se han logrado resultados sobresalientes en tareas de traducción de inglés a alemán y francés, así como en análisis de constituyentes en inglés. Se utilizan técnicas como la atención escalada de producto punto y la atención de múltiples cabezas para mejorar el rendimiento del modelo. Se destaca que el Transformer es más eficiente en términos de complejidad computacional y longitud máxima de la ruta para aprender dependencias a largo plazo en secuencias. Se menciona que el modelo ha establecido un nuevo estado del arte en tareas de traducción y se planea extender su uso a otras modalidades de entrada y salida.

## Summarize Multiple PDFs

We will now see how to summarize multiple documents into one. We will summarize the following four papers, each of which is very important to the field of GenAI.

* "[Attention Is All You Need](https://arxiv.org/pdf/1706.03762)" by Ashish Vaswani et al. (2017) - This paper introduced the Transformer architecture, which has become the backbone of most modern natural language processing systems, including text-to-text generative models like GPT and BERT.
* "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805)" by Jacob Devlin et al. (2018) - BERT (Bidirectional Encoder Representations from Transformers) revolutionized the way contextual information is handled by using a bidirectional training of Transformer models. This methodology significantly improved the performance of models on various NLP tasks.
* "[Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165)" by Tom B. Brown et al. (2020) - Also known as the GPT-3 paper, it explores the capabilities of very large transformer-based models, demonstrating that scaling up the size of the models improves performance across a broad spectrum of NLP tasks, often requiring little to no task-specific data.
* "[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683)" by Colin Raffel et al. (2019) - This paper introduces T5 (Text-to-Text Transfer Transformer), which converts all NLP tasks into a unified text-to-text format, simplifying the application of transfer learning across different tasks.

We use the same process demonstrated to load all these PDF documents and concatenate their summaries into an array.

In [5]:
urls = [
  "https://arxiv.org/pdf/1706.03762",
  "https://arxiv.org/pdf/1810.04805",
  "https://arxiv.org/pdf/2005.14165",
  "https://arxiv.org/pdf/1910.10683"
]

summaries = []

chain = load_summarize_chain(llm, chain_type="map_reduce")

for url in urls:
  print(f"Reading: {url}")
  loader = PyPDFLoader(url)
  docs = loader.load_and_split()
  chain = load_summarize_chain(llm, chain_type="map_reduce")
  summary = chain.invoke(docs)['output_text']
  summaries.append(summary)

Reading: https://arxiv.org/pdf/1706.03762
Reading: https://arxiv.org/pdf/1810.04805
Reading: https://arxiv.org/pdf/2005.14165
Reading: https://arxiv.org/pdf/1910.10683


After obtaining individual summaries of articles, the next step involves combining these summaries into a single, comprehensive overview. The provided code accomplishes this by first merging all the initial summaries into one long string. To manage the potentially large amount of text, it uses the CharacterTextSplitter class from LangChain to split this combined text into manageable chunks. Each chunk maintains a size of 500 characters with an overlap of 100 characters to ensure continuity and context are preserved across chunks. These chunks are then converted into Document objects, each holding a segment of the summarized text. A new summarization chain is loaded using the same 'map_reduce' model to process these document objects. This chain effectively runs across the segmented texts, extracting key information and producing a final, condensed summary of the combined initial summaries. Finally, this ultimate summary is displayed in markdown format to maintain clarity and formatting, providing a clear and succinct synthesis of the original articles' content.

In [6]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.schema.document import Document

chain = load_summarize_chain(llm, chain_type="map_reduce")

summary_str = " ".join(summaries)
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100)
texts = text_splitter.split_text(summary_str)
docs = [Document(page_content=t) for t in texts]
chain = load_summarize_chain(llm, chain_type="map_reduce")
final_summary = chain.invoke(docs)['output_text']
display_markdown(final_summary, raw=True)

The paper "Attention Is All You Need" introduces the Transformer model for sequence transduction tasks, achieving state-of-the-art results in machine translation with less training time. It utilizes self-attention for sequence modeling, eliminating the need for recurrent or convolutional networks. The paper also discusses the introduction and performance of BERT, T5, and GPT-3 models for natural language processing tasks, highlighting their capabilities and potential advancements. Challenges and limitations, such as biases and misuse, are also addressed.