<a href="https://colab.research.google.com/github/sudarshan-koirala/youtube-stuffs/blob/main/langchain/PDFSummarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PDF Summarizer with few lines of code using Gradio, OpenAI and LangChain

## Install necessary packages

- https://pypdf.readthedocs.io/en/stable/index.html
- https://www.gradio.app/
- https://github.com/openai/tiktoken
- https://docs.langchain.com/docs/

In [None]:
#install necessary packages
!pip install -q gradio openai pypdf tiktoken langchain

In [None]:
#with open('env_vars.json', 'r') as f:
#    env_vars = json.load(f)
#openai.api_key = env_vars["OPENAI_API_KEY"]

In [None]:
# https://platform.openai.com/account/api-keys
import os
os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"

In [41]:
# https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    print(encoding.encode(string))
    num_tokens = len(encoding.encode(string))
    return num_tokens

num_tokens_from_string("tiktoken is great!", "cl100k_base")

[83, 1609, 5963, 374, 2294, 0]


6

In [None]:
import gradio as gr
from langchain import OpenAI, PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import PyPDFLoader

llm = OpenAI(temperature=0)

In [None]:
#PyPDFLoader??

## LangChain part 
#### Function that takes PDF file as input and returns the summary of that PDF
- langchain `PyPDFLoader` helps load the PDF
- After that we can split the document in smaller chunks
- We then use the `load_summarize_chain` to create a summarization chain
- Langchain covers three different chain types: stuff, map_reduce, and refine. We will use `map_reduce` for this example. Refer --> [Langchain summarization](https://python.langchain.com/en/latest/modules/chains/index_examples/summarize.html)

In [42]:
# just to show you how it works
loader = PyPDFLoader('/content/2023_GPT4All_Technical_Report.pdf')
doc=loader.load_and_split()
print(len(doc))
doc[0]

3


Document(page_content='GPT4All: Training an Assistant-style Chatbot with Large Scale Data\nDistillation from GPT-3.5-Turbo\nYuvanesh Anand\nyuvanesh@nomic.aiZach Nussbaum\nzanussbaum@gmail.com\nBrandon Duderstadt\nbrandon@nomic.aiBenjamin Schmidt\nben@nomic.aiAndriy Mulyar\nandriy@nomic.ai\nAbstract\nThis preliminary technical report describes the\ndevelopment of GPT4All, a chatbot trained\nover a massive curated corpus of assistant in-\nteractions including word problems, story de-\nscriptions, multi-turn dialogue, and code. We\nopenly release the collected data, data cura-\ntion procedure, training code, and final model\nweights to promote open research and repro-\nducibility. Additionally, we release quantized\n4-bit versions of the model allowing virtually\nanyone to run the model on CPU.\n1 Data Collection and Curation\nWe collected roughly one million prompt-\nresponse pairs using the GPT-3.5-Turbo OpenAI\nAPI between March 20, 2023 and March 26th,\n2023. To do this, we first gat

In [43]:
def summarize_pdf(pdf_file_path):
    loader = PyPDFLoader(pdf_file_path)
    docs = loader.load_and_split()
    chain = load_summarize_chain(llm=llm, chain_type="map_reduce")
    summary = chain.run(docs)   
    return summary

In [None]:
# lets grab the document
!wget https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All_Technical_Report.pdf

In [44]:
summarize = summarize_pdf("/content/2023_GPT4All_Technical_Report.pdf")
summarize

' This paper presents GPT4All, a chatbot trained on a large curated dataset of assistant interactions. The authors provide all data, training code, and model weights for the community to use and evaluate the model using human evaluation data from the Self-Instruct paper. The paper also references three models (LORA, Stanford Alpaca, and LLAMA) and Self-Instruct, which are all designed to be low-rank, open, and efficient language models. The data and training details are released to accelerate open LLM research.'

## Create a simple gradio UI (if you prefer UI)

In [None]:
def summarize_pdf(pdf_file_path):
    loader = PyPDFLoader(pdf_file_path)
    docs = loader.load_and_split()
    chain = load_summarize_chain(llm=llm, chain_type="map_reduce")
    summary = chain.run(docs)   
    return summary

In [45]:
input_pdf_path = gr.components.Textbox(label="Provide the PDF file path")
output_summary = gr.components.Textbox(label="Summary")

interface = gr.Interface(
    fn=summarize_pdf,
    inputs=input_pdf_path,
    outputs=output_summary,
    title="PDF Summarizer",
    description="Provide PDF file path to get the summary.",
).launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://0928952a4c9150f59d.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
