<a href="https://colab.research.google.com/github/zganjei/Text-summarization/blob/main/text_summarization_with_langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarization Tool

This tool summarizes text using LangChain framework. A pdf file will be uploaded via the tools web interface and then the summary will be displayed.

## Install required libraries

In [1]:
%%capture
#LangChain module associated with OpenAI LLM models
!pip install langchain
!pip install langchain_openai
!pip install langchain_community
!pip install pypdf
# web app development library
!pip install streamlit
# tokeniser library
!pip install tiktoken
!pip install pyngrok

## Setup OpenAI API key

Setup the OpenAI API key and set it as an environment variable so that it can be retrieved in the app.py file

In [51]:
from google.colab import userdata
import os

api_key = userdata.get('openai.api_key')
if api_key:
  print(f"API key: {api_key[:20]}....")
else:
  print("API key not found!")
os.environ["OPENAI_API_KEY"] = api_key

API key: sk-proj-5pfnvLzOd8Qq....


## Build and test a GenAI app for PDF summarization

## Import required packages

In [52]:
%%writefile app.py
from langchain_openai import OpenAI
from langchain_openai.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
import streamlit as st
import os

api_key = os.getenv('OPENAI_API_KEY')
if api_key:
  print(f"API key: {api_key[:20]}....")
else:
  print("API key not found!")



Overwriting app.py


## Load the PDF file

Load the pdf file and split it by pages

In [53]:
%%writefile -a app.py
def summarize_pdf(pdf_file_path, chunk_size, chunk_overlap, map_prompt):
  # instantiate LLM model for gpt-3.5-turbo-16k
  llm = ChatOpenAI(model_name="gpt-3.5-turbo-16k", temperature=0, openai_api_key=api_key)

  #load PDF file
  loader = PyPDFLoader(pdf_file_path)
  pages = loader.load()

Appending to app.py


## The pdf summarization function
The function will take pdf file path as an input and generate a summary of the file. We use both Map_reduce and Stuff methods.

In [54]:
%%writefile -a app.py
  #Create multiple documents
  docs_raw_text = [page.page_content for page in pages]

  # split text into chunks
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs_chunks = text_splitter.create_documents(docs_raw_text)

  # Summarize the chunks
  chain = load_summarize_chain(llm, chain_type="stuff", prompt = map_prompt)
  summary= chain.invoke(docs_chunks,return_only_outputs=True)

  return summary['output_text']


Appending to app.py


## Add Combo template for Map_Reduce chain type

In [55]:
# combine_prompt_template = """
# You are give main points and important details of a research papaer in bullet points.
# Your goal is to give a final summary of the main research topic and findings
# which will be useful for an artificial intelligence researcher:
# "{text}"
# FINAL SUMMARY:
# """
# combine_prompt = PromptTemplate(template=combine_prompt_template, input_variables=["text"])

In [56]:
# def summarize_pdf_map_reduce(pdf_file_path, chunk_size, chunk_overlap, map_prompt, combine_prompt):
#   # instantiate LLM model for gpt-3.5-turbo-16k
#   llm = ChatOpenAI(model_name="gpt-3.5-turbo-16k", temperature=0, openai_api_key=OpenAI.api_key)

#   #load PDF file
#   loader = PyPDFLoader(pdf_file_path)
#   pages = loader.load()

#   #Create multiple documents
#   docs_raw_text = [page.page_content for page in pages]

#   # split text into chunks
#   text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
#   docs_chunks = text_splitter.create_documents(docs_raw_text)

#   # Summarize the chunks
#   chain = load_summarize_chain(llm, chain_type="map_reduce", map_prompt = map_prompt,combine_prompt=combine_prompt)
#   summary= chain.invoke(docs_chunks,return_only_outputs=True)

#   return summary['output_text']

In [57]:
# summary = summarize_pdf_map_reduce(pdf_url,1000,20,map_prompt, combine_prompt)
# print(summary)

## Add the main function with the web interface configuration

In [58]:
  %%writefile -a app.py
  def main():
    # Config the web interface
    st.set_page_config(page_title="PDF Summarizer", page_icon=":book:")
    st.title("PDF Summarizer")

    # Input PDF file path
    pdf_file_path = st.text_input("Enter PDF file path:")

    if pdf_file_path !="":
        st.write("PDF file was loaded successfully!")

    prompt_template = """
    Write a concise summary of the following text. include main points and important details in bullet points:
    "{text}"
    CONCISE SUMMARY:
    """
    map_prompt = PromptTemplate(template=prompt_template, input_variables=["text"])

    # Button
    if st.button("Summarize"):
      summary = summarize_pdf(pdf_file_path,1000,20,map_prompt)
      st.write(summary)
if __name__ == "__main__":
    main()

Appending to app.py


## Launch Streamlit app from Google Colab
Launch Streamlit app from Google Colab using ngrok service.

Start by downloding and unzipping ngrok





In [59]:
# NOTE: run this cell only the first time
# !wgt https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
# !unzip ngrok-stable-linux-amd64.zip

In [60]:
get_ipython().system_raw('./ngrok http 8501 &')

Generate an external IP to open the Streamlit app through a tunnel. The generated ip will be the password for the tunnel

In [61]:
!wget -q -O - ipv4.icanhazip.com

34.85.152.63


Run the app on the set port. Click on the provided link and paste the IP address provided above as the password

In [None]:
!streamlit run app.py & npx localtunnel --port 8501

[1G[0K⠙[1G[0K⠹
Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.85.152.63:8501[0m
[0m
[1G[0K⠏[1G[0Kyour url is: https://smart-spies-decide.loca.lt

>> from langchain.document_loaders import PyPDFLoader

with new imports of:

>> from langchain_community.document_loaders import PyPDFLoader
You can use the langchain cli to **automatically** upgrade many imports. Please see documentation here <https://python.langchain.com/docs/versions/v0_2/>
  from langchain.document_loaders import PyPDFLoader
API key: sk-proj-5pfnvLzOd8Qq....

>> from langchain.document_loaders import PyPDFLoader

with new imports of:

>> from langchain_community.document_loaders import PyPDFLoader
