In [None]:
# install all necessary libraries

!pip install streamlit
!pip install langchain
!pip install pypdf
!pip install streamlit
!pip install openai
!pip install tiktoken

Collecting streamlit
  Downloading streamlit-1.36.0-py2.py3-none-any.whl (8.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit)
  Downloading GitPython-3.1.43-py3-none-any.whl (207 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m
Collecting watchdog<5,>=2.1.5 (from streamlit)
  Downloading watchdog-4.0.1-py3-none-manylinux2014_x86_64.whl (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.0/83.0 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.19,<4,>=3.0.7->streamlit)
  Downloading gitdb-4.

In [None]:
#  Install and setup LangChain

import streamlit as st
import os
from langchain import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from dotenv import load_dotenv

load_dotenv()

# Make sure you have the OPENAI API KEY set in your environment
OpenAI.api_key = os.getenv('API_KEY')
# <font color=blue> Load PDF file
pdf_url = "https://www.biorxiv.org/content/10.1101/2024.01.15.575678v2.full.pdf"

loader = PyPDFLoader(pdf_url)

pages = loader.load_and_split()
#number of pages
print(len(pages),'\n')

#view page content
print(pages[3].page_content)

14 

4 C. Brito et al.
data [20, 9]. They provide a secure memory space at each server, where genomic
data can be securely processed in plaintext format. Both external and internal
attackers, even with high Operating System (OS) privileges, cannot access this
protected region and disclose the data being processed. While this is a promising
technology for running GWAS securely, its application has been typically limited
to a single-server mode [5, 18, 32]. By taking advantage of distributed infrastruc-
tures, it is possible to enhance the speed and scalability (i.e., the amount of data
being analyzed) of GWASes. However, as highlighted in this paper, developing a
distributed solution for privacy-preserving GWAS requires a fundamentally new
design that differs from previous methodologies. Namely, it requires securing
both the computation and storage of data at each cluster server and the data
being exchanged across servers.
Our Contributions We propose Gyosa, a novel distributed and priv

In [None]:

##  Define the summarize pdf function
# Define the main function that will take pdf file path as an input and generate a summary of the file.
def summarize_pdf(pdf_file_path, chunk_size, chunk_overlap):

    #Instantiate LLM model gpt-3.5-turbo-16k
    llm=ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0, openai_api_key=OpenAI.api_key)

    #Load PDF file
    loader = PyPDFLoader(pdf_file_path)
    docs_raw = loader.load()

    #Create multiple documents
    docs_raw_text = [doc.page_content for doc in docs_raw]

    #Split text into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs_chunks = text_splitter.create_documents(docs_raw_text)

    #Summarize the chunks
    # chain = load_summarize_chain(llm, chain_type="stuff")
    chain = load_summarize_chain(llm, chain_type="map_reduce")
    #Return the summary
    return chain.run(docs_chunks)
#print summary by using chain type stuff or map_reduce
print(summarize_pdf(pdf_url, 1000, 20))


The paper introduces Gyosa, a distributed computing solution for privacy-preserving genome-wide association studies (GWAS). The solution uses trusted execution environments (TEEs) like Intel SGX to securely delegate GWAS analysis to untrusted third-party infrastructures. Gyosa allows for the handling of large amounts of genomic data in a scalable and efficient manner while protecting data privacy. The experimental evaluation shows that Gyosa provides enhanced security guarantees and can achieve practical and usable privacy-preserving solutions.


In [None]:
##  Add Prompt template to the summarizer function
# Leveraging prompt templates to extract key information from the reserach paper in more guided manner.
##  Define Prompt Templates
map_prompt_template = """
                      Write a summary of the researh paper for a non-academic industry professional that includes main points and any important details in bullet points.{text}
                      """

# below is the type of template that we can pass to llm, we can create according to the need and requirement
# map_prompt_template =""" Writes a summary of the research paper for an IT professional to provide a technical overview  of the research topic in bullet point. {text} """

map_prompt = PromptTemplate(input_variables=["text"], template=map_prompt_template)


In [None]:

# with stuff method  using map_prompt
#Modify the custom function to add the prompt templates

def summarize_pdf(pdf_file_path, chunk_size, chunk_overlap, map_prompt):

    #Instantiate LLM model gpt-3.5-turbo-16k
    llm=ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0, openai_api_key=OpenAI.api_key)

    #Load PDF file
    loader = PyPDFLoader(pdf_file_path)
    docs_raw = loader.load()

    #Create multiple documents
    docs_raw_text = [doc.page_content for doc in docs_raw]

    #Split text into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs_chunks = text_splitter.create_documents(docs_raw_text)

    #Summarize the chunks
    chain = load_summarize_chain(llm, chain_type="stuff", prompt = map_prompt)
    #Return the summary
    return chain.run(docs_chunks)
#print summary by using combo prompts
print(summarize_pdf(pdf_url, 1000, 20, map_prompt))


- The research paper introduces Gyosa, a distributed computing solution for privacy-preserving genome-wide association studies (GWAS).
- Gyosa follows a distributed processing design that enables handling large amounts of genomic data in a scalable and efficient fashion.
- It leverages trusted execution environments (TEEs), such as Intel SGX, to allow users to confidentially delegate their GWAS analysis to untrusted third-party infrastructures.
- Gyosa implements a computation partitioning scheme to overcome the memory limitations of SGX while safeguarding users' genomic data privacy.
- The experimental evaluation validates the applicability and scalability of Gyosa, showing that it provides enhanced security guarantees and a practical and usable privacy-preserving solution.
- Gyosa is built on top of Apache Spark and uses the Glow library for genomic processing, allowing for the extension of the current genomic analysis pipeline.
- The results show that Gyosa has a runtime overhead ra

In [None]:

# with map_reduce method
combine_prompt_template = """you will be given main points and any important details of a research paper in bullet points.
                             you goal is to give a final summary of the main research topic and findings which will be useful to a researcher to grasp
                             what was done during the research work.

'''{text}'''
FINAL SUMMARY:
"""

combine_prompt = PromptTemplate(input_variables=["text"], template=combine_prompt_template)
#Modify the custom function to add the prompt templates

def summarize_pdf(pdf_file_path, chunk_size, chunk_overlap, map_prompt, combine_prompt):

    #Instantiate LLM model gpt-3.5-turbo-16k
    llm=ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0, openai_api_key=OpenAI.api_key)

    #Load PDF file
    loader = PyPDFLoader(pdf_file_path)
    docs_raw = loader.load()

    #Create multiple documents
    docs_raw_text = [doc.page_content for doc in docs_raw]

    #Split text into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs_chunks = text_splitter.create_documents(docs_raw_text)

    #Summarize the chunks

    # chain = load_summarize_chain(llm, chain_type="map_reduce", prompt = map_prompt)
    chain = load_summarize_chain(llm, chain_type="map_reduce", map_prompt = map_prompt, combine_prompt = combine_prompt)
    #Return the summary
    return chain.run(docs_chunks)

#print summary by using combo prompts
print(summarize_pdf(pdf_url, 1000, 20, map_prompt, combine_prompt))


- The research paper introduces Gyosa, a distributed computing solution for privacy-preserving genome-wide association studies (GWAS).
- Gyosa utilizes distributed processing and trusted execution environments (TEEs) to handle large amounts of genomic data while maintaining confidentiality.
- The paper discusses the limitations of current solutions for data privacy protection in GWAS and proposes Gyosa as a more efficient and secure alternative.
- Experimental evaluation confirms the applicability and scalability of Gyosa, demonstrating its ability to provide enhanced security guarantees.
- The paper explores the challenges and vulnerabilities associated with offloading data storage and processing to third-party infrastructures and emphasizes the need for robust security measures.
- The paper presents different frameworks and approaches, such as GWAS-Dist, GADE, and SOTERIA, for privacy-preserving solutions in GWAS.
- The paper discusses various types of attacks on genomic data analysi

In [None]:
##  Build and test a GenAI app for PDF summarization
%%writefile app.py
import streamlit as st
from langchain import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from dotenv import load_dotenv

load_dotenv()

# Make sure you have the OPENAI API KEY set in your environment
OpenAI.api_key = os.getenv('API_KEY')

#summarize_pdf function

def summarize_pdf(pdf_file_path, chunk_size, chunk_overlap, prompt):
    #Instantiate LLM model gpt-3.5-turbo-16k
    llm=ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0, openai_api_key=OpenAI.api_key)

    #Load PDF file
    loader = PyPDFLoader(pdf_file_path)
    docs_raw = loader.load()

    #Create multiple documents
    docs_raw_text = [doc.page_content for doc in docs_raw]

    #Split text into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs_chunks = text_splitter.create_documents(docs_raw_text)

    #Summarize the chunks
    chain = load_summarize_chain(llm, chain_type='stuff', prompt = prompt)

    #Return the summary
    summary= chain.invoke(docs_chunks, return_only_outputs=True)
    return summary["output_text"]

#streamlit app main() function

def main():
    #Set page config and title
    st.set_page_config(page_title="PDF Summarizer", page_icon=":book:", layout="wide")
    st.title("PDF Summarizer")

    #Input pdf file path
    pdf_file_path = st.text_input("Enter PDF file path:")
    if pdf_file_path != "":
        st.write("PDF file was loaded successfully")

    #Prompt input
    user_prompt = st.text_input("Enter prompt:")
    user_prompt = user_prompt + """:\n\n {text}"""
    prompt = PromptTemplate(input_variables=["text"], template=user_prompt)

    #Summarize button
    if st.button("Summarize"):
        #Summarize pdf
        summary = summarize_pdf(pdf_file_path, 1000, 20, prompt)
        st.write(summary)

if __name__ == "__main__":
    main()



Overwriting app.py


In [None]:

##  Cumulative Activity
%%writefile app.py
import streamlit as st
from langchain import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from dotenv import load_dotenv

load_dotenv()

# Make sure you have the OPENAI API KEY set in your environment
OpenAI.api_key = os.getenv('API_KEY')

#summarize_pdf function

def summarize_pdf(pdf_file_path, chunk_size, chunk_overlap, chain_type, prompt):
    #Instantiate LLM model gpt-3.5-turbo-16k
    llm=ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0, openai_api_key=OpenAI.api_key)

    #Load PDF file
    loader = PyPDFLoader(pdf_file_path)
    docs_raw = loader.load()

    #Create multiple documents
    docs_raw_text = [doc.page_content for doc in docs_raw]

    #Split text into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs_chunks = text_splitter.create_documents(docs_raw_text)

    #Create multiple prompts
    prompt = prompt + """:\n\n {text}"""
    combine_prompt = PromptTemplate(input_variables=["text"], template=prompt)
    map_prompt = PromptTemplate(template="Summarize:\n\n{text}", input_variables=["text"])

    #Summarize the chunks
    if chain_type == "map_reduce":
        chain = load_summarize_chain(llm, chain_type=chain_type,
                                    map_prompt=map_prompt, combine_prompt=combine_prompt)
    else:
        chain = load_summarize_chain(llm, chain_type= chain_type, prompt=combine_prompt)
    #Return the summary
    return chain.run(docs_chunks)

#streamlit app main() function

def main():
    #Set page config and title
    st.set_page_config(page_title="GenAI App", page_icon=":book:", layout="wide")
    st.title("PDF Summarizer")

    #Add custom sliders and selectbox for more user interaction
    chain_type = st.sidebar.selectbox("Chain Type", ["map_reduce", "stuff"])
    chunk_size = st.sidebar.slider("Chunk Size", min_value=100, max_value=10000, step=100, value=1900)
    chunk_overlap = st.sidebar.slider("Chunk Overlap", min_value=100, max_value=10000, step=100, value=200)

    #display warning message
    if 'map_reduce' in chain_type:
      st.sidebar.warning(f'Map_reduce chain type takes more than 5 mins to generate summary due to prompt latency!')

    #Input pdf file path
    pdf_file_path = st.text_input("Enter PDF file path:")

    #Prompt input
    user_prompt = st.text_input("Enter prompt:")

    #Summarize button
    if st.button("Summarize"):
        #Summarize pdf
        summary = summarize_pdf(pdf_file_path, chunk_size, chunk_overlap, chain_type, user_prompt)
        st.write(summary)

if __name__ == "__main__":
    main()


Overwriting app.py


In [None]:
## Launch Streamlit app from Google Colab

# The following lines of code would enable users to launch Streamlit app from Google Colab using [ngrok service](https://ngrok.com/)
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip


--2024-06-24 14:38:27--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 18.205.222.128, 54.237.133.81, 52.202.168.65, ...
Connecting to bin.equinox.io (bin.equinox.io)|18.205.222.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13921656 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip’


2024-06-24 14:38:27 (209 MB/s) - ‘ngrok-stable-linux-amd64.zip’ saved [13921656/13921656]



In [None]:
!unzip ngrok-stable-linux-amd64.zip


Archive:  ngrok-stable-linux-amd64.zip
  inflating: ngrok                   


In [None]:
get_ipython().system_raw('./ngrok http 8501 &')


In [None]:
!wget -q -O - ipv4.icanhazip.com

34.48.95.8


In [None]:

!streamlit run app.py & npx localtunnel --port 8501


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.48.95.8:8501[0m
[0m
[K[?25hnpx: installed 22 in 3.417s
your url is: https://eight-planets-lead.loca.lt

`from langchain_community.chat_models import ChatOpenAI`.

To install langchain-community run `pip install -U langchain-community`.

>> from langchain.document_loaders import PyPDFLoader

with new imports of:

>> from langchain_community.document_loaders import PyPDFLoader
You can use the langchain cli to **automatically** upgrade many imports. Please see documentation here https://python.langchain.com/v0.2/docs/versions/v0_2/ 
  warn_deprecated(

`from langchain_community.chat_models import ChatOpenAI`.

To install langchain-community run `pip install -U langchain-commun