### Summary

*TODO: add flowchart*

Here, we are building a POC all-in-one solutions that reads in and summarizes customer document store. What we have in mind is a pipeline that converts files (e.g., PDF)/video/audio to text and than summarizes the text in a storage-intelligent fashion.

## PDF

Following is an example with iPad, and since we don't have the original tech spec file as PDF, we need to convert it from HTML using PDF first. Use the [wk<html>TOpdf download page](https://wkhtmltopdf.org/downloads.html) to download (or install it with a package manager). Note that there's also the option to install using one single line of cmd according to [packaging](https://github.com/wkhtmltopdf/packaging). This is not necessarily a part of the workflow we had in mind - in our workflow, client would already have these files in storage. And thus, the installation process is not added in the main `README.md`

In [22]:
from pdfkit import from_url
import os
from PyPDF2 import PdfReader
from glob import glob

In [23]:
URLS = [
    'https://support.apple.com/kb/SP883?viewlocale=en_US&locale=en_US', # Pro
    'https://support.apple.com/kb/SP866?viewlocale=en_US&locale=en_US', # Air
    'https://support.apple.com/kb/SP788?viewlocale=en_US&locale=en_US', # Mini
    'https://support.apple.com/kb/SP884?viewlocale=en_US&locale=en_US', # 10th Gen
    'https://support.apple.com/kb/SP849?viewlocale=en_US&locale=en_US', # 9th Gen
]

FILE_NAMES = [
    'ipad_pro', 
    'ipad_air', 
    'ipad_mini', 
    'ipad_10th_gen', 
    'ipad_9th_gen'
]

OUTPUT_DIR = 'output'

# configure settings for pdf output
options = {
    "page-size": "A4",
    "margin-top": "0mm",
    "margin-right": "0mm",
    "margin-bottom": "0mm",
    "margin-left": "0mm",
    "encoding": "UTF-8",
}

### generate pdf from html

In [24]:
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

for url, pdf_file in zip(URLS, FILE_NAMES):
    pdf_file_name = os.path.join(OUTPUT_DIR, pdf_file + '.pdf') # e.g., output/ipad_pro.pdf
    from_url(url, pdf_file_name, options=options)
    print(f'{pdf_file_name} saved')

output/ipad_pro.pdf saved
output/ipad_air.pdf saved
output/ipad_mini.pdf saved
output/ipad_10th_gen.pdf saved
output/ipad_9th_gen.pdf saved


### convert html to text


In [25]:
def get_pdf_text(pdf_doc):
    text = ""
    pdf_reader = PdfReader(pdf_doc)
    for page in pdf_reader.pages:
        text += page.extract_text()
    return text

In [26]:
# test
test_doc = r'output/ipad_pro.pdf'
pdf_text = get_pdf_text(test_doc)
pdf_text[:200]

'Languages\n \nEnglish\niPad Pro 12.9-inch (6th generation) - Technical\nSpecifications\nYear introduced: 2022\nIdentify your iPad model\nFinish\nSilver\nSpace Gray\nCapacity\n128GB\n256GB\n512GB\n1TB\n2TB\nSize and W'

In [27]:
pdf_docs = glob(f'{OUTPUT_DIR}/*.pdf') # all PDF files in the list
for doc in pdf_docs:
    file_name = os.path.basename(doc)
    txt = get_pdf_text(doc)
    with open(f'{OUTPUT_DIR}/{file_name}.txt', 'w') as f:
        f.write(txt)
        print(f'{file_name}.txt saved')

ipad_pro.pdf.txt saved
ipad_9th_gen.pdf.txt saved
ipad_10th_gen.pdf.txt saved
ipad_air.pdf.txt saved
ipad_mini.pdf.txt saved


## Video

We expect clients to have some videos in-house but this acts as an alternative to additional intelligence that can be sourced from the web. Here, we are using YT videos with transcript as an example. In later stage of the project, we hope to utilize Amazon Transcribe to convert speech from videos to text, which allows to deal with videos without transcript as well. 

In [28]:
# test package imports
import pytube
import youtube_transcript_api

In [18]:
from langchain.document_loaders import YoutubeLoader

In [36]:
URL = r'https://www.youtube.com/watch?v=ujJEEJTrI1Y&ab_channel=TechGearTalk'

In [37]:
# loading transcripts from youtube videos using url
loader = YoutubeLoader.from_youtube_url(
    URL,
    add_video_info=True
)
loader.load()

[Document(page_content="- All right, so you're\nthinking of buying an iPad and you realize that there\nare a lot of great options but also some things that\nyou need to watch out for. If that's super confusing, don't worry, you're not alone. And spoiler alert, I'm actually gonna add\none iPad to this list, which I think a lot of\npeople are overlooking and they're making a huge mistake. So let's talk about the\nimportant differences. This way you can have all\nthe information you need to choose the right iPad for you. I'm also going to go over\nthe configuration options to help you narrow it down. Starting out, we have the iPad that got the most significant\nchanges in this lineup, the iPad 10. In the US it sells for 449, and if you're familiar with\nthe iPad 8 or the iPad 9 you'll immediately notice the new design. We no longer have the larger bezel on the top and the bottom and we're getting rounded\ncorners and squared off edges and it comes in some more fun colors. I absolutely lov

In [38]:
video_metadata  = loader.load()[0].metadata
video_title = video_metadata['title']
video_title

'2023 ULTIMATE iPad BUYING GUIDE!'

In [39]:
transcript = loader.load()[0].page_content
len(transcript)

15764

In [20]:
def clean_title(video_title:str) -> str:
    """Clean video title to remove special characters"""
    special_chars = ['?', '/', '\\', ':', '*', '<', '>', '|', '"']
    return "".join(c for c in video_title if c not in special_chars)

In [41]:
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

fname = clean_title(video_title)
with open(f'{OUTPUT_DIR}/{fname}.txt', 'w') as f:
    f.write(transcript)
    print(f'{fname}.txt saved')

2023 ULTIMATE iPad BUYING GUIDE!.txt saved


## Summarize

Now that we have the documents in .txt format, we will try to summarize them using language models. First, we will try to summarize one single document (thus, not surpassing the token limits). Then, we will try to load all the text in the document folder using a map reduce approach. Here, we are utilizing some examples found with LangChain, using API calls from OpenAI. Again, the models can be very easily converted to Bedrock FMs. 

In [49]:
from dotenv import load_dotenv

In [50]:
%reload_ext dotenv
%dotenv
chatGPT_api_key = os.getenv("OPENAI_API_KEY")

In [51]:
MAX_TOKEN = 16385 # max token length for OpenAI gpt-3.5-turbo-16k

### single document with stuff

Here, we are using a simple [LLM chain](https://python.langchain.com/docs/modules/chains/foundational/llm_chain#:~:text=An%20LLMChain%20consists%20of%20a,and%20returns%20the%20LLM%20output.) that adds some functionality around LMs. The `LLMChain` itself consists of a prompt (using `PromptTemplate`) and a language model (here it's `gpt-3.5-turbo-16k` from OpenAI). 

In [54]:
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI

In [55]:
# Define prompt
SUMMARIZE_PROMPT = """Write a concise summary of the following:
"{text}"
CONCISE SUMMARY:"""

prompt = PromptTemplate.from_template(SUMMARIZE_PROMPT)
prompt

PromptTemplate(input_variables=['text'], template='Write a concise summary of the following:\n"{text}"\nCONCISE SUMMARY:')

In [56]:
# Define LLM
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k", openai_api_key= chatGPT_api_key)

In [57]:
# Define Chain
llm_chain = LLMChain(llm=llm, prompt=prompt)

As we have the prompt, the next step is to summarize. Here, we use `load_summarize_chain` with `chain_type="stuff"` (the stuff chain takes a list of documents and insert them all into a prompt and passes that prompt to an LLM)

In [78]:
# Define StuffDocumentsChain
stuff_chain = StuffDocumentsChain(
    llm_chain=llm_chain, document_variable_name="text"
)

In [None]:
docs = loader.load() # note: here we are still using the old YouTube loader, only one document
summary_information = stuff_chain.run(docs)
summary_information

### multiple documents with map-reduce

We start with loading files from our output directory using `DirectoryLoader`. We are going to load only files with ".txt" extension.  
> Note: under the hood, this uses [UnstructedLoader](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file.html), which currently supports loading of text files, powerpoints, html, pdfs, images, and more. One can use `from langchain.document_loaders import UnstructuredFileLoader` for more flexibility.

In [44]:
from langchain.document_loaders import DirectoryLoader, TextLoader

In [71]:
path = f'{OUTPUT_DIR}/'
text_loader_kwargs={'autodetect_encoding': True}

loader = DirectoryLoader(path, 
                         glob='*.txt', 
                         show_progress=True,
                         use_multithreading=True,
                         loader_cls=TextLoader, # default loader class is UnstructuredLoader
                         loader_kwargs=text_loader_kwargs
                         )

In [106]:
docs = loader.load()
len(docs) # number of documents in the directory

100%|██████████| 8/8 [00:00<00:00, 867.49it/s]


8

Now that we have the documents, we will run the summarization pipeline on them. First, we map each document to individual summary using `LLMChain` and then we combine all the summaries into a single global summary using `ReduceDocumentsChain`.
> One specify a map template, but can also use the Prompt Hub to store and fetch prompts. We are using the map prompt provided [here](https://smith.langchain.com/hub/rlm/map-prompt). To pull from LangChain hub requires ` pip install langchainhub`.

In [27]:
from langchain.chains.mapreduce import MapReduceChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import ReduceDocumentsChain, MapReduceDocumentsChain, StuffDocumentsChain
from langchain import hub

In [76]:
llm

OpenAI(client=<class 'openai.api_resources.completion.Completion'>, temperature=0.0, openai_api_key='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', openai_api_base='', openai_organization='', openai_proxy='')

In [59]:
# Map
map_prompt = hub.pull("rlm/map-prompt")
map_prompt

ChatPromptTemplate(input_variables=['docs'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['docs'], template='The following is a set of documents:\n{docs}\nBased on this list of docs, please identify the main themes \nHelpful Answer:'))])

In [60]:
map_chain = LLMChain(llm=llm, prompt=map_prompt)

The `ReduceDocumentsChain` wraps a generic `CombineDocumentsChain` (like `StuffDocumentsChain`) but adds the ability to collapse documents before passing it if their cumulative size exceeds token_max.

In [61]:
reduce_prompt = hub.pull("rlm/reduce-prompt")
reduce_prompt

ChatPromptTemplate(input_variables=['doc_summaries'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['doc_summaries'], template='The following is set of summaries:\n{doc_summaries}\nTake these and distill it into a final, consolidated summary of the main themes. \nHelpful Answer:'))])

In [62]:
# Run chain
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

In [63]:
# Takes a list of documents and combines them into a single string
combine_documents_chain = StuffDocumentsChain(
        llm_chain=reduce_chain,
        document_variable_name="doc_summaries")

# Combines and iteratively reduces the mapped documents 
reduce_documents_chain = ReduceDocumentsChain(
    combine_documents_chain=combine_documents_chain,
    collapse_documents_chain=combine_documents_chain, # case documents exceed context for `combine_documents_chain`
    token_max=MAX_TOKEN)

# Combining documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(
    llm_chain=map_chain,
    reduce_documents_chain=reduce_documents_chain,
    document_variable_name="docs",
    return_intermediate_steps=False,
)

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=4000,chunk_overlap=0,separator="\n")
split_docs = text_splitter.split_documents(docs)

In [223]:
# sanity check
for doc in split_docs:
    print(doc.metadata['source'], len(doc.page_content))

output/ipad_pro.pdf.txt 3980
output/ipad_pro.pdf.txt 3991
output/ipad_pro.pdf.txt 3908
output/ipad_pro.pdf.txt 3908
output/ipad_pro.pdf.txt 3972
output/ipad_pro.pdf.txt 3991
output/ipad_pro.pdf.txt 60
output/ipad_pro.txt 3980
output/ipad_pro.txt 3991
output/ipad_pro.txt 3908
output/ipad_pro.txt 3908
output/ipad_pro.txt 3972
output/ipad_pro.txt 3991
output/ipad_pro.txt 60
output/ipad_air.pdf.txt 3968
output/ipad_air.pdf.txt 3955
output/ipad_air.pdf.txt 4000
output/ipad_air.pdf.txt 3979
output/ipad_air.pdf.txt 3898
output/ipad_air.pdf.txt 1095
output/Introducing Apple Business Essentials  Apple.txt 1005
output/ipad_10th_gen.pdf.txt 3991
output/ipad_10th_gen.pdf.txt 3942
output/ipad_10th_gen.pdf.txt 3905
output/ipad_10th_gen.pdf.txt 3977
output/ipad_10th_gen.pdf.txt 3976
output/ipad_10th_gen.pdf.txt 1356
output/ipad_mini.pdf.txt 3993
output/ipad_mini.pdf.txt 3988
output/ipad_mini.pdf.txt 3996
output/ipad_mini.pdf.txt 3896
output/ipad_mini.pdf.txt 1441
output/2023 ULTIMATE iPad BUYING GUID

In [225]:
summary_information = map_reduce_chain.run(split_docs)
len(summary_information)

829

In [227]:
file_path = f'{OUTPUT_DIR}/full_summary.txt'
with open(file_path, 'w') as f:
    f.write(summary_information)
    print(f'{file_path} saved')

output/full_summary.txt saved


## Combine

The other approach is of course to get one summary for each document. For demonstration purposes, we will use 10 files and create 10 summaries. We already have 5 PDF files and 2 YT videos converted to .txt, so we will grab those + 3 YT videoes. 

In [70]:
import spacy

In [71]:
# !python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [8]:
SRC_DIR = 'output'
OUT_DIR = 'sample'

if not os.path.exists(OUT_DIR):
    os.makedirs(OUT_DIR)

In [14]:
# move PDFs
for doc in glob(f'{SRC_DIR}/*.pdf.txt'):
    file_name = os.path.basename(doc)
    os.rename(doc, os.path.join(OUT_DIR, file_name))
    print(f'{file_name} moved to {OUT_DIR}')

ipad_pro.pdf.txt moved to sample
ipad_air.pdf.txt moved to sample
ipad_10th_gen.pdf.txt moved to sample
ipad_mini.pdf.txt moved to sample
ipad_9th_gen.pdf.txt moved to sample


In [21]:
# get more videos
URLS = [
    'https://www.youtube.com/watch?v=-v992SD1D2Q&t=266s&ab_channel=ByteReview', # The Ultimate iPad Buyers Guide 2023
    'https://youtu.be/CwtUJ30A8nY?si=RsUA6MOKyKeTPP_1', # The BEST iPad for 2023? DON’T buy wrong!
    'https://www.youtube.com/watch?v=yUKRkPKg5_U&ab_channel=Apple', # Meet the all-new iPad and iPad Pro | Apple
    
]

for url in URLS: 
    loader = YoutubeLoader.from_youtube_url(url, add_video_info=True)
    loader.load()
    video_metadata  = loader.load()[0].metadata
    video_title = clean_title(video_metadata['title'])
    transcript = loader.load()[0].page_content
    
    with open(f'{OUT_DIR}/{video_title}.txt', 'w') as f:
        f.write(transcript)
        print(f'{video_title}.txt saved')

The Ultimate iPad Buyers Guide 2023.txt saved
The BEST iPad for 2023 DON’T buy wrong!.txt saved
Meet the all-new iPad and iPad Pro  Apple.txt saved


In [22]:
files = glob(f'{OUT_DIR}/*.txt')
len(files)

10

In [80]:
for i, file in enumerate(files):
    loader = TextLoader(file)
    doc = loader.load()
    title = doc[0].metadata['source'].strip('sample/').strip('.txt')
    # text = doc[0].page_content
    summary = stuff_chain.run(doc)
    
    #TODO: implement try-except for when token sizes exceeds max token length
    # this is not implemented yet because we might have different error message for different models
    # split_doc = text_splitter.split_documents(doc)
    # summary = map_reduce_chain.run(split_doc)
    
    # save to file
    with open(f'{OUT_DIR}/summary{str(i)}.txt', 'w') as f:
        f.write(summary)
    print(f'{title} saved as summary{str(i)}.txt')

ipad_pro.pdf saved as summary0.txt
The Ultimate iPad Buyers Guide 2023 saved as summary1.txt
The BEST iPad for 2023 DON’T buy wrong! saved as summary2.txt
Introducing Apple Business Essentials  Apple saved as summary3.txt
ipad_air.pdf saved as summary4.txt
Meet the all-new iPad and iPad Pro  Apple saved as summary5.txt
ipad_10th_gen.pdf saved as summary6.txt
ipad_mini.pdf saved as summary7.txt
ipad_9th_gen.pdf saved as summary8.txt
2023 ULTIMATE iPad BUYING GUIDE! saved as summary9.txt
