# Summarization

We will use [LangChain](https://www.langchain.com/), an open-source library for making applications with LLMs.


## Document location
We will try to load  all the documents in the folder defined below.
If you prefer, you can change this to a different folder name.

In [None]:
#document_folder = 'documents'
document_folder = '../summarizing'

## Some configuration
To conserve memory, we configure more efficient memory use on the GPU.

In [None]:
%env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

## Installing Software
We’ll need to install some libraries first:

In [None]:
!pip install unstructured[all-docs] langchain-unstructured

## The Language Model
We'll use models from [HuggingFace](https://huggingface.co/), a website that has tools and models for machine learning.
We'll use the open-weights LLM 
[mistralai/Ministral-8B-Instruct-2410](https://huggingface.co/mistralai/Ministral-8B-Instruct-2410).


In [None]:
%env HF_HOME=/fp/projects01/ec443/huggingface/cache/

To use the model, we create a *pipeline*.
A pipeline can consist of several processing steps, but in this case, we only need one step.
We can use the method `HuggingFacePipeline.from_model_id()`, which automatically downloads the specified model from HuggingFace.

from transformers import pipeline

llm = pipeline("text-generation", 
               model="mistralai/Mistral-Nemo-Instruct-2407",
               device=0,
               max_new_tokens=1000)

In [None]:
from langchain_community.llms import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    #model_id='mistralai/Mistral-Small-Instruct-2409',
    model_id='mistralai/Ministral-8B-Instruct-2410',
    #model_id='mistralai/Mistral-7B-Instruct-v0.3',
    task='text-generation',
    device=0,
    pipeline_kwargs={
        'max_new_tokens': 1000,
        #'temperature': 0.3,
        #'num_beams': 4,
        #'do_sample': True
    }
)


We give some arguments to the pipeline:
- `model_id`: the name of the  model on HuggingFace
- `task`:  the task you want to use the model for
- `device`: the GPU hardware device to use. If we don't specify a device, no GPU will be used.
- `pipeline_kwargs`: additional parameters that are passed to the model.
    - `max_new_tokens`: maximum length of the generated text
    - `do_sample`: by default, the most likely next word is chosen.  This makes the output deterministic. We can introduce some randomness by sampling among the  most likely words instead.
    - `temperature`: the temperature controls the amount of randomness, where zero means no randomness.
    - `num_beams`: by default the model works with a single sequence of  tokens/words. With beam search, the program  builds multiple sequences at the same time, and then selects the best one in the end.


## Making a Prompt
We can use a *prompt* to tell the language model how to answer.
The prompt should contain a few short, helpful instructions.
In addition, we provide placeholders for the context.
LangChain replaces these with the actual documents when we execute a query.


In [None]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate


In [None]:
separator = '\nYour Summary:\n'
prompt_template = '''Write a summary of the following:

{context}
''' + separator
prompt = PromptTemplate(template=prompt_template,
                        input_variables=['context'])

## Create chain

The document loader loads each PDF page as a separate 'document'.
This is partly for technical reasons because that is the way PDFs are structured.
Therefore, we use the chain called  `create_stuff_documents_chain` which joins multiple documents  into a single large document.

In [None]:
chain = create_stuff_documents_chain(llm, prompt)

## Function to load and summarize a single document

This function loads a single document and runs it through the language model to produce a summary.

In [None]:
def split_result(result):
    "Split the reply from the prompt, should be done with output parser?"
    position = result.find(separator)
    summary = result[position + len(separator) :]
    return summary

In [None]:
from langchain_unstructured import UnstructuredLoader

def summarize_document(filename):
    print('\n\nProcessing file:', filename)
    loader = UnstructuredLoader(filename)
    docs = loader.load()
    document_lengths = [len(doc.page_content) for doc in docs]
    print(f'Number of documents: {len(docs)}, total length: {sum(document_lengths)}')
    print('Maximum document length: ', max(document_lengths))

    #Run chain    
    result = chain.invoke({"context": docs})
    return split_result(result)

## Loading the Documents


We use the Python library `pathlib` to iterate over all in files in `document_folder`.
`document_folder` is defined at the start of this  Notebook.

In [None]:
from pathlib import Path


directory = Path(document_folder)
file_iterator = directory.iterdir()
summaries  = dict()

for filename in file_iterator:
    try:
         summary = summarize_document(filename)
    except Exception as e:
        print('Error:', e)
    summaries[filename] = summary
    #summaries.append(Document(page_content = summary))
    
    print(summary)

In [None]:
with open('summaries.txt', 'w') as outfile:
    for filename in summaries:
        print('Summary of ', filename, file = outfile)
        print(summaries[filename], file=outfile)
        print(file=outfile)

In [None]:
from langchain.schema.document import Document
from langchain.prompts import ChatPromptTemplate

total_prompt = ChatPromptTemplate.from_messages(
    [("system", "Below is a list of summaries of some papers. Make a total summary all the information in all the papers:\n\n{context}\n\nTotal Summary:")]
)
total_chain = create_stuff_documents_chain(llm, total_prompt)
total_summary = total_chain.invoke({"context": [Document(page_content = summary) for summary in summaries.values()]})

print('Summary of all the summaries:')
print(total_summary)

#print(result)

with open('total_summary.txt', 'w') as outfile:
    print(total_summary, file=outfile)