# Summarization

We will use [LangChain](https://www.langchain.com/), an open-source library for making applications with LLMs.


## Document location
We will try to load  all the documents in the folder defined below.
If you prefer, you can change this to a different folder name.

In [1]:
#document_folder = 'documents'
document_folder = 'summarizing'

## Some configuration
To conserve memory, we configure more efficient memory use on the GPU.

In [2]:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

## Installing Software
We’ll need to install some libraries first:

In [3]:
!pip install --upgrade pip unstructured[all-docs] langchain-unstructured

Defaulting to user installation because normal site-packages is not writeable


## The Language Model
We’ll use models from [HuggingFace](https://huggingface.co/), a website that has tools and models for machine learning.
We’ll use the open-source LLM [mistralai/Mistral-Nemo-Instruct-2407]( https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407).
This model has 12 billion parameters.
For comparison, one of the largest LLMs at the time of writing is Llama 3.1, with 405 billion parameters.
Still, Mistral-Nemo-Instruct is around 25 GB, which makes it a quite large model.
To run it, we must have a GPU with at least 25 GB memory.
It can also be run without a GPU, but that will be much slower.

We should tell the HuggingFace library where to store its data. If you’re running on Educloud/Fox project ec443 the model is stored at the path below.

In [4]:
%env HF_HOME=/fp/projects01/ec443/huggingface/cache/

env: HF_HOME=/fp/projects01/ec367/huggingface/cache/


If you’re not running on Educloud/Fox project ec443 you’ll need to download the model.
Even though the model Mistral-Nemo-Instruct-2407 is open source, we must log in to HuggingFace to download it.
If you’re running on Educloud/Fox project ec443 the model is *already downloaded*, so you can skip this step.

In [5]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

To use the model, we create a *pipeline*.
A pipeline can consist of several processing steps, but in this case, we only need one step.
We can use the method `HuggingFacePipeline.from_model_id()`, which automatically downloads the specified model from HuggingFace.

from transformers import pipeline

llm = pipeline("text-generation", 
               model="mistralai/Mistral-Nemo-Instruct-2407",
               device=0,
               max_new_tokens=1000)

In [6]:
from langchain_community.llms import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    #model_id='mistralai/Mistral-Small-Instruct-2409',
    model_id='mistralai/Mistral-Nemo-Instruct-2407',
    #model_id='mistralai/Mistral-7B-Instruct-v0.3',
    task='text-generation',
    device=0,
    pipeline_kwargs={
        'max_new_tokens': 1000,
        #'temperature': 0.3,
        #'num_beams': 4,
        #'do_sample': True
    }
)

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]


We give some arguments to the pipeline:
- `model_id`: the name of the  model on HuggingFace
- `task`:  the task you want to use the model for
- `device`: the GPU hardware device to use. If we don't specify a device, no GPU will be used.
- `pipeline_kwargs`: additional parameters that are passed to the model.
    - `max_new_tokens`: maximum length of the generated text
    - `do_sample`: by default, the most likely next word is chosen.  This makes the output deterministic. We can introduce some randomness by sampling among the  most likely words instead.
    - `temperature`: the temperature controls the amount of randomness, where zero means no randomness.
    - `num_beams`: by default the model works with a single sequence of  tokens/words. With beam search, the program  builds multiple sequences at the same time, and then selects the best one in the end.


## Making a Prompt
We can use a *prompt* to tell the language model how to answer.
The prompt should contain a few short, helpful instructions.
In addition, we provide placeholders for the context.
LangChain replaces these with the actual documents when we execute a query.


In [7]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate


In [13]:
separator = '\nYour Summary:\n'
prompt_template = '''Write a summary of the following:

{context}
''' + separator
prompt = PromptTemplate(template=prompt_template,
                        input_variables=['context'])

## Create chain

The document loader loads each PDF page as a separate 'document'.
This is partly for technical reasons because that is the way PDFs are structured.
Therefore, we use the chain called  `create_stuff_documents_chain` which joins multiple documents  into a single large document.

In [14]:
chain = create_stuff_documents_chain(llm, prompt)

## Function to load and summarize a single document

This function loads a single document and runs it through the language model to produce a summary.

In [15]:
def split_result(result):
    "Split the reply from the prompt, should be done with output parser?"
    position = result.find(separator)
    summary = result[position + len(separator) :]
    return summary

In [16]:
from langchain_unstructured import UnstructuredLoader

def summarize_document(filename):
    print('\n\nProcessing file:', filename)
    loader = UnstructuredLoader(filename)
    docs = loader.load()
    document_lengths = [len(doc.page_content) for doc in docs]
    print(f'Number of documents: {len(docs)}, total length: {sum(document_lengths)}')
    print('Maximum document length: ', max(document_lengths))

    #Run chain    
    result = chain.invoke({"context": docs})
    return split_result(result)

## Loading the Documents


We use the Python library `pathlib` to iterate over all in files in `document_folder`.
`document_folder` is defined at the start of this  Notebook.

In [17]:
from pathlib import Path


directory = Path(document_folder)
file_iterator = directory.iterdir()
summaries  = dict()

for filename in file_iterator:
    try:
         summary = summarize_document(filename)
    except Exception as e:
        print('Error:', e)
    summaries[filename] = summary
    #summaries.append(Document(page_content = summary))
    
    print(summary)



Processing file: summarizing/Grimmelmann - 2022 - Programming Languages and Law A Research Agenda.pdf
Number of documents: 320, total length: 70316
Maximum document length:  1213
The article "Programming Languages and Law: A Research Agenda" by James Grimmelmann presents a research agenda for applying programming-language theory to law. The author argues that law and programming languages share a common focus on using precisely structured linguistic constructions to do things in the world, and that programming-language techniques can be useful in solving legal problems. The article surveys the history of research into programming languages and law, and presents ten promising avenues for future research. These include using programming languages to model legal doctrines, using legal drafting tools to improve legal drafting, using programming languages to design legal systems, and using programming languages to interpret legal texts. The article also discusses the challenges and opport

In [18]:
with open('summaries.txt', 'w') as outfile:
    for filename in summaries:
        print('Summary of ', filename, file = outfile)
        print(summaries[filename], file=outfile)
        print(file=outfile)

In [19]:
from langchain.schema.document import Document
from langchain.prompts import ChatPromptTemplate

total_prompt = ChatPromptTemplate.from_messages(
    [("system", "Below is a list of summaries of some papers. Make a total summary all the information in all the papers:\n\n{context}\n\nTotal Summary:")]
)
total_chain = create_stuff_documents_chain(llm, total_prompt)
total_summary = total_chain.invoke({"context": [Document(page_content = summary) for summary in summaries.values()]})

print('Summary of all the summaries:')
print(total_summary)

#print(result)

with open('total_summary.txt', 'w') as outfile:
    print(total_summary, file=outfile)

Summary of all the summaries:
System: Below is a list of summaries of some papers. Make a total summary all the information in all the papers:

The article "Programming Languages and Law: A Research Agenda" by James Grimmelmann presents a research agenda for applying programming-language theory to law. The author argues that law and programming languages share a common focus on using precisely structured linguistic constructions to do things in the world, and that programming-language techniques can be useful in solving legal problems. The article surveys the history of research into programming languages and law, and presents ten promising avenues for future research. These include using programming languages to model legal doctrines, using legal drafting tools to improve legal drafting, using programming languages to design legal systems, and using programming languages to interpret legal texts. The article also discusses the challenges and opportunities of interdisciplinary research