# Oppsummering av store dokumenter
Denne notebooken tar for seg forskjellige måter å oppsummere forskjellige dokumenter ved hjelp av LangChain og etablerte LLMs. Disse metodene gjennomgås:


1. **Document Stuffing** - Å putte et helt dokument rett inn i prompten
2. **Map Reduction** - Dele opp dokumentene i chunks, mappe hver chunk til en oppsummering, så lage en oppsummering av oppsummeringene
3. **Clustering** - Samle relevante chunks i et cluster og lage en oppsummering av disse, så lage en komplett oppsummering ut fra alle clusters.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from pathlib import Path
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
content_path = Path('./content/nord-universitet')

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Funksjoner for å gjøre om filer til Document-objekter

Denne notebooken støtter følgende filtyper:

.doc, 
.docx, 
.pdf, 
.xls, 
.xlsx

Koden under er for å gjøre om en fil til et Document-objekt.

In [2]:
%pip install langchain_community langchain-openai pypdf langchain werkzeug unstructured python-docx -Uq

Note: you may need to restart the kernel to use updated packages.


In [9]:
from dotenv import load_dotenv

import os

load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

In [3]:
import re
from tempfile import NamedTemporaryFile
from pypdf import PdfReader
from langchain_community.document_loaders import (
    UnstructuredExcelLoader,
    UnstructuredWordDocumentLoader)
from langchain_core.documents import Document
from werkzeug.datastructures import FileStorage

In [4]:
def clean_text(text):
    # Remove excessive newlines and keep only ASCII + æøå characters.
    text = re.sub(r'\n{2,}', '\n', text)
    text = re.sub(r'[^\x00-\x7FæøåÆØÅ]+', '', text)
    # Remove empty strings
    text = "\n".join([line for line in text.split('\n') if line.strip() != ''])
    return text

def word_document_to_document(file) -> Document:
    if isinstance(file, str):
        loader = UnstructuredWordDocumentLoader(file_path=file)
        data = loader.load()
    else:
        with NamedTemporaryFile() as temp_file:
            self.file.save(temp_file)
            loader = UnstructuredWordDocumentLoader(temp_file.name)
            data = loader.load()
    return data[0]

def pdf_to_document(file) -> Document:
    reader = PdfReader(file)
    text = ''
    for page_num in range(len(reader.pages)):
        text += reader.pages[page_num].extract_text()
    cleaned_text = clean_text(text)
    return Document(page_content=cleaned_text)

def excel_to_document(file) -> Document:
    loader = UnstructuredExcelLoader(file)
    data = loader.load()
    return data[0]

In [5]:
def process_file(file) -> Document:
    filename: str = file if isinstance(file, str) else file.filename
    if filename.endswith('.docx') or filename.endswith('.doc'):
        return word_document_to_document(file)
    elif filename.endswith('.pdf'):
        return pdf_to_document(file)
    elif filename.endswith('.xlsx') or filename.endswith('.xls'):
        return excel_to_document(file)

    raise Exception(f"Unsupported file type: {filename}")

## Les inn filer

In [6]:
from typing import List

documents: List[Document] = list()
for dirname, _, filenames in os.walk(content_path):
    for filename in filenames:
        documents.append(process_file(os.path.join(dirname, filename)))

In [7]:
type(documents[0])

langchain_core.documents.base.Document

## Stuffing

Problematisk når man har begrenset med input-tokens. Nyere modeller som Claude-modellene har store kontekstvinduer og får plass til mange tokens i input, men dersom man jobber med litt eldre modeller som GPT-modellene, eller enorme mengder tekst, kan det bli problematisk å bruke stuffing.

In [8]:
stuff_prompt = """Write a concise summary of the following text. The summary must be in the same language as the following text:
{context}
SUMMARY:
"""

In [10]:
from langchain.chains.combine_documents.stuff import create_stuff_documents_chain
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

docs = [documents[0]]

stuff_prompt_template = PromptTemplate(
    template=stuff_prompt,
    input_variables=["content"]
)

llm = ChatOpenAI(model_name="gpt-3.5-turbo-0125", api_key=OPENAI_API_KEY)
chain = create_stuff_documents_chain(llm=llm, prompt=stuff_prompt_template)

print(chain.invoke({"context": docs}))

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

## Map Reduction


In [11]:
summary_map_template = """Write a short summary of the following text:

{context}

SUMMARY:
"""

summary_reduce_template = """The following text is a set of summaries:

{doc_summaries}

Create a cohesive summary from the above text.
SUMMARY:"""

In [12]:
# Map Reduction Code

In [13]:
from langchain_core.documents import Document
from langchain.text_splitter import TokenTextSplitter

# Take in a document and chunk it if neccessary. Splits on token length.
def split_document_by_tokens(document: list[Document], chunk_size: int, overlap: int):
    splitter = TokenTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
    return splitter.split_documents(document)

In [14]:
from langchain.chains import LLMChain, ReduceDocumentsChain, MapReduceDocumentsChain, StuffDocumentsChain
from langchain.prompts import PromptTemplate

def summarize_document(document: list[Document]):
    """
    Takes in a list of Documents and summarizes them.
    :param document: The document(s) to be summarized.
    :return: A dict of named outputs Dict[str, Any].
    """
    # Chain to generate a summary from each chunk
    map_prompt = PromptTemplate.from_template(summary_map_template)
    map_chain = LLMChain(prompt=map_prompt, llm=llm)

    # Chain to generate one cohesive summary from the summaries
    reduce_prompt = PromptTemplate.from_template(summary_reduce_template)
    reduce_chain = LLMChain(prompt=reduce_prompt, llm=llm)
    stuff_chain = StuffDocumentsChain(llm_chain=reduce_chain, document_variable_name="doc_summaries")
    reduce_docs_chain = ReduceDocumentsChain(combine_documents_chain=stuff_chain)

    # The complete map reduction chain
    map_reduce_chain = MapReduceDocumentsChain(
        llm_chain=map_chain,
        document_variable_name="content",
        reduce_documents_chain=reduce_docs_chain
    )

    splitdocs = split_document_by_tokens(document, 15000, 200)
    summary = map_reduce_chain.run(splitdocs)
    return summary

## Clustering

Map Reduction fungerer greit når man har en passe stor mengde tekst (et par sider, noen små dokumenter osv). Det blir problematisk når mengden tekst blir svært stor, da metoden gjør mange kall til modellen.

Spørsmålet er da hvordan man skal oppsummere svært store mengder tekst, som hele bøker eller veldig lange dokumenter. Man kan velge tilfeldige chunks, eller chunks som er spredt ut en viss avstand fra hverandre, men da er det en fare for at man mister essensiell informasjon.

- Embedde teksten med en embedding-modell
- Bruke K-means clustering for å samle lignende chunks
- Bruke silhouette scoring for å finne optimal mengde clusters (k)
- Bruke T-SNE for å visualisere clusters
- Generere oppsummering av hvert cluster

In [None]:
# Clustering Code