<a href="https://colab.research.google.com/github/sahinutar/colabs/blob/main/LDA_with_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LDA and LLMs for Document Summarization

Necessary libraries not installed in colab.

In [None]:
!pip install pypdf
!pip install tiktoken
!pip install langchain
!pip install openai
!pip install gdown

Collecting pypdf
  Downloading pypdf-3.16.0-py3-none-any.whl (276 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.0/276.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-3.16.0
Collecting tiktoken
  Downloading tiktoken-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.4.0
Collecting langchain
  Downloading langchain-0.0.285-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.5.14-py3-none-any.whl (26 kB)
Collecting langsmith<0.1.0,>=0.0.21 (from langchain)
  Downloading langsmith-0.0.35-py3-none-any

In [None]:
import gensim
import nltk
from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from pypdf import PdfReader
from langchain.chains import LLMChain
from langchain.prompts import ChatPromptTemplate
from langchain.llms import OpenAI


## Functions

The following are the functions that implement the prepocessing, topic extraction, and LLM call.

In [None]:
def preprocess(text, stop_words):
    """
    Tokenizes and preprocesses the input text, removing stopwords and short
    tokens.

    Parameters:
        text (str): The input text to preprocess.
        stop_words (set): A set of stopwords to be removed from the text.
    Returns:
        list: A list of preprocessed tokens.
    """
    result = []
    for token in simple_preprocess(text, deacc=True):
        if token not in stop_words and len(token) > 3:
            result.append(token)
    return result


In [None]:

def get_topic_lists_from_pdf(file, num_topics, words_per_topic):
    """
    Extracts topics and their associated words from a PDF document using the
    Latent Dirichlet Allocation (LDA) algorithm.

    Parameters:
        file (str): The path to the PDF file for topic extraction.
        num_topics (int): The number of topics to discover.
        words_per_topic (int): The number of words to include per topic.

    Returns:
        list: A list of num_topics sublists, each containing relevant words
        for a topic.
    """
    # Load the pdf file
    loader = PdfReader(file)

    # Extract the text from each page into a list. Each page is considered a document
    documents= []
    for page in loader.pages:
        documents.append(page.extract_text())

    # Preprocess the documents
    nltk.download('stopwords')
    stop_words = set(stopwords.words(['english','spanish']))
    processed_documents = [preprocess(doc, stop_words) for doc in documents]

    # Create a dictionary and a corpus
    dictionary = corpora.Dictionary(processed_documents)
    corpus = [dictionary.doc2bow(doc) for doc in processed_documents]

    # Build the LDA model
    lda_model = LdaModel(
        corpus,
        num_topics=num_topics,
        id2word=dictionary,
        passes=15
        )

    # Retrieve the topics and their corresponding words
    topics = lda_model.print_topics(num_words=words_per_topic)

    # Store each list of words from each topic into a list
    topics_ls = []
    for topic in topics:
        words = topic[1].split("+")
        topic_words = [word.split("*")[1].replace('"', '').strip() for word in words]
        topics_ls.append(topic_words)

    return topics_ls


In [None]:
def topics_from_pdf(llm, file, num_topics, words_per_topic):
    """
    Generates descriptive prompts for LLM based on topic words extracted from a
    PDF document.

    This function takes the output of `get_topic_lists_from_pdf` function,
    which consists of a list of topic-related words for each topic, and
    generates an output string in bulleted nested list format.

    Parameters:
        llm (LLM): An instance of the Large Language Model (LLM) for generating
        responses.
        file (str): The path to the PDF file for extracting topic-related words.
        num_topics (int): The number of topics to consider.
        words_per_topic (int): The number of words per topic to include.

    Returns:
        str: A response generated by the language model based on the provided
        topic words.
    """

    # Extract topics and convert them to string
    list_of_topicwords = get_topic_lists_from_pdf(file, num_topics,
                                                  words_per_topic)
    string_lda = ""
    for list in list_of_topicwords:
        string_lda += str(list) + "\n"

    # Create the template
    template_string = '''Describe the topic of each of the {num_topics}
        double-quote delimited lists in a simple sentence and also write down
        three possible different subthemes. The lists are the result of an
        algorithm for topic discovery.
        Do not provide an introduction or a conclusion, only describe the
        topics. Do not mention the word "topic" when describing the topics.
        Use the following template for the response.

        1: <<<(sentence describing the topic)>>>
        - <<<(Phrase describing the first subtheme)>>>
        - <<<(Phrase describing the second subtheme)>>>
        - <<<(Phrase describing the third subtheme)>>>

        2: <<<(sentence describing the topic)>>>
        - <<<(Phrase describing the first subtheme)>>>
        - <<<(Phrase describing the second subtheme)>>>
        - <<<(Phrase describing the third subtheme)>>>

        ...

        n: <<<(sentence describing the topic)>>>
        - <<<(Phrase describing the first subtheme)>>>
        - <<<(Phrase describing the second subtheme)>>>
        - <<<(Phrase describing the third subtheme)>>>

        Lists: """{string_lda}""" '''

    # LLM call
    prompt_template = ChatPromptTemplate.from_template(template_string)
    chain = LLMChain(llm=llm, prompt=prompt_template)
    response = chain.run({
        "string_lda" : string_lda,
        "num_topics" : num_topics
        })

    return response

## OpenAI API key

For this demo, we are going to use chatgpt-3.5 Turbo. For that, it is necessary to introduce the API key. Check [How to get an OPEN API key for ChatGPT](https://www.maisieai.com/help/how-to-get-an-openai-api-key-for-chatgpt) for instructions on how to get one.

In [None]:
openai_key = "sk-V..."
llm = OpenAI(openai_api_key=openai_key, max_tokens=-1)

## Testing with documents

Now, lets try with a public domain pdf document, The Metamorphosis By Franz Kafka (1915).

In [None]:
!gdown https://www.sigortta.com/static/data/assistance/file_15.pdf

Downloading...
From: https://drive.google.com/uc?id=1mpXUmuLGzkVEqsTicQvBPcpPJW0aPqdL
To: /content/the-metamorphosis.pdf
  0% 0.00/427k [00:00<?, ?B/s]100% 427k/427k [00:00<00:00, 18.9MB/s]


In [None]:
file = "./the-metamorphosis.pdf"

num_topics = 6
words_per_topic = 30

summary = topics_from_pdf(llm, file, num_topics, words_per_topic)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
print(summary)



1: The transformation of Gregor and its impact on his family
- The physical changes undergone by Gregor
- The psychological effects on his family
- The impact of Gregor's transformation on his relationships

2: The living space of Gregor and his family
- The furniture in the room
- The activities taking place in the living space
- The symbolic meaning of the room

3: The reactions of Gregor's family to his transformation
- The different ways the family members react
- The emotions and thoughts experienced by the family
- The attempts of the family to cope with the changes

4: The efforts of Gregor to adapt to his new state
- Gregor's attempts to adjust to his new body
- The challenges he faces in his new form
- The support he receives from his family and society

5: The daily routine of Gregor and his family
- The mundane tasks Gregor performs
- The activities of the family members
- The changes in Gregor's life due to his transformation

6: The relationship between Gregor and his fa

Also, let's try with a technical book: The Foundations of Geometry by David Hilbert (1899).


In [None]:
!gdown https://drive.google.com/uc?id=1T_FeuGsoC08U_6Xb8Awt50CJXBUqji4D

file = "./Hilbert.pdf"
summary = topics_from_pdf(llm, file, num_topics, words_per_topic)
print(summary)

Downloading...
From: https://drive.google.com/uc?id=1T_FeuGsoC08U_6Xb8Awt50CJXBUqji4D
To: /content/Hilbert.pdf
  0% 0.00/878k [00:00<?, ?B/s]100% 878k/878k [00:00<00:00, 26.8MB/s]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



1: Analyzing geometric shapes, lines, and points
- Point and line relationships
- Congruent shapes
- Geometric functions

2: Applying axioms to geometric shapes
- Parallel lines
- Triangles and polygons
- Algebraic equations

3: Calculating area and content
- Area of triangles and polygons
- Measurement theory
- Archimedes' theorem

4: Describing congruence of shapes
- Angle and side relationships
- Congruence of triangles
- Parallel lines and segments

5: Exploring the multiplication of points, lines, and numbers
- Laws of multiplication
- Point and line order
- Polygons and algebra

6: Analyzing complex numbers and functions
- Domain and range of a function
- Jordan curve theorem
- Reversible displacement


Feel free experiment with the **number of topics** and the number of **words per topic** and find the combination that works for your document.

## Licence

GNU General Public License v2.0

## Author

[Antonio Jimenez](https://www.linkedin.com/in/antonio-jimnzc)