# Topic extraction using LLMs only

The traditional ML ways of topic extraction rely on converting each message into a vector in some vector space, and then clustering in that vector space. "Topics" are then really just regions in that vector space.

This approach has several weaknesses: Even interpreting such clusters is not trivial; editing them after the fit, let alone specifying an initial list of human-defined topics, to be automatically expanded if necessary, is pretty much impossible.

Here we show a different way, using only LLM calls. It works as follows: we feed one message at the time to the topic processor; it either assigns it to one of the existing topics, or if none are a good fit, puts it aside. Once the number of messages put aside reaches a threshold, these are used to extract a new topic, which is added to the list. There is also the option of generating topich hierarchies, by setting `max_depth` to a value bigger than 1.


In [1]:
from pprint import pprint

from langchain_openai import ChatOpenAI

try:
    import wise_topic
except ImportError:
    import os, sys
    sys.path.append(os.path.realpath(".."))


from wise_topic import greedy_topic_tree, tree_summary

docs = [
    "The summer sun blazed high in the sky, bringing warmth to the sandy beaches.",
    "During summer, the days are long and the nights are warm and inviting.",
    "Ice cream sales soar as people seek relief from the summer heat.",
    "Families often choose summer for vacations to take advantage of the sunny weather.",
    "Many festivals and outdoor concerts are scheduled in the summer months.",
    "Winter brings the joy of snowfall and the excitement of skiing.",
    "The cold winter nights are perfect for sipping hot chocolate by the fire.",
    "Winter storms can transform the landscape into a snowy wonderland.",
    "Heating bills tend to rise as winter's chill sets in.",
    "Many animals hibernate or migrate to cope with the harsh winter conditions.",
    "Fish swim in schools to protect themselves from predators.",
    "Salmon migrate upstream during spawning season, a remarkable journey.",
    "Tropical fish add vibrant color and life to coral reefs.",
    "Overfishing threatens many species of fish with extinction.",
    "Fish have a diverse range of habitats from deep oceans to shallow streams.",
]


In [2]:
topic_llm = ChatOpenAI(
    model="gpt-4-turbo",
    temperature=0,
    model_kwargs={"response_format": {"type": "json_object"}},
)

classifier_llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)

topic_tree = greedy_topic_tree(
    docs,
    initial_topics=["Winter", "Summer"],
    topic_llm=topic_llm,
    classifier_llm=classifier_llm,
    max_depth=1,
    num_topics_per_update=1,
    max_unclassified_messages=2,
)

tree_summary(topic_tree)

{'Summer': 6, 'Winter': 5, 'threats to diverse fish species': 4}

In [3]:
display(topic_tree)

{'Summer': {'messages': ['During summer, the days are long and the nights are warm and inviting.',
   'Tropical fish add vibrant color and life to coral reefs.',
   'The summer sun blazed high in the sky, bringing warmth to the sandy beaches.',
   'Ice cream sales soar as people seek relief from the summer heat.',
   'Many festivals and outdoor concerts are scheduled in the summer months.',
   'Families often choose summer for vacations to take advantage of the sunny weather.']},
 'Winter': {'messages': ['Winter storms can transform the landscape into a snowy wonderland.',
   'Winter brings the joy of snowfall and the excitement of skiing.',
   'The cold winter nights are perfect for sipping hot chocolate by the fire.',
   'Many animals hibernate or migrate to cope with the harsh winter conditions.',
   "Heating bills tend to rise as winter's chill sets in."]},
 'threats to diverse fish species': {'messages': ['Fish swim in schools to protect themselves from predators.',
   'Fish have 