In [1]:
#| default_exp summarise

# Summarisation

> Module for creating chunks of the transcript, for both paragraphs and topics.

# Methodology

## Aims

Above all what is required is a title summary of all our topics for clear navigation of the transcript. 

A secondary aim would be to also provide summaries for these, to allow readers to quickly understand the epsiode's content.

## Approach

Ultimately this is a summarisation problem, providing the summary as a title and possibly a paragraph description as well.

With the introduction of LLMs, there is a clear approach of doing this. It's a method that both has exceptional capabilities for both understanding the text, and also of outputting the contents as desired. These models have been extensively trained on summarisation tasks so they are beyond capable of the task.

## Considerations

### Prompting

How will we feed the model information, and instruct it to get an appropriate output. The best way to do this through first impressions is by using delimiters. Both the transcript text and outputs can be delimited through section titles. Before the OpenAI API was released I remember reading in their documents that they recommended separating variables either through triple backquotes or triple hashtags. 

#### Inputting Transcript

We need a way for the model to clearly recognise the text that it needs to summarise. This can be done through pasting under a section heading.

#### Output

We require the output to be parseable (backup: use LLMs to parse the appropriate sections). The best way to keep this consistent is to provide examples, and provide a set format for the output to be written in again using section headers.

### Efficiency

While LLMs will provide excellent results, they are considerably expensive to run. Is there any way we can make this cheaper/easier to run?

#### Combining Titles & Summary

Ideally, if we can get the the LLM to output reliably, it'd be most efficient to get it to output a title, and a summary in one prompt. Alternatively, we could make use of two-shot prompting, where the previous context would still be held in the model's attention, and a simple request to summarise it would be sufficient. 

### Models

Here we are required to consider the model size, as they are restricted by my GPU's memory (24G).

The introduction to Llama 2 has made this much more simple, as it comes with an already fine-tuned chat model. The choice of models then comes down to two factors: 
- parameter count
- floating point precision

A higher number of parameters would allow for more intricate understanding and output possibility, whereas floating point precision would allow for greater stability. It's best to get as much use as possible out of the GPU in this situation, so the choice here is between: 

model | precision | memory
--- | --- | ---
llama-chat-7B | FP16 | 7*2=14GB
llama-chat-13B | FP8 | 13*1=13GB

Since the task at hand isn't that complicated - it doesn't require lots of logic, and doesn't require large degrees of accuracy, it might be best to use the 7B model at a more stable FP16. This though depends on how stable quanitized models are. It should also be noted that this 'stability' is probably more important during model training. Since I'm just using it for inference, the model weights are already known and the decrease in accuracy of them shouldn't effect the output too drastically. The research on this doesn't seem too conclusive as of yet, so its worth investigating empirically.


### Full Summary

If there's a summary of all the topics, we could use the 'Map Reduce' method on this to obtain the full summary. This is simply combining and summarising all of the individual topic summaries.

# Code

In [2]:
#| export
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from transcriber.group import group_paragraphs_text
import torch
import pandas as pd
import io

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package wordnet to /home/steph/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
import json

In [4]:
with open("../data/podcast/people_i_admire_104_joy_of_maths/tmp/transcript-grouped.json") as f:
    transcript = json.load(f)

In [5]:
#| export
def load_llm_pipeline(model="meta-llama/Llama-2-13b-chat-hf", cache_dir=None, **pipeline_args): 
    
    repo_branch = "main"

    tokenizer = AutoTokenizer.from_pretrained(model, revision=repo_branch, cache_dir=cache_dir)
    model = AutoModelForCausalLM.from_pretrained(model, revision=repo_branch, cache_dir=cache_dir, load_in_8bit=True, trust_remote_code=True, device_map='auto')

    pipe = pipeline(
        model=model, tokenizer=tokenizer,
        return_full_text=False,  
        task='text-generation',
        # -- model hyperparameters --
        temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
        top_p=0.15,  # select from top tokens whose probability add up to this value
        top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
        max_new_tokens=400, 
        repetition_penalty=1.2,  
        **pipeline_args
    )

    return pipe

In [6]:
llm = load_llm_pipeline(cache_dir = "/home/steph/.cache/huggingface/os_models")

Loading checkpoint shards: 100%|██████████| 3/3 [00:02<00:00,  1.12it/s]


## Summarising Topics

### Formatting Text

Would it be best to format the topics as a speaker segmented transcript, or simply as plain text? I think this depends on whether the model would get confused or not.

In [7]:
#| export
def get_base_prompt(topic_text):
    return f"""

Your answer should be displayed in the following format:

###SECTION###
SPEAKER_00: example text
SPEAKER_01: example text

###SECTION TITLE###
example concise title

###SECTION SUMMARY###
example concise summary paragraph

Now give real titles and summaries for the below:

###SECTION###
{topic_text}

###SECTION TITLE###
"""

def get_intro_prompt(topic_text):
    return "For the following podcast introduction section, give it a title followed by a summary. Both the title and summary should be separated into their own section under the headers delimited by triple hashtags." + get_base_prompt(topic_text)

def get_standard_prompt(topic_text):
    return "You are an AI text summariser who for legal reasons absolutely cannot mention any personal names of the people in the text.\n\nFor the following podcast section, give it a title followed by a summary. Both the title and summary should be separated into their own section under the headers delimited by triple hashtags (###)." + get_base_prompt(topic_text)

In [8]:
#| export
def format_speech_text(topic): return '\n'.join([speech['label'] + ": " + speech['text'] for speech in topic['groups']])

In [9]:
#| export
def format_summary(summary):
    summary = summary.replace("'",'"')
    if summary.startswith('"') and summary.endswith('"') and '"' not in summary[1:-2]:
        summary = summary[1:-2]
    return summary

In [10]:
#| export
def parse_summary(llm_summary):
    split = llm_summary.split("###")
    if len(split) == 3:
        return {
            'title': format_summary(split[0].strip()),
            'summary': format_summary(split[2].strip()),
            'summary_unparsed': llm_summary
        }
    else:
        return {'title': "", 
                'summary': "", 
                'summary_unparsed': llm_summary
            }

In [11]:
#| export
def get_topic_summary_prompt(topic):
    speech_text = format_speech_text(topic)
    if topic['label'] == 0:
        prompt = get_intro_prompt(speech_text)
    else:
        prompt = get_standard_prompt(speech_text)
    return prompt

In [12]:
#| export
def summarise_topics(transcript, llm):
    for i, topic in enumerate(transcript):
        prompt = get_topic_summary_prompt(topic)
        summary = parse_summary(
            llm(prompt)[0]['generated_text']
        )
        topic.update(summary)
        torch.cuda.empty_cache()
    return transcript

In [13]:
summarised_topics = summarise_topics(transcript, llm)

In [14]:
dictfilt = lambda x, y: dict([ (i,x[i]) for i in x if i in set(y) ])

[dictfilt(topic, ('title','summary','summary_unparsed')) for topic in summarised_topics]

[{'title': 'Once Upon a Prime: Exploring the Magical Overlap Between Literature and Mathematics',
  'summary': 'In this episode, host Steve Levitt speaks with author and professor Sarah Hart about her latest book "Once Upon a Prime," which explores the connections between literature and mathematics. They discuss how mathematical concepts such as patterns, structures, and symmetries are present in various forms of creative expression, including literature, music, and poetry. The conversation covers topics such as the use of prime numbers in haiku poetry, the role of vibrations in creating pleasing musical compositions, and the ways in which mathematicians can appreciate the beauty and elegance of literature.',
  'summary_unparsed': 'Once Upon a Prime: Exploring the Magical Overlap Between Literature and Mathematics\n\n###SECTION SUMMARY###\nIn this episode, host Steve Levitt speaks with author and professor Sarah Hart about her latest book "Once Upon a Prime," which explores the connect

## Summarising Full Transcript

In [15]:
#| export
def get_whole_summary_prompt(summarised_topics):
    summaries = '\n\n'.join([topic['summary'] for topic in summarised_topics])
    prompt = f"""Written below are summaries of every topic of a podcast episode. Please write a detailed summary for the whole podcast episode.
    
###SECTION SUMMARIES###
{summaries}

###WHOLE SUMMARY###
"""
    return prompt

In [16]:
#| export
def summarise_transcript(transcript, llm):
    prompt = get_whole_summary_prompt(transcript)
    print(len(llm.tokenizer.tokenize(prompt)))
    llm.model.config.max_new_tokens = 2048
    summary = llm(prompt)[0]['generated_text']
    return summary

In [17]:
summarised_transcript = summarise_transcript(summarised_topics, llm)
summarised_transcript

925




'In this episode of Freakonomics Radio, host Steve Levitt is joined by author and professor Sarah Hart to explore the connections between literature and mathematics. They discuss how mathematical concepts such as patterns, structures, and symmetries are present in various forms of creative expression, including literature, music, and poetry. The conversation covers topics such as the use of prime numbers in haiku poetry, the role of vibrations in creating pleasing musical compositions, and the ways in which mathematicians can appreciate the beauty and elegance of literature. Additionally, they talk about the potential benefits of incorporating math appreciation courses into high school curriculums, the importance of making math accessible and engaging, and the limitations of the traditional scientific method in economics. Throughout the episode, the speakers emphasize the importance of finding the right balance between freedom and constraint in order to create something truly special, 

## Speaker Identification

We can use these LLMs for a number of other tasks which require nuanced understanding.

## Labelling Podcast Roles

LLMs could help labelling the speaker names with their roles in the podcast. Most podcasts have a host as well as a guest, in which there could sometimes be multiple of either.

In [18]:
#| export 
def get_roles_prompt(topic): 
    return f"""Below is a transcript of an introduction section from a podcast episode. For each speaker, please identify their name, and whether their role (host, co-host, guest). Note that usually the first name mentioned is the guest which is being introduced by the host speaking. Make sure that you write the host as the first entry in the table, and don't get mixed up between the naming.

For each speaker write your answer in a table format like the example below.

###PODCAST TRANSCRIPT###
SPEAKER_00: Hello I'm here with my guest William Shakespear. My name is Joe Rogan welcome to the podcast.
SPEAKER_01: Thanks Joe, pleasure to be on.

| SPEAKER NUMBER | NAME | ROLE |
| --- | --- | --- |
| SPEAKER_00 | Joe Rogan | Host |
| SPEAKER_01 | William Shakespear | Guest |

Now do it for the below transcript:

###PODCAST TRANSCRIPT###
{format_speech_text(topic)}

| SPEAKER NUMBER | NAME | ROLE |
| --- | --- | --- |
|"""


In [19]:
prompt = get_roles_prompt(transcript[0])
roles_output = llm(prompt)[0]['generated_text']
print(roles_output)



 SPEAKER_00 | Steve Levitt | Host |
| SPEAKER_01 | Sarah Hart | Guest |


In [20]:
print("\n".join(prompt.splitlines()[-3:]) + roles_output)

| SPEAKER NUMBER | NAME | ROLE |
| --- | --- | --- |
| SPEAKER_00 | Steve Levitt | Host |
| SPEAKER_01 | Sarah Hart | Guest |


In [21]:
#| export
def markdown_to_dict(markdown_table):
    df = pd.read_table(io.StringIO(markdown_table), sep='|')
    df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
    df = df.dropna(how='all')
    df = df.dropna(axis=1, how='all')
    df = df.iloc[1:]
    df.columns = df.columns.str.strip()
    dct_list = df.to_dict('records')
    dct = {d['SPEAKER NUMBER']: {k: v for k, v in d.items() if k != 'SPEAKER NUMBER'} for d in dct_list}
    return dct

In [22]:
#| export
def parse_roles(roles_output, prompt):
    markdown_table = "\n" + "\n".join(prompt.splitlines()[-3:]) + roles_output + "\n"
    dct = markdown_to_dict(markdown_table)
    dct = {k: {k2.lower(): v2 for k2, v2 in v.items()} for k, v in dct.items()}
    return dct

In [23]:
parsed_roles = parse_roles(roles_output, prompt)

In [24]:
#| export
def identify_speakers(transcript, llm):
    prompt = get_roles_prompt(transcript[0])
    llm_output = llm(prompt)[0]['generated_text']
    speaker_ids = parse_roles(llm_output, prompt)
    return speaker_ids

In [25]:
speaker_ids = identify_speakers(summarised_topics, llm)
speaker_ids



{'SPEAKER_00': {'name': 'Steve Levitt', 'role': 'Host'},
 'SPEAKER_01': {'name': 'Sarah Hart', 'role': 'Guest'}}

In [27]:
with open("../data/podcast/people_i_admire_104_joy_of_maths/transcript.json", 'w') as f:
    json.dump({
        'summary': summarised_transcript, 'topics': summarised_topics, 'speaker_ids': speaker_ids
    }, f, ensure_ascii=False, indent=2)

## Full Function

In [41]:
#| export 
def summarise(transcript, id_speakers=True):
    llm = load_llm_pipeline(cache_dir="/home/steph/.cache/huggingface/os_models")
    summarised_topics = summarise_topics(transcript, llm)
    summarised_transcript = summarise_transcript(summarised_topics, llm)
    output = {
        'summary': summarised_transcript, 
        'topics': summarised_topics
    }
    if id_speakers: 
        speaker_ids = identify_speakers(transcript, llm)
        output.update({
            'speaker_ids': speaker_ids
        })
    return output

In [42]:
#| hide
from nbdev import nbdev_export
nbdev_export()

---

# Appendix

## Map Reduce

UPDATE: This section is no longer required after the Llama2 release, which allows for a context length of 4096. This is a large enough context length for each topic section. All that is required is to have the topics staying below the context lengths.

Topics can end up being longer than the maximum context length of these models (2048 tokens). The options around this are either reducing the size of the topics (reasonable), or splitting them up, and doing a spit summarisation method. 

Langchain recommends a pattern which involves splitting the text up into parts, doing a summarisation for each of the parts, and taking these outputs to do the final summarisation. Lets try that and see how well it works.

In [25]:
print([len(llm.tokenizer.tokenize(topic['text'])) for topic in transcript])

[3031, 2928, 2931, 3180, 2939, 2873, 2491, 2228, 2557, 2788, 3282, 2279, 3156, 3631, 3458]


In [27]:
import numpy as np

In [28]:
token_word_ratio = [(len(llm.tokenizer.tokenize(topic['text']))/len(topic['text'].split(' '))) for topic in transcript ]
np.mean(token_word_ratio), np.max(token_word_ratio)

(1.3364756104849838, 1.3854103343465045)

So here we want to make sure that each topic stays under 4096/1.4=2900 words.

In [12]:
#| export
def get_num_tokens(text, tokenizer): return len(tokenizer.tokenize(text))

In [13]:
topic_text = ' '.join(topic['text'] for topic in transcript[:3])
get_num_tokens(topic_text, tokenizer)

Token indices sequence length is longer than the specified maximum sequence length for this model (6041 > 2048). Running this sequence through the model will result in indexing errors


6041

Here there's an opportunity to split it on specific separators. We can use paragraphs, or in fact we could actually do topics. Maybe just depends on whatever is fastest.

In [14]:
topic_text = split_paragraphs_text(topic_text)

In [34]:
#| export
def chunk_text(topic_text, chunk_size=4000):
    paragraphs = split_paragraphs_text(topic_text).split("\n\n")
    split_idxs = [0]
    current_length = 0
    for i, paragraph in enumerate(paragraphs):
        current_length += len(paragraph)
        if current_length > chunk_size:
            split_idxs.append(i+1)
            current_length = 0
    if len(split_idxs) > 1: split_idxs.pop()
    chunks = []
    for i, j in zip(split_idxs, split_idxs[1:]+[None]):
        if i > 0: 
            if j:
                chunks.append('\n\n'.join(paragraphs[i-1:j+1]))
            else:
                chunks.append('\n\n'.join(paragraphs[i-1:j]))
        else:
            if j:
                chunks.append('\n\n'.join(paragraphs[i:j+1]))
            else:
                chunks.append('\n\n'.join(paragraphs[i:j]))
    return chunks 

In [35]:
text_chunks = chunk_text(transcript[1]['text'])
print([get_num_tokens(text_chunk, tokenizer) for text_chunk in text_chunks])
text_chunks

[1328, 1275]


["Now, one of the most basic assumptions of economics is that constraints are bad, at least in classical economics. People are always better off, or at least no worse off, when you relax constraints.\n\nSo more money is better than less money, and more hours in the day would be better than fewer hours in the day. Do you think literature is an exception to that economic logic, that constraints really make it better? So I think there's some sweet spot, right? If you constrain everything, then you're trapped in a really rigid box. There's no room for you to be creative. With absolutely no constraints at all, you're out in the wilderness. But with a few simple constraints, like you might have in a poetic form, that doesn't stop you being creative. It spurs you to creativity. And so there's this great quote from the Irish poet, Paul Muldoon, where he said that poetic form is a straitjacket in the sense that straitjackets were a straitjacket for Houdini.\n\nThey give you something to push of

In [36]:
#| export
def get_chunk_summary_prompt(chunk_text):
    return f"""Write a concise summary of the following:

\"{chunk_text}\"

CONCISE SUMMARY: """

In [37]:
#| export
def get_chain_title_prompt(topic_text):
    return f"""The following is a series of summaries of a text chapter. Generate a title from these summaries.

It should be displayed in the below format:

###CHAPTER SUMMARIES###
summary of chapter describing why ai is good

another summary of chapter describing why ai is good

###CHAPTER TITLE###
why ai is good

Now try below:

###CHAPTER SUMMARIES###
{topic_text}

###CHAPTER TITLE###

"""

In [38]:
#| export
def get_topic_title_chain(text, pipe):
    text_chunks = chunk_text(text)
    chunk_summaries = []
    for text_chunk in text_chunks:
        summary = pipe(get_chunk_summary_prompt(text_chunk))[0]['generated_text'].strip()
        chunk_summaries.append(summary)
    title = pipe(get_chain_title_prompt("\n\n".join(chunk_summaries)))[0]['generated_text']
    return title

In [39]:
# title = get_topic_title_chain(text_chunks, pipe)
# title

In [40]:
context_length = 2048
2048 - get_num_tokens(get_chain_title_prompt(""), tokenizer)

1931

In [41]:
#| export
def title_topics(topics, model, tokenizer):
    pipe = pipeline(
        model=model, tokenizer=tokenizer,
        return_full_text=False,  
        task='text-generation',
        # we pass model parameters here too
        # stopping_criteria=stopping_criteria
        temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
        top_p=0.15,  # select from top tokens whose probability add up to 15%
        top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
        max_new_tokens=480,  # max number of tokens to generate in the output
        repetition_penalty=1.2  # without this output begins repeating
    )
    for topic in topics:
        topic_text = topic['text']
        if get_num_tokens(topic_text, tokenizer) < 1900: 
            topic['label'] = get_topic_title(topic_text, pipe)
        else:
            print("Topic is larger than the model's context window, running summary chain", topic['label'])
            topic['label'] = get_topic_title_chain(topic_text, pipe)
    return topics

In [42]:
transcript_titled = title_topics(transcript, model, tokenizer)
print([ topic['label'] for topic in transcript_titled ])


Topic is larger than the model's context window, running summary chain 1
Topic is larger than the model's context window, running summary chain 2
Topic is larger than the model's context window, running summary chain 4




['Exploring the Connections Between Literature and Mathematics', 'Constraints in Literature', '* Interactive Storytelling Through Graph Theory', 'The Power of Math in Fiction', 'Embracing Uncertainty in Mathematics Education', 'The Limits of Science in Economics']
