In [1]:
#| default_exp summarise

# Summarisation

> Module for creating chunks of the transcript, for both paragraphs and topics.

# Methodology

## Aims

Above all what is required is a title summary of all our topics for clear navigation of the transcript. 

A secondary aim would be to also provide summaries for these, to allow readers to quickly understand the epsiode's content.

## Approach

Ultimately this is a summarisation problem, providing the summary as a title and possibly a paragraph description as well.

With the introduction of LLMs, there is a clear approach of doing this. It's a method that both has exceptional capabilities for both understanding the text, and also of outputting the contents as desired. These models have been extensively trained on summarisation tasks so they are beyond capable of the task.

## Considerations

### Prompting

How will we feed the model information, and instruct it to get an appropriate output. The best way to do this through first impressions is by using delimiters. Both the transcript text and outputs can be delimited through section titles. Before the OpenAI API was released I remember reading in their documents that they recommended separating variables either through triple backquotes or triple hashtags. 

#### Inputting Transcript

We need a way for the model to clearly recognise the text that it needs to summarise. This can be done through pasting under a section heading.

#### Output

We require the output to be parseable (backup: use LLMs to parse the appropriate sections). The best way to keep this consistent is to provide examples, and provide a set format for the output to be written in again using section headers.

### Efficiency

While LLMs will provide excellent results, they are considerably expensive to run. Is there any way we can make this cheaper/easier to run?

#### Combining Titles & Summary

Ideally, if we can get the the LLM to output reliably, it'd be most efficient to get it to output a title, and a summary in one prompt. Alternatively, we could make use of two-shot prompting, where the previous context would still be held in the model's attention, and a simple request to summarise it would be sufficient. 

### Models

Here we are required to consider the model size, as they are restricted by my GPU's memory (24G).

The introduction to Llama 2 has made this much more simple, as it comes with an already fine-tuned chat model. The choice of models then comes down to two factors: 
- parameter count
- floating point precision

A higher number of parameters would allow for more intricate understanding and output possibility, whereas floating point precision would allow for greater stability. It's best to get as much use as possible out of the GPU in this situation, so the choice here is between: 

model | precision | memory
--- | --- | ---
llama-chat-7B | FP16 | 7*2=14GB
llama-chat-13B | FP8 | 13*1=13GB

Since the task at hand isn't that complicated - it doesn't require lots of logic, and doesn't require large degrees of accuracy, it might be best to use the 7B model at a more stable FP16. This though depends on how stable quanitized models are. It should also be noted that this 'stability' is probably more important during model training. Since I'm just using it for inference, the model weights are already known and the decrease in accuracy of them shouldn't effect the output too drastically. The research on this doesn't seem too conclusive as of yet, so its worth investigating empirically.


### Full Summary

If there's a summary of all the topics, we could use the 'Map Reduce' method on this to obtain the full summary. This is simply combining and summarising all of the individual topic summaries.

# Code

In [1]:
#| export
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from transcriber.group import group_paragraphs_text
import torch

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package wordnet to /home/steph/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
import json

In [3]:
with open("../data/podcast/practical_ai_236_tech_stack/tmp/transcript-grouped.json") as f:
    transcript = json.load(f)

In [4]:
#| export
def load_llm_pipeline(model="meta-llama/Llama-2-13b-chat-hf", cache_dir=None, **pipeline_args): 
    
    repo_branch = "main"

    tokenizer = AutoTokenizer.from_pretrained(model, revision=repo_branch, cache_dir=cache_dir)
    model = AutoModelForCausalLM.from_pretrained(model, revision=repo_branch, cache_dir=cache_dir, load_in_8bit=True, trust_remote_code=True, device_map='auto')

    pipe = pipeline(
        model=model, tokenizer=tokenizer,
        return_full_text=False,  
        task='text-generation',
        # -- model hyperparameters --
        temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
        top_p=0.15,  # select from top tokens whose probability add up to this value
        top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
        max_new_tokens=400, 
        repetition_penalty=1.2,  
        **pipeline_args
    )

    return pipe

In [5]:
llm = load_llm_pipeline(cache_dir = "/home/steph/.cache/huggingface/os_models")

Loading checkpoint shards: 100%|██████████| 3/3 [00:10<00:00,  3.55s/it]


# Zero-Shot Titling

### Formatting Text

Would it be best to format the topics as a speaker segmented transcript, or simply as plain text? I think this depends on whether the model would get confused or not.

In [23]:
#| export
def get_base_prompt(topic_text):
    return f"""

Your answer should be displayed in the following format:

###SECTION###
SPEAKER_00: example text
SPEAKER_01: example text

###SECTION TITLE###
example concise title

###SECTION SUMMARY###
example concise summary paragraph

Now give real titles and summaries for the below:

###SECTION###
{topic_text}

###SECTION TITLE###
"""

def get_intro_prompt(topic_text):
    return "For the following podcast introduction section, give it a title followed by a summary. Both the title and summary should be separated into their own section under the headers delimited by triple hashtags." + get_base_prompt(topic_text)

def get_standard_prompt(topic_text):
    return "You are an AI text summariser who for legal reasons absolutely cannot mention any personal names of the people in the text.\n\nFor the following podcast section, give it a title followed by a summary. Both the title and summary should be separated into their own section under the headers delimited by triple hashtags (###)." + get_base_prompt(topic_text)

In [24]:
#| export
def format_speech_text(topic): return '\n'.join([speech['label'] + ": " + speech['text'] for speech in topic['groups']])

In [31]:
#| export
def format_summary(summary):
    summary = summary.replace("'",'"')
    if summary.startswith('"') and summary.endswith('"') and '"' not in summary[1:-2]:
        summary = summary[1:-2]
    return summary

In [32]:
#| export
def parse_summary(llm_summary):
    split = llm_summary.split("###")
    if len(split) == 3:
        return {
            'title': format_summary(split[0].strip()),
            'summary': format_summary(split[2].strip()),
            'summary_unparsed': llm_summary
        }
    else:
        return {'title': "", 
                'summary': "", 
                'summary_unparsed': llm_summary
            }

In [27]:
#| export
def get_topic_summary_prompt(topic):
    speech_text = format_speech_text(topic)
    if topic['label'] == 0:
        prompt = get_intro_prompt(speech_text)
    else:
        prompt = get_standard_prompt(speech_text)
    return prompt

In [28]:
#| export
def summarise_topics(transcript, llm):
    for i, topic in enumerate(transcript):
        prompt = get_topic_summary_prompt(topic)
        summary = parse_summary(
            llm(prompt)[0]['generated_text']
        )
        topic.update(summary)
        torch.cuda.empty_cache()
    return transcript

In [29]:
summarised_topics = summarise_topics(transcript, llm)



In [30]:
dictfilt = lambda x, y: dict([ (i,x[i]) for i in x if i in set(y) ])

[dictfilt(topic, ('title','summary','summary_unparsed')) for topic in summarised_topics]

[{'title': 'Understanding Large Language Models and Their Role in Generative AI Applications',
  'summary': 'In this episode of Practical AI, hosts Daniel Whitenack and Chris Benson explore the concept of large language models (LLMs) and their role in generative AI applications. They discuss the differences between LLMs and other types of AI models, and examine the various components that make up the LLM ecosystem. Additionally, they touch on the idea that the model itself is not the application, but rather a tool that can be used to create value.',
  'summary_unparsed': 'Understanding Large Language Models and Their Role in Generative AI Applications\n\n###SECTION SUMMARY###\nIn this episode of Practical AI, hosts Daniel Whitenack and Chris Benson explore the concept of large language models (LLMs) and their role in generative AI applications. They discuss the differences between LLMs and other types of AI models, and examine the various components that make up the LLM ecosystem. Addi

## Full Summary

In [44]:
#| export
def get_whole_summary_prompt(summarised_topics):
    summaries = '\n\n'.join([topic['summary'] for topic in summarised_topics])
    prompt = f"""Written below are summaries of every topic of a podcast episode. Please write a detailed summary for the whole podcast episode.
    
###SECTION SUMMARIES###
{summaries}

###WHOLE SUMMARY###
"""
    return prompt

In [45]:
#| export
def summarise_transcript(transcript, llm):
    prompt = get_whole_summary_prompt(transcript)
    print(len(llm.tokenizer.tokenize(prompt)))
    llm.model.config.max_new_tokens = 2048
    summary = llm(prompt)[0]['generated_text']
    return summary

In [25]:
summarised_transcript = summarise_transcript(summarised_topics, llm)
summarised_transcript

1474


'This podcast episode features conversations with several guests who are experts in their respective fields, all centered around the theme of computation and its impact on society. Topics discussed include the power of computational thinking, the limitations of human understanding, the potential of automated content selection, and the implications of advanced artificial intelligence on human consciousness. Guests include Stephen Wolfram, David Deutsch, and other notable figures in the fields of computer science, physics, and philosophy.'

In [46]:
#| export 
def summarise(transcript_split):
    llm = load_llm_pipeline(cache_dir="/home/steph/.cache/huggingface/os_models")
    summarised_topics = summarise_topics(transcript_split, llm)
    summarised_transcript = summarise_transcript(summarised_topics, llm)
    return {
        'summary': summarised_transcript, 'topics': summarised_topics
    }

In [29]:
with open("../data/podcast/practical_ai_236_tech_stack/transcript.json", 'w') as f:
    json.dump({
        'summary': summarised_transcript, 'topics': summarised_topics
    }, f, ensure_ascii=False, indent=2)

In [33]:
#| hide
from nbdev import nbdev_export
nbdev_export()

---

# Map Reduce

UPDATE: This section is no longer required after the Llama2 release, which allows for a context length of 4096. This is a large enough context length for each topic section. All that is required is to have the topics staying below the context lengths.

Topics can end up being longer than the maximum context length of these models (2048 tokens). The options around this are either reducing the size of the topics (reasonable), or splitting them up, and doing a spit summarisation method. 

Langchain recommends a pattern which involves splitting the text up into parts, doing a summarisation for each of the parts, and taking these outputs to do the final summarisation. Lets try that and see how well it works.

In [25]:
print([len(llm.tokenizer.tokenize(topic['text'])) for topic in transcript])

[3031, 2928, 2931, 3180, 2939, 2873, 2491, 2228, 2557, 2788, 3282, 2279, 3156, 3631, 3458]


In [27]:
import numpy as np

In [28]:
token_word_ratio = [(len(llm.tokenizer.tokenize(topic['text']))/len(topic['text'].split(' '))) for topic in transcript ]
np.mean(token_word_ratio), np.max(token_word_ratio)

(1.3364756104849838, 1.3854103343465045)

So here we want to make sure that each topic stays under 4096/1.4=2900 words.

In [12]:
#| export
def get_num_tokens(text, tokenizer): return len(tokenizer.tokenize(text))

In [13]:
topic_text = ' '.join(topic['text'] for topic in transcript[:3])
get_num_tokens(topic_text, tokenizer)

Token indices sequence length is longer than the specified maximum sequence length for this model (6041 > 2048). Running this sequence through the model will result in indexing errors


6041

Here there's an opportunity to split it on specific separators. We can use paragraphs, or in fact we could actually do topics. Maybe just depends on whatever is fastest.

In [14]:
topic_text = split_paragraphs_text(topic_text)

In [34]:
#| export
def chunk_text(topic_text, chunk_size=4000):
    paragraphs = split_paragraphs_text(topic_text).split("\n\n")
    split_idxs = [0]
    current_length = 0
    for i, paragraph in enumerate(paragraphs):
        current_length += len(paragraph)
        if current_length > chunk_size:
            split_idxs.append(i+1)
            current_length = 0
    if len(split_idxs) > 1: split_idxs.pop()
    chunks = []
    for i, j in zip(split_idxs, split_idxs[1:]+[None]):
        if i > 0: 
            if j:
                chunks.append('\n\n'.join(paragraphs[i-1:j+1]))
            else:
                chunks.append('\n\n'.join(paragraphs[i-1:j]))
        else:
            if j:
                chunks.append('\n\n'.join(paragraphs[i:j+1]))
            else:
                chunks.append('\n\n'.join(paragraphs[i:j]))
    return chunks 

In [35]:
text_chunks = chunk_text(transcript[1]['text'])
print([get_num_tokens(text_chunk, tokenizer) for text_chunk in text_chunks])
text_chunks

[1328, 1275]


["Now, one of the most basic assumptions of economics is that constraints are bad, at least in classical economics. People are always better off, or at least no worse off, when you relax constraints.\n\nSo more money is better than less money, and more hours in the day would be better than fewer hours in the day. Do you think literature is an exception to that economic logic, that constraints really make it better? So I think there's some sweet spot, right? If you constrain everything, then you're trapped in a really rigid box. There's no room for you to be creative. With absolutely no constraints at all, you're out in the wilderness. But with a few simple constraints, like you might have in a poetic form, that doesn't stop you being creative. It spurs you to creativity. And so there's this great quote from the Irish poet, Paul Muldoon, where he said that poetic form is a straitjacket in the sense that straitjackets were a straitjacket for Houdini.\n\nThey give you something to push of

In [36]:
#| export
def get_chunk_summary_prompt(chunk_text):
    return f"""Write a concise summary of the following:

\"{chunk_text}\"

CONCISE SUMMARY: """

In [37]:
#| export
def get_chain_title_prompt(topic_text):
    return f"""The following is a series of summaries of a text chapter. Generate a title from these summaries.

It should be displayed in the below format:

###CHAPTER SUMMARIES###
summary of chapter describing why ai is good

another summary of chapter describing why ai is good

###CHAPTER TITLE###
why ai is good

Now try below:

###CHAPTER SUMMARIES###
{topic_text}

###CHAPTER TITLE###

"""

In [38]:
#| export
def get_topic_title_chain(text, pipe):
    text_chunks = chunk_text(text)
    chunk_summaries = []
    for text_chunk in text_chunks:
        summary = pipe(get_chunk_summary_prompt(text_chunk))[0]['generated_text'].strip()
        chunk_summaries.append(summary)
    title = pipe(get_chain_title_prompt("\n\n".join(chunk_summaries)))[0]['generated_text']
    return title

In [39]:
# title = get_topic_title_chain(text_chunks, pipe)
# title

In [40]:
context_length = 2048
2048 - get_num_tokens(get_chain_title_prompt(""), tokenizer)

1931

In [41]:
#| export
def title_topics(topics, model, tokenizer):
    pipe = pipeline(
        model=model, tokenizer=tokenizer,
        return_full_text=False,  
        task='text-generation',
        # we pass model parameters here too
        # stopping_criteria=stopping_criteria
        temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
        top_p=0.15,  # select from top tokens whose probability add up to 15%
        top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
        max_new_tokens=480,  # max number of tokens to generate in the output
        repetition_penalty=1.2  # without this output begins repeating
    )
    for topic in topics:
        topic_text = topic['text']
        if get_num_tokens(topic_text, tokenizer) < 1900: 
            topic['label'] = get_topic_title(topic_text, pipe)
        else:
            print("Topic is larger than the model's context window, running summary chain", topic['label'])
            topic['label'] = get_topic_title_chain(topic_text, pipe)
    return topics

In [42]:
transcript_titled = title_topics(transcript, model, tokenizer)
print([ topic['label'] for topic in transcript_titled ])


Topic is larger than the model's context window, running summary chain 1
Topic is larger than the model's context window, running summary chain 2
Topic is larger than the model's context window, running summary chain 4




['Exploring the Connections Between Literature and Mathematics', 'Constraints in Literature', '* Interactive Storytelling Through Graph Theory', 'The Power of Math in Fiction', 'Embracing Uncertainty in Mathematics Education', 'The Limits of Science in Economics']


# Appendix

## Langchain

# Misc

### Stopping Criteria

In [None]:
from transformers import StoppingCriteria, StoppingCriteriaList

In [38]:
with open("../data/podcast/practical_ai_236_tech_stack/transcript.json", 'w') as f:
    json.dump(topics, f, ensure_ascii=False, indent=2)

In [104]:
stop_token_ids = [
    tokenizer.convert_tokens_to_ids(x) for x in [
        [''], ['User', ':'], ['system', ':'], ['#','#','#'], ['\n', '\n'],
        [tokenizer.convert_ids_to_tokens([9427])[0], ':']
    ]
]
stop_token_ids = [torch.LongTensor(x).to(device, dtype=torch.float16) for x in stop_token_ids]

stop_token_ids

[tensor([0.], device='cuda:0', dtype=torch.float16),
 tensor([ 2660., 29904.], device='cuda:0', dtype=torch.float16),
 tensor([ 5204., 29904.], device='cuda:0', dtype=torch.float16),
 tensor([29936., 29936., 29936.], device='cuda:0', dtype=torch.float16),
 tensor([0., 0.], device='cuda:0', dtype=torch.float16),
 tensor([ 9424., 29904.], device='cuda:0', dtype=torch.float16)]

In [62]:
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False
    
stopping_criteria = StoppingCriteriaList([StopOnTokens()])

## Other Prompts

In [26]:
prompt_string = """Below is a transcript from a topic of a podcast delimited by triple backquotes.
Please write an appropriate title for this topic in seven words or less.
'''{text}'''
"""

In [27]:
prompt_string = "You are an AI language model designed to read podcast transcript sections and provide chapter titles for them. These chapter titles are to accurately summarise the text in 10 words or less. You are to provide such a title for the below piece of text:\n\n'''{text}'''"

In [28]:
prompt_string ="""You are an AI language model designed to summarise podcast transcript chapters as chapter headings.

For example:
###CHAPTER TEXT###
\"\"\"example text...\"\"\"
###CHAPTER HEADING###
\"\"\"Example Title\"\"\"

Now try it:
###CHAPTER TEXT###
\"\"\"{text}\"\"\"

###CHAPTER HEADING###
\"\"\""""