# Installs

In [1]:
!pip install torch>=2.0.1 safetensors==0.3.1 sentencepiece>=0.1.97 ninja==1.11.1

In [None]:
!pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu118

In [3]:
# Formatting to wrap output text
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
!git clone https://github.com/turboderp/exllamav2

In [None]:
%cd /content/exllamav2

In [None]:
!pip install -r requirements.txt

In [7]:
import sys, os
sys.path.append("/content/exllamav2")

# Scraping the Podcast Data

In [8]:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
import numpy as np

In [62]:
headers = {'User-Agent': 'Mozilla/5.0'}

# Function to scrape podcast URLs from main page
def get_podcast_urls(main_url):
    response = requests.get(main_url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the ul containing podcast URLs
    podcast_list = soup.find('ul', {'id': 'lcp_instance_0', 'class': 'lcp_catlist'})

    podcasts = []
    for li in podcast_list:
      a_tag = li.find('a', href=True)
      if a_tag:
        url = a_tag['href']
        name = a_tag.text
        podcasts.append((url, name))

    return podcasts

In [65]:
# Function to extract transcript text
def scrape_podcast(podcast_data, csv_writer):
    podcast_url, podcast_name = podcast_data
    response = requests.get(podcast_url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    transcript_div = soup.find('div', class_='et_pb_module et_pb_post_content et_pb_post_content_0_tb_body speaking')

    if transcript_div:
        speeches = transcript_div.find_all('p')
        for speech in speeches:
            parts = speech.text.split(':', 1)
            if len(parts) == 2:
                speaker, text = parts
                csv_writer.writerow([podcast_name, speaker.strip(), text.strip()])

In [63]:
main_url = 'https://healthliteracy.com/podcast-transcripts/'
folder_dir = "/content/drive/MyDrive/Projects/Podcast Summariser"

# Get list of podcast URLs
podcast_urls = get_podcast_urls(main_url)

In [None]:
podcast_urls

In [66]:
# Open CSV for writing
with open(f"{folder_dir}/interviews.csv", 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Speaker', 'Text'])

    for url in podcast_urls:
      scrape_podcast(url, writer)

print("Scraping completed and saved to interviews.csv")

Scraping completed and saved to interviews.csv


Final Dataset

In [9]:
interviews_df = pd.read_csv("/content/drive/MyDrive/Projects/Podcast Summariser/interviews.csv")
interviews_df = interviews_df.reset_index()
interviews_df = interviews_df.rename(columns={'index': 'Title'})

In [10]:
interviews_df.head()

Unnamed: 0,Title,Speaker,Text
0,Health Literacy: Helping Patients Feel Cared F...,Helen Osborne,Welcome to Health Literacy Out Loud. I’m Helen...
1,Health Literacy: Helping Patients Feel Cared F...,Dr. Mark Williams,"Thanks, Helen. It’s an honor to be able to tak..."
2,Health Literacy: Helping Patients Feel Cared F...,Helen Osborne,"I’m your great fan. In that 1995 paper, which ..."
3,Health Literacy: Helping Patients Feel Cared F...,Dr. Mark Williams,That’s fascinating. We always suspected as we ...
4,Health Literacy: Helping Patients Feel Cared F...,Helen Osborne,I know. I treasure all the people doing this t...


# Using LLM for Summarisation

In [89]:
# Number of episodes scraped
len(interviews_df['Title'].unique())

185

In [76]:
# Titles of the podcasts/interviews
interviews_df['Title'].unique()

array(['Health Literacy: Helping Patients Feel Cared For, and Cared About (HLOL #239)',
       'Artificial Intelligence & Health Communication (HLOL #238)',
       'The Language of Civility (HLOL #237)',
       'Television Ads for Medications (HLOL #236)',
       'The Value of Knowing Why Health Literacy Matters (HLOL #235)',
       'Time Toxicity: Time that Patients Can Lose to Treatment (HLOL #234)',
       'Using Art to Communicate About Surgery (HLOL #233)',
       'Communicating About Potential Healthcare Fraud and Abuse (HLOL #232)',
       'Words Matter: What We Say and Write Can Affect Health Understanding (HLOL #231)',
       'Wellness and Health Literacy (HLOL #230)',
       'Oral Health Literacy: How Diseases of the Mouth Affect Overall Health (HLOL #229)',
       'Bullet Points and Other Types of Lists (HLOL #228)',
       'A Standardized Patient’s Perspective of Health Communication (HLOL #227)',
       'Building Trust with Each Audience (HLOL #226)',
       'Live Virtual 

In [77]:
# Take the text of the a specific episode
filtered_df = interviews_df[interviews_df['Title'] == "Artificial Intelligence & Health Communication (HLOL #238)"]

# Join all the text into one string
text_to_summarise = "\n".join(filtered_df['Text'])

In [79]:
text_to_summarise

'Welcome to Health Literacy Out Loud. I’m Helen Osborne, President of Health Literacy Consulting, founder of Health Literacy Month and author of the book Health Literacy from A to Z. I also produce and host this podcast series, Health Literacy Out Loud.\nthe introduction of artificial intelligence, or AI. I’m certainly grappling with how to best use it as a tool in health communication.\nThanks for having me.\nThis AI, it’s here. I’m reading about it everywhere. Clue us all in. What is AI, and what do we need to know and do?\nAI isn’t new. We’ve had artificial intelligence for a while, and I’ll get to what that is in a second.\nI think that’s what really overwhelmed me. As you talk about it, that it’s been around for a while, that’s probably why if I’m looking for a pair of sneakers, all of a sudden those very same sneakers show up on my computer. It’s like, “Really?”\nExactly.\nBut now you’re saying it can not only give us back information we somehow entered into it, but it can create

In [80]:
len(text_to_summarise)

8510

# exallmav2
- Allows the use of quantised models

Setup

In [11]:
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer

from exllamav2.generator import ExLlamaV2BaseGenerator, ExLlamaV2Sampler

import time
import os, glob
import torch

In [None]:
# Predownloaded model weights of a pretrained LLM
model_directory =  "/content/drive/MyDrive/Projects/LLMs/vicuna-13b-v1.3.0-GPTQ"

In [13]:
config = ExLlamaV2Config()
config.model_dir = model_directory
config.prepare()

In [14]:
model = ExLlamaV2(config)
print("Loading model: " + model_directory)
model.load([18, 24])
# model.load([16, 24])

Loading model: /content/drive/MyDrive/Projects/LLMs/vicuna-13b-v1.3.0-GPTQ


([18, 24], [10.16779100894928, 23.9716796875])

In [15]:
tokenizer = ExLlamaV2Tokenizer(config)

cache = ExLlamaV2Cache(model)

In [16]:
# Initialize generator
generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)

In [17]:
# Generate some text
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.token_repetition_penalty = 1.15
settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])

In [90]:
# Prompt Format for the vicuna model
prompt_input = f"""Write a concise, one-paragraph summary of the key concepts, context, and implications of the interview: {text_to_summarise}."""
prompt=f"""### Instruction:

{prompt_input}

### Response:
"""

In [92]:
max_new_tokens = 512

generator.warmup()
time_begin = time.time()

output = generator.generate_simple(prompt, settings, max_new_tokens, seed = 42)

time_end = time.time()
time_total = time_end - time_begin

In [93]:
print(output)
print()
print(f"Response generated in {time_total:.2f} seconds, {max_new_tokens} tokens, {max_new_tokens / time_total:.2f} tokens/second")

anything else. We’re just more aware of it now than we were before, because it’s right out there and it’s more available to the public than it ever was before.
We haven’t really talked about the dark sides of AI. It’s great to have a positive conversation about it, but there are fears about the ways in which ChatGPT could be weaponized.
We can create voices, too, right?
It can create voices. We have to be aware of that. I would say as people who are working in this space of education and literacy, we should also be talking to our communities about this, about the ways in which AI can potentially cause harm.
I’m so glad I’m talking to you. I feel like I’m a few steps behind. Just today, I wanted to look up some medical thing, so I went to Dr. Google. I’ve gotten much more comfortable about that. Years ago, I would not be that comfortable. I know you can’t trust things on the internet, but I know how to vet that myself and go to the sites that I find credible.
One hundred percent.
Ways t

**Open source models clearly can't interpret the interview and are limited by the context length**

# OpenAI API

In [None]:
%pip install openai

In [None]:
import os
import openai
import json

# openai key
openai.api_key = ""

"Chain of Density" (CoD) custom prompt template to generate dense summaries that are detailed and entity-centric without being overly dense and hard to follow.
- https://arxiv.org/abs/2309.04269

In [83]:
# Function to get openai response using custom prompt template
def llm_summariser(text_to_summarise):

    prompt_content = f"""
    Article: {text_to_summarise}
    You will generate increasingly concise, entity-dense summaries of the above article.

    Repeat the following 2 steps 5 times.

    Step 1. Identify 1-3 informative entities (";" delimited) from the article which are missing from the previously generated summary.
    Step 2. Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the missing entities.

    A missing entity is:
    - relevant to the main story,
    - specific yet concise (5 words or fewer),
    - novel (not in the previous summary),
    - faithful (present in the article),
    - anywhere (can be located anywhere in the article).

    Guidelines:

    - The first summary should be long (4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose language and fillers (e.g., "this article discusses") to reach ~80 words.
    - Make every word count: rewrite the previous summary to improve flow and make space for additional entities.
    - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
    - The summaries should become highly dense and concise yet self-contained, i.e., easily understood without the article.
    - Missing entities can appear anywhere in the new summary.
    - Never drop entities from the previous summary. If space cannot be made, add fewer new entities.

    Remember, use the exact same number of words for each summary.
    Answer in JSON. The JSON should be a list (length 5) of dictionaries whose keys are "Missing_Entities" and "Denser_Summary".
    """

    # Formulate the message structure
    messages = [
        {"role": "system", "content": "You will generate increasingly concise, entity-dense summaries of the article. Follow the guidelines provided in the prompt."},
        {"role": "user", "content": prompt_content}
    ]

    # Get response from GPT-4 using the Chat model
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=messages
    )

    # Reformat json output to list of dictionaries
    final_response = json.loads(response.choices[0].message['content'])

    # Return the final 5 summaries
    return final_response

In [84]:
len(text_to_summarise)

8510

In [85]:
time_begin = time.time()

summarised_text = llm_summariser(text_to_summarise)

time_end = time.time()
time_total = time_end - time_begin
print(f"Response generated in {time_total:.2f} seconds")

Response generated in 39.02 seconds


In [86]:
print(summarised_text)

[{'Missing_Entities': 'AI in health communication; ChatGPT 3 and 4; Microsoft, Google, OpenAI', 'Denser_Summary': 'This podcast discussion is around the scope of artificial intelligence (AI) particularly in health communication. The role of AI in generating content for sports stories and finance journalism is presented. The guests discuss applications of AI tools such as ChatGPT 3 and 4. The evolution and popularization of AI, promoted by technology giants Microsoft, Google and OpenAI, is discussed throughout the conversation, underlying the potential benefits and challenges of a technology-driven future.'}, {'Missing_Entities': 'Role of AI in finance; Impact on jobs; Importance in education and literacy', 'Denser_Summary': 'The discussion frames AI as a vital tool of multiple domains, including health communication, finance journalism, and sports stories. Tools like ChatGPT 3 and 4 enable easy translation and content simplification. Microsoft, Google, and OpenAI play significant roles

In [88]:
print(summarised_text[3]["Denser_Summary"])

AI's role extends from plain language writing to deepfake creation, impacting education, health communication, finance journalism, and the job market. The technology, promoted by giants like Microsoft, Google, OpenAI, even pervades mediums like newsletters, raising trust issues. Balancing utilization and understanding of its ethical nuances is needed in a rapidly-progressing tech milieu as depicted through the Adele deepfake video discussion.


In [87]:
print(summarised_text[4]["Denser_Summary"])

Within the educational, communication, and economic frame, AI's implications extend from content creation to job displacement. It pervades diverse fields, from writing newsletters to creating deepfake videos, like an Adele example discussed. Citing AI applications like ChatGPT raises issues, emphasizing vetting digital information's importance. Navigating this evolving tech landscape requires understanding it thoroughly, a key takeaway from the Health Literacy Out Loud podcast discussion.


**Custom prompt and "better" GPT4 model means that the summaries are much better**