<a href="https://colab.research.google.com/github/remzicam/ted_talks_summarizer/blob/main/ted_talk_summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q transformers

[K     |████████████████████████████████| 5.8 MB 5.7 MB/s 
[K     |████████████████████████████████| 182 kB 80.0 MB/s 
[K     |████████████████████████████████| 7.6 MB 72.9 MB/s 
[?25h

In [2]:
from transformers import pipeline
from requests import get
from re import sub
from pprint import pprint
from warnings import filterwarnings
#supress warnings
filterwarnings("ignore")

In [3]:
summarizer = pipeline("summarization", 
                      'pszemraj/led-base-book-summary',
                      device=0)

Downloading:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/648M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [4]:
def clean_text(text: str) -> str:
    """Cleans subtitle text of ted talks.
    Args:
        text (str): subtitle of ted talk
    Returns:
        cleaned_text (str): cleaned version of subtitle text
    """
    # remove string inside parantheses (i.e appluse)
    text = sub(r"\(.*\)", "", text)
    # format text by splitting/removing new lines
    text = text.split("\n")[1:]
    # remove empty strings
    text = list(filter(None, text))
    # remove timestamps as they contains pattern of "-->"
    cleaned_text = " ".join([x.strip() for x in text if "-->" not in x])
    return cleaned_text


def ted_talk_transcriber(link: str) -> str:
    """Creates transcription of ted talks from url.
    Args:
        link (str): url link of ted talks
    Returns:
        raw_text (str): raw transcription of the ted talk
    """
    # request link of the talk
    page = get(link)
    # extract unique talk id to reach subtitle file
    talk_id = str(page.content).split("project_masters/")[1].split("/")[0]
    raw_text = get(
        f"https://hls.ted.com/project_masters/{talk_id}/subtitles/en/full.vtt"
    ).text
    return raw_text


def text_summarizer(text: str) -> str:
    """Summarizes given text.
    Args:
        text (str): ted talks transcription
    Returns:
        str: summary
    """
    result = summarizer(
        text,
        min_length=8,
        max_length=256,
        no_repeat_ngram_size=3,
        encoder_no_repeat_ngram_size=3,
        repetition_penalty=3.5,
        num_beams=4,
        do_sample=False,
        early_stopping=True,
    )
    return result[0]["summary_text"]


def run(link: str) -> str:
    """Summarizes ted talks given link.
    Args:
        link (str): url link of ted talks
    Returns:
        str: summary
    """
    raw_text = ted_talk_transcriber(link)
    cleaned_transcript = clean_text(raw_text)
    return text_summarizer(cleaned_transcript)

In [5]:
short_link='https://www.ted.com/talks/jen_gunter_the_truth_about_yeast_in_your_body'
pprint(run(short_link))

("You may have heard about sugar consumption, but you probably haven't. Well, "
 "according to a new study , there's actually a link between sugar consumption "
 'and an infection. According to the American College of Surgeons, sugar '
 'consumption leads to yeast infections because people overeat too much. And '
 "then there's diabetes, which is also linked to an increase in risk for some "
 "types of yeast infections . The good news: Type 2 diabetes doesn't seem to "
 'be associated with any increased risk for these kinds of problems. But '
 "wait--there's more to this story than meets the eye. Let's just skip the "
 '"repetition" part of this whole "sneak oil" thing. Go away, sugar eaters. '
 'Get over it.')


In [6]:
link='https://www.ted.com/talks/ted_ed_how_will_ai_change_the_world'
pprint(run(link))

('Artificial intelligence uses artificial intelligence to predict how humans '
 'will behave in the future. According to Stuart Russell, an AI expert , '
 'humans often fail to understand "the fixed objective" because they don\'t '
 "have enough information to make decisions about what's important to them. "
 "For example, if you ask someone to buy you coffee, but they can't figure out "
 "how to fix the world's ills, then there's no way to adapt to this new "
 'technology.')
