<a href="https://colab.research.google.com/github/r-c-c/ted_talks_summarizer/blob/main/ted_talk_transcriber.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q transformers

In [2]:
from transformers import pipeline
from requests import get
from re import sub
from pprint import pprint
from warnings import filterwarnings
#supress warnings
filterwarnings("ignore")

In [3]:
summarizer = pipeline("summarization", 
                      'pszemraj/long-t5-tglobal-base-16384-book-summary',
                      device=0)

In [4]:
def clean_text(text):
    """it cleans subtitle text of ted talks
    Args:
        text (str): subtitle of ted talk
    Returns:
        cleaned_text (str): cleaned version of subtitle text 
    """
    #remove string inside parantheses (i.e appluse)
    text=sub(r'\(.*\)', '', text)
    #format text by splitting/removing new lines
    text=text.split("\n")[1:]
    #remove empty strings
    text=list(filter(None, text))
    #remove timestamps as they contains pattern of "-->"
    cleaned_text=" ".join([x.strip() for x in text if "-->" not in x])
    return cleaned_text


In [5]:
def ted_talk_transcriber(link):
    """it yields transcription of ted talks from url
    Args:
        link (str): url link of ted talks
    Returns:
        cleaned_transcript (str): transcription of the ted talk
    """
    #request link of the talk
    page = get(link)
    #extract unique talk id to reach subtitle file
    talk_id=str(page.content).split("project_masters/")[1].split("/")[0]
    raw_text = get(f'https://hls.ted.com/project_masters/{talk_id}/subtitles/en/full.vtt').text
    cleaned_transcript=clean_text(raw_text)
    return cleaned_transcript

In [6]:
def summ(link):
    """it produce summary of ted talk by given link.
    First, it transcripts the talk then summarizes it.
    
    Args:
        link (str): link of ted talks
    Returns:
        summary (str): summary of the ted talk transcript
    """
    transcript=ted_talk_transcriber(link)
    result = summarizer(transcript)
    summary=result[0]['summary_text']
    return summary

In [7]:
short_link='https://www.ted.com/talks/jen_gunter_the_truth_about_yeast_in_your_body'
pprint(summ(short_link))

('In this chapter, Jen Gunter explains some of the most common myths about '
 "yeast infections and how they're related to overeating sugar. The first "
 "theory is that it's because of the overgrow of the yeast in the gut; then "
 "there's more sugar in the blood, which feeds the yeast. There's also no way "
 'that eating too much sugar can lead to yeast infection.')


In [8]:
link='https://www.ted.com/talks/ted_ed_how_will_ai_change_the_world'
pprint(summ(link))

('In this chapter, the narrator explains how artificial intelligence will '
 "change people's lives. He uses examples from his own life to explain that "
 "humans don't have the same sense of responsibility as machines do. He says "
 "that when you ask if someone can get you ice cream, you don'tes mean that "
 'they want to kill everybody in Starbucks to make sure you get it. But when '
 'you say that something is important, you really mean that everything else in '
 'the world matters. For example, if you wanted to fix the ocean '
 "acidification, you wouldn't need to kill all the fish and seaweed. If you "
 'asked for coffee, you would be satisfied with getting it right away. The '
 'problem with building an intelligent system now is that we give it a fixed '
 "goal. It doesn't matter whether or not there are side effects of the "
 'reaction on the ocean or anything else. When you build an intelligent '
 "machine, you know what the true purpose is. That's when you start to exhibit "
