The objective of this notebook is to gather the list of TED talks and their associated urls. The urls will be used to collect metadata of each talk in the next notebook - *02B_Collect_metadata*.

## Import libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

## 2A.1 Gather urls of talks

The list of talks will be retrieved from the [TED Talks]("https://www.ted.com/talks") website, filtered for talks in **English**.

Scrape date: 26/07/2020

In [2]:
# instantiate empty list to save list of talks' titles and urls
talks = []

# there are 115 pages of videos
for i in range(0, 115):
    
    # set url to make request from
    url = 'https://www.ted.com/talks?language=en&page=' + str(i) + '&sort=newest'
    
    # make request
    res = requests.get(url, headers = {'User-agent': 'S bot 1.0'})
    
    # halt request if status error
    if res.status_code != 200:
        print('Page ', i)
        print('Status error: ', res.status_code)
        break
    
    # create a BeautifulSoup object
    soup = BeautifulSoup(res.content, 'lxml')
    
    # this element contains the talk title and url   
    for e in soup.find_all('h4', {'class': 'f-w:700 h9 m5'}):
        talk = {}
        talk['title'] = e.find('a').text.strip()
        talk['url'] = 'https://www.ted.com' + e.find('a')['href']
        talks.append(talk)

In [3]:
# convert list of titles and urls to a df
talks_urls = pd.DataFrame(talks, columns=['title', 'url'])

In [4]:
# check for nulls
print(talks_urls.isnull().sum())

# check for duplicates
talks_urls.duplicated(keep='first').sum()

title    0
url      0
dtype: int64


36

In [5]:
# remove duplicated entries
talks_urls.drop_duplicates(keep='first', inplace=True)

In [6]:
# check number of entries
talks_urls.shape[0]

4104

## Export urls as csv file

In [7]:
talks_urls.to_csv('../../data/talks_urls.csv', index=False)