# TED Talks Transcript Scraper Notebook

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.parse import urljoin
import time

## データ構造

```json
{
    "Talk Title" : "hoge", 
    "Talk Link Address" : "https://www.tad.com/talks/hoge?language=en", 
    "Language" : "en", 
    "Topics" : ["topic1". "topic2", "topic3"],
    "Transcript Text" : ["sentence1", "sentence2", "sentence3"]
}
```

## すべてのトークのリンクアドレスを取得する

### 1ページ目の各トーク一覧について、トークへのリンクを取得する

https://www.ted.com/talks

In [2]:
base_url = "https://www.ted.com/talks"
target_url = base_url + "?language=" + "en"
html = urlopen(target_url)

soup = BeautifulSoup(html.read(), "lxml")
talk_link = soup.find_all("div", {"class": "talk-link"})

In [3]:
talk_addresses = [tl.find("h4", {"class": "h9"}).find("a").attrs['href'] for tl in talk_link]

### Talkへの相対アドレスを絶対アドレスに変更

In [4]:
talk_addresses = [urljoin(base_url, talk_address) for talk_address in talk_addresses]
for talk_address in talk_addresses:
    print(talk_address)

https://www.ted.com/talks/rebecca_mackinnon_we_can_fight_terror_without_sacrificing_our_rights?language=en
https://www.ted.com/talks/sebastian_kraves_the_era_of_personal_dna_testing_is_here?language=en
https://www.ted.com/talks/nadia_lopez_why_open_a_school_to_close_a_prison?language=en
https://www.ted.com/talks/david_burkus_why_you_should_know_how_much_your_coworkers_get_paid?language=en
https://www.ted.com/talks/eric_liu_let_s_make_voting_fun_again?language=en
https://www.ted.com/talks/abigail_marsh_why_some_people_are_more_altruistic_than_others?language=en
https://www.ted.com/talks/michael_murphy_architecture_that_s_built_to_heal?language=en
https://www.ted.com/talks/michael_shellenberger_how_fear_of_nuclear_power_is_hurting_the_environment?language=en
https://www.ted.com/talks/julie_lythcott_haims_how_to_raise_successful_kids_without_over_parenting?language=en
https://www.ted.com/talks/neha_narula_the_future_of_money?language=en
https://www.ted.com/talks/franz_freudenthal_a_new_wa

### 1ページ目で取得したトークへのリンクアドレスをまとめる

In [5]:
all_talk_link_address = []
for talk_address in talk_addresses:
    all_talk_link_address.append(talk_address)

### 次のページ（2ページ目）のトーク一覧を取得する

In [6]:
pagination_div = soup.find("div", {"class" : "pagination"})

In [7]:
next_link_a = pagination_div.find("a", {"class", "pagination__next"})
next_link = next_link_a.attrs['href']
next_link = urljoin(base_url, next_link)
print(next_link)

https://www.ted.com/talks?language=en&page=2


In [8]:
page_counter = 3
while True:
    target_url = next_link
    html = urlopen(target_url)
    soup = BeautifulSoup(html.read(), "lxml")
    
    talk_link = soup.find_all("div", {"class" : "talk-link"})
    talk_addresses = [tl.find("h4", {"class": "h9"}).find("a").attrs['href'] for tl in talk_link]
    talk_addresses = [urljoin(base_url, talk_address) for talk_address in talk_addresses]
    
    print("page: %d" % page_counter)
#     for talk_address in talk_addresses:
#         print(talk_address)
    
    # リンクアドレスを追加
    for talk_address in talk_addresses:
        all_talk_link_address.append(talk_address)
    
    # 次のページを取得する
    pagination_div = soup.find("div", {"class" : "pagination"})
    next_link_a = pagination_div.find("a", {"class", "pagination__next"})

    # もし次のページが存在しない場合は終了
    if next_link_a is None:
        break

    next_link = next_link_a.attrs['href']
    next_link = urljoin(base_url, next_link)
    print(next_link)

    page_counter += 1
    time.sleep(2)

page: 3
https://www.ted.com/talks?language=en&page=3
page: 4
https://www.ted.com/talks?language=en&page=4
page: 5
https://www.ted.com/talks?language=en&page=5
page: 6
https://www.ted.com/talks?language=en&page=6
page: 7
https://www.ted.com/talks?language=en&page=7
page: 8
https://www.ted.com/talks?language=en&page=8
page: 9
https://www.ted.com/talks?language=en&page=9
page: 10
https://www.ted.com/talks?language=en&page=10
page: 11
https://www.ted.com/talks?language=en&page=11
page: 12
https://www.ted.com/talks?language=en&page=12
page: 13
https://www.ted.com/talks?language=en&page=13
page: 14
https://www.ted.com/talks?language=en&page=14
page: 15
https://www.ted.com/talks?language=en&page=15
page: 16
https://www.ted.com/talks?language=en&page=16
page: 17
https://www.ted.com/talks?language=en&page=17
page: 18
https://www.ted.com/talks?language=en&page=18
page: 19
https://www.ted.com/talks?language=en&page=19
page: 20
https://www.ted.com/talks?language=en&page=20
page: 21
https://www.ted

## トークのトピックを取得する

## 1番目のトークについてトピックを取得してみる

In [9]:
target_talk_url = all_talk_link_address[0]
print(target_talk_url)

https://www.ted.com/talks/rebecca_mackinnon_we_can_fight_terror_without_sacrificing_our_rights?language=en


In [10]:
html = urlopen(target_talk_url)
soup = BeautifulSoup(html.read(), "lxml")

In [11]:
talk_topics_div = soup.find("div", {"class": "talk-topics"})

In [12]:
talk_topics_items = talk_topics_div.find_all("li", {"class":"talk-topics__item"})
type(talk_topics_items)

bs4.element.ResultSet

In [13]:
topic_list = []
for tti in talk_topics_items:
    topic = tti.find("a")
    if topic is not None:
        topic_str = topic.get_text().replace("\n","")
        print(topic_str)
        topic_list.append(topic_str)

Internet
Middle East
Communication
Democracy
Global issues
Privacy
Protests
Security
Social media
Society
Technology
Terrorism
Violence
War
Web


## トークのTranscrpitを取得する

In [14]:
target_transcrpit_url = target_talk_url.replace("?language=en", "/transcript?language=en")
print(target_transcrpit_url)

https://www.ted.com/talks/rebecca_mackinnon_we_can_fight_terror_without_sacrificing_our_rights/transcript?language=en


In [15]:
html = urlopen(target_transcrpit_url)
soup = BeautifulSoup(html.read(), "lxml")

In [16]:
talk_transcrpit_para = soup.find_all("p", {"class": "talk-transcript__para"})

In [17]:
for ttp in talk_transcrpit_para:
    transcript_text = ttp.find("span", {"class": "talk-transcript__para__text"})
    print(transcript_text.get_text().replace("\n",""))

There's a big questionat the center of lifein our democracies today:How do we fight terrorwithout destroying democracies,without trampling human rights?
I've spent much of my careerworking with journalists,with bloggers,with activists,with human rights researchersall around the world,and I've come to the conclusionthat if our democratic societiesdo not double downon protecting and defending human rights,freedom of the pressand a free and open internet,radical extremist ideologiesare much more likely to persist.
(Applause)
OK, all done. Thank you very much.No, just joking.
(Laughter)
I actually want to drill downon this a little bit.
So, one of the countries that has beenon the frontlines of this issueis Tunisia,which was the only countryto come out of the Arab Springwith a successful democratic revolution.Five years later,they're strugglingwith serious terror attacksand rampant ISIS recruitment.And many Tunisiansare calling on their governmentto do whatever it takes to keep them safe.
