# Introduction

There are several commonly used libraries that work together to scrape data online. Here, I introduces 'Requests' and 'Beautiful Soup' for accessing and parsing data from webpages. 

I provided a simple example of scraping English Transcripts from the TEDTalks website, and transforming it into a data frame that can be further processed. 

## **Requests:**
    - It is a libarary that you can use in Python to send HTTP requests (Install with !pip install requests in Python command line). Basically, it helps to access and obtain different websites.
    - Here is the User Guide References: https://requests.readthedocs.io/en/master/
           

In [1]:
# import the library
import requests

In [2]:
# To access a webpage 
# here I try to access a page on TedTalk

ted_result = requests.get('https://www.ted.com/talks/rahaf_harfoush_how_burnout_makes_us_less_creative/transcript?language=en')
ted_result


<Response [200]>

- Here, <response [200]> means that you have successfully access the page. 
- Otherwise, it will show 

- Or, you can use .status_code to see whether it is successful or not.

In [3]:
print(ted_result.status_code)

200


In [4]:
# check the headers
ted_result.headers

{'Connection': 'keep-alive', 'Content-Length': '21066', 'Content-Type': 'text/html; charset=utf-8', 'Status': '200 OK', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'X-Download-Options': 'noopen', 'X-Permitted-Cross-Domain-Policies': 'none', 'Referrer-Policy': 'strict-origin-when-cross-origin', 'Cache-Control': 'no-transform, public, max-age=0, s-maxage=180', 'ETag': 'W/"9ee1cd73491b3412fea3db361b6812fd"', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Date': 'Fri, 06 Nov 2020 05:58:49 GMT', 'Via': '1.1 varnish', 'Age': '0', 'X-Served-By': 'e12, cache-bwi5121-BWI, cache-sjc10024-SJC', 'X-Cache': 'MISS, MISS', 'X-Cache-Hits': '0, 0', 'Vary': 'Accept-Encoding', 'Set-Cookie': '_nu=1604642329; Expires=Wed, 05 Nov 2025 05:58:49 GMT; path=/, _abby=Eoe2Kj2ElSETTbm; Expires=Wed, 05 Nov 2025 05:58:49 GMT; Path=/; Domain=.ted.com, _abby_aa_forever=b; Expires=F

In [5]:
# check the content
ted_result.content



## **Beautiful Soup:**
     - It is a library that helps to parse HTML and XML data, so that you can extarct the content/data that you really need.
     - Here is the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
    

In [6]:
# import the library
from bs4 import BeautifulSoup

In [7]:
# after access the webpage, I want to extract info from it.

# store the content 
ted_src = ted_result.content

# use BeautifulSoup to process the content 
soup = BeautifulSoup(ted_src, 'lxml')
soup

<!DOCTYPE html>
<!--[if lt IE 8]> <html class="no-js loggedout oldie ie7" lang="en"> <![endif]--><!--[if IE 8]> <html class="no-js loggedout oldie ie8" lang="en"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js loggedout" lang="en"><!--<![endif]-->
<head>
<script>
  (function (H){
  H.className=H.className.replace(/\bno-js\b/,'js');
  if (('; '+document.cookie).match(/; _ted_user_id=/)) H.className=H.className.replace(/\bloggedout\b/,'loggedin');
  })(document.documentElement)
</script><meta charset="utf-8"/>
<meta content="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/12bb712e-e93f-4a3d-87b7-d71c3572d2f7/RahafHarfoush_2020V-embed.jpg?c=1050%2C550&amp;w=1050" property="og:image"/>
<meta content="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/12bb712e-e93f-4a3d-87b7-d71c3572d2f7/RahafHarfoush_2020V-embed.jpg?c=1050%2C550&amp;w=1050" property="og:image:secure_url"/>
<meta content="1050" property="og:image:width"/>
<meta content="550" property="og

In [8]:
# access specific information from the processed content
# for instance, I want to gather all transcripts in English;
# the above processed page source shows that the English transcripts started with 'p' tag

ted_transcripts = soup.find_all('p')
ted_transcripts

[<p>
 											A few years ago, my obsession
 with productivity
 											got so bad that I suffered
 an episode of burnout
 											that scared the hell out of me.
 											I'm talking insomnia,
 weight gain, hair loss — the works.
 											I was so overworked that my brain
 											literally couldn't come up
 with another idea.
 											That indicated to me that my identity
 was linked with this idea of productivity.
 									</p>,
 <p>
 											[The Way We Work]
 									</p>,
 <p>
 											Do you feel guilty if you haven't
 been productive enough during the day?
 											Do you spend hours
 reading productivity hacks,
 											trying new frameworks
 and testing new apps
 											to get even more done?
 											I've tried them all —
 task apps, calendar apps,
 											time-management apps,
 											things that are meant to manage your day.
 											We've been so obsessed with doing more
 											that we've missed
 the most important

I get a list of content from this webpage. But the content within the last 'p' tags is not part of the transcript of the speech. So I need to get rid of it. 

In [9]:
# check the lenth of the content
len(ted_transcripts)

15

There are 15 small paragraphs, and the last paragraph is the one I do not need. 

In [10]:
# get rid of the last 'p' tag
ted_transcripts_sub = ted_transcripts[:13]
ted_transcripts_sub

[<p>
 											A few years ago, my obsession
 with productivity
 											got so bad that I suffered
 an episode of burnout
 											that scared the hell out of me.
 											I'm talking insomnia,
 weight gain, hair loss — the works.
 											I was so overworked that my brain
 											literally couldn't come up
 with another idea.
 											That indicated to me that my identity
 was linked with this idea of productivity.
 									</p>,
 <p>
 											[The Way We Work]
 									</p>,
 <p>
 											Do you feel guilty if you haven't
 been productive enough during the day?
 											Do you spend hours
 reading productivity hacks,
 											trying new frameworks
 and testing new apps
 											to get even more done?
 											I've tried them all —
 task apps, calendar apps,
 											time-management apps,
 											things that are meant to manage your day.
 											We've been so obsessed with doing more
 											that we've missed
 the most important

In [11]:
# clean up the text by removing the 'p' tags and space 
ted_transcripts_clean = str(ted_transcripts_sub).replace('\n','').replace('\t','').replace('<p>','').replace('</p>','')
ted_transcripts_clean

"[A few years ago, my obsessionwith productivitygot so bad that I sufferedan episode of burnoutthat scared the hell out of me.I'm talking insomnia,weight gain, hair loss — the works.I was so overworked that my brainliterally couldn't come upwith another idea.That indicated to me that my identitywas linked with this idea of productivity., [The Way We Work], Do you feel guilty if you haven'tbeen productive enough during the day?Do you spend hoursreading productivity hacks,trying new frameworksand testing new appsto get even more done?I've tried them all —task apps, calendar apps,time-management apps,things that are meant to manage your day.We've been so obsessed with doing morethat we've missedthe most important thing.Many of these tools aren't helping.They're making things worse., OK, let's talk aboutproductivity for a second.Historically, productivityas we know it todaywas used during the industrial revolution.It was a system that measured performancebased on consistent output.You cloc

In [12]:
# # There are [] brackets that are not needed. 
# ted_transcripts_clean = ted_transcripts_clean.replace('[','').replace(']','')
# ted_transcripts_clean

In [13]:
# change it into a list
ted_transcripts_clean_ls = [i for i in ted_transcripts_clean.split('.,')] 
ted_transcripts_clean_ls = [i for i in ted_transcripts_clean.split('.')] 
print(ted_transcripts_clean_ls)

['[A few years ago, my obsessionwith productivitygot so bad that I sufferedan episode of burnoutthat scared the hell out of me', "I'm talking insomnia,weight gain, hair loss — the works", "I was so overworked that my brainliterally couldn't come upwith another idea", 'That indicated to me that my identitywas linked with this idea of productivity', ", [The Way We Work], Do you feel guilty if you haven'tbeen productive enough during the day?Do you spend hoursreading productivity hacks,trying new frameworksand testing new appsto get even more done?I've tried them all —task apps, calendar apps,time-management apps,things that are meant to manage your day", "We've been so obsessed with doing morethat we've missedthe most important thing", "Many of these tools aren't helping", "They're making things worse", ", OK, let's talk aboutproductivity for a second", 'Historically, productivityas we know it todaywas used during the industrial revolution', 'It was a system that measured performancebase

In [14]:
# check how many sentences are there
print(len(ted_transcripts_clean_ls))

51


## Convert data into a Pandas DataFrame for further processing

In [15]:
# import libraries 
import pandas as pd

In [16]:
# convert the list into a dictonary
dict_cont = dict(enumerate(ted_transcripts_clean_ls))
dict_cont

{0: '[A few years ago, my obsessionwith productivitygot so bad that I sufferedan episode of burnoutthat scared the hell out of me',
 1: "I'm talking insomnia,weight gain, hair loss — the works",
 2: "I was so overworked that my brainliterally couldn't come upwith another idea",
 3: 'That indicated to me that my identitywas linked with this idea of productivity',
 4: ", [The Way We Work], Do you feel guilty if you haven'tbeen productive enough during the day?Do you spend hoursreading productivity hacks,trying new frameworksand testing new appsto get even more done?I've tried them all —task apps, calendar apps,time-management apps,things that are meant to manage your day",
 5: "We've been so obsessed with doing morethat we've missedthe most important thing",
 6: "Many of these tools aren't helping",
 7: "They're making things worse",
 8: ", OK, let's talk aboutproductivity for a second",
 9: 'Historically, productivityas we know it todaywas used during the industrial revolution',
 10: 'I

In [17]:
# convert the dictionary into a DF
df_ted_trans = pd.DataFrame.from_dict(dict_cont, orient='index', columns= ['sentence'])
df_ted_trans

Unnamed: 0,sentence
0,"[A few years ago, my obsessionwith productivit..."
1,"I'm talking insomnia,weight gain, hair loss — ..."
2,I was so overworked that my brainliterally cou...
3,That indicated to me that my identitywas linke...
4,", [The Way We Work], Do you feel guilty if you..."
5,We've been so obsessed with doing morethat we'...
6,Many of these tools aren't helping
7,They're making things worse
8,", OK, let's talk aboutproductivity for a second"
9,"Historically, productivityas we know it todayw..."


In [18]:
# save the file 
df_ted_trans.to_csv('df_ted_transcripts.csv')

## Summary

Above is a short introduction of webscraping libaries such as Requests and Beautiful Soup with a simple example of extracting text from TedTalks webpage. 