# Web scraping basics with BeautifulSoup

## Introduction

As a data scientist, I often find myself looking for external data sources that could be relevant for my machine learning projects. The problem is that it is uncommon to find open source data sets that perfectly correspond to what you are looking for, or free APIs that give you access to data. In this case, web scraping can be one solution to get more data. 

#### What is web scraping?

Web scraping consists in gathering data available on websites. This can be done manually by a human user or by a bot. The latter can of course gather data much faster than a human user and that is why we are going to focus on this. Is it therefore technically possible to collect all the data of a website in a matter of minutes this kind of bot. The legality of this practice is not well defined however. Websites usually describe in their terms of use and in their robots.txt file if they allow scrapers or not.

#### How does it work?

Web scrapers gather website data in the same way a human would do it: the scraper goes onto a web page of the website, gets the relevant data, and move forward to the next web page. Every website has a different structure, that is why web scrapers are usually built to explore one website. The two important issues that arise during the implementation of a web scraper are the following:
- What is the structure of the web pages that contain relevant data?
- How can we get to those web pages?

In order to answer those questions, we need to understand a little how websites work. Websites are created using HTML (Hypertext Markup Language), along with CSS (Cascading Style Sheets) and JavaScript. HTML elements are separated by tags and they directly introduce content to the web page. Here is what a basic HTML document looks like:

<img src="images/basic_html_page.png">

We can see that the content of the first heading is contained between the 'h1' tags. The first paragraph is contained between the 'p' tags. On a real website, we need to find out between which tags the relevant data is and tell it to our scraper. We also need to specify which links should be explored and where they can be found among the HTML file. With all this information, our scraper should be able to gather the required data.

#### What tools are we going to use?

In this tutorial we are going to use the Python modules requests and BeautifulSoup.

Requests will allow us to send HTTP requests to get the HTML files.

Link to requests documentation: http://docs.python-requests.org/en/master/

BeautifulSoup will be used to parse the HTML files. It is one of the most used library for web scraping. Its is quite simple to use and has many features that help gathering websites data efficiently.

Link to BeautifulSoup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

## Prerequisites

- python 2.7
- requests
- beautifulsoup4
- pandas

## Objective

We want to scrape the data of an online book store: http://books.toscrape.com/

This website is fictional so we can scrape it as much as we want.

In this tutorial we will be gathering the following information about all the products of the website:
- book title
- price
- availability
- image
- category
- rating

## Warm-up: get the content of the main page

First let's use the requests module to get the HTML of the website's main page.

In [1]:
main_url = "https://www.tempo.co/indeks/2019/03/04/nasional"

In [2]:
import requests
result = requests.get(main_url)

In [3]:
result.text[:1000]

'<!DOCTYPE html>\r\n<html id="tempoco-2017" lang="en">\r\n  <head>\r\n\r\n\t<title>Indeks 4 Maret 2019 - Tempo.co</title>\n\r\n  <meta charset="utf-8">\r\n  <meta name="viewport" content="width=device-width, initial-scale=1.0">\r\n  <link rel="original-source" href="https://www.tempo.co/" />\r\n    <link rel="canonical" href="https://nasional.tempo.co/read/1181493/hadiri-deklarasi-geunting-ksp-jokowi-serius-tangani-stunting" />\r\n  \t    <link rel="publisher" href="https://plus.google.com/109335234362909335582"/>\r\n\r\n    \r\n    <meta name="description" content=""/>\r\n    <meta name="keywords" content=""/>\r\n      \t<meta property="fb:app_id" content="332404380172618" />\r\n    <meta property="fb:pages" content="160355148441">\r\n    <meta property="og:locale" content="id_ID" />\r\n\t\t<meta content="all" name="robots"/>\r\n\t\t<meta content="index, follow" name="robots"/>\r\n\t\t<meta content="index, follow" name="yahoobot"/>\r\n\t\t<!-- iklan 0 =  -  -  -->\t\t\t\t<meta name="a

The result is quite messy! Let's make this more readable:

In [4]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(result.text, 'html.parser')

In [5]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html id="tempoco-2017" lang="en">
 <head>
  <title>
   Indeks 4 Maret 2019 - Tempo.co
  </title>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="https://www.tempo.co/" rel="original-source"/>
  <link href="https://nasional.tempo.co/read/1181493/hadiri-deklarasi-geunting-ksp-jokowi-serius-tangani-stunting" rel="canonical"/>
  <link href="https://plus.google.com/109335234362909335582" rel="publisher"/>
  <meta content="" name="description">
   <meta content="" name="keywords">
    <meta content="332404380172618" property="fb:app_id"/>
    <meta content="160355148441" property="fb:pages"/>
    <meta content="id_ID" property="og:locale">
     <meta content="all" name="robots"/>
     <meta content="index, follow" name="robots"/>
     <meta content="index, follow" name="yahoobot"/>
     <!-- iklan 0 =  -  -  -->
     <meta content="indek" name="adx:sections"/>
     <link href="https://www.tempo.co/rss" rel="al

The function prettify() makes the HTML more readable. However we will not use this directly to explore where the relevant data is.

Let's define a function to request and parse a HTML web page as we will need this a lot during this tutorial:

In [6]:
def getAndParseURL(url):
    result = requests.get(url)
    soup = BeautifulSoup(result.text, 'html.parser')
    return(soup)

## Find book URLs on the main page

Now let's start to dive deeper into the subject. In order to get the book data, we need to be able to access their product page. The first step consist in finding the URL of every book product page.

In your browser, go onto the website main page, right-click on the name of a product and click on inspect. This will show you the HTML part of the web page corresponding to this element. Congratulations, you have found the first book link!

Note the structure of the HTML code:

<img src="images/inspect.png">

You can try this with every other product on the page: the structure is always the same. The link of the product corresponds to the 'href' attribute of the 'a' tag. This one belongs to an 'article' tag with the a class value 'product_pod'. This seems to be a reliable source to spot product URLs.

BeautifulSoup enables us to find those special 'article' tags. We can wall the find() function in order to find the first occurence of this tag in the HTML:

In [7]:
soup.find("div", class_ = "card card-type-1")

<div class="card card-type-1">
<div class="wrapper clearfix">
<a class="col" href="https://nasional.tempo.co/read/1181863/ksatria-airlangga-kasus-andi-arief-jokowi-serius-lawan-narkoba">
<img src="https://statik.tempo.co/data/2019/03/04/id_823956/823956_400.jpg"/>
</a>
<a class="col" href="https://nasional.tempo.co/read/1181863/ksatria-airlangga-kasus-andi-arief-jokowi-serius-lawan-narkoba">
<h2 class="title">Ksatria Airlangga: Kasus Andi Arief, Jokowi Serius Lawan Narkoba</h2>
<p>Ksatria Airlangga menyebut penangkapan Andi Arief bukti Jokowi serius lawan narkoba.</p>
<span class="col">4 Maret 2019 22:11 WIB</span>
</a>
</div>
</div>

We still have too much information.

Let's dive deeper in the tree by adding the other child tags:

In [8]:
soup.find("div", class_ = "card card-type-1").div.a

<a class="col" href="https://nasional.tempo.co/read/1181863/ksatria-airlangga-kasus-andi-arief-jokowi-serius-lawan-narkoba">
<img src="https://statik.tempo.co/data/2019/03/04/id_823956/823956_400.jpg"/>
</a>

Much better! But we only need the URL contained in the 'href' value. 

We can get this by adding .get("href") to the previous instruction:

Ok, we managed to get our first product URL with BeautifulSoup. 

Now let's gather all the products URLs on the main web page at once using the findAll() function:

In [20]:
a = soup.find('section', class_ = 'list list-type-1')
containers = a.find_all('div', class_ = 'card card-type-1')
print(type(containers))
print(len(containers))
main_page_products_urls = []
for c in containers:
    #print(c)
    x = c.find('a')
    a_link = x.get('href')
    #print(a_link)
    main_page_products_urls.append(a_link)
    
main_page_products_urls[:30]
    

<class 'bs4.element.ResultSet'>
42


['https://nasional.tempo.co/read/1181863/ksatria-airlangga-kasus-andi-arief-jokowi-serius-lawan-narkoba',
 'https://nasional.tempo.co/read/1181849/soal-penangkapan-andi-arief-demokrat-masalah-sensitif-bagi-kami',
 'https://nasional.tempo.co/read/1181838/andi-arief-ditangkap-demokrat-tak-ada-toleransi-untuk-narkoba',
 'https://nasional.tempo.co/read/1181837/sebut-andi-arief-korban-polisi-buka-peluang-rehabilitasi',
 'https://nasional.tempo.co/read/1181834/tkn-komentari-andi-arief-jokowi-serius-berantas-narkoba',
 'https://nasional.tempo.co/read/1181828/andi-arief-terjerat-narkoba-psi-menyindir-lewat-cuitan',
 'https://nasional.tempo.co/read/1181821/ada-dua-wna-di-dpt-pemilu-2019-kota-cirebon',
 'https://nasional.tempo.co/read/1181820/kasus-suap-bakamla-kpk-bekukan-rekening-pt-merial-esa-rp-60-m',
 'https://nasional.tempo.co/read/1181809/3-komentar-andi-arief-sepekan-sebelum-ditangkap-beri-jokowi-c',
 'https://nasional.tempo.co/read/1181815/menteri-agama-dukung-ide-jokowi-soal-hari-sarun

In [21]:
print(str(len(main_page_products_urls)) + " fetched products URLs")
print("One example:")
main_page_products_urls[0]

42 fetched products URLs
One example:


'https://nasional.tempo.co/read/1181863/ksatria-airlangga-kasus-andi-arief-jokowi-serius-lawan-narkoba'

This function is very handy for finding all the values at once, but you have to check that all the information collected is relevant. Sometimes one same tag can contain completely different data. That is why it is important to be as specific as possible when choosing the tags. Here we decided to rely on the tag 'article' with the 'product_pod' class because this seems to be a very specific tag and it is unlikely that we can find data other than product data in it.

The previous URLs correspond to their relative path from the main page. In order to make them complete, we just need to add before them the URL of the main page: http://books.toscrape.com/index.html (after removing the index.html part).

Now let's use this to define a function to retrieve book links on any given page of the website:

In [22]:
def getBooksURLs(url):
   
    #containers = a.find_all('div', class_ = 'card card-type-1')
    
    soup = getAndParseURL(url)
    a = soup.find('section', class_ = 'list list-type-1')
    # remove the index.html part of the base url before returning the results
    return([x.div.a.get('href') for x in a.findAll("div", class_ = "card card-type-1")])

## Find book categories URLs on the main page

Now let's try retrieving the URLs corresponding the different product categories:

<img src="images/inspect2.png">

By inspecting, we can see that they follow the same URL pattern: 'catalogue/category/books'. 

We can tell BeautifulSoup to match the URLs that contain this pattern in order to retrieve easily the categories URLs:

In [12]:
import re

categories_urls = [main_url + x.get('href') for x in soup.find_all("a", href=re.compile("catalogue/category/books"))]
categories_urls = categories_urls[1:] # we remove the first one because it corresponds to all the books

print(str(len(categories_urls)) + " fetched categories URLs")
print("Some examples:")
categories_urls[:5]

0 fetched categories URLs
Some examples:


[]

We managed to retrieve the 50 categories URLs successfully! 

Remember to always check what you fetched to be sure that all the information is relevant.

Getting the URLs of subsections of a website can be very useful if we want to scrape a specific part of it.

## Scrape all books data

For the last part of this tutorial, we will finally tackle our main objective: gather data about all the books of the website.

We know how to get the links of the books within a given page. If all the books were displayed on a same page this would be easy. However this situation is unlikely as it is not very user friendly to display all the catalog to the user on the same page.

Usually products are displayed on multiple pages or on one page but through scrolling. We can see here at the bottom of the main page that there are 50 products pages and a button 'next' to access to the next product page.

<img src="images/next.png">

On the next pages there is also a 'previous' button to come back to the last product page.

<img src="images/previous_next.png">

### Get all pages URLs

In order to fetch all the products URLs, we need to be able to get through all the pages. To do so, we can go iteratively through all the 'next' buttons.

<img src="images/next_inspect.png">

The 'next' button contains the pattern 'page'. We can use this to retrieve the URLs of the next pages. But let's be careful: the 'previous' button also contains this pattern!

If we have two results when matching with 'page', we should take the second one as it will correspond to the next page. For the first and the last pages we will have only one result because we will have either the 'next' button or the 'previous' button.

In [13]:
# store all the results into a list
pages_urls = [main_url]

soup = getAndParseURL(pages_urls[0])

# while we get two matches, this means that the web page contains a 'previous' and a 'next' button
# if there is only one button, this means that we are either on the first page or on the last page
# we stop when we get to the last page

while len(soup.findAll("a", href=re.compile("page"))) == 2 or len(pages_urls) == 1:
    
    # get the new complete url by adding the fetched URL to the base URL (and removing the .html part of the base URL)
    new_url = "/".join(pages_urls[-1].split("/")[:-1]) + "/" + soup.findAll("a", href=re.compile("page"))[-1].get("href")
    
    # add the URL to the list
    pages_urls.append(new_url)
    
    # parse the next page
    soup = getAndParseURL(new_url)

In [14]:
print(str(len(pages_urls)) + " fetched URLs")
print("Some examples:")
pages_urls[:5]

50 fetched URLs
Some examples:


['http://books.toscrape.com/index.html',
 u'http://books.toscrape.com/catalogue/page-2.html',
 u'http://books.toscrape.com/catalogue/page-3.html',
 u'http://books.toscrape.com/catalogue/page-4.html',
 u'http://books.toscrape.com/catalogue/page-5.html']

We successfully managed to get the 50 pages URLs. What is interesting here is that the URL of those pages is highly predictable. We could have just created this list by incrementing 'page-X.html' until 50.

This solution could work for this exact example but would not work anymore if the number of pages changed (e.g. if the website decided to print more products per pages, or if the catalog changed).

One solution could be to increment the value until we get on a 404 page.

<img src="images/404.png">

Here we can see that trying to go to the 51th page effectively gets us a 404 error. 

Fortunately the result of a request has a very useful attribute that can show us the return status of the HTML request.

In [49]:
result = requests.get("https://www.tempo.co/indeks/2019/03/04/nasional")
print("status code for page 50: " + str(result.status_code))

result = requests.get("https://www.tempo.co/indeks/2019/03/37/nasional")
print("status code for page 51: " + str(result.status_code))

status code for page 50: 200
status code for page 51: 200


The 200 code indicates that there is no error. The 404 code tells us that the page was not found.

We can use this information to get all our pages URLs: we should iterate until we get a 404 code.

Let's try this method now:

In [23]:
#str(int(pages_urls[-1].split("/")[6]) + 1 )
pages_urls = []
tahun = ["2019"]
bulan = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]
tgl = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31"]
new_page = "https://www.tempo.co/indeks/2019/01/01/nasional"
#while requests.get(new_page).status_code == 200:
for y in bulan:
    for x in tgl:
            pages_urls.append(new_page)
            new_page = pages_urls[-1].split("/")[0] + "/" + pages_urls[-1].split("/")[1] + "/" + pages_urls[-1].split("/")[2] + "/" + pages_urls[-1].split("/")[3] + "/" + "2019/" + y + "/" + x + "/nasional"
    
    
    #new_page = pages_urls[-1].split("/")[0] + "/" + pages_urls[-1].split("/")[1] + "/" + pages_urls[-1].split("/")[2] + "/" + pages_urls[-1].split("/")[3] + "/" + pages_urls[-1].split("/")[4] + "/" + pages_urls[-1].split("/")[5] + "/" + str(x) + "/" + pages_urls[-1].split("/")[7]
    

In [25]:
print(str(len(pages_urls)) + " fetched URLs")
print("Some examples:")
pages_urls[300:373]

372 fetched URLs
Some examples:


['https://www.tempo.co/indeks/2019/10/21/nasional',
 'https://www.tempo.co/indeks/2019/10/22/nasional',
 'https://www.tempo.co/indeks/2019/10/23/nasional',
 'https://www.tempo.co/indeks/2019/10/24/nasional',
 'https://www.tempo.co/indeks/2019/10/25/nasional',
 'https://www.tempo.co/indeks/2019/10/26/nasional',
 'https://www.tempo.co/indeks/2019/10/27/nasional',
 'https://www.tempo.co/indeks/2019/10/28/nasional',
 'https://www.tempo.co/indeks/2019/10/29/nasional',
 'https://www.tempo.co/indeks/2019/10/30/nasional',
 'https://www.tempo.co/indeks/2019/10/31/nasional',
 'https://www.tempo.co/indeks/2019/11/01/nasional',
 'https://www.tempo.co/indeks/2019/11/02/nasional',
 'https://www.tempo.co/indeks/2019/11/03/nasional',
 'https://www.tempo.co/indeks/2019/11/04/nasional',
 'https://www.tempo.co/indeks/2019/11/05/nasional',
 'https://www.tempo.co/indeks/2019/11/06/nasional',
 'https://www.tempo.co/indeks/2019/11/07/nasional',
 'https://www.tempo.co/indeks/2019/11/08/nasional',
 'https://ww

We managed to obtain the same URLs using this simpler method!

### Get all products URLs

Now the next step consists in fetching all the products URLs for every page. This step is quite simple as we already have the list of all pages and the function to get products URLs from a page.

Let's iterate through the pages and apply our function:

In [26]:
booksURLs = []
for page in pages_urls:
    booksURLs.extend(getBooksURLs(page))

In [28]:
print(str(len(booksURLs)) + " fetched URLs")
print("Some examples:")
booksURLs[3000:3082]

3081 fetched URLs
Some examples:


['https://nasional.tempo.co/read/1186893/korban-meninggal-banjir-sentani-mencapai-89-orang-74-hilang',
 'https://nasional.tempo.co/read/1186884/banser-cabut-laporan-atas-tirto-id-soal-kartun-provokatif-nu',
 'https://nasional.tempo.co/read/1186882/senior-minta-golkar-tak-ambil-sikap-ekstrem-soal-erwin-aksa',
 'https://nasional.tempo.co/read/1186869/golkar-gelar-rapat-khusus-bahas-dukungan-erwin-aksa-ke-sandiaga',
 'https://nasional.tempo.co/read/1186859/kegiatan-disisipi-politik-kompolnas-ingatkan-netralitas-polri',
 'https://nasional.tempo.co/read/1186851/jokowi-segera-integrasikan-sistem-transportasi-jabodetabek',
 'https://nasional.tempo.co/read/1186825/kpk-sita-laptop-dari-rumah-romahurmuziy',
 'https://nasional.tempo.co/read/1186820/3-fakta-seputar-kontroversi-ajakan-dukung-jokowi-di-acara-polri',
 'https://nasional.tempo.co/read/1186816/kpk-sita-uang-dari-ruang-menteri-agama-sekjen-ppp-honor-pribadi',
 'https://nasional.tempo.co/read/1186808/ky-disarankan-usut-tudingan-pelanggara

We finally got the 1000 book URLs. This corresponds to the number indicated on the website!

### Get product data

The last step consist in scraping the data for each product. Let's explore first how the information is structured on the products pages:

<img src="images/product_inspect.png">

We can easily retrieve a lot of information for every book:
- book title
- price
- availability
- image
- category
- rating

Let's do it!

In [30]:
%%time

names = []
deskripsi = []
tgl = []
link_url = []
categories = []
ratings = []

# scrape data for every book URL: this may take some time
for url in booksURLs:
        soup = getAndParseURL(url)
        # product name
        
        title_box = soup.find('h1', attrs={'itemprop':'headline'})
        title = title_box.text
        #print (title)
        names.append(title)
        
        # deskripsi
        des_box = soup.find('div', attrs={'id':'isi'})
        tes = des_box.findAll('p')
        des = ''
        for x in tes:
            des = des + x.text
        deskripsi.append(des)
        
        # tanggal published
        tgl_box = soup.find('span', attrs={'itemprop':'datePublished'})
        tgl_p = tgl_box.text
        tgl.append(tgl_p)
        
      
        
    

KeyboardInterrupt: 

In [46]:
%%time

names = []
deskripsi = []
tgl = []
link_url = []
categories = []
ratings = []

filename = "corpustempo.csv"
f = open(filename, "w", encoding="utf-8")

headers = "title` deskripsi` date"

f.write(headers)


# scrape data for every book URL: this may take some time
for url in booksURLs:
        soup = getAndParseURL(url)
        # product name
        
        title_box = soup.find('h1', attrs={'itemprop':'headline'})
        title = title_box.text
        #print (title)
        names.append(title)
        
        # deskripsi
        des_box = soup.find('div', attrs={'id':'isi'})
        tes = des_box.findAll('p')
        des = ''
        for x in tes:
            des = des + x.text
        deskripsi.append(des)
        
        # tanggal published
        tgl_box = soup.find('span', attrs={'itemprop':'datePublished'})
        tgl_p = tgl_box.text
        tgl.append(tgl_p)
        
        print("title: " + title)
        print("deskripsi: " + des)
        print("date: " + tgl_p)

        f.write("\n" + title.replace("\r\n\t\t\t\t\t\t\t", "") + "`" + des + "`" + tgl_p.replace(",", "|") )
f.close()

title: 
							Longsor Sukabumi, Sudah 15 Korban Meninggal yang Ditemukan						
deskripsi: TEMPO.CO, Jakarta -Tim gabungan pencarian korban tanah longsor Sukabumi, tepatnya di Kampung Garehong, Dusun Cimapag, Desa Sirnaresmi, Kecamatan Cisolok, Kabupaten Sukabumi, Jawa Barat, menemukan kembali jasad warga yang tertimbun. Tercatat, hingga pukul 14.00 WIB, Selasa 1 Januari 2019, sudah 13 warga yang ditemukan.Baca : PVMBG Ungkap 3 Faktor Penyebab Longsor Sukabumi"Hingga tadi siang, kami kembali menemukan 13 korban dalam kondisi meninggal dunia," kata Komandan Resor Militer 061 Suryakencana, Kolonel Muhammad Hasan, kepada wartawan di posko bencana, Senin 1 Januari 2019.Dengan begitu, lanjut Hasan, korban meninggal dunia yang tertimbun sebanyak 15 orang. Sebelumnya pada Senin 31 Desember 2018 malam, tim evakuasi gabungan menemukan dua korban. "Hingga saat ini masih ada enam mayat di lokasi," jelas Hasan.Hasan memastikan, hasil pendataan detail di lapangan, korban yang terindikasi tertimbun 

title: 
							Malam Ini Gempa Guncang Banda Aceh dan Selatan Jawa						
deskripsi: TEMPO.CO, Jakarta - Gempa dengan magnitudo 5.1 mengguncang Banda Aceh pada Selasa petang 1 Januari 2019. Badan Meteorologi Klimatologi dan Geofisika (BMKG) melalui situs resmi bmkg.go.id, menyatakan gempa tersebut berpotensi tidak tsunami.Baca juga: Awal Tahun 2019, Lombok Diguncang Gempa M 3,0BMKG menyebutkan jika getaran gempa dirasakan hingga Banda Aceh dengan skala II MMI. Gempa tersebut terjadi pada pukul 18.55 WIB pada kedalaman 14 km.Pusat gempa berada di laut 93 km barat daya dari Banda Aceh dengan koordinat pada 5.47 Lintang Utara, 94.49 Bujur Timur."Pusat gempa di laut 93 km barat daya Banda Aceh," tulis BMKG.Tak berapa lama dari gempa di Aceh, lindu juga mengguncang kawasan selatan Jawa. Dalam pengumuman yang dibuat BMKG, gempa tersebut bermagnitudo 5,0 terjadi pada pukul 19.25 WIB.Gempa itu terjadi di 327 km barat daya Kabupaten Pangandaran, Jawa Barat; 330 Km barat daya Cilacap, Jawa Tengah

title: 
							Gaya dan Cara Jokowi Sambut Tahun Baru Tiga Tahun Terakhir						
deskripsi: Jakarta - Presiden Joko Widodo melewatkan malam pergantian tahun dengan cara berbeda dalam tiga tahun terakhir. Namun gaya Jokowi cenderung sama pada setiap tahunnya.Baca: Sambut Tahun Baru 2019, Jokowi Undang PKL ke Istana BogorPergantian tahun 2019 ini, Jokowi memilih bersama keluarganya. Ditemani istrinya, Iriana, dan putra bungsunya, Kaesang Pangarep, ia menghabiskan malam tutup tahun 2018 di rumahnya, Wisma Bayurini, Istana Kepresidenan Bogor.Bersama anggota Pasukan Pengamanan Presiden (Paspampres) serta pegawai Istana Kepresidenan Bogor, Jokowi menyantap hidangan dari pedagang angkringan pada Senin malam. Pedagang angkringan itu sengaja diundang ke Istana Bogor untuk perayaan malam Tahun Baru 2019."Sate ayam, sate kambing, dan sate sapi menjadi salah satu menu yang disajikan malam itu. Selain itu, ada juga bakmi dengkul dan wedang ronde," ujar Deputi Bidang Protokol, Pers, dan Media Sekreta

title: 
							5 Fakta Kelompok Mujahidin Indonesia Timur yang Tembak 2 Polisi						
deskripsi: TEMPO.CO, Jakarta - Nama kelompok Mujahidin Indonesia Timur atau MIT kembali mencuat. Polisi menduga kelompok yang berada di pegunungan Poso, Sulawesi Tengah berada di balik kasus mutilasi dan penembakan dua anggota Polri di Pantai Kapal Dusun Salubose.Baca juga: Polda: Gerilya Mujahidin Indonesia Timur di Poso BerubahPenembakan dua anggota polisi itu berawal dari rencana tim kepolisian setempat memeriksa tempat terjadinya perkara pembunuhan dengan mutilasi. Saat ingin mengevakuasi korban mutilasi, terjadi baku tembak. Dua anggota polisi tertembak yaitu Brigadir Polisi Kepala (Bripka) Andrew Maha Putra dan Bripka Baso.Kepolisian RI pun langsung menurunkan dua Satuan Setingkat Pleton (SST) Brigadir Mobil, satu SST dari Poso dan Palu untuk membantu Kepolisian Resor Parimo melakukan pengejaran terhadap kelompok Mujahidin Indonesia Timur.Kelompok yang diduga mendalangi beberapa aksi terorisme it

title: 
							Adik Gus Dur Anggap Tes Baca Al Quran bagi Capres Tidaklah Urgen						
deskripsi: TEMPO.CO, Jakarta - Adik kandung Gus Dur, Lily Chodidjah Wahid alias Lily Wahid menilai, tes baca Al Quran untuk pasangan calon presiden dan calon wakil presiden yang akan berlaga di pemilihan presiden 2019, tidaklah urgen. "Saya menganggap hal itu tidak urgen," ujar Lily di kediaman Ma'ruf Amin di Jakarta pada Senin, 31 Desember 2018.Baca: Soal Tes Baca Al Quran untuk Capres, Amien Rais: Itu Lucu SekaliMenurut dia, kaum muslim yang waras pikirannya dan memiliki perhatian terhadap agama, otomatis akan memilih pemimpin yang bisa menjadi panutan dalam keislaman. "Jadi enggak usah dibawa-bawa Al Quran lah. Seperti halnya kalimat tauhid, sebaiknya enggak usah dijadikan bendera. Laa ilaaha illallah itu tempatnya di hati, bukan di mana-mana," ujar dia.Tokoh pluralis ini mengatakan, sejak Indonesia merdeka pada 17 Agustus 1945, masalah pluralisme sudah selesai dengan penghapusan tujuh kata dalam s

title: 
							Korban Longsor Sukabumi: 5 Tewas, 38 Orang Masih Tertimbun						
deskripsi: TEMPO.CO, Jakarta - Badan Nasional Penanggulangan Bencana (BNPB) mengatakan korban meninggal dalam bencana longsor di Kampung Cigarehong, Dusun Cimapag, Desa Sirnaresmi, Cisolok, Sukabumi, Jawa Barat, bertambah. Saat ini tercatat pada data sementara per 1 Januari 2019 pukul 10.00 WIB, lima orang meninggal.Baca: Evakuasi Longsor Sukabumi Terhambat Minimnya Alat Berat"Korban tewas menjadi lima orang, dan tiga luka-luka," ujar Kepala Pusat Data Informasi dan Humas BNPB Sutopo Purwo Nugroho dalam keterangan tertulis, Selasa, 1 Januari 2018.Longsor di Sukabumi ini terjadi pada Senin malam, 31 Desember 2018, pukul 17.30 WIB. Longsor terjadi disebabkan hujan deras mengguyur desa. Akibatnya, terjadi aliran permukaan di areal hutan dan persawahan dari perbukitan di lokasi kejadian. Aliran air kemudian menyebabkan material perbukitan meluncur menuruni lereng dan menimbun 30 rumah.Sutopo mengatakan, tercata

title: 
							Gempa yang Guncang Lombok Awal Tahun Baru 2019 Tergolong Dangkal						
deskripsi: TEMPO.CO, Jakarta - Tak lama setelah pergantian tahun, masyarakat di wilayah utara Pulau Lombok merasakan gempa. Kejadiannya pada pukul 00.15 Wita. Badan Meteorologi, Klimatologi, dan Geofisika (BMKG) mencatat gempa tersebut bermagnitudo 3,0.Baca: Awal Tahun 2019, Lombok Diguncang Gempa M 3,0Kepala Bidang Informasi Gempa bumi dan Peringatan Dini Tsunami BMKG Daryono, lewat keterangan tertulis, menginformasikan gempa terjadi pada Selasa, 1 Januari 2019, pukul 00.15.44 Wita. "Wilayah Lombok Utara mengalami gempa bumi tektonik," ujarnya.Hasis analisis BMKG menujukkan bahwa gempa bumi ini berkekuatan 3,0 Magnitudo. Pusat gempa pada koordinat 8,36 LS dan 116,20 BT, atau tepatnya berlokasi di darat."Jaraknya sekitar 8 kilometer barat laut Lombok Utara, Nusa Tenggara Barat pada kedalaman 10 km," kata Daryono.Dampak gempa bumi berdasarkan laporan masyarakat dirasakan di wilayah Lombok Utara dengan 

title: 
							KY Berharap DPR Segera Selesaikan RUU Jabatan Hakim						
deskripsi: TEMPO.CO, Jakarta - Wakil Ketua Komisi Yudisial, Maradaman Harahap, berharap Dewan Perwakilan Rakyat (DPR) segera menyelesaikan revisi Undang-Undang Jabatan Hakim. Beleid itu dinilai akan menguatkan posisi KY sebagai pengawas Mahkamah Agung (MA).Baca: Menteri Yohana Akan Bahas Usia Minimum Pernikahan dengan DPRMaradaman mengatakan, KY mengusulkan tambahan kewenangan agar dapat memberi sanksi kepada MA. "Satu sisi KY diberikan wewenang untuk mengawasi, tapi tidak diberi kewenangan untuk memberi sanksi," kata Maradaman di kantornya, Jakarta, Senin, 31 Desember 2018.Selama ini, pengawasan KY hanya menghasilkan usulan. MA dapat tidak melaksanakan rekomendasi tersebut jika mereka menilai putusan tersebut tidak sesuai wewenang KY atau teknis yudisial.Maradaman mencontohkan, sepanjang 2018 KY mengusulkan 39 putusan sanksi kepada MA terkait pelanggaran Kode Etik dan Pedoman Perilaku Hakim (KEPPH). MA hanya mere

title: 
							Presiden Jokowi Tunda Pelantikan Kepala BNPB yang Baru						
deskripsi: TEMPO.CO, Jakarta -Presiden Joko Widodo atau Jokowi menunda pelantikan Letnan Jenderal TNI Doni Monardo sebagai Kepala Badan Nasional Penanggulanan Bencana disingkat BNPB. Awalnya Doni akan dilantik pada Rabu, 2 Januari 2019 untuk menggantikan Laksamana Muda (Purn) Willem Rampangilei.Baca : Humas, BNPB dan BPBD Siap Mendukung Kepala yang Baru"Mohon maaf, pelantikan Kepala BNPB tidak jadi. Ditunda," kata juru bicara Presiden, Johan Budi, melalui pesan singkat pada Selasa, 1 Januari 2019. Johan menuturkan, kabar itu dia dengar dari Menteri Sekretaris Negara Pratikno. Pratikno juga sudah mengabarkan informasi tersebut kepada wartawan. "Menginformasikan bahwa besok (Rabu, 2 Januari 2019) tidak ada pelantikan. Maaf jika sudah dengar ada pelantikan," katanya.Sebelumnya beredar undangan pelantikan Kepala BNPB baru. Berdasarkan undangan tersebut, pelantikan akan dilaksanakan digelar di Istana Negara pada Rab

From cffi callback <function _verify_callback at 0x000002A2DBBA5BF8>:
Traceback (most recent call last):
  File "C:\Users\Wilda\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 306, in wrapper
    @wraps(callback)
KeyboardInterrupt


SSLError: HTTPSConnectionPool(host='nasional.tempo.co', port=443): Max retries exceeded with url: /read/1160669/malam-ini-gempa-guncang-banda-aceh-dan-selatan-jawa (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),))

In [41]:
# add data into pandas df
import pandas as pd

scraped_data = pd.DataFrame({'name': names,'deskripsi': deskripsi, 'tanggal':tgl} )
scraped_data.head(900)

Unnamed: 0,name,deskripsi,tanggal
0,"\r\n\t\t\t\t\t\t\tLongsor Sukabumi, Sudah 15 K...","TEMPO.CO, Jakarta -Tim gabungan pencarian korb...","Selasa, 1 Januari 2019 23:26 WIB"
1,\r\n\t\t\t\t\t\t\tPresiden Jokowi Tunda Pelant...,"TEMPO.CO, Jakarta -Presiden Joko Widodo atau J...","Selasa, 1 Januari 2019 23:04 WIB"
2,\r\n\t\t\t\t\t\t\tHumas: BNPB dan BPBD Siap Me...,"TEMPO.CO, Jakarta-Kepala Pusat Data Informasi ...","Selasa, 1 Januari 2019 21:53 WIB"
3,"\r\n\t\t\t\t\t\t\tLibur Awal Tahun, Jokowi Jog...","TEMPO.CO, Jakarta-Presiden Joko Widodo (Jokowi...","Selasa, 1 Januari 2019 21:22 WIB"
4,"\r\n\t\t\t\t\t\t\tSetelah Aceh, Gempa Magnitud...","TEMPO.CO, Jakarta-Gempa dengan magnitudo 5 men...","Selasa, 1 Januari 2019 21:07 WIB"
5,\r\n\t\t\t\t\t\t\tSBY: Saya Minta Restu Rakyat...,"TEMPO.CO, Jakarta - Cuitan berseri pertama Ket...","Selasa, 1 Januari 2019 21:00 WIB"
6,\r\n\t\t\t\t\t\t\tMalam Ini Gempa Guncang Band...,"TEMPO.CO, Jakarta - Gempa dengan magnitudo 5.1...","Selasa, 1 Januari 2019 20:08 WIB"
7,\r\n\t\t\t\t\t\t\tLetjen TNI Doni Monardo Beso...,"TEMPO.CO, Jakarta - Letnan Jenderal TNI Doni M...","Selasa, 1 Januari 2019 19:31 WIB"
8,\r\n\t\t\t\t\t\t\tAmien Rais Sebut yang Mendes...,"TEMPO.CO, Jakarta - Ketua Dewan Kehormatan Par...","Selasa, 1 Januari 2019 19:31 WIB"
9,"\r\n\t\t\t\t\t\t\tBesok, Jokowi Lantik Kepala ...","TEMPO.CO, Jakarta - Presiden Joko Widodo akan ...","Selasa, 1 Januari 2019 19:00 WIB"


We got our data: our web scraping experiment is a success. 

Some data cleaning may be useful before using them:
- transform the ratings into numerical values
- remove the numbers in the product_category column

## Wrap up

We have seen how to get through websites and gather data on each web page using automated web scrapers. One key thing in order to build efficient web scrapers is to understand the structure of the website on which you want to scrape the information. This means that you will probably have to maintain you scraper if you want it to remain useful after websites updates.

This book store website was an easy example, but in real life you may have to deal with more complex websites that render some of their content using Javascript. You may want to use a browser automator like Selenium for those kind of tasks (https://www.seleniumhq.org/).