# Web Scrapping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser.

For this class: **BeautifulSoup**

Beautifulsoup is aPython library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It has limited that cannot be used for dynamic websites that include Javascript. However, we will use Selenium to cover this drawback.

Installation in conda prompt
- pip install bs4
- pip install requests
- pip install selenium

### Scrapping data dari Gramedia

In [1]:
import requests
page = requests.get("https://www.gramedia.com/categories/buku")
page

<Response [200]>

In [3]:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('../../../../../chromedriver.exe')

url="https://www.gramedia.com/categories/buku"
driver.get(url)
html = driver.page_source

soup = BeautifulSoup(html, "html.parser")
print(soup.prettify()[:700])

  driver = webdriver.Chrome('../../../../../chromedriver.exe')


<html class="" lang="id">
 <head>
  <base href="/"/>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="#281e5a" name="theme-color"/>
  <meta content="index, follow" name="robots"/>
  <link href="/assets/favicon.ico" rel="icon" type="image/x-icon"/>
  <link href="manifest.json" rel="manifest"/>
  <link href="/assets/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <meta content="z91Vp6ZYo9UoX5D4ur6i4Lrl0l1j3DDoCH08fD3n53g" name="google-site-verification"/>
  <meta content="810657685650228" property="fb:app_id"/>
  <style>
   .async-hide {
        opacity: 0 !important
    }
  </style>
  <script async="" src="https://app.


``` <div _ngcontent-web-gramedia-c26 class="list-title">Biru Pram Frasa</div> ```


In [4]:
soup.find_all('div',{"class":"list-title"})

[<div _ngcontent-web-gramedia-c26="" class="list-title">Cabaca : Cara Cepat Pintar Membaca Untuk Anak 4-6 Tahun</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Living Clean : Gaya Hidup Minim Toksin!</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Biru Pram Frasa</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Uncensored Blak-Blakan Soal Pasutri (New Cover)</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Hilangnya Sajadah Ayah</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Komik Next G: Misteri Ruang Ipa</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Indonesia X-Files</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">Laskar &amp; Jingga</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">The King 2023 Soshum : Bedah Kisi2 Sbmptn &amp; Um Mandiri</div>,
 <div _ngcontent-web-gramedia-c26="" class="list-title">FSA Komik Pendidikan Vitamin : Menjaga Kesehatan Tubuh Dan Hati</

In [5]:
for div_tag in soup.find_all('div',{"class":"list-title"}):
    print(div_tag.get_text())

Cabaca : Cara Cepat Pintar Membaca Untuk Anak 4-6 Tahun
Living Clean : Gaya Hidup Minim Toksin!
Biru Pram Frasa
Uncensored Blak-Blakan Soal Pasutri (New Cover)
Hilangnya Sajadah Ayah
Komik Next G: Misteri Ruang Ipa
Indonesia X-Files
Laskar & Jingga
The King 2023 Soshum : Bedah Kisi2 Sbmptn & Um Mandiri
FSA Komik Pendidikan Vitamin : Menjaga Kesehatan Tubuh Dan Hati
FSA Real Account 04
Renungan Sufi Membuka Tirai Kegaiban
Mereka Yang Tak Pernah Mati
Injury Time Ujian Sekolah Sd/Mi 2023
Parenting Detox
Bebaskan Aku Dari Alergi, Panduan Ortu Kekinian Vol. 2
Ergonomi Dinamika Beban Kerja
FSA Sang Pemenang Berdiri Sendirian (The Winner Stands Alone)
Beternak Magot
Populasi - Sampel, Teknik Sampling&Bias Dalam Penelitian


In [6]:
# Putting all the html infos as dataframe
import pandas as pd

data = pd.DataFrame()

data['Title'] = [ title.get_text() for title in soup.find_all( 'div', {"class":"list-title"} ) ]
data['Author'] = [ author.get_text() for author in soup.find_all( 'p', {"class":"div-author"} ) ]
data['Price'] = [ price.get_text() for price in soup.find_all( 'p', {"class":"formats-price"} ) ]
data['Image'] = [ img['src'] for img in soup.find_all( 'img',{"class":"product-list-img"} ) ]

links = []
for tag in soup.find_all( 'div',{"_ngcontent-web-gramedia-c26":"","class":"ng-star-inserted"} ):
    try:
        links.append("https://www.gramedia.com"+tag.find_all('a',{"_ngcontent-web-gramedia-c26":""})[0]['href'])
    except:
        pass

data['Link'] = links

data

Unnamed: 0,Title,Author,Price,Image,Link
0,Cabaca : Cara Cepat Pintar Membaca Untuk Anak ...,FANY NOVIA CAHYANI,Rp 59.000,https://cdn.gramedia.com/uploads/items/111_HOS...,https://www.gramedia.com/products/cabaca-cara-...
1,Living Clean : Gaya Hidup Minim Toksin!,Inge Tumiwa - Bachrens,Rp 88.000,/assets/default-images/product.png,https://www.gramedia.com/products/living-clean...
2,Biru Pram Frasa,MEGAKATA None,Rp 88.000,https://cdn.gramedia.com/uploads/items/Biru_Pr...,https://www.gramedia.com/products/biru-pram-frasa
3,Uncensored Blak-Blakan Soal Pasutri (New Cover),@OLEVELOVE,Rp 109.000,/assets/default-images/product.png,https://www.gramedia.com/products/uncensored-b...
4,Hilangnya Sajadah Ayah,Benny Rhamdani,Rp 64.000,https://cdn.gramedia.com/uploads/items/Hilangn...,https://www.gramedia.com/products/hilangnya-sa...
5,Komik Next G: Misteri Ruang Ipa,"Sayyida Mafaazaa Salsabila, dkk.",Rp 39.000,/assets/default-images/product.png,https://www.gramedia.com/products/komik-next-g...
6,Indonesia X-Files,"dr. Abdul Mun'im Idries, Sp.F.",Rp 98.000,https://cdn.gramedia.com/uploads/items/Indones...,https://www.gramedia.com/products/indonesia-x-...
7,Laskar & Jingga,Tresia .,Rp 99.000,/assets/default-images/product.png,https://www.gramedia.com/products/laskar-jingga
8,The King 2023 Soshum : Bedah Kisi2 Sbmptn & Um...,Forum Tentor Indonesia,Rp 240.000,/assets/default-images/product.png,https://www.gramedia.com/products/the-king-202...
9,FSA Komik Pendidikan Vitamin : Menjaga Kesehat...,TIM PRODUKSI KBS VITAMIN,Rp 148.000,https://cdn.gramedia.com/uploads/items/FSA_Kom...,https://www.gramedia.com/products/fsa-komik-pe...


In [10]:
# Multipage Table
title = []
author = []
price = []
image = []
Links = []

driver = webdriver.Chrome('../../../../../chromedriver.exe')

for i in range(1,21):
    url="https://www.gramedia.com/categories/buku?page={}".format(i)
    driver.get(url)
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")

    title += [ title.get_text() for title in soup.find_all( 'div', {"class":"list-title"} ) ]
    author += [ author.get_text() for author in soup.find_all( 'p', {"class":"div-author"} ) ]
    price += [ price.get_text() for price in soup.find_all( 'p', {"class":"formats-price"} ) ]
    image += [ img['src'] for img in soup.find_all( 'img',{"class":"product-list-img"} ) ]

    links = []
    for tag in soup.find_all( 'div',{"_ngcontent-web-gramedia-c26":"","class":"ng-star-inserted"} ):
        try:
            links.append("https://www.gramedia.com"+tag.find_all('a',{"_ngcontent-web-gramedia-c26":""})[0]['href'])
        except:
            pass
    Links += links

data_multipage = pd.DataFrame()
data_multipage['Title'] = title
data_multipage['Author'] = author
data_multipage['Price'] = price
data_multipage['Image'] = image
data_multipage['Link'] = Links

data_multipage

  driver = webdriver.Chrome('../../../../../chromedriver.exe')


Unnamed: 0,Title,Author,Price,Image,Link
0,Cabaca : Cara Cepat Pintar Membaca Untuk Anak ...,FANY NOVIA CAHYANI,Rp 59.000,https://cdn.gramedia.com/uploads/items/111_HOS...,https://www.gramedia.com/products/cabaca-cara-...
1,Living Clean : Gaya Hidup Minim Toksin!,Inge Tumiwa - Bachrens,Rp 88.000,/assets/default-images/product.png,https://www.gramedia.com/products/living-clean...
2,Biru Pram Frasa,MEGAKATA None,Rp 88.000,https://cdn.gramedia.com/uploads/items/Biru_Pr...,https://www.gramedia.com/products/biru-pram-frasa
3,Uncensored Blak-Blakan Soal Pasutri (New Cover),@OLEVELOVE,Rp 109.000,/assets/default-images/product.png,https://www.gramedia.com/products/uncensored-b...
4,Hilangnya Sajadah Ayah,Benny Rhamdani,Rp 64.000,https://cdn.gramedia.com/uploads/items/Hilangn...,https://www.gramedia.com/products/hilangnya-sa...
...,...,...,...,...,...
135,Sd Kl 4 Pembelajaran Bahasa Mandarin Interakti...,Yi Ying,Rp 82.000,/assets/default-images/product.png,https://www.gramedia.com/products/sd-kl-4-pemb...
136,99 Kekuatan Berfikir Positif,Aning Naafiah,Rp 50.000,/assets/default-images/product.png,https://www.gramedia.com/products/99-kekuatan-...
137,Bumi Dan Lukanya,,Rp 89.500,/assets/default-images/product.png,https://www.gramedia.com/products/bumi-dan-luk...
138,"Experiential Based Learning, Pembelajaran Berb...",Ismijarti Juni Susanti Prof. Richardus Eko In...,Rp 39.000,/assets/default-images/product.png,https://www.gramedia.com/products/experiential...


In [14]:
# Accessing individual page
title = []
author = []
price = []
desc = []
num_pages = []
date_issue = []
publisher = []

driver = webdriver.Chrome('../../../../../chromedriver.exe')

for i in range(1,3):
    url="https://www.gramedia.com/categories/buku?page={}".format(i)
    driver.get(url)
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")

    for tag in soup.find_all( 'div',{"_ngcontent-web-gramedia-c26":"","class":"ng-star-inserted"} ):
        try:
            link = "https://www.gramedia.com"+tag.find('a',{"_ngcontent-web-gramedia-c26":""})['href']
            driver.get(link)
            html_ind = driver.page_source
            soup_ind = BeautifulSoup(html_ind, "html.parser")

            title.append( soup_ind.find( 'div', {"class":"book-title"} ).get_text() )
            author.append( soup_ind.find('span',{"class":"title-author"}).get_text() )
            price.append( soup_ind.find('div', {'class':'price-product'}).get_text() )
            desc.append( soup_ind.find('pre').get_text() )
            num_pages.append( soup_ind.find('div',{'class':'detail-section'}).find_all('p')[0].get_text() )
            date_issue.append( soup_ind.find('div',{'class':'detail-section'}).find_all('p')[2].get_text() )
            publisher.append( soup_ind.find('div',{'class':'detail-section'}).find_all('p')[1].get_text() )

        except:
            pass

pages = pd.DataFrame()
pages['Title'] = title
pages['Author'] = author
pages['Price'] = price
pages['Desc'] = desc
pages['Num Pages'] = num_pages
pages['Date Issue'] = date_issue
pages['Publisher'] = publisher

pages

  driver = webdriver.Chrome('../../../../../chromedriver.exe')


ValueError: Length of values (0) does not match length of index (16)

In [None]:
data_multipage.to_csv("result_scrapping.csv")