# Collecting Data by Scraping with BeautifulSoup4

Libraries and packages need to be installed first:

1. Requests ([docs](http://docs.python-requests.org/en/master/))
2. BeautifulSoup ([docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html?highlight=tag#))
3. Pandas ([docs](https://pandas.pydata.org/pandas-docs/stable/))t

By **scraping**, it means **access a webpage**, **extract its HTML/XML/JSON** structure, **extract its data**. Some sites provide API to expose their data to public (Twitter, GitHub, etc.), some sites don't. Scraping web page using python mostly rely on `Requests` and `BeautifulSoup`, although there are more advanced options like **Selenium** or **Scrapy**.

> **Requests** is a library to make a (simply) request (GET/POST/UPDATE/etc.) to webpage and extract raw HTML

> **BeautifulSoup** is a parser to extract data from HTML/XML such that we can navigate and search any HTML/XML tag and attributes easily

> **Pandas** is high-performance (if it's not) and easy-to-use data structure and data analysis tool

## Making a Requests

Make an HTTP get requests is really simple using `requests.get`.

In [1]:
# import libraries
from contextlib import closing
import datetime
import time

import pandas as pd
import requests
from requests.exceptions import RequestException
from bs4 import BeautifulSoup

In [2]:
csx_url = 'https://csx.itb.ac.id/seminar/'
csx_html = requests.get(url=csx_url, timeout=2)
csx_html

<Response [200]>

As you can see, `requests.get` returns **response status code**: 200. There are other response status code that you can read by yourself [here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). We can then print the html structure in multiple formats, depends on the response.

In [3]:
# bytes
# print(csx_html.content)    # will print a long HTML output

In [4]:
# string
# print(csx_html.text)    # will print a long HTML output

In [5]:
# json -> dict
itunes_url = 'https://itunes.apple.com/search?term=paramore&entity=song'
itunes_html = requests.get(itunes_url)
# print(itunes_html.json())    # will print a long HTML output

### URL Parameters

In `itunes_url`, there are parameters like `term` and `entity`. Parameters specify what kind of data we want to extract. Those parameters are usually provided in API Docs or directly in url name.

**Requests** provide us with other argument when calling `get`, that is `param`. We can create a **dictionary of key and values for url parameters** and pass it into `param` argument.

In [6]:
itunesurl = 'https://itunes.apple.com/search'
itunes_param = {
    'term': 'paramore',
    'entity': 'song',
}
itunes = requests.get(url=itunesurl, params=itunes_param)
print(itunes.status_code, itunes.url, sep=' - ')

200 - https://itunes.apple.com/search?term=paramore&entity=song


In [7]:
print(f'content type: {itunes.headers["Content-Type"]}')
print(itunes.headers)

content type: text/javascript; charset=utf-8
{'x-apple-jingle-correlation-key': '37BL334KQNZX35UV7FJG4XFY6I', 'x-apple-application-site': 'ST11', 'content-disposition': 'attachment; filename=1.txt', 'apple-originating-system': 'MZStoreServices', 'Content-Encoding': 'gzip', 'strict-transport-security': 'max-age=31536000', 'x-apple-translated-wo-url': '/WebObjects/MZStoreServices.woa/ws/wsSearch?term=paramore&entity=song&urlDesc=', 'apple-tk': 'false', 'x-apple-orig-url': 'https://itunes.apple.com/search?term=paramore&entity=song', 'apple-seq': '0', 'apple-timing-app': '432 ms', 'x-content-type-options': 'nosniff', 'Content-Type': 'text/javascript; charset=utf-8', 'x-apple-application-instance': '2000744', 'x-webobjects-loadaverage': '0', 'x-apple-request-uuid': 'dfc2bdef-8a83-737d-f695-f9526e5cb8f2', 'Content-Length': '6549', 'Vary': 'Accept-Encoding', 'Cache-Control': 'max-age=65490', 'Date': 'Sun, 31 Mar 2019 12:10:44 GMT', 'X-Cache': 'TCP_MEM_HIT from a23-45-232-140.deploy.akamaitech

## Requesting with Modular Python

In [8]:
def get_html(url, param=None, time_out=None):
    """Attempts to get the html at `url` via HTTP GET Requests.
    
    Parameters
    ----------
    url : str
        URL or API URI
    param : dict
        key-value pair to be attached to url
    timeout : float or int or tuple of both
        time limit to establish a connection. If a tuple (2,5) is given, then
        2 is time limit to estalbish a connection and 5 is time limit to wait
        on a response.
    
    Returns
    -------
    str
        Raw HTML
    str
        Complete URL
    """
    try:
        with closing(requests.get(url, params=param, timeout=time_out, stream=True)) as response:
            response.raise_for_status()
            if is_good_response(response):
                return response.text, response.url
            else:
                return None
    except RequestException as request_error:
        error_log(url, params=param, msg=request_error)
        return None


def is_good_response(response):
    """Evaluate response.
    
    If response seems to be HTML with status 200, return True.
    else, return False
    
    Parameters
    ----------
    response
        Requests response
    
    Returns
    -------
    bool
        Response quality
    """
    content_type = response.headers['Content-Type'].lower()
    return (
        response.status_code == 200
        and content_type is not None
        and content_type.find('html') > -1
    )


def error_log(url, params=None, msg=None):
    """Print error message and log them if exist.
    
    Parameters
    ----------
    url : str
        URL string
    params : dict
        key-value pair attached to url
    msg : str
        Error message based on `RequestException`
    """
    print(
        f'Error occured during request to {url}',
        f'with paremeter {params}',
        f'Error Message {msg}',
        sep='\n'
    )

In [9]:
# get_html(url=csx_url)
get_html('https://www.python.org/no-pages')

Error occured during request to https://www.python.org/no-pages
with paremeter None
Error Message 404 Client Error: Not Found for url: https://www.python.org/no-pages


## Structuring The Contents

You see that all those responses from requests are in `string`, which is difficult to read and acces the tags and content. `BeautifulSoup` handle this for us by parsing the html in `string` format. BeautifulSoup also provide `features` as a parser. We can use Python's default html parser `html.parser` or use `lxml` for speed.

> Mentioned in the docs that we are recommended to provide a `features` so that we use the same parser in every platform

In [10]:
# don't forget to install and import the library
soup = BeautifulSoup(csx_html.text, features='lxml')
# print(soup)    # will print a long HTML output

In [11]:
# what can we do with `soup`?
# dir(soup)    # list all attributes and methods for the object. Also print a long output

In [12]:
# print(soup.body)    # print html body
# print(soup.li)    # print first occurence of li tag
print(soup.article)

<article class="group post-23 page type-page status-publish hentry">
<div class="entry themeform tes">
<p><strong>Agustus (0,1)</strong></p>
<ul>
<li>2018-08-14 0900 BSC-A Tesis II 20916001 Diyah Wijayati<br/>
<em>Penerapan Teori Permainan dalam Menentukan Strategi Pemasaran Program Studi Teknik Industri dan Informatika di Sekolah Tinggi Teknologi Bandung</em><br/>
(Supervisor: Dr. Agus Yodi Gunawan)</li>
</ul>
<p><strong>Juli (1,0)</strong></p>
<ul>
<li>2018-07-16 0900 CAS 1A Tesis I 20917007 Syahrul Bahar Hamdani<br/>
<em>Predictive Maintenance Mesin Pesawat dengan Pendekatan Machine Learning</em><br/>
(Supervisor: Dr. Nuning Nuraini)</li>
</ul>
<p><strong>Juni (1,0)</strong></p>
<ul>
<li>2018-06-28 1330 BSC-A Tesis I 20916003 Ufra Neshia<br/>
<em>Pelabelan Jarak Ajaib Menggunakan Algoritma Paralel</em><br/>
(Supervisor: Dr. Rinovia M. G. Simanjuntak)</li>
</ul>
<p><strong>April (3,1)</strong></p>
<ul>
<li>2018-04-05 1030 BSC-A Tesis I 20916004 Arif Nurwahid<br/>
<em>Polinom Pembangk

In [13]:
# print(len(soup.find_all('article')))
# print(soup.find_all(name='li'))
print(len(soup.find_all('li')))

73


**So, how do we get what we want from those tag stacks?** - *Inspection*

For csx, suppose we are interested on **thesis title, author, supervisor, and seminar date**. From inspecting the structure, we found that all of those are the children of `article` tag. Because we already know there is only one `article` tag, we could use `.article` instead of `.find_all()`, or, we could use `.find()`.

> While scraping a website, we have to read and find where the contents are located. So, **a lot of inspecting** and **a lot of HTML**

In [14]:
csx_article = soup.find('article')
type(csx_article)

bs4.element.Tag

In [15]:
# what do we have in `p` tag within `article` tag
# csx_article.find_all('p')
csx_article.find_all('strong')

[<strong>Agustus (0,1)</strong>,
 <strong>Juli (1,0)</strong>,
 <strong>Juni (1,0)</strong>,
 <strong>April (3,1)</strong>,
 <strong>Maret (2,4)</strong>,
 <strong>Februari (3,0)</strong>,
 <strong>Januari (2,1)</strong>]

In [16]:
month_with_thesis = [c.text for c in csx_article.find_all('strong')]
df = pd.DataFrame([s.split(' ') for s in month_with_thesis], columns=['month', 'thesis'])
df

Unnamed: 0,month,thesis
0,Agustus,"(0,1)"
1,Juli,"(1,0)"
2,Juni,"(1,0)"
3,April,"(3,1)"
4,Maret,"(2,4)"
5,Februari,"(3,0)"
6,Januari,"(2,1)"


In [17]:
# more detail on splitting thesis
df['thesis'].map(lambda x: x.strip('()').split(','))

0    [0, 1]
1    [1, 0]
2    [1, 0]
3    [3, 1]
4    [2, 4]
5    [3, 0]
6    [2, 1]
Name: thesis, dtype: object

In [18]:
# define thesis1 and thesis2 feature
df['thesis1'] = df['thesis'].map(lambda x: x.strip('()').split(',')[0])
df['thesis2'] = df['thesis'].map(lambda x: x.strip('()').split(',')[1])

In [19]:
df

Unnamed: 0,month,thesis,thesis1,thesis2
0,Agustus,"(0,1)",0,1
1,Juli,"(1,0)",1,0
2,Juni,"(1,0)",1,0
3,April,"(3,1)",3,1
4,Maret,"(2,4)",2,4
5,Februari,"(3,0)",3,0
6,Januari,"(2,1)",2,1


Now, let's look at other features like those 3 information above. By insepcting (again), we know that those features are children of `article` tag live in `li` tag. So, we know where they exist, let's collect!

Let's get acquainted with **children** and **descendants**.

> `children` is a list of direct child in `tag` object. It returns `ListIterator` object, which is not a python list, yet it's an iterator.

> `descendants` is a broader list of child in `tag` object.

In [20]:
# .children is attributes to print all direct childs of a tag in a list 
print('Children of <li>:')
print(list(csx_article.find('li').children), len(list(csx_article.find('li').children)))    # children
# .descendants is attributes to print all descendants of a tag in a list
print('Descendants of <li>:')
print(list(csx_article.find('li').descendants), len(list(csx_article.find('li').descendants)))    # descendants

Children of <li>:
['2018-08-14 0900 BSC-A Tesis II 20916001 Diyah Wijayati', <br/>, '\n', <em>Penerapan Teori Permainan dalam Menentukan Strategi Pemasaran Program Studi Teknik Industri dan Informatika di Sekolah Tinggi Teknologi Bandung</em>, <br/>, '\n(Supervisor: Dr. Agus Yodi Gunawan)'] 6
Descendants of <li>:
['2018-08-14 0900 BSC-A Tesis II 20916001 Diyah Wijayati', <br/>, '\n', <em>Penerapan Teori Permainan dalam Menentukan Strategi Pemasaran Program Studi Teknik Industri dan Informatika di Sekolah Tinggi Teknologi Bandung</em>, 'Penerapan Teori Permainan dalam Menentukan Strategi Pemasaran Program Studi Teknik Industri dan Informatika di Sekolah Tinggi Teknologi Bandung', <br/>, '\n(Supervisor: Dr. Agus Yodi Gunawan)'] 7


In [21]:
# capture what we need
# 1. Thesis title is inside <em>..</em> tag or in -3 in descendants
# 2. Author is 1st element in descendants
# 3. Supervisor is the last element in descendants
title = pd.Series([list(li.descendants)[-3] for li in csx_article.find_all('li')], name='title')
supervisor = pd.Series([list(li.descendants)[-1].strip('()').split(':')[-1][1:] for li in csx_article.find_all('li')], name='supervisor')
seminar_date = pd.Series([list(li.descendants)[0].split(' ', 1)[0] for li in csx_article.find_all('li')], name='seminar_date').astype('datetime64')
author = pd.Series([list(li.descendants)[0].split('209')[-1].split(' ', 1)[-1] for li in csx_article.find_all('li')], name='author')
room = pd.Series([list(li.descendants)[0].split('Tesis')[0].split(' ', 2)[-1].strip() for li in csx_article.find_all('li')], name='seminar_room')

df = pd.concat([author, title, supervisor, seminar_date, room], axis=1)
display(df)

Unnamed: 0,author,title,supervisor,seminar_date,seminar_room
0,Diyah Wijayati,Penerapan Teori Permainan dalam Menentukan Str...,Dr. Agus Yodi Gunawan,2018-08-14,BSC-A
1,Syahrul Bahar Hamdani,Predictive Maintenance Mesin Pesawat dengan Pe...,Dr. Nuning Nuraini,2018-07-16,CAS 1A
2,Ufra Neshia,Pelabelan Jarak Ajaib Menggunakan Algoritma Pa...,Dr. Rinovia M. G. Simanjuntak,2018-06-28,BSC-A
3,Arif Nurwahid,Polinom Pembangkit Kode Siklik Aditif atas Z2Z4,"Djoko Suprijanto, Ph.D.",2018-04-05,BSC-A
4,Dimas Dwi Adiguna,Analisis Interaksi Surface Binding antara Afla...,"Acep Purqon, Ph.D.",2018-04-09,BSC-A
5,Teja Kesuma,Pengaruh Pelanggan Prioritas pada Permasalahan...,"Acep Purqon, Ph.D.",2018-04-09,BSC-A
6,Teja Kesuma,Pengaruh Pelanggan Prioritas pada Permasalahan...,"Acep Purqon, Ph.D.",2018-04-16,BSC-A
7,Prasetiyo Hadi Purwoko,Pengembangan Perangkat Lunak Berorientasi Obye...,"Muhamad A. Martoprawiro, Ph.D.",2018-03-02,BSC-A
8,Robieth Sohiburoyyan,Pendekatan Simulated Annealing pada Pelabelan ...,Dr. Rinovia M. G. Simanjuntak,2018-03-05,BSC-A
9,Arfian Alimansyah,Aplikasi Persamaan Jump Diffusion dalam Valuas...,"Acep Purqon, Ph.D.",2018-03-06,BSC-A


## Scraping Bukalapak Webpage

Let's focused on some particular categories: **komputer, handphone, elektronik, and kamera**. For instance, using search query _case ipad 6 2018_, in **handphone** category, we get the following URL: `https://www.bukalapak.com/c/handphone?search%5Bhashtag%5D=&search%5Bkeywords%5D=case+ipad+6+2018`. Breaking down this URL, the base url is `https://bukalapak.com/c`, `handphone` is category identifier, and the rest is query keywords or parameters. Hence, we try to scrape with arbitrary searching query as **input** with for every categories.

Next, inspecting the HTML structure, the site shows all possible products. Basically there is **div** tag with class **basic-products** as a main showcase. But, there are also **section** tags in top and bottom area, if exist. Product is also wrapped inside an **article** tag. This tag would help us to extract any product information in product card.

In [22]:
# let's define sites information
BL_SITE_BASE = 'https://bukalapak.com'
BL_SITE_SOURCE = [
    'https://www.bukalapak.com/c/handphone',
    'https://www.bukalapak.com/c/komputer',
    'https://www.bukalapak.com/c/elektronik',
    'https://www.bukalapak.com/c/kamera',
]

In [23]:
param = {
    'search[keywords]': 'ipad 6 2018 case'
}
bl_html, _ = get_html(BL_SITE_SOURCE[0], param=param)
bl_soup = BeautifulSoup(bl_html)

In [24]:
# bl_soup    # will print a long HTML output

In [25]:
bl_basic_product = bl_soup.find(name='div', class_='basic-products')
product_title = pd.Series(
    [product_card.find('a', class_='product__name')['title']
     for product_card in bl_basic_product.find_all('article', class_='product-display')],
    name='product_title'
)
product_href = pd.Series(
    [BL_SITE_BASE+product_card.find('a', class_='product__name')['href']
     for product_card in bl_basic_product.find_all('article', class_='product-display')],
    name='product_url'
)

bl_product = pd.concat([product_title, product_href], axis=1)
bl_product

Unnamed: 0,product_title,product_url
0,Switcheasy CoverBuddy iPad 9.7 2018 2017 Case ...,https://bukalapak.com/p/handphone/aksesoris-ha...
1,Ipad 6 2018 9.7 Case Cover TOTU dengan Pen Holder,https://bukalapak.com/p/handphone/aksesoris-ha...
2,Smart Case Cover New iPad 9.7 2018 Gen 6 Model...,https://bukalapak.com/p/handphone/aksesoris-ha...
3,smart case smart cover new ipad 6 generation i...,https://bukalapak.com/p/handphone/aksesoris-ha...
4,CASE COVER ORIGINAL SMARTCASE AUTO LOCK FOR NE...,https://bukalapak.com/p/handphone/aksesoris-ha...
5,Silikon Ipad 6 2018 - Ipad 5 2017 - Ipad 9.7 2...,https://bukalapak.com/p/handphone/aksesoris-ha...
6,CASE COVER TOTU WITH PEN HOLDER IPAD 6 2018 O...,https://bukalapak.com/p/handphone/tablet/13czb...
7,Griffin Survivor Cover New iPad 5 Air 3 9.7 iP...,https://bukalapak.com/p/handphone/aksesoris-ha...
8,Ipad 6 2018 97 Case Cover UAG dengan Pen Holder,https://bukalapak.com/p/handphone/aksesoris-ha...
9,NEW IPAD 6 2018 9.7 inchi Rugged Armor NEW IPA...,https://bukalapak.com/p/handphone/aksesoris-ha...


In [26]:
print(bl_product.loc[21, 'product_title'])
print(bl_product.loc[21, 'product_url'])

Dazzle Ipad 6 2018 9.7 Inch - Case Tebal Tahan Banting ShockProof Bisa Stand
https://bukalapak.com/p/handphone/aksesoris-handphone/casing-cover/1crw4be-jual-dazzle-ipad-6-2018-9-7-inch-case-tebal-tahan-banting-shockproof-bisa-stand?from=&product_owner=normal_seller


In [27]:
# define a function that take searching query as function parameter
def bl_make_soup(bl_url, bl_param=None):
    """Get HTML string using `requests.get()` method.
    
    Parameters
    ----------
    url: str
        URL target to requests a HTTP GET method
    param: dict
        `param` to pass into `requests.get()` method
        
    Returns
    -------
    bs4.BeautifulSoup
        BeautifulSoup object
    """
    html, url = get_html(url=bl_url, param=bl_param)
    
    return BeautifulSoup(html), url

In [28]:
param = {
    'search[keywords]': 'ipad 6 2018 new',
    'page': 1,
}
soup, query_url = bl_make_soup(BL_SITE_SOURCE[0], param)
print(query_url)

https://www.bukalapak.com/c/handphone?search%5Bkeywords%5D=ipad+6+2018+new&page=1


In [29]:
# how many product per page?
product_finder = soup.find_all(name='a', attrs={'class': 'product__name'})
product_title_list = [BL_SITE_BASE+product['href'] for product in product_finder]
print(product_title_list[:2])
print(f'Number of product per page: {len(product_title_list)} products')

# how many page are generated?
# --
# to answer this question, Bukalapak provide 2 options:
#    1. with <span class='last-page'>
#    2. use <a> directly
# so, first check if span with class last-page, if exists then use its string as max page
# else, use last <a>-occurence's string as max page
pagination = soup.find(name='div', attrs={'class': 'pagination'})
lastpage = pagination.find(name='span', attrs={'class': 'last-page'})
if lastpage == None:
    max_page = pagination.find_all(name='a')[-2].string
else:
    max_page = lastpage.string
print(f'Number of page: {max_page} pages')

['https://bukalapak.com/p/handphone/tablet/hdqs7u-jual-bnib-new-ipad-9-7-inch-ipad-2018-ipad-6-6th-gen-air-4-32gb-wifi-only?from=&product_owner=normal_seller', 'https://bukalapak.com/p/handphone/tablet/hdqwc7-jual-bnib-new-ipad-9-7-inch-ipad-2018-ipad-6-6th-gen-air-4-128gb-wifi-only?from=&product_owner=normal_seller']
Number of product per page: 50 products
Number of page: 321 pages


In [30]:
# define necessary function
def count_product(url, param, page):
    """Count generaed product in for a `page`.
    
    Parameters
    ----------
    url: `str`
        URL target to requests a HTTP GET method
    param: `dict`
        `param` to pass into `requests.get()` method
    page: `int`
        Page number
    
    product_amount: `int`
        Amount of product in `page`
    """
    param['page'] = page
    soup = make_soup(url=url, params=param)
    basic_product = soup.find('div', class_='basic-products')
    product_amount = len(basic_product.find_all('article', class_='product-display'))
    
    return product_amount

def count_page(soup):
    """Count page.
    
    Parameters
    ----------
    soup: `BeautifulSoup`
        BeautifulSoup object
    
    Returns
    -------
    max_page: `int`
        Number of generated page
    """
    pagination = soup.find('div', class_='pagination')
    last_page = pagination.find('span', class_='last-page')
    if last_page is None:
        max_page = int(pagination.find_all('a')[-2].get_text())
    else:
        max_page = int(last_page.get_text())
    
    return max_page

We have had number of page so that can be iterated over. Now, we can access `href` for each product in every pages.

In [31]:
# function to extract href
def get_href(base_url, html):
    """Get href in given `html` text.
    
    Parameters
    ----------
    base_url: `str`
        Base URL to concatenate with `href` string.
    html: `str`
        HTML string of <a> tag.
    
    Returns
    -------
    href: `str`
        value of href attributes
    """
    anchor_soup = BeautifulSoup(html)
    href = anchor_soup['href']
    
    return base_url + href

In [33]:
# get html string
param['search[keywords]'] = 'zenfone max pro 2018'
soup, _ = bl_make_soup(bl_url=BL_SITE_SOURCE[0], bl_param=param)
# determine max_page
max_page = count_page(soup)
print(f'Total page: {max_page} page(s)')

# loop for every page
data_product = pd.DataFrame([], columns=['product_title', 'product_url'])
for page in range(1, max_page+1):
    time.sleep(1)
    param['page'] = page
    soup, _ = bl_make_soup(BL_SITE_SOURCE[0], param)
    basic_product = soup.find('div', class_='basic-products')
    product_page = pd.concat(
        [
            pd.DataFrame([[product['title'], BL_SITE_BASE+product['href']]], columns=['product_title', 'product_url'])
            for product in basic_product.find_all('a', class_='product__name')
        ], ignore_index=True
    )
    data_product = pd.concat([data_product, product_page], ignore_index=True)
    if page % 20 == 0:
        print(f'Collect data product in page: {page}', f'Current shape: {data_product.shape}')

Total page: 22 page(s)
Collect data product in page: 20 Current shape: (600, 2)


In [34]:
data_product

Unnamed: 0,product_title,product_url
0,Asus Zenfone Max Pro M1,https://bukalapak.com/p/handphone/hp-smartphon...
1,ASUS ZENFONE MAX PRO M1 RAM 3GB ROM 32GB GARAN...,https://bukalapak.com/p/handphone/hp-smartphon...
2,TEMPERED GLASS FULL LEM TEMPERED GLASS 5D FOR ...,https://bukalapak.com/p/handphone/aksesoris-ha...
3,Case Asus Zenfone Max Pro M1,https://bukalapak.com/p/handphone/aksesoris-ha...
4,Tempered Glass Asus Zenfone Max M1 ZB555KL,https://bukalapak.com/p/handphone/aksesoris-ha...
5,Original Black Matte Softcase Ultra Thin Baby ...,https://bukalapak.com/p/handphone/aksesoris-ha...
6,Original Softcase TPU Solid Matte Pro Series B...,https://bukalapak.com/p/handphone/aksesoris-ha...
7,Case Asus Zenfone 5,https://bukalapak.com/p/handphone/aksesoris-ha...
8,ASUS ZENFONE MAX PRO M1 ZB602KL,https://bukalapak.com/p/handphone/hp-smartphon...
9,LCD 1SET ASUS ZB602KL ZB601KL ASUS ZENFONE MAX...,https://bukalapak.com/p/handphone/spare-part-t...


After we have done with getting product's href, we now move to **collect product details**. The following are the features we may want to look at:

1. Seller name
2. Product price
3. Product star (if exist)
4. Product rating (if exist)
5. Total view
6. Total sold

In [35]:
URL = data_product.loc[20, 'product_url']
psoup, _ = bl_make_soup(URL)

In [36]:
seller = psoup.find('a', class_='c-user-identification__name').get_text()
price = psoup.find('span', class_='amount').get_text()
details = [detail.get_text().strip() for detail in psoup.find_all('dd', class_=['qa-pd-sold-value', 'qa-pd-seen-value', 'qa-pd-favorited-value', 'qa-pd-weight'])]

In [37]:
details

['2', '301', '2', '50 gram']

In [38]:
# define function to get product details
# and returns a dictionary
def get_product_details(soup):
    """Collect product details from given `soup`.
    
    Parameters
    ----------
    soup: `BeautifulSoup`
        BeautifulSoup object
    
    Returns
    -------
    detail_dict: `dict`
        Product details with its title.
    """
    detail_dict = {'seller_name': soup.find('a', class_='c-user-identification__name').get_text()}
    detail_dict['price'] = float(soup.find('span', class_='amount').get_text())
    detail_dict['total_sold'] = int(soup.find('dd', class_='qa-pd-sold-value').get_text())
    detail_dict['total_views'] = int(soup.find('dd', class_='qa-pd-seen-value').get_text())
    detail_dict['favorited'] = int(soup.find('dd', class_='qa-pd-favorited-value').get_text())
    detail_dict['weight'] = soup.find('dd', class_='qa-pd-weight').get_text().strip()
    
    return detail_dict

In [39]:
product_detail = get_product_details(psoup)
product_detail

{'seller_name': 'A1 STORE',
 'price': 165.0,
 'total_sold': 2,
 'total_views': 301,
 'favorited': 2,
 'weight': '50 gram'}