# Web Scraping with Beautiful and Mechanical Soup

In this notebook, we will be doing the basics of web scrapping.

To begin, Beautiful Soup and Mechanical Soup must be installed. Uncomment the cell below and run it if the packages have not yet been installed.

In [1]:
#!pip install beautifulsoup4
#!pip install MechanicalSoup
#!pip install lxml

In [2]:
import mechanicalsoup
from datetime import datetime
import re
import time


# Table of Contents
1. [Web Scraping of Singapore's Current Weather](#weather)
2. [Text Messaging Abbreviations](#abb)
3. [Scraping Data from Shopee](#shopee)

## Web Scraping of Singapore's Current Weather <a name="weather"></a>

Mechanical Soup is used to scrap the weather details from http://www.weather.gov.sg/home.

This is a simple webscrapping to scrap the current Temperature, wind speed and precipitation of Singapore.

<Response [200]> indicates that the scrapping of the HTML from the site has succeeded.

In [3]:
url = "http://www.weather.gov.sg/home"
browser = mechanicalsoup.Browser()
page = browser.get(url)
now = datetime.now()
current_datetime = now.strftime("%d %B %Y, %H:%M:%S")
page

<Response [200]>

`page.soup.select` takes in the CSS selector as an argument to find all chunks that have that selector. For example, the minimum temperature and maximum temperature are found within a `<h2>` tag that is within a `<div>` tag that is found within a tag with a class called `.media`. The function returns a list of all chunks found.

`.text` is used to obtained the text found within the chunk.

In [4]:
min_temp, max_temp = page.soup.select(".media div h2")
min_temp = min_temp.text
max_temp = max_temp.text

In [5]:
forecast = page.soup.select(".w-sky p")[1].text

In [6]:
precip, wind = page.soup.select(".w-wind p")
precip = precip.text.strip()
wind = wind.text.strip()

Singapore's current weather forecast details are shown below:

In [7]:
print("Date:", current_datetime)
print(f"Minimum Temprature: {min_temp}\nMaximum Temperature: {max_temp}")
print(f"Precipitation: {precip}\nWind: {wind}")
print("Forecast:", forecast)

Date: 19 May 2021, 11:09:53
Minimum Temprature: 32°C
Maximum Temperature: 23°C
Precipitation: 55% - 80%
Wind: SW 10 - 20 km/h
Forecast: Thundery showers


## Text Messaging Abbreviations <a name = 'abb'></a>
Text Messaging Abbreviations are extracted from a [HTML site](https://www.webopedia.com/reference/text-abbreviations/) where the information found are in the `<table>` tag.

In [8]:
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import pandas as pd

The website requires a User Agent to check for bots. To use Beautiful Soup:

In [9]:
url = "https://www.webopedia.com/reference/text-abbreviations/"
req = Request(url, headers={'User-Agent':'Mozilla/5.0'})
page = urlopen(req)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

To use Mechanical Soup:

In [10]:
b = mechanicalsoup.StatefulBrowser()
b.set_user_agent('my-awesome-script')
b.get(url)

<Response [200]>

The `<table>` tag is used to find the abbreviation table. As each entry in `<td>` is an entry of the table, there is a need to check which entry is for abbreviation and which is for the meaning column. In addition, there are titles in between the entries. Hence, to prevent them from being added into the table we are creating, we check the 'chat abbreviations' is not present in the entry.

In [11]:
tables = soup.find_all("table")
test = tables[0]

td = test.find_all("td")

In [12]:
abb = []
meaning = []
abb_col = True # Boolean flag to check whether to add to abb or meaning.
for i in range(len(td)):
    # prevent titles from being added
    if 'CHAT ABBREVIATIONS'.lower() in td[i].text.strip().lower():
        abb_col = True
    elif abb_col:
        abb.append(td[i].text.strip())
        abb_col = False
    else:
        meaning.append(td[i].text.strip())
        abb_col = True

pd.DataFrame(data = {"abb":abb, "meaning":meaning})

Unnamed: 0,abb,meaning
0,?,I have a question
1,?,I don’t understand what you mean
2,?4U,I have a question for you
3,;S,"Gentle warning, like “Hmm? What did you say?”"
4,^^,Meaning “read line” or “read message” above
...,...,...
1547,ZH,Sleeping Hour
1548,ZOMG,Used in World of Warcraft to mean OMG (Oh My God)
1549,ZOT,Zero tolerance
1550,ZUP,Meaning “What’s up?”


## Scraping Data from Shopee <a name='shopee'></a>

Shopee website is dynamic and some listings are only loaded upon scrolling. Hence, there is a need to use selenium to scroll the website. In addition, Shopee requires the use of javascript. 


In [13]:
#!pip install selenium

In [14]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
options.add_argument('--disable-extensions')
browser = webdriver.Ie("C:\\Users\\tayya\\Desktop\\IEDriverServer.exe")

In this example, we search bubble tea in the Dining and Services Category:

In [15]:
url = "https://shopee.sg/search?category=166&keyword=bubble%20tea&trackingId=searchhint-1621331087-ae9298f7-b7bd-11eb-9b29-f898ef6c82ca"

In [16]:
def get_results(link, browser, max_pages = 999999):
    # By default max_pages is a large number to search all pages based on the search
    # max_pages can be modified to only obtain results from the first max_pages + 1 pages of search results
    count = 0
    r = []
    scroll_pause_time = 1
    # while loop to obtain all pages of entries based on search results.
    while True:
        url = link + '&page=' + str(count)
        browser.get(url)
        # Shopee Site has dynamic scrolling/loading. Hence scroll the page to ensure evyerthing is loaded.
        while True:
            time.sleep(scroll_pause_time)
            last_height = browser.execute_script("return document.body.scrollHeight")
            #print("last height",last_height)
            browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(scroll_pause_time)
            new_height = browser.execute_script('return document.body.scrollHeight')
            #print("new height", new_height)
            if new_height == last_height:
                browser.execute_script('window.scrollTo(0, document.body.scrollHeight);')
                time.sleep(scroll_pause_time)
                new_height = browser.execute_script('return document.body.scrollHeight')
            if new_height == last_height:
                break
            else:
                last_height = new_height
                continue
        # Obtain the HTML 
        html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
        # Parse using beautifulsoup
        soup = BeautifulSoup(html, "html.parser")
        
        # If no more pages of search results, break from loop
        if len(soup.select('.shopee-search-empty-result-section')) != 0:
            break
        # else add the search results into the list
        r.extend(soup.select(".col-xs-2-4.shopee-search-item-result__item a"))
        count += 1
        if count + 1 > max_pages:
            break
        
    #r = soup.select(".col-xs-2-4.shopee-search-item-result__item a")
    Names = []
    Prices = []
    Original_Prices = []
    for res in r:
        # Obtain the name of the listing
        name = re.findall('<a data-sqe="link" href="/.{1,280}-i', str(res))[0][26:-2]
        name = name.replace('-',' ')
        
        # Some of the listing have 2 prices, 1 being the discount price and the other being original price
        prices = re.findall("\$.{1,4}\.[0-9][0-9]", str(res.text))
        original_price = 0
        price = 0
        if len(prices) == 2:
            original_price, price = prices
        else:
            original_price = prices[0]
            price = prices[0]
        Names.append(name)
        Prices.append(price)
        Original_Prices.append(original_price)
    df = pd.DataFrame(data={"Name":Names, "Price":Prices, "Original Price":Original_Prices})
    return df

In [17]:
df = get_results(url, browser)

In [18]:
df

Unnamed: 0,Name,Price,Original Price
0,Takeaway Hollin 2 Large Sized Bubble Tea at Su...,$6.90,$11.80
1,Takeaway Hollin 2 Large Sized Bubble Tea at To...,$6.90,$11.80
2,Takeaway Hopii Bubble Tea Large Boba Milk Tea ...,$3.00,$5.80
3,Takeaway Chun Yang Tea Signature Bubble Tea at...,$4.00,$5.80
4,(Open now) Tropical Sunday 30 All Iced Tea and...,$6.00,$6.00
5,Western and Bubble Tea Family Meal Delivery,$49.90,$49.90
6,LiHO TEA Earl Grey Milk Tea Bubble Tea,$2.49,$4.00
7,Tiger Sugar Brown Sugar Signature Series Bubbl...,$5.60,$5.60
8,Takeaway Hopii Bubble Tea 2 Fresh Milk Bubble ...,$5.00,$6.80
9,Yan Xi Tang Strawberry Fresh Milk Bubble Tea,$2.99,$4.80


References:<br>
https://realpython.com/python-web-scraping-practical-introduction/#install-mechanicalsoup <br>
https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python

Author: Tay Yan Jie<br>
Last Updated: 19 May 2021