# Web Scrapping with Beautiful and Mechanical Soup

In this notebook, we will be doing the basics of web scrapping.

To begin, Beautiful Soup and Mechanical Soup must be installed. Uncomment the cell below and run it if the packages have not yet been installed.

In [1]:
#!pip install beautifulsoup4
#!pip install MechanicalSoup
#!pip install lxml

In [2]:
import mechanicalsoup
from datetime import datetime


# Table of Contents
1. [Web Scrapping of Singapore's Current Weather](#weather)
2. [Text Messaging Abbreviations](#abb)

## Web Scrapping of Singapore's Current Weather <a name="weather"></a>

Mechanical Soup is used to scrap the weather details from http://www.weather.gov.sg/home.

This is a simple webscrapping to scrap the current Temperature, wind speed and precipitation of Singapore.

<Response [200]> indicates that the scrapping of the HTML from the site has succeeded.

In [3]:
url = "http://www.weather.gov.sg/home"
browser = mechanicalsoup.Browser()
page = browser.get(url)
now = datetime.now()
current_datetime = now.strftime("%d %B %Y, %H:%M:%S")
page

<Response [200]>

`page.soup.select` takes in the CSS selector as an argument to find all chunks that have that selector. For example, the minimum temperature and maximum temperature are found within a `<h2>` tag that is within a `<div>` tag that is found within a tag with a class called `.media`. The function returns a list of all chunks found.

`.text` is used to obtained the text found within the chunk.

In [4]:
min_temp, max_temp = page.soup.select(".media div h2")
min_temp = min_temp.text
max_temp = max_temp.text

In [5]:
forecast = page.soup.select(".w-sky p")[1].text

In [6]:
precip, wind = page.soup.select(".w-wind p")
precip = precip.text.strip()
wind = wind.text.strip()

Singapore's current weather forecast details are shown below:

In [7]:
print("Date:", current_datetime)
print(f"Minimum Temprature: {min_temp}\nMaximum Temperature: {max_temp}")
print(f"Precipitation: {precip}\nWind: {wind}")
print("Forecast:", forecast)

Date: 18 May 2021, 14:22:48
Minimum Temprature: 31°C
Maximum Temperature: 23°C
Precipitation: 55% - 90%
Wind: SW 5 - 10 km/h
Forecast: Moderate Rain


## Text Messaging Abbreviations <a name = 'abb'></a>
Text Messaging Abbreviations are extracted from a [HTML site](https://www.webopedia.com/reference/text-abbreviations/) where the information found are in the `<table>` tag.

In [8]:
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import pandas as pd

The website requires a User Agent to check for bots. To use Beautiful Soup:

In [9]:
url = "https://www.webopedia.com/reference/text-abbreviations/"
req = Request(url, headers={'User-Agent':'Mozilla/5.0'})
page = urlopen(req)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

To use Mechanical Soup:

In [10]:
b = mechanicalsoup.StatefulBrowser()
b.set_user_agent('my-awesome-script')
b.get(url)

<Response [200]>

The `<table>` tag is used to find the abbreviation table. As each entry in `<td>` is an entry of the table, there is a need to check which entry is for abbreviation and which is for the meaning column. In addition, there are titles in between the entries. Hence, to prevent them from being added into the table we are creating, we check the 'chat abbreviations' is not present in the entry.

In [11]:
tables = soup.find_all("table")
test = tables[0]

td = test.find_all("td")

In [12]:
abb = []
meaning = []
abb_col = True # Boolean flag to check whether to add to abb or meaning.
for i in range(len(td)):
    # prevent titles from being added
    if 'CHAT ABBREVIATIONS'.lower() in td[i].text.strip().lower():
        abb_col = True
    elif abb_col:
        abb.append(td[i].text.strip())
        abb_col = False
    else:
        meaning.append(td[i].text.strip())
        abb_col = True

pd.DataFrame(data = {"abb":abb, "meaning":meaning})

Unnamed: 0,abb,meaning
0,?,I have a question
1,?,I don’t understand what you mean
2,?4U,I have a question for you
3,;S,"Gentle warning, like “Hmm? What did you say?”"
4,^^,Meaning “read line” or “read message” above
...,...,...
1547,ZH,Sleeping Hour
1548,ZOMG,Used in World of Warcraft to mean OMG (Oh My God)
1549,ZOT,Zero tolerance
1550,ZUP,Meaning “What’s up?”


References:<br>
https://realpython.com/python-web-scraping-practical-introduction/#install-mechanicalsoup

Author: Tay Yan Jie<br>
Last Updated: 18 May 2021