# Web Scraping

### Taikgun 'Tek' Song. email: songt@ufl.edu

## Introduction

##### What is Web Scraping?

Web scraping refers to the process of extracting data from websites by parsing the HTML or interacting with web applications.
* Data Analysis: Collecting data for statistical analysis, research, and market analysis.
* Content Aggregation: Gathering similar content from multiple websites (e.g., news, blogs).
* Automation: Performing repetitive tasks such as filling out forms and submitting data.

##### Understanding HTTP Request

* Request: When you enter a URL in a web browser and hit enter, the browser sends an HTTP request to the server specified by the URL.
* Response: The server processes the request and responds with an HTTP response. If the request is for a web page, the response usually includes an HTML document. <br>
The browser then interprets and renders this HTML document to display the web page.

<div>
<img src="https://raw.githubusercontent.com/songtuf/webscraping/main/html_request.png" width="700"/>
</div>

## HTML

### Why HTML Important?

* HTML stands for HyperText Markup Language.
* HTML provides the structure of web pages by defining elements such as headings, paragraphs, lists, links, images, and more.
* HTML uses a standarized set of tags to define the structure and layout of a web document. [Link to the list of html tags](https://www.w3schools.com/TAGS/default.asp)
* HTML enables the creation of hypertext links -- connect different web pages and resources across the internet

### HTML Structure

![Directory Structure](https://raw.githubusercontent.com/songtuf/webscraping/main/html_structure.png)

<!DOCTYPE html>  <!-- Declaration -->
<HTML>           <!-- Root of HTML doc -->
    <HEAD>       <!-- Metadata -->
        <TITLE>My Title</TITLE>
    </HEAD>
    <BODY>      <!-- contents -->
        <H1>A Heading</H1>
        <a href="https://www.google.com/">Link text</a>
    </BODY>
</HTML>

### HTML Element

<div>
<img src="https://raw.githubusercontent.com/songtuf/webscraping/main/HTML_tags2.png" width="700"/>
</div>

### Your Turn! -- HTML

1. Start a new Markdown chunk below
2. Decalre HTML document <br>
```<!DOCTYPE html>```
3. Insert the following tags: html, head, body, div, and H1
    ```
    <!DOCTYPE html>
    <html>
        <head>
        </head>
        <body>
        </body>
    </html>
    ```
4. Insert `title` tag under the `head` tag
    ```
    <head>
        <title>HTML Tutorial</title>
    </head>
    ```
5. Create a division using the `div` tag under the `body` tag
    ```
    <body>
        <H1> This is H1 heading</H1>
        <p> This is a paragraph</p>
        <div>
            <H3>This is H3 heading</H3>
        </div>
    </body>
    ```

Double click on this markdown chunk

## Essential Packages for Web Scraping

* `requests` - HTTP library that makes HTTP requests
    - We will use the `GET` request to retrieve data from the server
* `BeautifulSoup` - Popular library that extract data from HTML and XML documents
* `Pandas` - Powerful library for data manipulation and analysis

In [None]:
# This chunk contains Python code
# Text following the `#` symbol represents comments
# Python ignores comments during execution
# Comments are provided to help readers understand the code

In [1]:
# Let's try to import the requests and beautiful soup package
# The two lines should run smoothly
import requests
from bs4 import BeautifulSoup

In [2]:
# As an example, we will scrape example.com
# Define object `url` to be http://example.com
url = "http://example.com"

# Use the `get` request to retrieve the data
response = requests.get(url)
# Check respons
response

<Response [200]>

### Selected HTTP status codes
* 200: OK -- Request was successful
* 401: Unauthorized -- Lacks valid authentication credentials for the requested source
* 403: Forbidden -- Server refused to process it
* 404: Not Found -- Server cannot find the requested resource
* 502: Bad Gateway -- Server received an invalid response from the upstream server

In [3]:
# Extract the text part of the response
html_content = response.text
html_content

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

In [4]:
# Create a BeautifulSoup object and specify the parser
soup = BeautifulSoup(html_content, "html.parser")
# Note the difference between html_content output and the soup output"
soup

<!DOCTYPE html>

<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative example

In [5]:
# Extract Information using tags
# Find the first 'h1' tag
h1_tag = soup.find('h1')
print(h1_tag.string)

Example Domain


In [6]:
# Use the `find_all` function to extract multiple tags
# Example, find all 'p' tags
p_tags = soup.find_all('p')
p_tags

[<p>This domain is for use in illustrative examples in documents. You may use this
     domain in literature without prior coordination or asking for permission.</p>,
 <p><a href="https://www.iana.org/domains/example">More information...</a></p>]

In [7]:
# Note that multiple p tags are saved in a list.
p_tags[0]

<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>

In [8]:
p_tags[1]

<p><a href="https://www.iana.org/domains/example">More information...</a></p>

In [9]:
# Extract text infromation of the `a` tag
soup.find('a').text

'More information...'

In [10]:
# Extract hypertext reference (href) link in the `a` tag
soup.find('a')['href']

'https://www.iana.org/domains/example'

In [11]:
# Save the extracted url
next_url = soup.find('a')['href']
next_url

'https://www.iana.org/domains/example'

In [12]:
# Let's make another get request, but now using the extracted domain
response = requests.get(next_url)
html_content = response.text
soup = BeautifulSoup(html_content, "html.parser")
# Try to extract the two urls for "RFC 2606" and "RFC 6761".
# Use F12 key (for Windows) or CMD+ALT+I (Mac) to open Inspect on your web browser
# Use the `inspector` to identify path to the "RFC 2606" and "RFC 6761"
# Hint1: You may have to dive multiple layers of tags. Use `select` instead of `find_all` and then use the `>` symbol if you want to reach the child tag
# Hint2: the `main` tag may not be supported. Use other tags.
# Hint3: You may have to specify a certain `div`tag. Use it's class
soup.select('div[class=help-article] > p > a')
# Note that the `select` function naturally puts multiple tags into a list

[<a href="/go/rfc2606">RFC 2606</a>, <a href="/go/rfc6761">RFC 6761</a>]

### Your Turn! -- BeautifulSoup
The goal is to extract the link for "IANA-managed Reserved Domains"

1. Send a `get` request to `'https://www.iana.org/domains/example'`
2. Use F12 key (for Windows) or CMD+ALT+I (Mac) to open Inspect on your web browser
3. Use the `inspector` to identify path to the "IANA-managed Reserved Domains"
4. Extract the url link of "IANA-managed Reserved Domains" and save it as `html_ex`
Hint: The `html_ex` object will be a list if you used the `select` function <br>
In order to use `["href"]`, select the element from the list. Example: `html_ex[0]['href']`

In [13]:
response = requests.get("https://www.iana.org/help/example-domains")
html_content = response.text
soup = BeautifulSoup(html_content, "html.parser")
html_ex = soup.select("div[class=help-article] > ul > li > a")

In [14]:
# Note that the extracted url will be incomplete.
# This is because the root node is not provided.
# Check the root node -- "https://www.iana.org"
# Add the root node to the extracted url address
"https://www.iana.org" + html_ex[0]['href']

'https://www.iana.org/domains/reserved'

### Saving a Table
* We will use the Pandas package to extract table in a nice dataframe format
* Let's use the new url address `https://www.iana.org/domains/reserved`


In [15]:
# import the pandas library and name it pd to make it abstract
import pandas as pd

response = requests.get('https://www.iana.org/domains/reserved')
html_content = response.text
soup = BeautifulSoup(html_content, "html.parser")
table = pd.read_html(str(soup))
# Note that table will be saved in a list
table[0]

  table = pd.read_html(str(soup))


Unnamed: 0,Domain,Domain (A-label),Language,Script
0,إختبار,XN--KGBECHTV,Arabic,Arabic
1,آزمایشی,XN--HGBK6AJ7F53BBA,Persian,Arabic
2,测试,XN--0ZWM56D,Chinese,Han (Simplified variant)
3,測試,XN--G6W251D,Chinese,Han (Traditional variant)
4,испытание,XN--80AKHBYKNJ4F,Russian,Cyrillic
5,परीक्षा,XN--11B5BS3A9AJ6G,Hindi,Devanagari (Nagari)
6,δοκιμή,XN--JXALPDLP,"Greek, Modern (1453-)",Greek
7,테스트,XN--9T4B11YI5A,Korean,"Hangul (Hangŭl, Hangeul)"
8,טעסט,XN--DEBA0AD,Yiddish,Hebrew
9,テスト,XN--ZCKZAH,Japanese,Katakana


In [16]:
# You can also specify the table if there are multiple tables.
# Recall "find" function. You can add class, id, and other attributes as an argument.
specific_table_html = str(soup.find('table', {'id': 'arpa-table'}))
pd.read_html(specific_table_html)[0]

  pd.read_html(specific_table_html)[0]


Unnamed: 0,Domain,Domain (A-label),Language,Script
0,إختبار,XN--KGBECHTV,Arabic,Arabic
1,آزمایشی,XN--HGBK6AJ7F53BBA,Persian,Arabic
2,测试,XN--0ZWM56D,Chinese,Han (Simplified variant)
3,測試,XN--G6W251D,Chinese,Han (Traditional variant)
4,испытание,XN--80AKHBYKNJ4F,Russian,Cyrillic
5,परीक्षा,XN--11B5BS3A9AJ6G,Hindi,Devanagari (Nagari)
6,δοκιμή,XN--JXALPDLP,"Greek, Modern (1453-)",Greek
7,테스트,XN--9T4B11YI5A,Korean,"Hangul (Hangŭl, Hangeul)"
8,טעסט,XN--DEBA0AD,Yiddish,Hebrew
9,テスト,XN--ZCKZAH,Japanese,Katakana


In [17]:
# Let's spice things up!
# The goal is to extract "Dance/Electronic" chart from the Wiki Billboard_charts page.
response = requests.get('https://en.wikipedia.org/wiki/Billboard_charts')
html_content = response.text
soup = BeautifulSoup(html_content, "html.parser")
# One easy solution is to read all tables
table = pd.read_html(str(soup))
# Then extract the table of interest
table[4]
# This is not a great strategy! Sometimes there may be too many tables!

  table = pd.read_html(str(soup))


Unnamed: 0,Chart title,Chart type,Number of positions,Description
0,Dance Club Songs,reports from DJs,50,Compiled exclusively from playlists submitted ...
1,Hot Dance/Electronic Songs,"Continuous airplay, single sales, digital down...",50,A chart which uses the same methodology as the...
2,Dance/Mix Show Airplay,Continuous airplay (Spins from exclusive repor...,40,Originally called Hot Dance Airplay when it wa...
3,Dance/Electronic Digital Song Sales,digital sales,50,A chart that tracks the digital download sales...
4,Dance/Electronic Streaming Songs,streaming,25,A chart that tracks the week's top Dance/Elect...


In [18]:
# An alternative is to pinpoint the path to the "Dance/Electronic" table
# Note that this is a non trivial task
# step1: Locate the word "Dance/Electronic"
soup.find('h3',id='Dance/Electronic')

<h3 id="Dance/Electronic"><span id="Dance.2FElectronic"></span>Dance/Electronic</h3>

In [19]:
# Step2: Make your way up to the parent node with tag "div". This is because the "table" tag is a sibling of "div"
soup.find('h3',id='Dance/Electronic').find_parent("div")

<div class="mw-heading mw-heading3"><h3 id="Dance/Electronic"><span id="Dance.2FElectronic"></span>Dance/Electronic</h3><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Billboard_charts&amp;action=edit&amp;section=9" title="Edit section: Dance/Electronic"><span>edit</span></a><span class="mw-editsection-bracket">]</span></span></div>

In [20]:
# Step3: As mentioned in Step2, "table" is the next sibling.
path_table = soup.find('h3',id='Dance/Electronic').find_parent("div").findNext("table")
dance_tab = pd.read_html(str(path_table))[0]
dance_tab

  dance_tab = pd.read_html(str(path_table))[0]


Unnamed: 0,Chart title,Chart type,Number of positions,Description
0,Dance Club Songs,reports from DJs,50,Compiled exclusively from playlists submitted ...
1,Hot Dance/Electronic Songs,"Continuous airplay, single sales, digital down...",50,A chart which uses the same methodology as the...
2,Dance/Mix Show Airplay,Continuous airplay (Spins from exclusive repor...,40,Originally called Hot Dance Airplay when it wa...
3,Dance/Electronic Digital Song Sales,digital sales,50,A chart that tracks the digital download sales...
4,Dance/Electronic Streaming Songs,streaming,25,A chart that tracks the week's top Dance/Elect...


### Your Turn! -- Extract Table
1. Use the following URL `"https://en.wikipedia.org/wiki/Billboard_charts"`
2. Extract table for `R&B/Hip-Hop`

## Saving the Outputs
* It is important to know where your current working directory is
* File will be saved to the current working directory unless it is specified
* Use the `pandas` package to save the output in csv format

In [21]:
# First, let's amount Google Drive
# We will save the output on Google Drive
# Use the following code when working on Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [22]:
# Check the working directory
import os
os.getcwd()

'/content'

In [23]:
# save the "dance_tab" file
# Name the file "Billboard_Dance_Electronics.csv"
dance_tab.to_csv("Billboard_Dance_Electronics.csv")

### Selenium -- Automated Browser

In [24]:
!pip install google-colab-selenium

Collecting google-colab-selenium
  Downloading google_colab_selenium-1.0.14-py3-none-any.whl.metadata (2.7 kB)
Collecting selenium (from google-colab-selenium)
  Downloading selenium-4.25.0-py3-none-any.whl.metadata (7.1 kB)
Collecting trio~=0.17 (from selenium->google-colab-selenium)
  Downloading trio-0.26.2-py3-none-any.whl.metadata (8.6 kB)
Collecting trio-websocket~=0.9 (from selenium->google-colab-selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting outcome (from trio~=0.17->selenium->google-colab-selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium->google-colab-selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Collecting h11<1,>=0.9.0 (from wsproto>=0.14->trio-websocket~=0.9->selenium->google-colab-selenium)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading google_colab_selenium-1.0.14-py3-none-any.whl (8.

In [None]:
# The following code will not work
response = requests.get("https://www.iborrowdesk.com/")
html_content = response.text
soup = BeautifulSoup(html_content, "html.parser")
table = pd.read_html(str(soup))
table

In [26]:
import google_colab_selenium as gs
driver = gs.Chrome()
driver.get("https://www.iborrowdesk.com/")
soup = BeautifulSoup(driver.page_source, "html.parser")
table = pd.read_html(str(soup))
table

<IPython.core.display.Javascript object>

  table = pd.read_html(str(soup))


[    Symbol                          Name      Fee  Availability  \
 0    CZOOF               CAZOO GROUP LTD  985.7 %         25000   
 1     PRTG           PORTAGE BIOTECH INC  979.1 %         25000   
 2     TBIO               TELESIS BIO INC  973.5 %         15000   
 3     BNZI      BANZAI INTERNATIONAL INC  961.6 %         25000   
 4     REVB    REVELATION BIOSCIENCES INC  922.5 %        250000   
 5    FGFPP        FG FINANCIAL GROUP INC  904.2 %         30000   
 6     WHLR  WHEELER REAL ESTATE INVESTME  856.2 %         20000   
 7     TOVX         THERIVA BIOLOGICS INC  695.0 %        100000   
 8   TLF.CA  BROMPTON TECH LEADERS INCOME  694.8 %        200000   
 9     ADTX                    ADITXT INC  682.9 %         75000   
 10    GNLN    GREENLANE HOLDINGS INC - A  666.9 %         25000   
 11    PSIG    PS INTERNATIONAL GROUP LTD  653.2 %        550000   
 12    MGOL                MGO GLOBAL INC  606.2 %         50000   
 13    ONCO                 ONCONETIX INC  606.1

## What's Next?
* Next step is to create a loop to conduct multiple iterations for web crawling
* Javascript embedded webpages will not work with HTTP request as they are dynamic
* Use automated web browser such as `Selenium` to receive data
* Use Proxy server to avoid bot detection