# Web Data Scraping

__Web Data Scraping__ is a technique used to extract data from websites. This process involves programmatically accessing web pages and pulling out the information that you need. Web scraping can be used to gather data from websites that do not provide an __*Application Program Interface (API)*__ for easy data access or when you need large amounts of data quickly and the site's API limits do not allow for this. Here are the key aspects of web scraping:

1. __Sending a Request:__ The first step is to send a request to the web server hosting the website from which data is to be scraped. This request is typically done using HTTP or HTTPS protocols.

2. __Receiving the Response:__ The server responds to the request by sending back the requested web page, often in HTML format. Other formats like JSON and XML can also be received depending on the API or web service.

3. __Parsing the Data:__ Once the data is received, it needs to be parsed. For HTML, this usually involves using libraries like BeautifulSoup in Python, which allow for easy navigation of the structure of the HTML and extraction of the relevant information.

4. __Data Extraction:__ After parsing, the necessary data is extracted. This could be anything from product details on an ecommerce site, stock prices, sports statistics, or any other information available on the web.

# Imports

In [None]:
import pandas as pd
import numpy as np
import requests
import warnings
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

warnings.filterwarnings("ignore")

## BeautifulSoup Methods (Common)


| Method                  | Description                                      |
|-------------------------|--------------------------------------------------|
| `.find()`               | Finds the **first** matching element             |
| `.find_all()`          | Finds **all** matching elements                  |
| `.find_next()`         | Finds the **next** matching element              |
| `.find_previous()`     | Finds the **previous** matching element          |
| `.find_next_sibling()` | Finds the **next** sibling element               |
| `.find_previous_sibling()` | Finds the **previous** sibling element       |
| `.find_parents()`      | Finds **all** parent elements                    |
| `.find_parent()`       | Finds the **direct** parent element              |
| `.get_text()`          | Extracts text inside a tag                       |
| `.decompose()`         | Removes an element from the HTML                 |
| `.replace_with()`      | Replaces an element with new content             |
| `.select()`           | Finds **multiple** elements using CSS selectors  |
| `.select_one()`       | Finds the **first** element using CSS selectors  |
| `.get()`             | Retrieves an **attribute value**                   |
| `.has_attr()`         | Checks if an element has an **attribute**         |

## Web Scrape Basic Demo

In [None]:
# Link to URL

url = 'https://renatomaaliw3.github.io/scrape_demo.html'

In [None]:
# Send requests

# When you're making HTTP requests for web scraping, many websites check the "User-Agent" header
# to see what kind of client is making the request.
# By setting this header, you can mimic a real browser, which can help avoid blocks
# or serve the correct content.

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36'}

In [None]:
# Response

# This is a function from the requests library that makes an HTTP GET request.
# GET requests are used to retrieve data from a specified resource.

response = requests.get(url = url, headers = headers)

In [None]:
# Create a bs4 object to parse the HTML content

# This contains the raw bytes of the HTML content returned by your requests.get() call.
# Using .content ensures that you get the raw data, which is useful for parsing.

# This tells BeautifulSoup to parse the HTML using Python’s built-in HTML parser.
# The parser converts the raw HTML into a structured format (a parse tree) that you
# can easily navigate and search with methods like .find(), .find_all(), and .select().

soup = BeautifulSoup(response.content, 'html.parser')

In [None]:
# .find()

# output = soup.find('table')
# output = soup.find('table', id = 'table-2')
# output = soup.find('td', class_ = 'vip')
output = soup.find('td', attrs = {'data-occupation': 'ceo'})

output

<td data-occupation="ceo">Ariel Buckner</td>

In [None]:
# .find_all()

# output = soup.find_all('th')
# output = soup.find_all('th', class_ = 'name') # or used attrs

# target_ol = soup.find('ol', class_ = 'list level2')
# output = target_ol.find_all('span', recursive = False)

# target_ol = soup.find('ol', class_ = 'list level1')
# output = target_ol.find_all('span', string = ' LEVEL 2 ')

# import re
# target_ol = soup.find('ol', class_ = 'list level1')
# output = target_ol.find_all('span', string = re.compile('LEVEL 2'))

# target_ol = soup.find('ol', class_ = 'list level1')
# output = target_ol.find_all('ol', class_ = 'list level3', limit = 2)

output

<td data-occupation="ceo">Ariel Buckner</td>

In [None]:
# .select()

# output = soup.select('ol')
# output = soup.select('li > span')
# output = soup.select('li#item-304')
# output = soup.select('ol#list-837 li > span.red')
# output = soup.select('li', attrs = {'data-code': 'E5F6'})
# output = soup.select('ol.list.level3 > li:first-child')
# output = soup.select('span[class ="link red"]')

output

<td data-occupation="ceo">Ariel Buckner</td>

In [None]:
# .select_one()

# output = soup.select_one('ol.list.level3 > li')
# output = soup.select_one('ol span')
# output = soup.select_one('table span')

output

<td data-occupation="ceo">Ariel Buckner</td>

In [None]:
# .children()

ol_list_level3 = soup.find('ol', class_ = 'list level3')

for child in ol_list_level3.children:

  print(child)



<li class="entry gamma" data-code="C3D4" id="item-867">
<span class="link green"> Third Level A.1 </span>
</li>


<li class="entry delta" data-code="E5F6" id="item-145">
<span class="link purple"> Third Level A.2 </span>
</li>




In [None]:
# .descendants()

ol_list_level3 = soup.find('ol', class_ = 'list level3')

for child in ol_list_level3.descendants:

  print(child)



<li class="entry gamma" data-code="C3D4" id="item-867">
<span class="link green"> Third Level A.1 </span>
</li>


<span class="link green"> Third Level A.1 </span>
 Third Level A.1 




<li class="entry delta" data-code="E5F6" id="item-145">
<span class="link purple"> Third Level A.2 </span>
</li>


<span class="link purple"> Third Level A.2 </span>
 Third Level A.2 






In [None]:
# .find_next_sibling()

table_1 = soup.select_one('table#table-1')
th_name = table_1.find('th', class_ = 'name')
next_sib = th_name.find_next_sibling()

next_sib

<th class="email"> Email </th>

In [None]:
# .find_previous_sibling()

table_1 = soup.select_one('table#table-1')
th_name = table_1.find('th', class_ = 'email')
next_sib = th_name.find_previous_sibling('th')

next_sib

<th class="name"> Name </th>

In [None]:
# .get_text()

table_1 = soup.select_one('table#table-1')
th_name = table_1.find('th', class_ = 'name')
th_name.get_text().strip()

'Name'

In [None]:
# .get_text() - long method

data = []
table_rows = soup.select('table#table-1 > tr')[1:]

for row in table_rows:

  cols = []
  tds = row.find_all('td')

  for td in tds:

    text = td.get_text()
    cols.append(text)

  data.append(cols)

data

[["Geoffrey O'Donnell", 'vitae.semper@protonmail.org', 'Nigeria', 'Flevoland'],
 ['Justin Patrick', 'orci.ut.sagittis@outlook.com', 'Peru', 'Punjab'],
 ['Ariel Buckner', 'nunc@google.net', 'Vietnam', 'North Chungcheong'],
 ['Imani Faulkner', 'hendrerit@yahoo.com', 'Philippines', 'Bicol'],
 ['Colorado Hampton', 'eros.nam@hotmail.couk', 'Brazil', 'Katsina']]

In [None]:
# .get_text() - recommended

data = []
table_rows = soup.select('table#table-1 > tr')[1:]

for row in table_rows:

  cols = [col.get_text(strip = True) for col in row.find_all('td')]
  data.append(cols)

data

[["Geoffrey O'Donnell", 'vitae.semper@protonmail.org', 'Nigeria', 'Flevoland'],
 ['Justin Patrick', 'orci.ut.sagittis@outlook.com', 'Peru', 'Punjab'],
 ['Ariel Buckner', 'nunc@google.net', 'Vietnam', 'North Chungcheong'],
 ['Imani Faulkner', 'hendrerit@yahoo.com', 'Philippines', 'Bicol'],
 ['Colorado Hampton', 'eros.nam@hotmail.couk', 'Brazil', 'Katsina']]

In [None]:
# .attrs

div_tag = soup.find('li', id = 'item-592')
div_tag.attrs

{'id': 'item-592', 'class': ['entry', 'alpha'], 'data-code': 'X7Z3'}

In [None]:
# sample scenario

if (div_tag.attrs.get('class')[0] == 'entry'):

  print(True)

else:

  print(False)

True


In [None]:
# sample scenario (in)

classes = div_tag.attrs.get('class')

if ('entry' in classes and 'alpha' in classes):

  print(True)

else:

  print(False)

True


In [None]:
# sample scenario (set)


if (set(div_tag.attrs.get('class')) == {'entry', 'alpha'}):

  print(True)

else:

  print(False)

True


In [None]:
# .get

div_tag = soup.find('li', id = 'item-592')
div_tag.get('data-code')

'X7Z3'