<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>



# Demo 8.3: Web Scraping

INSTRUCTIONS:

- Run the cells
- Observe and understand the results

# Web Scraping in Python (using BeautifulSoup)

# HTML Basics
Before starting with the code, let’s understand the basics of HTML and some rules of scraping.

## HTML tags
Below is the source code for a simple HTML webpage.

    <!DOCTYPE html>  
    <html>  
        <head>
        </head>
        <body>
            <h1> First Scraping </h1>
            <p> Hello World </p>
        <body>
    </html>
    
This is the basic syntax of an HTML webpage. Every `<tag>` serves a block inside the webpage:
1. `<!DOCTYPE html>` HTML documents must start with a type declaration.
2. The HTML document is contained between `<html>` and `</html>`.
3. The meta and script declaration of the HTML document is between `<head>` and `</head>`.
4. The visible part of the HTML document is between `<body>` and `</body>` tags.
5. Title headings are defined with the `<h1>` through `<h6>` tags.
6. Paragraphs are defined with the `<p>` tag.

Other useful tags include `<a>` for hyperlinks, `<i>` for italics, `<table>` for tables, `<tr>` for table rows, and `<td>` for table columns.

Also, HTML tags sometimes come with `id` or `class` attributes. The `id` attribute specifies a unique id for an HTML tag and the value must be unique within the HTML document. The `class` attribute is used to define equal styles for HTML tags with the same class. We can make use of these ids and classes to help us locate the data we want.

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed.

## Inspecting a Wikipedia Page
Let’s take one page from **Wikipedia** as an example.

Open the web page on [Timeline of food](https://en.wikipedia.org/wiki/Timeline_of_food) with the browser and inspect it.

It has a number of events listed by year. We shall scrape these and load our results into a dataframe.

In [1]:
## Import Libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
pd.set_option('display.max_colwidth', None) #enables columns to be displayed entirely

### Define the content to retrieve (webpage's URL)

In [2]:
url = "https://en.wikipedia.org/wiki/Timeline_of_food"

In [3]:
r = requests.get(url)
if r.status_code == 200:
    page = r.content
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status Code: %d, Page Size: %d' % (r.status_code, len(page)))
else:
    print('Some problem occurred. Request Status Code: %d' % r_code.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status Code: 200, Page Size: 231647


### Convert the stream of bytes into a BeautifulSoup representation

In [4]:
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

Type of the variable 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [5]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Timeline of food - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-featur

### Check the HTML's Title

In [6]:
print('Title tag :%s:' % soup.title)
print('Title text:%s:' % soup.title.string)

Title tag :<title>Timeline of food - Wikipedia</title>:
Title text:Timeline of food - Wikipedia:


### `<li>` tags
- This page uses the tag `li` to introduce each event in the food timeline

        <li>`Year`: `Event description`</li>

The following code locates all `li` tags not containing any class or id attributes.

In [7]:
list_of_li_tags = soup.find_all('li', attrs={'class': None, 'id': None})

In [8]:
len(list_of_li_tags)

173

Only those up to index 154 are needed:

In [9]:
list_of_li_tags[:154]

[<li>5-2 million years ago: Hominids shift away from the consumption of <a href="/wiki/Nut_(fruit)" title="Nut (fruit)">nuts</a> and <a href="/wiki/Berry" title="Berry">berries</a> to begin the consumption of <a href="/wiki/Meat" title="Meat">meat</a>.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup><sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup></li>,
 <li>2.5-1.8 million years ago: The discovery of the use of fire for may have created a sense of sharing as a group. Earliest estimate for invention of <a href="/wiki/Cooking" title="Cooking">cooking</a>, by <a class="mw-redirect" href="/wiki/Phylogenetic" title="Phylogenetic">phylogenetic</a> analysis.<sup class="reference" id="cite_ref-PNAScooking_3-0"><a href="#cite_note-PNAScooking-3">[3]</a></sup></li>,
 <li>250,000 years ago: <a class="mw-redirect" href="/wiki/Hearths" title="Hearths">Hearths</a> appear, accepted archeological estimate for invention of cooking chicken. <sup clas

In [10]:
list_of_li_tags[154:]

[<li><a href="/wiki/Food_history" title="Food history">Food history</a></li>,
 <li><a class="mw-redirect" href="/wiki/History_of_breakfast" title="History of breakfast">History of breakfast</a></li>,
 <li><a href="/wiki/List_of_ancient_dishes" title="List of ancient dishes">List of ancient dishes</a></li>,
 <li><a href="/wiki/List_of_food_and_beverage_museums" title="List of food and beverage museums">List of food and beverage museums</a></li>,
 <li><link href="mw-data:TemplateStyles:r1133582631" rel="mw-deduplicated-inline-style"/><cite class="citation book cs1" id="CITEREFMelitta_Weiss_Adamson2004">Melitta Weiss Adamson (2004). <a class="external text" href="https://books.google.com/books?id=jtgud2P-EGwC&amp;pg=PR9" rel="nofollow">"Timeline"</a>. <i>Food in Medieval Times</i>. Greenwood. <a class="mw-redirect" href="/wiki/ISBN_(identifier)" title="ISBN (identifier)">ISBN</a> <a href="/wiki/Special:BookSources/978-0-313-32147-4" title="Special:BookSources/978-0-313-32147-4"><bdi>978-0

Let's look at parsing one of these tags.

In [11]:
sampletag = list_of_li_tags[153]
sampletag

<li>2017: The art of making Neapolitan <a href="/wiki/Pizza" title="Pizza">pizza</a> was added to <a href="/wiki/UNESCO" title="UNESCO">UNESCO</a>'s list of <a href="/wiki/Intangible_cultural_heritage" title="Intangible cultural heritage">intangible cultural heritage</a>.<sup class="reference" id="cite_ref-GuardianPizza2017_94-0"><a href="#cite_note-GuardianPizza2017-94">[94]</a></sup></li>

The `get_text` method extracts the text content from the tag:

In [12]:
text = sampletag.get_text()
text

"2017: The art of making Neapolitan pizza was added to UNESCO's list of intangible cultural heritage.[94]"

From this, we also would like to remove the [94] reference at the end. We can use a regular expression to achieve this.

In [13]:
text_minus_ref = re.sub("\[.*?\]", "", text)
text_minus_ref

"2017: The art of making Neapolitan pizza was added to UNESCO's list of intangible cultural heritage."

Next, we can use `str.split` to separate the year from the event description:

In [14]:
text_minus_ref.split(': ', 1) # the `1` executes the first such split

['2017',
 "The art of making Neapolitan pizza was added to UNESCO's list of intangible cultural heritage."]

## Parsing all elements

Now that we are able to handle one list element, we can apply that to all elements through a list comprehension:

In [15]:
eventlist = pd.Series([re.sub("\[.*?\]", "", tag.get_text()) for tag in list_of_li_tags[:154]])
eventlist

0                                                                                                                                                                                           5-2 million years ago: Hominids shift away from the consumption of nuts and berries to begin the consumption of meat.
1                                                                                                                           2.5-1.8 million years ago: The discovery of the use of fire for may have created a sense of sharing as a group. Earliest estimate for invention of cooking, by phylogenetic analysis.
2                                                                                                                                                                                                           250,000 years ago: Hearths appear, accepted archeological estimate for invention of cooking chicken. 
3                                                                                 

In [16]:
df = eventlist.str.split(pat=': ', n=1, expand=True)
df

Unnamed: 0,0,1
0,5-2 million years ago,Hominids shift away from the consumption of nuts and berries to begin the consumption of meat.
1,2.5-1.8 million years ago,"The discovery of the use of fire for may have created a sense of sharing as a group. Earliest estimate for invention of cooking, by phylogenetic analysis."
2,"250,000 years ago","Hearths appear, accepted archeological estimate for invention of cooking chicken."
3,"170,000 years ago",Cooked starchy roots and tubers in Africa
4,"40,000 years ago","First evidence of human fish consumption: isotopic analysis of the skeletal remains of Tianyuan man, a modern human from eastern Asia, has shown that he regularly consumed freshwater fish."
...,...,...
149,1960,The invention of the potato water gun knife facilitates the mass production of French fries by fast food restaurants.
150,1961,Invention of the Chorleywood bread process.
151,1964,The iconic Australian biscuit Tim Tam enters the market.
152,2013,"Professor Mark Post at Maastricht University pioneered a proof-of-concept for cultured meat by creating the first hamburger patty grown directly from cells. Since then, other cultured meat prototypes have gained media attention: SuperMeat opened a farm-to-fork restaurant called ""The Chicken"""


In [17]:
df.columns = ['Time', 'Event']

In [18]:
df

Unnamed: 0,Time,Event
0,5-2 million years ago,Hominids shift away from the consumption of nuts and berries to begin the consumption of meat.
1,2.5-1.8 million years ago,"The discovery of the use of fire for may have created a sense of sharing as a group. Earliest estimate for invention of cooking, by phylogenetic analysis."
2,"250,000 years ago","Hearths appear, accepted archeological estimate for invention of cooking chicken."
3,"170,000 years ago",Cooked starchy roots and tubers in Africa
4,"40,000 years ago","First evidence of human fish consumption: isotopic analysis of the skeletal remains of Tianyuan man, a modern human from eastern Asia, has shown that he regularly consumed freshwater fish."
...,...,...
149,1960,The invention of the potato water gun knife facilitates the mass production of French fries by fast food restaurants.
150,1961,Invention of the Chorleywood bread process.
151,1964,The iconic Australian biscuit Tim Tam enters the market.
152,2013,"Professor Mark Post at Maastricht University pioneered a proof-of-concept for cultured meat by creating the first hamburger patty grown directly from cells. Since then, other cultured meat prototypes have gained media attention: SuperMeat opened a farm-to-fork restaurant called ""The Chicken"""


We now have our data in a form that is easier to consume or process. For example, the following code lists events between 1910 and 1950.

In [19]:
df[df.Time.between('1910', '1950')]

Unnamed: 0,Time,Event
142,1912,Otto Rohwedder invents the bread-slicing machine. It wouldn't enter use before 1928 however.
143,1916,The first domesticated blueberries reach the market.
144,1920s,French fries introduced in the United States by returning First World War soldiers.
145,1940,"The McDonald's brothers opened their first McDonald's restaurant on May 15 in San Bernardino, California."
146,1948,Canada lifts the ban on margarine.




---



---



> > > > > > > > > © 2024 Institute of Data


---



---



