Content under Creative Commons Attribution license CC-BY 4.0, code under BSD 3-Clause License © 2019 R. Watkins

Note: This tutorial is based on: https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/


## Introduction

Web scraping is a useful technique to convert unstructured data on the web (such as a table or list) to structured data (such as dataframe) that you can use for a variety of purposes. For example, you might want to scrape an education website to get data on school test scores, or you might want to take voting information by zip code from a government webpage to create a visual image of voting trends in your area.

These are just a few of the questions / problems / products whose solutions might start with web scraping and information extraction (data collection) before you get to data analysis and interpretation.


## Ways to extract information from web

There are several ways to extract information from the web. **APIs** are probably the best way to extract data from a website. Almost all large websites like Twitter, Facebook, Google, Reddit, StackOverflow provide APIs to access their data (or at least limited sections of their data) in a more structured manner. If you can get the data you want through an API, it is almost always preferred approach over web scraping. This is because if you are getting access to structured data from the provider, why would you want to create an engine to extract the same information.

**RSS feeds** are another way that a website can share information for people to use for other purposes. For example, blogs will often produce an RSS feed that you can use to get a copy of all the recent posts, which you can then for your work.  But they are limited in their use and are mostly found in sites that have routine updates (such as news pages, or podcasts).

But what can you do when you want information that is on a website but they don't have an API or RSS feed that meets your purposes? Well, that is when you scrape the website to fetch the information.


## What is Web Scraping?

Web scraping is a technique for extracting information from websites. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet).

You can perform web scraping in various ways, including use of Google Docs to almost every programming language. I would resort to Python because of its ease and rich ecosystem. It has a library known as ‘BeautifulSoup’ which assists this task. In this article, I’ll show you the easiest way to learn web scraping using python programming.

For those of you, who need a non-programming way to extract information out of web pages, you can also look at import.io . It provides a GUI driven interface to perform all basic web scraping operations. The hackers can continue to read this article!

 

## Libraries required for web scraping
Python is an open source programming language and you will often find multiple libraries that can perform the same function. Hence, it is necessary to find the best to use library. We will use the BeautifulSoup (Python library), since it is easy and intuitive to work with: 

**BeautifulSoup**: It is an incredible tool for pulling out information from a webpage. You can use it to extract tables, lists, paragraph and you can also put filters to extract information from web pages. 

BeautifulSoup does not however fetch the web page for us. We will also use a library for opening the webpage URL:

**Urllib.request**: It is a Python module which can be used for fetching URLs. 

Python has several other options for HTML scraping in addition to BeatifulSoup. Here are some others: mechanize, scrapemark, or scrapy
 

## Basics – Get familiar with HTML (Tags)
While scraping the we, you deal with html tags. Thus, it is quite useful to have good understanding of them.  Below is the basic syntax of HTML.  This syntax has various tags as elaborated below:

    <!DOCTYPE html> : HTML documents must start with a type declaration
    HTML document is contained between <html> and </html>
    The visible part of the HTML document is between <body> and </body>
    HTML headings are defined with the <h1> to <h6> tags
    HTML paragraphs are defined with the <p> tag
    Other useful HTML tags are:

    HTML links are defined with the <a> tag, “<a href=“http://www.test.com”>This is a link for test.com</a>”
    HTML tables are defined with<Table>, row as <tr> and rows are divided into data as <td>
    html table
    HTML list starts with <ul> (unordered) and <ol> (ordered). Each item of list starts with <li>
    
If you are new to this HTML tags, I would also recommend you to refer HTML tutorial from W3schools. This will give you a clear understanding about HTML tags.

Before getting starting, take a look at the webpage you will be scraping:
<a href="https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India" target="_blank">  https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India</a>

Ok, now it is time to begin... you will have to import the **BeautifulSoup** library, which is part of the **bs4** package, and import **urllib.request**.  

In [3]:
import urllib.request
from bs4 import BeautifulSoup

With the libraries now loaded, you will want to specific variables for webpage (i.e., URL) you are scraping. In this example we will call it "wiki", and use **urllib** to open that URL.  And then you will want to run **BeautifulSoup** on that "page".

In [4]:
wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"
page = urllib.request.urlopen(wiki)
soup = BeautifulSoup(page)

You can see the HTML version of the webpage using the "soup.prettify()" function.

Exercise: In the cell below, write the code of printing (i.e., viewing) the HTML of the webpage.

Do your results look like this at the top?

    <!DOCTYPE html>
    <html class="client-nojs" dir="ltr" lang="en">
     <head>
      <meta charset="utf-8"/>
      <title>
       List of state and union territory capitals in India - Wikipedia
      </title>
      <script>
       document.documentElement.className="client-js"...

If not, try again.  If you can't get it, highlight the lines below to view the correct code.

<span style="color:white"> print (soup.prettify())</span>


Using BeautifulSoup you scrape specific elements from the webpage using the HTML tags. You can bring in the whole element, including the tags, or just the contents found between the tags (in the example below case a "string").

In [None]:
soup.title

In [None]:
soup.title.string


In [None]:
soup.a

BeautifulSoup can also be used to find specific HTML tags within a webpage. This is very helpful for finding the specific elements within the webpage that you want to scrape. This is when knowing HTML tags is useful, for example in HTML tables (such as the one we want to scrape from this Wikipedia page) are marked by "table" tags. So to find a table, we can use BeautifulSoup to locate and return the table(s) we want.

In [None]:
soup.find_all("table")

You can further the tables on the page by using other HTML tags. This allows us to get the specific table we want from the webpage. You can use the "inspect" function of your web browser to look at the HTML code for elements within the webpage as well.

In [None]:
right_table=soup.find('table', class_='wikitable sortable plainrowheaders')
right_table

Exercise: In the cell below, write code that will retrieve just rows #2 and #3 of the table.  

Do your results look like this?

    [<tr>
    <td>2
    </td>
    <th scope="row"><a href="/wiki/Andhra_Pradesh" title="Andhra Pradesh">Andhra Pradesh</a>
    </th>
    <td><a class="mw-redirect" href="/wiki/Hyderabad,_India" title="Hyderabad, India">Hyderabad</a> <small>(<i>de jure</i> to 2024)</small><br/><a href="/wiki/Amaravati" title="Amaravati">Amaravati</a> <small>(<i>de facto</i> from 2017)</small><sup class="reference" id="cite_ref-gulte.com_3-0"><a href="#cite_note-gulte.com-3">[3]</a></sup><sup class="reference" id="cite_ref-4"><a href="#cite_note-4">[4]</a></sup><sup class="reference" id="cite_ref-5"><a href="#cite_note-5">[a]</a></sup>
    </td>
    <td><a href="/wiki/Amaravati" title="Amaravati">Amaravati</a><sup class="reference" id="cite_ref-gulte.com_3-1"><a href="#cite_note-gulte.com-3">[3]</a></sup>
    </td>
    <td><a href="/wiki/Andhra_Pradesh_High_Court" title="Andhra Pradesh High Court">Amaravati</a>
    </td>
    <td>1956<br/>2017
    </td>
    <td><a href="/wiki/Kurnool" title="Kurnool">Kurnool</a> (1953-1956)
    </td></tr>, <tr>
    <td>3
    </td>
    <th scope="row"><a href="/wiki/Arunachal_Pradesh" title="Arunachal Pradesh">Arunachal Pradesh</a>
    </th>
    <td><a href="/wiki/Itanagar" title="Itanagar">Itanagar</a>
    </td>
    <td>Itanagar
    </td>
    <td><a href="/wiki/Guwahati" title="Guwahati">Guwahati</a>
    </td>
    <td>1986
    </td>
    <td> —
    </td></tr>]
    
If not, try again.  If you can't get it, highlight the lines below to view the correct code.

<span style="color:white">
table=soup.find('table', class_='wikitable sortable plainrowheaders')<br>
allrows = table.find_all('tr')<br>
print (allrows[2:4]) </span>

You can also retrieve all the links from the webpage.

In [None]:
all_links = soup.find_all("a")
for link in all_links:
    print (link.get("href"))

You can also retrieve just the last links on the webpage.

In [None]:
all_links = soup.find_all('a')

print('Total number of URLs present = ',len(all_links)) 

print('\n\nLast 5 URLs in the page are : \n')

if len(all_links) > 5 :
  
  last_5 = all_links[len(all_links)-5:]
  for url in last_5 :
    print(url.get('href'))

Now that you are familiar with scraping the table data from the webpage, you can use organize the HTML table into a Python dataframe that you can work with for data analysis.   You will start by making each row into a list.

In [None]:
#Generate lists
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
G=[]
for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    states=row.findAll('th') #To store second column data
    if len(cells)==6: #Only extract table body not heading
        A.append(cells[0].find(text=True))
        B.append(states[0].find(text=True))
        C.append(cells[1].find(text=True))
        D.append(cells[2].find(text=True))
        E.append(cells[3].find(text=True))
        F.append(cells[4].find(text=True))
        G.append(cells[5].find(text=True))

You can then use pandas to make the dataframe. 

In [None]:
#import pandas to convert list to data frame
import pandas as pd
df=pd.DataFrame(A,columns=['Number'])
df['State/UT']=B
df['Admin_Capital']=C
df['Legislative_Capital']=D
df['Judiciary_Capital']=E
df['Year_Capital']=F
df['Former_Capital']=G
df

Exercise: Now it is your turn to try one on your own.

You want to scrape this wikipedia page: https://en.wikipedia.org/wiki/List_of_economic_expansions_in_the_United_States

You want the table of growth periods since the Great Depression,  which can later use Pandas to analyze in interesting ways.

In the cell below, write the Python code that will retrieve this table as a dataframe.

Do your results look similar to this?
```text
	Years	Duration	Annual Employement Growth	Annual GDP Growth	Description
0	Oct 1945–	37	+5.2%	+1.5%	As the United States demobilized from
1	Oct 1949–	45	+4.4%	+6.9%	The United States exited recession in late 194...
2	May 1954–	39	+2.5%	+4.0%	Expansion resumed following a return to growth...
3	April 1958–	24	+3.6%	+5.6%	A brief, two-year period of expansion occurred...
4	Feb 1961–	106	+3.3%	+4.9%	A long expansionary period began in 1961. Inco...
5	Nov 1970–	36	+3.4%	+5.1%	Growth resumed after the brief
6	Mar 1975–	58	+3.6%	+4.3%	Following the steep
7	Jul 1980–	12	+2.0%	+4.4%	This short period of growth saw unemployment r...
8	Dec 1982–	92	+2.8%	+4.3%	Inflation was under control by the mid-1980s. ...
9	Mar 1991–	120	+2.0%	+3.6%	Following a
10	Nov 2001–	73	+0.9%	+2.8%	Another mild recession
11	June 2009–	123+\n	+1.1%	+2.3%	The effects of the
```

If not, try again. If you can't get it, highlight the lines below to view the correct code.<br>
<font style="color:white">
wiki2 = "https://en.wikipedia.org/wiki/List_of_economic_expansions_in_the_United_States"  
page2 = urllib.request.urlopen(wiki2)  
soup2 = BeautifulSoup(page2)  
table2 = soup2.find('table', class_='wikitable sortable')  
A=[]  
B=[]  
C=[] 
D=[]  
E=[]  
F=[]  
G=[]  
for row in table2.find_all("tr"):  
    cells = row.find_all('td')  
    dates=row.find_all('th') #To store second column data  
    if len(cells)==5: #Only extract table body not heading  
        A.append(cells[0].find(text=True))  
        B.append(cells[1].find(text=True))  
        C.append(cells[2].find(text=True))  
        D.append(cells[3].find(text=True))  
        E.append(cells[4].find(text=True))  
df=pd.DataFrame(A,columns=['Years'])  
df['Duration']=B  
df['Annual Employement Growth']=C  
df['Annual GDP Growth']=D  
df['Description']=E  
df  

</font>


Thanks for completing the tutorial.



In [None]:
# Execute this cell to load the notebook's style sheet, then ignore it
from IPython.core.display import HTML
