# Web Scraping

Web scraping refers to the practice of gathering data from World Wide Web without using Web API's. This is generally accomplished by writing a program that automatically gathers data from the Web. The technique involves parsing data from Web pages.

## Web Scraping using pandas

`pandas` provides `read_html` function that extracts html tables from web pages into a list of DataFrames.

In [3]:
import pandas as pd;
dfs = pd.read_html("https://en.wikipedia.org/wiki/SQL", match = 'Source')

In [4]:
dfs[0]

Unnamed: 0,Source,Abbreviation,Full name
0,ANSI/ISO Standard,SQL/PSM,SQL/Persistent Stored Modules
1,Interbase / Firebird,PSQL,Procedural SQL
2,IBM DB2,SQL PL,SQL Procedural Language (implements SQL/PSM)
3,IBM Informix,SPL,Stored Procedural Language
4,IBM Netezza,NZPLSQL[18],(based on Postgres PL/pgSQL)
5,Invantive,PSQL[19],Invantive Procedural SQL (implements SQL/PSM a...
6,MariaDB,"SQL/PSM, PL/SQL",SQL/Persistent Stored Module (implements SQL/P...
7,Microsoft / Sybase,T-SQL,Transact-SQL
8,Mimer SQL,SQL/PSM,SQL/Persistent Stored Module (implements SQL/PSM)
9,MySQL,SQL/PSM,SQL/Persistent Stored Module (implements SQL/PSM)


In [5]:
dfs2 = pd.read_html("https://gujcovid19.gujarat.gov.in/")

In [6]:
len(dfs2)

2

In [8]:
dfs2[0]

Unnamed: 0,District,Active Cases,Cases Tested for COVID19,Patients Recovered,People Under Quarantine,Total Deaths
0,Ahmedabad,27,4777 5282830,1 234882,33,3411
1,Amreli,0,746 590262,10708,80,102
2,Anand,5,1703 533652,9580,0,49
3,Aravalli,0,939 359870,5108,0,78
4,Banaskantha,0,1954 831954,13469,0,162
5,Bharuch,0,985 509500,11308,0,118
6,Bhavnagar,2,2156 1187078,21143,118,301
7,Botad,0,358 252823,2176,0,42
8,Chhota Udaipur,0,1066 264284,3357,0,38
9,Dahod,0,881 697116,9917,0,38


#### Remark
1. `read_html` is not designed to work with `https` protocol. If the URL starts with `https` try replacing it with `http`.
2. Some sites may produce SSL certificate verification error while trying to read through `read_html`. To address this issue, run following code before calling `read_html` function.

    `import ssl
    ssl._create_default_https_context = ssl._create_unverified_context`

## Beautiful Soup

Beautiful Soup is a Python library for extracting data from HTML and XML files. Beautiful Soup can be installed by the following commad:

    pip install beautifulsoup4

Since Beautiful Soup works on `html` files, we need to use another package to obtain html files from Web. We shall use the `request` sub-module of `urllib` pckage for this purpose.

In [5]:
from urllib import request
from bs4 import BeautifulSoup

When we submit a request to a web server, we need to provide a header containg information about "User-Agent", so that the web server understands as if request is sent by a specific web browser.

* The `Request` object is generated through the `Request` function. 
* The request is sent to the web server using `urlopen` function.
* Finally BeautifulSoup object is created using the `BeautifulSoup` function and an html parser. `html.parser` is an html parser included in python 3  distribution.

In the following code, the specified User Agent indicates the **Chrome** browser.

In [6]:
headers = {'User-Agent':
           'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
req = request.Request('http://pythonscraping.com/pages/page1.html', headers = headers)
html = request.urlopen(req)
bs = BeautifulSoup(html.read(), 'html.parser')

The `BeautifulSoup` object `bs` contains the parsed html page.

In [7]:
bs

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

#### Some terminology

`<title>` tag is the child of `<head>` tag.\
`<head>` tag is the parent of `<title>` tag.\
`<h1>` and `<div>` are siblings. \
`h1` is a descendant of `<html` tag.

The `bs` object can now be used to query the contents of parsed `html` document. For example, the following code extracts the `title` tag.

In [8]:
bs.title

<title>A Useful Page</title>

The following two queries also produce the same output.

In [9]:
bs.head.title 

<title>A Useful Page</title>

In [10]:
bs.html.head.title

<title>A Useful Page</title>

Beautiful Soup provides mechanisms to Navigate through the entire html tree in a web page. The navigation can be performed up, down, across, and diagonally.

#### Important

If you query an html tag that does not exist, the query result is **None**. However, if we try to access a child of a non-existing tag, it results in error.

It is, therefore, necessary to use excption handling while ding web scraping.

### Using attributes of html tags

Most `html` tags have attributes. Beautiful Soup can query an `html` page based on tag attributes using `find` and `findAll` functions.

To understand this, let's obtain an html page containing tag attributes.

In [11]:
req = request.Request('http://pythonscraping.com/pages/warandpeace.html', headers = headers)
html = request.urlopen(req)
bs = BeautifulSoup(html.read(), 'html.parser')
bs

<html>
<head>
<style>
.green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
</style>
</head>
<body>
<h1>War and Peace</h1>
<h2>Chapter 1</h2>
<div id="text">
"<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>"
<p></p>
It was in July, 1805, and the speaker was the well-known <span class="green">Anna
Pavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya
Fedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man
of high rank and importance, who was the firs

When you visit above web page, you can note that the lines spoken by characters in the story are displayed in <font color="red">red</font> color, and the names of the characters are displayed in <font color="green">green</font> color color.

This is achieved by assigning class attributes to <span> tags and using CSS.
    
In the code given below, we query for the names of the characters in the story.

In [12]:
nameList = bs.findAll('span', {'class':'green'})
pd.Series([name.text for name in nameList]).unique()


array(['Anna\nPavlovna Scherer', 'Empress Marya\nFedorovna',
       'Prince Vasili Kuragin', 'Anna Pavlovna', 'St. Petersburg',
       'the prince', 'Prince Vasili', 'Wintzingerode', 'King of Prussia',
       'le Vicomte de Mortemart', 'Montmorencys', 'Rohans', 'Abbe Morio',
       'the Emperor', 'Dowager Empress Marya Fedorovna', 'the baron',
       'the Empress', "Anna Pavlovna's", 'Her Majesty', 'Baron\nFunke',
       'The prince', 'Anna\nPavlovna', 'Anatole'], dtype=object)

#### Notes
* The `find` and `findAll` functions provide several options to perform search within a webpage.
* These two functions are the most used functions in web scraping using Beautiful Soup.
* Pattern matching performed through *regular expressions* is very useful is searching for the information following a specific pattern.

## Scraping Google Search results

Lets consider an example of scaping Google search result page. Before we can perform web scraping, it is important to understand the structure of a web page and identify the content that we want to scrap.

On most web browsers, pressing "Ctrl + U" will display the html source of the web page. Alternatively, you can 'right click' and select "View Page Source" from the context menu.

In the example, below, we want to scrap two numbers from the search result page, when the search is performed through the Chrome browser (Yes, the web page differs on different browsers).
1. Number of search results
2. Time taken to search the results

When the html source is inspected, it can be noted that the required contents as available in the `div` tag with id `result-stats`. 

In [13]:
headers = {'User-Agent':
           'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
req = request.Request('https://www.google.com/search?q=python', headers = headers)
html = request.urlopen(req)
bs = BeautifulSoup(html.read(), 'html.parser')

In [14]:
#sresult = bs.find("div", {"id": "result-stats"})
sresult = bs.find(id = "result-stats")
type(sresult)

bs4.element.Tag

In [15]:
n = sresult.text.split()[1]
int(n.translate(str.maketrans('', '', ',')))

528000000

In [16]:
float(sresult.text.split()[3].translate(str.maketrans('', '', '(')))

0.53

Next we write a python code that performs Google searches for a set of search words, and creates a DataFrame containing the two columns containing above two pieces of information from result pages.

In [17]:
words = ["python", "javascript", "scala", "sql"]
results = []
times = []
for word in words:
    req = request.Request("https://www.google.com/search?q="+ word, headers = headers)
    html = request.urlopen(req)
    bs = BeautifulSoup(html.read(), 'html.parser')
    sresult = bs.find("div", {"id": "result-stats"})
    tokens = sresult.text.split()
    results.append(int(tokens[1].translate(str.maketrans('', '', ','))))
    times.append(float(tokens[3].translate(str.maketrans('', '', '()'))))

In [18]:
df = pd.DataFrame({'Results':results, 'time':times}, index = words)
df

Unnamed: 0,Results,time
python,528000000,0.48
javascript,1950000000,0.58
scala,199000000,0.56
sql,347000000,0.46
