In the following article, "strata scratch" have outlined some of the principles and techniques used to extract data from the web using python "requests" and "beautiful soup". This is typically known as web scraping and is an incrediably powerful tool. There are other tools such as "urllib2" and "Selenium" that are also very useful, but for the purposes of this step we will focus mainly on "requests" and "beautiful soup". "requests" helps us to extract data from a website in python using get calls and "beautiful soup" is predominantly used to "pretify" our html data and allow us to extract the relevant data. Web data is full of extra text that can make it hard to extract the "real" data from the web page. When I discovered how to do this it opened up a whole new world for me. When you master these ideas the data sources that are available will expand exponentially and you will have entered the world of Big Data. Implement the google colab below and see how you get on.



![sslogo](https://github.com/stratascratch/stratascratch.github.io/raw/master/assets/sslogo.jpg)

# Web scraping in Python

Scraping refers to extracting useful data from web pages which are written in a programming language called HTML. To scrap data from the HTML tree we first have to download the web page to our PC.

We will use the following packages to achieve the tasks in this lesson:
- [`requests`](http://docs.python-requests.org/en/master/)
- [`beautifoulsoup4`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start)

### Install the packages using pip

In [1]:
!pip install requests



In [2]:
!pip install beautifulsoup4



### Import the modules

In [3]:
import numpy as np
import pandas as pd
import requests
import bs4
import lxml.etree as xml

## Basic concepts

### Fetch webpage contents using requests

To get everything about a webpage we use the `get` method from requests. There are many optional arguments it can take but the one main argument it takes is the url to the webpage you want retrieved.

In [21]:
#URL = "https://github.com/requests/requests"
URL="http://www.csgnetwork.com/llinfotable.html"


requests.get(URL)

<Response [200]>

The result of this method is a Response object. The number 200 is a [status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). 200 is OK and it means no error.

In [22]:
requests.get(URL, {}).text

'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.1 Transitional//EN">\r\n<HTML>\r\n<!--   HTML coding exclusively with the CoffeeCup HTML 2010 Editor --> \r\n<!--          http://www.coffeecup.com                             --> \r\n<!--   Brewed on December 6 1994 1:09:21 PM                        --> \r\n<!--   Updated on September 6 2001 4:18:12 AM                      --> \r\n<!--   Updated on January 4 2011 8:22:45 PM                        --> \r\n<!--   Created by Dr. Gene Davis - Computer Support Group          --> \r\n<HEAD>\r\n<title>Countries, Capitals, Latitude and Longitude Table</title>\r\n<META NAME="Title" CONTENT="CSGNetwork Countries, Capitals, Latitude and Longitude Table">\r\n<META NAME="Author" CONTENT="For CSGNetwork.Com, Computer Support Group - Dr. Gene Davis">\r\n<META NAME="Subject" CONTENT="CSGNetwork public use information table">\r\n<META NAME="Description" CONTENT="Countries, Capitals, Latitude and Longitude Table. Though this table does not, most pages require 

To get the HTML as a string we use the `text` property of the Response object.

Before we go farther you should know that often you will get an error when accessing the webpage. There are many errors and even more causes for the error, but the most common cases are:
- You use a wrong URL.
- The website is down. To be sure this happens access it via browser.
- The website blocks bots and scraping agents. You can try to use browser looking UserAgent to fix this. If this happens investigate the `headers` parameter of the `get` method. It usually helps to use a plausible UserAgent but if it doesn't good luck trying to find a solution.

We can convert that text into either a BeautifoulSoup object.

#### Example 1

Create a beautifoul soup object.

In [23]:
web_page = bs4.BeautifulSoup(requests.get(URL, {}).text, "lxml")

In [24]:
web_page

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.1 Transitional//EN">
<html>
<!--   HTML coding exclusively with the CoffeeCup HTML 2010 Editor -->
<!--          http://www.coffeecup.com                             -->
<!--   Brewed on December 6 1994 1:09:21 PM                        -->
<!--   Updated on September 6 2001 4:18:12 AM                      -->
<!--   Updated on January 4 2011 8:22:45 PM                        -->
<!--   Created by Dr. Gene Davis - Computer Support Group          -->
<head>
<title>Countries, Capitals, Latitude and Longitude Table</title>
<meta content="CSGNetwork Countries, Capitals, Latitude and Longitude Table" name="Title"/>
<meta content="For CSGNetwork.Com, Computer Support Group - Dr. Gene Davis" name="Author"/>
<meta content="CSGNetwork public use information table" name="Subject"/>
<meta content="Countries, Capitals, Latitude and Longitude Table. Though this table does not, most pages require JavaScript" name="Description"/>
<meta content="latitude, long

Web pages are trees of elements nested one inside the other.

For example:
- html
  - body
      - div
      - div
      - div
      
We say that body is a child of html and html is a parent of body, and that the 3 div are children of body. The 3 div are siblings. This terminology matters because the method names in bs4 follow it.

Before you go scrapping open the website in Inspector View to see the nesting hierarchy of web page elements.

Generally all web pages have two main sections called `head` and `body`:
- `head` is where a lot of metadata lives
- `body` is what you seen on the screen and it contains all links, tables and images.

#### Example 2

Let's find the title of the web page we pulled using the `head` and `title` elements.

In [25]:
web_page.head.title

<title>Countries, Capitals, Latitude and Longitude Table</title>

We can navigate the tree by going element by element. You need to know the element names (html, head, div, span, p, a and so on) but don't worry if you don't. Look at the webpage in the inspector view in your browser and you can see the full path to the element of interest.

To get the text we need to use the `text` property of elements.

In [27]:
web_page.head.title.text
#web_page

'Countries, Capitals, Latitude and Longitude Table'

#### Example 3

Let's go into the body of the github page we accessed.

In [28]:
web_page.body

<body><center>
<font face="Arial,Helvetica,Geneva,Swiss,SunSans-Regular">
<script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<style type="text/css">
<!--
body { font-family: Tahoma, Verdana; }
input.btn {
  color:black;
  font: bold 75% 'tahoma',helvetica,sans-serif;
  background-color:#ffd148;
  border: 2px solid;
  border-color: #c7c5c8;
 }
 
 input.btn2 {
  color:black;
  font: bold 95% 'tahoma',helvetica,sans-serif;
  background-color:#ffd148;
  border: 2px solid;
  border-color: #c7c5c8;
 
}
//-->
</style>
<meta content="width=958, initial-scale=1" name="viewport"/>
<div align="center"><table bgcolor="#899194" border="0" cellpadding="0" cellspacing="0" width="949">
<tr>
<td colspan="13"><div align="left"><img alt="CSGNetwork.com Free Information" height="37" src="http://storage.googleapis.com/csgnetworkstatic/allbar_header.png" width="950"/></div></td>
</tr>
<tr valign="middle"><td><div align="left" style="float:left;">
<form action="http://w

It is full of elements like `<a>` or `<ul>` or `<li>` or `<div>` or `<span>` etc.

The majority of that is noise to us because we want to find the numbers which describe this repository.

#### Example 4

Searching the html for particular text using find_all.

In [29]:
#sub_web_page = web_page.find_all(name="td", attrs={"class": "location"})
sub_web_page = web_page.body.find_all(name="td")
sub_web_page

[<td colspan="13"><div align="left"><img alt="CSGNetwork.com Free Information" height="37" src="http://storage.googleapis.com/csgnetworkstatic/allbar_header.png" width="950"/></div></td>,
 <td><div align="left" style="float:left;">
 <form action="http://www.csgnetwork.com/search_frame.html" id="cse-search-box">
 <div>
 <input name="cx" type="hidden" value="partner-pub-8018289210612122:1424138921"/>
 <input name="cof" type="hidden" value="FORID:10"/>
 <input name="ie" type="hidden" value="UTF-8"/>
 <input name="q" size="55" type="text"/>
 <input class="btn" name="sa" type="submit" value="Search"/>
 </div>
 </form>
 </div>
 <div align="center" style="float:right; margin-right:40px;"><span style="decoration:none; color:white; font-weight:bold; font-size:smaller;">
 <a href="http://www.csgnetwork.com" style="decoration:none; color:white;"><b>Home</b></a>  |  
 <a href="http://www.csgnetwork.com/top_free_apps.html" style="decoration:none; color:white;"><b>Top Free Apps</b></a>  |<br/>
 <a h

In [None]:
sub_web_page[0].text

'\n        Stewartbury, AA\n      '

#### Example 5

Get all the tags from the github page. The tags are `python`, `http`, `forhumans` etc.

In [30]:
URL = "https://github.com/requests/requests"

web_page_text = requests.get(URL).text

web_page = bs4.BeautifulSoup(web_page_text, "lxml")

web_page

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-7aa84bb7e11e.css" media="all" rel="stylesheet"/><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-f65db3e8d171.css" media="all" rel="stylesheet"/><link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://github.githubassets.com/

We are looking for the element `<div class="BorderGrid-row">`.

Notice that class can have many entries, for example 3 as seen in this `div`.

When filtering by class we can use any single class, we don't have to list them all.

When we want to target a single element it is better to use `find` which has the same parameters as `find_all` but returns only a single sub web page.

Here we also use the `children` property for the first time.

In our list comprehension we filter by type and we remove all `NavigableString`s which are just string and not elements.

To understand the difference between a `NavigableString` and `Tag`(tag is synonimous for Element) look at this example.

```
    <div>
        I am some text and of type Navigable String.
        
        <a> I am a child I am of type Tag when I am an element but I am Navigable String when I am text </a>
   </div>
```

We keep only elements and access their text property and clean.

In [31]:
tags_elements = web_page.find(name="div", attrs={"class": "BorderGrid-row"})
print(tags_elements)
#tags_text = [tag_elements.text.strip("\n ")]
##for elem in tags_elements:
  # if type(elem) != bs4.NavigableString]

#tags_text

<div class="BorderGrid-row">
<div class="BorderGrid-cell">
<div class="hide-sm hide-md">
<h2 class="mb-3 h4">About</h2>
<p class="f4 my-3">
        A simple, yet elegant, HTTP library.
      </p>
<div class="my-3 d-flex flex-items-center">
<svg aria-hidden="true" class="octicon octicon-link flex-shrink-0 mr-2" data-view-component="true" height="16" version="1.1" viewbox="0 0 16 16" width="16">
<path d="m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z"></path>
</svg>
<span class="flex-auto min-width-0 css-truncate css-trun

## Full Example

This is an example scrapper for GDP from wikipedia per IMF.

The web page is located at:
- https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)

In the end we will have a pandas data frame of GDPs for each country in 2017.

In [32]:
WP_URL = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"

# Fake the user agent so the web page thinks we access it as a regular human user
web_page = bs4.BeautifulSoup(requests.get(WP_URL, headers={
    "UserAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.183 Safari/537.36"
}).text, "lxml")

imf_table = web_page.find_all(name="table", attrs={"class": "wikitable"})[0]

# Get the column names of our dataframe.
# `children` is an iterator and to index it we must first convert it to a list.
columns = list(imf_table.tbody.children)[0]
columns = [elem.text.strip("\n ")
           for elem in columns
           if type(elem) != bs4.NavigableString]

rows = []

for i, row in enumerate(imf_table.tbody.find_all("tr")):
    # Skip the header
    if i <= 1 or type(row) == bs4.NavigableString:
        continue

    tds = row.find_all("td")

#    rank         = tds[0].text
    country_name = tds[0].text
    gdp          = tds[2].text

    rows.append((country_name, gdp))

data_frame = pd.DataFrame(rows, columns=['country name','gdp'])

data_frame.head()

Unnamed: 0,country name,gdp
0,World,2025
1,United States,2025
2,China,[n 1]2025
3,Germany,2025
4,Japan,2025


Our data frame is full of unclean data and all are of type object.

Our next step is to clean our data.

## Your very own scrapper

Pick a website of your choice and scrap the data into a pandas dataframe and clean it afterwards.