# Getting Data from the Internet (Part 2)
#### Obtaining and processing data

-----------
_Author: Dhavide Aruliah_

### Assignment Contents
- [Using `bs4`](#bs4)
- [Scraping Monty Python Quotes](#python-quotes)
    - [Question 1](#Question-1)
    - [Question 2](#Question-2)
    - [Question 3](#Question-3)
    - [Question 4](#Question-4)
    - [Question 5](#Question-5)
    - [Question 6](#Question-6)
- [Scraping Wikipedia](#wikipedia)
    - [Question 7](#Question-7)
    - [Question 8](#Question-8)
    - [Question 9](#Question-9)
    - [Question 10](#Question-10)
    - [Question 11](#Question-11)

#### EXPECTED TIME 1.5 HRS  


### Overview

You have seen how to obtain data programmatically from web services using the Python `requests` module. You have also seen how to process data retreived in some common web formats (JSON and XML) using appropriate Python libraries (`json` and `lxml` respectively).

In this assignment, you will focus on working with HTML retrieved from the web using the Python module `bs4` (which stands for [*BeautifulSoup 4*](https://www.crummy.com/software/BeautifulSoup/)). This module provides basic tools for web-scraping from HTML data.

The content here is drawn from Video lectures 8-1 through 8-9.

### Activities in this Assignment

- Using `requests` to retrieve HTML pages
- Using the `BeuatifulSoup` class from `bs4` to capture the content of HTML data
- Using common methods and attributes (`find`,  `find_all`, `get`, `text`, etc.)

---
<a id="bs4"></a>
### Using `bs4`

The class `BeautifulSoup` from the module `bs4` provides all the functionality you'll need for this assignment. This is not exhaustive; the `selenium` module, for instance, can be required for scraping more sophisticated web pages.

We'll import the `requests` module again because that is a good resource for obtaining web content.

In [1]:
from bs4 import BeautifulSoup
import requests

---
<a id="python-quotes"></a>
## Scraping Monty Python Quotes

To start, download some web data using the [Python `requests` module](https://2.python-requests.org/en/master/) as you did in the previous assignment. The data consists of
[Monty Python quotations](http://www.allgreatquotes.com/monty_python_quotes.shtml) taken from [AllGreatQuotes.com](https://www.allgreatquotes.com/).


[Back to top](#Assignment-Contents)

---

#### Question 1

Your task is as follows:

+ Obtain the text from the web page [`URL_quotes`](https://www.allgreatquotes.com/monty_python_quotes.shtml) by extracting the `.text` attribute from an associated `Response` object. Assign the result to a string `text_quotes`. This text is the content of the page in HTML.
+ Use the `str.find` method to extract the *title* from the associated HTML. This is the text that lies strictly between the tags `<title>` and `</title>`. Assign the result to a string `title_quotes`.

In [2]:
### GRADED
### Use requests.get to query URL_quotes (provided)
### Extract the .text attribute from the response obtained and bind
###    the resulting string to the identifier text_quotes.
### Use the string method find to identify the text between
###    the substrings "<title>" and "</title>" in text_quotes.
### Assign the extracted text to text_quotes.
###
### Note: The resulting text must not contain the tag '<title>'.
###
URL_quotes = 'http://www.allgreatquotes.com/monty_python_quotes.shtml'
###
### YOUR CODE HERE
###
response = requests.get(URL_quotes)
text_quotes = response.text
title_quotes = text_quotes[(text_quotes.find("<title>")+7):text_quotes.find("</title>")]

### For verifying answer:
print('title_quotes: {}\n\n'.format(title_quotes))
print(text_quotes[:110])

title_quotes: Monty Python quotes, famous Monty Python quotes, sayings from The Pythons


<html>
<head>
<title>Monty Python quotes, famous Monty Python quotes, sayings from The Pythons</title>
<meta h


In [3]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

---

#### Question 2

Using Python string methods to extract data from HTML is inefficient and difficult. The `BeautifulSoup` class is designed to make this much easier.

Your task in this question is to parse the string `text_quotes` into a `BeautifulSoup` object. Use the `lxml` parser to decode the text. Assign the result to the identifier `soup_quotes`.

In [4]:
### GRADED
### Instantiate a BeautifulSoup object using the string text_quotes from Question 1.
###     Use the 'lxml' parser in the call to BeautifulSoup.
###     Assign the result to soup_quotes.
###
###
### YOUR CODE HERE
###
soup_quotes = BeautifulSoup(text_quotes,'lxml')

### For verifying answer:
print('type(soup_quotes): {}'.format(type(soup_quotes)))

type(soup_quotes): <class 'bs4.BeautifulSoup'>


In [5]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

---


#### Question 3

Your task in this question is to repeat the computation from Question 1 using the method `find` associated with `BeautifulSoup` object `soup_quotes` constructed in Question 2. Recall the `find` method finds the first tag matching the given identifier in the `BeautifulSoup` object; in this case, you are looking to match the tag `'title'`.

Assign the result of the `find` command to the identifier `title_quotes`.

In [6]:
### GRADED
### Use the find method to extract the next 'title' tag from the BeautifulSoup object
###    soup_quotes. Assign the result to title_quotes.
###
### YOUR CODE HERE
###
title_quotes = soup_quotes.find('title')

### For verifying answer:
print('title_quotes: {}\n'.format(title_quotes))
print('type(title_quotes): {}\n\n'.format(type(title_quotes)))
print(title_quotes.text)

title_quotes: <title>Monty Python quotes, famous Monty Python quotes, sayings from The Pythons</title>

type(title_quotes): <class 'bs4.element.Tag'>


Monty Python quotes, famous Monty Python quotes, sayings from The Pythons


In [7]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

---


#### Question 4

The `BeautifulSoup` object has a `find_all` method that resembles the `find` method. The difference is that the result returned is a `ResultSet` object (which is basically a list of `bs4` elements).

In this question, you will extract all the tags from `soup_quotes` that match `'title'` (not just the first). Assign the result of the `find_all` command to the identifier `all_title_quotes`.

In [8]:
### GRADED
### Use soup_quotes to extract all 'title' tags and assign the output to a variable named all_titles_quotes.
###
###
### YOUR CODE HERE
###
all_titles_quotes = soup_quotes.find_all('title')

### For verifying answer:
print('all_titles_quotes: {}\n'.format(all_titles_quotes))
print('type(all_titles_quotes): {}\n'.format(type(all_titles_quotes)))
for title in all_titles_quotes:
    print(title.text)

all_titles_quotes: [<title>Monty Python quotes, famous Monty Python quotes, sayings from The Pythons</title>]

type(all_titles_quotes): <class 'bs4.element.ResultSet'>

Monty Python quotes, famous Monty Python quotes, sayings from The Pythons


In [9]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

---

#### Question 5

Examining the HTML source of `soup_quotes`, it looks something like this:

```HTML
<table width="100%" border="1" cellpadding="6" cellspacing="0" class="body">
<tr bgcolor="#FFFFFF">
<td>Newsreader [John Cleese]: And now for something completely
different.<br/>
<b>Monty Python's Flying Circus</b></td>
</tr>
<tr>
<td>Norman [Eric Idle]: Is your wife a..."goer"... eh?
Know what I mean? Know what I mean? Nudge nudge. Nudge nudge!
Know what I mean? Say no more...Know what I mean?<br/>
<b>Monty Python's Flying Circus</b></td>
</tr>
<tr>
<td>Norman [Eric Idle]: A nod's as good as a wink to a blind bat,
eh?.<br/>
<b>Monty Python's Flying Circus</b></td>
</tr>

...
</tr>
</table>
```

The quotations referred to in this page, then, are enclosed within a `<table>` tag with `class="body"` (as opposed to other tables contained in the same page that contain, e.g., links to advertisements or pages with more quotations). The quotations are enclosed between `<td>` tags.

You can extract this table using the `find` method (matching on `'table'` and using the keyword argument `class_="body"` to match on the required attribute). From that result, extract the corresponding quotations using the `find_all` method matching on `'td'`. The `find_all` yields a `ResultSet` with the desired quotations as the `text` attributes for all the tags within.
+ Assign the output of the `find` method to `table_quotes`.
+ Assign the output of the `find_all` method  to `quotes`.
+ Assign the number of elements in `quotes` to `num_quotes`.
+ Assign the fifth element (i.e., element 4 when indexed from 0) to `quotes_4`.

In [10]:
### GRADED
### Extract the 'table' tag from soup_quotes using the find method with the keyword argument class_="body";
###    assign the result to table_quotes.
### Extract all the tags matching "td" from table_quotes using the find_all method;
###    assign the result to quotes.
### Assign the number of quotes extracted to num_quotes.
### Assign the text of the fifth quote to quotes_4.
###
###
### YOUR CODE HERE
###
table_quotes = soup_quotes.find("table",{"class":"body"})
quotes = table_quotes.find_all("td")
num_quotes = len(quotes)
quotes_4 = quotes[4].text

### For verifying answer:
print('num_quotes: {}\n'.format(num_quotes))
print('quotes_4:\n\n{}'.format(quotes_4))

num_quotes: 10

quotes_4:

Cardinal Ximinez [Michael Palin] Nobody expects the Spanish
Inquisition!
Monty Python's Flying Circus


In [11]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

---


#### Question 6

 `BeautifulSoup` 
 
You will now use `find_all` to extract all the annotations (i.e., with tag `<a>`) and links from `soup_quotes`.
* Assign the result of the `find_all` method to `links`.
* Assign the number of elements within `links` to the integer `num_links`.
* Construct a list `local_links` by extracting the `href` attributes from all the links in `links` that begin with the character `/`.
* Assign the number of elements within `local_links` to the integer `num_local`.
* Assign the last entry of `local_links` to the identifier `local_last`.

In [23]:
### GRADED 
### Assign the result of the `find_all` method to `links`.
### Assign the number of elements within `links` to the integer `num_links`.
### Construct a list `local_links` by extracting all the `href` attributes from `links` that begin with the character `/`.
### Assign the number of elements within `local_links` to the integer `num_local`.
### Assign the last entry of `local_links` to the identifier `local_last`.
###
###
### YOUR CODE HERE
###

a = '''links = soup_quotes.find_all("a")
#print(links)
num_links = len(links)
local_links = list()
for link in links:
    href = link.get('href')
    if href != None:
        if href.startswith("/"):
            local_links.append(href)
num_local = len(local_links)
local_last = local_links[len(local_links)-1]
#print(local_links)
'''
##as suggested by Carleton, changing the code to workaround the bug
links = soup_quotes.find_all('a')
num_links = len(links)
local_links = [link for link in links if link.get('href').startswith('/')]
num_local = len(local_links)
local_last = local_links[-1].get('href')


### For verifying answer:
print('num_links: {}\n'.format(num_links))
print('num_local: {}\n'.format(num_local))
print('local_links: {}\n'.format(local_links))
print('local_last: {}'.format(local_last))

num_links: 46

num_local: 40

local_links: [<a href="/">
<img alt="Famous quotes, funny quotes, inspirational and motivational quotations, literary, historical. Quotes by famous authors and celebrities" border="0" height="80" src="Images/logo.jpg" width="395"/></a>, <a href="/authors-a/">A</a>, <a href="/">B</a>, <a href="/authors-c/">C</a>, <a href="/authors-d/">D</a>, <a href="/authors-e/">E</a>, <a href="/authors-f/">F</a>, <a href="/authors-g/">G</a>, <a href="/authors-h/">H</a>, <a href="/authors-i/">I</a>, <a href="/authors-j/">J</a>, <a href="/authors-k/">K</a>, <a href="/authors-l/">L</a>, <a href="/authors-m/">M</a>, <a href="/authors-n/">N</a>, <a href="/authors-o/">O</a>, <a href="/authors-p/">P</a>, <a href="/authors-q/">Q</a>, <a href="/authors-r/">R</a>, <a href="/authors-s/">S</a>, <a href="/authors-t/">T</a>, <a href="/authors-u/">U</a>, <a href="/authors-v/">V</a>, <a href="/authors-w/">W</a>, <a href="/authors-x/">X</a>, <a href="/authors-y/">Y</a>, <a href="/authors-

In [13]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


---

<a id=wikipedia></a>

### Scraping Wikipedia

You'll move on here to scrape a table from [Wikipedia](https://en.wikipedia.org).


[Back to top](#Assignment-Contents)

---


#### Question 7

The identifier `URL_solar_system` provides a link to the page on [gravitationally rounded objects in the solar system](https://en.wikipedia.org/wiki/List_of_gravitationally_rounded_objects_of_the_Solar_System).

To start, download the text of the page into a `BeautifulSoup` object; bind the resulting String object to the identifier `soup_solar`.

In [14]:
### GRADED
### Use the text attribute of the BeautifulSoup class to extract the body of the
###   required web page (provided in URL_solar_system).
###
# Get the soup object first....
URL_solar_system = "https://en.wikipedia.org/wiki/List_of_gravitationally_rounded_objects_of_the_Solar_System"
###
### YOUR CODE HERE
###
new_response = requests.get(URL_solar_system)
soup_solar = BeautifulSoup(new_response.text, 'lxml')

In [15]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

---


#### Question 8

Having constructed the `BeautifulSoup` object `soup_solar`, your task in this question is to extract all the tables. You can use the `find_all` method to get `<table>` tags with the attribute `attrs={'class': 'wikitable'}`.
* Assign result of this top level `find_all` to the identifier `tables`.
* Assign the third (i.e., item 2 indexed from 0) to `planets`.
* Find all the `<tr>` tags within `planets` and bind result to `rows`.

In [16]:
### GRADED
### Assign result of this top level `find_all` to the identifier `tables`.
###      Assign the third (i.e., item 2 indexed from 0) to `planets`.
###      Find all the `<tr>` tags within `planets` and bind result to `rows`.
###
###
### YOUR CODE HERE
###
tables = soup_solar.find_all("table", attrs={'class': 'wikitable'})
planets = tables[2]
rows = planets.find_all("tr")

### For verifying answer:
print('Number of tables: {}\n'.format(len(tables)))
print('Number of rows in planets: {}'.format(len(rows)))

Number of tables: 9

Number of rows in planets: 24


In [17]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

---

#### Question 9

 Having extracted the desired sequence of tables from the `BeautifulSoup` object `soup_solar`, now you can extract data from a table. You have already extracted the rows of this table in `rows` in Question 8.
 
 The first row, `rows[0]` contains the headers of the table (which is the names of the planets in the solar system). The column headers in `rows[0]` look something like this when printed:

```
*Mercury[6]

*Venus[7]

*Earth[8]

*Mars[9]

°Jupiter[10]

°Saturn[11]

†Uranus[12]

†Neptune[13]
```

The raw Unicode is given as follows:

```
\n\xa0\n\n*Mercury[6]\n\n*Venus[7]\n\n*Earth[8]\n\n*Mars[9]\n\n°Jupiter[10]\n\n°Saturn[11]\n\n†Uranus[12]\n\n†Neptune[13]\n
```

Your task in this question is to extract the column headers (the names of the planets) from `rows[0]`.  Each column header is enclosed in a `<th>` tag. In each case, the associated `text` attribute has extraneous characters to ignore (notably the leading Unicode character and the trailing link number in square brackets).

In [18]:
### GRADED
###
### Extract the names of the planets in order from nearest to the sun, as they appear in the table.
### To do so, iterate over all the column headers of row[0] and then remove the leading unicode symbol
### and the link number `[3]` after the planet name.
### Store the planet names in a list named headers.
###
###
### YOUR CODE HERE
###
import re
first_row = rows[0]
all_headers = first_row.find_all("th")
headers = list()
for header in all_headers:
    header_content = header.text
    header_content = re.findall("[a-zA-Z]*",header_content)
    header_content = header_content[1]
    if len(header_content) > 0:
        headers.append(header_content)
    

### For verifying answer:
print('headers: {}'.format(headers))

headers: ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']


In [19]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

---

#### Question 10

Skipping to the fourth row (i.e., `rows[3]`), the `BeautifulSoup` elements sought are nested within `td` tags.
* Use the `find_all` method to extract all the `<td>` tags from `rows[3]`.
* For the eight planet columns, extract the 8 (mean) distances (measured in km) from each celestial body to the sun; store in a list called `mean_distances` sorted as in the Wikipedia article.
* For each entry, ignore the second number expressing the distance in astronomical units.
* Be sure to transform the `str` data to `float` values.

In [20]:
### GRADED
### Extract numbers from 4th row (Mean distance from the Sun)
###    Use the find_all method to extract all the td tags from rows[3]
###    and accumulate a list mean_distances of each planet obtained by
###    converting the first number from each tag to a float.
###
###
### YOUR CODE HERE
###
#print(rows[3].prettify())
all_td = rows[3].find_all("td")
#print(all_td)
mean_distances = list()
counter = 0
for index in range(2,len(all_td)):
    td = all_td[index]
    counter += 1
    space_index = td.text.index(" ")
    distance = td.text[:space_index].strip().replace(",","")
    #print(distance)
    mean_distances.append(float(distance))

### For verifying answer:
print('mean_distances: {}'.format(mean_distances))

mean_distances: [57909175.0, 108208930.0, 149597890.0, 227936640.0, 778412010.0, 1426725400.0, 2870972200.0, 4498252900.0]


In [21]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

---

#### Question 11

Your final task in Question 11 is similar to that in Question 10. You will extract all the entries of `rows[4]` that match `<td>`.
* Use the `find_all` method to extract all the `<td>` tags from `rows[4]`; bind the result to `rows_4`.
* For the eight planet columns, extract the 8 equatorial radii (measured in km) from each celestial body orbitting the sun; store the result in a list called `equatorial_radii` in the same sequence as in `mean_distances` from Question 10.
* For each entry, ignore the second number expressing the distance in astronomical units.
* Be sure to transform the `str` data to `float` values.

In [22]:
### GRADED
### Extract numbers from 5th row (Equatorial radius).
###    Use the find_all method to extract all the td tags from rows[4]
###    and accumulate a list equatorial_radii of each planet obtained by
###    converting the first number from each tag to a float.

###
###
### YOUR CODE HERE
###
print(rows[4].prettify())
all_td = rows[4].find_all("td")
#print(all_td)
equatorial_radii = list()
counter = 0
for index in range(2,len(all_td)):
    td = all_td[index]
    counter += 1
    space_index = td.text.index(" ")
    distance = td.text[:space_index].strip().replace(",","")
    #print(distance)
    equatorial_radii.append(float(distance))

### For verifying answer:
print('equatorial_radii: {}'.format(equatorial_radii))

<tr>
 <td style="background: #fffdd0;">
  Equatorial
  <a href="/wiki/Radius" title="Radius">
   radius
  </a>
 </td>
 <td style="background: #fffdd0;">
  km
  <br/>
  :E
  <sup class="reference" id="ref_Fnone">
   <a href="#endnote_Fnone">
    [f]
   </a>
  </sup>
 </td>
 <td style="background: #eeffee;">
  2,439.64
  <br/>
  0.3825
 </td>
 <td style="background: #eeffee;">
  6,051.59
  <br/>
  0.9488
 </td>
 <td style="background: #eeffee;">
  6,378.1
  <br/>
  1
 </td>
 <td style="background: #eeffee;">
  3,397.00
  <br/>
  0.53260
 </td>
 <td style="background: #fffaee;">
  71,492.68
  <br/>
  11.209
 </td>
 <td style="background: #fffaee;">
  60,267.14
  <br/>
  9.449
 </td>
 <td style="background: #ffe6ea;">
  25,557.25
  <br/>
  4.007
 </td>
 <td style="background: #ffe6ea;">
  24,766.36
  <br/>
  3.883
 </td>
</tr>

equatorial_radii: [2439.64, 6051.59, 6378.1, 3397.0, 71492.68, 60267.14, 25557.25, 24766.36]


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
