# Scraping an html page (loading and searching it's contents)

# Local:  saved in a file on your computer
# Remote: somewhere on the web

To fully understand this notebook, open the example_html.html file in another tab, and open it's example_html.html's source code in a third tab (or even better: in browser's View>Developer tools). You will see in a minute what is the exact addres of that file.

For scraping, we need a few of different libraries, most notably Beautifulsoup. Let's first import these:

In [1]:
import os
from urllib.request import urlopen
from bs4 import BeautifulSoup

We can simply enter a web page as a string and open it. Afterwards, BeautifulSoup converts it into a BeautifulSoup object which has many interesting functions and attributes:

In [2]:
# website address
#page = 'http://www.uebs.ed.ac.uk'

# open the url and store the website
#website = urlopen(page)

# for now we use a local file (os.getcwd() gets the Current Working Directory, aka. the folder you're in)
file_url = "file:///"+os.getcwd()+"/example_html.html"
website_source_code = urlopen(file_url)


# in another tab: (open the example_html.html file directly in your browser to see how it will look like)
# then in your browser, right click and select 'view source', or open developer tools to see the source
print("Paste this url to your browser to see the demo website (copy the whole thing, together wioth the file:// part):")
print( file_url)

# convert the website's content, for this a parser is needed. In this case a html parser
soup = BeautifulSoup(website_source_code, 'html.parser')

Paste this url to your browser to see the demo website (copy the whole thing, together wioth the file:// part):
file:///C:\Users\reena\OneDrive\Documents\GitHub\web-and-social-network-analytics-notes\week1-web-scraping-and-analytics/example_html.html


In [3]:
# here's a complete html of the page, but it's easier to read if you open it's source using the url above
print(soup)

<!DOCTYPE html>

<html>
<head>
<style>
.hipster {
	background-color:black;
	color:red;
	padding:22px;
}
</style>
<script type="text/javascript">
  var numberOfClicks = 0;
  function clickedButton()
  {
      numberOfClicks += 1;
    document.getElementById("clickableButton").text="GOOD JOB! You clicked me "+numberOfClicks+" times. If you reload the page I will go back to the original state :)"; 
  }
</script>
</head>
<body>
<h1 title="A header">Example for Media and Web Analytics</h1>
<p>Here you typically see some text.
Ocassionaly, an URL is present <a href="http://www.ed.ac.uk">UoE</a>
</p>
<h1 title="A header">Some other stuff</h1>
<h2>3 Rows and 3 Columns:</h2>
<table>
<tr>
<td>100</td>
<td>200</td>
<td>300</td>
</tr>
<tr id="middle_row">
<td>400</td>
<td>500</td>
<td>600</td>
</tr>
<tr>
<td>700</td>
<td>800</td>
<td>900</td>
</tr>
</table>
<a href="#" id="clickableButton" onclick="clickedButton()" target="none">CLICK ME!</a>
<div class="hipster">
<h2>A Dangerous-Loo

In [3]:

# .find_all retrieves all tags containing 'h1':
h1Tags = soup.find_all('h1')
for h1 in h1Tags:
    print('Complete tag code: ', h1)
    print("Just the text in the tag: ", h1.text)
    print()

Complete tag code:  <h1 title="A header">Example for Media and Web Analytics</h1>
Just the text in the tag:  Example for Media and Web Analytics

Complete tag code:  <h1 title="A header">Some other stuff</h1>
Just the text in the tag:  Some other stuff



In [5]:
# Added this one for practice. 
tdTags = soup.find_all('td')
for td in tdTags:
    print('td tags:', td.text )

td tags: 100
td tags: 200
td tags: 300
td tags: 400
td tags: 500
td tags: 600
td tags: 700
td tags: 800
td tags: 900


In [4]:
titleTags = soup.find_all('title')
for title in titleTags:
    print('Complete tag code: ', title)
    print("Just the text in the tag: ", title.text)
    
# nothing will be printed. there are no tags <title> </title> there

## Understanding the html is all about finding components you need:

### .find_all( ) will find all things that match criteria, in a list
### .find( ) will find just the first item that mathes the criteria

You can use it on the whole website, like `a_table = soup.find("table")` or on an element you found before `rows = a_table.find("tr")`

You can seek for types of tags, classes or ids  `soup.find("h1")`,  `soup.find(id="main_navigation")`, `soup.find(class="warning_message")`

But it is very frequent to fetch an element by its unique id:

In [6]:
middle_row = soup.find(id='middle_row')

print('Complete tag code: ', middle_row)
print("Just the text in the tag: ", middle_row.text)


Complete tag code:  <tr id="middle_row">
<td>400</td>
<td>500</td>
<td>600</td>
</tr>
Just the text in the tag:  
400
500
600



In [9]:
#Added for practice
click_button = soup.find(id='clickableButton')

print('Comeplete tag:', click_button)
print('\nJust the text:', click_button.text)

Comeplete tag: <a href="#" id="clickableButton" onclick="clickedButton()" target="none">CLICK ME!</a>

Just the text: CLICK ME!


## Find children:

When, like above, a tag contains some children (tags inside it) you can extract them into a list.
The example would be above table row `<tr></tr>` includes three table data `<td></td>`
    
```.findChildren()```will give you a list with all tags inside of a given tag

You can specify exactly which children, if you want, like with the `.find()`. So you could use `.findChildren("tr")` or `.findChildren(class="warning_message")`

In [10]:
middle_row = soup.find(id='middle_row')
cells_in_the_row = middle_row.findChildren()
for cell in cells_in_the_row:
    print('Complete tag code: ', cell, "Just the text in the tag: ", cell.text)


Complete tag code:  <td>400</td> Just the text in the tag:  400
Complete tag code:  <td>500</td> Just the text in the tag:  500
Complete tag code:  <td>600</td> Just the text in the tag:  600


In [14]:
#Added for pratice
middle_row_ex= soup.find(id='middle_row')
cells_children= middle_row_ex.findChildren()
for count, each_cell in enumerate(cells_children):
    print(count, each_cell)

0 <td>400</td>
1 <td>500</td>
2 <td>600</td>


You can dive deeper into certain tags, for example here you look for all divs from the (CSS) class called hipster:

In [15]:
class_elements = soup.find_all("div", {"class" : "hipster" })
for element in class_elements:
    print('whole tag:\n', str(element), '\n')
    #print('Just the text: ', line.text)
    print('Just the text: ', element.text)

whole tag:
 <div class="hipster">
<h2>A Dangerous-Looking Header</h2>
<p>
I look like a paragraph Kylo Ren could have written.
</p>
</div> 

Just the text:  
A Dangerous-Looking Header

I look like a paragraph Kylo Ren could have written.


whole tag:
 <div class="hipster">
<h2>Another Dangerous-Looking Header</h2>
<p>
This one is not as scary.
</p>
</div> 

Just the text:  
Another Dangerous-Looking Header

This one is not as scary.




## <span style='background :yellow' > <font color='red'> Check this one</font>  </span>
 <font color='red'>Why does it print twice. Need to check</font>


In [16]:
#Added for practice. 
title_elements= soup.find_all("h1", {"title": "A header"})
for each_ele in title_elements:
    print("whole tag:", title_elements)

whole tag: [<h1 title="A header">Example for Media and Web Analytics</h1>, <h1 title="A header">Some other stuff</h1>]
whole tag: [<h1 title="A header">Example for Media and Web Analytics</h1>, <h1 title="A header">Some other stuff</h1>]


Getting all the elements out of the table:

In [17]:
# list all tables, since we only have 1, use the first in the list at index 0
my_table = soup.find_all('table')[0]
# or just use: my_table = soup.find('table')

# loop the rows and keep the row number
row_num = 0
for row in my_table.find_all('tr'):
    print("Row: "+str(row_num))
    row_num = row_num+1

    #loop the cells in the row
    for cell in row.find_all('td'):
        print("whole html:", str(cell)+" \tJust content: "+cell.text)
        
# if you'd like, try to change this code to use .findChildren( ) rather than .find_all('tr')

Row: 0
whole html: <td>100</td> 	Just content: 100
whole html: <td>200</td> 	Just content: 200
whole html: <td>300</td> 	Just content: 300
Row: 1
whole html: <td>400</td> 	Just content: 400
whole html: <td>500</td> 	Just content: 500
whole html: <td>600</td> 	Just content: 600
Row: 2
whole html: <td>700</td> 	Just content: 700
whole html: <td>800</td> 	Just content: 800
whole html: <td>900</td> 	Just content: 900


 <font color='red'>Why does it print all the row numers after each children tags. Need to check</font>

In [25]:
new_table = soup.find_all('table')[0]
#first_child=new_table.findChildren()
for row_num, each_elements in enumerate(new_table.findChildren()):
    print(row_num)
    print() #print each children
    #second_child=each_elements.findChildren()
    for each_cell in each_elements.findChildren():
        print('Whole html:', each_cell, '\tJUst the text:', each_cell.text)

0
Whole html: <td>100</td> 	JUst the text: 100
Whole html: <td>200</td> 	JUst the text: 200
Whole html: <td>300</td> 	JUst the text: 300
1
2
3
4
Whole html: <td>400</td> 	JUst the text: 400
Whole html: <td>500</td> 	JUst the text: 500
Whole html: <td>600</td> 	JUst the text: 600
5
6
7
8
Whole html: <td>700</td> 	JUst the text: 700
Whole html: <td>800</td> 	JUst the text: 800
Whole html: <td>900</td> 	JUst the text: 900
9
10
11


### Minitask: Now attempt to scrape something from a real online website:

Use the above code to make a list of all the degrees available in business school of University of Edinburgh. 

1. You will need to get the source of the page the list is on and feed it into the breautiful soup (see code above). (use this url instead of our demo website file://..... use this:  https://www.ed.ac.uk/studying/undergraduate/degrees/index.php?action=view&code=12)
2. get the html component that holds all the degrees. Use developer tools to identify what type of component it is (hint: ul stamds for "unordered list"). Does this component have a class or an id? How would you get a component when you know it's id? (hint: proxy_degreeList )
3. What type of a tag are the actual names of degrees in? (div, a, p, or something else) hint: what tag surround the name of the course?
4. Grab children of that type from the component with all names and in a loop, extract only the text of each of them. And print them.


I am posting the solution lower down, but do try to solve it by yourself first!

In [None]:
# copy-paste relevant parts of the code from above to start:

Only uncover the solutions once you tried to complete the task:
    
    
<details><summary style='color:blue'>CLICK HERE TO SEE THE THE HINT 1.</summary>

1. You will need to get the source of the page the list is on and feed it into the breautiful soup (see code above). (use this url instead of our demo website file://..... use this:  https://www.ed.ac.uk/studying/undergraduate/degrees/index.php?action=view&code=12)

```
file_url = "https://www.ed.ac.uk/studying/undergraduate/degrees/index.php?action=view&code=12"
website_source_code = urlopen(file_url)
soup_degrees_website = BeautifulSoup(website_source_code, 'html.parser')
```
</details>

<details><summary style='color:blue'>CLICK HERE TO SEE THE THE HINT 2.</summary>

 2. get the html component that holds all the degrees.  Use developer tools to identify what type of component it is (hint: ul stamds for "unordered list").  Does this component have a class or an id? How would you get a component when you know it's id?  (hint: proxy_degreeList )
```
degrees = soup_degrees_website.find(id='proxy_degreeList')
 ```   
</details>

<details><summary style='color:blue'>CLICK HERE TO SEE THE THE HINT 3.</summary>

 3. What type of a tag are the actual names of degrees in? (div, a, p, or something else) hint: what tag surround the name of the course?
``` 
for list_item in degrees.findChildren("a"):
  ```  
</details>



<details><summary style='color:blue'>CLICK HERE TO SEE THE THE HINT 4.</summary>

4. Grab children of that type from the component with all names and in a loop, extract only the text of each of them. And print them.
```
    print("Degree Name:", list_item.text)
    ```
</details>


In [28]:
file_url = "https://www.ed.ac.uk/studying/undergraduate/degrees/index.php?action=view&code=12"
website_source_code = urlopen(file_url)
soup_degrees_website = BeautifulSoup(website_source_code, 'html.parser')

In [11]:
degrees = soup_degrees_website.find(id='proxy_degreeList')
for list_item in degrees.findChildren("a"):
     print("Degree Name:", list_item.text)

Degree Name: Business and Economics (MA) NL11
Degree Name: Business and Geography (MA) NL17
Degree Name: Business and Law (MA) NM11
Degree Name: Business Management (MA) N100
Degree Name: Business with Decision Analytics (MA) NN12
Degree Name: Business with Enterprise and Innovation (MA) N1N2
Degree Name: Business with Human Resource Management (MA) N1N6
Degree Name: Business with Marketing (MA) N1N5
Degree Name: Business with Strategic Economics (MA) N1L1
Degree Name: Finance and Business (MA) NN13
Degree Name: International Business (MA) N120
Degree Name: International Business with Arabic (MA) N1T6
Degree Name: International Business with Chinese (MA) N1T1
Degree Name: International Business with French (MA) N1R1
Degree Name: International Business with German (MA) N1R2
Degree Name: International Business with Italian (MA) N1R3
Degree Name: International Business with Japanese (MA) N1T2
Degree Name: International Business with Russian (MA) N1R7
Degree Name: International Business wi

In [36]:
#Using li gives a space after each degree name
all_degrees = soup_degrees_website.find(id='proxy_degreeList')
for new_list_item in all_degrees.findChildren("li"):
     print("Degree Name:", new_list_item.text)

Degree Name: Business and Economics (MA) NL11

Degree Name: Business and Geography (MA) NL17

Degree Name: Business and Law (MA) NM11

Degree Name: Business Management (MA) N100

Degree Name: Business with Decision Analytics (MA) NN12

Degree Name: Business with Enterprise and Innovation (MA) N1N2

Degree Name: Business with Human Resource Management (MA) N1N6

Degree Name: Business with Marketing (MA) N1N5

Degree Name: Business with Strategic Economics (MA) N1L1

Degree Name: Finance and Business (MA) NN13

Degree Name: International Business (MA) N120

Degree Name: International Business with Arabic (MA) N1T6

Degree Name: International Business with Chinese (MA) N1T1

Degree Name: International Business with French (MA) N1R1

Degree Name: International Business with German (MA) N1R2

Degree Name: International Business with Italian (MA) N1R3

Degree Name: International Business with Japanese (MA) N1T2

Degree Name: International Business with Russian (MA) N1R7

Degree Name: Interna

New changes added should be committed on git otherwise will be lost.