## Webscrape a basic HTML file

### Webscraping:

Extracting data from websites.

### Is it legal?

- Depends on how you use the data.
- Publicly available sites can not require a user to agree to any Terms of Service before accessing the data, users are free to use web crawlers to collect data from the site.



### What will we do today?

Scrape courses of coursera to build a dataset having the following fields:
- Title
- Rating 
- Number of Enrollments
- Main Category 
- Sub Category


### How our solution looks like

### Outline:

- Understand webscraping through a basic site
- Develop a scraper for coursera and create our dataset 
- See the development workflow through github

### Libraries used

#### Beautiful Soup

    - A Python library for pulling data out of HTML and XML files.
    - Works with a parser for navigating and searching the data from these web pages.
    - Search by tags and labels.

#### Lxml
    - Parser used in this project
    - Easy-to-use library for processing XML and HTML in the Python language

#### Requests module
    - Allows to send HTTP requests using Python.
    - The HTTP request returns a Response Object with all the response data (content, encoding, status, etc).

### Import Beautiful Soup

In [5]:
from bs4 import BeautifulSoup 

#### Parse our HTML file
Open and pass HTML file and lxml to parse our simple html file

In [6]:
with open('basic.html') as html_file:
    soup = BeautifulSoup(html_file,'lxml')

See the contents of our soup :)

In [9]:
soup

<!DOCTYPE html>
<html>
<head>
<title>Book Excerpt</title>
<meta charset="utf-8"/>
</head>
<body>
<h1 id="site_title">Book Excerpts</h1>
<hr/>
<div class="excerpts">
<h2>Jane Eyre</h2>
<p>“Do you think I am an automaton? — a machine without feelings? and can bear to have my morsel of bread snatched from my lips, and my drop of living water dashed from my cup? Do you think, because I am poor, obscure, plain, and little, I am soulless and heartless? You think wrong! — I have as much soul as you — and full as much heart! And if God had gifted me with some beauty and much wealth, I should have made it as hard for you to leave me, as it is now for me to leave you. I am not talking to you now through the medium of custom, conventionalities, nor even of mortal flesh: it is my spirit that addresses your spirit; just as if both had passed through the grave, and we stood at God's feet, equal — as we are!”</p>
</div>
<hr/>
<div class="excerpts">
<h2>Homo Deus</h2>
<p>“This is the best reason to le

prettify method: Looks fancy :D

In [12]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Book Excerpt
  </title>
  <meta charset="utf-8"/>
 </head>
 <body>
  <h1 id="site_title">
   Book Excerpts
  </h1>
  <hr/>
  <div class="excerpts">
   <h2>
    Jane Eyre
   </h2>
   <p>
    “Do you think I am an automaton? — a machine without feelings? and can bear to have my morsel of bread snatched from my lips, and my drop of living water dashed from my cup? Do you think, because I am poor, obscure, plain, and little, I am soulless and heartless? You think wrong! — I have as much soul as you — and full as much heart! And if God had gifted me with some beauty and much wealth, I should have made it as hard for you to leave me, as it is now for me to leave you. I am not talking to you now through the medium of custom, conventionalities, nor even of mortal flesh: it is my spirit that addresses your spirit; just as if both had passed through the grave, and we stood at God's feet, equal — as we are!”
   </p>
  </div>
  <hr/>
  <div class="excerp

### Extract through tags

#### Title tag

In [5]:
soup.title

<title>Book Excerpt</title>

In [6]:
soup.title.text

'Book Excerpt'

#### Div tag

In [13]:
soup.div

<div class="excerpts">
<h2>Jane Eyre</h2>
<p>“Do you think I am an automaton? — a machine without feelings? and can bear to have my morsel of bread snatched from my lips, and my drop of living water dashed from my cup? Do you think, because I am poor, obscure, plain, and little, I am soulless and heartless? You think wrong! — I have as much soul as you — and full as much heart! And if God had gifted me with some beauty and much wealth, I should have made it as hard for you to leave me, as it is now for me to leave you. I am not talking to you now through the medium of custom, conventionalities, nor even of mortal flesh: it is my spirit that addresses your spirit; just as if both had passed through the grave, and we stood at God's feet, equal — as we are!”</p>
</div>

### Customised search for tags:

find method for more customised search

In [8]:
soup.find('div')

<div class="excerpts">
<h2>Jane Eyre</h2>
<p>“Do you think I am an automaton? — a machine without feelings? and can bear to have my morsel of bread snatched from my lips, and my drop of living water dashed from my cup? Do you think, because I am poor, obscure, plain, and little, I am soulless and heartless? You think wrong! — I have as much soul as you — and full as much heart! And if God had gifted me with some beauty and much wealth, I should have made it as hard for you to leave me, as it is now for me to leave you. I am not talking to you now through the medium of custom, conventionalities, nor even of mortal flesh: it is my spirit that addresses your spirit; just as if both had passed through the grave, and we stood at God's feet, equal — as we are!”</p>
</div>

Only got the first div on page

Find specific div by it's label

In [10]:
soup.find('div', class_ ='footer')

<div class="footer">
<p>Footer Information</p>
</div>

Underscore after the class because class is a keyword in python

### Extract content from HTML file

In [20]:
excerpt = soup.find('div', class_ ='excerpts')
excerpt.h2.text

'Jane Eyre'

In [21]:
excerpt.p.text

"“Do you think I am an automaton? — a machine without feelings? and can bear to have my morsel of bread snatched from my lips, and my drop of living water dashed from my cup? Do you think, because I am poor, obscure, plain, and little, I am soulless and heartless? You think wrong! — I have as much soul as you — and full as much heart! And if God had gifted me with some beauty and much wealth, I should have made it as hard for you to leave me, as it is now for me to leave you. I am not talking to you now through the medium of custom, conventionalities, nor even of mortal flesh: it is my spirit that addresses your spirit; just as if both had passed through the grave, and we stood at God's feet, equal — as we are!”"

In [30]:
for excerpt in soup.find_all('div', class_ ='excerpts'):
    headline = excerpt.h2.text
    print('Headline:', headline)
    summary = excerpt.p.text
    print('Summary:',summary)

Headline: Jane Eyre
Summary: “Do you think I am an automaton? — a machine without feelings? and can bear to have my morsel of bread snatched from my lips, and my drop of living water dashed from my cup? Do you think, because I am poor, obscure, plain, and little, I am soulless and heartless? You think wrong! — I have as much soul as you — and full as much heart! And if God had gifted me with some beauty and much wealth, I should have made it as hard for you to leave me, as it is now for me to leave you. I am not talking to you now through the medium of custom, conventionalities, nor even of mortal flesh: it is my spirit that addresses your spirit; just as if both had passed through the grave, and we stood at God's feet, equal — as we are!”
Headline: Homo Deus
Summary: “This is the best reason to learn history: not in order to predict the future, but to free yourself of the past and imagine alternative destinies. Of course this is not total freedom – we cannot avoid being shaped by the 