# Data Analytics

## BeautifulSoup and Web Scraping

The incredible amount of data on the Internet is a rich resource for any field of research or personal interest.

To effectively harvest that data, you’ll need to become skilled at web scraping.

### What is Web Scraping

When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information, and then looks at more web pages

Search engines scrape web pages - we call this “spidering the web” or “web crawling”

> Note : Web Scraping is considered as illegal in many cases. It may also cause your IP to be blocked permanently by a website.

Some websites don’t like it when automatic scrapers gather their data, while others don’t mind. 

If you’re scraping a page respectfully for educational purposes, then you’re unlikely to have any problems. Still, it’s a good idea to do some research and make sure that you’re not violating any 'Terms of Service' before you start.

### Scraping Web Pages

The Python libraries `requests`, `html5lib` and `BeautifulSoup` are powerful tools, perfect for the job of webs craping.

Python `requests` is a  module that allows you to send HTTP requests using Python. The HTTP request returns a 'Response Object' with all the response data (content, encoding, status, etc).

Once we have accessed the HTML content in a 'Response Object', we need to parse the data. One needs a parser which can create a nested/tree structure of the HTML data. There are many HTML parser libraries available but the most advanced one is `html5lib`.

Python `BeautifulSoup` is a  library (from https://www.crummy.com/software/BeautifulSoup/) for pulling data out of `HTML` and `XML` files. This document covers `BeautifulSoup` version 4 that works with Python 3.

### Installing `BeautifulSoup`, `html5lib` and `requests`

To install these libraries, use PIP.

```python
    python -m pip install requests
    pip install html5lib
    pip install beautifulsoup4
```

### Steps involved in web scraping:

1. Send an HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the HTML content of the webpage. For this task, we will use `requests`

2. Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data. For this task, we will use `html5lib`

3. Now, we need to navigating and searching the parse tree that we created, i.e. tree traversal. For this task, we will use `BeautifulSoup`



#### **Step 1.) - Accessing the HTML content from a webpage**

Import the requests library. Then, specify the URL of the webpage you want to scrape.
Send a HTTP request to the specified URL and save the response from the server in a response object called `response`.
Now, as print `response.content` to get the raw HTML content of the webpage. It is of ‘string’ type.

In [None]:
# import the libraries
import requests

# request the URL of the webpage you want to access
URL = "http://www.dr-chuck.com/page1.htm"
# URL = "https://www.geeksforgeeks.org/data-structures/"
response = requests.get(URL)

#  print 'response.content' to get the raw HTML content of the webpage. It's of ‘string’ type
print(response.content)


#### **Step 2.) - Parsing the HTML content**

A really nice thing about the BeautifulSoup library is that it is built on the top of the HTML parsing libraries like html5lib, lxml, html.parser, etc. So the BeautifulSoup object and specifying the parser library can be done at the same time.

We create a BeautifulSoup object by passing two arguments:
- **response.content** : It is the raw HTML content.
- **html5lib** : Specifying the HTML parser we want to use.

Printing `soup.prettify()` gives the visual representation of the parse tree created from the raw HTML content. 


In [None]:
# request the URL of the webpage you want to access
import requests
from bs4 import BeautifulSoup as bsoup

URL = "http://www.values.com/inspirational-quotes"
response = requests.get(URL)

# parse the response into a readable form
soup = bsoup(response.content, 'html5lib')
print(soup.prettify())

#### **Step 3.) - Searching and navigating through the parse tree**

Now, we would like to extract some useful data from the HTML content. The soup object contains all the data in the nested structure which could be programmatically extracted. 

In our example, we are scraping a webpage consisting of some quotes. So, we would like to create a program to save those quotes (and all relevant information about them). 

Here below is code to do that ...

In [None]:
#Program to scrape website and save quotes from website
import requests
from bs4 import BeautifulSoup  as bsoup
import csv

# request the URL of the webpage you want to access
URL = "http://www.values.com/inspirational-quotes"
response = requests.get(URL)

# parse the response into a readable form
soup = bsoup(response.content, 'html5lib')
# print(soup.prettify())

quotes=[] # a list to store quotes

# search the response for the HTML container that holds the quotes
table = soup.find('div', attrs = {'id':'all_quotes'})
# print(table)

# iterate the able rows to find each quote info
for row in table.findAll('div'):
    quote = {}  # create a dictionary for each quote
    quote['url'] = "https:/" + row.a['href']
    quote['lines'] = row.img['alt'].split(" #")[0]
    quote['theme'] = row.h5.a.text
    quote['img'] = row.img['src']
    quotes.append(quote)    # attache each quote to the list

print(quotes) 

# save the quotes list of dictionaries into a CSV file
filename = 'inspirational_quotes.csv'
with open(filename, 'w', newline='') as f:
	w = csv.DictWriter(f,['theme','url','img','lines'])
	w.writeheader()
	for quote in quotes:
		w.writerow(quote)

#### Lets analyze the code  

- First search through the HTML content of the webpage; print it using `soup.prettify()` method and try to find a pattern or a way to navigate to the quotes.

- The quotes are inside a `div` container whose `id` is ‘all_quotes’. So, we find that div element by using the `find()` method :

            table = soup.find('div', attrs = {'id':'all_quotes'}) 

- The first argument is the HTML `div` tag we want; the second argument is a dictionary type element to specify the additional attributes associated with that tag. 

- The `find()` method returns the first matching element. We can try to print `table.prettify()` to get a sense of what this piece of code does.

- Now, in the table element, one can notice that each quote is inside a div container whose class is quote. So, we iterate through each div container with that class.

- Finally, we use the `findAll()` method which is similar to the `find()` method in terms of arguments but it returns a list of all matching elements. Each quote is now iterated using a variable called row.

- Using the row variable, we find info snippets on each quote to populate a quote dictionary, which is added to the quotes list.

- Finally, save the quotes list of dictionaries into a CSV file.

### A Note on Dynamic Websites

In this section we learned how to scrape a static website. 

Static sites are straightforward to work with because the server sends you an HTML page that already contains all the page information in the response. You can parse that HTML response and immediately begin to pick out the relevant data.

On the other hand, with a dynamic website, the server might not send back any HTML at all. Instead, you could receive JavaScript code as a response. This code will look completely different from what you saw when you inspected the page with your browser’s developer tools.

Many modern web applications are designed to provide their functionality in collaboration with the clients’ browsers. Instead of sending HTML pages, these apps send JavaScript code that instructs your browser to create the desired HTML. 

Web apps deliver dynamic content in this way to offload work from the server to the clients’ machines as well as to avoid page reloads and improve the overall user experience.

When we use `requests`, we only receive what the server sends back. In the case of a dynamic website, you’ll end up with some JavaScript code instead of HTML. 

The only way to go from the JavaScript code you received to the content that you’re interested in is to execute the code, just like your browser does. The `requests` library can’t do that for you, but there are other solutions that can.

For example, `requests-html` is a project created by the author of the `requests` library that allows you to render JavaScript using syntax that’s similar to the syntax in requests. It also includes capabilities for parsing the data by using Beautiful Soup under the hood.