# HTML and Web Scraping


The web provides us with more data than any of us can read and understand, so we often want to work with that information programmatically in order to make sense of it. Sometimes, that data is provided to us by website creators via .csv or comma-separated values files, or through an API (Application Programming Interface). Other times, we need to collect text from the web ourselves.

This tutorial will go over how to work with the Requests and Beautiful Soup Python packages in order to make use of data from web pages. The Requests module lets you integrate your Python programs with web services, while the Beautiful Soup module is designed to make screen-scraping get done quickly. Using the Python interactive console and these two libraries, we’ll go through how to collect a web page and work with the textual information available there. 


## Installing Requests and Beautiful Soup

You may already have the libraries 
[Requests](https://pypi.org/project/requests/2.7.0/) and 
[Beautiful Soup](https://pypi.org/project/beautifulsoup4/) if you
installed Python via anaconda. 

You can open Anaconda Prompt and type 
```python
conda list
``` 
to view the list of packages and versions installed in active environment.  
See [conda cheat sheet](https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf) for
more details.

You can also get into python prompt and type
```python
help("modules")
```
to view the list of packages and versions installed in your computer.

If you have not installed the python libraries Requests and Beautiful 
Soup, you can type 
```python
pip install requests beautifulsoup4
```
to install both python libraries.

Now that both Beautiful Soup and Requests are installed, we can move 
on to understanding how to work with the libraries to scrape websites.

## Collecting a Web Page with Requests

Let's first import the Requests module so that we can collect a sample web page:

In [None]:
import requests

In [None]:
We’ll assign the URL (below) of the sample web page to the variable url: 

In [None]:
url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=Fullerton%2C+CA&ns=1"

Next, we can assign the result of a request of that page to the 
variable page with the request.get() method. We pass the page’s 
URL (that was assigned to the url variable) to that method.

In [None]:
page = requests.get(url)

The variable page is assigned a Response object:

In [None]:
page

The Response object above tells us the status_code property in square 
brackets (in this case 200). This attribute can be called explicitly:

In [None]:
page.status_code

The returned code of 200 tells us that the page downloaded successfully. 
Codes that begin with the number 2 generally indicate success, while 
codes that begin with a 4 or 5 indicate that an error occurred. You 
can read more about HTTP status codes from the [W3C’s Status Code 
Definitions](https://www.w3.org/Protocols/HTTP/1.1/draft-ietf-http-v11-spec-01#Status-Codes).

In order to work with web data, we’re going to want to access the 
text-based content of web files. We can read the content of the 
server’s response with page.text (or page.content if we would like 
to access the response in bytes).

In [None]:
page.text

You can also type 
view-source:https://www.yelp.com/search?find_desc=Restaurants&find_loc=fullerton%2C%20CA
in the brower to view source code fo the website. 

In [None]:
page.text[:100]

### Another example for using requests

In [None]:
url2 = "https://dailytitan.com/2019/03/csuf-baseball-to-open-big-west-play-against-no-17-ucsb/"
page2 = requests.get(url2)
page2.text

Here we see that the full text of the page was printed out, with all 
of its HTML tags. However, it is difficult to read because there is 
not much spacing.

In the next section, we can leverage the Beautiful Soup module to 
work with this textual data in a more human-friendly manner.

## Stepping Through a Page with Beautiful Soup

The Beautiful Soup library creates a parse tree from parsed HTML and XML 
documents (including documents with non-closed tags or tag soup and 
other malformed markup). This functionality will make the web page text 
more readable than what we saw coming from the Requests module.

To start, we’ll import Beautiful Soup into the Python console:

In [None]:
from bs4 import BeautifulSoup

Next, we’ll run the page.text document through the module to give 
us a BeautifulSoup object — that is, a parse tree from this parsed 
page that we’ll get from running Python’s built-in html.parser over 
the HTML. The constructed object represents the above website document 
as a nested data structure. This is assigned to the variable soup.

In [None]:
soup = BeautifulSoup(page.text, 'html.parser')

To show the contents of the page on the terminal, we can print it with 
the prettify() method in order to turn the Beautiful Soup parse tree 
into a nicely formatted Unicode string.

In [None]:
print(soup.prettify())

In [None]:
This will render each HTML tag on its own line. In the output above, 
we can see that there is one tag per line and also that the tags are 
nested because of the tree schema used by Beautiful Soup. 

## Finding Instances of a Tag

We can extract a single tag from a page by using Beautiful Soup’s 
**find_all** method. This will return all instances of a given tag 
within a document.

In [None]:
soup.find_all('p')

In [None]:
Running that method on our object returns the full text 
along with the relevant &lt;p&gt; tags and any tags contained 
within that requested tag, which here includes the line break 
tags &lt;br/&gt;. You will notice in the output above that the data is 
contained in square brackets [ ]. This means it is a Python list 
data type. 

Because it is a list, we can call a particular item within 
it (for example, the third &lt;p&gt; element), and use 
the **get_text()** method 
to extract all the text from inside that tag:

In [None]:
soup.find_all('p')[2].get_text()

In [None]:
The output that we receive will be Inexpensive, which is in the 
third &lt;p&gt; element in this case. 

In [None]:
soup.find_all('a')

In [None]:
Below is a completed python code.

In [None]:
## Complete Python Code
import requests
from bs4 import BeautifulSoup

url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=Fullerton%2C+CA&ns=1"
page = requests.get(url)

# page -- <Response [200]>

print("\n", page.status_code)  ## 200

soup = BeautifulSoup(page.text, 'html.parser')
print("\n", soup.prettify())

for link in soup.find_all('a'):
    print("\n", link)