# Requests Introduction and it's Web Data Search Application

The Requests library is vital to add to your data science toolkit. It makes interacting with Web services seamless, and it overcomes most of the difficulties in urllib/urllib2. It’s a simple yet powerful HTTP library, where you can use it to access web pages and make some interactions.  

we will use the following two html page as example.  
- https://api.github.com
- https://assets.digitalocean.com/articles/eng_python/beautiful-soup/mockturtle.html

## Part 1: Requests Introduction

When we make a request, the response from the API comes with a response code which tells us whether our request was successful. Response codes are important because they immediately tell us if something went wrong.  

In the later content, we will learn how to:

- Make requests using the most common HTTP methods  
- Customize your requests’ headers and data, using the query string and message body  
- Inspect data from your requests and responses  
- Make authenticated requests  
- Configure your requests to help prevent your application from backing up or slowing down

Firstly install requests library using the following command.

Once install requests in terminal, you can use in your application. Import requests like this:

In [1]:
import requests

### 1.1 Get Requests

HTTP methods such as GET and POST, determine which action you’re trying to perform when making an HTTP request.  
One of the most common HTTP methods is GET. The GET method indicates that you’re trying to get or retrieve data from a specified resource. To make a GET request, invoke requests.get().

In [3]:
response = requests.get('https://api.github.com')
contents = response.content

In this example, we have captured the return value of get(), which is an instance of Response, and stored it in a variable called page. The content of the website will be stored in a variable named contents.  

There is a status code informs you the status of the request. For example, 200 ok status means the requests was successful. whereas a 404 NOT FOUND status means the resource you were looking for was not found.  

By accessing .status_code, you can see the status code that the server returned:

In [4]:
page.status_code

200

Here are some codes that are relevant to GET requests:  

- 200: Everything went okay, and the result has been returned (if any).  
- 301: The server is redirecting you to a different endpoint. This can happen when a company switches domain names, or an endpoint name is changed.  
- 400: The server thinks you made a bad request. This can happen when you don’t send along the right data, among other things.  
- 401: The server thinks you’re not authenticated. Many APIs require login ccredentials, so this happens when you don’t send the right credentials to access an API.  
- 403: The resource you’re trying to access is forbidden: you don’t have the right permissions to see it.  
- 404: The resource you tried to access wasn’t found on the server.  
- 503: The server is not ready to handle the request.  

We can check the responds status with the following code:  

In [18]:
from requests.exceptions import HTTPError

for url in ['https://api.github.com', 'https://api.github.com/unknown']:
    try:
        response = requests.get(url)

        # If the response was successful, no Exception will be raised
        response.raise_for_status()
    except HTTPError as http_err:
        print(f'HTTP error occurred: {http_err}') 
    except Exception as err:
        print(f'Other error occurred: {err}') 
    else:
        print('Success!')

Success!
HTTP error occurred: 404 Client Error: Not Found for url: https://api.github.com/unknown


### 1.2 Headers

The response headers can give you useful information, such as the content type of the response payload and a time limit on how long to cache the response. To view these headers, access .headers:  

To customize headers, we can pass a dictionary of HTTP headers to get() using the headers parameter. For example, we can change the previous search request to highlight matching search terms in the results by specifying the text-match media type in the Accept header:

In [25]:
response = requests.get(
    'https://api.github.com/search/repositories',
    params={'q': 'requests+language:python'},
    headers={'Accept': 'application/vnd.github.v3.text-match+json'},
)

json_response = response.json()
repository = json_response['items'][0]
print(f'Text matches: {repository["text_matches"]}')

Text matches: [{'object_url': 'https://api.github.com/repositories/4290214', 'object_type': 'Repository', 'property': 'description', 'fragment': 'Requests + Gevent = <3', 'matches': [{'text': 'Requests', 'indices': [0, 8]}]}]


The Accept header tells the server what content types your application can handle. In this case, since you’re expecting the matching search terms to be highlighted, you’re using the header value application/vnd.github.v3.text-match+json, which is a proprietary GitHub Accept header where the content is a special JSON format.

### 1.3 Authentication

Authentication helps a service understand who you are. Typically, you provide your credentials to a server by passing data through the Authorization header or a custom header defined by the service. All the request functions you’ve seen to this point provide a parameter called auth, which allows you to pass your credentials.

In [26]:
from getpass import getpass
requests.get('https://api.github.com/user', auth=('username', getpass()))

········


<Response [403]>

We can even supply our authentication,

In [31]:
from requests.auth import AuthBase

class TokenAuth(AuthBase):
    """Implements a custom authentication scheme."""

    def __init__(self, token):
        self.token = token

    def __call__(self, r):
        """Attach an API token to a custom auth header."""
        r.headers['X-TokenAuth'] = f'{self.token}'
        return r


requests.get('https://httpbin.org/get', auth=TokenAuth('12345abcde-token'))

<Response [200]>

Bad authentication mechanisms can lead to security vulnerabilities, so unless a service requires a custom authentication mechanism for some reason, you’ll always want to use a tried-and-true auth scheme like Basic or OAuth.

### 1.4 Requests Performance

When using requests, especially in a production application environment, it’s important to consider performance implications. Features like timeout control, sessions, and retry limits can help you keep your application running smoothly.  

By default, requests will wait indefinitely on the response, so you should almost always specify a timeout duration to prevent these things from happening. To set the request’s timeout, use the timeout parameter. timeout can be an integer or float representing the number of seconds to wait on a response before timing out.  

If the request establishes a connection within 2 seconds and receives data within 5 seconds of the connection being established, then the response will be returned as it was before. If the request times out, then the function will raise a Timeout exception:



In [32]:
import requests
from requests.exceptions import Timeout

try:
    response = requests.get('https://api.github.com', timeout=1)
except Timeout:
    print('The request timed out')
else:
    print('The request did not time out')

The request did not time out


### 1.5 Session Object

If you need to fine-tune your control over how requests are being made or improve the performance of your requests, you may need to use a Session instance directly.  

Sessions are used to persist parameters across requests. For example, if you want to use the same authentication across multiple requests, you could use a session:

In [34]:
from getpass import getpass

# By using a context manager, you can ensure the resources used by
# the session will be released after use
with requests.Session() as session:
    session.auth = ('username', getpass())

    # Instead of requests.get(), you'll use session.get()
    response = session.get('https://api.github.com/user')

# You can inspect the response just like you did before
print(response.headers)
print(response.json())

········
{'Server': 'GitHub.com', 'Date': 'Tue, 07 Apr 2020 13:58:22 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Status': '403 Forbidden', 'X-GitHub-Media-Type': 'github.v3; format=json', 'X-RateLimit-Limit': '60', 'X-RateLimit-Remaining': '57', 'X-RateLimit-Reset': '1586271455', 'Access-Control-Expose-Headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'Content-Security-Policy': "default-src 'none'", 'Vary': 'Accept-Encoding, Accept, X-Requested-With', 'Content-Encoding': 'gzip', 'X-G

Each time you make a request with session, once it has been initialized with authentication credentials, the credentials will be persisted.  

The primary performance optimization of sessions comes in the form of persistent connections. When your app makes a connection to a server using a Session, it keeps that connection around in a connection pool. When your app wants to connect to the same server again, it will reuse a connection from the pool rather than establishing a new one.

### 1.5 Pass the Data

Typically, you want to send some form-encoded data — much like an HTML form. To do this, simply pass a dictionary to the data argument. Your dictionary of data will automatically be form-encoded when the request is made. The data argument can also have multiple values for each key. This can be done by making data either a list of tuples or a dictionary with lists as values. This is particularly useful when the form has multiple elements that use the same key.  

JSON (JavaScript Object Notation) is the language of APIs. JSON is a way to encode data structures that ensures that they are easily readable by machines. JSON is the primary format in which data is passed back and forth to APIs, and most API servers will send their responses in JSON format.  

There are times that you may want to send data that is not form-encoded. If you pass in a string instead of a dict, that data will be posted directly.

In [57]:
import json

url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}
r = requests.post(url, data=json.dumps(payload))

In [58]:
rr = requests.post(url, json=payload)

After going through this, we are able to:
- Make requests using a variety of different HTTP methods such as GET, POST, and PUT;  
- Customize your requests by modifying headers, authentication, query strings, and message bodies;  
- Inspect the data you send to the server and the data the server sends back to you;  
- Work with SSL Certificate verification;  
- Use requests effectively using max_retries, timeout, Sessions, and Transport Adapters

# Part 2 Web Tag and Search Using Requests and BeautifulSoup

We will use a small example to clarify the usage combining requests and beautifulsoup in dealing with extracting and searching data in HTML/CSS.

In [59]:
url = 'https://assets.digitalocean.com/articles/eng_python/beautiful-soup/mockturtle.html'
page = requests.get(url)

page.status_code

200

The returned code of 200 tells us that the page downloaded successfully. In order to work with web data, we’re going to want to access the text-based content of web files. We can read the content of the server’s response with page.text.

In [60]:
page.text

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n\n<html lang="en-US" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">\n<head>\n  <meta http-equiv="content-type" content="text/html; charset=us-ascii" />\n\n  <title>Turtle Soup</title>\n</head>\n\n<body>\n  <h1>Turtle Soup</h1>\n\n  <p class="verse" id="first">Beautiful Soup, so rich and green,<br />\n  Waiting in a hot tureen!<br />\n  Who for such dainties would not stoop?<br />\n  Soup of the evening, beautiful Soup!<br />\n  Soup of the evening, beautiful Soup!<br /></p>\n\n  <p class="chorus" id="second">Beau--ootiful Soo--oop!<br />\n  Beau--ootiful Soo--oop!<br />\n  Soo--oop of the e--e--evening,<br />\n  Beautiful, beautiful Soup!<br /></p>\n\n  <p class="verse" id="third">Beautiful Soup! Who cares for fish,<br />\n  Game or any other dish?<br />\n  Who would not give all else for two<br />\n  Pennyworth only of Beautiful Soup?<br />\n  Pennyworth only of

To show the contents of the page on the terminal, we can print it with the prettify() method in order to turn the Beautiful Soup parse tree into a nicely formatted Unicode string.

In [62]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en-US" xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <title>
   Turtle Soup
  </title>
 </head>
 <body>
  <h1>
   Turtle Soup
  </h1>
  <p class="verse" id="first">
   Beautiful Soup, so rich and green,
   <br/>
   Waiting in a hot tureen!
   <br/>
   Who for such dainties would not stoop?
   <br/>
   Soup of the evening, beautiful Soup!
   <br/>
   Soup of the evening, beautiful Soup!
   <br/>
  </p>
  <p class="chorus" id="second">
   Beau--ootiful Soo--oop!
   <br/>
   Beau--ootiful Soo--oop!
   <br/>
   Soo--oop of the e--e--evening,
   <br/>
   Beautiful, beautiful Soup!
   <br/>
  </p>
  <p class="verse" id="third">
   Beautiful Soup! Who cares for fish,
   <br/>
   Game or any other dish?
   <br/>
   Who would not give all else for two
   <br/>
   Pennyworth only of 

We can extract a single tag from a page by using Beautiful Soup’s find_all method. This will return all instances of a given tag within a document.

In [63]:
soup.find_all('p')

[<p class="verse" id="first">Beautiful Soup, so rich and green,<br/>
   Waiting in a hot tureen!<br/>
   Who for such dainties would not stoop?<br/>
   Soup of the evening, beautiful Soup!<br/>
   Soup of the evening, beautiful Soup!<br/></p>,
 <p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beautiful Soup!<br/></p>,
 <p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
   Game or any other dish?<br/>
   Who would not give all else for two<br/>
   Pennyworth only of Beautiful Soup?<br/>
   Pennyworth only of beautiful Soup?<br/></p>,
 <p class="chorus" id="fourth">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beauti--FUL SOUP!<br/></p>]

HTML elements that refer to CSS selectors like class and ID can be helpful to look at when working with web data using Beautiful Soup. We can target specific classes and IDs by using the find_all() method and passing the class and ID strings as arguments.

First, let’s find all of the instances of the class chorus. In Beautiful Soup we will assign the string for the class to the keyword argument class_:

In [64]:
soup.find_all('p', class_='chorus')

[<p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beautiful Soup!<br/></p>,
 <p class="chorus" id="fourth">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beauti--FUL SOUP!<br/></p>]

In [65]:
soup.find_all(id='third')

[<p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
   Game or any other dish?<br/>
   Who would not give all else for two<br/>
   Pennyworth only of Beautiful Soup?<br/>
   Pennyworth only of beautiful Soup?<br/></p>]

It's a common and useful tool to interact with web page and doing the 
It retrieving a web page with the Requests module in Python and doing some preliminary scraping of that web page’s textual data in order to gain an understanding of Beautiful Soup.

## Reference

- https://docs.python.org/3/library/urllib.request.html  
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/  
- https://docs.python.org/3/library/  
- https://realpython.com/python-requests/
- https://requests.readthedocs.io/en/master/user/quickstart/  
- https://www.bogotobogo.com/python/python-REST-API-Http-Requests-for-Humans-with-Flask.php'  
- https://www.digitalocean.com/community/tutorials/how-to-work-with-web-data-using-requests-and-beautiful-soup-with-python-3