# Session 0 - Requests




The first step in writing a crawler in Python is to simulate a get request to obtain the source code of the target web page and to get to know the structure of the response and how to make use of the fetched data.  

There are two modules for requests: `urllib` and `requests`. 

In this session we are going to learn how to use the `requests` module for different queries and get familiar with: 

* The structure of GET request and how to use it to fetch some data
* the structure of POST request and how to use it to send forms and files
* The structure of the RESPONSE from the server after a GET or a POST client's request and how to extract useful information
* Other advanced topics such as cookies, timers, ...

In [1]:
# we begin by importing the `requests` module
# if not installed just run: pip install requests in a shell
# or if you are using cond: conda -c conda-forge requests
import requests
import json

## GET Request

In [2]:
# The get request can be as easy as this one line when using requests
# Only a link is needed and voila you have all the HTML source in a given variable
r = requests.get('http://httpbin.org/get')

`r` is our Response object. We can get all the information we need from it.

In [3]:
# show the response object
print(r.text)

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.25.1", 
    "X-Amzn-Trace-Id": "Root=1-60cb2a39-1ca4e39b2e0d007f00e109dc"
  }, 
  "origin": "185.207.249.116", 
  "url": "http://httpbin.org/get"
}



## Passing Prameters in URLs

You often want to send some kind of data  in the URL's query string, e.g when you are searching google for requests you get something like https://www.google.com/search?q=requests in the URL


In [4]:
# we can add some parameters to our get request
search = {
    'q': 'requests'
}

r = requests.get('https://www.google.com/search', params=search)

In [5]:
print(f"Response status code: {r.status_code}")
print(f"URL: {r.url}")

Response status code: 200
URL: https://www.google.com/search?q=requests


In [6]:
# you can pass other parameters in form of dictionary such as:
# 1.
payload = {'key1': 'value1',
          'key2': 2,
          'key3': ['value3a', 'value3b']}
r = requests.get('http://httpbin.org/get', params=payload)
print(f"URL: {r.url}")

URL: http://httpbin.org/get?key1=value1&key2=2&key3=value3a&key3=value3b


In [7]:
# 2. change user agent:
# define your user agent, in order to disguise the request as coming from a browser
# since some websites don't allow crawler agents to fetch data
headers = {'User-Agent': "Mozilla/5.0 (Windows; U; \
Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6"}
r = requests.get('http://httpbin.org/get', headers=headers)
print(r.text)

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6", 
    "X-Amzn-Trace-Id": "Root=1-60cb2a3b-57b1b83e48d7b60644bc2161"
  }, 
  "origin": "185.207.249.116", 
  "url": "http://httpbin.org/get"
}



Notice how the user agent changed to Mozilla. and the parameters have been added as args to the url.  

In [8]:
# if we want to see the request header:
print(r.request.headers)

{'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}


## Response Content

In [9]:
# 1. HTML

# define the link you want to target
link = 'http://pythonscraping.com/pages/page1.html'
# define your user agent - a disguise as a browser
headers = {'User-Agent': "Mozilla/5.0 (Windows; U; Windows NT 6.1; \
en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6"}
# Get the requested page and save it as r
r = requests.get(link, headers=headers)
print(r.text)

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>



In [10]:
# If you want to capture the raw socket response from the server you can access `r.raw`

r = requests.get('https://api.github.com/events', stream=True)
r.raw

<urllib3.response.HTTPResponse at 0x7ff3bd131470>

In [11]:
r.raw.read(20)

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\xbd[s\xdbX\xb6&\xf8W'

By executing the below cell, we obtain the HTML source code of the link webpage. By extracting the needed data we complete our intended data crawling

In [12]:
# To capture Binary data

r = requests.get('https://github.com/favicon.ico')
with open('favicon.ico', 'wb') as f:
    f.write(r.content)

Instead of requests we could also use the `urllib` package from the standard library. It works out of the box and has the basic modules to start scraping the web

In [13]:
# Sometimes we don't have requests installed or we have it on another envinv
# In this case we could use the request function from the python3's standard library:
# `urllib`. It is less performant than `requests`. 
# works fine for simple queries.
from urllib.request import urlopen

html = urlopen(link)
html.read()

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'

In [14]:
# Response Headers
# Actually you need just the `r.headers` command to get the GET header
# For a better printing of the header we do the following
print(json.dumps(dict(r.headers), indent=4))

{
    "Server": "GitHub.com",
    "Date": "Thu, 17 Jun 2021 10:55:56 GMT",
    "Content-Type": "image/x-icon",
    "Last-Modified": "Thu, 17 Jun 2021 09:40:29 GMT",
    "ETag": "W/\"60cb188d-1976\"",
    "expires": "Sun, 15 Jun 2031 10:55:56 GMT",
    "Cache-Control": "max-age=315360000",
    "Vary": "Accept-Encoding, Accept, X-Requested-With",
    "X-Frame-Options": "DENY",
    "Content-Encoding": "gzip",
    "Set-Cookie": "_gh_sess=Qhymfn4cN%2FqT4gyMx9mY%2BJqUB%2BiYAYAnxGnu7jbQyarzG4fnql8s3RDwNafiZ37w1xzlP%2FwzxFAz6y0dy0nbyQUgpxGCj9WO3SmA0TyTFmhlAdAVB9mJd3fyDE3zaQ%2BiTfdovUzjWsOLt6Pw2%2FCRWuHbdYCq2TkX9fdGIzgyqBvow%2F3EbMmzw2ASMI2ax52niwWPKVRcT4MnddLBStbvuMtWkPwl8jFfpNuQuaesCRZIdJ3S8915EwpQK3JW4Jf2ZDLa7qBngsWGaR%2FYwCg6nw%3D%3D--bkBPPR0%2FQOqPvxBL--gFzZtsKq7HBEKgC4mpJtbg%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax, _octo=GH1.1.1507979890.1623927356; Path=/; Domain=github.com; Expires=Fri, 17 Jun 2022 10:55:56 GMT; Secure; SameSite=Lax, logged_in=no; Path=/; Domain=github.com; Expire

## Post Requests

Typically, you want to send some form-encoded data — much like an HTML form.
To do this, simply pass a dictionary to the data argument.

Forms provide:

* User interface for editing data
* Mechanism for uploading changes to the server 

It is a collection of elements, each of which has:
    
* Name (used by application, often invisible to user)
* Value (just a string)
* User interface 

Different types of elements provide different ways to edit the value (text entry, checkbox, menu, etc.) 

In [15]:
# Your dictionary of data will automatically be form-encoded when the request is made:
data = {'key1': 'value1', 'key2': 2}
r = requests.post("http://httpbin.org/post", data=data)
print(r.text)

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "18", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.25.1", 
    "X-Amzn-Trace-Id": "Root=1-60cb2a3d-2447caa54101261304d4be5b"
  }, 
  "json": null, 
  "origin": "185.207.249.116", 
  "url": "http://httpbin.org/post"
}



In [16]:
# post a file
link = 'https://httpbin.org/post'
file = {'file': open('favicon.ico', 'rb')}

r = requests.post(link, files=file)
print(r.text)

{
  "args": {}, 
  "data": "", 
  "files": {
    "file": "data:application/octet-stream;base64,AAABAAIAEBAAAAEAIAAoBQAAJgAAACAgAAABACAAKBQAAE4FAAAoAAAAEAAAACAAAAABACAAAAAAAAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABERE3YTExPFDg4OEgAAAAAAAAAADw8PERERFLETExNpAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABQUFJYTExT8ExMU7QAAABkAAAAAAAAAAAAAABgVFRf/FRUX/xERE4UAAAAAAAAAAAAAAAAAAAAAAAAAABEREsETExTuERERHhAQEBAAAAAAAAAAAAAAAAAAAAANExMU9RUVF/8VFRf/EREUrwAAAAAAAAAAAAAAABQUFJkVFRf/BgYRLA4ODlwPDw/BDw8PIgAAAAAAAAAADw8PNBAQEP8VFRf/FRUX/xUVF/8UFBSPAAAAABAQEDAPDQ//AAAA+QEBAe0CAgL/AgIC9g4ODjgAAAAAAAAAAAgICEACAgLrFRUX/xUVF/8VFRf/FRUX/xERES0UFBWcFBQV/wEBAfwPDxH7DQ0ROwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA0NEjoTExTnFRUX/xUVF/8SEhKaExMT2RUVF/8VFRf/ExMTTwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAERERTBUVF/8VFRf/ExMT2hMTFPYVFRf/FBQU8AAAAAIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAITExTxFRUX/xMTFPYTExT3FRUX/xQUFOEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBQU4RUVF/8TExT3FBQU3hUVF/8TExT5Dw8PIQAAAAAAAAAAA

## Response Status Code

In [17]:
r = requests.get('https://httpbin.org/get')
r.status_code

200

Requests also comes with a built-in status code lookup object for easy reference:

In [18]:
requests.codes.bad

400

In [19]:
requests.codes.accepted

202

## Cookies

Let's see deepl's cookie:

In [20]:
r = requests.get('http://deepl.com')

In [21]:
r.cookies

<RequestsCookieJar[Cookie(version=0, name='releaseGroups', value='89.DPAY-991.2.5_123.DPAY-1399.2.2_145.DPAY-1214.2.1_187.DM-70.2.1_188.DM-71.1.2_193.DM-2.1.1_194.DM-8.2.1_198.TG-88.2.1_199.DM-13.1.1', port=None, port_specified=False, domain='www.deepl.com', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure='True', expires=1626555361, discard=False, comment=None, comment_url=None, rest={'SameSite': 'Lax'}, rfc2109=False), Cookie(version=0, name='userCountry', value='US', port=None, port_specified=False, domain='www.deepl.com', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure='True', expires=None, discard=True, comment=None, comment_url=None, rest={'SameSite': 'Lax'}, rfc2109=False)]>

You can also send your own cookie to the server. For instance to login an already expired session.

In [22]:
url = 'https://httpbin.org/cookies'
cookies = dict(cookies_are='delicious')
r = requests.get(url, cookies=cookies)
print(r.text)

{
  "cookies": {
    "cookies_are": "delicious"
  }
}



When you create a BeautifulSoup object, two arguments are passed in: the HTML text and the parser

In [23]:
# Timeout
# r = requests.get(link, timeout=0.001 ) 

## SSL Cert Verification

`requests` verifies SSL certificates for HTTPS request, just like a web browser does. By default, SSL verification is enabled. A SSLError is thrown if unable to verify a certificate:

In [31]:
# Expired SSL
try:
    requests.get('https://expired.badssl.com/')
except Exception as x:
        print(type(x),'\n', x)

<class 'requests.exceptions.SSLError'> 
 HTTPSConnectionPool(host='expired.badssl.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))


In [39]:
try:
    requests.get('https://untrusted-root.host.badssl.com/')
except Exception as x:
        print(type(x),'\n', x)

<class 'requests.exceptions.SSLError'> 
 HTTPSConnectionPool(host='untrusted-root.host.badssl.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))


In [35]:
try:
    requests.get('https://no-common-name.host.badssl.com/')
except Exception as x:
        print(type(x),'\n', x)

<class 'requests.exceptions.SSLError'> 
 HTTPSConnectionPool(host='no-common-name.host.badssl.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))


## Proxies

If you need to use a proxy, you can configure individual requests with `proxies` argument to any request method:

```python
proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080'
}
requests.get('http://example.org', proxies=proxies)
```

## References

1. [CS 142: Web Applications](https://web.stanford.edu/~ouster/cgi-bin/cs142-winter15/lectures.php), by John Ousterhout, Stanford University
2. `requests` official documentation
3. Mitchell, R. 2018. Web Scraping with Python: Collecting More Data from the Modern Web. O’Reilly Media.
