# Advanced Web Scraping

In [1]:
import requests

## Handling Web Response Status Codes

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [4]:
r1 = requests.get('https://thegurus.tech')
r1.status_code

200

In [5]:
r2 = requests.get('https://thegurus.tech/i_dont_exist')
r2.status_code

404

In [6]:
r3 = requests.get('http://thegurus.tech')
r3.status_code

200

In [7]:
r3.history

[<Response [301]>]

In [6]:
r3.history[0].status_code

301

In [7]:
r3.history[0].text

'<html>\r\n<head><title>301 Moved Permanently</title></head>\r\n<body bgcolor="white">\r\n<center><h1>301 Moved Permanently</h1></center>\r\n<hr><center>nginx/1.14.0 (Ubuntu)</center>\r\n</body>\r\n</html>\r\n'

In [8]:
r4 = requests.get('https://thegurus.tech')
r4.history

[]

It's a good idea to check the status code before you parse the response text

In [9]:
url = 'http://thegurus.tech'

r = requests.get(url)

if r.status_code < 300:
    print('request was successful')
elif r.status_code >= 300 and r.status_code < 400:
    print('request was redirected')
elif r.status_code >= 400 and r.status_code < 500:
    print('request failed because the resource either does not exist or is forbidden')
elif r.status_code >= 500:
    print('request failed because the response server encountered an error')
else:
    print('we have found a new http protocol')

request was successful


It should be noted that the requests lib automatically makes additional requests to the redirected URL if the web resource is moved (i.e. 30x status codes). Even if a moved web resource redirects once again, the requests lib will track it all the way down until it receives success or failure as long as the number of redirects does not exceed the redirect limit (default 30). You can choose to disallow redirects by using requests.get(url, allow_redirects=False) so that requests will not track down the redirected URL. Or you can choose to reduce the max redirects allowed by using max_redirects=n so as to avoid endless redirects or save time in making requests.

In [14]:
r = requests.get('http://thegurus.tech', allow_redirects=False)
r.status_code

301

## Handling Request Errors

### Exceeding Max Redirects

As previously mentioned, there is a max redirects number by default (30) which you can override with max_redirects. If the number of redirects exceeds the limit, requests will throw a TooManyRedirects exception.

In [16]:
session = requests.Session()
session.max_redirects = 1

try:
    session.get('http://thegurus.tech').status_code
    print('successful request!')
    
except requests.exceptions.TooManyRedirects as ex:
    print('handled exception!')
    print(ex)


successful request!


### Timeout

Sometimes a remote server is not responsive either because requests cannot connect to the intended web resource or because the remote server does not send back the promised data. If that happens, requests will typically wait for a long period of time until the connection is closed by the remote server then throw a ConnectTimeout exception. This is a big waste of time because most modern websites respond to web requests within a couple of seconds. Therefore, it's a desirable approach to supply a timeout argument to requests to limit the amount of wait time.

In [46]:
try:
    requests.get('https://thegurus.tech', timeout=0.05)
    print('successful request!')
    
except requests.exceptions.Timeout as ex:
    print('handled timeout exception!')
    print(ex)

except requests.exceptions.ConnectionError as ex:
    print('handled connection exception!')
    print(ex)

finally:
    print('finally executed')

handled timeout exception!
HTTPSConnectionPool(host='thegurus.tech', port=443): Read timed out. (read timeout=0.05)
finally executed


### SSL Certificate Error

If a website wants to use SSL/TLS, it has to purchase (or not -> Let's Encrypt) a special certificate from certificate vendors and configure the web server properly in order for the certificate to be functional. If the SSL/TLS certificate is not installed properly or it has expired (purchased certificate has to be renewed every two years), modern browsers such as Chrome will indicate the problem to the users. The requests lib, similarly, will throw an exception if it detects the SSL certificate is problematic.

In [47]:
try:
    requests.get('https://thegurus.tech')
    print('successful request!')
    
except requests.exceptions.SSLError as ex:
    print('handled exception!')
    print(ex)

successful request!


## User Agent

Identifies the browser and operating system from which the request is made. Some websites send out different responses to the requests made with different user agents for reasons such as:

- They want to avoid bugs of the website that only occur in certain browsers or browser versions.

- They want to personalize user experience (e.g. desktop vs mobile, small vs big screen, etc.) by sending different data.

In case you decide to fool the website by pretending to be a certain browser, use the approach below:

In [48]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
response = requests.get('https://thegurus.tech', headers=headers)
response.content[:500]

b'<!DOCTYPE html>\n<html lang="en">\n\n<head>\n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n        <meta name="viewport" content="width=device-width, initial-scale=1">\n\n\n        <title>The Gurus</title>\n\n            <link href="https://thegurus.tech/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="The Gurus Full Atom Feed" />\n        <!-- Bootstrap Core CSS -->\n        <link href="https://thegurus.tech/theme/css/bootstrap.min.css" '

With the user agent string above, the request pretends to be from Chrome browser v71.0.3578.98 in macOS 10.14.2 (Mojave). To see what the user agent string look like in other browsers/OS, check out https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

## Making Asynchronous Requests

In [15]:
!pip install asyncio



### __async is not about paralelizing tasks, it's about scheduling tasks__

```python
import asyncio
import requests

N_REQUESTS = 5

async def gurus_async():
    print(requests.get('https://thegurus.tech'))
    await asyncio.sleep(0)
    print(requests.get('http://thegurus.tech', allow_redirects=False))
    await asyncio.sleep(0)
    print(requests.get('https://thegurus.tech/i_dont_exist'))

async def main_async():

    function_list = [gurus_async() for _ in range(N_REQUESTS)]
    await asyncio.gather(*function_list)

def gurus_sync():
    print(requests.get('https://thegurus.tech'))
    print(requests.get('http://thegurus.tech', allow_redirects=False))
    print(requests.get('https://thegurus.tech/i_dont_exist'))

def main_sync():

    for _ in range(N_REQUESTS):
        gurus_sync()

print('\nrunning async...\n')
asyncio.run(main_async())

print('\nrunning sync...\n')
main_sync()

```

The code cannot be run on Jupyter Notebook because asyncio.run cannot be called when another asyncio event loop is running in the same thread (Jupyter Notebook asyncio event).

## Understanding robots.txt

There are many reasons why some websites don't welcome bots:

- When bots crawls a website, it takes up some CPU, memory, and bandwidth resources of the server that should have been dedicated to the normal users. 
- Sometimes the website admin does not want to expose certain semi-confidential web resources to the search engines. 
- Sometimes the website admin wants to promote the most important web resources rather than letting the search engines see many irrelevant or outdated resources resting on the server. 

To achieve those different purposes, the common practice is to create a robots.txt file in the root level of the website to instruct the bots what to crawl and what not to.

- In case of "not malicious" crawlers: Google?

The robots.txt file is there to tell crawlers and robots which URLs they should not visit on your website. This is important to help them avoid crawling low quality pages, or getting stuck in crawl traps where an infinite number of URLs could potentially be created, for example, a calendar section which creates a new URL for every day.

- In case of "malicious" crawlers:

They will give a shit about your robots.txt XD

For more info about robots.txt please have a look at: https://support.google.com/webmasters/answer/6062608?hl=en&ref_topic=6061961&visit_id=637139316066285039-3854058931&rd=1