# Web Scraping - Session 0

## `urllib` vs `requests`

* `urllib` is a python stdlib, that collects several modules for working with URLs. So it works out of the box and you don't have to deal with package compatibility issues and has the basic modules to start scraping the web such as:
    * `urllib.request` for opening and reading URLs
    * `urllib.error` containing the exceptions raised by urllib.request
    * `urllib.parse` for parsing URLs
    * `urllib.robotparser` for parsing robots.txt files
* However, when it comes to higher level HTTP client interfaces and requests `requests` is recommended (by the official python docs). Indeed `requests` package is very useful and was designed to do just that with very short commands.

When it comes to the real application, I would recommend `requests`, but it doesn't hurt to get to know both.

## Requests

When it comes to HTTP requests and web scraping, `requests` should be you to go package: 
* It comes with very usefull and short commands.
* It supports a fully restful API, and is very easy to use

```python
import requests

resp = requests.get('http://www.mywebsite.com/user')
resp = requests.post('http://www.mywebsite.com/user')
resp = requests.put('http://www.mywebsite.com/user/put')
resp = requests.delete('http://www.mywebsite.com/user/delete')
```
* It takes a dictionary as argument
* It has it own built in JSON decoder

In [4]:
# Auto-completion is not working bc of jedi
%config Completer.use_jedi = False

In [1]:
# first example
import requests

# define the link you want to target
link = 'http://pythonscraping.com/pages/page1.html'
# define your user agent - a disguise as a browser
headers = {'User-Agent': "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6"}
# Get the requested page and save it as r
r = requests.get(link, headers=headers)
print(r.text)

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>



In [2]:
# second example
from urllib.request import urlopen

html = urlopen(link)
html.read()

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'

In [5]:
# example using beautifulSoup
# Like its Wonderland namesake, BeautifulSoup tries to make sense of the nonsensical; 
# it helps format and organize the messy web by fixing bad HTML 
# and presenting us with easily traversable Python objects representing XML structures.
from bs4 import BeautifulSoup as bs


html = urlopen(link)
inhalt = bs(html.read(), 'html.parser')
print(inhalt.h1)

<h1>An Interesting Title</h1>


When you create a BeautifulSoup object, two arguments are passed in: the HTML text and the parser

In [7]:
inhalt

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

In [8]:
from urllib.error import HTTPError, URLError

try:
    html = urlopen('https://something_wrong.com')
except HTTPError as e:
    print("The server returned an HTTP error")
except URLError as e:
    print("The server couldn't be found!")
else:
    print(html.read())

The server couldn't be found!


In [11]:
import lxml

inhalt = bs(html.read(), 'lxml')

lxml has some advantages over html.parser in that it is generally better at parsing “messy” or malformed HTML code. It is forgiving and fixes problems like unclosed tags, tags that are improperly nested, and missing head or body tags.

In [17]:
try:
    badContent = inhalt.nonExistingTag.anotherTag
except AttributeError as e: print('Tag was not found 1')
else:
    if badContent == None:
        print ('Tag was not found 2') 
    else:
        print(badContent)

Tag was not found 1


In [18]:
# Custom Requests
key_dict = {'key1': 'value1', 'key2': 'value2'}
r = requests.get('http://httpbin.org/get', params=key_dict)
print("URL has been correctly encoded :" ,  r.url) 
print("Response body in string format: \n " ,  r.text) 

URL has been correctly encoded : http://httpbin.org/get?key1=value1&key2=value2
Response body in string format: 
  {
  "args": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.25.1", 
    "X-Amzn-Trace-Id": "Root=1-60c8dbfe-6c3732a224b3434262737a2f"
  }, 
  "origin": "89.35.30.236", 
  "url": "http://httpbin.org/get?key1=value1&key2=value2"
}



In [24]:
headers  =  { 
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36'
} 
link = 'https://skacem.github.io/about'
r = requests.get(link,  headers=headers )
print("Response status code: ", r.status_code)

Response status code:  200


In [20]:
# Sending a POST
r = requests.post('http://httpbin.org/post', data=key_dict)
print(r.text)

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.25.1", 
    "X-Amzn-Trace-Id": "Root=1-60c8dd66-127e3ab113c4aec76664d06f"
  }, 
  "json": null, 
  "origin": "89.35.30.236", 
  "url": "http://httpbin.org/post"
}



In [25]:
# Timeout
# r = requests.get(link, timeout=0.001 ) 

ConnectTimeout: HTTPSConnectionPool(host='skacem.github.io', port=443): Max retries exceeded with url: /about (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f9fd0c557f0>, 'Connection to skacem.github.io timed out. (connect timeout=0.001)'))