<a href="https://colab.research.google.com/github/saffarizadeh/INSY4054/blob/main/Web_Crawling_(I).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="http://saffarizadeh.com/Logo.png" width="300px"/>

# *INSY 4054: Emerging Technologies*

# **Web Crawling (I)**

Instructor: Dr. Kambiz Saffarizadeh

---

# Web Crawling in Python

We can user the `requests` library to issue requests to different websites.

To do so, we have to specify the `url` the we want to open. In this example, we use the `get` method to open the url.

In [None]:
import requests

In [None]:
url = 'https://www.marquette.edu/'
response = requests.get(url)

We can always use the `dir` function on any object in Python to see what kinds of methods and properties we have access to through the object. For example, here we check in which way we can use `response`.

In [None]:
dir(response)

## Response Headers

To access the header from the `response` object, we can use `headers`.

In [None]:
response.headers

{'Date': 'Mon, 08 Mar 2021 16:31:34 GMT', 'Server': 'Apache', 'X-Powered-By': 'PHP/7.2.34', 'Strict-Transport-Security': 'max-age=63072000; includeSubDomains; preload', 'X-Content-Type-Options': 'nosniff', 'Vary': 'Accept-Encoding,User-Agent', 'Content-Encoding': 'gzip', 'Content-Length': '16586', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=ISO_8859-10'}

Note: `headers` is a special dictionary: it’s made just for HTTP headers. According to RFC 7230, HTTP Header names are case-insensitive.

So, we can access the headers using any capitalization we want:

In [None]:
response.headers['Content-Type']

'text/html; charset=ISO_8859-10'

Note that we can also use the `get` method to do the same thing. The difference between using square brackets and using the `get` method is that if what we are looking for does not exist inside headers, we get an error with square brackets but not with the `get` method.

In [None]:
response.headers.get('content-type')

'text/html; charset=ISO_8859-10'

## Cookies

Some websites use cookies, that is create some files on your device to store some temporary information. To learn more about cookies visit https://www.allaboutcookies.org/cookies/.

We can access the cookies created by a website after we open it using `cookies,get_dict()` method. Note that some websites don't use cookies unless you are logged in.

In [None]:
response.cookies.get_dict()

{}

Websites often use their own stored cookies in next interactions to make sure you continue your interaction from where you left off.

Using `requests` you can send websites your own custom cookies. To send your own cookies to the server, you can use the cookies parameter:

In [None]:
url = 'https://httpbin.org/cookies'

cookies = {'cookies_are': 'working'}

response = requests.get(url, cookies=cookies)

print(response.text)

{
  "cookies": {
    "cookies_are": "working"
  }
}



## Custom Headers

If you’d like to add HTTP headers to a request, simply pass in a `dict` to the `headers` parameter.

This is often useful when a website only opens on specific devices or browsers. This way you can mimic those devices or browsers (to some extent).

In [None]:
url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}
response = requests.get(url, headers=headers)

## Timeouts

You can tell `requests` to stop waiting for a response after a given number of seconds with the `timeout` parameter.

Nearly all production code should use this parameter in nearly all requests. Failure to do so can cause your program to hang indefinitely.

Note that `timeout` is not a time limit on the entire response download; rather, an exception is raised if the server has not issued a response for timeout seconds (more precisely, if no bytes have been received on the underlying socket for timeout seconds). If no timeout is specified explicitly, requests do not time out.

In [None]:
requests.get('https://github.com/', timeout=0.1)

<Response [200]>

## Response Status Codes

When you issue a request, the response will have a status. This status shows whether the request was successful (200), page not found (404), there was a server error (500), etc.

These status codes can help you automized the process of crawling much more efficiently.

In [None]:
url = 'https://www.marquette.edu/dsakjhdasiudasjk'
response = requests.get(url)

In [None]:
response.status_code

404

In [None]:
requests.codes.ok

200

In [None]:
response.status_code == requests.codes.ok

False

The `get` method does not raise an exception (error) when the request was unsuccessful. It is up to you to check the status and decide how the program should proceed based on the status.

But sometimes, you might want to intentionally raise an exception (error) if the `get` method was unsuccessful. To do so, you can simply call `raise_for_status` method on the response object.

In [None]:
# response.raise_for_status()

## Text and Binary Responses

To access the crawled page content, you have two main methods: `text` and `content`.

The difference between these two methods is that `text` returns a string (str) but `content` returns `bytes`. The advantage of working with `bytes` is that they are not limited to texts, that is, we can also access an image or a video if the url is directly referring to one.

Warning: It is strongly recommended that you open files in binary mode.

In [None]:
url = 'https://www.marquette.edu/'

print(type(response.text)) # returns text content
print(type(response.content)) # works for text and any other content type such as image

<class 'str'>
<class 'bytes'>


## JSON Response

In some cases, we might want to retrieve json data from a url. To do so, we don't need to do anything special in `get`. After `getting` the url, you can use `json` method to get the response as a list of dictionaries. 

In [None]:
response = requests.get('https://api.github.com/events')

In [None]:
json_resp = response.json()

In [None]:
print(type(json_resp))
print(type(json_resp[0]))
print(json_resp)

## POST requests

Sometimes, you want to send some form-encoded data — much like an HTML form. To do this, simply pass a dictionary to the data argument. Your dictionary of `data` will automatically be form-encoded when the request is made:

In [None]:
payload = {'key1': 'value1', 'key2': 'value2'}

In [None]:
response = requests.post("https://httpbin.org/post", data=payload)

In [None]:
print(response.text)

# OR print(response.json())

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.23.0", 
    "X-Amzn-Trace-Id": "Root=1-60465168-277a941274853219502b47ce"
  }, 
  "json": null, 
  "origin": "34.105.42.245", 
  "url": "https://httpbin.org/post"
}



There are times that you may want to send data that is not form-encoded. If you pass in a `string` instead of a `dict`, that data will be posted directly.

In [None]:
import json

In [None]:
payload = {'key1': 'value1', 'key2': 'value2'}

In [None]:
json.dumps(payload) # to convert the dictonary to string

'{"key1": "value1", "key2": "value2"}'

In [None]:
response = requests.post('https://httpbin.org/post', data=json.dumps(payload))

In [None]:
print(response.text)

{
  "args": {}, 
  "data": "{\"key1\": \"value1\", \"key2\": \"value2\"}", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "36", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.23.0", 
    "X-Amzn-Trace-Id": "Root=1-60465168-54c36eeb585570f278c6b9e0"
  }, 
  "json": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "origin": "34.105.42.245", 
  "url": "https://httpbin.org/post"
}

