<a href="https://colab.research.google.com/github/saffarizadeh/INSY4054/blob/main/Web_Crawling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Crawling in Python

In [1]:
import requests

In [2]:
url = 'https://www.marquette.edu/'
response = requests.get(url)

In [None]:
dir(response)

## Response Headers

In [None]:
response.headers

Note: The dictionary is special, though: it’s made just for HTTP headers. According to RFC 7230, HTTP Header names are case-insensitive.

So, we can access the headers using any capitalization we want:

In [None]:
response.headers['Content-Type']

In [None]:
response.headers.get('content-type')

## Cookies

In [None]:
response.cookies.get_dict()

To send your own cookies to the server, you can use the cookies parameter:

In [None]:
url = 'https://httpbin.org/cookies'

cookies = {'cookies_are': 'working'}

response = requests.get(url, cookies=cookies)

print(response.text)

## Custom Headers

If you’d like to add HTTP headers to a request, simply pass in a `dict` to the `headers` parameter.

In [9]:
url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}
response = requests.get(url, headers=headers)

## Timeouts

You can tell Requests to stop waiting for a response after a given number of seconds with the `timeout` parameter.

Nearly all production code should use this parameter in nearly all requests. Failure to do so can cause your program to hang indefinitely.

### Note

`timeout` is not a time limit on the entire response download; rather, an exception is raised if the server has not issued a response for timeout seconds (more precisely, if no bytes have been received on the underlying socket for timeout seconds). If no timeout is specified explicitly, requests do not time out.

In [None]:
requests.get('https://github.com/', timeout=0.001)

## Response Status Codes

In [None]:
url = 'https://www.marquette.edu/dsakjhdasiudasjk'
response = requests.get(url)

In [None]:
response.status_code

In [None]:
requests.codes.ok

In [None]:
response.status_code == requests.codes.ok

In [None]:
response.raise_for_status()

## Text and Binary Responses

In [None]:
url = 'https://www.marquette.edu/'

print(type(response.text)) # returns text content
print(type(response.content)) # works for text and any other content type such as image

### Warning

It is strongly recommended that you open files in binary mode. This is because Requests may attempt to provide the Content-Length header for you, and if it does this value will be set to the number of bytes in the file. Errors may occur if you open the file in text mode.

## JSON Response

In [None]:
response = requests.get('https://api.github.com/events')

In [None]:
json_resp = response.json()

In [None]:
json_resp

## POST requests

Typically, you want to send some form-encoded data — much like an HTML form. To do this, simply pass a dictionary to the data argument. Your dictionary of `data` will automatically be form-encoded when the request is made:

In [None]:
payload = {'key1': 'value1', 'key2': 'value2'}

In [None]:
response = requests.post("https://httpbin.org/post", data=payload)

In [None]:
print(response.text)

# OR print(r.json())

There are times that you may want to send data that is not form-encoded. If you pass in a `string` instead of a `dict`, that data will be posted directly.

In [None]:
import json

In [None]:
payload = {'key1': 'value1', 'key2': 'value2'}

In [None]:
response = requests.post('https://httpbin.org/post', data=json.dumps(payload))

In [None]:
print(response.text)

## Using XPath: A Primer

In [None]:
import requests
from lxml import html

In [None]:
url = 'https://www.marquette.edu/'
response = requests.get(url)

In [None]:
tree = html.fromstring(response.content)

In [None]:
link_labels = tree.xpath('//a/text()')
link_urls = tree.xpath('//a/@href')

In [None]:
link_labels

In [None]:
link_urls

In [None]:
label_url_pairs = list(zip(link_labels, link_urls))

In [None]:
label_url_pairs

In [None]:
cleaned_label_url_pairs = []

for label, url in label_url_pairs:
    if url[:4] == "http":
        cleaned_label_url_pairs.append((label, url))

In [None]:
cleaned_label_url_pairs