# Chapter 3. First Web Scrapping
### Downloading web pages (p. 72)
`$ echo "requests==2.23.0" >> requirements.txt` -- had to change to 2.22.0 <br>
Will download __[this columbia sample wep page](http://www.columbia.edu/~fdc/sample.html)__

In [None]:
!pip install -r requirements.txt

In [3]:
import requests
url = 'http://www.columbia.edu/~fdc/sample.html'
response = requests.get(url)
response.status_code
response.text
response.headers
response.request.headers
response.request
response.request.url

'http://www.columbia.edu/~fdc/sample.html'

__[request module docs](https://requests.readthedocs.io/en/master/)__ <br>
__[status codes](https://httpstatuses.com/)__ They are also described in the `http.HTTPStatus` enum with convenient constant names, such as OK, NOT_FOUND, or FORBIDDEN

`$ echo "beautifulsoup4==4.8.2" >> requirements.txt`   __[Beautiful Soup doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)__

In [2]:
import requests
from bs4 import BeautifulSoup
url = 'http://www.columbia.edu/~fdc/sample.html'
response = requests.get(url)

page = BeautifulSoup(response.text, 'html.parser')
page.title
page.title.string
page.find_all('h3')

ModuleNotFoundError: No module named 'bs4'

Extract the text on the section for Special Characters. Stop when you reach the next `<h3>` tag:

In [None]:
link_section = page.find('h3', attrs={'id':'chars'}) # tag <a>
section = []
for el in link_section.next_elements:
    if el.name == 'h3':
        break
    section.append(el.string or '') # None if el has no text

result = ''.join(section)
result

In [None]:
import re
page.find_all( re.compile('(h2|h3)'))  #regex in find_all

### Crawling the web (p. 79)

Download the whole __[test_site directory](https://github.com/PacktPublishing/Python-Automation-Cookbook-Second-Edition/tree/master/Chapter03/test_site)__  using DownGit<br>
Start server below with bang.
Check browser at __[http://localhost:8000](http://localhost:8000)__

In [34]:
!cd test_site; python simple_delay_server.py

Starting server, use <Ctrl-C> to stop
127.0.0.1 - - [24/Jul/2022 15:00:29] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [24/Jul/2022 15:00:30] "GET /files/b93bec5d9681df87e6e8d5703ed7cd81-2.html HTTP/1.1" 200 -
127.0.0.1 - - [24/Jul/2022 15:00:30] "GET /files/5eabef23f63024c20389c34b94dee593-1.html HTTP/1.1" 200 -
127.0.0.1 - - [24/Jul/2022 15:00:31] "GET /files/33714fc865e02aeda2dabb9a42a787b2-0.html HTTP/1.1" 200 -
127.0.0.1 - - [24/Jul/2022 15:00:31] "GET /files/archive-september-2018.html HTTP/1.1" 200 -
127.0.0.1 - - [24/Jul/2022 15:00:32] "GET /index.html HTTP/1.1" 200 -
^C
Traceback (most recent call last):
  File "/Users/yuri/usr/python/py-autoCookBook/test_site/simple_delay_server.py", line 25, in <module>
    server.serve_forever()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/socketserver.py", line 232, in serve_forever
    ready = selector.select(poll_interval)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/selectors.py", line 

From same site download `ch03-crawl_web.py` and search for references to `python` or `crocodile` <br>
But better previous server or this call, since Jupyter executes only one bang at a time
```
$ python ch03-crawl_web.py http://localhost:8000/ -p python
```

In [None]:
!python ch03-crawl_web.py http://localhost:8000/ -p python

##### Components of crawl_web.py
1. A loop that goes through all the found links, in the `main` function:

In [None]:
def main(base_url, to_search):
    checked_links = set()
    to_check = [base_url]
    max_checks = 10

    while to_check and max_checks:
        link = to_check.pop()
        links = process_link(link, text=to_search)
        checked_links.add(link)
        for link in links:
            if link in checked_links:
                continue
            checked_links.add(link)
            to_check.append(link)
        max_check -= 1

2. download and parse links in `parse_link`

In [None]:
import logging
from urllib.parse import urlparse
import http
def process_link(source_link, pat):
    logging.info(f'exctracting links from {source_link}')
    result = requests.get()
    if result.status_code != http.client.ok:
        logging.error(f'Failed retrieve {source_link}: {result}')
        return []
    if 'html' not in result.headers['Content-type']: #skip PDF
        logging.info(f'Not HTML: {source_link}')
        return []
    page = BeautifulSoup(result.text,'html.parser')
    search_text(source_link,page,pat)
    parsed_source = urlparse(source_link) # divides URL to elements: http site path etc
    return get_links(parsed_source,page)

def search_text(source_link, page, pat):
    '''print elements with text pattern'''
    for el in page.find_all(text = re.compile(pat,flags=re.IGNORECASE) ) :
        print(f'Link {source_link} ==> {el}')

3. The `get_links` function retrieves all links on a page:

In [None]:
from urllib.parse import urljoin
def get_links(parsed_source,page):
    '''retieve links on the page'''
    links = []
    for el in page.find_all('a'): # <a> elements
        link = el.get('href')
        if not link: continue
        if link.startwith('#'): continue  # inside page
        if link.startwith('mailto:'): continue  
        if not link.startwith('http'):    # local link
            netloc = parsed_source.netloc
            scheme = parsed_source.scheme
            path = urljoin(parsed_source.path, link)
            link = f'{scheme}://{netloc}{path}'
        if parsed_source.netloc not in link: # accept only same domain
            continue
        links.append(link)
    return links

### Subscribing to feeds
`$ echo "feedparser==5.2.1" >> requirements.txt`  `use_2to3` is invalid
Since this __[dirty fix](https://pypi.org/project/feedparser/5.2.1/)__ is ugly, I'll skip this section for RSS

### Accessing web APIs
RESTful API using __[JSON](https://www.json.org/)__  -- `requests` has native support<br>
__[RESTful](https://codewords.recurse.com/issues/five/what-restful-actually-means)__ uses GET POST DELETE etc

We will use __[https://jsonplaceholder.typicode.com](https://jsonplaceholder.typicode.com)__ -- It simulates a common case with posts, comments, and other common resources.

In [42]:
import requests
URL = 'https://jsonplaceholder.typicode.com'
result = requests.get(URL+'/posts')  # <Response [200]>
result.json() # 100 posts
result.json()[-1]

{'userId': 10,
 'id': 100,
 'title': 'at nam consequatur ea labore ea harum',
 'body': 'cupiditate quo est a modi nesciunt soluta\nipsa voluptas error itaque dicta in\nautem qui minus magnam et distinctio eum\naccusamus ratione error aut'}

Create new post

In [55]:
new_post = {'userId' : 10, 'title' : 'a title', 'body' : 'some stuff'}
result = requests.post(URL+'/posts', json=new_post)
result        # <Response [201]>
result.json() #{'userId': 10, 'title': 'a title', 'body': 'some stuff', 'id': 101}
result.headers['Location']

'http://jsonplaceholder.typicode.com/posts/101'

Fetch an existing post with GET

In [None]:
result = requests.get(URL+'/posts/2')  # <Response [200]>
result
result.json()

Use PATCH to update its values. Check the returned resource:

In [58]:
update = {'body' : 'new body'}
result = requests.patch(URL+'/posts/2', json=update) 
result
result.json()

{'userId': 1, 'id': 2, 'title': 'qui est esse', 'body': 'new body'}

### Interacting with forms
__[test site](https://httpbin.org/forms/post)__ renders the form, but internally calls __[the URL](https://httpbin.org/post)__

In [18]:
import requests
from bs4 import BeautifulSoup
import re
response = requests.get('https://httpbin.org/forms/post') # forms/post fpr get
page = BeautifulSoup(response.text)
form = page.find('form')
{ field.get('name')    for field in form.find_all( re.compile('input|textarea') )   }


{'comments', 'custemail', 'custname', 'custtel', 'delivery', 'size', 'topping'}

prepare data to post as a dictionary, post
(in API we used Content-Type as `application/json` now `application/x-www-form-urlencoded` -- 400 if incorrect.)

In [16]:
data = {'custname': "Sean O'Connell", 
        'custtel': '123-456-789', 
        'custemail': 'sean@oconnell.ie', 
        'size': 'small', 
        'topping': ['bacon', 'onion'], 
        'delivery': '20:30', 
        'comments': ''}
response = requests.post('https://httpbin.org/post', data)  # /post to post
response.json()

{'args': {},
 'data': '',
 'files': {},
 'form': {'comments': '',
  'custemail': 'sean@oconnell.ie',
  'custname': "Sean O'Connell",
  'custtel': '123-456-789',
  'delivery': '20:30',
  'size': 'small',
  'topping': ['bacon', 'onion']},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Content-Length': '140',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.22.0',
  'X-Amzn-Trace-Id': 'Root=1-62df0cba-6461d54e52fd48ab256a1a73'},
 'json': None,
 'origin': '149.117.75.11',
 'url': 'https://httpbin.org/post'}

__[Cross-Site Request Forgery (CSRF)](https://stackoverflow.com/a/33829607)__ -- first download the form, as shown in the
recipe, obtain the value of the CSRF token, and resubmit it.

In [20]:
#form.find( attrs={'name' : 'token'}).get('value')
form.find( attrs={'name' : 'token'})  # did not work, None?

### Using Selenium for advanced interaction