# Chapter 3. First Web Scrapping
### Downloading web pages (p. 72)
`$ echo "requests==2.23.0" >> requirements.txt` -- had to change to 2.22.0 <br>
Will download __[this columbia sample wep page](http://www.columbia.edu/~fdc/sample.html)__

In [8]:
import requests
url = 'http://www.columbia.edu/~fdc/sample.html'
response = requests.get(url)
response.status_code
response.text
response.headers
response.request.headers
response.request
response.request.url

'http://www.columbia.edu/~fdc/sample.html'

__[request module docs](https://requests.readthedocs.io/en/master/)__ <br>
__[status codes](https://httpstatuses.com/)__ They are also described in the `http.HTTPStatus` enum with convenient constant names, such as OK, NOT_FOUND, or FORBIDDEN

`$ echo "beautifulsoup4==4.8.2" >> requirements.txt`   __[Beautiful Soup doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)__

In [12]:
import requests
from bs4 import BeautifulSoup
url = 'http://www.columbia.edu/~fdc/sample.html'
response = requests.get(url)

page = BeautifulSoup(response.text, 'html.parser')
page.title
page.title.string
page.find_all('h3')

[<h3 id="contents">CONTENTS</h3>,
 <h3 id="basics">1. Creating a Web Page</h3>,
 <h3 id="syntax">2. HTML Syntax</h3>,
 <h3 id="chars">3. Special Characters</h3>,
 <h3 id="convert">4. Converting Plain Text to HTML</h3>,
 <h3 id="effects">5. Effects</h3>,
 <h3 id="lists">6. Lists</h3>,
 <h3 id="links">7. Links</h3>,
 <h3 id="tables">8. Tables</h3>,
 <h3 id="viewing">9. Viewing Your Web Page</h3>,
 <h3 id="install">10. Installing Your Web Page on the Internet</h3>,
 <h3 id="more">11. Where to go from here</h3>,
 <h3 id="fluid">12. Postscript: Cell Phones</h3>]

Extract the text on the section for Special Characters. Stop when you reach the next `<h3>` tag:

In [19]:
link_section = page.find('h3', attrs={'id':'chars'}) # tag <a>
section = []
for el in link_section.next_elements:
    if el.name == 'h3':
        break
    section.append(el.string or '') # None if el has no text

result = ''.join(section)
result

'3. Special Characters\n\nHTML special "character entities" start with ampersand (&&) and\nend with semicolon (;;), like "&euro;&euro;" = "€".  The\never-popular "no-break space" is &nbsp;&nbsp;.  There are special\nentity names for accented Latin letters and other West European special\ncharacters such as:\n\n\n\n\n\n\n&auml;&auml;\na-umlaut\n\xa0ä\xa0\n\n\n&Auml;&Auml;\nA-umlaut \n\xa0Ä\xa0\n\n\n&aacute;&aacute;\na-acute \n\xa0á\xa0\n\n\n&agrave;&agrave;\na-grave \n\xa0à\xa0\n\n\n&ntilde;&ntilde;\nn-tilde \n\xa0ñ\xa0\n\n\n&szlig;&szlig;\nGerman double-s\n\xa0ß\xa0\n\n\n&thorn;&thorn;\nIcelandic thorn \n\xa0þ\xa0\n\xa0þ\xa0\n\n\n\n\n\n(The table above is shown in the basic, default style of HTML.  Of course\nthere are many ways to customize the appearance of tables; more\nabout this belowbelow.\n\n\n\nExamples:\n\n\nFor SpanishSpanish you would need:\n&Aacute;&Aacute; (Á),\n&aacute;&aacute; (á),\n&Eacute;&Eacute; (É),\n&eacute;&eacute; (é),\n&Iacute;&Iacute; (Í),\n&iacute;&iacute; (í)

In [21]:
import re
page.find_all( re.compile('(h2|h3)'))  #regex in find_all

[<h2>Do-It-Yourself Web Authoring - a beginner's HTML tutorial</h2>,
 <h3 id="contents">CONTENTS</h3>,
 <h3 id="basics">1. Creating a Web Page</h3>,
 <h3 id="syntax">2. HTML Syntax</h3>,
 <h3 id="chars">3. Special Characters</h3>,
 <h3 id="convert">4. Converting Plain Text to HTML</h3>,
 <h3 id="effects">5. Effects</h3>,
 <h3 id="lists">6. Lists</h3>,
 <h3 id="links">7. Links</h3>,
 <h3 id="tables">8. Tables</h3>,
 <h3 id="viewing">9. Viewing Your Web Page</h3>,
 <h3 id="install">10. Installing Your Web Page on the Internet</h3>,
 <h3 id="more">11. Where to go from here</h3>,
 <h3 id="fluid">12. Postscript: Cell Phones</h3>]

### Crawling the web (p. 79)
downloaded __[simple_delay_server.py](https://github.com/PacktPublishing/Python-Automation-Cookbook-Second-Edition/blob/master/Chapter03/crawling_web_step1.py)__
`$ python ch03-simple_delay_server.py` <br>
Check browser at __[http://localhost:8000](http://localhost:8000)__

In [24]:
!python ch03-simple_delay_server.py

usage: ch03-simple_delay_server.py [-h] [-p P] url
ch03-simple_delay_server.py: error: the following arguments are required: url
