#  The World Wide Web
FPNP3e ch11


Overview of HTTP
---
- Clients request documents
- Servers respond by providing them
- HTTP is certainly capable of delivering stand-alone documents such as 
  - files, audios, images, and video
- The main purpose is to deliver the [World Wide Web](https://en.wikipedia.org/wiki/World_Wide_Web)
  - allow servers all over the world to publish documents, 
  - through mutual cross-links, 
  - become a single interlinked fabric of information


Hypermedia and URLs
---
- [URLs (uniform resource locators)](https://en.wikipedia.org/wiki/URL) are addresses on the web
  - known as hyperlinks, or simply links
  - clickable and usually highlighted with underlines
- Hypermedia are images, sound, and video mixed with links
- Hypertext documents contain embedded hyperlinks

```html
<!-- some sample URLs  -->
https://www.python.org/
http://en.wikipedia.org/wiki/Python_(programming_language)
http://localhost:8000/headers
ftp://ssd.jpl.nasa.gov/pub/eph/planets/README.txt
telnet://rainmaker.wunderground.com

<!-- URL with a query string -->
https://www.google.com/search?product=ipad&bInStore=yes

<!-- URL with fragment -->
https://datatracker.ietf.org/doc/html/rfc3986#section-3.2
```

Parsing and Building URLs
---
- Every HTTP URL conforms to the syntax of a generic URI
- The URI generic syntax consists of five components organized hierarchically 
  - in order of decreasing significance from left to right
  ```
  URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment]
  authority = [userinfo "@"] host [":" port]
  ``` 


🔭 Explore
---
- [the components of URIs](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Syntax)
- [List of URI schemes](https://en.wikipedia.org/wiki/List_of_URI_schemes)

Parsing URLs
---

In [None]:
# The PSL urllib.parse module is used to interpret and to build URLs
from urllib.parse import urlsplit
u = urlsplit('https://www.google.com/search?product=ipad&bInStore=yes')
tuple(u)

In [None]:
print(f'type: {type(u)}\nscheme: {u.scheme}\nnetloc: {u.netloc}\npath: {u.path}\nquery: {u.query}\nfragment: {u.fragment}')

In [None]:
# The “network location” netloc can have several subordinate pieces, 
# but they are uncommon enough that urlsplit()
# does not break them out as separate items in its tuple. 
# Instead, they are available only as attributes of its result
u = urlsplit('https://brandon:atigdng@localhost:8000/')
print('netloc: ', u.netloc)
print('username: ', u.username)
print('password: ', u.password)
print('hostname: ', u.hostname)
print('port: ', u.port)

In [None]:
# the path and query components can be decomposed further
# &, #, and / are delimitors in URL, their literals must be escaped
from urllib.parse import parse_qs, parse_qsl, unquote
# http://example.com/Q&A/TCP/IP?q=packet loss&t=time elapse'
# In Q&A, & -> %26; in TCP/IP, / -> %2F; in 'packet loss', ␣ -> +
u = urlsplit('http://example.com/Q%26A/TCP%2FIP?q=packet+loss&t=time+elapse')
path = [unquote(s) for s in u.path.split('/')]
query = parse_qsl(u.query)

# the initial empty string in the path components is the root path begins with a slash
print('path components: ', path) 

# 
print('queries: ', query)

In [None]:
# parse_qsl() gives the query as a list since a query parameter can appear multiple times
# parse_qs()  gives the query as a dictionary
u = urlsplit('http://search.org/?q=one&q=two&p=price')
ql = parse_qsl(u.query)
qd = parse_qs(u.query)
print("Query as list: ", ql)
print("Query as dictionary: ", qd)

Building URLs
---

In [None]:
URL = 'http://example.com/Q%26A/TCP%2FIP?q=packet+loss&t=time+elapse'
u = urlsplit(URL)
path = [unquote(s) for s in u.path.split('/')]
query = parse_qsl(u.query)

from urllib.parse import quote, urlencode, urlunsplit
urlunsplit((u.scheme, u.netloc, 
            '/'.join(quote(p, safe='') for p in path),
            urlencode(query), ''))

In [None]:
s = 'Q&A/TCP IP'
qt = quote(s)
uq = unquote(qt)
print((qt, uq))

🔭 Explore
---
- [urllib.parse — Parse URLs into components](https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse)

Relative URLs
---
- In filesystem, there are 
  - *absolute path* starts from the root
  - *relative paths* start from the *current working directory*
- Hypertext has similar hierarchy
  - *An absolute URL* can be accessed from anywhere
  - *Relative URLs* in a document depends on the location of the document
    - *urljoin()* can be used to find the absolute form of a relative URL

In [None]:
# The argument order of urljoin() is the same as that of os.path.join(). 
# First the base URL of the document
# Second the relative URL inside of it
from urllib.parse import urljoin
base = 'http://tools.ietf.org/html/rfc3986'
a1 = urljoin(base, 'rfc7320')
a2 = urljoin(base, '.')
a3 = urljoin(base, '..')
a4 = urljoin(base, '/dailydose/')
a5 = urljoin(base, '?version=1.0')
a6 = urljoin(base, '#section-5.4')

# Find a1-a6 manually, then verify your answers

In [None]:
# for an absolute URL, urljoin will not modify it
urljoin(base, 'https://www.google.com/search?q=apod&btnI=yes')

In [None]:
# a relative URL can omit the scheme but specify everything else.
# only the scheme is copied from the base URL
urljoin(base, '//www.google.com/search?q=apod')

In [None]:
# trailing slash matters in relative URLs
ta1 = urljoin('http://tools.ietf.org/html/rfc3986', 'rfc7320')
ta2 = urljoin('http://tools.ietf.org/html/rfc3986/', 'rfc7320')
print(ta1)
print(ta2)

- always redirect users from a wrong URL to the correct path to create robust web applications
- relative URLs are not necessarily relative to the path provided in the HTTP request
- relative URLs should be constructed relative to the Location header in the response

[The Hypertext Markup Language](https://en.wikipedia.org/wiki/HTML)
---
- the standard markup language for documents designed to be displayed in a web browser
- defines the meaning and structure of web content
- often assisted by Cascading Style Sheets (CSS) and scripting languages such as JavaScript


🔭 Explore
---
- [HTML living standard](https://html.spec.whatwg.org/)
- [CSS Snapshot](https://www.w3.org/TR/CSS/)
- [JavaScript](https://developer.mozilla.org/en-US/docs/Web/JavaScript)
- [Document Object Model (DOM)](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model)


[Well-formed document](https://en.wikipedia.org/wiki/Well-formed_document)
---
- A HTML file is a well-formed document with [well-formed tags](https://en.wikipedia.org/wiki/Well-formed_element) organizing content
  - Tags are case sensitive
  - Content be delimited with a beginning and end tag
  - Content be properly nested (parents within roots, children within parents)


Tag Examples
---
```html
<!-- 1. tags are properly nested -->
<p>This is a paragraph with <b>bold</b> and <i>italic</i> words.</p>

<!-- 2. self-contained tags -->
<br> or <br/>

<!-- 3.  the most generic kind of box -->
<div> generic box </div>

<!-- 4. the most generic way to mark running text -->
<span> generic way to mark running text </span>
```

Element selectors
---
```html
<div class="weather">
  <h5 class="city"> Tampa </h5>
  <p class="temperature"> 26 ℃ </p>
</div>

<!-- generic class selectors -->
.city, .temperature

<!-- element class selectors -->
h5.city, p.temperature

<!-- whitespace-concatenated selectors -->
.weather h5
.weather p
```

🖊️ Practice
---
- Inspect a HTML webpage with developer tools in Firefox

Reading and Writing to a Database 
---
- Web applications usually have a backend database
- e.g. a simple bank web application has a database with tables to save 
  - user accounts and bank accounts
  - checking accounts, saving account and credit cards
  - payments, etc.
- the server scripts of the web app interact with the DBMS through [DB-API](https://peps.python.org/pep-0249/) 
  - defined in PEP 249 – Python Database API Specification v2.0
  - supported by all modern Python database connectors
  - the Python connector for SQLite is built in PSL


🔭 Explore
---
- [A Routine for Building and Talking to a Database](./www/bank.py)
- View the database created above using one of the SQLite GUI applications
  - [SQLiteStudio](https://sqlitestudio.pl/)
  - [DB Browser for SQLite](https://sqlitebrowser.org/)
  - [SQLite Manager](https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager-webext/)

An Insecure Web Application
---
- consists of
  - a SQLite database *bank.db* and accessing program *bank.py*
  - a server application: app_insecure.py
  - 4 template files in [Jinja](https://jinja.palletsprojects.com) template language:
    - [base.html](./www/templates/base.html) contains 
      - page title, used in element \<title\> and \<h1\>
      - page body
    - [login.html](./www/templates/login.html) contains
      - a title and a form for login
    - [index.html](./www/templates/index.html) contains
      - flash messages
      - a list of payments
      - two links
    - [pay.html](./www/templates/pay.html) contains
      - a status and hint string *complaint*
      - the form of payment
  - a css style [style.css](./www/static/style.css)
- Jinja syntax examples
  - double-brace syntax, as in {{ username }}, defines a variable username
  - brace-percent maneuvers like {% for %} defines code
- Explore [the Pallets Projects](https://palletsprojects.com/)
- [An Insecure Web Application](./www/app_insecure.py)
  ```bash
  # 1. install modules from Pallets Projects
  pip install Flask Werkzeug colorama watchdog Jinja2 click MarkupSafe itsdangerous Quart

  # Note: in case your installation messed up with the system-site-packages
  # sudo apt purge python3-click python3-prompt-toolkit
  # then rerun step 1
  # Best practice: using a virtual environment
  # [Virtual Environments and Packages](https://docs.python.org/3/tutorial/venv.html)

  # 2. run and play with the web application
  flask --app=app_insecure run
  ```
- What are the vulnerabilities in this web app?


🔭 Explore
---
- [Flask web framework](https://flask.palletsprojects.com)

HTML [Forms](https://www.w3schools.com/html/html_forms.asp) and HTTP Methods 
---
- A simple HTML form
  ```html
  <form action="/search">
    <label>Search: <input name="q"></label>
    <button type="submit">Go</button>
  </form>
  ```
- An HTML form has the default method of GET
  - encode the input fields directly in the URL
     - saved in browsing history
  - suitable for sharing browsing results
  - safe to reload, forward and back
  ```html
  GET /search?q=python+network+programming HTTP/1.1
  Host: searchme.com
  ```
  - don't use GET to send sensitive information
- Form inputs with method POST, PUT or DELETE do not go into the URL
  - a simple form with method POST
    ```html
    <form method="post" action="/donate">
      <label>Charity: <input name="name"></label>
      <label>Amount: <input name="dollars"></label>
      <button type="submit">Donate</button>
    </form>
    ```
  - the inputs are completely encoded in the body of request
    ```html
    POST /donate HTTP/1.1
    Host: example.com
    Content-Type: application/x-www-form-urlencoded
    Content-Length: 39
    
    name=PyCon%20scholarships&dollars=35
    ```
  - it may be a problem to reload, forward and back on the result of a POST
    - try these operations on the /pay form of the previous insecure web app
    - two techniques to solve the problem:
      - use JavaScript or HTML5 form input constraints to prevent invalid inputs
      - redirect to another URL instead of responding simply with a 200 OK page
  - use GET for read operations and POST for writes
    - there are many website picked the wrong methods
    - Could you find one?
  

Safe and Unsafe Cookies
---
- the cookies in the previous insecure web app are easy to guess
  - it simply uses the username as the cookie
  - which can be forged to impersonate legal users


🖊️ Practice
---
- run the insecure web app
- use Firefox *web develop tools* to investigate its cookies


In [None]:
# An unauthenticated request gets forwarded to the /login page
import requests
r = requests.get('http://localhost:5000/')
print(r.url)

In [None]:
# misuse a cookie
r = requests.get('http://localhost:5000/', cookies={'username': 'brandon'})
print(r.url)

In [None]:
# forge a request to steal money
r = requests.post('http://localhost:5000/pay', 
                  {'account': 'hacker', 'dollars': 100, 'memo': 'Auto-pay'},
                  cookies={'username': 'brandon'})
print(r.url)

- Three safe approaches to creating nonforgeable cookies
  - leave the cookie in plaintext but sign the cookie with a digital signature
  - encrypt the cookie to ciphertext
  - create a purely random string for the cookie
    - e.g. use a standard UUID library
    - save the cookie in a database, a [Redis](https://redis.io/) instance or other short-term storage for persistence

```python
# Flask has built-in ability to digitally sign cookies
app.secret_key = 'saiGeij8AiS2ahleahMo5dahveixuV3J'

# Flask session object uses the secret key when setting a cookie
session['username'] = username
session['csrf_token'] = uuid.uuid4().hex

# Flask verifies the signature before trusting any cookie values
username = session.get('username')
```
- Further cookie safety
  - cookies must be transmitted in HTTPS
  - enable the [Same-origin policy](https://en.wikipedia.org/wiki/Same-origin_policy) for web servers

Further vulnerabilities of the insecure web app
---
- also vulnerable to 
  - [cross-site scripting (XSS)](https://en.wikipedia.org/wiki/Cross-site_scripting) attack
    - due to Jinja2 does not escape special characters automatically
    - can be fixed with Flask *render_template()* which turns on HTML escaping 
  - [Cross-Site Request Forgery (CSRF)](https://en.wikipedia.org/wiki/Cross-site_request_forgery) attack
    - can be fixed by supplying and checking a per-session random secret
      - demoed in [pay2.html](./www/templates/pay2.html)
    - CSRF protection is usually built into web frameworks or third-party plugins
      - such as library [Flask-WTF](https://flask-wtf.readthedocs.io/)

🖊️ Practice
---
- Go through [the improved web application](./www/app_improved.py)
- discuss the protections


🔭 Explore
---
- Find the features of [Django](https://www.djangoproject.com/)
  - [Overview of Django](https://en.wikipedia.org/wiki/Django_(web_framework))


🖊️ Practice
---
- Go through [Django tutorial](./django.md)

Choosing a [Web Framework](https://en.wikipedia.org/wiki/Web_framework)
---
- [Comparison of server-side web frameworks](https://en.wikipedia.org/wiki/Comparison_of_server-side_web_frameworks)
  - [Django](https://www.djangoproject.com/)
    - [overview](https://en.wikipedia.org/wiki/Django_(web_framework))
  - [Tornado](https://www.tornadoweb.org/)
    - [overview](https://en.wikipedia.org/wiki/Tornado_(web_server))
  - [Flask](https://flask.palletsprojects.com/)
    - [overview](https://en.wikipedia.org/wiki/Flask_(web_framework))
  - [Quart - a Fast Python web microframework](https://pgjones.gitlab.io/)
  - [Bottle](https://bottlepy.org/)
  - [Pyramid](https://www.trypyramid.com/)
- WSGI supports only traditional HTTP which operates lockstep or half-duplex
  - has the *[long polling problem](https://en.wikipedia.org/wiki/Push_technology)* in live content update
    - work around with [Comet - a web application model](https://en.wikipedia.org/wiki/Comet_(programming))
- [WebSockets](https://en.wikipedia.org/wiki/WebSocket) provides full-duplex communication channels over a single TCP connection
  - standardized in [rfc6455](https://datatracker.ietf.org/doc/html/rfc6455)
  - start negotiation through HTTP to switch to a new system of data framing
  - need close coordination between clients and servers
  - [websockets: a library for building WebSocket servers and clients in Python](https://websockets.readthedocs.io/)


🔭 Explore
---
- [Building a basic chat server with Quart](https://pgjones.gitlab.io/quart/tutorials/chat_tutorial.html#chat-tutorial)

[Web Scraping](https://en.wikipedia.org/wiki/Web_scraping)
---
- also known as web harvesting, or web data extraction 
- extracts data from websites for study, statistics, etc.
- raw scraping is NOT recommended to avoid retrieving raw HTML
- ways besides raw scraping
  - download datasets such as [IMDb Non-Commercial Datasets](https://developer.imdb.com/non-commercial-datasets/)
  - access Web service APIs such as [Google maps platform](https://developers.google.com/maps)
  - obey the Terms of Service and robots.txt
    - robots.txt shows which URLs are designed for downloading by search engines and which should be avoided
- challenges
  - visit a URL more than once
  - fall in loops forever


Steps for web scraping
---
- manual investigation with browser developer tools
  - Inspect elements, Network info, Console
- use automation tools
  - [web crawler programs](https://en.wikipedia.org/wiki/Web_crawler) for a whole website
  - hindered by web-based authentication, [OAuth](https://oauth.net/), anti-crawling, etc.

Three ways for fetching pages  
---
- Making direct GET or POST requests using a Python library such as
  - [urllib.request](https://docs.python.org/3/library/urllib.request.html) for simple situations
  - the Session object from library [Requests](https://requests.readthedocs.io/)
    - keeps up with cookies and do connection pooling
- Using middle-ware such [Mechanize](https://mechanize.readthedocs.io/)
  - handle form elements
- Using a full-featured web browser such as Firefox, Chrome, etc.
  - can be exploited headlessly with libraries such as
    - [Selenium Webdriver library](https://selenium-python.readthedocs.io/)
    - [ghost.py - a webkit web client written in python](https://ghost-py.readthedocs.io/)
    - [PhantomJS - Scriptable Headless Browser](https://phantomjs.org/)
  - they work by creating a WebKit instance


Scraping Pages return structured data
---
- some web pages return data in CSV, JSON, or some other recognized data format
  - parse with PSL or third-party libraries
- information hidden in user-facing HTML
  - turn off JavaScript in the browser then reload
  - use HTML tidy programs and Python libraries
    - [Tidy](https://www.html-tidy.org/)
    ```python
    # 1. BeautifulSoup
    print(soup.prettify())

    # 2. lxml
    from lxml import etree
    print(etree.tostring(root, pretty_print=True).decode('utf-8'))
    ```
- Three steps in examining HTML
  1. parse HTML using chosen libraries
    - hard to parse HTML documents with errors 
  2. find patterned elements with selectors
  3. retrieve the text and attribute values of each element


🖊️ Practice
---
- Scrape the payments app with [Logging In to the Payments System and Adding Up Income](./www/mscrape.py)
  ```bash
  # 1. install required libraries
  pip install beautifulsoup4 requests selenium lxml
  
  # 2. run the payments app
  python3 app_improved.py

  # 3. run the scraper
  # using Requests library
  python mscrape.py http://127.0.0.1:5000/

  # using Selenium and Firefox with -s option 
  # broken: https://bugs.launchpad.net/ubuntu/+source/firefox/+bug/2025268
  python mscrape.py -s http://127.0.0.1:5000/

  # using lxml library
  python mscrape.py -l http://127.0.0.1:5000/
  ```

🔭 Explore
---
- [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/)
- [lxml](https://lxml.de/)

Recursive Scraping
---
- recursively scrape all of the URLs on a web site in which
  - some links are loaded dynamically through JavaScript 
  - others are reached only through a form post
- usually use web-scraping engine such as [Scrapy](https://scrapy.org/)
  - record invoked functions and URLs to avoid revisiting


🖊️ Practice
---
- Scrape websites recursively with
  - [Simple Recursive Web Scraper That Does GET](./www/rscrape1.py)
  - [Recursively Scraping a Web Site with Selenium](./www/rscrape2.py)
  ```bash
  # start the tiny site
  sudo python3 -m http.server -d ./tinysite

  # scrape the tiny site
  # find only the two links that appear literally in the HTML
  python rscrape1.py http://127.0.0.1:8000/

  # scrape httpbin.org
  python rscrape1.py http://httpbin.org/

  # scrape the tiny site with more features
  python rscrape2.py http://127.0.0.1:8000/
  ```

🔭 Explore
---
- Play with [Scrapy](https://scrapy.org/)