#  The World Wide Web
FPNP3e ch11


Overview of HTTP
---
- Clients request documents
- Servers respond by providing them
- HTTP is certainly capable of delivering stand-alone documents such as 
  - files, audios, images, and video
- The main purpose is to deliver the [World Wide Web](https://en.wikipedia.org/wiki/World_Wide_Web)
  - allow servers all over the world to publish documents, 
  - through mutual cross-links, 
  - become a single interlinked fabric of information


Hypermedia and URLs
---
- [URLs (uniform resource locators)](https://en.wikipedia.org/wiki/URL) are addresses on the web
  - known as hyperlinks, or simply links
  - clickable and usually highlighted with underlines
- Hypermedia are images, sound, and video mixed with links
- Hypertext documents contain embedded hyperlinks

```html
<!-- somme sample URLs  -->
https://www.python.org/
http://en.wikipedia.org/wiki/Python_(programming_language)
http://localhost:8000/headers
ftp://ssd.jpl.nasa.gov/pub/eph/planets/README.txt
telnet://rainmaker.wunderground.com

<!-- URL with a query string -->
https://www.google.com/search?product=ipad&bInStore=yes

<!-- URL with fragment -->
https://datatracker.ietf.org/doc/html/rfc3986#section-3.2
```

Parsing and Building URLs
---
- Every HTTP URL conforms to the syntax of a generic URI
- The URI generic syntax consists of five components organized hierarchically 
  - in order of decreasing significance from left to right
  ```
  URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment]
  authority = [userinfo "@"] host [":" port]
  ``` 


🔭 Explore
---
- [the components of URIs](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Syntax)
- [List of URI schemes](https://en.wikipedia.org/wiki/List_of_URI_schemes)

Parsing URLs
---

In [None]:
# The PSL urllib.parse module is used to interpret and to build URLs
from urllib.parse import urlsplit
u = urlsplit('https://www.google.com/search?product=ipad&bInStore=yes')
tuple(u)

In [None]:
print(f'type: {type(u)}\nscheme: {u.scheme}\nnetloc: {u.netloc}\npath: {u.path}\nquery: {u.query}\nfragment: {u.fragment}')

In [None]:
# The “network location” netloc can have several subordinate pieces, 
# but they are uncommon enough that urlsplit()
# does not break them out as separate items in its tuple. 
# Instead, they are available only as attributes of its result
u = urlsplit('https://brandon:atigdng@localhost:8000/')
print('netloc: ', u.netloc)
print('username: ', u.username)
print('password: ', u.password)
print('hostname: ', u.hostname)
print('port: ', u.port)

In [None]:
# the path and query components can be decomposed further
# &, #, and / are delimitors in URL, their literals must be escaped
from urllib.parse import parse_qs, parse_qsl, unquote
# http://example.com/Q&A/TCP/IP?q=packet loss&t=time elapse'
# In Q&A, & -> %26; in TCP/IP, / -> %2F; in 'packet loss', ␣ -> +
u = urlsplit('http://example.com/Q%26A/TCP%2FIP?q=packet+loss&t=time+elapse')
path = [unquote(s) for s in u.path.split('/')]
query = parse_qsl(u.query)

# the initial empty string in the path components is the root path begins with a slash
print('path components: ', path) 

# 
print('queries: ', query)

In [None]:
# parse_qsl() gives the query as a list since a query parameter can appear multiple times
# parse_qs()  gives the query as a dictionary
u = urlsplit('http://search.org/?q=one&q=two&p=price')
ql = parse_qsl(u.query)
qd = parse_qs(u.query)
print("Query as list: ", ql)
print("Query as dictionary: ", qd)

Building URLs
---

In [None]:
URL = 'http://example.com/Q%26A/TCP%2FIP?q=packet+loss&t=time+elapse'
u = urlsplit(URL)
path = [unquote(s) for s in u.path.split('/')]
query = parse_qsl(u.query)

from urllib.parse import quote, urlencode, urlunsplit
urlunsplit((u.scheme, u.netloc, 
            '/'.join(quote(p, safe='') for p in path),
            urlencode(query), ''))

In [None]:
s = 'Q&A/TCP IP'
qt = quote(s)
uq = unquote(qt)
print((qt, uq))

🔭 Explore
---
- [urllib.parse — Parse URLs into components](https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse)

Relative URLs
---
- In filesystem, there are 
  - *absolute path* starts from the root
  - *relative paths* start from the *current working directory*
- Hypertext has similar hierarchy
  - *An absolute URL* can be accessed from anywhere
  - *Relative URLs* in a document depends on the location of the document
    - *urljoin()* can be used to find the absolute form of a relative URL

In [None]:
# The argument order of urljoin() is the same as that of os.path.join(). 
# First the base URL of the document
# Second the relative URL inside of it
from urllib.parse import urljoin
base = 'http://tools.ietf.org/html/rfc3986'
a1 = urljoin(base, 'rfc7320')
a2 = urljoin(base, '.')
a3 = urljoin(base, '..')
a4 = urljoin(base, '/dailydose/')
a5 = urljoin(base, '?version=1.0')
a6 = urljoin(base, '#section-5.4')

# Find a1-a6 manually, then verify your answers

In [None]:
# for an absolute URL, urljoin will not modify it
urljoin(base, 'https://www.google.com/search?q=apod&btnI=yes')

In [None]:
# a relative URL can omit the scheme but specify everything else.
# only the scheme is copied from the base URL
urljoin(base, '//www.google.com/search?q=apod')

In [None]:
# trailing slash matters in relative URLs
ta1 = urljoin('http://tools.ietf.org/html/rfc3986', 'rfc7320')
ta2 = urljoin('http://tools.ietf.org/html/rfc3986/', 'rfc7320')
print(ta1)
print(ta2)

- always redirect users from a wrong URL to the correct path to create robust web applications
- relative URLs are not necessarily relative to the path provided in the HTTP request
- relative URLs should be constructed relative to the Location header in the response

[The Hypertext Markup Language](https://en.wikipedia.org/wiki/HTML)
---
- the standard markup language for documents designed to be displayed in a web browser
- defines the meaning and structure of web content
- often assisted by Cascading Style Sheets (CSS) and scripting languages such as JavaScript


🔭 Explore
---
- [HTML living standard](https://html.spec.whatwg.org/)
- [CSS Snapshot](https://www.w3.org/TR/CSS/)
- [JavaScript](https://developer.mozilla.org/en-US/docs/Web/JavaScript)
- [Document Object Model (DOM)](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model)


[Well-formed document](https://en.wikipedia.org/wiki/Well-formed_document)
---
- A HTML file is a well-formed document with [well-formed tags](https://en.wikipedia.org/wiki/Well-formed_element) organizing content
  - Tags are case sensitive
  - Content be delimited with a beginning and end tag
  - Content be properly nested (parents within roots, children within parents)


Tag Examples
---
```html
<!-- 1. tags are properly nested -->
<p>This is a paragraph with <b>bold</b> and <i>italic</i> words.</p>

<!-- 2. self-contained tags -->
<br> or <br/>

<!-- 3.  the most generic kind of box -->
<div> generic box </div>

<!-- 4. the most generic way to mark running text -->
<span> generic way to mark running text </span>
```

Element selectors
---
```html
<div class="weather">
  <h5 class="city"> Tampa </h5>
  <p class="temperature"> 26 ℃ </p>
</div>

<!-- generic class selectors -->
.city, .temperature

<!-- element class selectors -->
h5.city, p.temperature

<!-- whitespace-concatenated selectors -->
.weather h5
.weather p
```

🖊️ Practice
---
- Inspect a HTML webpage with developer tools in Firefox

Reading and Writing to a Database 
---
- Web applications usually have a backend database
- e.g. a simple bank web application has a database with tables to save 
  - user accounts and bank accounts
  - checking accounts, saving account and credit cards
  - payments, etc.
- the server scripts of the web app interact with the DBMS through [DB-API](https://peps.python.org/pep-0249/) 
  - defined in PEP 249 – Python Database API Specification v2.0
  - supported by all modern Python database connectors
  - the Python connector for SQLite is built in PSL


🔭 Explore
---
- [A Routine for Building and Talking to a Database](./www/bank.py)
- View the database created above using one of the SQLite GUI applications
  - [SQLiteStudio](https://sqlitestudio.pl/)
  - [DB Browser for SQLite](https://sqlitebrowser.org/)
  - [SQLite Manager](https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager-webext/)

A Terrible Web Application
---
- Explore [the Pallets Projects](https://palletsprojects.com/)
- [An Insecure Web Application](./www/app_insecure.py)
  ```bash
  # 1. install modules from Pallets Projects
  pip install Flask Werkzeug colorama watchdog Jinja2 click MarkupSafe itsdangerous Quart

  # 2. run and play with the web application
  flask --app=app_insecure run
  ```
- What are the vulnerabilities in this web app?

The Dance of Forms and HTTP Methods 
When Forms Use Wrong Methods 
Safe and Unsafe Cookies  
Nonpersistent Cross-Site Scripting  
Persistent Cross-Site Scripting 
Cross-Site Request Forgery  
The Improved Application  
The Payments Application in Django 
Choosing a Web Framework
WebSockets 
Web Scraping 
Fetching Pages  
Scraping Pages 
Recursive Scraping 