#  The World Wide Web
FPNP3e ch11


Overview of HTTP
---
- Clients request documents
- Servers respond by providing them
- HTTP is certainly capable of delivering stand-alone documents such as 
  - files, audios, images, and video
- The main purpose is to deliver the [World Wide Web](https://en.wikipedia.org/wiki/World_Wide_Web)
  - allow servers all over the world to publish documents, 
  - through mutual cross-links, 
  - become a single interlinked fabric of information


Hypermedia and URLs
---
- [URLs (uniform resource locators)](https://en.wikipedia.org/wiki/URL) are addresses on the web
  - known as hyperlinks, or simply links
  - clickable and usually highlighted with underlines
- Hypermedia are images, sound, and video mixed with links
- Hypertext documents contain embedded hyperlinks

```html
<!-- somme sample URLs  -->
https://www.python.org/
http://en.wikipedia.org/wiki/Python_(programming_language)
http://localhost:8000/headers
ftp://ssd.jpl.nasa.gov/pub/eph/planets/README.txt
telnet://rainmaker.wunderground.com

<!-- URL with a query string -->
https://www.google.com/search?product=ipad&bInStore=yes

<!-- URL with fragment -->
https://datatracker.ietf.org/doc/html/rfc3986#section-3.2
```

Parsing and Building URLs
---
- Every HTTP URL conforms to the syntax of a generic URI
- The URI generic syntax consists of five components organized hierarchically 
  - in order of decreasing significance from left to right
  ```
  URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment]
  authority = [userinfo "@"] host [":" port]
  ``` 


🔭 Explore
---
- [the components of URIs](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Syntax)
- [List of URI schemes](https://en.wikipedia.org/wiki/List_of_URI_schemes)

Parsing URLs
---

In [None]:
# The PSL urllib.parse module is used to interpret and to build URLs
from urllib.parse import urlsplit
u = urlsplit('https://www.google.com/search?product=ipad&bInStore=yes')
tuple(u)

In [None]:
print(f'type: {type(u)}\nscheme: {u.scheme}\nnetloc: {u.netloc}\npath: {u.path}\nquery: {u.query}\nfragment: {u.fragment}')

In [None]:
# The “network location” netloc can have several subordinate pieces, 
# but they are uncommon enough that urlsplit()
# does not break them out as separate items in its tuple. 
# Instead, they are available only as attributes of its result
u = urlsplit('https://brandon:atigdng@localhost:8000/')
print('netloc: ', u.netloc)
print('username: ', u.username)
print('password: ', u.password)
print('hostname: ', u.hostname)
print('port: ', u.port)

In [None]:
# the path and query components can be decomposed further
# &, #, and / are delimitors in URL, their literals must be escaped
from urllib.parse import parse_qs, parse_qsl, unquote
# http://example.com/Q&A/TCP/IP?q=packet loss&t=time elapse'
# In Q&A, & -> %26; in TCP/IP, / -> %2F; in 'packet loss', ␣ -> +
u = urlsplit('http://example.com/Q%26A/TCP%2FIP?q=packet+loss&t=time+elapse')
path = [unquote(s) for s in u.path.split('/')]
query = parse_qsl(u.query)

# the initial empty string in the path components is the root path begins with a slash
print('path components: ', path) 

# 
print('queries: ', query)

In [None]:
# parse_qsl() gives the query as a list since a query parameter can appear multiple times
# parse_qs()  gives the query as a dictionary
u = urlsplit('http://search.org/?q=one&q=two&p=price')
ql = parse_qsl(u.query)
qd = parse_qs(u.query)
print("Query as list: ", ql)
print("Query as dictionary: ", qd)

Building URLs
---

In [None]:
URL = 'http://example.com/Q%26A/TCP%2FIP?q=packet+loss&t=time+elapse'
u = urlsplit(URL)
path = [unquote(s) for s in u.path.split('/')]
query = parse_qsl(u.query)

from urllib.parse import quote, urlencode, urlunsplit
urlunsplit((u.scheme, u.netloc, 
            '/'.join(quote(p, safe='') for p in path),
            urlencode(query), ''))

In [None]:
s = 'Q&A/TCP IP'
qt = quote(s)
uq = unquote(qt)
print((qt, uq))

🔭 Explore
---
- [urllib.parse — Parse URLs into components](https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse)

In [None]:
%%bash
ls


Relative URLs 
The Hypertext Markup Language 
Reading and Writing to a Database 
A Terrible Web Application (in Flask) 
The Dance of Forms and HTTP Methods 
When Forms Use Wrong Methods 
Safe and Unsafe Cookies  
Nonpersistent Cross-Site Scripting  
Persistent Cross-Site Scripting 
Cross-Site Request Forgery  
The Improved Application  
The Payments Application in Django 
Choosing a Web Framework
WebSockets 
Web Scraping 
Fetching Pages  
Scraping Pages 
Recursive Scraping 