## Building Foundational Python Sills for Data Analytics

https://www.safaribooksonline.com/library/view/modern-python-livelessons/9780134743400/MOPY_01_01_01.html

# Resampling - Preparation
Big Idea: Statistics modeled in a program are eaiser to get right and understand than using a formulaic approach. It also extends to more complicated situations that classic formulars.

## F-strings
* Old style: percent style - `%-formatting`
* newer style: `.format()`
* newer style: `f''`
    - `f'{value:{width}.{precision}}'`

In [None]:
x = 10
print('The answer is %d today' % x)

print('The answer is {0} today'.format(x))
print('The answer is {x} today'.format(x=x))

print(f'The answer is {x} today')
print(f'The answer is {x :08d} today') # Format operators
print(f'The answer is {x ** 2 :08d} today')  # Run expression inside

"""common for raising exception strings"""
raise ValueError(f"Expect {x!r} to a float not a {type(x).__name__}")

## Counter Objects
* Counter: a subclass of dict; modeled from small tote bag
* Find out how many instances of some events occurr 
* Bags: multiset in c++, no order, has frequency 
* `c.elements()`: Pull out one at a time

In [None]:
from collections import Counter
d = {}
# Tranditional dict will raise an key error if no such key
# d['dragons']
d = Counter()
d['dragons']
d['dragons'] += 1
print(d)

# Count elements in a list
c = Counter('red green red blue red blue green'.split())
print(f'Counter is: {c}; Most common color is {c.most_common(1)}')

# list all elements of the counter
list(c.elements())
list(c) # list key of a (dict)
list(c.values()) # list values
list(c.items()) # list k,v pair

## Statistics Module
describing data

In [None]:
from statistics import mean, median, mode, stdev, pstdev
numbers1 = [50, 52, 53]
mean(numbers1)
# sample stdev vs population stdev
stdev(numbers1)
pstdev(numbers1)

## Sequence Operations
concat / slicing / count / index / sort

In [None]:
s = [10, 20, 30]
t = [0, 40, 50, 60, 60]

# List concat
u = s + t
print(u)

# Slicing
print(u[:2])
print(u[-2:])
print(u[:2] + u[-2:]) # Concat 

# all sequences have _count_ and _index_
print(dir(list))
u[0]
u.index(10)
u.count(60)

# sort
u.sort()  # sort original list
t = sorted(u) # construct a new list

## `lambda` expressions
* try to replace labmda -> partial, itemgetter, attrgetter, ...
* should be called `make_function()`
* Deffering a computertation in the future by making a function of __no arguments__. When the functions is called, the function runs. aka freeze/thaw; promisses

In [None]:
lambda x: x**2
print((lambda x: x**2)(5))

# multiple arguments
lambda x, y: 3*x + y

x = 10
y = 20
f = lambda : x**y # We can thaw the function after frozen, only when we call it to run it
f() 

## Chained Comparison

In [None]:
x = 15
x > 6
x < 10
x > 6 and x < 10
6 < x < 20 # doesn't load the stack 2 times, a bit more efficient

## Random / Distribution

In [None]:
import random
random.seed(123456789) # testing
random.random()

# Continus dist
random.uniform(1000, 1100)
random.triangular(1000, 1100) # halfway point much more chosen
random.gauss(100, 15) # Random IQ, normal
random.expovariate(20) # 1/20

from statistics import mean, stdev
# triangular, uniform, gauss, expovariate
data = [random.expovariate(20) for i in range(1000)]
print(mean(data))
print(stdev(data))# discrete dist
from random import choice, choices, sample, shuffle

In [None]:
# discrete dist
from random import choice, choices, sample, shuffle
outcomes = ['win', 'lose', 'draw', 'play again', 'double win']
choice(outcomes)
choices(outcomes, k=10) # Sampling with replacement

from collections import Counter
Counter(choices(outcomes, k=10))

sorted(sample(range(1, 57), k=6)) # Sampling without replacement

sample(outcomes, k=1)[0]
choice(outcomes) # choice is a special case of sample

shuffle(outcomes); outcomes
sample(outcomes, k=len(outcomes)) # shuffle is a special of sample


# Resampling statistic

* Six roulette wheel spins / choices with weighting

In [None]:
from random import *
from statistics import *
from collections import *

# Six roulette whells -- 18 red 19 black 2 green
# ways to populate
choice(['red', 'red', 'red', 'black', 'black', 'black', 'green'])

population = ['red'] * 18 + ['black'] * 18 + ['green'] * 2
choice(population)

[choice(population) for i in range(6)]
Counter([choice(population) for i in range(6)])

Counter(choices(population, k=6))

# better way
Counter(choices(['red', 'black', 'green'], [18, 18, 2], k=6))

* deal 20 cards without replacement (16 tens, 36 low)
* after a random deal, what's the  likelihoods of wining

In [None]:
# define what the deck is
deck = Counter(tens=16, low=36)
# list all elements
deck = list(deck.elements())
deal = sample(deck, 20)
Counter(deal)
# deal 52 cards
deal = sample(deck, 52)
remainder = deal[20:]
Counter(remainder)

* a weighted biased coin (spin it)
* P(5 or more heads from 7 spins)

In [None]:
pop = ['heads', 'tails']
cumwgt = [0.60, 1.00] # cumulative weights

# 1 trail
trail = lambda : choices(pop, cum_weights=cumwgt, k=7).count('heads') >= 5
n = 100
sum(trail() for i in range(n)) / n

# compare to the analytic approach 
from math import factorial as fact

def comb(n, r):
    return fact(n) // fact(r) // fact(n - r)

comb(10, 3)
ph=0.6
# 5 heads out of 7 spins
ph ** 5 * (1-ph) ** 2 * comb(7, 5)
# 6 heads out of 7 spins
ph ** 6 * (1-ph) ** 1 * comb(7, 6)

sum(ph**i * (1-ph)**(7-i) * comb(7, i) for i in range(5, 8))

* Does the median-of-five fall in the middle of two quartiles

In [None]:
sample(range(10000), 5)
sorted(sample(range(10000), 5))[2]

n = 100000
n // 4
n * 3 // 4

trail = lambda : n // 4 < median(sample(range(100000), 5)) <= 3 * n // 4
sum(trail() for i in range(100000)) / n

# Improving Reliability

## Mypy and Type Hinting
__Big idea__: Add type hints to code helps clarify your thoughts, improves documentaion, and may allow a static analysis tool to detect some kinds of errors.
* Use of `# type: ` - Type comment
* Use of function annotaion
* Use of class
* Container[Type]
* Tuple and ...
* Optional arguments
* Deque vs deque
* Issues with f-string, new colon notation, secrets module

Tools:
* mypy
* pyflakes
* hypothesis
* unittest -> `nose py.test`  [classic, most builtiful way to run unittest]

In [None]:
# To check, run in Mypy: `python3 -m mypy hints.py`
import typing
from collections import OrderedDict, deque, namedtuple

# Old style:
x = 10 # type: int

# New style
x: int = 10
    
# Function annotations, before run time
def f(x: int, y: Optional[int]=None) -> int:
    if y is None:
        y = 20
    return x + y
    
x = {} # type: OrderedDict (should throw error)
y = OrderedDict() # type: OrderedDict

# Sequence: indexable / iterable
def g(x: Sequence):
    print(len(x))
    print(x[2])
    for i in x:
        print(i)
    print()

# Specify Sequence type: 
def g(x: Sequence[int]):
    pass

# Specify a list
def g(x: List[int]):
    pass

info = ('Foo', 'Bar', 'Var', 'lentgh') # Tuple[str, ...]  (all strings)

Point = namedtuple('Point', ['x', 'y'])
Point2 = typing.NamedTuple('Point2', [('x', int), ('y', int)])

## fsum, true division
* `fsum()` is more accurate than `sum()`
```Python
from math import fsum 
```
* / vs // [True division vs floor division]

In [None]:
print(f'{1.1 + 2.2}')
print(f'{1.1 + 2.2 == 3.3}')
print(f'{sum([0.1] * 10)}')

##  Grouping with `defaultdict`
`defaultdict` creates a new container to store elements with a common feature (key)

```Python
d = defaultdict(set)
d['t'] # returns an empaty set   vs keyerror in regular dict
d['t'].add('tom')
```

```Python
d = defaultdict(list)
d['t'].append('tom')
d['t'].append('tom')
```

In [None]:
from pprint import pprint
from collections import defaultdict
names = '''david betty susan mary darlene sandy davin shelly tom michael'''.split()
d = defaultdict(list)
for name in names:
    feature = name[0] # len(name) / name[-1], etc
    d[feature].append(name)
pprint(d, width=60)

d = defaultdict(list)
for name in names:
    feature = len(name)
    d[feature].append(name)
pprint(d)

## Key function
__What__: A function takes one argument and transform it into a key (and do operations below with that key)

Works with: min(), max(), sorted(), nsmallest(), nlargest(), groupby() and merge()

SQL:
```SQL
SELECT name FROM names ORDER BY len(name);
```

Python:
```Python
pprint(sorted(names, key=len))
```

## Transposing 2-D data with zip() and star-args
- zip: bring multiple sequences together, pair-wise / unpaired elements are left out
- `from itertools import zip_longest`: fill in missing value
- use with star (*) -> umpack m into seperate arguments; each row becomes an argument

In [None]:
l = list(zip('abcdef','ghijklm'))
print(f'Result: {l}; m is missing') # notice `m` is missing

from itertools import zip_longest
list(zip_longest('abcdef','ghijklm'))
list(zip_longest('abcdef','ghijklm', fillvalue='x'))


# Transposing 3x2 matrix to 2x3
m = [[10, 20], [30, 40], [50, 60]]
print(f'{list(zip([10, 20], [30, 40], [50, 60]))}')

print(f'{list(zip(*m))}')

## Flattening data with list comprehensions

In [None]:
m = [[10, 20], [30, 40], [50, 60, 70]]
pprint(m, width=15)

# flatterning matrix
for row in m:
    for col in row:
        print(col)
        
[x for row in m for x in row] # Same result using list comprehension

non_flat = m
for x in non_flat:
    if len(x) > 2:
        for y in x:
            print(y)
[y for x in non_flat if len(x) > 2 for y in x] # Same as above            

## Convert an interator into a list
Use `list(iterator)`
- list: indexable; loop over it multiple times; run in reverse

# K-means - Cluster Analysis
* Big Idea: K-means is an unsupervised learning tool for identifying cluster with-in datasets.
* Algorithm:
    - Pick arbitrary points as guesses for the center of each group.
    - Assign all the data points to the closest matching group. 
    - Within each group, average the pints to get a new guess for the center of the group.
    - Repeat multiple times: Assign data and average the points

* Goal: 
Epress the idea more clearly and beeautifully in Python than in English.


Tasks: 
- mean(data)
- dist(point, piont)
- assign_data(centroids, pionts)   # centroid = potential center of a cluster
- compute_centroids(groups)
- k_means(points)

In [None]:
from typing import Iterable, Tuple, Sequence, Dict, List
from pprint import pprint
from math import fsum, sqrt

# Alias Tuple[int, ...] to Pint
Point = Tuple[int, ...]
Centroid = Point

points = [
    (10, 41, 23),
    (22, 30, 29),
    (11, 42, 5),
    (20, 32, 4),
    (12, 40, 12),
    (21, 36, 23),
]

def mean(data: Iterable[float]) -> float:
    'Acurate arithmetic mean'
    data = list(data) # data might be sequence OR generator; convert generator to list
    return fsum(data) / len(data)

def dist_old(p, q):
    'Euclidean distance function for multi-dimensional data'
    return sqrt(fsum([(x - y) ** 2 for x, y in zip(p, q)]))

# convert global to local; verify by `from dis import dis; dis(dist)`
def dist(p: Point, q: Point, fsum=fsum, sqrt=sqrt, zip=zip) -> float:
    return sqrt(fsum([(x - y) ** 2 for x, y in zip(p, q)]))

from dis import dis
dis(dist2)

In [None]:
from collections import defaultdict

def assign_data(centroids: Sequence[Centroid], data: Iterable[Point]) -> Dict[Centroid, List[Point]]:
    'Group the data points to the closest centroid'
    d = defaultdict(list)
    for point in data:
        closest_centroid = min(centroids, key=partial(dist, point)) # lambda: dist(point, centroid)
        d[closest_centroid].append(point)
    return dict(d) # convert back to regular dict

# list all the centroids:
centroids = [(9, 39, 20), (12, 36,25)]
point = (11, 42, 5)
[dist(point, centroid) for centroid in centroids]

# use key function to determine the point, rather dist
min(centroids, key=lambda centroid: dist(point, centroid))

In [None]:
from functools import partial # partial function evaluation; freeze some arguments
pow(2, 5)
twopow = partial(pow, 2)
twopow(5)
min(centroids, key=partial(dist, point)) # partial version

In [None]:
pprint(assign_data(centroids, points), width=45)

In [None]:
def transpose(data):
    'Swap the rows and columns in a 2-D array of data'
    return list(zip(*data))

def compute_centroids(groups: Iterable[Sequence[Point]]) -> List[Centroid]:
    'Compute the centroid of each group'
    return [tuple(map(mean, transpose(group))) for group in groups]
    # return [tuple(map(mean, zip(*group))) for group in groups]

In [None]:
from random import sample # sample without replacement
def k_means(data: Iterable[Point], k=2, iterations=50) -> List[Centroid]:
    data = list(data) # turn iterable into sequence
    centroids = sample(data, k)
    for i in range(iterations):
        labeled = assign_data(centroids, data)
        centroids = compute_centroids(labeled.values())
    return centroids

centroids = k_means(points, k=3)
d = assign_data(centroids, points)
pprint(d)

# Building Additional Skills For Data Analysis
- `defaultdict` for accumulating data (tabulating)
- `defaultdict` for reversing a one-to-many mapping
- glob
- reading files with an encoding
- using `next()` or `islice()` to remove elements from an iterator
- csv.reader
- tuple unpacking
- looping idioms: enumerate, zip, reversed, sorted, set
- incrementing instances of Counter
- assertions

## Use `defaultdict` for accumulation
and then convert `defaultdict` to __regular `dict`__ for normal use after the cummulation phase
- after data is accumulated, convert it to dict since no longer need its defaulting behavior 
```Python
d = dict(d)
```
- defaultdict: grouping, accumulation

In [None]:
from collections import defaultdict
from pprint import pprint
d = defaultdict(list)
d['a'].append('red')
d['b'].append('yellow')
d['c'].append('blue')
pprint(d, width=30)

d['a'].append('mac')
d['b'].append('pc')
d['c'].append('arm')
pprint(d, width=50)

pprint(dict(d))

## Reverse a one-many mapping
- Model one-to-many: `dict(one, list_of_many)`
- reverse (defaultdict, flat list and add)
- simpler case (one-to-one): `{span: eng for eng, span in e2s.items()}`

In [None]:
# Pattern for 1-many: scaler, list
e2s = {
    'one': ['uno'],
    'two': ['dos'],
    'three': ['tres'],
    'trio': ['tres'],
    'free': ['libre', 'gratis']
}

# To revers:
s2e = defaultdict(list)
for eng, spanwords in e2s.items():
    for span in spanwords:
        s2e[span].append(eng)
pprint(s2e)

## glob
global wildcard expansion (glob in bash)

could be called -> os.expand_wildcards()

In [None]:
import glob
glob.glob('*.*')

## Reading files with an encoding

In [None]:
with open('README.md', encoding='utf-8') as f:
    print(f.read())

## Remove elements from an iterator
consume some elements first then pass it to another function

In [None]:
it = iter('abcdefg')

print(next(it))
print(next(it))

print(list(it))

## CSV module

In [None]:
import csv
with open('README.md', encoding='utf-8') as f:
    for row in csv.reader(f):
        print(row)

## Tuple packing / unpacking
```Python
t = ('a1', 'b2', 42, '2') # Tuple packing
type(t)
len(t)
field1, field2, field3, field4 = t # unpacking
```

## Looping idioms

In [None]:
names = 'sq eg sm ll'.split()
colors = 'red green blue yellow'.split()
cities = 'yvr yvr pvg pvg'.split()

# Loop idioms:
for i in range(len(names)):
    print(names[i].upper())

for name in names:           # foreach
    print(name.upper())
    
###
for i in range(len(names)):
    print(i+1, names[i])

for i, name in enumerate(names, start=1): # enum
    print(i, name)
    
# print all color in reverse order
for i in range(len(colors) - 1, -1, -1):
    print(colors[i])
    
for color in reversed(colors):
    print(color)
    
# pair / mapcar
n = min(len(names), len(colors))
for i in range(n):
    print(names[i], colors[i])

for name, color in zip(names, colors):
    print(name, color)
    
# sorted
for color in sorted(colors, key=len):
    print(color)
    
# eliminating duplicates
for city in set(cities):
    print(city)
    
for city in sorted(set(cities)):
    print(city)
# SELECT DISTINCT city FROM Cities ORDER BY city;
# DISTINCT == set()
# ORDER BY == sorted()

# Functional programming
for i, city in enumerate(map(str.upper, reversed(sorted(set(cities))))):
    print(i, city)

In [None]:
# Sort


## Counter

In [None]:
import collections
c = collections.Counter()
c['red'] += 1
print(c)

c['blue'] += 1
c['red'] += 1
print(c)

print(f'most common: {c.most_common(1)}\n'
      f'{list(c.elements())}')

## Assertion
check assertion

In [None]:
assert 5 + 3 == 8
assert 5 - 3 = 6

# Applying Cluster Analysis To A Real Dataset
TODO

# Gearing Up For A Publisher Subscriber Application
- Pub/Sub service
- __Big Idea__:
Users make posts. Followers subscribe to the posts they are interested in. Newer posts are more relevant.
Display Posts by a user, posts for a user or posts matching a search request. Display followers of a user. Display those followed by a user. Store the user account information with hashed passwords

- Tools:
    * Unicode normalization. NFC: chr(111)+chr(776) -> chr(246)
    * Named tuples
    * sorted(), bisect() and merge() -- revers and key arguments
    * itertools.islice()
    * sys.intern()
    * random.expovariate()
    * time.sleep() and time.time()
    * hashlib: pbkdf2_hmac, sha256/512, digest, hexdigest
    * repr or a tuple
    * joining strings
    * floor division
    * ternary operator
    * and/or short-circuit boolean operators that return a value


## Unicode Normalization
`\u0664`
`\N{trade mark sign}`
```Python
import unicodedata
u = unicodedata.normalize('NFC', string)
```

## Named tuples
Lookup fields by name
```Python
import collections
Person = collections.namedtuple('Person', ['fname', 'lname', 'age', 'email'])
p = Person('Y', 'Q', 1, 'abc@example.com')
```
Like regular tuple:
- `len()`
- unpackable: `a, b, c, d = p`
- slicerable: `p[:2]`
- indexable: `p[0]`

## Sorted data / bisect
`bisect` is for searching ranges, not searching for a particular values

cutting rope: cut n times, get n+1 sections

Example: searching income tax brackets

In [None]:
import bisect
cuts = [60, 70, 80, 90]
grades = 'FDCBA'

grades[bisect.bisect(cuts, 76)]
[grades[bisect.bisect(cuts, score)] for score in [76, 83, 92, 100, 69, 50]]

__`merge()` creates an iterator to combine multiple sorted inputs__

Sort multiple lists:
```sorted([10, 5, 20] + [1, 11, 25]) # several lists sorted (concat a new list inside)```

If multiple lists already sorted: 

In [None]:
a = [1, 3, 5]
b = [2, 4, 6]
c = [5, 10, 15]

from heapq import merge
merge_iterator = merge(a, b, c)

next(merge_iterator) # verlong list, just pull few elements

## islice
`islice()` and `next()` let you partially consume iterators

consume iter and produce iter; useful for list (generator is an iterator that runs on demand)

`islice(merge(*inputs), n)` beats combining inputs, fully sorting and slicing


In [None]:
from itertools import islice # itertools.islice(iterable, start, stop[, step])
list(islice('abcdefghi', 3))
list(islice('abcdefghi', 0, 4, 2))

it = merge(a, b, c)
list(islice(it, 3))

## sys.intern() to save memory
`s = intern(s)` saves memory for frequently used strings

e.g., for user information, don't want the username appears all over the place

In [None]:
import sys
s = 'he'
t = 'llo'
hello = 'hello'
vv = s + t ## vv and s+t have different id()
u = sys.intern('hello')
v = sys.intern(s + t)

u is v


## expovariate
`random.expovariate()` commonly used in simuation on arrival times

In [None]:
from statistics import mean
import random

mean([random.expovariate(1 / 5) for _ in range(500)]) # mean -> 5 where expo(1/5)

## time
`time.sleep()` adds a delay

`time.ctime()` for end-user and `time.time()` to store timestamps

In [None]:
import time
x = 10; print(x ** 2)
time.sleep(2); print('Done')

print(time.time())
print(time.ctime())

## hash
prefer sha256 and sha512 over the weaker md5 and sha1 hash functions

`hashlib.pbkdf2_hmac()` iterates a sha512 to slowdown forward password guessing attacks (adding salt, etc)

In [None]:
# deprecated: md5
import hashlib
try:
    out = hashlib.md5('foo')
except:
    out = hashlib.md5('foo'.encode('utf-8'))
print(out.digest())
print(out.hexdigest()) # more human readable

In [None]:
# sha256
hashlib.sha256('foo'.encode('utf-8')).hexdigest()

In [None]:
# slow down the hash function
# run multiple times:
b = 'foo'.encode('utf-8')
b = hashlib.sha512(b).digest()
b = hashlib.sha512(b).digest()
b = hashlib.sha512(b).digest()
b = hashlib.sha512(b).digest()

# =====>
p = 'passphrase'.encode('utf-8')
h = hashlib.pbkdf2_hmac('sha256', p,  salt=b'rand string', iterations=100)
h

## `__repr__`

In [None]:
s = 'foo'
t = 'bar'
print(s+t)

s = 'fo'
t = 'obar'
print(s+t) # displays the same; don't want

repr((s, t))

## join strings
opposite of split

In [None]:
l = ['foo', 'bar', '2019']
' '.join(l)

## Division
/ vs //

## Ternary operator / Conditional expression
Ternary <-- 3
```Python
<posres> if <cond> else <negres>
```

In [None]:
score = 70
'pass' if score >= 70 else 'fail'

## short-circuit boolean
`and` and `or` return the __value__ that caused the expression to be `True` or `False`

Usage: `s = s or "default"` when passing optional arguments

In [None]:
3 < 10 and 10 < 20
bool('hello')
True and 'hello'

In [None]:
def f(x, s=None):
    s = s or 'default'
    print(x, s)
f(10, 'some value')
f(10)

# Implementing a Pub/Sub

## Start by developing the _data model_
Users make posts. Newer posts are more relevant. 

- post: namedtuple
- place to put it - newest->oldest
    - no list (don't scale) -> used when growing to the right
    - `deque()` is preferred over `list()` because it supports appendleft()
    - doule-ended queue; append and pop on both side
- `defaultdict()` with `deque()` simplifies per-user accumulation of posts
- `deque.appendlist(dataum)` >> `list.insert(0, datum)`
- In large programs, memory use is dominated by data not by containers
- keep the test data in separate file (session.py)
- Testing with `pyflakes`, `mypy`, testcases, hypothesis
- Type defination alias: `User = str`

In [None]:
'pubsub.py: Simple message pub/sub service'

from typing import NamedTuple, Deque, DefaultDict
from collections import deque, defaultdict
import time

User = str
#Post = namedtuple('Post', ['timestamp', 'user', 'text'])
Post = NamedTuple('Post', [('timestamp', float), ('user', str), ('text', str)])

posts = deque()                  # type: Deque[Post]  # Posts from newest to oldest
user_posts = defaultdict(deque)   # type: DefaultDict[User, deque] # defaultdict for accumulation

def post_message(user: User, text: str, timestamp: float=None) -> None:
    timestamp = timestamp or time.time()
    post = Post(timestamp, user, text)
    posts.appendleft(post)
    user_posts[user].appendleft(post)

In [None]:
'session.py: Sample data to test the pubsub internals'
# from pubsum import *
post_message('steve', 'hello world')
post_message('gx', 'xbbbbbb')
post_message('foo', 'bar')
post_message('steve', 'hello world222')

pprint(posts)
pprint(user_posts['steve'])

## Let one user follow another
Followers subscribe to the posts they are interested in. 

In [None]:
'pubsub.py: Simple message pub/sub service'

from typing import NamedTuple, Deque, DefaultDict, List, Optional
from collections import deque, defaultdict
from heapq import merge
from sys import intern
import time

User = str
Timestamp = float
#Post = namedtuple('Post', ['timestamp', 'user', 'text'])
Post = NamedTuple('Post', [('timestamp', float), ('user', str), ('text', str)])

posts = deque()                  # type: Deque[Post]  # Posts from newest to oldest
user_posts = defaultdict(deque)   # type: DefaultDict[User, deque] # defaultdict for accumulation
following = defaultdict(set)               # type: DefaultDict[User, Set[User]]
followers = defaultdict(set)               # type: DefaultDict[User, Set[User]]

def post_message(user: User, text: str, timestamp: float=None) -> None:
    timestamp = timestamp or time.time()
    post = Post(timestamp, user, text)
    posts.appendleft(post)
    user_posts[user].appendleft(post)
    
def follow(user: User, followed_user: User) -> None:
    following[user].add(followed_user)
    followers[followed_user].add(user)

In [None]:
'session.py: Sample data to test the pubsub internals'
# from pubsum import *
from pprint import pprint

post_message('steve', 'hello world')
post_message('gx', 'xbbbbbb')
post_message('foo', 'bar')
post_message('steve', 'hello world222')

follow('steve', followed_user='gx')
follow('gx', followed_user='foo')
follow('gx', followed_user='steve')

# pprint(posts)
# pprint(user_posts['steve'])
pprint(following)
pprint(followers)

## Display posts and followers
Display Posts by a user, posts for a user or posts matching a search request. Display followers of a user. Display those followed by a user. 
- `deque` already supports 
- `islice` >> `list(user_posts('steve')`
- dev procedure:
    1. `list(islice(user_posts[user], limit))` in python shell
    1. give function name, arg names
    1. type hinting
    1. testing the code
    1. Tool chain: pyflakes, mypy
    1. When code is working, look back to optimize 
- `merge` two sorted iterable (`from heapq`)

In [None]:
from itertools import islice
list(islice(user_posts['steve'], 2))

In [None]:
def posts_by_user(user: User, limit: Optional[int]=None) -> List[Post]:
    return list(islice(user_posts[user], limit))

def posts_for_user(user: User, limit: Optional[int]=None) -> List[Post]:
    relevant = list(merge(*[user_posts[followed_user] 
                            for followed_user in following[user]], reverse=True))
    return list(islice(relevant, limit))
    
pprint(posts_for_user('gx', limit=10))

## Effiency
Apply interning which eliminates redundant strings to save memory

In [1]:
'pubsub.py: Simple message pub/sub service'

from typing import NamedTuple, Deque, DefaultDict, List, Optional
from collections import deque, defaultdict
from heapq import merge
from sys import intern
import time

User = str
Timestamp = float
#Post = namedtuple('Post', ['timestamp', 'user', 'text'])
Post = NamedTuple('Post', [('timestamp', float), ('user', str), ('text', str)])

posts = deque()                  # type: Deque[Post]  # Posts from newest to oldest
user_posts = defaultdict(deque)   # type: DefaultDict[User, deque] # defaultdict for accumulation
following = defaultdict(set)               # type: DefaultDict[User, Set[User]]
followers = defaultdict(set)               # type: DefaultDict[User, Set[User]]

def post_message(user: User, text: str, timestamp: float=None) -> None:
    user = intern(user)                # sys.intern()
    timestamp = timestamp or time.time()
    post = Post(timestamp, user, text)
    posts.appendleft(post)
    user_posts[user].appendleft(post)
    
def follow(user: User, followed_user: User) -> None:
    user, followed_user = intern(user), intern(followed_user)
    following[user].add(followed_user)
    followers[followed_user].add(user)
    
def posts_by_user(user: User, limit: Optional[int]=None) -> List[Post]:
    return list(islice(user_posts[user], limit))

def posts_for_user(user: User, limit: Optional[int]=None) -> List[Post]:
    relevant = list(merge(*[user_posts[followed_user] 
                            for followed_user in following[user]], reverse=True))
    return list(islice(relevant, limit))

def search(phrase: str, limit: Optional[int]=None) -> List[Post]:
    # TODO: add pre-indexing to speed-up searches
    # TODO: Add time sensentive caching of search queries 
    # return [post for post in posts if phrase in post.text]
    return list(islice((post for post in posts if phrase in post.text), limit))   # [] -> (): list -> generator

# Bottle Rest Apis
Micro-webframeworks (such as Bottle) are all about minimizing the code and effort required to links an application to a web server. Decorators connect a route or path to a function. The function manages getting a user request, calling the application and forming the response. 
- set / get headers
- extract queries
- content negotiation
- common pattern in rest api


In [None]:
from pprint import pprint
from bottle import *
import time

@route('/')
def welcome():
    response.set_header('Vary', 'Accept')
    pprint(dict(request.headers))
    response.content_type = 'text/plain'
    return 'hell0'

## Content negotiation
Content negotiation attempts to honor user preferences
- Different requestors get different responses
- smaller content to mobile users / bigger content to desktop

In [None]:
@route('/')
def welcome():
    if 'text/html' in request.headers.get('Accept', '*/*'):
        response.content_type = 'text/html'
        return '<h1> HI! </h1>'
    response.content_type = 'text/plain'
    return 'hell0'

## Dynamic content

In [None]:
@route('/now')
def time_service():
    response.content_type = 'text/plain'
    return time.ctime()

## Caching
- Caching is used to limit the load on the server
- Reverse proxy like nginx 
- __header__ for cache control
- don't set 'max-age' too high (> 1 hour)

- BUT content negotiation can confuse caches unless the "vary" head is used
    - `response.set_header('Vary', 'Accept')`

In [None]:
@route('/now')
def time_service():
    response.set_header('Cache-Control', 'max-age=1') # cache for 1 sec
    response.content_type = 'text/plain'
    return time.ctime()

## Dynamic Route
- Dynamic routes are marked with _angle brackets_
- Query String `?key=value&key2=value2`
    1. assign route
    1. extract query information
    1. call application
    1. format result
- 500 Server errors indicate a need for better error handling
    - user should never see 5xx
- JSON
    - returned dict includes both query and answer
- 'vary' cache to cache differently depend on the contents of the subheader
- Cookies
    - store information on the user side
    - `r.get_cookie`, `r.set_cookie`
    - cookies are easily spoofed, so they have a lower level of trust
    - using a 'secret'

In [None]:
@route('/upper/<word>')
def upper_case_service(word):
    return word.upper()

In [None]:
secret = 'abcdefgh'

@route('/area/circle')
def circle_area_service():
    # pprint(dict(request.query))
    last_visit = request.get_cookie('last-visit', 'unknown', secret=secret)
    print(f'Last visit {last_visit}')
    response.set_header('Vary', 'Accept')
    response.set_cookie('last-visit', time.ctime(), secret=secret)   # cookie
    
    try:
        radius = float(request.query.get('radius', '0.0'))
    except ValueError as e:
        return e.args[0]
    
    area = radius ** 2.0 * 3.14 # business logic, should from other module
    
    if 'text/html' in request.headers.get('Accept', '*/*'):
        response.content_type = 'text/html'
        return f'<p>The radius is {radius!r}</p>'
    return dict(radius=radius, area=area, service=request.path)

## Bottle's templating tool
- `{{ expression }}`
- can put _statements_ in template `% ... % end`

In [None]:
from bottle import template
print(template('The answer is {{x}} today', x=10))

lastname = 'Q'
first_names = 'S X L'.split()
family_template = '''\
The {{ lastname.title() }} Family
{{ '=' * (len(lastname) + 11) }}
% for name in first_names:
* {{ name.title() }}
% end
'''

print(template(family_template, lastname=lastname, first_names=first_names))

## Small file server

In [None]:
import os
os.listdir('.')


## file server ############

file_template = '''\
<h1> List of files </h1>
<hr>
<ol>
  % for file in files:
    <li> <a href="files/{{ file }}"> {{ file }} </a> </li>
  % end
</ol>
'''

@route('/files')
def show_file():
    response.set_header('Vary', 'Accept')

    files = os.listdir('.')
    if 'text/html' not in request.headers.get('Accept', '*/*'):
        return dict(files=files)
    response.content_type = 'text/html'
    return template(file_template, files=files)
        
@route('/files/<filename>')
def serve_one_file(filename):
    return static_file(filename, './')

In [None]:
# Run
if __name__ == '__main__':
    run(host='0.0.0.0', port=18080)

# PubSub Web App
Pub Sub Service
- Display login page and check credentials
- Post a message
- Run a search
- Display followers or following
- Show user page
- Return static content