## Building Foundational Python Sills for Data Analytics

https://www.safaribooksonline.com/library/view/modern-python-livelessons/9780134743400/MOPY_01_01_01.html

# Resampling - Preparation
Big Idea: Statistics modeled in a program are eaiser to get right and understand than using a formulaic approach. It also extends to more complicated situations that classic formulars.

## F-strings
* Old style: percent style - `%-formatting`
* newer style: `.format()`
* newer style: `f''`

In [None]:
x = 10
print('The answer is %d today' % x)

print('The answer is {0} today'.format(x))
print('The answer is {x} today'.format(x=x))

print(f'The answer is {x} today')
print(f'The answer is {x :08d} today') # Format operators
print(f'The answer is {x ** 2 :08d} today')  # Run expression inside

"""common for raising exception strings"""
raise ValueError(f"Expect {x!r} to a float not a {type(x).__name__}")

## Counter Objects
* Counter: a subclass of dict; modeled from small tote bag
* Find out how many instances of some events occurr 
* Bags: multiset in c++, no order, has frequency 
* `c.elements()`: Pull out one at a time

In [None]:
from collections import Counter
d = {}
# Tranditional dict will raise an key error if no such key
# d['dragons']
d = Counter()
d['dragons']
d['dragons'] += 1
print(d)

# Count elements in a list
c = Counter('red green red blue red blue green'.split())
print(f'Counter is: {c}; Most common color is {c.most_common(1)}')

# list all elements of the counter
list(c.elements())
list(c) # list key of a (dict)
list(c.values()) # list values
list(c.items()) # list k,v pair

## Statistics Module
describing data

In [None]:
from statistics import mean, median, mode, stdev, pstdev
numbers1 = [50, 52, 53]
mean(numbers1)
# sample stdev vs population stdev
stdev(numbers1)
pstdev(numbers1)

## Sequence Operations
concat / slicing / count / index / sort

In [None]:
s = [10, 20, 30]
t = [0, 40, 50, 60, 60]

# List concat
u = s + t
print(u)

# Slicing
print(u[:2])
print(u[-2:])
print(u[:2] + u[-2:]) # Concat 

# all sequences have _count_ and _index_
print(dir(list))
u[0]
u.index(10)
u.count(60)

# sort
u.sort()  # sort original list
t = sorted(u) # construct a new list

## `lambda` expressions
* try to replace labmda -> partial, itemgetter, attrgetter, ...
* should be called `make_function()`
* Deffering a computertation in the future by making a function of __no arguments__. When the functions is called, the function runs. aka freeze/thaw; promisses

In [None]:
lambda x: x**2
print((lambda x: x**2)(5))

# multiple arguments
lambda x, y: 3*x + y

x = 10
y = 20
f = lambda : x**y # We can thaw the function after frozen, only when we call it to run it
f() 

## Chained Comparison

In [None]:
x = 15
x > 6
x < 10
x > 6 and x < 10
6 < x < 20 # doesn't load the stack 2 times, a bit more efficient

## Random / Distribution

In [None]:
import random
random.seed(123456789) # testing
random.random()

# Continus dist
random.uniform(1000, 1100)
random.triangular(1000, 1100) # halfway point much more chosen
random.gauss(100, 15) # Random IQ, normal
random.expovariate(20) # 1/20

from statistics import mean, stdev
# triangular, uniform, gauss, expovariate
data = [random.expovariate(20) for i in range(1000)]
print(mean(data))
print(stdev(data))# discrete dist
from random import choice, choices, sample, shuffle

In [None]:
# discrete dist
from random import choice, choices, sample, shuffle
outcomes = ['win', 'lose', 'draw', 'play again', 'double win']
choice(outcomes)
choices(outcomes, k=10) # Sampling with replacement

from collections import Counter
Counter(choices(outcomes, k=10))

sorted(sample(range(1, 57), k=6)) # Sampling without replacement

sample(outcomes, k=1)[0]
choice(outcomes) # choice is a special case of sample

shuffle(outcomes); outcomes
sample(outcomes, k=len(outcomes)) # shuffle is a special of sample


# Resampling statistic

* Six roulette wheel spins / choices with weighting

In [None]:
from random import *
from statistics import *
from collections import *

# Six roulette whells -- 18 red 19 black 2 green
# ways to populate
choice(['red', 'red', 'red', 'black', 'black', 'black', 'green'])

population = ['red'] * 18 + ['black'] * 18 + ['green'] * 2
choice(population)

[choice(population) for i in range(6)]
Counter([choice(population) for i in range(6)])

Counter(choices(population, k=6))

# better way
Counter(choices(['red', 'black', 'green'], [18, 18, 2], k=6))

* deal 20 cards without replacement (16 tens, 36 low)
* after a random deal, what's the  likelihoods of wining

In [None]:
# define what the deck is
deck = Counter(tens=16, low=36)
# list all elements
deck = list(deck.elements())
deal = sample(deck, 20)
Counter(deal)
# deal 52 cards
deal = sample(deck, 52)
remainder = deal[20:]
Counter(remainder)

* a weighted biased coin (spin it)
* P(5 or more heads from 7 spins)

In [None]:
pop = ['heads', 'tails']
cumwgt = [0.60, 1.00] # cumulative weights

# 1 trail
trail = lambda : choices(pop, cum_weights=cumwgt, k=7).count('heads') >= 5
n = 100
sum(trail() for i in range(n)) / n

# compare to the analytic approach 
from math import factorial as fact

def comb(n, r):
    return fact(n) // fact(r) // fact(n - r)

comb(10, 3)
ph=0.6
# 5 heads out of 7 spins
ph ** 5 * (1-ph) ** 2 * comb(7, 5)
# 6 heads out of 7 spins
ph ** 6 * (1-ph) ** 1 * comb(7, 6)

sum(ph**i * (1-ph)**(7-i) * comb(7, i) for i in range(5, 8))

* Does the median-of-five fall in the middle of two quartiles

In [None]:
sample(range(10000), 5)
sorted(sample(range(10000), 5))[2]

n = 100000
n // 4
n * 3 // 4

trail = lambda : n // 4 < median(sample(range(100000), 5)) <= 3 * n // 4
sum(trail() for i in range(100000)) / n

# Improving Reliability

## Mypy and Type Hinting
__Big idea__: Add type hints to code helps clarify your thoughts, improves documentaion, and may allow a static analysis tool to detect some kinds of errors.
* Use of `# type: ` - Type comment
* Use of function annotaion
* Use of class
* Container[Type]
* Tuple and ...
* Optional arguments
* Deque vs deque
* Issues with f-string, new colon notation, secrets module

Tools:
* mypy
* pyflakes
* hypothesis
* unittest -> `nose py.test`  [classic, most builtiful way to run unittest]

In [None]:
# To check, run in Mypy: `python3 -m mypy hints.py`
import typing
from collections import OrderedDict, deque, namedtuple

# Old style:
x = 10 # type: int

# New style
x: int = 10
    
# Function annotations, before run time
def f(x: int, y: Optional[int]=None) -> int:
    if y is None:
        y = 20
    return x + y
    
x = {} # type: OrderedDict (should throw error)
y = OrderedDict() # type: OrderedDict

# Sequence: indexable / iterable
def g(x: Sequence):
    print(len(x))
    print(x[2])
    for i in x:
        print(i)
    print()

# Specify Sequence type: 
def g(x: Sequence[int]):
    pass

# Specify a list
def g(x: List[int]):
    pass

info = ('Foo', 'Bar', 'Var', 'lentgh') # Tuple[str, ...]  (all strings)

Point = namedtuple('Point', ['x', 'y'])
Point2 = typing.NamedTuple('Point2', [('x', int), ('y', int)])

## fsum, true division
* `fsum()` is more accurate than `sum()`
```Python
from math import fsum 
```
* / vs // [True division vs floor division]

In [None]:
print(f'{1.1 + 2.2}')
print(f'{1.1 + 2.2 == 3.3}')
print(f'{sum([0.1] * 10)}')

##  Grouping with `defaultdict`
`defaultdict` creates a new container to store elements with a common feature (key)

```Python
d = defaultdict(set)
d['t'] # returns an empaty set   vs keyerror in regular dict
d['t'].add('tom')
```

```Python
d = defaultdict(list)
d['t'].append('tom')
d['t'].append('tom')
```

In [None]:
from pprint import pprint
from collections import defaultdict
names = '''david betty susan mary darlene sandy davin shelly tom michael'''.split()
d = defaultdict(list)
for name in names:
    feature = name[0] # len(name) / name[-1], etc
    d[feature].append(name)
pprint(d, width=60)

d = defaultdict(list)
for name in names:
    feature = len(name)
    d[feature].append(name)
pprint(d)

## Key function
__What__: A function takes one argument and transform it into a key (and do operations below with that key)

Works with: min(), max(), sorted(), nsmallest(), nlargest(), groupby() and merge()

SQL:
```SQL
SELECT name FROM names ORDER BY len(name);
```

Python:
```Python
pprint(sorted(names, key=len))
```

## Transposing 2-D data with zip() and star-args
- zip: bring multiple sequences together, pair-wise / unpaired elements are left out
- `from itertools import zip_longest`: fill in missing value
- use with star (*) -> umpack m into seperate arguments; each row becomes an argument

In [None]:
l = list(zip('abcdef','ghijklm'))
print(f'Result: {l}; m is missing') # notice `m` is missing

from itertools import zip_longest
list(zip_longest('abcdef','ghijklm'))
list(zip_longest('abcdef','ghijklm', fillvalue='x'))


# Transposing 3x2 matrix to 2x3
m = [[10, 20], [30, 40], [50, 60]]
print(f'{list(zip([10, 20], [30, 40], [50, 60]))}')

print(f'{list(zip(*m))}')

## Flattening data with list comprehensions

In [None]:
m = [[10, 20], [30, 40], [50, 60, 70]]
pprint(m, width=15)

# flatterning matrix
for row in m:
    for col in row:
        print(col)
        
[x for row in m for x in row] # Same result using list comprehension

non_flat = m
for x in non_flat:
    if len(x) > 2:
        for y in x:
            print(y)
[y for x in non_flat if len(x) > 2 for y in x] # Same as above            

## Convert an interator into a list
Use `list(iterator)`
- list: indexable; loop over it multiple times; run in reverse

# K-means - Cluster Analysis
* Big Idea: K-means is an unsupervised learning tool for identifying cluster with-in datasets.
* Algorithm:
    - Pick arbitrary points as guesses for the center of each group.
    - Assign all the data points to the closest matching group. 
    - Within each group, average the pints to get a new guess for the center of the group.
    - Repeat multiple times: Assign data and average the points

* Goal: 
Epress the idea more clearly and beeautifully in Python than in English.


Tasks: 
- mean(data)
- dist(point, piont)
- assign_data(centroids, pionts)   # centroid = potential center of a cluster
- compute_centroids(groups)
- k_means(points)

In [4]:
from typing import Iterable, Tuple, Sequence, Dict, List
from pprint import pprint
from math import fsum, sqrt

# Alias Tuple[int, ...] to Pint
Point = Tuple[int, ...]
Centroid = Point

points = [
    (10, 41, 23),
    (22, 30, 29),
    (11, 42, 5),
    (20, 32, 4),
    (12, 40, 12),
    (21, 36, 23),
]

def mean(data: Iterable[float]) -> float:
    'Acurate arithmetic mean'
    data = list(data) # data might be sequence OR generator; convert generator to list
    return fsum(data) / len(data)

def dist_old(p, q):
    'Euclidean distance function for multi-dimensional data'
    return sqrt(fsum([(x - y) ** 2 for x, y in zip(p, q)]))

# convert global to local; verify by `from dis import dis; dis(dist)`
def dist(p: Point, q: Point, fsum=fsum, sqrt=sqrt, zip=zip) -> float:
    return sqrt(fsum([(x - y) ** 2 for x, y in zip(p, q)]))

from dis import dis
dis(dist2)

 29           0 LOAD_FAST                3 (sqrt)
              2 LOAD_FAST                2 (fsum)
              4 LOAD_CONST               1 (<code object <listcomp> at 0x7fdc403a39c0, file "<ipython-input-2-1cd6c7d96beb>", line 29>)
              6 LOAD_CONST               2 ('dist2.<locals>.<listcomp>')
              8 MAKE_FUNCTION            0
             10 LOAD_FAST                4 (zip)
             12 LOAD_FAST                0 (p)
             14 LOAD_FAST                1 (q)
             16 CALL_FUNCTION            2
             18 GET_ITER
             20 CALL_FUNCTION            1
             22 CALL_FUNCTION            1
             24 CALL_FUNCTION            1
             26 RETURN_VALUE


In [6]:
from collections import defaultdict

def assign_data(centroids: Sequence[Centroid], data: Iterable[Point]) -> Dict[Centroid, List[Point]]:
    'Group the data points to the closest centroid'
    d = defaultdict(list)
    for point in data:
        closest_centroid = min(centroids, key=partial(dist, point)) # lambda: dist(point, centroid)
        d[closest_centroid].append(point)
    return dict(d) # convert back to regular dict

# list all the centroids:
centroids = [(9, 39, 20), (12, 36,25)]
point = (11, 42, 5)
[dist(point, centroid) for centroid in centroids]

# use key function to determine the point, rather dist
min(centroids, key=lambda centroid: dist(point, centroid))

(9, 39, 20)

In [7]:
from functools import partial # partial function evaluation; freeze some arguments
pow(2, 5)
twopow = partial(pow, 2)
twopow(5)
min(centroids, key=partial(dist, point)) # partial version

(9, 39, 20)

In [8]:
pprint(assign_data(centroids, points), width=45)

{(9, 39, 20): [(10, 41, 23),
               (11, 42, 5),
               (20, 32, 4),
               (12, 40, 12)],
 (12, 36, 25): [(22, 30, 29), (21, 36, 23)]}


In [16]:
def transpose(data):
    'Swap the rows and columns in a 2-D array of data'
    return list(zip(*data))

def compute_centroids(groups: Iterable[Sequence[Point]]) -> List[Centroid]:
    'Compute the centroid of each group'
    return [tuple(map(mean, transpose(group))) for group in groups]
    # return [tuple(map(mean, zip(*group))) for group in groups]

In [17]:
from random import sample # sample without replacement
def k_means(data: Iterable[Point], k=2, iterations=50) -> List[Centroid]:
    data = list(data) # turn iterable into sequence
    centroids = sample(data, k)
    for i in range(iterations):
        labeled = assign_data(centroids, data)
        centroids = compute_centroids(labeled.values())
    return centroids

centroids = k_means(points, k=3)
d = assign_data(centroids, points)
pprint(d)

{(11.5, 41.0, 8.5): [(11, 42, 5), (12, 40, 12)],
 (17.666666666666668, 35.666666666666664, 25.0): [(10, 41, 23),
                                                  (22, 30, 29),
                                                  (21, 36, 23)],
 (20.0, 32.0, 4.0): [(20, 32, 4)]}
