In [None]:
Most people write way too much code. Their ideas become muddy and the code hard to change.

Express big ideas with little amount of code.

# The statistics module

Core tools for data analytics!

New in version 3.4.

[Docs](https://docs.python.org/3/library/statistics.html)

In [1]:
from statistics import mean, median, mode, stdev, pstdev

In [2]:
mean([50, 52, 53])

51.666666666666664

In [3]:
median([51, 50, 52, 53])

51.5

In [5]:
mode([51, 50, 52, 53, 51, 51])

51

About standard deviation at 11:00

We divide by the number n

When n is large, one will give you infinity, while the other will give you zero

In [7]:
stdev([51, 50, 52, 53, 51, 51])

1.0327955589886444

In [8]:
pstdev([51, 50, 52, 53, 51, 51])

0.9428090415820634

# Concating lists

Concats end to end. NumPy does it position by position.

In [10]:
s =  [10, 20, 30]
t = [40, 50, 60]

In [11]:
u = s + t

In [12]:
u

[10, 20, 30, 40, 50, 60]

First two:

In [13]:
u[:2]

[10, 20]

Last two:

In [16]:
# Go back two, colon means "go all the way to the end"
u[-2:]

[50, 60]

In [17]:
u[:2] + u[-2:]

[10, 20, 50, 60]

A reminder about some things that people tend to forget about. All sequences has count and index.

In [18]:
dir(list)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

In [20]:
s = 'abracadabra'
i = s.index('c')

In [21]:
s[i]

'c'

In [22]:
s.count('c')

1

In [23]:
s.count('a')

5

These will be very useful for data analytics later on.

# Sorting

In [25]:
s = [10, 5, 70, 2]

In [26]:
s.sort()

In [27]:
s

[2, 5, 10, 70]

Sorted creates a copy

In [30]:
s = [10, 5, 70, 2]
t = sorted(s)

In [31]:
t

[2, 5, 10, 70]

In [32]:
sorted('cat')

['a', 'c', 't']

# Lambda

Map and lambda was popular before list comprehensions, then people forgot about lambda. People have invented many tools for people trying to not use lambda.

lambda -> partial objects, itemgetter, attrgetter and many objects that make sure we don't have to use lambda

lambda should be called make_function()

In [33]:
lambda x: x**2

<function __main__.<lambda>>

Create a function. Call the function with (5)

In [37]:
100 + (lambda x: x**2)(5) + 50

175

In [40]:
f = lambda x, y: 3*x + y

In [41]:
f(3, 8)

17

3*3 plus 8

Make a promise. Freeze and thaw, promises, thunks.

This lambda function will take whatever is assigned as x and y and compute it when we run the function

In [45]:
x = 10
y = 20
f = lambda : x ** y
f()

100000000000000000000

# Chained comparisons

In [46]:
x = 15
x > 6

True

In [47]:
x < 10

False

In [48]:
x > 6 and x < 20

True

In [49]:
6 < x < 20

True

# The random module

In [50]:
from random import *

In [51]:
random()

0.5172096915450849

In [64]:
random?

By seeding the random number generator, random will always produce the same sequence

In [52]:
seed(1414123123)

In [65]:
seed?

In [53]:
random()

0.05954354881602675

In [54]:
random()

0.378610987173179

This is useful for simulations

In [56]:
uniform(1000, 1100)

1000.1400967521248

In [66]:
uniform?

Triangular will be centered about the middle. Choose the halfway point more frequently.

In [101]:
triangular(low=1000, high=1100)

1049.1449832107387

To illustrate:

triangular(low=1000, high=1100)
               ________
               \      /
                \    /
                 \  /
                  \/

                1068.6058048784505

Gauss let's the tails be more angular, gaussian or normal distribution is a good choice. It's named after the mathematician Carl Friedrich Gauss.

In [62]:
gauss?

In [92]:
gauss(mu=100, sigma=15)

86.97286903346375

The standard deviation is 15 here.

In [102]:
expovariate?

Exponential distribution. It's argument is called lambd. It is used to simulate arrival times.

In [103]:
expovariate(20)

0.0066559299069569935

In [104]:
data = [triangular(1000, 1100) for i in range(1000)]

Now let's check the mean. It should be right in the middle.

In [105]:
mean(data)

1050.7878458318398

In [106]:
stdev(data)

20.592991613076848

In [107]:
data = [uniform(1000, 1100) for i in range(1000)]

We will expect about the same mean, but the distribution will be much wider. Let's have a look.

In [109]:
mean(data)

1050.4378937384013

In [110]:
stdev(data)

28.8629431315601

In [113]:
data = [gauss(100, 15) for i in range(1000)]

In [114]:
mean(data)

99.00571005651493

Let's hope it turns about to be around 15.

In [115]:
stdev(data)

14.676286473060598

In [120]:
data = [expovariate(20) for i in range(1000)]

In [117]:
1/20

0.05

In [121]:
mean(data)

0.04621896890564128

In [122]:
stdev(data)

0.04575456778724989

# Discrete distributions

Choice, choices, sample and shuffle (25:35)

In [123]:
from random import choice, choices, sample, shuffle

In [124]:
outcomes = ['win', 'lose', 'draw', 'play again', 'double win']

In [126]:
choice(outcomes)

'lose'

In [128]:
choices(outcomes, k=10)

['draw',
 'draw',
 'play again',
 'lose',
 'double win',
 'play again',
 'lose',
 'double win',
 'lose',
 'win']

In [130]:
from collections import Counter

In [133]:
Counter(choices(outcomes, k=10000))

Counter({'double win': 2058,
         'draw': 2014,
         'lose': 1971,
         'play again': 1957,
         'win': 2000})

In [142]:
outcomes = ['win', 'lose', 'draw', 'play again', 'double win']
weights = [5, 4, 3, 2, 1]

5 times as many wins, 4 times as many lose, 3 times as many draws - as double win (and so on).

In [141]:
Counter(choices(outcomes, weights=weights, k=10000))

Counter({'double win': 657,
         'draw': 1992,
         'lose': 2669,
         'play again': 1323,
         'win': 3359})

In [135]:
choices?

Let's win a lot.

In [159]:
outcomes = ['win', 'lose', 'draw', 'play again', 'double win']
weights = [5, 1, 1, 1, 1]

In [162]:
roll = Counter(choices(outcomes, weights=weights, k=10000))
roll

Counter({'double win': 1099,
         'draw': 1104,
         'lose': 1134,
         'play again': 1094,
         'win': 5569})

We can stack the data to get a even better visual here:

In [170]:
tuple(zip(roll.items(), weights))

((('win', 5569), 5),
 (('lose', 1134), 1),
 (('double win', 1099), 1),
 (('draw', 1104), 1),
 (('play again', 1094), 1))

Are we making a slot machine?

In [143]:
shuffle(outcomes) # direct mutation

In [144]:
outcomes

['draw', 'win', 'double win', 'lose', 'play again']

In [145]:
choices(outcomes, k=5) # we can get duplicates

['play again', 'win', 'lose', 'lose', 'double win']

In [146]:
sample(outcomes, k=4)

['lose', 'double win', 'win', 'draw']

Let's make lottery numbers

In [155]:
ticket = sorted([5, 13, 54, 23, 11, 44])
ticket

[5, 11, 13, 23, 44, 54]

In [156]:
lottery = sorted(sample(range(1, 57), k=6))
lottery

[2, 11, 30, 46, 49, 50]

In [157]:
lottery == ticket

False

What a scam.