In [1]:
%pylab inline
import numpy as np

Populating the interactive namespace from numpy and matplotlib


## Compute statistics for a known distribution

Requested additions:
1. Do the same calcs for a sample.
2. Show how to do the calcs without .dot and outer, just standard python (much slower, but ok).
3. Add text to explain each step

### Defining a joint distribution over two discrete random variables
The probabilities are organized in a 2D array, where the columns correspond to values of $x$ and the rows correspond to values of $y$

In [2]:
# We start with positive weights that don't sum to 1
P=np.array([[1.,1,2],[2,2,2]])
P2=copy(P)
P

array([[ 1.,  1.,  2.],
       [ 2.,  2.,  2.]])

In [3]:
# We then normalize the weights
# using Pure Python

#Compute the sum
s=0
for i in range(shape(P)[0]):
    for j in range(shape(P)[1]):
        s+=P[i,j]
print 'the sum is ',s
#divide by the sum
for i in range(shape(P)[0]):
    for j in range(shape(P)[1]):
        P[i,j] /= s
P

the sum is  10.0


array([[ 0.1,  0.1,  0.2],
       [ 0.2,  0.2,  0.2]])

In [4]:
# Using Numpy we can write it in a much shorter way
P2/=sum(P2)
P2

array([[ 0.1,  0.1,  0.2],
       [ 0.2,  0.2,  0.2]])

In [5]:
# The values that the random variables x and y take
x=np.array([1,2,3])
y=np.array([-1,1])

#### Computing Marginals
The marginal distributions are the probabilities associated with each random variable alone.

In [6]:
# The pure python way
Px=[0.]*shape(P)[1]
Py=[0.]*shape(P)[0]

for i in range(len(Px)):
    for j in range(len(Py)):
        Px[i]+=P[j,i]
        Py[j]+=P[j,i]
Px,Py

([0.30000000000000004, 0.30000000000000004, 0.40000000000000002],
 [0.40000000000000002, 0.60000000000000009])

In [7]:
#the numpy way:
Px=sum(P,axis=0)
Py=sum(P,axis=1)
Px,Py

(array([ 0.3,  0.3,  0.4]), array([ 0.4,  0.6]))

### Check whether $x$ and $y$ are independent

If they are independent then the outer product and P should be equal.

In [8]:
# The pure python way
for i in range(len(Px)):
    for j in range(len(Py)):
        if Px[i]*Py[j] != P[j,i]:
            print "Px[%d]*Py[%d] != P[%d,%d] ::::: %5.3f*%5.3f = %5.3f != %5.3f"%\
                    (i,j,j,i,Px[i],Py[j],Px[i]*Py[j],P[j,i])


Px[0]*Py[0] != P[0,0] ::::: 0.300*0.400 = 0.120 != 0.100
Px[0]*Py[1] != P[1,0] ::::: 0.300*0.600 = 0.180 != 0.200
Px[1]*Py[0] != P[0,1] ::::: 0.300*0.400 = 0.120 != 0.100
Px[1]*Py[1] != P[1,1] ::::: 0.300*0.600 = 0.180 != 0.200
Px[2]*Py[0] != P[0,2] ::::: 0.400*0.400 = 0.160 != 0.200
Px[2]*Py[1] != P[1,2] ::::: 0.400*0.600 = 0.240 != 0.200


In [9]:
# The numpy way
np.outer(Px,Py).T - P

array([[ 0.02,  0.02, -0.04],
       [-0.02, -0.02,  0.04]])

### Calculating the mean and standard deviation
To calculate the mean of $X$ and $Y$ under this distribtion in python, we need to iterate through the values of $x$ and $y$ and plug them into the formuls $E[X] = \sum_x P(X=x)x$. Similarly for standard deviation.


### Computing the covariance

### Calculating the mean and standard deviation
To calculate the mean of $X$ and $Y$ under this distribtion in python, we need to iterate through the values of $x$ and $y$ and plug them into the formuls $E[X] = \sum_x P(X=x)x$. Similarly for standard deviation.


In [10]:
from math import sqrt
#The python way
Ex = 0
for i in range(3):
    Ex+=Px[i]*x[i]
Ey = 0
for i in range(2):
    Ey+=Py[i]*y[i]

varx = 0
for i in range(3):
    varx+=Px[i]*(x[i] - Ex)**2
stdx = sqrt(varx)

vary = 0
for i in range(2):
    vary+=Py[i]*(y[i] - Ey)**2
stdy = sqrt(vary)

Ex,Ey,stdx,stdy

(2.1000000000000005,
 0.20000000000000007,
 0.8306623862918076,
 0.9797958971132713)

In [11]:
# In numpy you can use np.dot(A,B) which calculates the pairwise product of elements in A and B
# and sums them up
Ex=np.dot(Px,x)
Ey=np.dot(Py,y)
Ex2=np.dot(Px,x**2)
Ey2=np.dot(Py,y**2)
stdx=sqrt(Ex2-Ex**2)
stdy=sqrt(Ey2-Ey**2)
print 'Ex=%f,Ey=%f,stdx=%f,stdy=%f'%(Ex,Ey,stdx,stdy)

Ex=2.100000,Ey=0.200000,stdx=0.830662,stdy=0.979796


#### Subtract the means

In [12]:
nx=x-Ex
nx

array([-1.1, -0.1,  0.9])

In [13]:
ny=y-Ey
ny

array([-1.2,  0.8])

### Calculate the covariance


In [14]:
# in python
s=0
for i in range(len(x)):
    for j in range(len(y)):
        s+=P[j,i]*nx[i]*ny[j]
print 'the covariance is',s #our expected values are now 0 so nothing to subtract

the covariance is -0.12


In [15]:
# numpy

print 'the covariance is', np.dot(P.flatten(), np.outer(ny,nx).flatten())

the covariance is -0.12


### and the correlation


In [16]:
s/(stdx*stdy)

-0.1474419561548973

## Empirical statistics

If we now draw samples from these distributions, we can see that the emperical statistics, the population mean, population standard deviation and population covariance approach the original values of mean, standard deviation and covariance.

In [26]:
x,Px

(array([1, 2, 3]), array([ 0.3,  0.3,  0.4]))

In [28]:
np.random.choice(x,10,True, Px)

array([1, 3, 1, 1, 1, 2, 1, 3, 1, 2])

In [29]:
numsamples = [2,10,100,100000]

for num in numsamples: 
    print "Population mean after drawing {num} samples = {s}".format(
        num = num,
        s = np.mean(np.random.choice(x, num, True, Px))
    )

Population mean after drawing 2 samples = 1.0
Population mean after drawing 10 samples = 2.2
Population mean after drawing 100 samples = 2.07
Population mean after drawing 100000 samples = 2.10171


In [19]:
#To calculate the covariance, we will generate samples (x,y) form the joint distribution P
#possible samples
nxy =  np.array([(i,j) for i in nx for j in ny])
for num in numsamples:
    samples = np.random.choice(nxy.shape[0], num, True, P.T.flatten()), #choose rows
    print "Population covariance after drawing {num} samples = {s}".format(
        num = num,
        s = np.cov(
            nxy[samples][:,0],
            nxy[samples][:,1]
        )[0,1]
    )

Population covariance after drawing 2 samples = 0.0
Population covariance after drawing 10 samples = -0.311111111111
Population covariance after drawing 100 samples = -0.177777777778
Population covariance after drawing 100000 samples = -0.118734094541
