### Simple Statistics

Let the following data set be given (sample size 60):

4, 3, 2, 5, 4, 6, 3, 7, 4, 1, 4, 0, 6, 4, 3, 5, 2, 3, 5, 1, 4, 4, 9, 5, 4, 3, 3, 5, 2, 4, 3, 6, 5, 2, 6, 2, 4, 5, 5, 1, 5, 4, 4, 2, 7, 1, 3, 3, 4, 7, 3, 4, 4, 6, 6, 3, 3, 2, 6, 1.

Calculate:
- The mean
- The mean recursively
- The standard deviation over the sample

The calculation of the mean and the standard deviation of a list of numbers is fairly straightforward.
$$\bar{x}=\frac{\sum_{k=1}^i x_k}{i}$$

$$\sigma=\sqrt{\frac{\sum_{k=1}^i(x_k-\bar{x})^2}{i}}$$

In [8]:
import math
import time

numbers = [4, 3, 2, 5, 4, 6, 3, 7, 4, 1, 4, 0, 6, 4, 3, 5, 2, 3, 5, 1, 4, 4, 9, 5, 4, 3, 3, 5, 2, 4, 3, 6, 5, 2, 
           6, 2, 4, 5, 5, 1, 5, 4, 4, 2, 7, 1, 3, 3, 4, 7, 3, 4, 4, 6, 6, 3, 3, 2, 6, 1]

# calculate mean
mean = sum(numbers) / len(numbers)

# calculate std dev
std_dev = math.sqrt(
    sum([(x - mean)**2 for x in numbers]) 
    / len(numbers)
)

print('Sample mean    : %0.2f' % mean)
print('Sample std dev : %0.2f' % std_dev)

Sample mean    : 3.87
Sample std dev : 1.77


However, in streaming environments, x is **unbounded**, which makes it necessary to calculate these simple statistics **incrementaly**.

To incrementally calculate the mean and standard deviation of a random variable x, we need to maintain three variables for x:
- **LS (Linear Sum)**
- **SS (Squared Sum)**
- **N (Count)**

This allows observations to be incrementally added.
- LS = LS + $x_{i}$
- SS = SS + $x_{i}^2$
- N = N + 1

As shown below, these three variables and their incremental additive properties are sufficient to calculate the mean and standard deviation of x in a streaming environment.

In [3]:
class Stream:
    
    def __init__(self):
        self.ls = 0.0
        self.ss = 0.0
        self.n = 0.0
    
    def increment(self, x):
        """
        Add x to the observations by incrementing the sufficient stats
        """
        self.ls += x
        self.ss += x**2
        self.n += 1

    def decrement(self, x):
        """
        Remove x from the observations by decrementing the sufficient stats
        """
        self.ls -= x
        self.ss -= x**2
        self.n -= 1
    
    def mean(self):
        """
        Return mean of the observations by dividing LS by N
        """
        return self.ls/self.n
    
    def std_dev(self):
        """
        Return the standard deviation of the observations
        """
        return math.sqrt((self.ss/self.n) - (self.ls/self.n)**2)
    
    def print_stats(self):
        """
        Print the current values of the sufficient stats to the console
        """
        print('Linear Sum  : %0.2f' % self.ls)
        print('Squared Sum : %0.2f' % self.ss)
        print('N           : %0.2f' % self.n)

The mean can be calculated by:
$$\bar{x}=\frac{LS}{N}$$
And the standard deviation can be calculated by:
$$\sigma=\sqrt{\frac{SS - \frac{LS^2}{N}}{N-1}}$$


Below, we are incrementally adding three numbers to the sample, and calculating the mean and standard deviation of the observations in the stream

In [4]:
stream = Stream()

stream.increment(4)
stream.increment(3)
stream.increment(2)

stream.print_stats()
print()
print('Mean: %0.2f' % stream.mean())
print('Standard Deviation: %0.2f' % stream.std_dev())

Linear Sum  : 9.00
Squared Sum : 29.00
N           : 3.00

Mean: 3.00
Standard Deviation: 0.82


**Coming back to the original sample of 60 items:**

**4, 3, 2, 5, 4, 6, 3, 7, 4, 1, 4, 0, 6, 4, 3, 5, 2, 3, 5, 1, 4, 4, 9, 5, 4, 3, 3, 5, 2, 4, 3, 6, 5, 2, 6, 2, 4, 5, 5, 1, 5, 4, 4, 2, 7, 1, 3, 3, 4, 7, 3, 4, 4, 6, 6, 3, 3, 2, 6, 1.**

**Below, a stream is simulated where the items arrive one by one with some time delay. They are incrementally added to the stream by updating the sufficient statistics, then the sufficient statistics along with the running mean and standard deviation are printed.**

In [46]:
stream = Stream()
for number in numbers:
    print('Incoming Item: %d' % number)
    stream.increment(number)
    print('[LS,    SS,    N]')
    print([stream.ls, stream.ss, stream.n])
    print()
    print('Mean: %0.2f, Std Dev: %0.2f' % (stream.mean(), stream.std_dev()))
    print('=============================')
    time.sleep(3)

Incoming Item: 4
[LS,    SS,    N]
[4.0, 16.0, 1.0]

Mean: 4.00, Std Dev: 0.00
Incoming Item: 3
[LS,    SS,    N]
[7.0, 25.0, 2.0]

Mean: 3.50, Std Dev: 0.50
Incoming Item: 2
[LS,    SS,    N]
[9.0, 29.0, 3.0]

Mean: 3.00, Std Dev: 0.82
Incoming Item: 5
[LS,    SS,    N]
[14.0, 54.0, 4.0]

Mean: 3.50, Std Dev: 1.12
Incoming Item: 4
[LS,    SS,    N]
[18.0, 70.0, 5.0]

Mean: 3.60, Std Dev: 1.02
Incoming Item: 6
[LS,    SS,    N]
[24.0, 106.0, 6.0]

Mean: 4.00, Std Dev: 1.29
Incoming Item: 3
[LS,    SS,    N]
[27.0, 115.0, 7.0]

Mean: 3.86, Std Dev: 1.25
Incoming Item: 7
[LS,    SS,    N]
[34.0, 164.0, 8.0]

Mean: 4.25, Std Dev: 1.56
Incoming Item: 4
[LS,    SS,    N]
[38.0, 180.0, 9.0]

Mean: 4.22, Std Dev: 1.47
Incoming Item: 1
[LS,    SS,    N]
[39.0, 181.0, 10.0]

Mean: 3.90, Std Dev: 1.70
Incoming Item: 4
[LS,    SS,    N]
[43.0, 197.0, 11.0]

Mean: 3.91, Std Dev: 1.62
Incoming Item: 0
[LS,    SS,    N]
[43.0, 197.0, 12.0]

Mean: 3.58, Std Dev: 1.89
Incoming Item: 6
[LS,    SS,    N