# Tutorial 3: Estimating a Simple Mean

In this notebook we will set up the *simplest possible inference problem* and solve it with the same algorithm, but while storing the data in two different places: RAM, and a database.

Suppose we have a catalog of stars, each one with an ID and a measured distance. We will assume that they belong to a single population lying at the same distance, and that they are standard candles. Our model has only one parameter, the mean distance to the population.

Goals:

* Simulate a stellar population, and generate an observed distance dataset
* Load the data in to memory and infer the mean distance 
* Create a simple database and store the data in it
* Load the data from the database and infer the mean distance

### Requirements

You will need to install `sqlalchemy`.


In [1]:
import numpy as np

## Probability Theory

We have measured distances $d_k$ for each star $k$. We are looking for the posterior PDF for the mean distance $\mu$:

${\rm Pr}(\mu | \boldsymbol{d}) = {\rm Pr}(\mu) \prod_k {\rm Pr}(d_k | \mu)$

We will assign a uniform prior for $\mu$ and etc etc etc We're just going to take the arithmetic mean of $d$ and call it $\mu$.

## Simulating a Stellar Population

In [2]:
class StellarPopulation(object):
    """
    Simulate a population of stars, all with the same distance.
    """
    def __init__(self, distance):
        """
        Instantiate a StellarPopulation object.
        
        Parameters
        ----------
        distance : float
            Distance to the stars, in kpc
        """
        self.distance = distance 
        return
    
    def generate(self, N=1000,rms_error=0.1):
        """
        Simulates the observations of N stars.
        
        Parameters
        ----------
        N : int
            Number of stars to generate
        rms_error : float
            RMS fractional distance uncertainty
        """
        self.N = N
        self.rms_error = rms_error
        self.d_obs = self.distance + (self.rms_error * self.distance) * np.random.randn(self.N)
        self.id = map(str, range(self.N))
        return
    
    def estimate_mean_distance(self):
        """
        Estimate the mean of the observed stellar distances, and return this as well as the wallclock time taken.
        
        Returns
        -------
        mu : float
            Estimated mean distance in kpc
        time : float
            Wallclock time in milliseconds
        """
        import time as wallclock
        start = wallclock.time()
        mu = np.mean(self.d_obs)
        end = wallclock.time()
        time = (end-start) * 1e3
        return mu, time

In [3]:
cluster = StellarPopulation(3.0)
cluster.generate(10000000)

In [4]:
mu, time = cluster.estimate_mean_distance()
print "Estimated mean distance = ", mu
print "Wallclock time spent = ", np.round(time,1), "milliseconds"

Estimated mean distance =  2.99987992483
Wallclock time spent =  5.5 milliseconds


## Inferring the Mean Distance

## Storing Data in a Database

## Loading the Data and Inferring the Mean