# Tutorial 3: Estimating a Simple Mean

In this notebook we will set up the *simplest possible inference problem* and solve it with the same algorithm, but while storing the data in two different places: RAM, and a database.

Suppose we have a catalog of stars, each one with an ID and a measured distance. We will assume that they belong to a single population lying at the same distance, and that they are standard candles. Our model has only one parameter, the mean distance to the population.

Goals:

* Simulate a stellar population, and generate an observed distance dataset
* Load the data in to memory and infer the mean distance 
* Create a simple database and store the data in it
* Load the data from the database and infer the mean distance

### Requirements

You will need to install `sqlalchemy`.


In [None]:
import numpy as np

Make some relevant imports.

In [2]:
import sqlalchemy as sq
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy.sql import func

## Probability Theory

We have measured distances $d_k$ for each star $k$. We are looking for the posterior PDF for the mean distance $\mu$:

${\rm Pr}(\mu | \boldsymbol{d}) = {\rm Pr}(\mu) \prod_k {\rm Pr}(d_k | \mu)$

We will assign a uniform prior for $\mu$ and etc etc etc We're just going to take the arithmetic mean of $d$ and call it $\mu$.

## Simulating a Stellar Population

In [3]:
class StellarPopulation(object):
    """
    Simulate a population of stars, all with the same distance.
    """
    def __init__(self, distance):
        """
        Instantiate a StellarPopulation object.
        
        Parameters
        ----------
        distance : float
            Scalar distance to the cluster of stars, in kpc
        """
        self.distance = distance 
        return
    
    def generate(self, N=1000,rms_error=0.1):
        """
        Simulates the observations of N stars.
        
        Parameters
        ----------
        N : int
            Number of stars to generate
        rms_error : float
            RMS fractional distance uncertainty
        """
        self.N = N
        self.rms_error = rms_error
        self.d_obs = self.distance + (self.rms_error * self.distance) * np.random.randn(self.N)
        self.id = map(str, range(self.N))
        return
    
    def estimate_mean_distance(self):
        """
        Estimate the mean of the observed stellar distances, and return this as well as the wallclock time taken.
        
        Returns
        -------
        mu : float
            Estimated mean distance in kpc
        time : float
            Wallclock time in milliseconds
        """
        import time as wallclock
        start = wallclock.time()
        mu = np.mean(self.d_obs)
        end = wallclock.time()
        time = (end-start) * 1e3
        return mu, time

In [4]:
cluster = StellarPopulation(3.0)
cluster.generate(100000)

In [5]:
mu, time = cluster.estimate_mean_distance()
print "Estimated mean distance = ", mu
print "Wallclock time spent = ", np.round(time,1), "milliseconds"

Estimated mean distance =  2.99960990232
Wallclock time spent =  0.3 milliseconds


## Setting Up Database

In [6]:
Base = declarative_base()

In [7]:
class PopulationTable(Base):
    __tablename__ = 'population'
    
    #Define columns like we did in the last notebook
    id = sq.Column(sq.Integer, primary_key=True)
    distance = sq.Column(sq.Float, nullable=False)
    
class StarTable(Base):
    __tablename__ = 'star'
    
    #Again define columns like above. These columns
    #are normal python instance attributes.
    id = sq.Column(sq.Integer, primary_key=True)
    object_id = sq.Column(sq.Integer, sq.ForeignKey("population.id"), nullable=False)
    distance = sq.Column(sq.Float, nullable=False)

In [8]:
dbfile = 'mean.db'

try: os.remove(dbfile)
except: pass

In [9]:
engine = sq.create_engine('sqlite:///'+dbfile)
Base.metadata.create_all(engine)

Base.metadata.bind = engine
DBSession = sessionmaker(bind=engine)
session = DBSession()

## Inserting Data

In [10]:
cluster_db = PopulationTable(distance=3.0)
session.add(cluster_db)

In [11]:
for d in cluster.d_obs:
    star = StarTable(object_id=0, distance=d)
    session.add(star)
    
session.commit()

# Load Data and Infer Mean

In [15]:
import time as wallclock
start = wallclock.time()

obj = session.query(func.avg(StarTable.distance)).filter(StarTable.object_id == 0)

end = wallclock.time()
time = (end-start) * 1e3

print "Estimated mean distance = ", obj.all()[0]

print "Wallclock time spent = ", np.round(time,1), "milliseconds"

Estimated mean distance =  (2.9997514303065937,)
Wallclock time spent =  0.5 milliseconds
