# Final Project (30%)

For the final project, you will need to implement a "new" statistical algorithm in Python from the research literature and write a "paper" describing the algorithm. 

Deadline 2nd May 2018 at 11:59 PM

Note: 1 bonus point for each day that it is submitted before the deadline. The actual project has a maximum grade of 100, but bonus points can push it above 100.

## Paper

The paper should have the following:

### Title

Should be consise and informative.

### Abstract

250 words or less. Identify 4-6 key phrases.

### Background

State the research paper you are using. Describe the concept of the algorithm and why it is interesting and/or useful. If appropriate, describe the mathematical basis of the algorithm. Some potential topics for the backgorund include:

- What problem does it address? 
- What are known and possible applications of the algorithm? 
- What are its advantages and disadvantages relative to other algorithms?
- How will you use it in your research?

### Description of algorithm

First, explain in plain English what the algorithm does. Then describes the details of the algorihtm, using mathematical equations or pseudocode as appropriate. 

### Describe optimization for performance

First implement the algorithm using plain Python in a straightforward way from the description of the algorihtm. Then profile and optimize it using one or more apporpiate mathods, such as:

1. Use of better algorithms or data structures
2. Use of vectorization
3. JIT or AOT compilation of critical functions
4. Re-writing critical functions in C++ and using pybind11 to wrap them
5. Making use of parallelism or concurrency
6. Making use of distributed compuitng

Document the improvemnt in performance with the optimizations performed.

### Applications to simulated data sets

Are there specific inputs that give known outuputs (e.g. there might be closed form solutions for special input cases)? How does the algorithm perform on these? 

If no such input cases are available (or in addition to such input cases), how does the algorithm perform on simulated data sets for which you know the "truth"? 

### Applications to real data sets

Test the algorithm on the real-world examples in the orignal paper if possible. Try to find at least one other real-world data set not in the original paper and test it on that. Describe and interpret the results.

### Comparative analysis with competing algorihtms

Find two other algorihtms that addresss a similar problem. Perform a comparison - for example, of accurary or speed. You can use native libraires of the other algorithms - you do not need to code them yourself. Comment on your observations. 

### Discussion/conclusion

Your thoughts on the algorithm. Does it fulfill a particular need? How could it be generalized to other problem domains? What are its limiations and how could it be improved further?

### References/bibliography

Make sure you cite your sources.

## Code

The code should be in a public GitHub repository with:

1. A README file
2. An open source license
3. Source code
4. Test code
5. Examples
6. A reproducible report

The package should be downloadable and installable with `python setup.py install`, or even posted to PyPI adn installable with `pip install package`.


## Rubric

Each item is worth 10 points, but some sections will give up to 10 bonus points if done really well. Note that the "difficulty factor" of the chosen algorithm will be factored into the grading. 

1. Is the abstract, background and discussion readable and clear? (10-20 points)
2. Is the algorithm description clear and accurate? (10-20 points)
3. Has the algorihtm been optimized? (10-20 points)
4. Are the applicaitons to simulated/real data clear and useful? (10-20 points)
5. Was the comarative analysis done well? (10-20 points points)
6. Is there a well-maitnatined Github repository for the code? (10 points)
7. Is the document show evidenc of literate programming? (10 points)
8. Is the analyiss reproducible? (10 points)
9. Is the code tested? Are examples provided? (10 points)
10. Is the package easily installable? (10 points)



In [5]:
import numpy as np
import scipy.linalg

### Hamiltonian Monte Carlo

In [6]:
def hmc(theta_0, eps, max_iter, M):
    """Hamiltoniam Monte Carlo"""
    Theta = np.empty((1, max_iter))
    r = np.empty((1, max_iter))
    for i in range(max_iter):
        r = np.random.normal(0, M)  ## not sure if it should be univariate normal or multivariate normal
        r = r - eps/2*grad_U(theta_0)
        
        r0 = r
        theta_0 = theta_0
        for i in range(m):
            theta_1 = theta_0 + eps*scipy.linalg.inv(M)*r0
            r1 = r0-eps*grad_U(theta_1)  ## helper function grad_U is needed
            r0 = r1
            theta_0 = theta_1
        
        r1 = r1 - eps/2*grad_U(theta_0)
        
        theta_hat = theta_0
        r_hat = r_1
        
        u = np.random.uniform(0, 1, 1)
        
        rho = np.exp(H(theta_hat, r_hat)-H(theta_0, r0))  ## helper function H is needed
        
        if rho>u:
            theta_0 = theta_hat
            r_0 = r_hat
        
        Theta[i] = theta_0
        r[i] = r_0
    return Theta, r

In [7]:
# The total energy of a physical system with position variables theta and momentum variables r
H = lambda theta, r: U(theta) + 1/2*r.T*scipy.linalg.inv(M)*r

In [8]:
## The potential energy
U = lambda px, ptheta: - np.sum(np.log(px) - np.log(ptheta))
# px is the likelihood function for observed data need something like pnorm in R here
# ptheta is the density distribution for theta

### Stochastic gradient HMC

In [None]:
def sghmc(theta0, max_iter, m, B_hat):
    """Stochastic gradient HMC"""
    Theta = np.empty(shape=(1, max_iter))
    r = np.empty(shape=(1, max_iter))
    for j in range(max_iter):
        r = np.random.normal(0, M, 1)
        r0 = r
        theta0 = theta0
        for i in range(max_iter):
            theta1 = theta0 + eps*scipy.linag.inv(M)*r0
            ## need to know what C and B_hat are
            ## need helper function for grad_U_til
            r1 = r0 - eps*grad_U_til(theta1) - eps@C@scipy.linalg.inv(M)*r0+np.random.normal(0, 2*(C-B_hat)*eps)
            theta0 = theta1
            r0 = r1
        Theta[j] = theta0
        r[j] = r0
    return Theta, r