## Assignment 2 Numpy and pandas

This assignment will contain 3 questions with details as below. The due date is October 5 (Friday), 2018 23:59PM. Each late day will result in 20% loss of total points.

### Question 1 (30 points) Numpy is fast!

Suppose we need to compute the cumulative sum of $\sum_{i=0}^n \alpha^i$ for given $\alpha$ and $n$. 

For example, when $\alpha=0.5$ and $n=10$, the cumulative sum of $\sum_{i=0}^{10} 0.5^i$ returns `[1.0, 1.5, 1.75, 1.875, 1.9375, 1.96875, 1.984375, 1.9921875, 1.99609375, 1.998046875, 1.9990234375]`


As a courtesy, I implement the following function `cum_sum` that can be used to generate a list of cumulative sum when iterating through a range generator.

In [None]:
def cum_sum(alpha, n):
    current = 1.0
    sum = current
    for i in range(n):
        current = current * alpha
        sum = sum + current
    return sum 

cumsum = []
for i in range(11):
    cumsum.append(cum_sum(0.5, i))
    
print(cumsum)

We can calculate how much time does it spend to run this code using `time` module as below:

In [None]:
import time

begin = time.time()
n_samples = 10000

cumsum = []
for i in range(n_samples):
    cumsum.append(cum_sum(0.5, i))
print(cumsum)
    
end = time.time()

time0 = end-begin
print("Time took to run: {} seconds.".format(time0))

It takes about 3.6 seconds on my machine to run the code with 10,000 samples. Note that this time may vary depending on the memory and CPU of your machine. 

**Question 1.1** (15 points) Now implement a list comprehension for the same purpose and estimate how much time does it take to generate a list of cumulative sum for 10,000 samples

Hint: you can use the method accumulate in the module itertools. Check the documentation of itertools at [here](https://docs.python.org/3/library/itertools.html#itertools.accumulate)

In [None]:
# Question 1


begin = time.time()
n_samples = 10000

# write your code here



end = time.time()

time1 = end-begin

print("Time took to run: {} seconds.".format(time1))

In [None]:
time0/time1

**Question 1.2** (15 points) Now implement using numpy for the same purpose and estimate how much time does it take to generate a list of cumulative sum for 10,000 samples (in order to receive full score, your program must be at least 1500 times faster than the for loop)

You may receive 5 bonus points if your program is at least 5000 times faster than the for loop!

In [None]:
begin = time.time()
n_samples = 10000
alpha = 0.5
# write your code here


print(cumsum)
end = time.time()

time2 = end-begin

print("Time took to run: {} seconds.".format(time2))

### Question 2 (30 points) Monte Carlo 

Monte Carlo is a city in Monacco where the famous Monte Carlo casino is located.

![casino](http://www.casinomontecarlo.com/wp-content/uploads/2017/06/casino-de-monte-carlo-1100x358.jpg)

In light of this, Monte Carlo methods (or Monte Carlo experiments) are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. Their essential idea is using randomness to solve problems that might be deterministic in principle. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to use other approaches. Monte Carlo methods are mainly used in three problem classes: optimization, numerical integration, and generating draws from a probability distribution.

**Estimate the Pi**

In order to estimate the $\pi$, the idea is to simulate random (x, y) points in a 2-D plane with domain as a square of side 1 unit. Imagine a circle inside the same domain with same diameter and inscribed into the square. We can generate a large number of uniformly distributed random points and plot them on the graph. These points can be in any position within the square i.e. between (0,0) and (1,1). We keep track of the total number of points, and the number of points that are inside the circle. If we divide the number of points within the circle, $N_{inner}$ by the total number of points, $N_{total}$, we should get a value that is an approximation of the ratio of the areas we calculated above, $\pi/4$.

Write a function `approximate_pi` with argument `number_simulations` to approximate the Pi value using Monte Carlo simulations. You may consider to use `numpy.random` to make random draws. 

Give a rough estimate about how many random draws you may need to achieve accuray of 99.999% (by comparing with numpy.pi).

In [None]:
# Question 2



### Question 3 (40 points) California housing

We will explore the famous Califronia housing dataset. The original database is available from StatLib http://lib.stat.cmu.edu/datasets/


The data contains 20,640 observations on 9 variables.
This dataset contains the average house value as target variable
and the following input variables (features): records
the following for each tract in California: Median house price, median
house age, average number of rooms per house, average number of bedrooms,
average number of occupants, total number of houses, median income
(in thousands of dollars), latitude and longitude.

You can download the data from the Internet as:

```python
from sklearn.datasets.california_housing import fetch_california_housing

cal_housing = fetch_california_housing()

```

The dataset has the following format:
```
    dataset : dict-like object with the following attributes:
    dataset.data : ndarray, shape [20640, 8]
        Each row corresponding to the 8 feature values in order.
    dataset.target : numpy array of shape (20640,)
        Each value corresponds to the average house value in units of 100,000.
    dataset.feature_names : array of length 8
        Array of ordered feature names used in the dataset.
    dataset.DESCR : string
        Description of the California housing dataset.
```

** Question 3.1 ** (5 points) Create a dataframe of `california_housing` from the fetched dataset using the data, target and feature_names, shows how many observations/features

** Question 3.2** (5 points) Show the descripive statistics of features

** Question 3.3** (15 points) Compare the areas with houses having more than 25 years old with areas with houses having less than 10 years old, what do you find about their total rooms, total bedrooms, median household incomes?

** Question 3.4 ** (15 points) What is the difference between average median house price for households with top 20% of median houshold income and bottom 20% of median household income?