# Sampling and simulation

This demo focuses on approaches from statistics and machine learning that you can easily take advantage of without learning any new math. Most of the methodologies covered in lecture are pretty accessible too, but it's hard to demo everything! Here are some additional resources:

- Online textbook for [Data 8](https://www.inferentialthinking.com/) (chapter 9 onward)
- Detailed introduction to OLS regression, in a [notebook](https://github.com/waddell/CP255/blob/master/13-regression/statistical-modeling.ipynb) (from CP 255 last year)
- Examples from the Statsmodels library: https://www.statsmodels.org/stable/examples/
- Examples from the Scikit-Learn library: https://scikit-learn.org/

## 1. Random sampling

Often we want to inspect or validate a dataset, but don't have the ability to look in detail at each data point. Random sampling gives us a subset of the data that's more likely to be representative than, for example, looking at the first few lines of the dataset. 

Intuitively, this makes sense: Usually the top of the dataset includes observations that are all from the same place, or all from the same point in time. And statistics [tells us](https://en.wikipedia.org/wiki/Sampling_(statistics)) that random samples are even more powerful than they seem. A random sample of several dozen data points will give you a good sense of the characteristics of the full dataset, regardless of how large it is. (If you're interested in rare outcomes, you will need a larger sample. As a rule of thumb, increase the sample size until you get a few dozen of whatever kind of data point you're interested in.)

In **Python**, NumPy has dozens of random sampling and random number generating functions:   
https://docs.scipy.org/doc/numpy-1.15.0/reference/routines.random.html

In [None]:
import numpy as np

In [None]:
np.random.random()  # random float between 0 and 1

In [None]:
np.random.randint(low=1, high=100, size=10)

### How does this work?

When computers generate random numbers, they're really just "pseudo-random". The algorithm begins with an unlikely-to-be-repeated "seed" (like the current time in microseconds), and applies permutations so that the resulting sequence is effectively arbitrary. Read more [here](https://en.wikipedia.org/wiki/Random_number_generation).

### Random sampling in Pandas

Conveniently, Pandas can directly give you a random sample of a DataFrame.

https://pandas.pydata.org/pandas-docs/version/0.24.2/reference/api/pandas.DataFrame.sample.html

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Download Zillow rent index data: https://www.zillow.com/research/data/
url = "http://files.zillowstatic.com/research/public/City/City_Zri_AllHomesPlusMultifamily_Summary.csv"
rents = pd.read_csv(url)

In [None]:
rents.head()

In [None]:
rents.sample(n=5)

### Other applications

You can also layer random sampling on top of other statistical procedures, for example to perform **cross-validation**. This is when you divide your data into two chunks, a "training" set and a "testing" set. Use one set to fit a statistical model, and the other to check how well it performs with data it hasn't seen before, to better mimic real-world applications. Read more [here](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). 

### Exercise

Generate a list of 50 random integers between 0 and 100. 

Plot a histogram of them -- if you have a list from NumPy, you can use `pd.Series(my_list).hist()`.

How much does the distribution change when you re-run the code? How much does it change if you increase the size of the sample?

## 2. Monte Carlo simulation

Computers are really good at generating random numbers, and really good at doing the same thing many times. If you're studying some kind of real-world process that involves randomness (like people's behavioral choices, or the weather, or epidemics), you can write code to simulate the process many, many times, to get a sense of what the aggregate outcomes will look like. This approach is called [Monte Carlo simulation](https://en.wikipedia.org/wiki/Monte_Carlo_method#Applications), and it's usually much easier than doing the same thing analytically. For example, this is often used in travel demand modeling or land use modeling.

### Example

We are building a 50-unit apartment building. Each unit has a 50% chance of being rented by a student, and a 50% change of being rented by someone else. Students have an 80% chance of owning a bicycle and a 10% chance of owning a car. Non-students have a 60% chance of owning a bicycle and a 75% chance of owning a car. What's the range of car and bicycle parking demand that we're likely to see?

In [None]:
def simulate():
    """
    Simulate the number of bikes and cars owned by residents of the building.
    
    """
    bike_count = 0
    car_count = 0
    
    for i in range(50):  # do this once for each resident
       
        if (np.random.random() < 0.5):  # student

            if (np.random.random() < 0.8):
                bike_count = bike_count + 1

            if (np.random.random() < 0.1):
                car_count = car_count + 1

        else:

            if (np.random.random() < 0.6):
                bike_count = bike_count + 1

            if (np.random.random() < 0.75):
                car_count = car_count + 1

    return bike_count, car_count

In [None]:
simulate()

In [None]:
# run the simulation 500 times and plot the range of outcomes

bike_counts = []
car_counts = []

for i in range(500):
    a, b = simulate()
    bike_counts.append(a)
    car_counts.append(b)

pd.Series(bike_counts).hist()
pd.Series(car_counts).hist()

### Exercise

Simulate flipping a coin, and plot a histogram of the results.

If you flipped a real coin 100 times, how far from 50 would the number of heads or tails need to be before you got suspicious?