In [None]:
import matplotlib
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
import numpy as np
plots.style.use('fivethirtyeight')

# Activity: Normal Distributions

The cell below will load in a dataset from Inside Airbnb (http://insideairbnb.com/get-the-data.html) that has details of over 2000 Airbnb listings in Asheville, NC through February 17, 2021. It includes information about the name of the property, the zip code it's in, price, and many other fields.

Today, we'll be investigating how we might be able to estimate the average price for a night at an Airbnb. There were a small number (6) listings with prices over 1000 dollars which were removed.

**Throughout this activity, we are assuming that we don't have access to the full population to imagine how we might be able to make an estimate about the population mean from a single sample.** As a result, we will try not to inspect the full dataset as much as possible.

In [None]:
asheville = Table.read_table('data/asheville-airbnb.csv').where('price', are.below(1000))
asheville.show(5)

In [None]:
asheville.hist('price', bins = 20)

## Pick a sample size

If we want to estimate the population mean, we have to make a few choices. As we've seen, larger sample sizes produce more narrow confidence intervals, meaning we have a more precise estimate. However, larger samples are more expensive to take in real life (you have to spend either more time or more money to get more data in your sample). As such, in practice you'll typically decide up front how wide your interval is allowed to be, and then set to work creating a sample of sufficient size to produce an interval of that width. If you were to choose a smaller sample size, your interval would not be narrow enough to provide the required accuracy, a larger sample size will cost more money to provide accuracy you already decided you didn't need. So, how do we choose the correct size? 

Let's start by just picking a sample size to draw from the population.

In [None]:
sample_size = 50
my_sample = asheville.sample(sample_size)

my_sample.show(5)

Let's see what type of estimate a sample of this size generates:

In [None]:
def one_bootstrap_mean():
    return np.mean(my_sample.sample().column('price'))

In [None]:
bootstrap_means = make_array()

for i in np.arange(1000):
    new_mean = one_bootstrap_mean()
    bootstrap_means = np.append(bootstrap_means, new_mean)
    
left = percentile(2.5, bootstrap_means)
right = percentile(97.5, bootstrap_means)

In [None]:
Table().with_column('Bootstrap means', bootstrap_means).hist()

plots.plot([left,right], [0,0], color="gold", lw=8, zorder=1);
plots.title('Bootstrap Means (1K Bootstraps from our Sample)');
print("95% CI: (", left, ",", right,")")
print("CI Width:", right-left)