## Repeated Sampling

Let's simulate the following scenario: 

We have a population of 10,000 students currently enrolled at Georgetown SCS. We want to estimate the average amount of time they spent on schoolwork during the last week. Doing a survey of the complete population would be too time consuming, so we're going to randomly select 20 students, ask them record their schoolwork times, and calculate the mean of the sample. 

We will simulate this by creating the population data using the `random` library, and we will repeatedly simulate sampling 20 students and compare the different sample means to see what they can tell us about the population.

Because we are continuing to build Python skills, now is a good time to discuss coding conventions by looking at PEP8:

https://www.python.org/dev/peps/pep-0008/

PEP stands for Python Enhancement Proposals, it is a method by which the Python development team proposes new ideas, discusses them, and if agreed upon formalizes them as part of the Python core language. PEP8 is the Style Guide for Python Code, originally proposed in 2001.

In [None]:
import random
import pandas as pd
import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'retina'

Next we will randomly generate our __population__ data. Just like in reality, we will not know the true parameters of our population at the outset. I will try to obfuscate the population data by using binary notation and a list comprehension. I will explain the basics of list comprehensions a little further down.

(If you want a challenging exercise, research binary notation and list comprehensions, and see if you can rewrite my code in a more readable format)

The population data will be a DataFrame containing student IDs from 1 to 10000, and study time recorded in minutes (rounded to the nearest 10).

In [None]:
random.seed(1234)
students = pd.DataFrame([{'id': id+1, 'study_time': int(round(random.gauss(0b00011011, 0b00001011)))*0b00001010} for id in range(10000)])
students.head()

In [None]:
students.tail()

### Single Sample
Now we will draw a random sample of 20 students, print their study times, and calculate the sample mean.

In [None]:
sample = students.sample(20, random_state=1234) # random_state is a seed, like above
sample

In [None]:
mean = sample['study_time'].mean()
print(f"Mean study time of 20 students: {mean:.1f}")

In the real world, if you had collected this sample and calculated the statistic, how would you explain what it tells us about the sample, and about the population?

#### Answer:
For these 20 students, the average time spent studying this past week was 272 minutes.

If we had to estimate the average time spent studying in the past week for all 10,000 Georgetown students, our estimate would be 272 minutes.

=========

### 10 Repeated Samples
Now let's generate repeated samples using a for loop. In the real world, it would probably be prohibitive to do our sample of 20 people over and over again, but in Python it's easy. To start, I'll take 10 samples.

In [None]:
sample_means = []

for _ in range(10):
    temp_sample = students.sample(20)
    sample_means.append(temp_sample['study_time'].mean())
    
print("Sample Means:")
print(sample_means)

The following code does the exact same thing, but using a __list comprehension__ instead of a for loop. Because the pattern of taking an iterable (like a list), performing some operation to each element, and saving the results in a new list is so common, Python has a shorthand for it:

In [None]:
sample_means = [students.sample(20)['study_time'].mean() for _ in range(10)]
print("Sample Means:")
print(sample_means)

Let's visualize just the means quickly:

In [None]:
sample_means_df = pd.DataFrame(sample_means)
sample_means_df.hist()

This gives us a better estimate as to what the true population mean could be. We can add a vertical line at the location of the mean of these 10 values also, as that is going to be our new estimate for the population mean.

In [None]:
sample_means_df.hist()
plt.axvline(sample_means_df[0].mean(), color='r', linewidth=3)
plt.show()

Now we have an estimate for the true population study time, but there's still a lot of variability between these samples, so we know our estimate isn't very good. Let's try again, this time with 100 samples.

### 100 Repeated Samples
Keep in mind that as we take 100 samples, each time we're choosing 20 students and finding their mean and standard deviation. This time we won't print the raw data, just the histogram.

In [None]:
sample_means_100 = []

for i in range(100):
    temp_sample = students.sample(20)
    sample_means_100.append(temp_sample['study_time'].mean())
    
sample_means_100_df = pd.DataFrame(sample_means_100)
sample_means_100_df.hist()
plt.axvline(sample_means_100_df[0].mean(), color='r', linewidth=3)
plt.show()

This gives us a much better picture of the likely value of the population mean. But can we get a more accurate estimate with more samples? Indeed, that basic concept in statistics is called the _Law of Large Numbers_. Let's do 10,000 samples next:
### 10,000 Repeated Samples

In [None]:
sample_means_10k = []
sample_stds_10k = []

for i in range(10000):
    temp_sample = students.sample(20)
    sample_means_10k.append(temp_sample['study_time'].mean())
    sample_stds_10k.append(temp_sample['study_time'].std())
    
sample_means_10k_df = pd.DataFrame(sample_means_10k)
sample_means_10k_df.hist()
plt.axvline(sample_means_10k_df[0].mean(), color='r', linewidth=3)
plt.show()

Looking at this approximation of the __sampling distribution__, we can see it has converged on what is probably a very good estimate of the population mean.

You may be wondering if, since we had a population of 10,000 students, and we took 10,000 samples, does that mean we've used all the possible samples?

The number of ways you can choose a sample of size _k_ from a population of size _N_ is called the __binomial coefficient__ (this isn't something you have to know for this class, it's just interesting). The mathematical notation is: $\binom{N}{k}$ 

It turns out the formula is $\frac{N!}{k!(n-k)!} = \frac{10,000!}{20!(9,980)!}$, which is much, much larger than 10,000. So although we have taken enough samples to get a very accurate estimate of the true population mean, we haven't yet calculated the entire true sampling distribution, which includes every single possible sample.

### Conclusion
If you compare the distributions of sample means that we generated for 10 samples, 100 samples, and 10,000 samples, how would you describe what you see?

#### Answer:
The distribution of the 10 sample means was too small to really describe its shape. We did find the mean of those 10 sample means, which is an OK estimate for the population mean, but it's still pretty rough.

When we increased to 100 samples, and then again to 10,000 samples, we saw the shape of the distribution become more symmetric and bell-shaped. We also saw our estimate for the true mean get closer and closer to 270. It is probably a safe guess that the true population mean is 270, and that the __sampling distribution__ is Normally distributed.

### Optional:
Does this mean that the population is probably Normally distributed as well?