## Playing with sampling and the Central Limit Theorem

Begin with imports:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline

Let's create a population that isn't normally distributed we will concatenate several normal distributions to do so:

In [None]:
d1 = np.random.normal(loc=-6.4, scale=1.2, size = 40000)
d2 = np.random.normal(loc=4, scale=10, size = 16000)
d3 = np.random.normal(loc=22, scale=8, size = 72000)

population = np.concatenate([d1,d2,d3])
pop = pd.DataFrame(data=population, columns=['population'])
pop.head()

## Make a histogram. Play around with bin size

Hint: there are multiple ways to do this. Try numpy.histogram or the pandas method hist.

In [None]:
pop['population'].hist(bins=100)

Extra: Try displaying the data using an alternate visualization technique, a violin plot. Seaborn has a built-in method that is useful for this.

In [None]:
import seaborn as sns
ax = sns.violinplot(y='population',data=pop)

## Make a kernel density estimate of the population distribution

Hint: pandas.DataFrame.plot.kde

In [None]:
pop.plot.kde()

## Compute the mean of the population

In [None]:
pop['population'].mean()

## Computer the standard deviation of the population

In [None]:
pop['population'].std()

## We have described our population. Now let's draw a sample of size n and look at the distrubtion of our sample mean and s.d.

Write a function that samples the pop dataframe with an argument n that is the number of samples to take. Sample without replacement.

In [None]:
def draw_sample(pop, n):
    data = pd.DataFrame(np.random.choice(pop,size=n, replace=False),columns=['population_sample'])
    return data

In [None]:
def draw(pop, n):
    data = pop.sample(n)
    return data



In [None]:
draw(pop['population'], 100)

In [None]:
draw_sample(pop['population'], 100)

## Now we want to draw repeated samples of size *n* from the population

Create another function that calls the first `samples` times. Have `samples` be an argument to the function along with n which is the argument to the first function. For each sample, append the mean and the standard deviation of the sample to two separate lists and return them.

Hint: use a loop with    range(samples) iterations. To create an empty list at the start of a function, try something like:

    def repeat_samples(samples, n):  
      means = []  
      sds = []  
      ...  
      return (means, sds)
    
then use the append method to append each mean and sd value to the end of each respective list.

In [None]:
def repeat_samples(samples, n):
    means = [] 
    sds = []
    
    for i in range(samples):
        sample = draw_sample(pop['population'],n)
        means.append(sample['population_sample'].mean)
        sds.append(sample['population_sample'].std)
    return (means, sds)

In [38]:
def repeat_samples(samples, n):
    means = [] 
    sds = []
    for i in range(samples):
        sample = np.random.choice(pop['population'],size=n, replace=False)
        means.append(sample.mean())
        sds.append(sample.std())
        means_sds = pd.DataFrame({'means': means,'sds':sds})
    return (means_sds)

In [39]:
repeat_samples(40, 30)

Unnamed: 0,means,sds
0,9.122884,14.760745
1,10.988336,13.497027
2,11.324318,15.484101
3,12.356518,13.668917
4,13.918663,15.032178
5,12.173758,15.420337
6,9.619925,15.842289
7,13.794991,14.743482
8,17.193207,13.057733
9,8.49958,13.281619


In [None]:
len(means)

## Almost there!

Now make a function with two arguments `samples` and `n` that takes the return values from the last function and
* converts the lists to a single dataframe
* plots two histograms of the columns (mean, sd)
* prints out the mean and sd of the columns

Hint: to get a multi-valued return into new variables, try this:

    means, sds = repeat_samples(samples, n)
    df = pd.DataFrame(data={'means: means, 'sds': sds})

In [None]:
def repeat_samples(pop, samples, n):
   means = []
   sds = []

   for i in range(samples):
       sample = draw_sample(pop, n)
       means.append(sample['sample'].mean())
       sds.append(sample['sample'].std())

   return (means, sds)

In [None]:
def describe_sample(pop, samples, n):
   means, sds = repeat_samples(pop, samples, n)
   df = pd.DataFrame(data={'means': means, 'sds': sds})

   df.hist(bins=100)
   print('Mean: {}'.format(np.round(df['means'].mean(), 2)))
   print('Std Dev: {}'.format(np.round(df['sds'].mean(), 2)))

   return df

In [None]:
df = describe_sample()

## Run your final function several times with varying values of samples and n

How did your result begin to converge on the population mean and sd?

## Bootstrapping your data: Finding confidence intervals

Statisticians take advantage of the central limit theorem as a method of establishing confidence intervals. Create a function that finds the nth and (100-n)th percentiles of the distribution of means found with describe_sample.