<a href="https://colab.research.google.com/github/tbonne/IntroPychStats/blob/main/notebooks/intro_confidence_intervals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='http://drive.google.com/uc?export=view&id=1LFvPQl6_4ilF6qaXVILyTtH0whgy7OZV' width=500>

#<font color='darkorange'>Calculating confidence intervals for samples</font>

Here we will take a look at how we learn about a population from a sample.

First let's create a population! We'll give each individual in the population an IQ. To do that we will use a normal distribution with a mean of 100 and a standard deviation of 25.

In [None]:
#choose the parameters of our population
pop_IQ_mean = 100
pop_IQ_sd = 25

#Create a population of 100,000 people, each with a IQ
population_IQ_values = rnorm(n=100000, mean=pop_IQ_mean, sd=pop_IQ_sd )

#take a look at our population
hist(population_IQ_values)



Now that we have a population let's take a random sample of 10 individuals

In [None]:
#sample some individuals
sample_IQ_values = sample(population_IQ_values, size=10)

#take a look at our sample
sample_IQ_values

If we only had this sample of the population we could use what we've learnt in descriptive statistics to calculate a mean!

In [None]:
mean(sample_IQ_values)

How close is the sample mean (what you got from runing the code just above) to the population mean?
> Remember: we made the population so we know the real population mean is 100!
  
Let's calculate a confidence interval for our sample mean. This confidence interval will help us quantify the uncertainty that we know is in our sample (i.e., we didn't sample everyone so we are uncertain about the mean of the population).

First let's calculate the sample error of the mean (SEM)! Remember SEM is just a measure of how much our sample mean is likely to vary as we take many samples. We can estimate this value without having to take many samples by using the formula below.
  


In [None]:
#SEM is just the standard devation of our sample divided by the number of observations
SEM = sd(sample_IQ_values)/sqrt(10)

Then we can calculate a 95% confidence interval by subtracting and adding to the mean. First let's calculate a lower confidence interval by subtracting 1.96*SEM from the mean. Remember that about 95% of the means should be within 1.96*SEM from the mean.

In [None]:
#lower confidence interval
mean_lower_confidence = mean(sample_IQ_values) - 1.96 * SEM

#take a look
mean_lower_confidence

Then we can calculate an upper confidence interval by adding 1.96*SEM to the mean.

In [None]:
#lower confidence interval
mean_upper_confidence = mean(sample_IQ_values) + 1.96 * SEM

#take a look
mean_upper_confidence

We can now say that a population mean of between 99 and 131 is consistent with our sample. That is we are pretty sure that the population mean is not 60, or 140. Rather we think it falls somewhere between 99 and 131.
  
  >Note: these exact values are likely to differ each time you run the code as your random sample will be different! The main idea will remain the same though.

Let's look at our confidence interval!

In [None]:
#input the sample mean and the 95%CI
sample_mean = ?
lower_CI = ?
upper_CI = ?

#plot our confidence interval
plot(sample_mean, 1, xlim=c(80,150), xlab="IQ", ylab="sample")
arrows(lower_CI, 1, upper_CI, 1, length=0.05, angle=90, code=3) ## this part adds horizontal error bars

Above you should see a point and a line. The point is our sample mean, it is the estimated mean of the sample we have. The line, is the error bar, and it covers the range of population IQ values that are 95% consistent with our sample. I.e., they represent the likely mean IQ values of the population.

> In your case does the 95% confidence interval contain the true population mean (e.g., 100)?
  

> Go back to the top and try to change the sample size to see if you can get the 95% confidence interval to fail! (i.e., where the true population mean is not in the 95% CI).

*Disclaimer*: We calculated a 95% CI that is not quite right. There are some nuances that can make a difference especially for small samples (e.g., less than 30). We'll leave out this nuance as right now we are interested in getting the idea that you can quantify your uncertainty about your sample using confidence intervals. If you'd like to learn more chech out [the great chapter here](https://https://www.crumplab.com/statistics/probability-sampling-and-estimation.html#the-central-limit-theorem). Later we'll see how we can use a function call **confint** to help us calculate proper 95% confidence intervals!