# Module 9: Inference About a Population Mean I

In this module, we discuss confidence intervals and hypothesis tests about a single parameter, the population mean. We start by assuming that the population standard deviation is known, and present the Z-confidence interval and Z-test. In the next module, we remove this assumption and present the T-confidence interval and T-test.

## Z-Confidence Intervals

The Z-confidence interval assumes that our data are normally distributed and that we know the population standard deviation. In practice this is rarely true, but it is a useful assumption for studying the basics of confidence intervals. 

All confidence intervals are of the form: (estimate) $\pm$ (margin of error). In the Z-confidence interval, the estimate is $\bar{X}$, the sample mean, and the margin of error is a critical value from the standard normal distribution times the standard deviation of $\bar{X}$, $\sigma/\sqrt{n}$. 

Suppose that we have 10 observations from a normal distribution, and that we know the standard deviation of this normal distribution is three. Let's make a 95\% confidence interval for the mean of our normal distribution. Remember that we calculate the mean and standard deviation of a sample using the "mean()" and "sd()" function respectively. We can also get a percentile from the normal distribution by using the "qnorm()" function. 

In [None]:
data = c(4,3,6,6,3,5,6,7,7,3)
x.bar = mean(data) #Sample mean
SE = sd(data)/sqrt(10) #Sample etandard error
z.star = qnorm(0.975) #Standard normal critical value
MOE = z.star * SE #Margin Of error
lcl = x.bar - MOE #Lower endpoint of our confidence interval
ucl = x.bar + MOE #Upper endpoint of our confidence interval
print(lcl)
print(ucl)

The lower and upper endpoints of a confidence interval are sometimes called the lower confidence limit and upper confidence limit respectively.

The sample we used in this example were generated from a normal distribution with standard deviation three and mean five, so our confidence interval does contain the true mean of the distribution. If we generate new data and calculate a new confidence interval many times, eventually we will get one that does not contain the true mean. 

## Z-Tests

We now move on to hypothesis testing. The Z-test assumes that our data are normally distributed, and that we know the population standard deviation. We start with a null hypothesis about the population mean, such as "the population mean is equal to one". We also need an alternative hypothesis, such as "the population mean is greater than one". 

All hypothesis tests are based on a test statistic. Many of these test statistics are of the form: (estimate minus hypothesised value) divided by the standard error of the estimate. The test statistic in the Z-test is called the Z-statistic. It uses the sample mean, $\bar{X}$, as an estimate, and the true standard deviation of $\bar{X}$ in the denominator instead of a standard error. We can use the true standard deviation of $\bar{X}$ because we know the population standard deviation of our data.

Once we have the Z-statistic, we calculate a p-value based on this statistic, and compare that p-value to the significance level of our test, $\alpha$. If the p-value is less than $\alpha$, we reject our null hypothesis in favor of the alternative hypothesis. If the p-value is greater than $\alpha$ then we do not reject the null hypothesis (which is not the same as accepting it).

Suppose that we have another 10 observations from a normal distribution, but this time we know that the population standard deviation is 20. Let's test the null hypothesis that the population mean is two, against the alternative hypothesis that the population mean is greater than two. Let's use $\alpha=0.05$ as our significance level.

In [None]:
data = c(-1, 17, 19, 5, 24, 42, 14, 17, 13, -4)
x.bar = mean(data) #Sample mean
mu = 2 #Hypothesised population mean
SD = 20/sqrt(10) #Standard deviation of x.bar
Z = (x.bar - mu)/SD #Z-statistic
p.value = pnorm(Z, lower.tail=F) #Probability of being greater than Z
print(p.value)

Our p-value is less than $\alpha$, so we reject the null hypothesis and conclude that the population mean is greater than two.

We calculated our p-value as the probability that a standard normal is **greater than** our observed Z-statistic. This is because our alternative says that the mean is **greater than** two.

For comparison, let's do a test with the same data and null hypothesis, but this time our alternative will be that the population mean is less than two.

In [None]:
data = c(-1, 17, 19, 5, 24, 42, 14, 17, 13, -4)
x.bar = mean(data) #Sample mean
mu = 2 #Hypothesised population mean
SD = 20/sqrt(10) #Standard deviation of x.bar
Z = (x.bar - mu)/SD #Z-statistic
p.value = pnorm(Z, lower.tail=T) #Probability of being less than Z
print(p.value)

This time, our p-value is much greater than $\alpha$, so we do not reject our null hypothesis. This p-value is the probability of being **less than** our Z-statistic because our alternative hypothesis says that the population mean is **less than** two.

Both of the alternative hypothesis that we have looked at so far are called one-sided because they say that the population mean falls on one side of the null hypothesis value. The last type of alternative hypothesis that we can have is called two-sided. Two-sided hypotheses say that the population mean is not equal to the null hypothesis value (that is, that the population mean can fall on either side of the null hypothesis value).

Let's use the same data and null hypothesis from the last two examples, but this time we will use the alternative hypothesis that the population mean is not equal to two.

In [None]:
data = c(-1, 17, 19, 5, 24, 42, 14, 17, 13, -4)
x.bar = mean(data) #Sample mean
mu = 2 #Hypothesised population mean
SD = 20/sqrt(10) #Standard deviation of x.bar
Z = (x.bar - mu)/SD #Z-statistic
p.value = 2*pnorm(abs(Z), lower.tail=F) #Double the probability of being 
                                        #greater than the absolute value of Z
print(p.value)

This p-value is barely less than $\alpha$, so we reject our null hypothesis, but we might want to do more investigation to tell if our conclusions are valid or not.

Notice that the way we calculated our p-value in this example was a bit different from the other two. First, we made our Z-statistic positive by taking its absolute value. This makes sure that we are calculating the probability of being farther than Z from the value in the null hypothesis. Second, we double the probability that we calculate. This is because, in a two-sided test, we are interested in both large and small values of the population mean. Doubling our probability ensures that we have accounted for both of these possibilities. 