In [1]:
import saspy
my_session = saspy.SASsession()

Using SAS Config named: oda
Pandas module not available. Setting results to HTML
SAS Connection established. Subprocess id is 225081



# Statistical inference
We have two methods to work with statistical inference: **estimation** and **hypothesis testing**.

## Estimation

There are two ways in which we can estimate the value of a population parameter. 
1. **Point estimate**: a single number that is our best guess for the population parameter.

Suppose we are wondering how much sleep new parents in Amsterdam lost after they had their first baby. Let's say that new parents are those who got a baby within the last six months. We draw a simple random sample of $n=60$ new parents and asked them how much hours per night they slept less than before they had a baby.

Lets assume that the mean number of hours that the 60 respondents in the sample slept less after they had their first baby is 2.6 ($\bar{x} = 2.6$). This means that a good point estimate for the mean number of lost sleeping hours in the population is 2.6. In other words, the statistic $\bar{x}$, which in our case is 2.6 hours, is a good point estimate for the parameter $\mu$.

However, one individual point estimate does not tell us if we're close to the population parameter we're interested in. Therefore, we'll want to know the likely precision of the point estimate. This precision is computed using an interval estimate. 

2. **Interval estimate**: a range of values within which we expect the population parameter to fall.  

On the basis of a sample mean of $\bar{x}=2.6$ hours, we might predict that the mean lost sleeping hours of all new parents in Amsterdam lies somewhere between $\mu = 2.3-2.9$. The probability that the interval contains the population value is called the **confidence level**. The confidence level always has a value close to 1; in most cases, is it 0.95, so we talk about a "95% confidence interval". 

### Confidence intervals for mean with **known** population standard deviation ($\sigma$)

Suppose we ask new parents how much sleep they lose per night after having a new baby. We find that the mean number of lost sleeping hours per night is $\bar{x} = 2.6$ hours and the _sample_ standard deviation is $s=0.9$ hours. Suppose we also know the _population_ standard deviation, $\sigma = 1.1$ hours. In practice, it is very unlikely that we'll know $\sigma$.

We can construct a confidence interval based on the information from our sample ($\bar{x}$, $s$) and the population standard deviation ($\sigma$). As long as our sample is sufficiently large, the sampling distribution is normally distributed with 

$$
\mu_{\bar{x}} = \mu \\
\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}
$$

For a 95% confidence, we can look up the z-score that corresponds to the probability of 0.025 and find that $z=\pm1.96$.

<center><img src="z_table.png" style="width:400px"/></center>


Therefore, the 95% confidence interval is given by

$$
\text{confidence interval} = \bar{x} \pm 1.96\sigma_{\bar{x}} \\
\text{where } \sigma_{\bar{x}} = \frac{sigma}{\sqrt{n}} \\
------- \\
\sigma_{\bar{x}} = \frac{1.1}{\sqrt{60}} \approx 0.142 \\
\text{margin of error} = 1.96\times0.142 \approx 0.28 \\\
CI = 2.6 \pm 0.28
$$

Therefore, we have 95% confidence that (2.32, 2.88) contains the actual population mean. If we were to draw an infinite number of samples with $n=60$ from our population, and if we computed the confidence interval for every sample with this margin of error, in 95% of the samples the population value will fall within the confidence interval. 

<center><img src="confidence_interval.png" style="width:400px"/></center>

### Confidence intervals for mean with **un**known population standard deviation ($\sigma$)

Since we often do not know the population standard deviation, we need to estimate it. To do this, we often use the **t-distribution**.

Suppose we ask $n=60$ new parents how much sleep they lose per night after having a new baby. We find that the mean number of lost sleeping hours per night is $\bar{x} = 2.6$ hours and the _sample_ standard deviation is $s=0.9$ hours. 

To construct a 95% confidence interval, we estimate the sample standard deviation and use the formula

$$
\bar{x} \pm t_{95\%}(se) \\
\text{where } se = \frac{s}{\sqrt{n}}
$$

where $se$ is the **standard error**. It is the estimated standard deviation of the sampling distribution of the sample mean. $s$ is the sample standard deviation. 

In our example,

$$
se = \frac{0.9}{\sqrt{60}} \approx 0.116 \\
CI = 2.6 \pm (2.00)(0.116) \\
= 2 \pm 0.23 \\
CI = (2.37, 2.83)
$$

To obtain a confidence interval for a population mean, two assumptions must be satisfied.
1. Your data should be obtained by randomization.
2. Your population should be approximately normally distributed. 

### Confidence interval for proportions

For a large sample, the sampling distribution is normally distributed with a mean that is equal to the population proportion, $p$. 

$$
\hat{p} = p \\
\sigma_{proportion} = \sqrt{\frac{p(1-p)}{n}}
$$

The z-score that corresponds to the 95% confidence interval is 1.96. Therefore,   

$$
\hat{p} \pm z_{95\%}\sigma_{proportion} \\
\hat{p} \pm 1.96\sigma_{proportion} \\
\text{where } \sigma_{proportion} = \sqrt{\frac{p(1-p)}{n}}
$$

Since we do not know the population parameter $p$, we can substitute an estimate: 

$$
\hat{p} \pm z_{95\%}(se) \\
\text{where } se = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
$$

Our assumption is that $n\hat{p} \ge 10$ and $n(1-\hat{p})\ge10$.

### Step-by-step plan for confidence intervals
1. Decide a confidence level
2. Proportion or mean? Proportion = z-distribution, mean = t-distribution (requires computing degrees of freedom, $df = n-1$).
3. Compute the endpoints of the confidence interval. 
4. Interpret the results. 

We often report an estimate and its **standard error (SE)**. Standard error is the standard deviation of a _statistic_ (rather than of the population). For a random sample, the standard error of the sample mean (SEM) is 

$$
SE(\bar{Y}) = \frac{\sigma}{\sqrt{n}} \\
$$

Since we do not know the standard deviation of the _population_ ($\sigma$), we can estimate the standard error of the sample mean as

$$
\hat{SE}(\bar{Y}) = \frac{s}{\sqrt{n}} \\
$$

If we are observing the wingspan of 25 butterflies with a sample mean of 3.53 and a sample standard deviation of 0.17, then

$$
SEM = \frac{0.17}{\sqrt{25}} \approx 0.034 \\
$$

**Q.** What do we mean by the term "estimation"?

**A.** Attempting to use data to give a value (or range of values) for a parameter.

**Q.** Why do we need to report more than a point estimate when using data to describe a parameter?

**A.** A point estimate alone does not given an idea of variability. We usually provide a standard error or a confidence interval as well.

**Q.** What does the standard error of a statistic measure?

**A.** Variability (standard deviation) in the statistic.

## Frequentist estimation via method of moments and maximum likelihood (the math behind estimates)

A point _estimate_ (e.g. the actual number $\bar{y}$ or $s$) is an observed value of a point _estimator_ (a random variable, such as $\bar{Y}$ or $S$, that we will use to make an inference). $\bar{Y}$, which we often observe in a normal distribution, is a common estimator of the population mean, $\mu$. Estimators have several key properties:
* If we take many different samples and look at the estimator across all of those samples, the average of the estimators should give the true population value (an "unbiased" estimator)
* The estimator should have as little variation as possible from dataset to dataset (a small standard error, SE)
* The bigger the sample, the closer the estimator is to the true population value ("consistent" estimator)

A point estimator is a random quantity that has yet to be observed.  For instance, the idea of taking a sample mean of the amount of debt of 100 randomly selected people.  The quantity isn’t observed yet and is random – it has a distribution, mean, standard error, etc.  We usually denote estimators with uppercase values.

A point estimate is a fixed quantity – the observed value of the estimator.  For instance, if we observed the 100 randomly selected people and their sample mean amount of debt was 175 thousand.  This quantity is fixed or known.  We usually denote estimates with lowercase values.

**Q.** What do we mean by the term "consistent estimator"?

**A.** An estimator that is observed arbitrarily close to the parameter as the sample size increases is a consistent estimator.  This just implies that our estimator eventually is observed right at the truth.

Note: If the bias of the estimator is 0 or goes away as the sample size grows and the standard error shrinks toward 0 as the sample size grows, the estimator will be consistent! This should make some sense, bias tells us where the estimator is observed on average and standard error tells us how variable our estimator is. If on average we take on the true value and our variation decreases towards 0, we must be observing closer and closer to the truth!

**Q.** What is the basic idea of the Method of Moments estimation procedure?

**A.** We take sample moments (such as $\bar{Y}$) and set them equal to population moments (such as E(Y)) and solve the equations of the parameters.

#### Two common methods for creating an estimator

We hope our estimators have nice properties
* Unbiased
* Low variance/standard error
* Consistency
* Easy-to-use sampling distributions

**Method of Moments (MOM)**

If we have a random sample, MOM uses the sample average and the population averages to create estimators. On the upside, MOM makes the finding of estimators easy and MOM estimators are consistent; on the downside, the distribution of the created estimator can be difficult to determine and gamma MOM estimators are not unbiased (taking more samples does not bring us closer to the right answer). 

_Gamma Distribution Example_: We have a gamma distribution $Y\sim Gamma(\alpha,\beta)$ where $Y$ is the time spent reading a news article. For gamma distributions,

$$
\mu = E(Y) = \frac{\alpha}{\beta}\\
\sigma^2 = Var(Y) = \frac{\alpha}{\beta^2} \\
E(Y^2) = \frac{\alpha(\alpha+1)}{\beta^2}
$$

where $E(Y)$ and $E(Y^2)$ are the population (raw) _moments_. We can observe a random sample of times spent reading a news article (a random sample of $Y$s). The Law of Large Numbers states that, for large sample sizes, the sample averages should be very close to the population averages. That is,

$$
\hat{\alpha} = \frac{\bar{Y}^2}{\frac{\sum_{i=1}^{n} Y_{i}^2}{n} - \bar{Y}^2} \approx\frac{\bar{y}^2}{\frac{\sum_{i=1}^{n} y_{i}^2}{n} - \bar{y}^2} \\ \text{ and }\\
\hat{\beta} = \frac{\bar{Y}}{\frac{\sum_{i=1}^{n} Y_{i}^2}{n} - \bar{Y}^2} \approx \frac{\bar{y}}{\frac{\sum_{i=1}^{n} y_{i}^2}{n} - \bar{y}^2}
$$

_Binomial/Bernoulli MOM example_: Our population is all of the customers at a bank and each customer follows a Bernoulli distribution. Our parameter $p$ is the proportion of customers willing to open an additional account. We observe 40 random customers. We can define $X_{i}=1$ if a customer $i$ opens an account, where $X_{i} \sim ^{iid} Ber(p)$. We can also define $Y$ as the number of customers opening an account, where 

$$
Y = \sum_{i=1}^{n} X_{i}\sim Bin(n,p)
$$

What is the MOM estimator of $p$?

MOM tells us to set the sample average to the population average.

$$
X \sim Ber(p) \\
E(X) = p \\
Var(X) = p(1-p)
$$

Here, $\bar{X} \approx p$ and 

$$
\text{Sample proportion} = \hat{p} = \frac{Y}{n}
$$

This estimator is unbiased (on average, it gives us $p$) and consistent and

$$
\text{Standard error} = SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}
$$

Moreover, $\hat{p}$ can be approximated using a normal distribution so that

$$
\hat{p} \sim N(p,\sqrt{\frac{p(1-p)}{n}})
$$

**Maximum Likelihood (ML)**

ML uses the assumed curve to find the "most likely" values of the parameters to produce the data we see. Mathematically, they are more difficult, but they are generally consistent and the distribution of an MLE can often be approximated with a normal curve.

_Exponential distribution example_: The exponential distribution is a special case of the gamma with $\alpha = 1$. All we need to estimate is the $\beta$. If $Y$ is the time spent reading a news article, then 

$$
Y \sim Gamma(1, \beta) \sim (Exp \beta) \\
f(y) = \beta e^{-\beta y}, y \gt 0 \\
\mu = E(Y) = \frac{1}{\beta} \\
\sigma^2 = Var(Y) = \frac{1}{\beta^2}
$$

The PDF tells us "give me a $\beta$ and I will give you an observed value $y$". The likelihood reverses that and tells us "give me an observed value $y$ and I will give you a $\beta$". 

$$
f(y | \beta) = \beta e^{-\beta y}, y \gt 0, \beta \gt 0 \\
L(\beta | y) = \beta e^{-\beta y}, y \gt 0, \beta \gt 0
$$

<div class="alert alert-block alert-success">
<b>Tip:</b> For generic $y$ value, the MLE for the exponential distribution is 

$$
\hat{\beta} = \frac{1}{y}
$$

That is, the most likely value of $\beta$ to produce $y$ is 1/$y$. For a random sample of $Y$s, $\hat{\beta}_{MLE} = \frac{1}{\bar{Y}}$
</div>

For a random sample of $Y$s, 

$$
Y_{i} \sim ^{iid} Exp(\beta), i=1,...,n
$$

We can find the joint distribution as follows

$$
A \text{ independent } B = P(A \cap B) = P(A)P(B)\\
f(y_1, y_2,...,y_n) = f(y_1)f(y_2)...f(y_n) \\
=\beta e^{-\beta y_1}\beta e^{-\beta y_2}...\beta e^{-\beta y_n} \\
= \beta^n e^{-\beta \sum_{i=1}^{n} y_{i}}
$$

**Q.** What is the difference between a joint distribution (say $f(y_1,...,y_n|parameters)$) and a likelihood (say $L(parameters|y_1,...,y_n)$)?

**A.** With a joint distribution, we consider the parameters to be known and the y’s can be varied (for instance in finding $P(Y_{1} \gt 10, Y_{2} \gt 5)$. With a likelihood, we assume the data is known and that the parameters can be varied. This allows us to maximize the function with respect to the parameter value(s). The value of the parameter that corresponds to this maximum is called the maximum likelihood estimator. 

#### Common ML and MOM comparison

<center><img src="mom_ml_estimators.png" style="width:600px"/></center>

## Central Limit Theorem and Confidence Intervals

A sampling distribution is the pattern and frequency of a statistic/estimator. For a random sample (iid) of size $n$ from a population with mean $\mu$ and variance $\sigma^2$, a good approximation to the distribution of the sample mean in a "large" sample is a normal distribution where

$$
\bar{Y} \sim N(\mu, \frac{\sigma}{\sqrt{n}}) \\
\text{with} \\
Z = \frac{\bar{Y} - \mu}{\frac{\sigma}{\sqrt{n}}}
$$

The confidence interval is the range of values that contain the true parameter.
* Often use 95% confidence
  * For 100 samples, about 95 intervals would contain the population mean

Probability statements about a Binomial can be approximated using either

$$
\hat{p} \sim N(p, \sqrt{\frac{p(1-p)}{n}}) \\
\text{with} \\
Z = \frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{n}}}
$$

or via

$$
Y \sim N(np, \sqrt{np(1-p)}) \\
\text{with} \\
Z = \frac{Y-np}{\sqrt{np(1-p)}}
$$

#### Continuity correction

When is a continuity correction useful?

Anytime we approximate the distribution of something discrete with a continuous distribution, we can utilize a continuity correction to improve our probability approximations. 

We saw the continuity correction applied to the Binomial and Poisson (both discrete distributions).

#### Confidence

Confidence refers to how much we believe in our procedure. That is, if we are 95% confident, that means that our procedure will produce a confidence interval that capture the truth 95% of the time.

Said another way, if we were to repeatedly sample people and find a confidence interval for the average amount of debt people have, 95% of the intervals we created would capture the truth.

**Q.**: In a confidence interval of the form point estimate +/- MOE, the Margin of Error (MOE) is determined by which of the following?

**A.**: (estimated) standard error of the estimator, level of confidence we want (a-$\alpha$)100%, sampling distribution


In [None]:
%%SAS my_session
PROC FREQ Data=Color;
  TABLES EYES/BINOMIAL(Wald);
RUN;

### Common CIs

#### For a single mean
<center><img src="common_ci_single_mean.png" style="width:800px"/></center>

Where $\bar{Y}$ is the _sample_ average and $\mu$ is the _population_ average. 

A normal distribution is described by $\mu$ and $\sigma$. A t-distribution is described by degrees of freedom, calculated as sample size $n$ minus one ($t_{n-1}$, the last row in the chart above). 

**Q.**: Suppose we want to make inference for a population mean.  When should we use the “z” interval vs the “t” interval?  (i.e. when should we use the one-sample z interval instead of the one-sample t interval.)

**A.**: The t interval should be used when your population is roughly normally distributed.  A z interval should be used when you have a ‘large’ sample size. 
Note: The t interval and z interval are basically the same for sample sizes above 40 or 50 or so!

**Q.**: What is meant by paired data?  Why do we have to treat paired data differently than the previous “two-sample” case?

**A.**: Data that consists of two measurements on the same (or very similar/matched) units is paired data.  Since the observations are made on very similar units we can’t assume the two observations are independent.

**Q.**: How are the paired t-interval and the one-sample t-interval related?

**A.**: The paired t-interval is equivalent to doing a one-sample t-interval on the differences of the paired data.  Both assume normality (either of the differences or of the single sample, respectively).

### T-test in SAS
Consider data on the length of court cases.
* Population = all court cases in some district
* Sample = 20 cases (variable days)
* Make inference about the average length of all court cases, $\mu$
* Create a 99% confidence interval for $\mu$ using ALPHA = 0.01

In [None]:
%%SAS my_session

PROC TTEST DATA = cases ALPHA = 0.01 PLOTS = all;
  VAR days;
RUN;

# Bayesian vs. Frequentist perspectives

Previous estimations and confidence intervals fall under the "Frequentist" viewpoint, which assumes a fixed situation and treats probability as a "long-run relative frequency", where $P(A) = \frac{\text{number of times A occurs}}{\text{total number of repeated trials}}$. 

## Bayesian paradigm

Combines prior belief with observed data to create an updated (posterior) belief. 

**Q.**: What is the biggest difference(s) between the Frequentist and Bayesian viewpoints?

**A.**: 
* The Frequentist paradigm assumes the parameters are fixed and you have repeatable situations.  Our goal is usually to find a confidence interval, do a hypothesis test, or predict.

* The Bayesian paradigm assumes the parameters are random and we quantify our belief through probability distributions.  Our goal is to find the posterior distribution and use that to inform decisions.

## Bayes' Theorem

Recall that, given a condition B, our probability of A is 

$$
P(A|B) = \frac{P(A \cap B)}{P(B)}
$$

According to the multiplication law,

$$
P(A \cap B) = P(A|B)P(B) \\
or \\
P(A \cap B) = P(B|A)P(A)
$$ 

Bayes' Theorem simply replaces the numerator in the conditional probability.

$$
\text{Bayes' Theorem: } P(A|B) = \frac{P(B|A)P(A)}{P(B)}
$$

**Q.**: When we say events make up a partition, what does that mean?

**A.**: The events are all disjoint and, in total, make up the entire space of events.

**Q.**: What is meant by a prior distribution?

**A.**: A distribution that specifies our prior belief about our parameters

**Q.**: What is meant by our likelihood?

**A.**: A distribution that models our data

**Q.**: What is meant by a posterior distribution?

**A.**: A distribution that contains our updated belief about our parameter(s) after seeing the data.


### Problem Set 7

Answer the following questions:

**1. What is the Frequentist viewpoint on statistics?**

The Frequentist paradigm assumes the parameters are true, fixed unknowns that exist in the world and what we are trying to do is to estimate them. you have repeatable situations.  Our goal is usually to find a confidence interval, do a hypothesis test, or predict.

The Bayesian viewpoint treats things as more "fluid". Instead of thinking about "fixed" situations, we still have parameters but they are assumed to be random and to have their own distributions. When we actually do observe data, we're only observing it from one particular realization of the parameter in play.   

Often, Frequentist and Bayesian approaches produce similar rough results. Frequentist measurements can be assessed with "confidence", Bayesian measurements can actually use the word "probability". 

Frequentist
- Fixed, unknown parameters.
- Repeatable situations (e.g. probability is the "long-run" relative frequency, the distribution of a statistics in "repeated" trials).
- Confidence intervals assume that if we did the same exact experiment, in the same exact way, from the same exact population, over and over again, 95% of our intervals will contain the truth. 
- "Static" idea with situations being repeated over and over again. 
- Does not necessarily match up with real life. A sample taken today and another taken next week may be affected by events that induce changes in the population.  

Fixed unknown parameters, situations are repeatable (static/fixed) situations. Problem is that things happen between now and the next week that may affect results.

"Frequentists have fixed parameters and assume that situations are repeatable." 

**2. What is the Bayesian viewpoint on statistics?**

The Bayesian paradigm assumes the parameters _are random_ and we quantify our belief through probability distributions.  Our goal is to find the posterior distribution and use that to inform decisions.

Bayesian
- We assume that parameters are random (they are, in themselves, a random variable and they have a distribution associated with them). They are not observed ate one particular point and there is no true value that we're trying to estimate. Instead, there is a distribution of values and we're attempting to look at that distribution.    
- We specify information some information about our parameter of interest beforehand (we call this "prior information" or "prior knowledge"). We do this in the form of a distribution (by applying a prior distribution to our parameter). Then, we get some data, just like in the Frequentist approach, and we model this data through a "likelihood". We then take the prior information and the data we gathered (prior + likelihood) and combine it into a "posterior distribution" that represents updated information about our parameter. 
- There is no "true" parameter out there. There is a distribution of values that tells us where we think the parameter will be observed in the future.  

No true parameter, just a distribution that we look at. We put a prior distribution, model through likelihood, and then combine into a posterior distribution. 

"Bayesians have random parameters and they specify their prior beliefs with a distribution, combine that with the data that they saw to create a posterior distribution or belief after they've seen the data."      
If we don't want to influence our results too much with a prior belief, we can look at non-informative prior distributions (e.g. [Jeffreys prior](https://en.wikipedia.org/wiki/Jeffreys_prior)).

**3. Should I be a Frequentist or a Bayesian?**

Not really, choose what works best.

**4. What are the terms prior, likelihood, and posterior?**

**Prior belief** = some prior belief/probability distribution about our parameters before we see the data. Whatever distribution we assume about our data before seeing that data. We may have a $p$ probability of success for some parameter of interest ($p=P(Success)$) and we may model $p$ as a Beta distribution beforehand, $p \sim Beta(\alpha, \beta)$. Why a Beta? The Beta distribution's support (all of the values it takes on) are between 0 and 1. Probabilities also take on values between 0 and 1, so a Beta distribution could be a reasonable prior belief about our parameter. The Beta is also very flexible, so it gives us a flexible way to specify our prior belief about our parameter.

<center><img src="1062px-Beta_distribution_pdf.png" style="width:400px"/><figcaption>Some of the possible shapes for the Beta distribution, which has support (takes on possible values) from 0 to 1.</figcaption></center>


**Likelihood** = a model for our data. We need some experiment or a study to investigate $p$. Experiment may be to observe the number of successes, $Y$, in $n$ iid trials. Likelihood is $Y|p ~ Bin(n,p)$. No fixed true value of $p$, just a distribution of $p$s. 

**Posterior belief** = an updated belief about our parameters after seeing the data. Combine prior and likelihood into a posterior distribution using Bayes' theorem. With a posterior, we're looking at a distribution and are trying to find a distribution for $p$ (a random thing) given our data. i.e. $p|Y=Y ~ ?$. $p$ still follows a Beta distribution (like our prior). $p|y ~ Beta(\alpha +y, n-y+ \beta)$.

**5. What do we do once we have the posterior?**

Summarize that posterior distribution in meaningful ways, with something like a posterior mean $E(p|y)$, a posterior median, and a posterior standard deviation. Give a credible interval (CI). 

**6. How are credible intervals and confidence intervals similar? different?**
A credible interval includes 95% of our distribution. We can still observe it with a probability. 

In the Frequentist view, the probability is either a 0 or a 1.  



**7. Undergraduate students are classified into one of four groups:**
* freshmen (35% of students)
* sophomores (28% of students)
* juniors (23% of students)
* seniors (14% of students)

**For the students that are freshmen 46% live on campus, for those that are sophomores 23% live on campus, for those that are juniors 17% live on campus, and for those that are seniors 14% live on campus.**

**a. What is the probability a randomly selected student lives on campus?**

**b. If a student lives on campus, what is the probability they are a senior?**

**8. In a 2018 study, 1250 households across the U.S. were randomly selected for a survey the use of cellular versus landline phones. Of those surveyed, 715 responded that they do not have a landline phone and rely completely on their cellular phone. Suppose we want to make Bayesian inference on p = P(landline).**

**a. What do we need to do first?**

* Define our prior beliefs
* Model the data appropriately
* Combine info with Bayes' theorem to find posterior distribution

**b. What is a reasonable likelihood?**
$$
Y = \text{number of people with a landline} \\
Y \sim Bin(1250,p) \\
Y|p \sim Bin(1250,p)
$$

**c. What is our posterior distribution?**

**d. Find and interpret a 95% credible interval for p.**


**9. Data was collected on 45 Americans and the number of hours of traditional TV watched in a week is recorded. The data is included in the TVData.xlsx file. Suppose we want to make Bayesian inference on the mean number of hours watched.**

**a. What do we need to do first?**

**b. What might we use as a reasonable likelihood?

c. What is our posterior distribution?

d. Report a posterior mean and find and interpret a 95% credible interval for the mean.

# General problems

In [2]:
%%SAS my_session
DATA geom;
     INPUT trials;
     DATALINES;
5
1
3
9
6
6
5
3
8
7
;

In [None]:
%%SAS my_session
DATA pressure;
   INPUT SBPbefore SBPafter;
   DATALINES;
120 128   
124 131   
130 131   
118 127
140 132   
128 125   
140 141   
135 137
126 118   
130 132   
126 129   
127 135
;

PROC TTEST DATA = pressure PLOTS = all;
PAIRED SBPbefore*SBPafter;
RUN;

In [None]:
%%SAS my_session
PROC TTEST DATA = sashelp.bweight ALPHA = 0.1 PLOTS = all;
VAR weight;
RUN;

In [None]:
%%SAS my_session
PROC FREQ DATA=sashelp.bweight;
TABLES Weight/BINOMIAL(Wald);
RUN;

# Hypothesis testing

**Hypotheses** are expectations about a _population_ (e.g. the parameters of that population).

The **test statistic** is the number of standard errors that a sample value deviates by from the $H_{0}$ value.

### Test about the _proportion_
Support we assume that the number of Americans who have scuba-diving experience is less than 3%. Our null hypothesis is $H_{0}=0.03$ and our alternative hypothesis is $H_{A} \lt 0.03$. We draw a sample of 1,000 Americans ($n=1000$) and find that the sample proportion is $p=0.02$. How likely is a sample proportion of 0.02 if the population proportion is 0.03?

**Steps**:
1. Compute the test statistic (or the number of standard error that the sample statistic is removed from the assumed population parameter)

The number of standard errors thee sample statistic is removed from the assumed population parameter is represented by a **Z-score**.

$$
\text{test statistic} = z = \frac{\hat{p}-p}{SE_{0}} \\
\text{where the standard error assumed under the null hypothesis} = SE_{0} = \sqrt{\frac{p(1-p)}{n}}
$$

$p$ is the population proportion assumed under the null hypothesis and $\hat{p}$ is the sample proportion.

Here,

$$
\text{test statistic} = z = \frac{0.02-0.03}{SE_{0}} \\
\text{where } SE_{0} = \sqrt{\frac{0.03(1-0.03)}{1000}} \approx 0.005 \\
z = \frac{0.02-0.03}{0.005} \approx -1.85
$$

This means that our _sample_ proportion falls 1.85 standard errors below the _population_ proportion when the null hypothesis ($H_{0}$) is true. The probability, or the **p-value**, that corresponds to this z-score is 0.0322 (3.22%), as seen in the z-table below.

<center><img src="z_score_prob_value.png" style="width:400px"/></center>

Thus, finding a sample proportion of 0.02 if the population proportion is actually 0.03 is unlikely. Is it _unlikely enough_ to reject the null hypothesis? It depends on the **significance level**, $\alpha$, that we choose before conducting the experiment. A common $\alpha$ value is 0.05. In this case, if the p-value is smaller than 0.05, we can say that "our sample provides enough evidence to reject the null hypothesis". Since $0.0322 \lt 0.05$, we reject the null hypothesis. 

We can also state that our test statistic of -1.85 falls within the **rejection region**. The critical z-value that forms the border of the rejection region is -1.64, which can be found by looking up the left-tailed probability of 0.05.    

<center><img src="rejection_region.png" style="width:600px"/></center>

We thus reject our null hypothesis and conclude that the proportion of Americans with scuba-diving experience is lower than 0.03. 

Since our alternative hypothesis was that the population parameter is _smaller_ than 0.03, $H_{A}: p \lt 0.03$, we only focused on one side of the sampling distribution (the left side). We performed a **one-tailed test**. If our alternative hypothesis had been $H_{A}: p \neq 0.03$, we would not focuses just on the left side of the distribution but on both sides of the distribution. We would thus perform a **two-tailed test**. Based on the z-table, this would correspond to the critical values of -1.96 and 1.96.   

<center><img src="two_tailed_test.png" style="width:400px"/></center>

In this case, our test statistic of -1.85 does not fall in the rejection region. This means that we cannot reject the null hypothesis that $p=0.03$. 

Choosing a one- or a two-tailed test can make a big difference to your conclusions! In practice, two-tailed tests are used much more often. 

<div class="alert alert-block alert-success">
<b>Tip:</b> Most significance tests are two-tailed and are based on a significance level ($\alpha$) of 0.05! 
</div>


### Significance test about the _population mean_
How long can scuba-divers stay under water? Suppose we expect that experienced American divers can stay under water for more than 60 minutes. Suppose we approach 100 America scuba-divers (n=100) and measure how long they can stay under water. We find that the mean time that these divers stay under water is 62 minutes ($\bar{x} = 62 minutes$). The standard deviation is 5 minutes ($S=5 minutes$). 

Here, 

$$
H_{0}: \mu = 60 \\
H_{A}: \mu \gt 60
$$

To conduct a significance test about the population mean, assess if it is likely that the sample we have collected actually comes from a population with a mean that equals the value in our null hypothesis.

<center><img src="sample_distribution_sample_mean.png" style="width:400px"/></center>

Steps:
1. Compute a test statistic (the number of standard errors the sample mean is removed from the $H_{0}$ value according to the null hypothesis).


**One-tailed test**

To compute the standard error, we need to know the population standard deviation, $\sigma$. Since we do not know $\sigma$, we need to estimate it using the _sample_ standard deviation, $s$. Since this introduces extra error, we employ the **t-distribution**, rather than the z-distribution. 

$$
\text{Test statistic} = t = \frac{\bar{x}-\mu_{0}}{SE} \\
\text{where standard error} = SE = \frac{s}{\sqrt{n}}
$$

Here,

$$
\text{Standard error} = \frac{5}{\sqrt{100}} = 0.5 \\
t = \frac{62-60}{0.5} = 4
$$

Our test statistic is a t-score of 4. Is this enough to reject the null hypothesis? If we employ $\alpha=0.05$ and do a one-tailed test, we can find the critical value (1.67) for our rejection region in a t table. We look at $t_{90\%}$ because we want a cumulative probability of 0.05 in the right tail of the distribution.   

<center><img src="t_test.png" style="width:400px"/></center>
<center><img src="t_table_90.png" style="width:400px"/></center>

Since $t=4$ falls into our rejection region, we can reject the null hypothesis and conclude that, on average, experienced American divers stay under water for more than 60 minutes.  

<center><img src="sample_dist.png" style="width:400px"/></center>

If, instead, we state that 

$$
H_{0}: \mu = 60 \\
H_{A}: \mu \neq 60
$$

we have to do a two-tailed t-distribution test. 

**Two-tailed test**

Suppose $\alpha=0.01$. Our sampling distribution looks like this:

<center><img src="two_tailed_sample_dist.png" style="width:400px"/></center>

Our t-value is still 4 and is still in the rejection region. We still reject the null hypothesis and conclude that our finding is still statistically significant. With a two-tailed t-test, we conclude that the mean time that experienced American divers spend under water is NOT 60 minutes. 

### Step-by-step plan for conducting significance tests
Suppose you have two expectations:
* More than half of all certified divers in America have more than 35 hours of diving experience. 
  * Here, we are dealing the a proportion, $p$.
  
* Mean number of hours of diving experience of all certified divers in America is more than 35 hours. 
  * Here, we are dealing with a mean, $\mu$

How many hours of experiences? In our sample of n=100, the distribution of the variables "hours of diving experience" is approximately normal.

**Example 1**

A proportion of 0.57 has more than 35 hours: $p(\gt 35 \text{ hours}) = 0.57$. Here, $H_{0}: p = 0.5$, $H_{A}: p \gt 0.5$. We thus have to conduct right-tailed tests.

In the case of proportions,
$$
z = \frac{\hat{p} - p}{SE_{0}} \\
SE_{0} = \sqrt{\frac{p(1-p)}{n}}
$$

Here, 
$$
SE_{0} = \sqrt{\frac{0.5(1-0.5)}{500}} \\
z = \frac{0.57 - 0.5}{\sqrt{\frac{0.5 \times 0.5}{500}}} \approx 3.13 \\
$$

**Example 2**

Mean number of hours of diving experience is 35.5, the standard deviation is 8: $\bar{x}=35.5$, $S=8$. Here, $H_{0}: \mu = 35$, $H_{A}: \mu \gt 35$. We thus have to conduct right-tailed tests.

In the case of the mean, 
$$
SE = \frac{8}{\sqrt{500}} \\
t = \frac{35.5 - 35}{\frac{8}{\sqrt{500}}} \approx 1.40 \\
$$

**Step-by-step plan**
1. _Proportion or mean?_ In example 1, we're dealing with a proportion. In example 2, we're dealing with a mean. 
2. _Formulate your hypotheses_

In the case of proportions, our 
$$
H_{0}: p=p_{0} \\
H_{A}: p \neq p_{0} \text{ or} \\
H_{A}: p \gt p_{0} \text{ or} \\
H_{A}: p \lt p_{0}
$$

In the case of the mean, 
$$
H_{0}: \mu = \mu_{0} \\
H_{A}: \mu \neq \mu_{0} \text{ or} \\
H_{A}: \mu \gt \mu_{0} \text{ or} \\
H_{A}: \mu \lt \mu_{0}
$$
3. _Check if your assumptions are met._

In both cases 1 and 2, randomization is essential. Your data must have been collected by means of a random sample or a randomized experiment. 

In the case of proportions,
$$
np \ge 10 \text{ and} \\
n(1-p) \ge 10
$$

In the case of the mean, the population distribution should be approximately normal. 

4. Determine your significance level, $\alpha$. Usually, $\alpha=0.05$. 
5. Compute your test statistic.

In the case of proportions,
$$
z = \frac{\hat{p} - p}{SE_{0}} \\
SE_{0} = \sqrt{\frac{p(1-p)}{n}}
$$

In the case of the mean, 
$$
t = \frac{\bar{x} - \mu}{SE} \\
SE = \frac{S}{\sqrt{n}}
$$

6. _Draw the sampling distribution_.
<center><img src="step_6.png" style="width:900px"/></center>

7. _Find location of test statistic_
8. _Decide if the null hypothesis should be rejected_
9. _Interpret your findings._

### Example
A vet wants to determine the proportion of adult cats that have high tartar build-up on their teeth.  They are thinking of running a special on dental cleanings but need a large number of appointments to make it worth their while.

They want to see if the proportion of adult cats with tartar build-up is greater than 0.7.

Our null and alternative hypotheses are: 

$$
H_0: p = 0.7, H_A: p >0.7
$$
 

We observe a random sample of 50 cats and find that 39 of them have tartar build-up.  Our test statistic is:

$$
Z = \frac{\hat{p}-0.7}{\sqrt{0.7(1-0.7)/50}} \sim^{H_0} N(0,1)
$$

What is the value of our p-value for testing these hypotheses? (Use two decimal places.)

**Answer**

$$
P(Z \ge z_{obs}) = P(Z \ge 1.2344) = 1-P(Z \le 1.2344)
$$

Using the standard normal CDF, this gives $1-0.8915=0.1085=0.11$. With a significance level of $\alpha=0.05$, we fail to reject the null hypothesis that the proportion of adult cats with tartar buildup is 0.7.

In [12]:
%%SAS my_session
* To find P(T_19 ge 2.71);
data new; 
  t=1-CDF('T', 2.71, 19);
put t=;
RUN;

* To find 1-P(Z \le 1.2344);
data new; 
  p=1-CDF('NORMAL', 1.2344);
put p=;
RUN;

### Significance test and confidence interval example
There are two methods of **inferential statistics**. 

1. Inference about _interval_ estimation by means of **confidence intervals**.
2. Inference about point estimations using **significance tests**. 

**Example**

Suppose you have a sample of $n=500$ divers with a mean diving time of $\bar{x}=36$ hours and a standard deviation of $S=8$ hours. The sample distribution of the variable "hours of diving experience" is approximately normal.

Based on this information, you want to draw inferences about the population parameter, $\mu$. 

Here,
$$
H_{0}: \mu = 35 \\
H_{A}: \mu \neq 35
$$

Our assumptions are met - our analysis is based on a simple random sample and we're dealing with a large sample. This large sample is approximately normally distributed. Our test statistic is
$$
t = \frac{\bar{x} - \mu}{SE} \\
\text{where } SE = \frac{S}{\sqrt{n}} \\
$$

Thus,
$$
t = \frac{36-35}{\frac{8}{\sqrt{500}}} \approx 2.80
$$

Our sampling distribution looks like:
<center><img src="ex_sampling_distribution.png" style="width:400px"/></center>


To construct a **95% confidence interval**, we use the following:

$$
\bar{x} \pm t_{95\%}(SE) \\
\text{where } SE = \frac{\text{standard deviation}}{\sqrt{n}}
$$

The relevant t-score is 1.984, so we have 
$$
36 \pm 1.984(\frac{8}{\sqrt{500}})
$$

Thus, the 95% confidence interval is (35.29, 36.71). We can be confident that, with repeated sampling, this interval would contain the actual population mean 95% of the time. 


<div class="alert alert-block alert-success">
<b>Tip:</b> If the p-value in a two-tailed significance test is $\le 0.05$, the 95% confidence interval does NOT contain the $H_{0}$ value. Conversely, if the p-value in a two-tailed significance test is $\gt 0.05$, then the 95% confidence interval WILL contain the $H_{0}$ value. 
</div>
 

## Common hypotheses tests for a single mean
RR = rejection region

<center><img src="common_ht_single_mean.png" style="width:800px"/></center>
<center><img src="common_ht_single_mean_2.png" style="width:800px"/></center>
<center><img src="common_CI_differences_of_means.png" style="width:800px"/></center>
<center><img src="common_CI_differences_of_means_2.png" style="width:800px"/></center>
<center><img src="common_CI_variances.png" style="width:800px"/></center>
<center><img src="common_CI_variances_2.png" style="width:800px"/></center>

**What is the goal of a hypothesis test?**

For a hypothesis test, we make assumptions about some aspect of our population (usually a parameter) and see if our data refutes that assumption.  This gives us a way to make decisions about our population using our data.

**What do we use a p-value for?**

p-values provide a measure of evidence against the null hypothesis. We look at p-values to determine our conclusion. 

**What is meant by a ‘z-test’?**

A hypothesis test that uses the normal distribution for its test statistic.

**When is a ‘z-test’ reasonable when doing a hypothesis test for a population mean?**

When you have a random sample of size n from a normally distributed population where you know the population variance/standard deviation.
When you have a large random sample from (almost) any population.

Note that when you have a ‘large’ random sample from (almost) any population, a ‘t-test’ is basically the same as a ‘z-test’. This means you might just use a t-test anyway!

**We can use a p-value to make a decision about H0.  We reject H0 when our p-value is less than alpha.  When we use this rule to make a conclusion about H0, how does this control alpha?**

Alpha = Probability we reject H0 when H0 is true = P(Reject H0|H0)
P-value = P(result as or more extreme|H0)
We calculate the p-value assuming H0 is true. We use the rule that we only reject when the p-value is less than alpha. 
Assuming H0 is true, we should only see samples that give us a p-value less than alpha in $100*alpha\%$ of our experiments.  Again since H0 is assumed true when finding the p-value, we will only falsely reject H0 with probability alpha.

**What do we mean by the term "controlling alpha"?**

We mean keeping this probability (P(reject H0|H0)) fixed at a small value (often 0.05).  Since type I errors are usually considered worse, we want to ‘control’ the probability of making a type I error – i.e. we want to control alpha by keeping it small.

**What is a p-value?**

A p-value is the probability of seeing a result as or more extreme than what was observed, assuming the null hypothesis is true.

**How does a rejection region control $\alpha$?**

The rejection region is set up assuming the null hypothesis is true. In fact, it is set up so that, assuming the null is true, we should only observe our test statistic in that region with probability alpha.  That means we will only reject H0 when H0 is true with probability alpha.





# Type I and Type II errors

**Power**: the probability of rejecting the null hypothesis given that it is false. We want low $\alpha$, low $\beta$, high power. Power is $1-\beta$. Larger samples increase power. 

Power is important because, before we conduct a study, it can help us determine how many participants we need. After we've conducted the study, it can help us make sense of results that are not statistically significant. 

### Example

Imagine you're a diver interested in whale sharks and you'd like to know what the average length of these gigantic animals. Also suppose you have spent years and years in different parts of the world to study these creatures.

Over the years you have encountered and measured 258 whale sharks. Because you have measured whale sharks all over the world, we assume from now that these 258 whale sharks can be understood as a _simple random sample_.

It turns out that the mean length equals 8.3 meters. The sample standard deviation is 3.4 meters. It also turns out that the distribution of whale shark length is approximately normal.

$$
n = 258 \text{ whale sharks}\\
\bar{x} = 8.3 \text{ meters} \\
\text{sample standard deviation} = s = 3.4 \text{ meters}
$$

We will test three alternative hypotheses against the null hypothesis that the mean whale shark length in the population is 8 meters.

$$
H_{0}: \mu = 8 \text{ meters} \\
--- \\
H_{A_{1}}: \mu \neq 8 \text{ meters} \\
H_{A_{2}}: \mu \gt 8 \text{ meters} \\
H_{A_{3}}: \mu \lt 8 \text{ meters} 
$$

In all cases, **$\alpha = 0.10$**.

1. Check our assumptions.
* Randomization
* Population distribution of whale shark lengths: "approximately normal"

2. Compute the test statistic.

For $H_{A_{1}}: \mu \neq 8$:

$$
\text{test statistic} = t = \frac{\bar{x} - \mu}{SE} \\
SE = \frac{s}{\sqrt{n}} \\
------------\\
t = \frac{8.3 - 8}{\frac{3.4}{\sqrt{258}}} \approx 1.42\\
$$

For two-tailed $\alpha=0.10$, $t_{90\%} \approx 1.66$.

<center><img src="ex_1_two_tailed.png" style="width:400px"/></center>

For $H_{A_{2}}: \mu \gt 8$:

$$
\text{test statistic} = t = \frac{\bar{x} - \mu}{SE} \\
SE = \frac{s}{\sqrt{n}} \\
------------\\
t = \frac{8.3 - 8}{\frac{3.4}{\sqrt{258}}} \approx 1.42\\
$$

Instead of a two-tailed test, we now do a one-tailed test. For $\alpha=0.10$, one-tail $p \approx 1.29$. Since $t=1.42$ is now within the rejection region, we DO reject the null hypothesis and conclude that the population mean is indeed larger than 8.

<center><img src="ex_1_one_tailed.png" style="width:400px"/></center>

For $H_{A_{3}}: \mu \lt 8$:

$$
\text{test statistic} = t = \frac{\bar{x} - \mu}{SE} \\
SE = \frac{s}{\sqrt{n}} \\
------------\\
t = \frac{8.3 - 8}{\frac{3.4}{\sqrt{258}}} \approx 1.42\\
$$

Instead of a two-tailed test, we now do a one-tailed test. For $\alpha=0.10$, one-tail $p \approx 1.29$, so our critical value is -1.29. Since $t=1.42$ is not within the rejection region, we do not reject the null hypothesis and we cannot conclude that the population mean is smaller than 8.

<center><img src="ex_1_one_tailed_left.png" style="width:400px"/></center>

# Working with hypotheses tests in SAS

### Example 1: z-test for a **proportion**
Consider a data set collected on Europeans. The goal is to test if the proportion of Europeans with blue eyes differs from 0.5. That is, the null hypothesis is that half of Europeans have blue eyes and the alternative is "not half" (more or less than a half). 

$$
H_{0}: 0.5 \\
H_{A}: p \neq 0.5
$$
 
**Question**: What SAS procedure that we’ve studied allows us to do a hypothesis test for a proportion?

**Answer**: `PROC FREQ`

In [None]:
%%SAS my_session
* z-test for a proportion;
PROC FREQ DATA = Color;
  TABLES Eyes / BINOMIAL(Wald);
RUN;

<center><img src="sas_example_z_test_for_proportion.png" style="width:800px"/></center>

Since we're testing if the proportion is _less than_ **or** _great than_ 0.5, we want a two-sided test. This requires us to multiply the p-value by 2. The area to the left of -1.7321 is around 4.16% or 0.0416, so we double this to account for the area on the other side. This gives us a p-value of 0.0833.

Thus, using a significance level of $\alpha=0.05$, we fail to reject the null hypothesis that the proportion of blue-eyed Europeans is 0.5 ($0.0833 \gt 0.05$). We don't have enough evidence to that that the proportion differs from 0.5.

If we had been testing _only_ $H_{A}: p \neq 0.5$, _then_ we would have been able to reject the null hypothesis.  

### Example 2: t-test for a **mean**
Consider a data set with court cases and their lengths. The population is all court cases in some district, the sample is $n=20$ court cases on variable days, and let's make an inference about $\mu$, the average length of all court cases.

$$
H_{0}: \mu = 60 days \\
H_{A}: \mu \gt 60 days
$$

**Question**: What SAS procedure that we’ve studied allows us to do a one-sample t-test for a mean?

**Answer**: `PROC TTEST`

We have a small data set, so let's check our assumptions to make sure the data is _normally distributed_.

<center><img src="sas_example_t_test_for_mean_assumptions.png" style="width:800px"/></center>

We have linearity (roughly speaking), so let's go to the next step. 

In [None]:
%%SAS my_session
/* t-test for a mean. H0 gives SAS the null value, 60. We also have to
tell SAS if we want a one-sided test or a two-sided test.
Here, we want a one-sided test, specifically an upper test (great than),
so our SIDES = option is "U". If SIDES = L, then we are testing for "less than" 60 days */
PROC TTEST DATA = cases ALPHA=0.01 H0 = 60 SIDES = U PLOTS = all;
  VAR days;
RUN;

<center><img src="sas_example_t_test_for_mean.png" style="width:800px"/></center>

On average, our cases were taking 89.85 days. We have a standard error of 4.2811 around that 89.85 days. Our test statistic, the t-value, is 6.97, comes from a t-distribution with 19 degrees of freedom. The p-value is less than 0.0001 (very rare).  

We have enough evidence to conclude that the mean number of days for a court case is greater than 60. 

### Example 3: difference of means
Consider the Cookie Cats game. The population is all possible players (now and in the future). The sample is approximately 90,000 players that installed the game during the study. Players were randomly assigned to have a gate at level 30 or level 40 when they installed. 

Make an inference about the _difference_ in gates between average number of games played, $\mu_{Diff} = \mu_{30} - \mu_{40}$.

$$
H_{0}: \mu_{Diff} = 0 \\
H_{A}: \mu_{Diff} \neq 0
$$

**Question**: What SAS procedure that we’ve studied allows us to do a one-sample t-test for a mean?

**Answer**: `PROC TTEST`

That the data in this set is NOT normally distributed. 
<center><img src="cookie_cats_qq.png" style="width:800px"/></center>

We can transform this data using the _log_ of the sum of the game rounds to make normality much more reasonable. 
<center><img src="cookie_cats_log_qq.png" style="width:800px"/></center>

In [None]:
%%SAS my_session
* Difference of means;
PROC TTEST DATA = cats PLOTS = all;
  CLASS version;
  VAR log_sum_gamerounds;
RUN;

We typically use the unequal variance assumption and, therefore, the Satterhwaite approximation. 
<center><img src="sas_example_difference_of_means.png" style="width:800px"/></center>

With $\alpha=0.05$, we would reject if $p-value \lt 0.05$. Our p-values are 0.9 (very high). We do not have enough evidence to claim that gates 30 and 40 are different. We fail to reject the null hypothesis. 

### Example 4: paired t-test
Jocko's garage seems to be giving out really high estimates for insurance claims. To investigate insurance fraud, insurance adjusters take 10 damaged cars and take each one to both Jocko's and a repair shop they trust, Jami's repair shop. They then get estimates from each repair shop (in the end, 2 for each car).
- The cars are the unit of measurement
- **Measured twice so paired data!**
- Measurements are clearly related and **cannot be treated as independent**

We're **NOT** looking at the average of Jocko and the average of Jami and taking the different. We're taking the difference first! We're then looking at the average of these differences to see if that average is equal to 0.

$$
H_{0}: \mu_{Diff} = 0 \\
H_{A}: \mu_{Diff} \gt 0 \text{ (i.e. Jocko is keeping some money for himself)}
$$

**Question**: What SAS procedure that we’ve studied allows us to do a one-sample t-test for a mean?

**Answer**: `PROC TTEST`

In [None]:
%%SAS my_session
* Difference of means;
PROC TTEST DATA = insurance PLOTS = all;
  PAIRED Jocko*Jami
RUN;

<center><img src="sas_example_paired_t_test.png" style="width:800px"/></center>

The probability 0.0014 is less than $\alpha=0.05$ and we can reject the null hypothesis. Jocko needs to be investigated further. His estimates are, on average, $160 higher than Jami's. 

# Comparing confidence intervals and hypotheses tests

We seem to get the same conclusions when we perform either test. This is because the two methods are related!

<div class="alert alert-block alert-success">
<b>Tip:</b> 
<br>If the null value (e.g. $\mu_{0}$) is contained in a $100(1- \alpha)$% confidence interval for $\mu$, then we fail to reject $H_{0}$ at level $\alpha$.  
    
<br>If the null value is NOT contained in a $100(1- \alpha)$% confidence interval for $\mu$, then we reject $H_{0}$ at level $\alpha$. 
</div>

<center><img src="ci_vs_ht.png" style="width:800px"/></center>

# Power

<center><img src="type_I_type_II_errors.png" style="width:400px"/></center>

$$
Power = 1 - \beta = P(\text{Reject } H_{0}|H_{A} \text{ true})
$$

High power means low type II error. We often use 80% power. We can manipulate power by manipulating sample sizes. 

### Example
Consider data on length of court cases
- Population = all court cases (in some district)
- RV is Y = length of a court case
- Assumed to be normally distributed
- We want to test (using $\alpha=0.05$)

$$
H_{0}: \mu = 60 \text{ days} \\
H_{A}: \mu \gt 60 \text{ days}
$$

We want to have a power of 80% to detect a true mean of 65 days. What sample size do we need?

We must have a value to use for $\sigma$. Suppose that, from past court case data, we can estimate $\sigma$ to be 16 days.

$$
n = \frac{(z_{\alpha}+z_{\beta})^2\sigma^2}{(\mu_{0}-\mu_{A})^2} \\
= \frac{(1.645+0.842)^2 16^2}{(60-65)^2} \approx 63.3
$$

We need at least 63.3 observations. So, if we use 64 cases and the true mean is actually 65, we will reject the null hypothesis in 80% of all possible experiments. We would have a type II error rate of 20%.

Note that $z_{\alpha}$ is the value from the standard normal distribution with $\alpha$ area to the right of it (i.e. the $1-\alpha$ quantile of the standard normal).

Similarly, $z_{\beta}$ is the value from the standard normal distribution with $\beta$ area to the right of it (i.e. the $1-\beta$ quantile of the standard normal).


**Question**: What is power?

**Answer**: P(Reject $H_{0}$|$H_{A}$ true)

# Quiz 8

1. The sashelp.bweight dataset gives information about the mother and their baby's birth weight.

Conduct a test that the average weight of a baby differs from 3350 (grams).

Report the test statistic (two decimal places).

In [None]:
%%SAS my_session
/* t-test for a mean. H0 gives SAS the null value, 3350. We also have to
tell SAS if we want a one-sided test or a two-sided test.
Here, we want a two-sided test, so our SIDES = option is "2" */
PROC TTEST DATA = sashelp.bweight ALPHA=0.01 H0 = 3350 SIDES = 2 PLOTS = all;
  VAR Weight;
RUN;

2. The sashelp.bweight dataset gives information about the mother and their baby's birth weight.

Conduct a test that the proportion of male children (Boy = 1) equals 0.52 against the alternative that it is less than 0.52.  (Just use the Wald interval used in the notes.)

Report the p-value for this one-sided test.  Use four decimal places.

In [10]:
%%SAS my_session
* z-test for a proportion;
PROC FREQ DATA = sashelp.bweight;
  TABLES Boy / BINOMIAL(Wald P=0.52);
RUN;

Baby Boy,Baby Boy,Baby Boy,Baby Boy,Baby Boy
Boy,Frequency,Percent,Cumulative Frequency,Cumulative Percent
0,24208,48.42,24208,48.42
1,25792,51.58,50000,100.0

Binomial Proportion,Binomial Proportion
Boy = 0,Boy = 0.1
Proportion,0.4842
ASE,0.0022

Confidence Limits for the Binomial Proportion,Confidence Limits for the Binomial Proportion,Confidence Limits for the Binomial Proportion
Proportion = 0.4842,Proportion = 0.4842,Proportion = 0.4842
Type,95% Confidence Limits,95% Confidence Limits.1
Wald,0.4798,0.4885

Test of H0: Proportion = 0.52,Test of H0: Proportion = 0.52.1
ASE under H0,0.0022
Z,-16.0410
One-sided Pr < Z,<.0001
Two-sided Pr > |Z|,<.0001


12. The sashelp.bweight dataset gives information about the mother and their baby's birth weight.  There is also a variable that denotes the sex of the baby.

Conduct a test to see if the difference in average weight of baby boy differs from that of a baby girl.  (Use the Satterthwaite method.)

Report the test statistic associated with this test.

In [None]:
%%SAS my_session
* Difference of means;
PROC TTEST DATA = sashelp.bweight PLOTS = all;
  CLASS Boy;
  VAR Weight;
RUN;

14. Consider data about blood pressure measurements.  There is a before reading, a drug is given, and there is an after reading.  The data can be read into SAS (data taken from SAS documentation) using the following DATA step.

DATA pressure;

   INPUT SBPbefore SBPafter;
   DATALINES;
   
120 128   
124 131   
130 131   
118 127
140 132   
128 125   
140 141   
135 137
126 118   
130 132   
126 129   
127 135
;

Conduct a test that the mean difference is not equal to 0.  Report the p-value of this test to four decimal places.

In [9]:
%%SAS my_session
DATA pressure;
   INPUT SBPbefore SBPafter;
   DATALINES;
120 128   
124 131   
130 131   
118 127
140 132   
128 125   
140 141   
135 137
126 118   
130 132   
126 129   
127 135
;

PROC TTEST DATA = pressure PLOTS = all;
  PAIRED SBPbefore*SBPafter;
RUN;

N,Mean,Std Dev,Std Err,Minimum,Maximum
12,-1.8333,5.8284,1.6825,-9.0,8.0

Mean,95% CL Mean,95% CL Mean.1,Std Dev,95% CL Std Dev,95% CL Std Dev.1
-1.8333,-5.5365,1.8698,5.8284,4.1288,9.8958

DF,t Value,Pr > |t|
11,-1.09,0.2992


# Summaries
<center><img src="BEPB0000465_135.png" style="width:800px"/></center>
Summary of sampling distribution for sample mean statistics
<center><img src="BEPB0000465_139.png" style="width:800px"/></center>
<center><img src="BEPB0000465_140.png" style="width:800px"/></center>

##### 