# DS102 Statistical Programming in R : Lesson Seven - t-Tests

### Table of Contents <a class="anchor" id="DS102L7_toc"></a>

* [Table of Contents](#DS102L7_toc)
    * [Page 1 - Introduction](#DS102L7_page_1)
    * [Page 2 - Single Sample t-Tests](#DS102L7_page_2)
    * [Page 3 - Independent t-Tests](#DS102L7_page_3)
    * [Page 4 - Dependent t-Tests](#DS102L7_page_4)
    * [Page 5 - Key Terms and R Commands](#DS102L7_page_5)
    * [Page 6 - Hands On](#DS102L7_page_6)    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction<a class="anchor" id="DS102L7_page_1"></a>

[Back to Top](#DS102L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: T-Tests
VimeoVideo('331822046', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102L07overview.zip)**.

In this lesson, you will use a built-in function ```t.test()``` that automates all three types of *t*-tests. Pay close attention to the details, because all that sets these three tests apart in R are the different arguments that they use.  

By the end of this lesson, you should be able to perform: 

* Single sample *t*-tests
* Independent *t*-tests
* Dependent *t*-tests
* Checking for *t* test assumption of normality using a histogram and/or QQ plot

Your hands on for this lesson will put both your data manipulation and your *t* test skills to the test as you determine how temperatures have changed in New Hampshire over the years.

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Single Sample t-Tests<a class="anchor" id="DS102L7_page_2"></a>

[Back to Top](#DS102L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [2]:
from IPython.display import VimeoVideo
# Tutorial Video Name: T-Tests
VimeoVideo('328682696', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L07-pg2tutorial.zip)**.

# Hypothesis Tests on the Mean

One of the difficulties of learning statistics and analysis is that there is a bewildering array of hypothesis tests; each one applies to slightly different circumstances. They are all useful and important, but it can be a real trial trying to figure out which test applies to your particular data set.

In this section, you will use R to implement hypothesis tests for the following circumstances:

1.  Single Sample *t*-test: Testing the mean of a single population
>Based on a sample, is the mean of a population equal to a value or not equal to a value?

2.  Independent *t*-test: Testing the mean of two populations 
>Based on two samples, are the means of two populations equal to each other?

3.  Dependent *t*-test: Testing the mean of two populations using paired data
>Based on two samples, where each observation in one sample is related to a specific observation in the other sample, are the means of the two populations equal to each other?

---

## T-test for One Sample

You will start with scenario one: testing the mean of a single population. 

You can use the ```t.test()``` function to perform a hypothesis test on the mean. Recall that to perform a hypothesis test, you create a null hypothesis and then determine if your data give you evidence to reject the null hypotheses.

For this example, you will use the **[frostedflakes data set](https://repo.exeterlms.com/documents/V2/DataScience/Stat-Prog-R/frostedflakes.zip)**. Import this data, and then you should now be able to access the ```frostedflakes``` dataset:

```{r}
head(frostedflakes)
```

```text
Lab IA400

1 36.3 35.1

2 33.2 35.9

3 39.0 40.1

4 37.3 35.5

5 40.7 37.9

6 38.4 39.5
```

This is a frame with two variables: ```Lab```, which contains the percentage of sugar measured in a 25 gram sample of Frosted Flakes using a laboratory high performance liquid chromatography technique, and ```IA400```, which contains the percentage of sugar in the same sample measured by a machine (the Infra-Analyzer 400).

According to the nutritional information supplied with Frosted Flakes, the sugar percentage by weight is 37%. You will create a hypothesis test to see if the data set provides evidence to the contrary. To set up the hypothesis test, define the null and alternate hypotheses as follows:

![H sub zero, mu equals thirty seven. H sub one, mu does not equal thirty seven.](Media/L08-OnePopH.png)

You will now see if your data provide evidence that you should reject the null hypothesis by using a *t*-test. In R, you will use the following commands: 

```{r}
t_obj <- t.test(frostedflakes$Lab, mu = 37)
print(t_obj)
```

You will save the object returned by ```t.test()``` in the variable ```t_obj```, then print this object.  The arguments for ```t.test()``` are the name of the dataset, ```frostedflakes```, followed by the variable you are testing in the dataset ```Lab```, followed by the argument ```mu=```.  Remember that mu is the population mean, so it is the number you want to test against.  In this case, you want to see if the sample sugar percentage is 37% or not, so that will be your ```mu=``` value here. This code yields the following output:

```text
One Sample t-test

data: frostedflakes$Lab

t = 2.4155, df = 99, p-value = 0.01755

alternative hypothesis: true mean is not equal to 37

95 percent confidence interval:

37.10642 38.08558

sample estimates:

mean of x

37.596
```

The object includes a lot of information. For your current purposes of doing a hypothesis test, you will focus on the following two lines:

```text
t = 2.4155, df = 99, p-value = 0.01755

alternative hypothesis: true mean is not equal to 37
```

The second line confirms that you have set up the test correctly: the alternate hypothesis is that mu, the true mean, is not equal to 37. The first line gives you a lot of information. The first part tells you that the *t*-score computed using the hypothesized mean of 37 is 2.4155.

The last part gives you the *p*-value for this test: 0.01755. The *p*-value is an indication of whether the data provide evidence that you should reject the null hypothesis. The smaller the *p*-value, the stronger the evidence that you should reject the null hypothesis.

Generally, data scientists compare the *p*-value to a threshold of 0.05 to determine whether to reject the null hypothesis; in this case, the *p*-value is 0.01755, which is smaller than 0.05, so you should reject the null hypothesis. You have decided that the percentage of sugar in frosted flakes is not equal to 37%.

The fact that the mean of the measured values is 37.596 gives you some evidence that the percentage of sugar is somewhat higher than 37%.

You can also see a graphical interpretation of this test. In the plot below, there is a histogram of the data with the 95% confidence interval computed by ```t.test()``` in red; you also show the value of mu for the null hypothesis in green. You've seen the first three lines of the code below before; it's standard for histograms.  But the last three lines are new.  The function ```geom_vline()``` plots vertical lines on the graph, and they are plotted at values of the ```xintercept=``` argument.  You can pull these directly from the ```t.test()``` function by providing the object name followed by the name of the information in the ```t.test()``` output.  ```conf.int[1]``` is the lower confidence level, ```conf.int[2]``` is the upper confidence level, and ```null.value``` is the mean. 

```{r}
d <- ggplot(frostedflakes, aes(x = Lab))
d + geom_histogram(binwidth = 1) +
geom_vline(xintercept = t_obj$conf.int[1], color = "red") +
geom_vline(xintercept = t_obj$conf.int[2], color = "red") +
geom_vline(xintercept = t_obj$null.value, color = "green")
```

This code will yield this image:

![A histogram of data with the 95% confidence interval computed by t dot test open parentheses close parentheses in red. The value of mu for the null hypothesis is in green.](Media/L08-HistPlusCIT-Test.png)

Note that the value of mu for the null hypotheses (which is 37) is not in the 95% confidence interval. Roughly speaking, this indicates that the probability that 37 is the true value of the mean is lower than 5%. This tells you that you have strong evidence to reject the null hypothesis, which states that the true value of the mean is 37.

When using the *t*-test, it is always a good idea to check to see if your data have an approximate normal distribution. You can check to see if the ```Lab``` variable in the ```frostedflakes``` data set is normal by creating the normal probability plot for it:

```{r}
ggplot(frostedflakes, aes(sample = Lab)) + geom_qq()
```

![A normal probability plot. The x axis is labeled theoretical and runs from approximately negative two point seven five to approximately two point seven five in increments of one starting at negative two. The y axis is labeled sample and runs from approximately twenty eight to forty eight. Data is plotted in more or less a straight line running from the bottom left to the upper right.](Media/L08-NormalProbPlotFlakes.png)

The data fall pretty much on a straight line, so you can conclude that they come from a normal distribution and the hypothesis test is built on solid assumptions.  This can also be seen in the histogram above; it looks approximately bell shaped.

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [1]:
try:
    from DS_Students import MultipleChoice
    from ipynb.fs.full.DS102Questions import *
except:
    !pip install DS_Students
    from DS_Students import MultipleChoice
    from ipynb.fs.full.DS102Questions import *

In [2]:
try:
    display(L7P2Q1, L7P2Q2, L7P2Q3, L7P2Q4)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. Google gives the level of Lake Huron as 577 feet…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. What is the p-value for this t-test?\n', 'output…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '3. Based on the p-value you chose, does this data p…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '4. What function would you use to add vertical line…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Independent t-Tests<a class="anchor" id="DS102L7_page_3"></a>

[Back to Top](#DS102L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [3]:
from IPython.display import VimeoVideo
# Tutorial Video Name: T-Tests
VimeoVideo('328682607', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L07-pg3tutorial.zip)**.

# *t*-Test for Two Independent Samples

You will use the frosted flakes data again. Suppose you now want to determine if the measurements made by the IA-400 give us the same average values as the lab measurements. To do this, you will think of the measurements in the data set as coming from two populations: one population is the lab measurements, and the other is the IA-400 measurements. 

Define the null and alternative hypotheses as follows:

![H sub zero, mu sub one equals mu sub two. H sub one, mu sub one does not equal mu sub two.](Media/L08-TwoPopH.png)

The null hypothesis is that the two sample means are equal, and the alternative hypothesis is that the two sample means differ.  This is a two-sided *t*-test.

You will use ```t.test()``` to create a test on the two sample means; in this case, each sample is an argument to ```t_test()```. As before, you save the object returned by ```t.test()``` into a variable ( this one is named ```t_ind```), then print that object. There  are two other arguments included in ```t.test()``` as well.  The first is ```alternative=```.  Your options are ```"two.sided"``` for a two-tailed hypothesis test, or ```"greater"``` and ```"less"``` for a one-tailed test, each for the particular direction you are hypothesizing for the one-tailed test.  The last argument is ```var.equal=``` and the options for this are only ```TRUE``` or ```FALSE```.  This is very similar to the different independent *t* test options you receive in MS Excel; one type is for when you have homogeneity, or equal, variance, and the other is for when you have heterogeneous (unequal) variance.  If you don't know, or don't want to bother to find out, just make sure to use the ```FALSE``` option.  If you were to leave off the ```var.equal=``` argument altogether, the ```FALSE``` option would be the default.

```{r}
t_ind <- t.test(frostedflakes$Lab, frostedflakes$IA400, alternative="two.sided", var.equal=FALSE)
print(t_ind)
```

The code above yields the following output.  You get a reminder that you have chosen unequal variance, because this is a *Welch Two Sample t-Test*:

```text
Welch Two Sample t-test

data: frostedflakes$Lab and frostedflakes$IA400

t = -1.6699, df = 195.08, p-value = 0.09654

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-1.3565905 0.1125905

sample estimates:

mean of x mean of y

37.596 38.218
```

The output here reminds you what data you ran (handy if you are running lots of tests in quick succession!) and then provides information about the *t* test itself, including the *t* and *p* values associated with the test and the degrees of freedom. Remember that if the *p* value is less than .05, you can reject the null hypothesis, and there is a significant difference between the groups.  Since your *p* value is .09, that is greater than .05, and there is no statistically significant difference between the sugar content measured in the lab and with the machine. 

The next line helps you determine that you have set up the test correctly. Your alternate hypothesis is that the two population means are not equal; if they are not equal, the difference of one subtracted from the other will not be zero.

---

## Graphing Data for an Independent *t* Test

You can see graphically what is going on in this test by creating a box plot for the ```Lab``` values and comparing it to a box plot for the ```IA400``` values. To create this plot with ```ggplot()```, you need the data values to be in one column of the data frame, and a label of whether it was a measurement taken in the lab or by machine in the other column of the data frame. The simplest way to create such a data frame is with the ```melt()``` function from the ```reshape2``` package. 

First install and then load the ```reshape2``` package:

```{r}
library(reshape2)
ff <- melt(frostedflakes, id="X")
```

The result from this command is: 

![The result, shown in a data frame, from using the melt function from the re shape two package. There are two columns, variable and value, and seventeen rows. The variable for each row is always lab, and the value for each row varies. The values range from a low to thirty three point two to a high of forty three point five.](Media/correlation12.png)

See how the columns are no longer labeled ```Lab``` and ```IA400```? Now which column it came from is denoted in the ```variable``` column, and the actual number that used to be contained within those columns is under a new column labeled ```value```. 

With your data happily reformatted, you can then proceed onto the box plot: 

```{r}
ggplot(ff) + geom_boxplot(aes(x = variable, y = value)) +
xlab("Test Method") + ylab("Percentage of Sugar")
```

![A box plot for two groups, lab and I A four hundred. The y axis is labeled percentage of sugar and ranges from thirty to just above forty five. There are two box plots, one for the lab group and one for the I A four hundred group. The IA400 group has a somewhat different median value than the Lab group, and the IA400 group has a larger variation as evidenced by the longer whiskers and the larger box. However, there is not a large difference in the median values.](Media/L08-BoxPlotsFlakes.png)

As you can see from this plot, the ```IA400``` group has a somewhat different median value than the ```Lab``` group, and the ```IA400``` group has a larger variation as evidenced by the longer whiskers and the larger box. However, there is not a large difference in the median values. So it makes sense that the *t*-test would not indicate that there is strong evidence that the two means are different.

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [3]:
try:
    display(L7P3Q1, L7P3Q2, L7P3Q3, L7P3Q4)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. Which of the following is \x1b[1mNOT\x1b[0m an a…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. The \x1b[31;1mDAAG\x1b[0m package has data on th…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '3. What is the p-value for this t-test?\n', 'output…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '4. Based on the p-value above, does this data provi…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Dependent t-Tests<a class="anchor" id="DS102L7_page_4"></a>

[Back to Top](#DS102L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [4]:
from IPython.display import VimeoVideo
# Tutorial Video Name: T-Tests
VimeoVideo('328682569', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L07-pg4tutorial.zip)**.

# Paired *t*-Test for Two Samples

Sometimes, you have two samples in which each value in one sample has a connection with a value in the other sample. In this section you will use the following example of two connected populations: 

>An instructor is required to find out how much students learn in her course. At the beginning of the course, she randomly selects a group of students in her course, and gives each student in this group a pre-test over the class material and records their scores. At the end of the course, she gives each student in the group the same test as a post-test.

She wants to see if the difference in scores is statistically significant; in other words, she wants to see if, on average, each student's scores improved. In this case, you have two populations: her class at the beginning of the course and her class at the end of the course. The populations are made up of the same people, but hopefully their level of expertise is different at the end of the course than at the beginning.

When there is a one-to-one relationship between elements of two populations, a dependent, or paired, *t*-test is appropriate to determine whether the means of the two populations are different. In this case, you will call the students at the beginning of the course Population 1, and the students at the end of the course Population 2. Your hypothesis test looks the same as in the previous section:

![H sub zero, mu sub two minus mu sub one equals eight. H sub one, mu sub two minus mu sub one does not equal eight.](Media/L08-TwoPopHPaired.png)

The difference, however, is that the two samples are paired.

The scores for the tests are in **[this file](https://repo.exeterlms.com/documents/V2/DataScience/Stat-Prog-R/scores.zip)**. 

Read it into a data frame using code or the wizard and then store this data frame in the variable scores as follows:

```{r}
scores <- read.csv("scores.csv")
head(scores)
```

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>You will get an error when you import the data if the file is not in your current working directory. As you learned previously, you can change your working directory by using the menu: Session -> Set Working Directory -> Choose Directory.</p>
    </div>
</div>

Here's the head of the data:

```
Student.Number pretest postest

1 1 13 24

2 2 10 17

3 3 11 22

4 4 14 21

5 5 16 25

6 6 10 23
```

This shows you that the data frame has three variables: ```Student.Number```, ```pretest```, and ```postest```. ```pretest``` and ```postest``` contain the beginning and ending test scores.

Use ```t.test()``` to perform the test on the two samples.  The very crucial part of this code, which sets it apart from its Single Sample and Independent fellows, is the ```paired=``` argument. It must be set to ```TRUE``` or an independent *t* test will be run instead. Here is the code:

```{r}
t_dep <- t.test(scores$postest, scores$pretest, paired = TRUE)
t_dep
```

And the output that is provided by R: 

```
Paired t-test

data: scores$postest and scores$pretest

t = 18.569, df = 33, p-value < 2.2e-16

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

7.411558 9.235501

sample estimates:

mean of the differences

8.323529
```

The meat and potatoes of this output is really these two lines:

```
t = 18.569, df = 33, p-value < 2.2e-16

alternative hypothesis: true difference in means is not equal to 0
```

The second line verifies that the alternative hypothesis is that the difference in the population means is not equal to zero, or equivalently that the two population means are not equal to each other. The first line gives you the *p*-value; you reject the null hypothesis if the *p*-value is sufficiently small. The *p*-value for this test is so small it is given in scientific notation; written as a decimal, it is 0.00000000000000022. This is a very small number, and in particular, it is much smaller than 0.05, so you can say that the data give you strong evidence that you can reject the null hypothesis and that the two population means are not equal.

---

## Graphing Data for an Dependent *t* Test

You will graph the dependent *t* test data the same way you did the independent *t* test. First you'll need to reshape the data using ```melt()```.  You'll use an extra argument to ```melt()``` because there are more than two variables present in your dataset.  When you did this for the ```frostedflakes``` dataset, there were only two columns, so R had no confusion about which two to move around.  But here, with the added variable of ```Student.Number```, you need to let R know that you want to fill the variable column with ```pretest``` and ```postest```.  This is done using the ```measure.vars=``` argument.

```{r}
library(reshape2)
ss <- melt(scores, measure.vars = c("pretest", "postest"))
```

And here is how the data ends up being formatted: 

![A data frame showing three columns, student number, variable, and value. There are seventeen rows. They student number for each row is the same as the row number, one through seventeen. The variable for each row is pretest. The value for each row varies, from a low of seven to a high of sixteen.](Media/correlation13.png)

See how ```Student.Number``` is untouched, but that you know have those same pre-determined columns of ```variable``` and ```value```? Now you're ready to plot your data. You can make box plots of the pretest and postest data values as follows:

```{r}
ggplot(ss) + geom_boxplot(aes(x = variable, y = value)) +
xlab("Test") + ylab("Score")
```

And here is the result! You can see that rejecting the null hypothesis was clearly the right decision. The median of the postest scores is much higher than the median of the pretest scores. 

![A box plot for two tests, pretest and postest. The y axis is labeled score and runs from just under five to twenty five. The median of the pretest scores is approximately eleven. The median of the post test scores is much higher, approximately twenty two.](Media/L08-BoxPlotsScores.png)

---

### Graphing the Difference Scores

Another way to graph these data is to compute the difference between the ```postest``` score and the ```pretest``` score for each student, and create a histogram for this difference. You can do this as follows:

```{r}
dd <- scores$postest - scores$pretest
df <- data.frame(dd)
ggplot(df, aes(x = dd)) + geom_histogram(binwidth = 1) +
xlab("Difference between postest and pretest")
```

The first part of the code subtracts the ```pretest``` scores from the ```postest``` scores, and then it is turned into a data frame with the command ```data.frame```. Then it's business as usual for a histogram, with this result: 

![A histogram. The x axis is labeled difference between postest and pretest and runs from zero to approximately eighteen. The y axis is labeled count and runs from zero to six. Vertical bars of various heights are plotted.](Media/L08-HistScores.png)

From this histogram of differences, you can see that the post-test score for most students was between 5 and 13 points higher than the pretest score.

---

## Summary

*t*-tests are quite handy for when your sample size is relatively small and you are looking to determine a difference between means.  The type of *t*-test depends on the means you want to compare.  If you are comparing a sample to a population mean, then choose a single sample *t*-test.  If you are comparing two unrelated samples, then choose an independent *t*-test.  And if you are comparing two related samples, then you'll want to chose a dependent *t*-test. 

R makes *t*-testing so very easy with the function ```t.test()```, which can handle all three types. It also provides you with means and confidence intervals for your *t*-tests, which is a big help.   

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [5]:
try:
    display(L7P4Q1, L7P4Q2)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. Which of the following turns \x1b[31;1mt.test\x1…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. True or False?\n\nThe \x1b[31;1mmelt()\x1b[0m fu…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Key Terms and R commands <a class="anchor" id="DS102L7_page_5"></a>

[Back to Top](#DS102L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Welch's Two Sample t-Test</td>
        <td>A type of t-test done when variance is unequal.</td>
    </tr>
</table>

---

# Key R Commands

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>t.test()</td>
        <td>Computes a t-test.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>mu=</td>
        <td>Argument to t.test() that specifies the population mean for a single sample t-test.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>geom_vline()</td>
        <td>Argument to ggplot() that creates a vertical line on a plot.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>xintercept()</td>
        <td>Argument to geom_vline() that specifies where on the x axis the vertical line should be placed.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>alternative=</td>
        <td>Argument to t.test() that specifies whether the hypothesis is one or two sided. Options are: "two.sided", "greater", or "less". </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>var.equal=</td>
        <td>Argument to t.test() that specifies whether the test should assume equal variance or not.  FALSE does not assume equal variance, while TRUE does.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>library(reshape2)</td>
        <td>Loads the package reshape2, which has many functions for wrangling data into the right format. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>melt()</td>
        <td>Takes data that comes in two labeled columns and changes it so that one column has the label, and the other has the actual value . </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>paired=TRUE</td>
        <td>Argument to t.test() that specifies this is a dependent t-test. </td>
    </tr>
</table>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Lesson 7 Hands On <a class="anchor" id="DS102L7_page_6"></a>

[Back to Top](#DS102L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [6]:
from IPython.display import VimeoVideo
# Tutorial Video Name: T-Tests
VimeoVideo('411258489', width=720, height=480)

For this hands on, you will be calculating the answer to several questions. This Hands-On **will** be graded, so be sure you complete all requirements. Please provide your R script file as well as a document that discusses the bullet points below.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---
## Requirements

The ```nhtemp``` built-in data set gives the mean annual temperature in New Haven, CT, for the years 1912 to 1971.

Suppose you want to test whether the average over the first 25 years (1912 to 1936) of the data is statistically significantly different than the average over the last 25 years (1947 to 1971) of the data set. You create two vectors from the data set: ```first25``` and ```last25```, using the following code:

```{r}
first25 <- nhtemp[1:25]
last25 <- nhtemp[36:60]
```

Compute a test to see if these two vectors have the same mean. 

What type of *t*-test should you use? 

* Single sample
* Independent 
* Dependent 

<div class="cc-content-answer">
   <p><strong>Hover your mouse pointer here to check your answer</strong>.</p>
   <div class="well cc-content-answer-hidden">
       The correct answer is Dependent t-test!
   </div>
</div>

Within a text document, discuss the following: 

* The problem to be solved
* The hypotheses 
* The results of the hypothesis test and the conclusion
>What is the *p*-value for this test? Based on this *p*-value, do you reject the null hypothesis?

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit all your files when finished!</p>
    </div>
</div>

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>To zip your file on <b>Windows</b>, right click on the file and select "Send to", then select "Compressed (zipped) folder". For <b>Mac</b> users, right click on the file and select "Compress", then select your file from the options.</p>
    </div>
</div>