# Demo of Subsetting in R

We'll look at two studies that attempted to answer causal questions, the first on sleep deprivation and the second on nearsightedness. 

For a causal question, we will need to compare the ''effect'' between two groups, one with the potential cause and one without.  Because we're dealing with variable ''effects'', we need to summarize over the observational units in a group, for example, by computing the average effect.

One way to make a subgroup in R and then summarize it is with the ''tapply'' function in R.  

## Example 1: Reaction time and sleep deprivation

Dataset Description: 21 students took a test to determine their speed in responding to a visual stimulus.  They were then randomly assigned to be either sleep deprived or allowed to sleep without restriction.  Three days later, the students took the reaction time test again and the  improvement in reaction time was recorded.

Step 1: load the data in R via the ''read.csv'' function. The input to this function is the name of the file containing the dataset.  The output is a data.frame in R that is store in ''sleep''. 

In [None]:
sleep <- read.csv("sleep.csv")

Step 2: check to make sure the dataset was loaded correctly in R.  There are a couple ways to do this, for example, the ''str'' function or the ''head'' function both show the first couple of values for each variable. 

In [None]:
str(sleep)

In [None]:
head(sleep)

Step 3: if the dataset wasn't read in correctly, make the appropriate corrections.  Generally, it's better to make the correction in R, not the dataset. 

In this example, no corrections need to be made.

Step 4: Compute the mean improvement in reaction time by sleep group.

There are many ways to do this in R.  One way is to first create a subset of the dataset for each group and then compute the mean for each subset. The code belows illustrates this method:

In [None]:
sleep_deprived <-subset(sleep, subset=(sleepcondition=="deprived"))

In [None]:
sleep_deprived

In [None]:
sleep_deprived$improvement

In [None]:
mean(sleep_deprived$improvement)

Another subset would need to be created for the group of unrestricted sleepers and the mean computed.  Try copying the code above and edit it to do this.

A second method is with the ''tapply'' function.  This method looks complicated initially, but can save time once you understand the idea of multiple inputs to a function.

In [None]:
tapply( sleepdata$improvement, sleepdata$sleepcondition,  mean)

Briefly comment on what you learned about sleep deprivation from this dataset.

## Example 2: Nearsightness and Nightlight Use with Infants

Now let's turn to nearsightedness and nightlight use with infants...

In this dataset, the variable Nightlight was coded as 1 if a nightlight was used, 0 if not.



Step 1: Read the dataset ''nightlight.csv'' into R using the ''read.csv'' function.  

The code below gives an error.  Use example 1 to figure out and correct the error.

In [None]:
night <- read("nightlight.csv")

Step 2: Check that the dataset was correctly loaded in R using the ''str'' and ''head'' functions. Please refer to example 1 as needed. 

In [None]:
str(night)

Question: what is one difference between the nightlight and sleep datasets? What is one similarity?

Step 3: Correct any mistakes when the dataset was loaded in R.   

In this example, one variable is NOT correctly loaded in R: the variable ''Nightlight'' has possible values of 0 or 1, so R thinks it is an ''int'', or integer variable.  In fact, the values don't really mean the numbers 0 and 1 but are just place holders for the two types of nightlight use.

The code below show how to change the nightlight variable to a ''factor'' in R, with appropriate and informative labels.

In [None]:
night$Nightlight <- factor(night$Nightlight)

In [None]:
levels(night$Nightlight)

In [None]:
levels(night$Nightlight) <- c("no nightlight", "some nightlight")

In [None]:
levels(night$Nightlight)

Now let's doublecheck our dataset...

In [None]:
str(night)

Step 4: Summarize the outcome variable per group.

First, let's try the shortest method of summarizing from example 1 with the ''tapply'' function:

In [None]:
tapply(night$Nearsighted, night$Nightlight, mean)

“argument is not numeric or logical: returning NA”
“argument is not numeric or logical: returning NA”


What error did you get?  Why does it make sense?

The code below shows the correct way to summarize over non-numerical variables in R.

In [None]:
table(night$Nearsighted, night$Nightlight)

In [None]:
tab <- table(night$Nearsighted, night$Nightlight)
tab

In [None]:
prop.table(tab)

In [None]:
prop.table(tab, margin=1)

In [None]:
prop.table(tab, margin=2)

Three ways of going from counts to proportions are shown in the code above.  Which line do you think gives the most appropriate summary for the research question, "does nightlight use increase nearsightnedess?"