# Data Tidying and Manipulation in R
## by Diya Das, David DeTomaso, and Andrey Indukaev

### The goal
Data tidying is a necessary first step for data analysis - it's the process of taking your messily formatted data (missing values, unwieldy coding/organization, etc.) and literally tidying it up so it can be easily used for downstream analyses. To quote Hadley Wickham, "Tidy datasets are easy to manipulate, model and visualise, and have a specific structure:
each variable is a column, each observation is a row, and each type of observational unit
is a table."

These data are actually pretty tidy, so we're going to be focusing on cleaning and manipulation, but these manipulations will give you some idea of how to tidy untidy data.

### The datasets
We are going to be using the data from the R package [`nycflights13`](https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf). There are five datasets corresponding to flights departing NYC in 2013. We will load directly into R from the library, but the repository also includes CSV files we created for the purposes of the Python demo and can also be used to load the data into our R session.



*** If you've never run Jupyter notebooks with R, please run `conda install -c r r-essentials`

In [None]:
options(repos=structure(c(CRAN="http://cran.cnr.berkeley.edu/", 
BioCsoft="http://www.bioconductor.org/packages/release/bioc/")))
ipak <- function(pkg){
     new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
     if (length(new.pkg))
         install.packages(new.pkg, dependencies = TRUE)
     sapply(pkg, require, character.only = TRUE)
 } #https://gist.github.com/stevenworthington/3178163

pkgs <- c("nycflights13",
       "tidyr",
       "dplyr",
       "reshape",
       "ggplot2",
       "data.table")
ipak(pkgs)

In [None]:
#invisible(sapply(pkgs, library, character.only=TRUE ))

## Reading data from a file
Let's read data from a file, though we won't be using it for this exercise.

In [None]:
unzip("nycflights13.zip")
list.files()
read.delim("airlines.txt")

## Inspecting a dataframe // What's in the `flights` dataset?
Let's run through an example using the `flights` dataset. This dataset includes...well what does it include? You could read the documentation, but let's take a look first.

In [None]:
data(flights)
flights <- data.frame(flights) ## dplyr has introduced a new data format that I am ignoring

message('What are the first 6 rows?')
print(head(flights))
message('What are the last 6 rows?')
print(tail(flights))

message('What does the sample function do?')
print(sample(1:6,2))

message('What happens when I use sample for indexing?')
print(flights[sample(1:nrow(flights),10),]) ## what is this doing?

## Identifying and removing NAs in a dataset
We noticed some NAs above (hopefully). How do you find them and remove observations for which there are NAs? 

In [None]:
message('What are the dimensions of the flights dataframe?\n')
print(dim(flights))

message('Are there any NAs in the flights dataframe?\nAre all values NA?')
print(any(is.na(flights)))
print(all(is.na(flights)))

message('Selecting for flights where there is complete data, what are the dimensions?\n')
## complete.cases returns a logical vector indicating whether all observations in a row are not-NA.
message('Using base R...')
flights_complete <- flights[complete.cases(flights),]
print(dim(flights_complete))
message('Using tidyR...')
flights_complete2 <- drop_na(flights)
message('Are they the same datasets?')
print(identical(flights_complete,flights_complete2))

message('How might I obtain a summary of the original dataset?')
print(summary(flights))

## Performing a function along an axis // Calculating mean times

R allows easy application of descriptive function along an axis.

`any` and `all`, which we used earlier, is an example of that.  If the data are boolean, `any` collapses a series of boolean values into True if *any* of the values are true. `all` collapses a series of boolean values into True if *all* of the values are true.

What's the mean air time?

In [None]:
mean(flights_complete$air_time)

Can we calculate the mean for multiple subsets of our data at once?

In [None]:
subset <- select(flights_complete, air_time, dep_delay, arr_delay)
message('Find mean of each row...')
print(head(apply(subset,1, mean)))
message('Find mean of each column...')
print(apply(subset,2, mean))
message('Find mean of each column...yet another way...')
print(lapply(subset, mean))
message('Let\'s fix the formatting...')
print(sapply(subset, mean))
message('Using dplyr...')
subset %>%  summarise_all(mean) %>% print()

## Performing column-wise operations while grouping by other columns // Departure delay by airport of origin
Sometimes you may want to perform some aggregate function on data by category, which is encoded in another column. Here we calculate the statistics for departure delay, grouping by origin of the flight - remember this is the greater NYC area, so there are only three origins!

In [None]:
message('Using tapply...')
print(tapply(flights_complete$dep_delay, flights_complete$origin, summary))

message('That code is a bit messy, so using the with command for indicating the parent 
dataframe...')
print(with(flights_complete, 
           tapply(dep_delay, origin, summary)
          ))

message('Using dplyr...')
flights_complete %>% group_by(origin) %>% summarise(avg_dep_delay=mean(dep_delay)) %>% print()

## Pipes in R: making code readable
The last sections have used the operator `%>%`. This symbol is called a pipe. It was introduced in the `magrittr` package, but `dplyr` and `tidyr` also take advantage of this syntax.

Pipes `%>%` exist because they help tidy up commands when performing a chain of operations. When we want to provide to a `function1` some data which is output of a `function2`, whose input is output from `function3`, we can end up with some very nested, difficult-to-read commands:
`function1(function2(function3(data,parameters3),parameters2),parameters1)`

Sometimes, `with` may help simplify your commands, as above, but piping can be more direct. 

`data %>% function(parameters)` is exactly the same as `function(data,parameters)`

But the code is read (and written) from the left to the right (and not inside-out) and is easier to understand.


## Removing NAs // Getting a `flights` dataset with no missing measurements
Let's remove rows with missing values (NA) from `flights` dataset, then calculate the average departure delay in one call, first using basic syntax then using pipes.


In [None]:
message('Standard R Syntax (using with)')
print(with(flights[complete.cases(flights),], 
           tapply(dep_delay, origin, summary)
          )) 

message('Same operation using pipes and `dplyr`')
flights %>% filter(complete.cases(.)) %>% 
    group_by(origin) %>% 
        summarise(avg_dep_delay = mean(dep_delay)) %>% 
            print()

flights %>% drop_na() %>%
    group_by(origin) %>% 
        summarise(avg_dep_delay = mean(dep_delay)) %>% 
            print()

#note that . stands for the data frame in the call of type: data %>% funtction1(function2(data))
#and in case of multi-line call, %>% should be in the end of a line

There are different schools of thought. Some prefer to make code readable by doing all operations step-by-step.

## Merging tables 'vertically' // Subsetting and re-combining flights from different airlines
You will likely need to combine datasets at some point. R provides quite a few tools to do that, and as you've seen, it's possible to do something many different ways.

Here, we present a simple case of 'vertical' merging, using `rbind`. Let's create a data frame with information on flights by United Airlines and American Airlines only, by creating two data frames via subsetting data about each airline one by one and then merging. 

The main requirement is that the columns must have the same names (may be in different order).

In [None]:
message('Subsetting the dataset to have 2 dataframes')
flightsUA <- flights[flights$carrier == 'UA',]
flightsAA <- flights[flights$carrier == 'AA',]
message('Checking the number of rows in two dataframes')
print(nrow(flightsUA) + nrow(flightsAA))
message('Combining two dataframes than checking the number of rows in the resulting data frame')
flightsUAandAA <- rbind(flightsUA,flightsAA)
print(nrow(flightsUAandAA))


Nothing special, just be sure the dataframes have the columns with the same names and types.

A useful tip is to use ``do.call`` in order to merge more than two data frames.
``do.call`` is a function that applies a function to a list of elements.

In [None]:
message('rBinding 3 data frames and checking the number of rows')
print(nrow(do.call(rbind, list(flightsUA,flightsAA,flightsUAandAA))))

## A useful tip for populating dataframes within a loop
'rbind' is really useful for populating a dataframes, but it can be a bit slow within loops. Each time we append a row to a dataframe within a loop a new copy of a dataframe is stored in memory :( 

The solution is to create a list of lists and then merge them with `do.call rbind` combo. But since ``rbind``, as many native R functions, is slow and not memory-efficient, for large datasets one may want to use
``rbindlist`` function from ``data.table`` package, which does the same operation, but faster. 

Let's compare these approaches using the `system.time` function to see the execution times.

In [None]:
message('execution time for rbind')
system.time(do.call(rbind, list(flightsUA,flightsAA,flightsUAandAA)))

message('execution time for rbindlist, written in C')
system.time(rbindlist(list(flightsUA,flightsAA,flightsUAandAA)))

And now the example of using rbindlist for populating a data frame.
Let's pretend we forgot how to use `tapply` or `group_by` (`dplyr`) and we want to calculate the average arrival and departure delays per airline.

In [None]:
Start <- Sys.time()
carriers  <- unique(flights_complete$carrier)
resList <- list()
for (i in 1:length(carriers))
{
    meanDepDelay <- mean(flights_complete[flights_complete$carrier == carriers[i],]$dep_delay)
    meanArrDelay <- mean(flights_complete[flights_complete$carrier == carriers[i],]$arr_delay) 
    resList[[i]] <- list(carriers[i],meanDepDelay,meanArrDelay)
}
DelaysByAirline <- rbindlist(resList)
colnames(DelaysByAirline) <- c("carrier","meanDepDelay","meanArrDelay")
End <- Sys.time()
message('It took us')
print(End-Start)
message('and here is the result for Amercian Airlines')
print(DelaysByAirline[DelaysByAirline$carrier == 'AA',])

message('Same result without messing with loops')
Start <- Sys.time()
flights_complete %>% group_by(carrier)%>%
    summarize(meanDepDelay = mean(dep_delay), meanArrDelay = mean(arr_delay))%>%
        filter(carrier == 'AA') %>% print()
End <- Sys.time()
message('And it took us')
print(End-Start)

In most cases loops are possible to avoid, but if you have to write one, the "list of lists" + `rbindlist` approach may save you a lot of time.

## Merge two tables by a single column // What are the most common destination airports?
The `flights` dataset has destination airports coded, as three-letter airport codes. I'm pretty good at decoding them, but you don't have to be. 

In [None]:
print(head(airports))

The `airports` table gives us a key! Let's merge the `flights` data with the `airports` data, using `dest` in `flights` and `faa` in `airports`.

In [None]:
message('This is pretty easy in base R...')
flights_readdest <- merge(flights_complete, airports, by.x='dest', by.y = 'faa', all.x=TRUE)
print(head(flights_readdest))

message('And you can do it in dplyr too...')
flights_readdest2 <- left_join(flights_complete, airports, by = c("dest" = "faa"))
print(head(flights_readdest2))

**Why did we use `all.x = TRUE` and `left_join`?**

In [None]:
print(setdiff(flights$dest, airports$faa))

Well this merged dataset is nice, but do we really need all of this information?

In [None]:
print(colnames(flights_readdest))

In [None]:
flights_sm <- select(flights_readdest,origin, dest=name, year, month, day, air_time)
print(head(flights_sm))

Why would you want to ever use `select`? `dplyr` lets you chain operations using the pipes, as discussed above. Let's calculate the average air time for various flight paths, using origin and the readable version of destination airport.

In [None]:
airtime <- left_join(flights_complete, airports, by = c("dest" = "faa")) %>% 
    select(origin, dest=name, air_time) %>% 
    group_by(origin, dest) %>% 
    summarize(avg_air_time = mean(air_time))

print(head(airtime))
print(dim(airtime))

What's the longest flight from each airport, on average?

In [None]:
with(airtime, tapply(avg_air_time, origin,  function(x) airtime[which.max (x), ])) %>% print()
airtime %>% group_by(origin) %>% summarise(max(avg_air_time)) %>% print()

## Pivot Table // Average flight time from origin to destination

Let's put destinations in rows and origins in columns, and have `air_time` as values.

In [None]:
pvt_airtime <- spread(airtime, origin, avg_air_time)
summary(apply(pvt_airtime, 1, function(x) all(is.na(x))))
print(pvt_airtime)

## Multi-column merge // What's the weather like for departing flights?
Flights...get delayed. What's the first step if you want to know if the departing airport's weather is at all responsible for the delay? Luckily, we have a `weather` dataset for that.

Let's take a look.

In [None]:
head(weather)
print(intersect(colnames(flights_complete), colnames(weather)))

In [None]:
flights_weather <- merge(flights_complete, weather, 
                         by=c("year", "month","day","hour", "origin"))
print(dim(flights_complete))
print(dim(flights_weather))

In [None]:
flights_weather_posdelays <- filter(flights_weather, dep_delay>200)
print(dim(flights_weather_posdelays))

## Arranging a dataframe // What's the weather like for the most and least delayed flights?

Let's sort the `flights_weather` dataframe on `dep_delay` and get data for the top 10 and bottom 10 delays.

In [None]:
flights_weather %>% arrange(desc(dep_delay)) %>% head(10)

In [None]:
flights_weather %>% arrange(dep_delay) %>% head(10)

## Some other tidying
## Capitalization issues.

In [None]:
print(head(tolower(flights_complete$dest)))
print(head(toupper(tolower(flights_complete$dest))))

## Wide to long formatted data

In [None]:
print(head(flights_complete))

In [None]:
message('Using reshape')
day_delay <- melt(flights_complete, id.vars=c("time_hour"), 
                measure.vars=c("dep_delay","arr_delay"), variable_name = "type_of_delay")
print(head(day_delay))
message('Using tidyr')
day_delay <- gather(flights_complete, `dep_delay`,`arr_delay`, 
                    key = "type_of_delay", value="value")
print(head(day_delay))
ggplot(day_delay,aes(x=time_hour,y=value,colour=type_of_delay, group=type_of_delay)) + geom_point()

Well this is a bit hard to read. What about the first entry for each type of delay in each hour? 

## Removing duplicates

In [None]:
day_delay_first <- distinct(day_delay, time_hour, type_of_delay, .keep_all = TRUE)
print(head(day_delay_first))
ggplot(day_delay_first,aes(x=time_hour,y=value,colour=type_of_delay, group=type_of_delay)) + geom_point()

## An incomplete investigation of NAs 

Let's examine where there are NAs in the `flights` dataset.

In [None]:
ind <- which(is.na(flights), arr.ind = TRUE)
print(head(ind))
ind2 <- table(ind[,2])
print(ind2)

message('But what do those numbers mean?')
names(ind2) <- colnames(flights)[as.numeric(names(ind2))]
print(ind2)

In [None]:
flights_incomplete <- flights[!complete.cases(flights),]
print(dim(flights_incomplete))

Do flights with NA departure time also have an NA departure delay?

In [None]:
print(table(is.na(flights_incomplete$dep_time) & is.na(flights_incomplete$dep_delay)))
print(table(is.na(flights_incomplete$dep_time) | is.na(flights_incomplete$dep_delay)))

Yes.