# Lab 2: ggplot and dplyr




In [None]:
install.packages('ggplot2')
install.packages('dplyr')
install.packages('palmerpenguins')
install.packages('nycflights13')

In [None]:
options(warn=-1)

library(palmerpenguins)
library(ggplot2)
library(dplyr)
library(nycflights13)

# Introducing ggplot

Every ggplot2 plot has three key components:

- data,

- A set of aesthetic mappings between variables in the data and visual properties, and

- At least one layer which describes how to render each observation. Layers are usually created with a geom function.

(Source: https://ggplot2-book.org/getting-started.html)

In [None]:
# The first basic scatter plot
ggplot(data=penguins, mapping=aes(x=bill_length_mm, y=bill_depth_mm)) + 
    geom_point() # continue code on a new line with "+" operator

In [None]:
# An equivalent way of plotting the above
ggplot(data=penguins) + 
    geom_point(aes(x=bill_length_mm, y=bill_depth_mm))

In [None]:
#  A more professionally formatted plot
ggplot(data=penguins) + 
    geom_point(aes(x=bill_length_mm, y=bill_depth_mm)) +
    labs(x='bill length (in mm)', y='bill depth (in mm)') + # add axes' labels
    ggtitle('Scatter plot : bill length vs bill depth') +  # add plot's title
    theme_bw() # add a theme layer

### Layering Geometric Objects

Suppose we are interested in identifying trends in our data. We can plot a smooth line of best fit as follows.

In [None]:
ggplot(data=penguins, mapping=aes(x=bill_length_mm, y=bill_depth_mm)) + 
    geom_point() +
    geom_smooth() + # add a second smoothing line
    labs(x='bill length (in mm)', y='bill depth (in mm)') +
    ggtitle('Scatter plot for bill length and bill depth') + 
    theme_bw() 

### **Exercise 1**

Make a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables?

In [None]:
#@title Answer to exercise 1

ggplot(data=penguins) + 
    geom_point(aes(x=bill_length_mm, y=bill_depth_mm,  color = species)) +
    labs(x='bill length (in mm)', y='bill depth (in mm)') + 
    ggtitle('Scatter plot : bill length vs bill depth') +  
    theme_bw() 

# the below illustration suggests that we can classify penguins' species by their
# combination of bill length and bill depth

### Bar plots, histograms and box plots with ggplot

In [None]:
# bar plot for  penguin species' counts
ggplot(penguins, aes(x = species)) +
  geom_bar() +
  ggtitle('Penguin species count') +
  theme_bw()

penguins_nona <- penguins %>% na.omit() # dropping rows with missing values

# using bar plot to visualise the relationship between species and sex
ggplot(penguins_nona, aes(x = species, fill = sex)) +
  geom_bar() +
  ggtitle('Species vs Sex') +
  theme_bw()

# using proportional bar plot to visualise the relationship between species and sex
ggplot(penguins_nona, aes(x = species, fill = sex)) +
  ggtitle('Species vs Sex (Proportional) ') +
  geom_bar(position = "fill")

In [None]:
# histogram plot for bill_depth_mm
ggplot(penguins, aes(x = bill_depth_mm)) +
  geom_histogram(binwidth = 0.25) +
  theme_bw()

# density plot for bill_depth_mm
ggplot(penguins, aes(x = bill_depth_mm)) +
  geom_density() +
  theme_bw()

In [None]:
# Use box plot to inspect penguins' boday mass across islands
ggplot(penguins, aes(x = island, y = body_mass_g)) +
  geom_boxplot() +
  theme_bw()

### **Exercise 2**

Create the following grahics:

1. A barplot for the counts of species subvided by island of which **the bars are not overlapped**.
2. A histogram plot for body_mass_g
3. A boxplot that explores the relationship between penguins'sex and body_mass_g **where the penguins' sex is the y-axis**.

In [None]:
#@title Answer to exercise 2
ggplot(penguins, aes(x = species, fill = island)) +
  geom_bar( position = "dodge") +
  theme_bw()

ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 100) +
  theme_bw()


penguins_nona = penguins %>% na.omit()
ggplot(penguins_nona, aes(x = sex, y = body_mass_g)) +
  geom_boxplot() +
  coord_flip() +
  theme_bw()



# dplyr for data manipulation

In the next part of the lab, we are working with the dataframe "flights" from the library "nycflights13". 

In [None]:
head(flights) # look at the first few rows of the dataframe

In [None]:
help(flights) # read the dataset descripion

There are five functions in dplyr: select, filter, arrange, mutate and summarise. All of them have the following properties:
1. The first argument is a data frame.
2. The subsequent arguments describe what to do with the data frame, using the variable names( without quotes).
3. The result is a new data frame.

## Select
This is used to reduce the number of columns that we're dealing with. Useful for things like genetic data

In [None]:
names(flights)

In [None]:
df <- select(flights, time_hour, carrier, flight, origin, dest, distance)
print(df)

##### **Pipe**

Because the first argument is a data frame and the output is a data frame, dplyr verbs work well with the pipe, %>% (or |>). The pipe takes the thing on its left and passes it along to the function on its right

In [None]:
flights %>%  
  select(time_hour, carrier, flight, origin, dest, distance) %>%  # select columns by name
  print() # print the resulted dataframe

In [None]:
flights %>% 
  select(carrier:distance) %>% # slice selecting all columns from carrier to distance
  print() 

In [None]:
flights %>% 
  select( -c(year, month, day) ) %>% # do not show columns year, month, days
  print() 

Use rename(), which is a variant of select(), to rename a column and keep all the variables that aren't explicitly mentioned:

In [None]:
flights %>% 
  rename(x = carrier) %>%
  print()

Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you would like to move to the start of the dataframe

In [None]:
flights %>% 
  select(carrier, tailnum, everything()) %>%
  print()

There are some helper functions for select():
    starts_with()
    ends_with()
    contains()
    
### Exercise 3.a:
Write code that wil have tailnum as the first column and the columns starting with the letter 'd' as the following columns.



In [None]:
#@title Answer to exercise 3.a
flights %>% 
  select(tailnum, starts_with('d')) %>%
  print()

## Filter
Used if you want to view or store a new dataset containing a subset of the full dataset.

In [None]:
flights %>% 
  filter(carrier == 'UA') %>% # select only United Airline data records
  print()

Usually you want to store the newly subsetted data in memory. 

In [None]:
ua_flights = flights %>% filter(carrier == 'UA')

Make sure to use '==' instead of '='. The former is to test equality while the latter is for assignments. 

In [None]:
4 == 6
x = 6
x
x == 5


Use %in% if the filtering variable can take multiple values.

In [None]:
 flights %>% 
   filter(carrier %in% c('UA', 'AA', 'DL')) %>% # get only rows with these specific carriers
   print()

## Logical conditions

In [None]:
summer_flights = flights %>% filter(month == 6 | month == 7 | month == 8) 
# filtering for rows that satisfy one or all of the conditions

ua_december = flights %>% filter(carrier == 'UA' & month == 12) 
# filtering for rows that satisfy both conditions

short_flights = flights %>% filter(air_time < 60) 
# can do this because air_time is a numeric variable

In R, if you want to find if a variable's value is missing, use the is.na() function. In particular, do not check for equality with NA:

In [None]:
x = 4

In [None]:
x == NA
is.na(x)

This function can be used with filter() to remove the rows with NA values.

In [None]:
flights_nona = flights %>% filter( !is.na(dep_time))

In [None]:
print(dim(flights))
print(dim(flights_nona))

### Exercise 3.b

Write code using filter that selects all the rows of carrier 'AA' on the last day of month.

In [None]:
#@title Answer to exercise 3.b
month31 = c(1, 3, 5, 7, 8, 10, 12)
month30 = c(4, 6, 9, 11)
month28 = c(2)

# use filter with complex conditions
flights %>% 
  filter(carrier == 'AA') %>%
  filter( ( (month %in% month31) & (day == 31) ) | 
          ( (month %in% month30) & (day == 30) ) | 
          ( (month %in% month28) & (day == 28) ) ) %>%
  print()

## Arrange

Useful for ordering rows of the dataframe.

In [None]:
flights %>%
  arrange(year, month, day, dep_time) %>% 
  print()
# can be done since year, month, day, dep_time are numeric

In [None]:
# arranging in the increasing order of year the descending order of month, day
flights %>%
  arrange(year, desc(month), desc(day))

Missing values are always sorted at the end:

In [None]:
df = tibble(x = c(5, NA, 2))
df %>% arrange(x)

In [None]:
df %>% arrange(desc(x))

### Exercise 3.c
Use arrange to sort flights dataset by descending order of the division of distance by air_time. 

In [None]:
#@title Answer to exercise 3.c

flights %>% 
  arrange(desc(distance / air_time)) %>%
  print()

### Mutate

The role of mutate() is to add new columns that are calculated from the existing columns. By default, mutate() adds new columns on the right hand side of your dataset

In [None]:
flights %>% 
  mutate(speed = distance / air_time * 60) %>%
  print()

To add the variables to the left hand side2 use .before.

In [None]:
flights %>% 
  mutate(speed = distance / air_time * 60, .before = 1) %>%
  print()

If you only want to keep the new variables, use transmute()

In [None]:
flights %>% 
  transmute(speed = distance / air_time * 60) %>%
  print()

#### **if_else with mutate**

To create a new column which takes one value when a condition is TRUE and another value when it’s FALSE

In [None]:
flights %>% 
  mutate(
    status = if_else(
      is.na(arr_delay), "cancelled", "scheduled"),
    .keep = "used"
  ) %>%
  print()

#### **case_when()**

To use when there are more than two values.

In [None]:
flights %>% 
  mutate(
    status = case_when(
      is.na(arr_delay)      ~ "cancelled",
      arr_delay < -30       ~ "very early",
      arr_delay < -15       ~ "early",
      abs(arr_delay) <= 15  ~ "on time",
      arr_delay > 15        ~ "late",
      arr_delay > 60        ~ "very late",
    ),
    .keep = "used"
  ) %>%
  print()

### Exercise 3.d

Create the new column type_of_flight which can be "short-haul", "long-haul", "medium-haul". Short-haul is a flight whose air_time lasts anywhere from 30 minutes to 3 hours. Medium-haul is defined by flights lasting between 3-6 hours. And lastly, long-haul flights are those that extend beyond 6 hours. Note that you should remove all the rows with NA.

In [None]:
#@title Answer to exercise 3.d

flights %>% 
  filter(!is.na(air_time)) %>%
  mutate(
    type_of_flight = case_when(
      air_time <  180       ~ "short-haul",
      air_time <  360       ~ "medium-haul",
      air_time >= 360       ~ "long-haul",
    ),
    .keep = "used"
  ) %>% print()


## summarise()
Generally used in concert with group_by() function. This is the most important grouped operation which collapses each group to a single row. Group summaries are seen in many applications.

**count rows with n()**

In [None]:
# count number of records for each carrier
flights %>% 
  count(carrier, sort = TRUE) %>%
  print()

# An equivalent way using group_by and summarize()
flights %>%
  group_by(carrier) %>% 
  summarize(
    n = n(),
  ) %>% 
  arrange(desc(n)) %>%
  print()

**sum()**

In [None]:
flights %>% 
  group_by(carrier, tailnum) %>% 
  summarize(miles = sum(distance)) %>% # count the miles traveled by each plane
  print()

**Minimum, maximum, and quantiles**

min() and max() will give you the largest and smallest values. Another powerful tool is quantile() which is a generalization of the median: quantile(x, 0.25) will find the value of x that is greater than 25% of the values.

Let's inspect the departure delay time for each day in flights dataset in more details.

In [None]:
flights %>%
  group_by(year, month, day) %>%
  summarize(
    max = max(dep_delay, na.rm = TRUE), # max departure delay
    min = min(dep_delay, na.rm = TRUE), # max departure delay
    q5 = quantile(dep_delay, 0.05, na.rm = TRUE), # 5 percentile value of departure delay
    q95 = quantile(dep_delay, 0.95, na.rm = TRUE), # 95 percemntile of departure delay
    .groups = "drop"
) %>% print()

**Center and spread**

We often use mean() to summarize the center of a vector of values. Because the mean is the sum divided by the count, it is sensitive to even just a few unusually high or low values. An alternative is to use the median(), which finds a value that lies in the “middle” of the vector, i.e. 50% of the values is above it and 50% are below it. 

Two commonly used summaries to measure the spread of data values are the standard deviation, sd(x), and the inter-quartile range, IQR(). IQR() is quantile(x, 0.75) - quantile(x, 0.25)

In [None]:
# air time by each airplanes
flights %>% 
  filter(!is.na(air_time)) %>%
  group_by(carrier, tailnum) %>%
  summarize(
    distance_mean = mean(air_time), # mean
    distance_median = median(air_time), # median
    distance_iqr = IQR(air_time), # IQR distance
    distance_sd = sd(air_time), # sd of distance
    .groups = "drop"
  ) %>%  print() 

### Exercise 3.e

Which destinations show the greatest variation in air speed ? Air speed is defined as distance divided by air time (miles/hour).

In [None]:
#@title Answer to exercise 3.e

flights_var =  flights %>% 
    filter(!is.na(air_time) & !is.na(distance)) %>%
    mutate(air_speed = distance / air_time * 60) %>% # create the new column air_speed with mutate()
    group_by(dest) %>%
    summarise(
      speed_sd = sd(air_speed),
      speed_iqr = IQR(air_speed)
    )
  
flights_var %>% arrange(desc(speed_sd)) %>% head(3)
flights_var %>% arrange(desc(speed_iqr)) %>% head(3)

# OKC has the highest variation in term of air speed SD.
# HOU has the highest variation in term of air speed IQR.