# L03-2-Reading Excel Files
## Assignment Instructions
Rename with your name in place of Studentname and make your edits and updates here.



# Reading Excel Files

Excel is a common type of source data. It can get a bit messy if that Excel file was generated by a human rather than an automated process, but this is a welcomed data wrangling challenge. There are times when things are going all wrong that I simply open up the Excel file, make any edits or column formatting changes, and save as CSV. There, now I have a CSV file that I can easily import. 

That is a last resort. R has numerous ways to import Excel files just as it does for flat files. In fact, there are packages that both read AND write to Excel. Yes, you can programmatically edit Excel using R. In this course, however, we will stick to the tidyverse Excel import which is pretty simple and basic, but the package is under active development so more features are coming.

In this exercise, you will read in Excel files, both the older xls format and the newer xlsx format. We will use the readxl package. It was developed by Hadley Wickham, the same author of the readr package we used in the prior exercise, so the syntax will be familiar. readxl doesn't load with the tidyverse so we will have load the package with its own library command.

There are only two functions in the readxl package at this time. excel_sheets() and read_excel(). The first returns a list of sheet names for a given file. The second returns a data frame for a given Excel file and sheet. Pretty simple.
The readxl package comes with two example files. We'll take this opportunity to learn how to find and copy these files to our current working directory.

We will then use excel_sheets() to get a list of sheet names and use that with read_excel() to load sheets into R data frames.

There are two read_csv features I miss the most that are not yet in read_excel(). The first is the ability to distinguish between double and integer data types. The second is trimming whitespace. Both of these tasks can easily be performed after they are in a data frame. We'll see how to convert to integer in this exercise and we'll leave string manipulation to a later exercise. Read_csv does both of these things by default. What I find myself doing is reading in the excel file, exporting to csv and reading the csv back in. Then I get all the features I was missing from read_excel(). Call me lazy, but all but the biggest Excel sheets can perform this double import in under a second. Anything that saves me some data wrangling time is good in my book.

We will import an Excel file where the user (okay it was me pretending to be a user) filtered the data on two different Excel table sheets to different years. We import both sheets. After we get two data frames, one for the rows from year 1999 and the other for year 2008, we combine them together using bind_rows() a handy tidyverse dplyr function. This is similar to SQL UNION ALL. If you had a number of spreadsheets where the columns are all the same but filtered by some variable, R can do a great job automating the combining of this data into a single data set.

We then tackle a more challenging FORMATTED spreadsheet that has a merged cell title and some empty rows and columns as well as some miscellaneous manual cell formatting and notes outside of the main table of data. These manually prettified spreadsheets can be more challenging to deal with and don't load in very well by default. The default import will "shrink-wrap" the spreadsheet into an enclosing rectangle that contains all of the information, which includes the title at the top of the sheet and any other straggling non-empty cells. The idea is that it is not ignoring any data from the sheet, thus you get a mess of string columns and missing values all over the place….but all the data from the sheet is technically in a data frame. Skipping the first few rows and using select() to get the subset of columns, allowed us to recover the original data set quite easily, much easier than wrangling an out-of-control data frame we would have had.


## R Features
* library() - load libraries needed
* system.file()
* dir() - list of files in current directory
* file.copy()
* getwd() - gets current working directory
* ? - help
* excel_sheets()
* read_excel()
* glimpse()
* mutate()
* as.integer()
* filter()
* bind_rows()
* select()
* drop_na()

## Datasets
* datasets.xls(x)
* multi_sheet_xlsx.xlsx


In [None]:
# One time package installation
# No need to run, should already be installed
# For your reference only
# install.packages("readxl", repos = "https://cloud.r-project.org")

# Load libraries
library(tidyverse)

# readxl is not loaded as part of the tidyverse
# but it will likely be added in the future
# So we load it separately
___


In [None]:
# readxl like many packages
# contains sample data
# Let's copy the sample Excel files

# The files are copied as part of 
# package installation
# Let's figure out where that turned
# out to be

# The sample files are in a subfolder
# named extdata
# Use system.file() to display the path
___("extdata", package = "___")

# We could save this path as a variable if we like
# but if it is only used one time
# I try to use the %>% instead of assigning a
# variable to the result

In [None]:
# Recall that we can display the contents
# of a folder using dir()

# Let's display the names of the
# files in the readxl extdata folder
# Since I want the full path along with the 
# filenames I will set parameter full.names = TRUE
___("extdata", package = "___") %>% 
   ___(___ = ___)

# Notice there are two files, one xls and the other xlsx
# Both contain the same data

# Open these files from within Excel
# and look at the data and sheets
# Be sure to close Excel after you are done so
# it doesn't lock the file from accessing it from R

In [None]:
# Although the code is getting a bit long
# let's take the prior code that produced two full 
# file paths pass it to the file.copy() function
# to copy it to our current working directory
___(from = ___("extdata", package = "___") %>% 
             ___(full.names = ___), 
          to = ___())

# Notice the output of TRUE TRUE
# This indicates that both files were copied successfully

# Run this code block a second time and 
# it should output FALSE FALSE indicating that 
# the files were not copied. This is expected
# because the files already exist in the destination
# and the default behavior of file.copy() is to not overwrite

In [None]:
# List the files in the current working directory
# to confirm the sample files were copied
___()

# To display just the Excel files
# we can pass a filter pattern to
# dir(). The pattern string is 
# interpreted as a regular expression
# See if you can create a pattern 
# that returns only xls and xlsx files
dir(pattern = "___")

# For you regex folks, the below one is more accurate
# as it looks for the extension at the end of the file 
# only
# The . is escaped by two backslashes
# so it matches the period character and not
# any character
# The {0, 1} means 0 to 1 occurrences of the prior character
# so the x could be missing or occur once
# The $ means end of string
dir(pattern = "___")

In [None]:
# Now that we have the files in our 
# Current working directory
# We first need to look at the sheets
# to figure out which ones to import

# Let's pull up help on 
# excel_sheets()
___

# Notice there is a single argument, path
# In R, path refers to the filename and 
# if desired the folder path

In [None]:
# Let's display the sheet names of
# both datasets Excel files

# file: datasets.xls
___("___")

# file: datasets.xlsx
___("___")

# Notice that both files contain the same sheet names
# They are listed in the order in which they 
# appear in Excel

In [None]:
# We know the filename and sheet names
# So let's look at read_excel()
# Pull up help on this
___

# Notice its similarity to read_csv()
# That is no coincidence since both packages
# were written by Hadley Wickham
# readxl is not yet as full featured as readr package

## read_excel Arguments
### path	
Path to the xls/xlsx file
### sheet	
Sheet to read. Either a string (the name of a sheet), or an integer (the position of the sheet). Defaults to the first sheet.
### col_names	
Either TRUE to use the first row as column names, FALSE to number columns sequentially from X1 to Xn, or a character vector giving a name for each column.
### col_types	
Either NULL to guess from the spreadsheet or a character vector containing "blank", "numeric", "date" or "text".
### na	
Missing value. By default readxl converts blank cells to missing data. Set this value if you have used a sentinel value for missing values.
### skip	
Number of rows to skip before reading any data.

In [None]:
# Let's import datasets.xlsx
# and store the resulting data frame
# in a variable named df
__ <- ___("___")

# Take a glimpse at df
___

In [None]:
# Let's try importing the mtcars sheet
# of the older xls type spreadsheet
# Store the result in df
# Use sheet = "name of sheet" parameter
df <- ___("___", ___ = "___")

# glimpse df
___

# glimpse mtcars
___

# Notice that both have the same
# number of rows and columns
# and the same variable names and order
# and the same data types

In [None]:
# Let's turn to a less tidy 
# Excel spreadsheet, multi_sheet_xlsx.xlsx

# Open this workbook in Excel and browse it
# Be careful not to change anything
# Notice the filters
# Look through all the sheets
# Close the workbook without saving.

# List the sheets
# Use excel_sheets()
___("___")



In [None]:
# Excel spreadsheet, multi_sheet_xlsx.xlsx
# Import each sheet into it's own data frame

# Import the first sheet by name
# variable: df_year1999
df_year1999 <- ___("___", sheet = "___")

# Glimpse the results
___

# Import the second sheet by position instead of name
# Sheet position starts with 1 for the first sheet
# variable: df_year2008
df_year2008 <- ___("___", sheet = ___)

# Glimpse the results
___

# Import the third sheet by name
# variable: df_offset
df_offset <- ___("___", ___ = "___")

# Glimpse the results
___

# Compare them to mpg
___

# Did the Excel filters have any effect?
# How well did the data types match the original?
# How did the Offset sheet get handled?

In [None]:
# Let's work on the first two sheets
# and fix the issues before we tackle the third one

# Let's correct the data types
# Excel treats all numbers as floating point
# values, and we could do that in R too, 
# but the purest in me wants to use the 
# most appropriate data type which is
# integer for some variables

# Let's remind ourselves what variables should be integers
# Review the prior output of glimpse(mpg)
# The int columns are:
# year, cyl, cty, and hwy

# Unfortunately, we don't have 
# integer available in col_types
# We only have "blank", "numeric", "date" or "text"
# So, we need to fix it after the import
# We will use the as.integer() function
# inside a mutate() function to do this
# We'll assign it to a new data frame variable
# but normally we could overwrite the same 
# variable and reuse it
df_year1999_datatypes <- ___ %>%
   mutate(year = as.integer(___), 
          ___ = as.integer(cyl), 
          cty = ___(cty), 
          hwy = ___)

# Glimpse the new data frame
___

# Notice the data types look good. 
# It still has years other than 1999
# Let's fix that next.


In [None]:
# We want to filter the data so our 
# data frame has only year 1999 to match the 
# spreadsheet. Normally, I would rather have a single 
# data frame with all the data
# but this is good learning opportunity

# Let's use another variable to store the results
# df_year1999_fixed
# Use filter() to keep only year 1999
df_year1999_fixed <- ___ %>% 
   filter(___ == ___)

# Glimpse the results
___

# Notice the number of rows compared to the full dataset

In [None]:
# Let's fix the data types and filter 
# for year all together using the second sheet
# Use the previous code block as a guide
# This time pipe the filter() after the mutate
# so no new variable is needed
df_year2008_fixed <- df_year2008 %>%
   mutate(___, 
          ___, 
          ___, 
          ___) %>% 
   filter(___)

# Glimpse result
___

# Notice the integer data types and the 
# number of rows compared to the full dataset

In [None]:
# Pretending that we had two spreadsheets
# One with the data for one year
# and the other for the data from the other year
# We really would like to get them into a 
# single data frame. 
# When you have a situation where the 
# column names and data types match up and you simply want to 
# append them together, then
# bind_rows() does the job well.

# Combine the two data frames together into a 
# single data frame name df_allyears
df_allyears <- df_year1999_fixed %>%
   ___(___)

# Glimpse the result
___

# Compare to the original mpg
___

# Look at the data in the output carefully
# All the data is there but it is in a different
# order than the original. 
# in our combined data frame, all 1999 rows are first
# followed by the 2008 rows.

In [None]:
# Getting to the last spreadsheet, Offset
# The default import was a disaster
# Let's review what it looked like
# using glimpse
___(df_offset)

# Notice no column names 
# NAs for entire rows and columns
# All data types are chr (character strings)
# Sure we could clean this up after it is
# in a data frame, but if we can give 
# the read_excel a hint about what we want to import
# that may save us a lot of clean up work

# To see this more clearly lets look at the 
# top 5 rows of the data frame
df_offset %>% ___(___)

# Now look at the bottom 10 rows
df_offset %>% ___(___)

# Notice the first 3 rows should be ignored
# and their is some junk at the bottom of the spreadsheet
# that should also be ignored

In [None]:
# Let's skip the first 3 rows and see how much that helps
df_offset <- read_excel("multi_sheet_xlsx.xlsx", sheet = "Offset", ___ = ___)

# Glimpse the results
___

# Notice the warning message. 
# That was the notes at the bottom of the spreadsheet.

# It looks like if we remove the last column and change the data types
# it will be all fixed up

In [None]:
# Remove the last column using select()
# Hint: use column number range
df_offset_columns <- df_offset %>%
   ___(___:___)

# Glimpse the result
___

In [None]:
# Now let's fix the double to integer data types
df_offset_fixed <- df_offset_columns %>%
   mutate(___, 
          ___, 
          ___), 
          ___)

# Glimpse result
glimpse(df_offset_fixed)

# Compare to original mpg
glimpse(mpg)


# drop_na()
Drop rows containing missing values. 

drop_na(data, ...)


In [None]:
# Pull up help on drop_na()
___

In [None]:
# Now let's remove the na rows
df_offset_fixed <- df_offset_columns %>%
   ___()

# Glimpse result
___

# Compare to original mpg
___

# Perfect match, including row order

readxl package is under active development. It is developing features to handle this allowing you to provide a cell range like df_offset <- read_excel("multi_sheet_xlsx.xlsx", sheet = "Offset", range = "B4:L238") and even anchor points so you just need to specify the top left corner cell which is good when you don't how many rows and columns there will be.

For now, we can do manual clean up of the data frame and use skip as we did earlier. If that doesn't work well, consider manually selecting the data in Excel and pasting in cell A1 on a new sheet, or if you are going that far, you can just save the sheet as a csv file and import via read_csv to give yourself a few more features during the import.
