# Reading Flat Files
## Exercise Instructions

* Complete all cells as instructed, replacing any ??? with the appropriate code

* Execute Jupyter **Kernel** > **Restart & Run All** and ensure that all code blocks run without error

## Description

It doesn’t get more universal than importing the typical ‘flat file’. As you can imagine, there are many ways and packages in R that can do this. And this has been streamlined as much as possible, while providing customizable options a mile long. The good news is that we can expect a simple single line of code to do the trick only providing a minimum of information, letting R make all the rest of the assumptions for us. 

### read_csv()
Fortunately, there is a tidyverse version of file import, so that is what we will use. Hopefully, as we explore more and more of the tidyverse, it will start to become more intuitive. I can often guess what a function name would be called or how it will return a result. It is like guessing the meaning of a word you hadn’t heard of before, but being reasonably close to its definition because it is consistent with the rest of the language that you already learned. 
In this exercise, you will read in comma separated value (csv) files into R using read_csv(). Note the underscore between read and csv. This version is from the readr package, a part of the tidyverse. There is a function that uses a period in place of an underscore which is the base R version. Remember to use the underscore version of the function.
read_csv() has some parameters most with default values. The only required parameter is the filename. That is the only thing it couldn’t guess. It can’t get much simpler than that. The remaining parameters are used to adjust the default behavior when you encounter that situation.

### read_tsv()
There is also a read_tsv() to use for tab separated values. Just substitute tsv for csv in the function call. There is also a read_delim() function where you can specify your own column delimiter in case the file uses something other than a comma or tab. Read_csv() and read_tsv() are called wrapper functions around read_delim() meaning that they set default values to make it easy to use for a specific context. So instead of setting the column delimiter in read_delim() you simply call the read_csv function which sets that parameter for you. This is just like geom_jitter() we encountered before as a wrapper function around geom_point() with the jitter parameter enabled. 

### read_delim()
I find the wrapper functions easier to read than parameters, so I use those when I can. But since they are wrapper functions, that means you can call the underlying read_delim() function and have access to all the parameters for complete customization.

### write_csv(), write_tsv()
There are also write versions of the read functions like write_csv() and write_tsv() which takes a data frame and writes it to a file. This is a useful way to export a copy of the data especially after doing a bunch of data wrangling.

### skip, col_names, col_types
We will explore skipping rows, handling header rows, renaming columns, and setting data types. We will also see what can go wrong during data type conversion and how to get details about the errors with problems(). 

### Process strings after import
If you cannot resolve a data type conversion issue during import, just bring the column in as a string column and then handle it after it is in a data frame where we have the full power of R at our disposal. This is most common when dealing with various date formats. If we are lucky we can convert to a date during import, but sometimes the import tools cannot handle the wide diversity of dates. So, we bring it in as a string and then use different R packages that specialize in date manipulation to coerce them into dates.


## R Features
* library()
* write_csv()
* write_tsv()
* getwd()
* dir()
* read_csv()
* read_tsv()
* glimpse()
* names()
* print()
* c()
* head()
* spec()
* problems()

## Datasets
* mpg


In [4]:
# Load the tidyverse library
library('tidyverse')

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.2
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 1.0.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Notice readr is a tidyverse library

## write_csv(), write_tsv(), write_excel_csv()
Save a data frame to a delimited file. This is about twice as fast as write.csv, and never writes row names.

write_csv(x, path, na = "NA", append = FALSE, col_names = !append)

write_tsv(x, path, na = "NA", append = FALSE, col_names = !append)

write_excel_csv(x, path, na = "NA", append = FALSE, col_names = !append)


# write_csv()
We are going to create the files we need to import by first exporting them or writing them to disk. Let's start with help on write_csv()

In [6]:
# Display help on write_csv()
?write_csv()

# write_delim {readr}	R Documentation
Write a data frame to a delimited file
## Description
This is about twice as fast as write.csv(), and never writes row names. output_column() is a generic method used to coerce columns to suitable output.

## Usage
write_delim(x, path, delim = " ", na = "NA", append = FALSE,
  col_names = !append)

write_csv(x, path, na = "NA", append = FALSE, col_names = !append)

write_excel_csv(x, path, na = "NA", append = FALSE, col_names = !append)

write_tsv(x, path, na = "NA", append = FALSE, col_names = !append)
## Arguments
x - A data frame to write to disk

path - Path or connection to write to.

delim	- Delimiter used to separate values. Defaults to " ". Must be a single character.

na	- String used for missing values. Defaults to NA. Missing values will never be quoted; strings with the same value as na will always be quoted.

append	- If FALSE, will overwrite existing file. If TRUE, will append to existing file. In both cases, if file does not exist a new file is created.

col_names	- Write columns names at the top of the file?

Notice there are a few versions. In this lesson will explore the csv, and tsv options. We pass these functions a data frame and a filename string.

# Write a csv file
Let's export the mpg data to a csv file. Now that you are aware of the pipe operator %>% I'll tend to use that syntax. I still use the other way sometimes, especially for short simple function calls. You are welcome to use either way. In the case for writing a csv file you can use dataframe_name %>% write_csv("myfile.csv") or write_csv(dataframe_name, "myfile.csv")

In [7]:
# Write a csv file
# data: mpg 
# file: mpg.csv
# Hint: write_csv()
mpg %>% write_csv("mpg.csv")

There was no visible output if all went according to plan. It wrote the mpg dataframe to a file named mpg.csv in the current working directory. We should confirm that...but where is the current working directory. Let's find out.

In [8]:
# Display help on getwd()
?getwd()

# getwd {base}	R Documentation
Get or Set Working Directory
## Description
getwd returns an absolute filepath representing the current working directory of the R process; setwd(dir) is used to set the working directory to dir.

## Usage
getwd()
setwd(dir)
## Arguments
dir	- A character string: tilde expansion will be done.



# getwd()
Equipped with getwd(), let's find where R created this file. Unless we specify the full path when executing the write_csv() function, it will create the file in its current working directory. That is where we will find our file.

In [9]:
# Display the current working directory
# Hint: getwd()
getwd()

This resulted in a folder path which might be unfamiliar if you are running this notebook in a virtual cloud environment. Yours will...hopefully...be different than my result as we don't want to overwrite each other's files.

# Listing the contents of a directory
Now that we know where our exported file is, it would be nice to see it somehow. The dir() command will display the files in a folder from within R. Let's check out its usage statement.

In [10]:
# Display help on dir()
?dir()

# list.files {base}	R Documentation
List the Files in a Directory/Folder
## Description
These functions produce a character vector of the names of files or directories in the named directory.

## Usage
list.files(path = ".", pattern = NULL, all.files = FALSE,
           full.names = FALSE, recursive = FALSE,
           ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)

       dir(path = ".", pattern = NULL, all.files = FALSE,
           full.names = FALSE, recursive = FALSE,
           ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)

list.dirs(path = ".", full.names = TRUE, recursive = TRUE)
## Arguments
path	- a character vector of full path names; the default corresponds to the working directory, getwd(). Tilde expansion (see path.expand) is performed. Missing values will be ignored.

pattern	- an optional regular expression. Only file names which match the regular expression will be returned.

all.files	- a logical value. If FALSE, only the names of visible files are returned. If TRUE, all file names will be returned.

full.names	- a logical value. If TRUE, the directory path is prepended to the file names to give a relative file path. If FALSE, the file names (rather than paths) are returned.

recursive	- logical. Should the listing recurse into directories?

ignore.case	- logical. Should pattern-matching be case-insensitive?

include.dirs	- logical. Should subdirectory names be included in recursive listings? (They always are in non-recursive ones).

no..	- logical. Should both "." and ".." be excluded also from non-recursive listings?

How can something so simple be made so complex!! Don't worry, a simple dir() is all we need.

In [11]:
# Let's list the files in the 
# current working directory
# Hint: dir()
dir()

Notice that mpg.csv should be listed in there.

# write_tsv()
Let's do the same thing with tsv.

In [13]:
# Now write it as a tsv file
# Make sure the file extension is tsv
# data: mpg 
# file: mpg.tsv
# Hint: write_tsv()
mpg %>% write_tsv("mpg.tsv")

# Confirm that the file is listed in the current working directory
# Hint: dir()
dir()

Confirm that mpg.tsv is one of the files listed.

# read_csv()
Now that we have some files of known data to experiment on, let's read them in.

In [15]:
# Display help on read_csv()
read_csv('mpg.csv')

Parsed with column specification:
cols(
  manufacturer = [31mcol_character()[39m,
  model = [31mcol_character()[39m,
  displ = [32mcol_double()[39m,
  year = [32mcol_double()[39m,
  cyl = [32mcol_double()[39m,
  trans = [31mcol_character()[39m,
  drv = [31mcol_character()[39m,
  cty = [32mcol_double()[39m,
  hwy = [32mcol_double()[39m,
  fl = [31mcol_character()[39m,
  class = [31mcol_character()[39m
)


manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>
audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact
audi,a4,3.1,2008,6,auto(av),f,18,27,p,compact
audi,a4 quattro,1.8,1999,4,manual(m5),4,18,26,p,compact
audi,a4 quattro,1.8,1999,4,auto(l5),4,16,25,p,compact
audi,a4 quattro,2.0,2008,4,manual(m6),4,20,28,p,compact


# read_delim {readr}	R Documentation
Read a delimited file (including csv & tsv) into a tibble
## Description
read_csv() and read_tsv() are special cases of the general read_delim(). They're useful for reading the most common types of flat file data, comma separated values and tab separated values, respectively. read_csv2() uses ; for separators, instead of ,. This is common in European countries which use , as the decimal separator.

## Usage
read_delim(file, delim, quote = "\"", escape_backslash = FALSE,
  escape_double = TRUE, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
  comment = "", trim_ws = FALSE, skip = 0, n_max = Inf,
  guess_max = min(1000, n_max), progress = show_progress())

read_csv(file, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
  quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf,
  guess_max = min(1000, n_max), progress = show_progress())

read_csv2(file, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
  quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf,
  guess_max = min(1000, n_max), progress = show_progress())

read_tsv(file, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
  quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf,
  guess_max = min(1000, n_max), progress = show_progress())

Notice it has csv, tsv, and the generic delim functions csv2 is for other locales that use the comma as a decimal separator. 

There are a number of options to handle all sorts of conditions. Help provides more details on the arguments, if you need to change the default behavior of the function. 

We'll try out a few of the options in a moment, but first let's read in the mpg.csv file.

In [16]:
# file: mpg.csv
# variable: df_csv
# Hint: read_csv()
df_csv <- "mpg.csv" %>% read_csv()

Parsed with column specification:
cols(
  manufacturer = [31mcol_character()[39m,
  model = [31mcol_character()[39m,
  displ = [32mcol_double()[39m,
  year = [32mcol_double()[39m,
  cyl = [32mcol_double()[39m,
  trans = [31mcol_character()[39m,
  drv = [31mcol_character()[39m,
  cty = [32mcol_double()[39m,
  hwy = [32mcol_double()[39m,
  fl = [31mcol_character()[39m,
  class = [31mcol_character()[39m
)


Notice the column specification output. The default behavior is to review the first 1000 rows and determine the most appropriate data type for each column. If there are a large number of columns, the output will mention a default value and lists all the columns that are not the default datatype to make the output more readable. In the above case, there wasn't enough columns to trigger this.

# Compare imported csv to original dataframe
Let's compare the imported csv file from the original dataframe. Often times, something gets lost in translation because the csv file is all character data and some of the rich metadata that is contained in the R dataframe cannot be exported to a file format such as csv. 

In [17]:
# Compare dataframes
# Hint: glimpse()

# Explore df_csv
df_csv %>% glimpse()

# Explore mpg
mpg %>% glimpse()

Observations: 234
Variables: 11
$ manufacturer [3m[38;5;246m<chr>[39m[23m "audi", "audi", "audi", "audi", "audi", "audi", "audi", …
$ model        [3m[38;5;246m<chr>[39m[23m "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", …
$ displ        [3m[38;5;246m<dbl>[39m[23m 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2…
$ year         [3m[38;5;246m<dbl>[39m[23m 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 20…
$ cyl          [3m[38;5;246m<dbl>[39m[23m 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8,…
$ trans        [3m[38;5;246m<chr>[39m[23m "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "aut…
$ drv          [3m[38;5;246m<chr>[39m[23m "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "…
$ cty          [3m[38;5;246m<dbl>[39m[23m 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, …
$ hwy          [3m[38;5;246m<dbl>[39m[23m 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, …
$ fl           [3m[38;5;246m<c

Notice that both have the same # number of rows and columns and the same variable names and order and the same data types. The write_csv() and read_csv() both did a good job of preserving the data in this case. Note that the data types were not stored in the csv file, rather the read_csv() function made that determination and converted the data to that datatype.

# Compare imported tsv to original dataframe
Similar to the csv file, let's compare the imported tsv file from the original dataframe.

In [18]:
# file: mpg.tsv
# variable: df_tsv
# Hint: read_tsv()
df_tsv <- "mpg.tsv" %>% read_tsv()

Parsed with column specification:
cols(
  manufacturer = [31mcol_character()[39m,
  model = [31mcol_character()[39m,
  displ = [32mcol_double()[39m,
  year = [32mcol_double()[39m,
  cyl = [32mcol_double()[39m,
  trans = [31mcol_character()[39m,
  drv = [31mcol_character()[39m,
  cty = [32mcol_double()[39m,
  hwy = [32mcol_double()[39m,
  fl = [31mcol_character()[39m,
  class = [31mcol_character()[39m
)


In [19]:
# Compare dataframes
# Hint: glimpse()

# Explore df_tsv
df_tsv %>% glimpse()

# Explore mpg
mpg %>% glimpse()

Observations: 234
Variables: 11
$ manufacturer [3m[38;5;246m<chr>[39m[23m "audi", "audi", "audi", "audi", "audi", "audi", "audi", …
$ model        [3m[38;5;246m<chr>[39m[23m "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", …
$ displ        [3m[38;5;246m<dbl>[39m[23m 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2…
$ year         [3m[38;5;246m<dbl>[39m[23m 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 20…
$ cyl          [3m[38;5;246m<dbl>[39m[23m 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8,…
$ trans        [3m[38;5;246m<chr>[39m[23m "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "aut…
$ drv          [3m[38;5;246m<chr>[39m[23m "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "…
$ cty          [3m[38;5;246m<dbl>[39m[23m 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, …
$ hwy          [3m[38;5;246m<dbl>[39m[23m 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, …
$ fl           [3m[38;5;246m<c

Notice any differences? Hopefully not.

# read_csv() options

Let's focus on a few useful options for read_csv(). They are the same for read_tsv() as well. 

read_csv(file, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
  comment = "", trim_ws = TRUE, skip = 0, n_max = Inf,
  guess_max = min(1000, n_max), progress = interactive())
  
Some notable parameters are listed below. We will explore some of them in more detail.

### trim_ws 
trims whitespace on the beginning and ends of the value
I always do this cleaning step, it is on be default

### guess_max
These are the rows used to determine the data types. 1000 is the minimum, but you can go higher if you think it would make a difference.

### col_names
Specify the column names. This is handy when the file doesn't provide it in the header. This serves two purposes, first you can turn it on or off (TRUE or FALSE) meaning that you are stating that the first row is a header row or not. The second is to provide a vector of column names and it will use those names instead. Note that it will think the first row is a data row not a header row. 

### skip
This allows you to skip the first n rows you specify. I have found files that have summary info at the top that need to be skipped. skip can also be used when you want to use col_names to rename the columns and skip the header row at the same time.

### col_types
This allows you to define the column types upon import. An automatic attempt is made, but you may want a different data type or if you are like me, you want to be more deterministic and not leave it to the data to determine its type. Another useful feature is to drop or skip columns you don't want. This is really a trade off of when to do the data wrangling, right now or after it is already in a data frame variable. As always, it is a personal preference and often a balance.

# col_names parameter
Let's set the column names parameter to FALSE when reading in the csv file. What do you think will happen?

In [20]:
# Read mpg.csv with col_names set to false
# Hint: read_csv() with col_names parameter
df_csv_no_col_names <- "mpg.csv" %>% read_csv(col_names = FALSE )

# Compare the result to mpg
# Hint: glimpse()
df_csv_no_col_names %>% glimpse()
mpg %>% glimpse ()

Parsed with column specification:
cols(
  X1 = [31mcol_character()[39m,
  X2 = [31mcol_character()[39m,
  X3 = [31mcol_character()[39m,
  X4 = [31mcol_character()[39m,
  X5 = [31mcol_character()[39m,
  X6 = [31mcol_character()[39m,
  X7 = [31mcol_character()[39m,
  X8 = [31mcol_character()[39m,
  X9 = [31mcol_character()[39m,
  X10 = [31mcol_character()[39m,
  X11 = [31mcol_character()[39m
)


Observations: 235
Variables: 11
$ X1  [3m[38;5;246m<chr>[39m[23m "manufacturer", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ X2  [3m[38;5;246m<chr>[39m[23m "model", "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", …
$ X3  [3m[38;5;246m<chr>[39m[23m "displ", "1.8", "1.8", "2", "2", "2.8", "2.8", "3.1", "1.8", "1.8…
$ X4  [3m[38;5;246m<chr>[39m[23m "year", "1999", "1999", "2008", "2008", "1999", "1999", "2008", "…
$ X5  [3m[38;5;246m<chr>[39m[23m "cyl", "4", "4", "4", "4", "6", "6", "6", "4", "4", "4", "4", "6"…
$ X6  [3m[38;5;246m<chr>[39m[23m "trans", "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "aut…
$ X7  [3m[38;5;246m<chr>[39m[23m "drv", "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4"…
$ X8  [3m[38;5;246m<chr>[39m[23m "cty", "18", "21", "20", "21", "16", "18", "18", "18", "16", "20"…
$ X9  [3m[38;5;246m<chr>[39m[23m "hwy", "29", "29", "31", "30", "26", "26", "27", "26", "25", "28"…
$ X10 [3m[38;5;246m<chr>[39m

* Notice that the column names are automatically generated X1 to X11. 
* Notice that all columns came is an character. Do you know why?
* Notice that there is an extra row in df. If the data in the file didn't change, where did the extra row of data come from?

The first line in the file contains the column name strings. This is known as the header row. Since we told read_csv() that there was no header row with col_names=FALSE, it treated this first row as a data row resulting in an extra row of data. Because there was a string value detected in each column, all the column data types were changed to character in order to save this string data.

# skip parameter
We created a simulated troubleshooting opportunity in the prior cell!  Let's start fixing this by skipping the first line. Use the skip parameter and set it to the number of rows you want to skip.

In [22]:
# Skip the first line of the file
# Hint read_csv() with skip parameter
df_skip <- read_csv("mpg.csv", col_names = FALSE, skip = 1 )

# Compare the output to mpg
# Hint: glimpse()
df_skip %>% glimpse ()
mpg %>% glimpse()

Parsed with column specification:
cols(
  X1 = [31mcol_character()[39m,
  X2 = [31mcol_character()[39m,
  X3 = [32mcol_double()[39m,
  X4 = [32mcol_double()[39m,
  X5 = [32mcol_double()[39m,
  X6 = [31mcol_character()[39m,
  X7 = [31mcol_character()[39m,
  X8 = [32mcol_double()[39m,
  X9 = [32mcol_double()[39m,
  X10 = [31mcol_character()[39m,
  X11 = [31mcol_character()[39m
)


Observations: 234
Variables: 11
$ X1  [3m[38;5;246m<chr>[39m[23m "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ X2  [3m[38;5;246m<chr>[39m[23m "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4 quatt…
$ X3  [3m[38;5;246m<dbl>[39m[23m 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, …
$ X4  [3m[38;5;246m<dbl>[39m[23m 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 2008, 2008,…
$ X5  [3m[38;5;246m<dbl>[39m[23m 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8,…
$ X6  [3m[38;5;246m<chr>[39m[23m "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto(l5)", "…
$ X7  [3m[38;5;246m<chr>[39m[23m "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4", "4", …
$ X8  [3m[38;5;246m<dbl>[39m[23m 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 15, 15, 1…
$ X9  [3m[38;5;246m<dbl>[39m[23m 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 25, 24, 2…
$ X10 [3m[38;5;246m<chr>[39m

Notice row count and data types are corrected. Column names still need work.

# Specify column names
You can use the col_names parameter to specify column names. So this parameter is doing double duty. It can be a boolean to specify whether column names exist or not. It can also be used to provide column names if set to a vector of column names. 

In [25]:
# Let's add our own column names
# Display the column names from mpg.
# Hint: names()
names(mpg)

# Now let's store that into a variable
# which is called a vector in R
v_mpg_col_names <- names(mpg)

# To see what is in a variable 
# Hint: print()
 print(v_mpg_col_names)

# or you can just type the variable name
# Hint: v_mpg_col_names
vp_mpg_col_names

 [1] "manufacturer" "model"        "displ"        "year"         "cyl"         
 [6] "trans"        "drv"          "cty"          "hwy"          "fl"          
[11] "class"       


ERROR: Error in eval(expr, envir, enclos): object 'vp_mpg_col_names' not found


In [29]:
# Let's use our new names variable to add the column
# names back in
# Hint: read_csv() with col_names parameter
df_my_col_names <- read_csv("mpg.csv", col_names = TRUE)

# Compare the column names to mpg
# Hint: names()
df_my_col_names %>% names()
mpg %>% names()

Parsed with column specification:
cols(
  manufacturer = [31mcol_character()[39m,
  model = [31mcol_character()[39m,
  displ = [32mcol_double()[39m,
  year = [32mcol_double()[39m,
  cyl = [32mcol_double()[39m,
  trans = [31mcol_character()[39m,
  drv = [31mcol_character()[39m,
  cty = [32mcol_double()[39m,
  hwy = [32mcol_double()[39m,
  fl = [31mcol_character()[39m,
  class = [31mcol_character()[39m
)


Notice the column names match.

# c() for combine
Let's say we had done our homework and knew what column names we wanted. We can create our own names vector using the c() function short for combine, which combines comma separated values into a vector.

In [31]:
# Combine strings into a vector
# Hint: c()
v_mpg_col_names <- c('vehicle', 'model', 'disp', 'year', 
                     'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class')

* You can use single quotes or double quotes. Since I copied and pasted the above from the earlier output window, it had single quotes so I used them. I normally use double quotes so they don't mess with SQL syntax that uses single quotes when I am embedded SQL strings into R. Again it is a personal preference.

* Notice that I added a carriage return after 'year'. Any time there is whitespace outside of a string, you can add a carriage return (newline) for readability.

* Notice that I changed column names manufacturer to vehicle and displ to disp so we can determine if our code is really working.

In [33]:
# Let's try with the new names
# Hint: read_csv() with col_names parameter
df_custom_col_names <- read_csv("mpg.csv", col_names = v_mpg_col_names , skip = 1)

# Compare the output to mpg
# Hint: glimpse()
df_custom_col_names %>% glimpse ()
mpg %>% glimpse()

Parsed with column specification:
cols(
  vehicle = [31mcol_character()[39m,
  model = [31mcol_character()[39m,
  disp = [32mcol_double()[39m,
  year = [32mcol_double()[39m,
  cyl = [32mcol_double()[39m,
  trans = [31mcol_character()[39m,
  drv = [31mcol_character()[39m,
  cty = [32mcol_double()[39m,
  hwy = [32mcol_double()[39m,
  fl = [31mcol_character()[39m,
  class = [31mcol_character()[39m
)


Observations: 234
Variables: 11
$ vehicle [3m[38;5;246m<chr>[39m[23m "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi…
$ model   [3m[38;5;246m<chr>[39m[23m "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4 q…
$ disp    [3m[38;5;246m<dbl>[39m[23m 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.8, 2…
$ year    [3m[38;5;246m<dbl>[39m[23m 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 2008, 2…
$ cyl     [3m[38;5;246m<dbl>[39m[23m 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8…
$ trans   [3m[38;5;246m<chr>[39m[23m "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto(l5)…
$ drv     [3m[38;5;246m<chr>[39m[23m "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4", "…
$ cty     [3m[38;5;246m<dbl>[39m[23m 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 15, 1…
$ hwy     [3m[38;5;246m<dbl>[39m[23m 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 25, 2…
$ fl      [3m[38;5;246m<chr>[

# Changing data types
Now that we can easily rename our columns. Let's set the data types manually

We will use the col_type variable and a convenient shorthand letter for each column
* c = character, 
* i = integer, 
* n = number, same as double
* d = double, 
* l = logical, 
* D = date, 
* T = date time, 
* t = time, 
* ? = guess, 
* _ or - to skip the column


In [36]:
# One letter for each column
v_mpg_col_type <- "ccdiicciicc"

# set the data types
# Hint: read_csv() with col_type parameter
df_col_types <- read_csv("mpg.csv", col_names = v_mpg_col_names, skip = 1, col_types = v_mpg_col_type )

# Compare the output to mpg
# Hint: glimpse()
df_col_types %>% glimpse()
mpg %>% glimpse ()

Observations: 234
Variables: 11
$ vehicle [3m[38;5;246m<chr>[39m[23m "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi…
$ model   [3m[38;5;246m<chr>[39m[23m "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4 q…
$ disp    [3m[38;5;246m<dbl>[39m[23m 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.8, 2…
$ year    [3m[38;5;246m<int>[39m[23m 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 2008, 2…
$ cyl     [3m[38;5;246m<int>[39m[23m 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8…
$ trans   [3m[38;5;246m<chr>[39m[23m "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto(l5)…
$ drv     [3m[38;5;246m<chr>[39m[23m "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4", "…
$ cty     [3m[38;5;246m<int>[39m[23m 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 15, 1…
$ hwy     [3m[38;5;246m<int>[39m[23m 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 25, 2…
$ fl      [3m[38;5;246m<chr>[

Notice we got it all back and we have demonstrated full control over the column names and data types. Also notice that the specification information isn't displayed when specifying the column type.

# spec()

Examine the column specifications for a data frame. spec extracts the full column specifications. cols_condense takes a spec object and condenses its definition by setting the default column type to the most frequent type and only listing columns with a different type.

spec(x)

In [37]:
# Pull up help on spec()
? readr:: spec

In [39]:
# Display the specification for df_col_types
# Hint: spec()
df_col_types %>% spec()

cols(
  vehicle = [31mcol_character()[39m,
  model = [31mcol_character()[39m,
  disp = [32mcol_double()[39m,
  year = [32mcol_integer()[39m,
  cyl = [32mcol_integer()[39m,
  trans = [31mcol_character()[39m,
  drv = [31mcol_character()[39m,
  cty = [32mcol_integer()[39m,
  hwy = [32mcol_integer()[39m,
  fl = [31mcol_character()[39m,
  class = [31mcol_character()[39m
)

# Handling a parsing error
What happens when there is a parsing error? What does that look like and how can we best deal with it.

Let's attempt to create an error by taking a string and asking for a double on the first column and a date on the last column>

In [43]:
# Take a string and ask for a double
# on the first column and a 
# date on the last column
# Hint: d - double, D - date
v_mpg_col_names <- c('vehicle', 'model', 'disp', 'year', 
                     'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class')
v_mpg_col_type <- "dcdiicciicD"

# Hint: read_csv() with parameters col_names and col_type
df_problem <- read_csv("mpg.csv", col_names = v_mpg_col_names , skip = 1, col_types = v_mpg_col_type)

# Compare the output to mpg
# Hint: glimpse()
df_problem %>% glimpse()
mpg %>% glimpse()

“468 parsing failures.
row     col   expected  actual      file
  1 vehicle a double   audi    'mpg.csv'
  1 class   date like  compact 'mpg.csv'
  2 vehicle a double   audi    'mpg.csv'
  2 class   date like  compact 'mpg.csv'
  3 vehicle a double   audi    'mpg.csv'
... ....... .......... ....... .........
See problems(...) for more details.
”

Observations: 234
Variables: 11
$ vehicle [3m[38;5;246m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ model   [3m[38;5;246m<chr>[39m[23m "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4 q…
$ disp    [3m[38;5;246m<dbl>[39m[23m 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.8, 2…
$ year    [3m[38;5;246m<int>[39m[23m 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 2008, 2…
$ cyl     [3m[38;5;246m<int>[39m[23m 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8…
$ trans   [3m[38;5;246m<chr>[39m[23m "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto(l5)…
$ drv     [3m[38;5;246m<chr>[39m[23m "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4", "…
$ cty     [3m[38;5;246m<int>[39m[23m 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 15, 1…
$ hwy     [3m[38;5;246m<int>[39m[23m 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 25, 2…
$ fl      [3m[38;5;246m<chr>[

* Notice the warning message including problems() function.
* Notice the NA values for the first and last columns. It put NA as a placeholder meaning Not Available.

# problems()
We can use problems(dataframe_name) to view any parsing problem that were encountered.

In [44]:
# Let's run problems()
# pass it in your data frame
# Let's also just view the top 10 rows
# Hint: problems(), head()
df_problem %>% problems() %>% head(10)

row,col,expected,actual,file
<int>,<chr>,<chr>,<chr>,<chr>
1,vehicle,a double,audi,'mpg.csv'
1,class,date like,compact,'mpg.csv'
2,vehicle,a double,audi,'mpg.csv'
2,class,date like,compact,'mpg.csv'
3,vehicle,a double,audi,'mpg.csv'
3,class,date like,compact,'mpg.csv'
4,vehicle,a double,audi,'mpg.csv'
4,class,date like,compact,'mpg.csv'
5,vehicle,a double,audi,'mpg.csv'
5,class,date like,compact,'mpg.csv'


Notice for each instance it has the column name, what it expects, and what was provided.

# Summary
We learned how to import csv and tsv files and how to handle the header row, column names, and column types. We also viewed parsing problems.