# Module 1 | Part 4: Using the download.file() Function in R 

As seen previously, a data file can be directly read into R via a URL.  That is, the read.csv() function allows for the use of a URL.  R also has the ability to do some basic file management tasks.  In Module 1 | Part 4, the download.file() function will be used to download a dataset that has been zipped to reduce its overall size.

The following steps are necessary

1.   Use download.file() to download a zipped dataset
2.   Unzip the file so that its contents are accessible
3.   Read the data file into R


<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

### Example 1.4.1

Consider information about Postal Codes for this example. This information can be found on the Geonames.org website and their website contains information about postal codes for many countries around the world. This example will focus on zipcodes in the United States will be used here.  

<i>Geonames.org Website</i>: [Link](http://download.geonames.org/)

<i>Link for US.zip file</i>: http://download.geonames.org/export/zip/US.zip

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

## Using the download.file() function

The file to be downloaded for this example is a zipped file.  Therefore, a temporary directory will be created to hold the zipped file. 

In [None]:
#Creating a temporary directory for download
Zipcode_Directory <- tempdir()

Again, let us take advantage of some ultility functions in R.  Here, <strong>file.info()</strong> will give us information about this newly created diretory.

In [None]:
#The file.info() function provides information regarding this temporary directory
file.info(Zipcode_Directory)

Unnamed: 0_level_0,size,isdir,mode,mtime,ctime,atime,uid,gid,uname,grname
Unnamed: 0_level_1,<dbl>,<lgl>,<octmode>,<dttm>,<dttm>,<dttm>,<int>,<int>,<chr>,<chr>
/tmp/RtmpkoGaCD,4096,True,700,2021-01-21 17:42:04,2021-01-21 17:42:04,2021-01-21 17:42:14,0,0,root,root


Next, a temporary file must be created that will be used to name the file to be downloaded. 

In [None]:
#Creating a temporary file 
Zipcode_File <-tempfile()

Again, let us verify that a file was actually created using file.info() on this newly created file.  NA are produced because this is an "empty" file.

In [None]:
file.info(Zipcode_File)

Unnamed: 0_level_0,size,isdir,mode,mtime,ctime,atime,uid,gid,uname,grname
Unnamed: 0_level_1,<dbl>,<lgl>,<octmode>,<dttm>,<dttm>,<dttm>,<int>,<int>,<chr>,<chr>
/tmp/RtmpkoGaCD/file3e625a95dc,,,,,,,,,,


Finally, we are ready to use the <strong>download.file()</strong> function to download the zipped file.  The temporary file created above will be the destination file passed into this function.

In [None]:
#Downlaod the file, happens to be a zip file thus must be unpacked
download.file(url="http://download.geonames.org/export/zip/US.zip", destfile=Zipcode_File)

One more time, use the file.info() function to verify that the file was downloaded successsfully. 

In [None]:
file.info(Zipcode_File)

Unnamed: 0_level_0,size,isdir,mode,mtime,ctime,atime,uid,gid,uname,grname
Unnamed: 0_level_1,<dbl>,<lgl>,<octmode>,<dttm>,<dttm>,<dttm>,<int>,<int>,<chr>,<chr>
/tmp/RtmpkoGaCD/file3e625a95dc,633982,False,644,2021-01-21 18:33:02,2021-01-21 18:33:02,2021-01-21 18:33:01,0,0,root,root


Notice that Zipcode_File is simply a string variable in R whose contents is the path to this file.

In [None]:
Zipcode_File

## Using the unzip() function

After the zip file has been downloaded, the next step is to unzip the file so that R can retrieve the dataset.  The utility function unzip() will be used to accomplish this task.  The temporary file and directory will be passed into the unzip() function.

In [None]:
#Unzipping the file
unzip(zipfile=Zipcode_File, exdir=Zipcode_Directory)

The <strong>list.files()</strong> function is used here to list all the files contained in this temporary directory.

In [None]:
#Get a list of all the files in the Zipcode_Directory
list.files(Zipcode_Directory)

## Reading the unzipped (text) file into R

In Colab, the temporary directory is contained within the tmp directory.


<p align='center'><img src="https://drive.google.com/uc?export=view&id=1MUSRJLuriJ0VAvZFWa7ndbiWrJ7XhotB"></p>

 

Here the <strong>file.path()</strong> utility function will be used to create the necessary path to our dataset.

In [None]:
#Using file.path() function to create the necessary path to the file.
filelocation <- file.path(Zipcode_Directory,"US.txt")

In [None]:
#Look at the newly created variable to verify the path
filelocation

Before reading in the US.txt file into R, let us first consider the structure of this file.  Make note that this file is *not* a CSV file.  This file appears to have a column structure -- albeit not all the columns are aligned.


<p align='center'><img src="https://drive.google.com/uc?export=view&id=1FsXX6gbWBoY0PcCV2k1LN9x1fvkK3ZKh"></p>


This particular text file is <strong>tab aligned</strong>.  The tab is a hidden character and is most often not revealed when viewing the file.  All most all files have hidden characters, e.g. characters to identify an end of a line are present in a most file.

Some more advanced text editors, e.g. Notepad ++, allow you to reveal the hidden characters in a file.  Here the tab character is indicated by an arrow and the end-of-line charaters are indicated by LF.



<p align='center'><img src="https://drive.google.com/uc?export=view&id=1HriLPzOjb-eUeHK5YSzPAUn2ieXN_SYD"></p>

The <strong>read.table()</strong> function can be used to read in this file.  Parameters being passed into this function include


1.   The location of the file
2.   The seperator used to seperate the various fields in this data file
3.   A logical variable to indicate that field names are not present in this data file 



In [None]:
Zipcode_Data <- read.table(file=filepath, sep="\t", header=FALSE)

In [None]:
#Verify that the data was read in correctly
ls()

In [None]:
#Use head() to verify the top few rows of the data.frame
head(Zipcode_Data)

Unnamed: 0_level_0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<dbl>,<dbl>,<int>
1,US,99553,Akutan,Alaska,AK,Aleutians East,13,,,54.143,-165.7854,1
2,US,99571,Cold Bay,Alaska,AK,Aleutians East,13,,,55.1858,-162.7211,1
3,US,99583,False Pass,Alaska,AK,Aleutians East,13,,,54.8542,-163.4113,1
4,US,99612,King Cove,Alaska,AK,Aleutians East,13,,,55.0628,-162.3056,1
5,US,99661,Sand Point,Alaska,AK,Aleutians East,13,,,55.3192,-160.4914,1
6,US,99546,Adak,Alaska,AK,Aleutians West (CA),16,,,51.874,-176.634,1


Next, we will add in appropriate names for the 12 fields in this data.frame.  Information regarding these 12 fields is provided in the readme.txt file that was contained in the zipped file that was downloaded. 

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1HhWzpoJ3z3OAZNIKvJHRttOXIHcGCJ86"></p>



The <strong>names()</strong> function can be used to provide names to fields contained in a data.frame.

In [None]:
#Existing names for the 12 fields in this data.frame
names(Zipcode_Data)

In [None]:
#Creating new names for the Zipcode_Data data.frame
names(Zipcode_Data) <- c("Country", "Zipcode", "Location", "StateName", "StateAbbrevation","CountyName", "CountyAbbreviation", "CommunityName", "CommunityAbbreviation", "Latitude", "Longitude", "Accuracy")

In [None]:
#Verify that new names were correctly applied to this data.frame
names(Zipcode_Data)

The <strong>str()</strong> function can be used to identify the structure of the various columns in the data.frame.

In [None]:
str(Zipcode_Data)

'data.frame':	40268 obs. of  12 variables:
 $ Country              : chr  "US" "US" "US" "US" ...
 $ Zipcode              : int  99553 99571 99583 99612 99661 99546 99547 99591 99638 99660 ...
 $ Location             : chr  "Akutan" "Cold Bay" "False Pass" "King Cove" ...
 $ StateName            : chr  "Alaska" "Alaska" "Alaska" "Alaska" ...
 $ StateAbbrevation     : chr  "AK" "AK" "AK" "AK" ...
 $ CountyName           : chr  "Aleutians East" "Aleutians East" "Aleutians East" "Aleutians East" ...
 $ CountyAbbreviation   : int  13 13 13 13 13 16 16 16 16 16 ...
 $ CommunityName        : chr  "" "" "" "" ...
 $ CommunityAbbreviation: int  NA NA NA NA NA NA NA NA NA NA ...
 $ Latitude             : num  54.1 55.2 54.9 55.1 55.3 ...
 $ Longitude            : num  -166 -163 -163 -162 -160 ...
 $ Accuracy             : int  1 1 1 1 1 1 1 1 1 1 ...


Notice that Zipcode is specified as an integer.  Indeed a zipcode is a five digit number; however, the Zipcode field is simply a sequence of five digits to identify a location.  For example, it does not make sense to add two zipcodes together.  For this reason, a string type is most likely more meaningful for the Zipcode field.  The <strong>as.character()</strong> function can be used to convert a vector into a string vector.

In [None]:
Zipcode_Data$Zipcode <- as.character(Zipcode_Data$Zipcode)

In [None]:
#Verify that Zipcode is indeed a string vector
str(Zipcode_Data)

'data.frame':	40268 obs. of  12 variables:
 $ Country              : chr  "US" "US" "US" "US" ...
 $ Zipcode              : chr  "99553" "99571" "99583" "99612" ...
 $ Location             : chr  "Akutan" "Cold Bay" "False Pass" "King Cove" ...
 $ StateName            : chr  "Alaska" "Alaska" "Alaska" "Alaska" ...
 $ StateAbbrevation     : chr  "AK" "AK" "AK" "AK" ...
 $ CountyName           : chr  "Aleutians East" "Aleutians East" "Aleutians East" "Aleutians East" ...
 $ CountyAbbreviation   : int  13 13 13 13 13 16 16 16 16 16 ...
 $ CommunityName        : chr  "" "" "" "" ...
 $ CommunityAbbreviation: int  NA NA NA NA NA NA NA NA NA NA ...
 $ Latitude             : num  54.1 55.2 54.9 55.1 55.3 ...
 $ Longitude            : num  -166 -163 -163 -162 -160 ...
 $ Accuracy             : int  1 1 1 1 1 1 1 1 1 1 ...


In [None]:
#The as.integer() function can be used to convert back to an integer type
Zipcode_Data$Zipcode <- as.integer(Zipcode_Data$Zipcode)
str(Zipcode_Data)

'data.frame':	40268 obs. of  12 variables:
 $ Country              : chr  "US" "US" "US" "US" ...
 $ Zipcode              : int  99553 99571 99583 99612 99661 99546 99547 99591 99638 99660 ...
 $ Location             : chr  "Akutan" "Cold Bay" "False Pass" "King Cove" ...
 $ StateName            : chr  "Alaska" "Alaska" "Alaska" "Alaska" ...
 $ StateAbbrevation     : chr  "AK" "AK" "AK" "AK" ...
 $ CountyName           : chr  "Aleutians East" "Aleutians East" "Aleutians East" "Aleutians East" ...
 $ CountyAbbreviation   : int  13 13 13 13 13 16 16 16 16 16 ...
 $ CommunityName        : chr  "" "" "" "" ...
 $ CommunityAbbreviation: int  NA NA NA NA NA NA NA NA NA NA ...
 $ Latitude             : num  54.1 55.2 54.9 55.1 55.3 ...
 $ Longitude            : num  -166 -163 -163 -162 -160 ...
 $ Accuracy             : int  1 1 1 1 1 1 1 1 1 1 ...


<strong>Questions</strong>


1.   Currently there are a little over 41,000 zipcodes in the United States. How many zipcodes are included in the file that was read in by R?



In [None]:
dim(mydata2)

<strong>Task</strong>:

R did not correctly read in all the zipcodes in the US.txt file.  Your task is to figure out why and fix it.  Hint:  The first occurance of the issue occured when reading in the lines for Iowa.

<br><br><br><strong>Note:</strong> The following command can be used to save a csv version of the Zipcode_Data into our working R directory.

In [None]:
write.csv(Zipcode_Data,file="/content/Zipcode_Data.csv")