# Module 2 | Part 1: Pre-Processing of Data Files

This handout will include some of the common tasks regarding the pre-processing of data file before the data is read into R or Python.  In particular, the processing being done here is happening at the operating system level.  


<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

### Example 2.1.1

Consider data on births in the United States. This information is provided by the Center for Disease Control (CDC) and it's Natality data within their  WONDER online databases.

Two different datasets were obtained for the most recent year available. 

*   Number of Births that took place At Home for each State
*   Number of Births that took place in a Hospital for each State

<i>Data Sources</i>: https://wonder.cdc.gov/natality-current.html

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

These two data files have been saved into the following folder in Colab.

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1jieTL2t7jEocOtJAas1Svn8nOKNByKBr "></p>



## Utility Functions to View Contents of a File
Colab / iPython Notebook allows one to run commands at the operating system level using typical code blocks.  The commands must simply be preceeded with <strong>!</strong> , i.e. an exclaimation point.

For example, to get a listing of file in a particular directory, use <strong> !ls *\<path to directory\>* </strong>

In [None]:
#A listing of the files in the /content/sample_data/ folder
!ls '/content/sample_data'

To see the contents of a file, use <strong> !cat *\<filename\>* </strong>

WARNING: Do *not* run this command on a large file.

In [None]:
!cat /content/sample_data/athome.txt

For large files, use <strong> !head *\<filename\>* </strong> instead of !cat

In [None]:
!head /content/sample_data/athome.txt

To view the last 15 lines of a file, use <strong> !tail -n 15 *\<filename\>* </strong>

In [None]:
#Print last 15 lines of a file
!tail -n 15 /content/sample_data/athome.txt

Next, print all lines in a file -- starting with Line #2.

In [None]:
#Don't print the first line, use + with !tail
!tail -n +2 /content/sample_data/athome.txt

Finally, printing lines from the start to a certain point -- truncating the last 22 lines here.  Use case here would be to exclude footer information.

In [None]:
#Do not print the last number of lines, use - with !head
!head -n -22 /content/sample_data/athome.txt

## Using SED to Manipulate Files

A text editor can be used to change the contents of a file. SED (Stream EDitor) is a common text editor that can be used to manipulate text files from a command line.

Wiki Page for sed: https://en.wikipedia.org/wiki/Sed

A text editor such as sed is certainly more powerful than commands used to print the contents of a file.  For example, the following sed command can be used to print a sequence of lines in the middle of a file. 

In [None]:
#Printing lines 2 through 52 of a file
!sed -n 2,52p /content/sample_data/athome.txt

SED allows one to *push* the contents being printing to the screen to be saved into a second file.  This action requires the use of a piping command, i.e. <strong>></strong>, and a file name should be provided for the new file.

In [15]:
#Printing lines 2 through 52 of a file AND pushing contents from print into a new file
!sed -n 2,52p /content/sample_data/athome.txt > /content/sample_data/AtHome_Births_StateLevel_v2.txt

In [None]:
!sed -n 's/---//' /content/sample_data/athome.txt > /content/sample_data/AtHome_Births_StateLevel_v2.txt

In [None]:
#Checking to see if new file was created
!ls '/content/sample_data'

In [None]:
#Printing entire file to view changes
!cat /content/sample_data/AtHome_Births_StateLevel_v2.txt

## Using SED to Find/Replace Contents

Notice that each line in this data file begins with a tab character.  This is evident when the file is put into Notepad ++ and Show All Characters has been specifed for the text file.

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1JR_XPxdAl2RPsJAod1s7BW8V7vdUXkU_"></p>



The following sed command does the following

*   Searches each line of the file named *filename*
*   Finds the first occurance of the tab character, i.e. \t
*   Replaces the first occurance of \t with *nothing*
*   Creates a new file, named *filename2* for the output




<p align='center'><img src="https://drive.google.com/uc?export=view&id=1sg9zYH1c2i6Ucrxm-2i6uiVy7SYvb_1_"></p>

In [19]:
#Run the command above to remove the first tab occurance on each line in the AtHome_Births_StateLevel_v2 file
!sed 's/\t//' /content/sample_data/AtHome_Births_StateLevel_v2.txt > /content/sample_data/AtHome_Births_StateLevel_v3.txt

In [None]:
#Verify that this command produces the desired output
!cat /content/sample_data/AtHome_Births_StateLevel_v3.txt

The removal of the first tab occurance can be verified in Notepad ++ as is shown here.

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1-7ALLdTSvPTjrFrH4u3q4maTkdenzkEX"></p>

SED can be used to easily convert this tab delimited file into the more common CSV delimited file.  In this situaiton all occurance of the tab character will be replaced with the comma.

*   Search each line of the file named *filename*
*   Finds each occurance of the tab character, i.e. \t
*   Replaces the each occurance of \t with ,
*   The trailing g preforms a global search, i.e. all occurances
*   Creates a new file, named *filename2* for the output

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1fUmMPbEsRXSNdld3gn-gcTfPlQhXtQ95"></p>

In [42]:
#Convert from tab delimited to comma delimited
!sed 's/\t/,/g' /content/sample_data/AtHome_Births_StateLevel_v3.txt > /content/sample_data/AtHome_Births_StateLevel_v4.txt

In [None]:
#Verify that conversion was completed successfully
!cat /content/sample_data/AtHome_Births_StateLevel_v4.txt

## Using SED to Insert/Delete Lines

Next, sed will be used to add field or variable names to the first line of this data file. 

*   Insert a line at line 1
*   Insert the field/variable names into Line 1; Contents are contained with slashes
*   Create a new file, named *filename2* for the output

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1Rdzm5SCRiKKdt67JBcX5uJ6toNk6CVan"></p>

In [47]:
#Insert variable names into first line of data file
## NOTE: Be careful not to run this line mutiple times          ##
##       Can delete lines using '1d' option of sed (see below)  ##

!sed '1i \State, StateCode, Year, YearCode, AtHomeCount\' /content/sample_data/AtHome_Births_StateLevel_v4.txt > /content/sample_data/AtHome_Births_StateLevel_v5.txt 

In [None]:
#Verify that header was successfully added to thd data file
!cat /content/sample_data/AtHome_Births_StateLevel_v5.txt

Comment:  The -i option in sed permits one to edit the file <strong>in-place</strong>.  Here, the header information here is put into the *same* file -- no new file is being created here for the output.

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1-HSWFO_nvam13ZBI3peDbB8a2G60k-ep"></p>

The following command can be used to <strong>delete</strong> line 1 from an existing data file.

In [46]:
#The following can be used delete the first line in a data file
!sed -i '1d' /content/sample_data/AtHome_Births_StateLevel_v5.txt

## Complete the following task:


1.   Clean-up the hosptial data file, e.g. remove footer info, remove initial tabs on each line, etc.
2.   Convert the file from tab deliminated to comma deliminated
3.   Add an appropriate header to label each field



gets lines needed from file and prints


In [127]:

!sed -n 2,52p /content/sample_data/hospital.txt > /content/sample_data/hospital_v2.txt
!cat /content/sample_data/hospital.txt

"Notes"	"State"	"State Code"	"Year"	"Year Code"	Births
	"Alabama"	"01"	"2019"	"2019"	58292
	"Alaska"	"02"	"2019"	"2019"	9148
	"Arizona"	"04"	"2019"	"2019"	78020
	"Arkansas"	"05"	"2019"	"2019"	36169
	"California"	"06"	"2019"	"2019"	441421
	"Colorado"	"08"	"2019"	"2019"	61134
	"Connecticut"	"09"	"2019"	"2019"	33941
	"Delaware"	"10"	"2019"	"2019"	10304
	"District of Columbia"	"11"	"2019"	"2019"	8967
	"Florida"	"12"	"2019"	"2019"	216112
	"Georgia"	"13"	"2019"	"2019"	125133
	"Hawaii"	"15"	"2019"	"2019"	16442
	"Idaho"	"16"	"2019"	"2019"	20822
	"Illinois"	"17"	"2019"	"2019"	139203
	"Indiana"	"18"	"2019"	"2019"	78743
	"Iowa"	"19"	"2019"	"2019"	37087
	"Kansas"	"20"	"2019"	"2019"	34643
	"Kentucky"	"21"	"2019"	"2019"	52199
	"Louisiana"	"22"	"2019"	"2019"	58683
	"Maine"	"23"	"2019"	"2019"	11518
	"Maryland"	"24"	"2019"	"2019"	69214
	"Massachusetts"	"25"	"2019"	"2019"	68539
	"Michigan"	"26"	"2019"	"2019"	106215
	"Minnesota"	"27"	"2019"	"2019"	64466
	"Mississippi"	"28"	"2019"

In [128]:
!cat /content/sample_data/hospital_v2.txt

	"Alabama"	"01"	"2019"	"2019"	58292
	"Alaska"	"02"	"2019"	"2019"	9148
	"Arizona"	"04"	"2019"	"2019"	78020
	"Arkansas"	"05"	"2019"	"2019"	36169
	"California"	"06"	"2019"	"2019"	441421
	"Colorado"	"08"	"2019"	"2019"	61134
	"Connecticut"	"09"	"2019"	"2019"	33941
	"Delaware"	"10"	"2019"	"2019"	10304
	"District of Columbia"	"11"	"2019"	"2019"	8967
	"Florida"	"12"	"2019"	"2019"	216112
	"Georgia"	"13"	"2019"	"2019"	125133
	"Hawaii"	"15"	"2019"	"2019"	16442
	"Idaho"	"16"	"2019"	"2019"	20822
	"Illinois"	"17"	"2019"	"2019"	139203
	"Indiana"	"18"	"2019"	"2019"	78743
	"Iowa"	"19"	"2019"	"2019"	37087
	"Kansas"	"20"	"2019"	"2019"	34643
	"Kentucky"	"21"	"2019"	"2019"	52199
	"Louisiana"	"22"	"2019"	"2019"	58683
	"Maine"	"23"	"2019"	"2019"	11518
	"Maryland"	"24"	"2019"	"2019"	69214
	"Massachusetts"	"25"	"2019"	"2019"	68539
	"Michigan"	"26"	"2019"	"2019"	106215
	"Minnesota"	"27"	"2019"	"2019"	64466
	"Mississippi"	"28"	"2019"	"2019"	36451
	"Missouri"	"29"	"2019"	"2019"	70422
	"M

removes first tab


In [129]:
!sed -i 's/\t//' /content/sample_data/hospital_v2.txt
!cat /content/sample_data/hospital_v2.txt

"Alabama"	"01"	"2019"	"2019"	58292
"Alaska"	"02"	"2019"	"2019"	9148
"Arizona"	"04"	"2019"	"2019"	78020
"Arkansas"	"05"	"2019"	"2019"	36169
"California"	"06"	"2019"	"2019"	441421
"Colorado"	"08"	"2019"	"2019"	61134
"Connecticut"	"09"	"2019"	"2019"	33941
"Delaware"	"10"	"2019"	"2019"	10304
"District of Columbia"	"11"	"2019"	"2019"	8967
"Florida"	"12"	"2019"	"2019"	216112
"Georgia"	"13"	"2019"	"2019"	125133
"Hawaii"	"15"	"2019"	"2019"	16442
"Idaho"	"16"	"2019"	"2019"	20822
"Illinois"	"17"	"2019"	"2019"	139203
"Indiana"	"18"	"2019"	"2019"	78743
"Iowa"	"19"	"2019"	"2019"	37087
"Kansas"	"20"	"2019"	"2019"	34643
"Kentucky"	"21"	"2019"	"2019"	52199
"Louisiana"	"22"	"2019"	"2019"	58683
"Maine"	"23"	"2019"	"2019"	11518
"Maryland"	"24"	"2019"	"2019"	69214
"Massachusetts"	"25"	"2019"	"2019"	68539
"Michigan"	"26"	"2019"	"2019"	106215
"Minnesota"	"27"	"2019"	"2019"	64466
"Mississippi"	"28"	"2019"	"2019"	36451
"Missouri"	"29"	"2019"	"2019"	70422
"Montana"	"30"	"2019"	"2019"	

finds all tabs replaces with comma and print


In [130]:
!sed -i 's/\t/,/g' /content/sample_data/hospital_v2.txt

In [131]:
!cat /content/sample_data/hospital_v2.txt

"Alabama","01","2019","2019",58292
"Alaska","02","2019","2019",9148
"Arizona","04","2019","2019",78020
"Arkansas","05","2019","2019",36169
"California","06","2019","2019",441421
"Colorado","08","2019","2019",61134
"Connecticut","09","2019","2019",33941
"Delaware","10","2019","2019",10304
"District of Columbia","11","2019","2019",8967
"Florida","12","2019","2019",216112
"Georgia","13","2019","2019",125133
"Hawaii","15","2019","2019",16442
"Idaho","16","2019","2019",20822
"Illinois","17","2019","2019",139203
"Indiana","18","2019","2019",78743
"Iowa","19","2019","2019",37087
"Kansas","20","2019","2019",34643
"Kentucky","21","2019","2019",52199
"Louisiana","22","2019","2019",58683
"Maine","23","2019","2019",11518
"Maryland","24","2019","2019",69214
"Massachusetts","25","2019","2019",68539
"Michigan","26","2019","2019",106215
"Minnesota","27","2019","2019",64466
"Mississippi","28","2019","2019",36451
"Missouri","29","2019","2019",70422
"Montana","30","2019","2019",

puts headers on first line

In [132]:
!sed -i '1i \State, StateCode, Year, YearCode, HospitalCount\' /content/sample_data/hospital_v2.txt
!cat /content/sample_data/hospital_v2.txt

State, StateCode, Year, YearCode, HospitalCount
"Alabama","01","2019","2019",58292
"Alaska","02","2019","2019",9148
"Arizona","04","2019","2019",78020
"Arkansas","05","2019","2019",36169
"California","06","2019","2019",441421
"Colorado","08","2019","2019",61134
"Connecticut","09","2019","2019",33941
"Delaware","10","2019","2019",10304
"District of Columbia","11","2019","2019",8967
"Florida","12","2019","2019",216112
"Georgia","13","2019","2019",125133
"Hawaii","15","2019","2019",16442
"Idaho","16","2019","2019",20822
"Illinois","17","2019","2019",139203
"Indiana","18","2019","2019",78743
"Iowa","19","2019","2019",37087
"Kansas","20","2019","2019",34643
"Kentucky","21","2019","2019",52199
"Louisiana","22","2019","2019",58683
"Maine","23","2019","2019",11518
"Maryland","24","2019","2019",69214
"Massachusetts","25","2019","2019",68539
"Michigan","26","2019","2019",106215
"Minnesota","27","2019","2019",64466
"Mississippi","28","2019","2019",36451
"Missouri","29","2

## Reading the processed data files into R

This an iPython Notebook; thus, to run R commands, R Magic will need to be implemented. 

In [29]:
%load_ext rpy2.ipython

  from pandas.core.index import Index as PandasIndex


Once again, R commands can be run in code blocks when %%R is used in the first line of the code block.

In [133]:
%%R

AtHome <- read.csv('/content/sample_data/AtHome_Births_StateLevel_v5.txt')
Hospital <- read.csv('/content/sample_data/hospital_v2.txt')

In [134]:
%%R

head(AtHome)


       State StateCode Year YearCode AtHomeCount
1    Alabama         1 2019     2019         243
2     Alaska         2 2019     2019         195
3    Arizona         4 2019     2019         706
4   Arkansas         5 2019     2019         326
5 California         6 2019     2019        3081
6   Colorado         8 2019     2019         899


In [135]:
%%R

head(Hospital)


       State StateCode Year YearCode HospitalCount
1    Alabama         1 2019     2019         58292
2     Alaska         2 2019     2019          9148
3    Arizona         4 2019     2019         78020
4   Arkansas         5 2019     2019         36169
5 California         6 2019     2019        441421
6   Colorado         8 2019     2019         61134


In [136]:
#The head command of the file in Colab directory
!head /content/sample_data/AtHome_Births_StateLevel_v5.txt
!head /content/sample_data/hospital_v2.txt

State, StateCode, Year, YearCode, AtHomeCount
"Alabama","01","2019","2019",243
"Alaska","02","2019","2019",195
"Arizona","04","2019","2019",706
"Arkansas","05","2019","2019",326
"California","06","2019","2019",3081
"Colorado","08","2019","2019",899
"Connecticut","09","2019","2019",217
"Delaware","10","2019","2019",66
"District of Columbia","11","2019","2019",72
State, StateCode, Year, YearCode, HospitalCount
"Alabama","01","2019","2019",58292
"Alaska","02","2019","2019",9148
"Arizona","04","2019","2019",78020
"Arkansas","05","2019","2019",36169
"California","06","2019","2019",441421
"Colorado","08","2019","2019",61134
"Connecticut","09","2019","2019",33941
"Delaware","10","2019","2019",10304
"District of Columbia","11","2019","2019",8967


In [137]:
%%R 

dim(AtHome)


[1] 51  5


In [138]:
%%R

dim(Hospital)


[1] 51  5
