# 110 - read data from file

In this cookbook we will assume that we can acquire tidy data from external files.  
To get the data into Python we have to read the files.

Pandas has a [wealth of functions](http://pandas.pydata.org/pandas-docs/stable/io.html) to do that. Here we will show some of the most useful.

# 0 - setup notebook

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

# 1 - read data from file

Pandas has a number of functions that can read files that contain tabular data.  
All the pandas-read-functions have the form: pd.read_xxx() where the xxx is the type of file.  

xxx can be:
- csv (comma separated file)
- excel (excel sheets)
- sql (tables defined with sql DDL)
- json (java script object notation)
- html (hypertext markup language tables)
- sas, stata or spss (commercial statistical packages)
- etc.

See the [documentation](http://pandas.pydata.org/pandas-docs/stable/io.html) for a full list.

All the read functions work about the same.  
Here we will demonstrate their workings with read_csv, read_excel and read_html.

# 2 - read_csv()

Files in comma-separated values format ([csv](https://en.wikipedia.org/wiki/Comma-separated_values) files) are a very common way to store data.   
Almost all data-processing programs can read and write data in and to csv.   
So csv files are currently the lingua franca for data exchange (but note that in 10 years that role might be occupied by [json](https://en.wikipedia.org/wiki/JSON) or [xml](https://en.wikipedia.org/wiki/XML), for other formats [see](https://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats)). 

csv files are simple text files (**note** the file should be in [UTF-8](https://en.wikipedia.org/wiki/UTF-8) text format not in ANSI).  
Here is an example of a part of the first three lines in the ctw.csv file.  

    code,country,region,pop,PPP,GDP,PPPpc,GDPpc, ...
    AFG,Afghanistan,SouthernAsia, 30552.0, NA, 20496.8, NA, ... 
    ALB,Albania,Southern Europe, 3173.0, 28211.4, 12648.1, 9962.6, 4466.9, ...

The first row contains the names of the columns/variables.  
Each subsequent row represents a row of data (each row has the variable-values for one observed case).  
The values of the different variables (i.e. the columns) are separated by a comma.  

Below we will demonstrate the basics of reading csv files with pandas [read_csv()](http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table) function.   
When the file is indeed tidy the most simple form of the read_csv() function often is all we need.  
Here is an example:

In [2]:
dat = pd.read_csv('./dat/ctw.csv')
print(dat.shape)
dat.head(2)

(152, 35)


Unnamed: 0,code,country,region,pop,PPP,GDP,PPPpc,GDPpc,HDI,KOFec,...,gini,voice,demo,stab,govEff,regQual,ROLwb,ROLwjp,CORwb,fragil
0,AFG,Afghanistan,Southern Asia,30552.0,,20496.8,,,0.374,,...,,-1.32,2.48,-2.42,-1.4,-1.21,-1.72,0.34,-1.41,4.22
1,ALB,Albania,Southern Europe,3173.0,28211.4,12648.1,9962.6,4466.9,0.749,61.9,...,,0.01,5.67,-0.16,-0.28,0.17,-0.57,0.49,-0.72,2.19


The table contains data about 152 countries (i.e. each row/observed case is a country).  
Each country is described by 35 variables: country name, population, gdp, etc.  
The appendix at the end of this notebook, gives a short explanation for each variable (more [info](https://github.com/vilkoos/CTWdata)).

#### After reading a file it is good practice to check the resulting dataframe.  
Problems that frequently occur are:
- the row containing the column names is not recognized (solution add the argument **header=...**)
- one of the columns should be the index (solution add the argument **index_col=...** )
- one of the codes for missing values is not recognized as such (solution add  the argument **na_values=[...]**)

In our case the read_csv got the column-names perfectly and there are no problems with missing values codes.  
However the index should be set to the frist column, e.g. the one  that contains the uniquely identifying country-code.   
We can do this by adding the argument **index=0** (the column code has column-index number 0).

In [3]:
dat = pd.read_csv('./dat/ctw.csv', index_col=0 )
print(dat.shape)
dat.head(2)

(152, 34)


Unnamed: 0_level_0,country,region,pop,PPP,GDP,PPPpc,GDPpc,HDI,KOFec,KOFsoc,...,gini,voice,demo,stab,govEff,regQual,ROLwb,ROLwjp,CORwb,fragil
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AFG,Afghanistan,Southern Asia,30552.0,,20496.8,,,0.374,,17.4,...,,-1.32,2.48,-2.42,-1.4,-1.21,-1.72,0.34,-1.41,4.22
ALB,Albania,Southern Europe,3173.0,28211.4,12648.1,9962.6,4466.9,0.749,61.9,42.1,...,,0.01,5.67,-0.16,-0.28,0.17,-0.57,0.49,-0.72,2.19


### more problems

An other problem that frequently occurs is that fields are not separated by a comma (tab or ; are common alternatives).  
Some countries use the comma as decimal point (i.e. they write 100,000.111 as 1000.000,111).  
Here is an example of such a csv file.

> col1;col2;col3   
> aaa;0,111;10.000   
> bbb;0,222;20.000  
> ccc;0,333;30.000   

Here the **;** is used as separator the **,** is the decimal point and the **.** is used to separate the thousands.  
Lets see what happens when we read this csv file.

In [4]:
dat2 = pd.read_csv('./dat/csv2.csv')
dat2

Unnamed: 0,col1;col2;col3
aaa;0,111;10.000
bbb;0,222;20.000
ccc;0,333;30.000


The result is a bit of a mess. We can repair this easily:
- add the argument sep=';' to specify the used separator
- add the argument decimal=',' to specify the symbol for indicating the decimal
- add the argument thousands="." to specify the symbol for indicating the thousands

In [5]:
dat2 = pd.read_csv('./dat/csv2.csv', sep=';', decimal =',', thousands="." )
dat2

Unnamed: 0,col1,col2,col3
0,aaa,0.111,10000
1,bbb,0.222,20000
2,ccc,0.333,30000


## good practice

After reading a file, check the result. When there are problems add arguments to the read_csv().  
See the [documentation](http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table) for a list of available arguments.

# 3 - read_excel()

# Appendix - reading tables in html files

In [6]:
#-- install html5lib if needed (not in the standard annoconda distribution) ----------
import html5lib 

tables = pd.read_html('./dat/ctw_code_book.htm')
codebook = tables[0]
codebook.head()

Unnamed: 0,0,1,2,3
0,nr,Var name,name,link
1,1,code,Country ISO code,ISO 3166-1
2,2,country,Country name,
3,3,region,Region,
4,4,pop,Population 2012 (in thousands),


In [7]:
codebook.tail(7)

Unnamed: 0,0,1,2,3
34,34.0,CORwb,Control of Corruption 2012 (World Bank) range...,
35,35.0,fragil,State Fragility (Internal peace Index 2012),
36,,,,
37,,,,
38,,,,
39,,,,
40,,,,


Problems:
- row 0 has the column names, these are read as data
- column 0 is the old index this can be dropped
- coulmn 1 the country code should be the index
- column 3 can be dropped
- we need only to read up to line 36 (beyond 35 all lines are empty)

In [8]:
# -- add header=0 to indicate that the columnnanes are on the first row
# -- add index=1 to indicate that the the second column should be used as the index
# -- nrows=35 to read only the first 35 lines does nor work in read_html()
# -- usecols=[1,2] does not work for read_html 
tables = pd.read_html('./dat/ctw_code_book.htm', header=0, index_col=1)
codebook = tables[0]
codebook.head()

Unnamed: 0_level_0,nr,name,link
Var name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
code,1.0,Country ISO code,ISO 3166-1
country,2.0,Country name,
region,3.0,Region,
pop,4.0,Population 2012 (in thousands),
PPP,5.0,GDP based on PPP 2011 in dollars,PPP


In [9]:
#-- keep only the name column
#-- NOTE a single column from a dataframe results in a series, 
#-- we must use pd.DataFrame() to get a dataframe
codebook = pd.DataFrame(codebook['name'])

In [13]:
#codebook.head()

Unnamed: 0_level_0,name
Var name,Unnamed: 1_level_1
code,Country ISO code
country,Country name
region,Region
pop,Population 2012 (in thousands)
PPP,GDP based on PPP 2011 in dollars


In [16]:
# --- keep the first 35 lines with meaningful info --
codebook = codebook[0:35]
# --- show the total contents of the code book ------------
codebook.head(35)

Unnamed: 0_level_0,name
Var name,Unnamed: 1_level_1
code,Country ISO code
country,Country name
region,Region
pop,Population 2012 (in thousands)
PPP,GDP based on PPP 2011 in dollars
GDP,GDP in current Dollars 2012 (in millions)
PPPpc,GDP Per Capita based on PPP 2011
GDPpc,GDP Per Capita in current Dollars 2012
HDI,Human Development Index (HDI) value 2012
KOFec,Economic Globalization index (KOF) 2011
