# 110 - read data from file

In this cookbook we will assume that we can acquire tidy data from external files.  
To get the data into Python we have to read the files.

Pandas has a [wealth of functions](http://pandas.pydata.org/pandas-docs/stable/io.html) to do that. Here we will show some of the most useful.

# 0 - setup notebook

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

# 1 - read data from file

Pandas has a number of functions that can read files that contain tabular data.  
All the pandas-read-functions have the form: pd.read_xxx() where the xxx is the type of file.  

xxx can be:
- csv (comma separated file)
- excel (excel sheets)
- sql (tables defined with sql DDL)
- json (java script object notation)
- html (hypertext markup language tables)
- sas, stata or spss (commercial statistical packages)
- etc.

See the [documentation](http://pandas.pydata.org/pandas-docs/stable/io.html) for a full list.

All the read functions work about the same.  
Here we will demonstrate their workings with read_csv, read_excel and read_html.

# 2 - read_csv()

Files in comma-separated values format ([csv](https://en.wikipedia.org/wiki/Comma-separated_values) files) are a very common way to store data.   
Almost all data-processing programs can read and write data in and to csv.   
Csv files are currently the lingua franca for data exchange  
(but note that in 10 years that role might be occupied by [json](https://en.wikipedia.org/wiki/JSON) or [xml](https://en.wikipedia.org/wiki/XML), for other formats [see](https://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats)). 

csv files are simple text files (**note** the file should be in [UTF-8](https://en.wikipedia.org/wiki/UTF-8) text format not in ANSI).  
Here is an example of a part of the first three lines in the ctw.csv file.  

```
code,country,region,pop,PPP,GDP,PPPpc,GDPpc, ...
AFG,Afghanistan,SouthernAsia, 30552.0, NA, 20496.8, NA, ... 
ALB,Albania,Southern Europe, 3173.0, 28211.4, 12648.1, 9962.6, 4466.9, ...
```

The first row contains the names of the columns/variables.  
Each subsequent row represents a row of data (each row has the variable-values for one observed case).  
The values of the different variables (i.e. the columns) are separated by a comma.  

Below we will demonstrate the basics of reading csv files with pandas [read_csv()](http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table) function.   
When the file is tidy the most simple form of the read_csv() function often is all we need.  
Here is an example:

In [2]:
dat = pd.read_csv('./dat/ctw.csv')
print(dat.shape)
dat.head(2)

(152, 35)


Unnamed: 0,code,country,region,pop,PPP,GDP,PPPpc,GDPpc,HDI,KOFec,...,gini,voice,demo,stab,govEff,regQual,ROLwb,ROLwjp,CORwb,fragil
0,AFG,Afghanistan,Southern Asia,30552.0,,20496.8,,,0.374,,...,,-1.32,2.48,-2.42,-1.4,-1.21,-1.72,0.34,-1.41,4.22
1,ALB,Albania,Southern Europe,3173.0,28211.4,12648.1,9962.6,4466.9,0.749,61.9,...,,0.01,5.67,-0.16,-0.28,0.17,-0.57,0.49,-0.72,2.19


The table contains data about 152 countries (i.e. each row/observed case is a country).  
Each country is described by 35 variables: country name, population, gdp, etc.  
Section 4 of this notebook, reads a file with a short explanation for each variable (more [info](https://github.com/vilkoos/CTWdata)).

#### After reading a file it is good practice to check the resulting dataframe.  
Problems that frequently occur are:
- the row containing the column names is not recognized (solution add the argument **header=...**)
- one of the columns should be the index (solution add the argument **index_col=...** )
- one of the codes for missing values is not recognized as such (solution add  the argument **na_values=[...]**)

In our case the read_csv got the column-names perfectly and there are no problems with missing values codes.  
However the index should be set to the frist column, e.g. the one  that contains the uniquely identifying country-code.   
We can do this by adding the argument **index=0** (the column code has column-index number 0).

In [3]:
dat = pd.read_csv('./dat/ctw.csv', index_col=0 )
print(dat.shape)
dat.head(2)

(152, 34)


Unnamed: 0_level_0,country,region,pop,PPP,GDP,PPPpc,GDPpc,HDI,KOFec,KOFsoc,...,gini,voice,demo,stab,govEff,regQual,ROLwb,ROLwjp,CORwb,fragil
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AFG,Afghanistan,Southern Asia,30552.0,,20496.8,,,0.374,,17.4,...,,-1.32,2.48,-2.42,-1.4,-1.21,-1.72,0.34,-1.41,4.22
ALB,Albania,Southern Europe,3173.0,28211.4,12648.1,9962.6,4466.9,0.749,61.9,42.1,...,,0.01,5.67,-0.16,-0.28,0.17,-0.57,0.49,-0.72,2.19


## more problems

An other problem that frequently occurs is that fields are not separated by a comma (tab or ; are common alternatives).  
Some countries use the comma as decimal point (i.e. they write 100,000.111 as 1000.000,111).  
Here is an example of such a csv file.

> col1;col2;col3   
> aaa;0,111;10.000   
> bbb;0,222;20.000  
> ccc;0,333;30.000   

Here the **;** is used as separator the **,** is the decimal point and the **.** is used to separate the thousands.  
Lets see what happens when we read this csv file.

In [4]:
dat2 = pd.read_csv('./dat/csv2.csv')
dat2

Unnamed: 0,col1;col2;col3
aaa;0,111;10.000
bbb;0,222;20.000
ccc;0,333;30.000


The result is a bit of a mess. We can repair this easily:
- add the argument sep=';' to specify the used separator
- add the argument decimal=',' to specify the symbol for indicating the decimal
- add the argument thousands="." to specify the symbol for indicating the thousands

In [5]:
dat2 = pd.read_csv('./dat/csv2.csv', sep=';', decimal =',', thousands="." )
dat2

Unnamed: 0,col1,col2,col3
0,aaa,0.111,10000
1,bbb,0.222,20000
2,ccc,0.333,30000


## problems with missing values

Pandas uses NaN to mark missing values.  
This is actually the nan (not-a-number) that is used in numpy.  
In pandas no other codes are used for missing values.

In the input files the missing values can be marked by several codes.  
Commonly used codes are:
- NA (not available, the R way of marking missing values)
- NULL (the SQL code)
- the empty cell or empty string "" (the Excel way)
- None (the standard Python code for missing values)

The pandas read_xxx functions recognize all (except None) as missing values.  
In the produced dataframe these values will be replaced by NaN.

Sometimes other codes are used to represent the missing value (e.g. 999).  
Here is an example file nans.csv

```
col1,col2,col3
xxx,0,000
aaa,NA,111
bbb,NaN,222
ccc,NULL,333
ddd,None,444
eee,,555
fff,999,666
```
**NOTE** there are **no spaces** between the comma and the values,  
when spaces are present the 'NA' is read as ' NA' etc.

Lets see what happens when we read this file.

In [11]:
nans = pd.read_csv('./dat/nans.csv')
nans

Unnamed: 0,col1,col2,col3
0,xxx,0.0,0
1,aaa,,111
2,bbb,,222
3,ccc,,333
4,ddd,,444
5,eee,,555
6,fff,999.0,666


In [None]:
To instruct the read_csv that None and 999 are also missing values specify the na_values argument.

In [12]:
nans = pd.read_csv('./dat/nans.csv', na_values=['None','999'])
nans

Unnamed: 0,col1,col2,col3
0,xxx,0.0,0
1,aaa,,111
2,bbb,,222
3,ccc,,333
4,ddd,,444
5,eee,,555
6,fff,,666


That worked as expected ... only the integer 0 in the first row becomes a float 0.0.   
(NaN actually is the numpy.nan, this nan is a float)

## good practice

After reading a file, check the result. When there are problems add arguments to the read_csv().  
See the [documentation](http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table) for a list of available arguments.

# 3 - read_excel()

Lets read data from the excel-workbook city.xlsx.   
(Tip: have a look at the it first, the file is located in the subdirectory cookbook\dat\)

The spreadsheet city.xlsx has two sheets:
- **citydata** has data for 69 large cities, each city is described by 42 variables
- **dictionary** has an extensive definition of each of the 42 variables.

This excel file contains data that have been meticulously cleaned (i.e. they are definitely tidy).  
There should be no problems with reading it.   

In [6]:
#-- read the the excelsheet with name citydata -----------------
cities = pd.read_excel('./dat/city.xlsx', sheetname='citydata')
print(cities.shape)
cities.head()

(69, 42)


Unnamed: 0,city,areaC,areaM,popC,popM,fornB,growP,ppp,share,unempR,...,lfExpF,nrMDs,nrHosp,asLegi,nrMus,nrArts,greens,airQ,effLaw,retFit
0,London,321.0,1584.0,8.2,9.01,0.37,0.009,52.0,0.032,0.083,...,83.3,,255.0,1.0,237.0,307.0,0.14,29.0,1.0,1.0
1,Amsterdam,165.0,807.0,0.76,1.4,0.473,0.012,46.0,0.01,0.054,...,80.8,269.0,7.0,1.0,68.0,141.0,0.57,24.0,1.0,0.0
2,Ankara,31.0,25437.0,3.54,4.77,,0.257,21.2,,0.121,...,,,7.0,1.0,36.0,,0.07,46.0,0.0,0.0
3,Athens,39.0,381.0,0.66,4.01,0.22,,30.5,,0.162,...,,,23.0,1.0,47.0,,,41.0,1.0,
4,Bangkok,1569.0,7762.0,5.72,6.5,0.2,0.031,23.4,0.002,0.022,...,74.0,,173.0,,27.0,,0.24,54.0,1.0,1.0


As we expect when we read tidy data, the dataframe can be read with a straightforward  read command.  
The dataframe is ready for use (we do not have to solve problems).  
The only thing we might want to change is that we could use the city names as the index (we will not do that here). 

Lets put the contents of the dictionary worksheet in a dataframe city_cb (city code book)

In [7]:
#-- read the dictionary into dataframe city_cb----------
city_cb = pd.read_excel('./dat/city.xlsx', sheetname='dictionary')
print(city_cb.shape)
city_cb.head()

(42, 7)


Unnamed: 0,variable name,description,variable group,domain type,domain constraint,measurement unit,missing values
0,city,English name of the city,,string,identifier (unique and not null),,not allowed
1,areaC,City Area (km2),Geography,real,>=0 with 0 decimals,km^2,empty cell
2,areaM,Metro Area (km2),Geography,real,>=0 with 0 decimals,km^2,empty cell
3,popC,City Population (millions),People,real,>=0 up to 2 decimals,in milions of people,empty cell
4,popM,Metro Population (millions),People,real,>=0 up to 2 decimals,in milions of people,empty cell


As expected all went well in one pass.

# 4 - reading tables in html files with read_html()

Webpages are are [html](https://en.wikipedia.org/wiki/HTML) files. Html files may contain one or more tables.  
The data in these tables can be read into a dataframe.  

Note: tables on webpages are not meant to store data for later retrieval.  
Chances are that the data in such tables are not tidy, so we probably have to do some data-wrangling. 

The ctw.csv file we read in section 2 also has a code book (alas, far simpler than the city_cb).  
We can find it in the subdirectory cookbook\dat\ctw_code_book.htm.  
ctw_code_book.htm is a html file that contains one table, here we will read this table into a dataframe.

In [8]:
#-- install html5lib if needed (not in the standard annoconda distribution) ----------
import html5lib 

tables = pd.read_html('./dat/ctw_code_book.htm')
#-- the read produces a list of dataframes, one for each table on the html page-
#-- we need the first dataframe from the list -----
codebook = tables[0]
codebook.head()

Unnamed: 0,0,1,2,3
0,nr,Var name,name,link
1,1,code,Country ISO code,ISO 3166-1
2,2,country,Country name,
3,3,region,Region,
4,4,pop,Population 2012 (in thousands),


In [9]:
codebook.tail(7)

Unnamed: 0,0,1,2,3
34,34.0,CORwb,Control of Corruption 2012 (World Bank) range...,
35,35.0,fragil,State Fragility (Internal peace Index 2012),
36,,,,
37,,,,
38,,,,
39,,,,
40,,,,


Problems:
- row 0 has the column names, these are read as data
- column 0 is the old index this can be dropped
- column 3 can be dropped
- we need only to read up to line 36 (beyond 35 all lines are empty)

In [10]:
# -- add header=0 to indicate that the columnnanes are on the first row
# -- nrows=35 to read only the first 35 lines does nor work in read_html()
# -- usecols=[1,2] does not work for read_html 
tables = pd.read_html('./dat/ctw_code_book.htm', header=0)
codebook = tables[0]
codebook.head()

Unnamed: 0,nr,Var name,name,link
0,1.0,code,Country ISO code,ISO 3166-1
1,2.0,country,Country name,
2,3.0,region,Region,
3,4.0,pop,Population 2012 (in thousands),
4,5.0,PPP,GDP based on PPP 2011 in dollars,PPP


In [11]:
#-- keep only the columns Var name and name
codebook = codebook[['Var name','name']]

In [12]:
codebook.head()

Unnamed: 0,Var name,name
0,code,Country ISO code
1,country,Country name
2,region,Region
3,pop,Population 2012 (in thousands)
4,PPP,GDP based on PPP 2011 in dollars


In [13]:
# --- keep the first 35 lines with meaningful info --
codebook = codebook[0:35]
# --- show the total contents of the code book ------------
codebook.head(35)

Unnamed: 0,Var name,name
0,code,Country ISO code
1,country,Country name
2,region,Region
3,pop,Population 2012 (in thousands)
4,PPP,GDP based on PPP 2011 in dollars
5,GDP,GDP in current Dollars 2012 (in millions)
6,PPPpc,GDP Per Capita based on PPP 2011
7,GDPpc,GDP Per Capita in current Dollars 2012
8,HDI,Human Development Index (HDI) value 2012
9,KOFec,Economic Globalization index (KOF) 2011


As expected we had to do some extra work to get to our result.

The reason for this extra work was that we read data from a file that is not tidy.  
If the files are not tidy you probably will run into this kind of puzzles to solve.