<div class="licence">
<span>Licence CC BY-NC-ND</span>
<span>Valérie Roy</span>
<span><img src="media/ensmp-25-alpha.png" /></span>
</div>

# Importing data in pandas

In [None]:
import numpy as np
import pandas as pd

## common file formats

   - **pandas** can **import** files of **a lot of formats**
      - CSV, JSON, HTML, Excel, ...
   - let us play with the **csv** format  
   - [RTFM about the others](http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)

## writing a csv file

*csv* stands for "comma separated values"

   - to write a **csv** use the **method** *pandas.DataFrame.to_csv*

In [None]:
# just preparing a dataframe...

distance = pd.Series(
    [0.387, 0.723, 30, 1., 5.203, 1.523, 9.6, 19.19],
    index=['Mercury', 'Venus', 'Neptune', 'Earth', 
           'Jupiter', 'Mars', 'Saturn', 'Uranus'])

lowest_temp = pd.Series(
    [-200.0, 446.0,  -90.0, -125.0, -140.0],
    index=['Mercury', 'Venus', 'Earth', 'Jupiter', 'Mars'])

highest_temp = pd.Series(
    [430.0, 490.0, 60.0, 17.0, 20.0],
    index=['Mercury', 'Venus', 'Earth', 'Jupiter', 'Mars'])

planets = pd.DataFrame(
   {'distance': distance,
    'lowest temperature': lowest_temp, 
    'highest temperature': highest_temp, 
    'origin':'solar system'})

In [None]:
planets

In [None]:
planets.index

In [None]:
# this will store our data in file "planets.csv"
planets.to_csv('planets.csv', index_label='names', float_format='%.3f')

   - a file **planets.csv** has been **created** in your current folder
   - we gave a **name** to the **rows** index

   - the csv **format** is very **simple**: a $2 \times 2$ matrix, where:
   - by default, the **first** line is the **columns** header  
    (**labels** if any, else **indexes**)
   - the **other** lines are **rows**, one per line,  
     with values **separed by** ','

In [None]:
# here is the actual file contents

!cat planets.csv

## reading a *csv* file

- to **read** a **csv** use the **method** *pandas.DataFrame.read_csv*

In [None]:
# this returns a dataframe allright
df_from_file = pd.read_csv('planets.csv')

# BUT: the file does have names for the columns
# however nothing says that the first column
# should be the index..
# so let us enforce that

df_from_file = df_from_file.set_index('names')     

In [None]:
df_from_file

**digression**

In [None]:
# NOTE that in the mix we have lost a bit of precision

df_from_file == planets

**2 phenomena are at work here**

- we had explicitly used format `.3f` 

> `planets.to_csv('planets.csv', index_label='names',
>   float_format='%.3f')

- even without that, floats **always** come with precision issues [see e.g. this issue on GH](https://github.com/pandas-dev/pandas/issues/17154)

> *trying to get exact equality out of floating points is generally a losing battle*

In [None]:
planets.loc['Mars', 'distance'], df_from_file.loc['Mars', 'distance']

**See also**

method *pandas.DataFrame.read_csv*
   - has many optional **parameters** that you can **set**
   - see the help

In [None]:
#pd.read_csv?

In [None]:
#pd.DataFrame.to_csv?