In [1]:
%pylab inline
import numpy as np
import pandas as pd

Populating the interactive namespace from numpy and matplotlib


* # [Read From Text File](#read)
    * [usecols](#usecols)
    * [index_col](#index_col)
    * [handle missing values](#missing)
    * [reading piece instead of reading all](#readpiece)
        * [chunksize](#chunksize)
* # [Write To Text File](#write)
    * [deal with inconsistency between to_csv and read_csv](#inconsistency)
        * [easier way to solve the inconsistency: use from_csv classmethod](#from_csv)
    * [write into string](#write-into-string)

<a id="read"></a>
## Read From Text File

In [2]:
!type example1.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


by default, <span style="font-size:2em;">**read_csv**</span> will
1. use the first row as the column names
2. use the row number as the index

In [3]:
pd.read_csv("example1.csv")

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


we can use <span style="font-size:2em;">**read_table**</span> and specify the delimiter to achieve the same goal

In [4]:
pd.read_table("example1.csv",sep=",")

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


<a id="usecols"></a>
### usecols
you can use usecols to return a subset of the columns. <span style="color:red;font-weight:bold">but pay attention, if you specify the index_col, then usecols must include that index_col</span>

In [5]:
pd.read_csv("example1.csv", index_col="message", usecols=["message","d","a"])

Unnamed: 0_level_0,a,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1
hello,1,4
world,5,8
foo,9,12


when the file doesn't have a header row
1. you can set **header=None** to let pandas assign default column names. **"header parameter" is the integer row number(s) to use as the column names, and the start of the data.**
2. you can specify columns yourself by setting a list to **names** parameter

In [6]:
!type example2.csv

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


In [7]:
pd.read_csv("example2.csv",header=None)

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [8]:
pd.read_csv("example2.csv",names=["a","b","c","d","message"])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


<a id="index_col"></a>
### index_col
specify which column is the index of the row by setting the index_col parameter. you can pass in column number of column name

In [9]:
pd.read_csv("example1.csv",index_col="message")

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


ignore comments by setting the **skiprows** argument

In [10]:
!type example3.csv

# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


In [11]:
pd.read_csv("example3.csv",skiprows=[0,2,3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


<a id="missing"></a>
### deal with missing values
Missing data is usually either 
* not present (empty string) or 
* marked by some sentinel value. By default, pandas uses a set of commonly occurring sentinels, such as **NA, -1.#IND, and NULL <span style="color:red">(case sensitive)</span>**

In [12]:
!type example4.csv

index,a,b,c,d,message
one,-1,2,3,4,NA
two,5,6,,-999,world
three,NULL,10,11,12,foo


In [13]:
pd.read_csv("example4.csv")

Unnamed: 0,index,a,b,c,d,message
0,one,-1.0,2,3.0,4,
1,two,5.0,6,,-999,world
2,three,,10,11.0,12,foo


The **na_values** option can take either a list or set of strings to consider missing values

In [14]:
pd.read_csv("example4.csv",na_values=[-1])# all columns will regard -1 as NaN

Unnamed: 0,index,a,b,c,d,message
0,one,,2,3.0,4,
1,two,5.0,6,,-999,world
2,three,,10,11.0,12,foo


In [15]:
pd.read_csv("example4.csv",na_values={'a':[-1],'d':[-999]})# all columns will regard -1 as NaN

Unnamed: 0,index,a,b,c,d,message
0,one,,2,3.0,4.0,
1,two,5.0,6,,,world
2,three,,10,11.0,12.0,foo


<a id="readpiece"></a>
### reading pieces instead of reading all

In [16]:
bigdf = pd.read_csv("example5.csv")
bigdf.head()

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


In [17]:
# first we use a inefficient way to calculate, the result will verify the latter result calculated by efficient way
bigdf["key"].value_counts().order(ascending=False)[:5]

E    368
X    364
L    346
O    343
Q    340
dtype: int64

<a id=chunksize></a>
### chunksize
pass in a **chunksize** argument, then read_csv return a **TextParser** object instead of a DataFrame object. this TextParser object isn't an iterator, but can be iterable. when using "for" loops this TextParser object, it is equivalent to call its get_chunk() method and return a DataFrame with specified chunksize. we can iterate the big file, instead of just loading all content in that big file into memory

In [18]:
chunker = pd.read_csv("example5.csv",chunksize=1000)
total = pd.Series([]) # hold and count the intermediate results during loop
for index,piece in enumerate(chunker):
    print "\n*******************\n%d-th piece: %s"%(index+1,piece.shape)
    current_counts = piece["key"].value_counts()
    total = total.add(current_counts,fill_value=0)
    
    print "\ttop 5 in current piece:\n%s"%(current_counts.order(ascending=False)[:5])


*******************
1-th piece: (1000, 5)
	top 5 in current piece:
S    48
O    44
F    40
J    39
Q    39
dtype: int64

*******************
2-th piece: (1000, 5)
	top 5 in current piece:
O    48
L    44
X    40
I    39
R    38
dtype: int64

*******************
3-th piece: (1000, 5)
	top 5 in current piece:
O    40
A    40
X    39
E    39
M    38
dtype: int64

*******************
4-th piece: (1000, 5)
	top 5 in current piece:
X    43
J    41
V    38
D    38
Q    38
dtype: int64

*******************
5-th piece: (1000, 5)
	top 5 in current piece:
E    54
Q    42
L    40
H    39
K    38
dtype: int64

*******************
6-th piece: (1000, 5)
	top 5 in current piece:
Y    42
K    41
F    41
X    38
V    37
dtype: int64

*******************
7-th piece: (1000, 5)
	top 5 in current piece:
C    41
L    41
Z    39
W    39
X    38
dtype: int64

*******************
8-th piece: (1000, 5)
	top 5 in current piece:
J    42
M    40
E    39
D    39
X    39
dtype: int64

*******************
9-th piece:

In [19]:
# check the final result
total.order(ascending=False)[:5]

E    368
X    364
L    346
O    343
Q    340
dtype: float64

# <a id="write"></a> Write To Text File

<a id="inconsistency"></a>
### pay attention to inconsistency between to_csv and read_csv
<span style="color:red">pay attention that, if we just simply call to_csv without any extra parameter, the index will be printed as first column without header. then if we **use read_csv to read it back, that won't be what we want**.</span>

In [20]:
df = pd.DataFrame(np.arange(1,10).reshape(3,3), columns = ["a","b","c"],   index = ["record1","record2","record3"])
df.to_csv("example6.csv")
!type "example6.csv"

,a,b,c
record1,1,2,3
record2,4,5,6
record3,7,8,9


In [21]:
pd.read_csv("example6.csv")

Unnamed: 0.1,Unnamed: 0,a,b,c
0,record1,1,2,3
1,record2,4,5,6
2,record3,7,8,9


to solve this problem, we must use <span style="color:MediumVioletRed">**"index_label"**</span> to give a header for index column. and when reading back, specify <span style="color:MediumVioletRed">**"index_col"**</span> to indicate the index column

In [22]:
df.to_csv("example6.csv",index_label="index")
!type example6.csv

index,a,b,c
record1,1,2,3
record2,4,5,6
record3,7,8,9


In [23]:
pd.read_csv("example6.csv",index_col=0)# indicate the first column is the index column

Unnamed: 0_level_0,a,b,c
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
record1,1,2,3
record2,4,5,6
record3,7,8,9


<a id="from_csv"></a>
### <span style="color:IndianRed">easier way to solve the inconsistency: use from_csv classmethod</span>

In [24]:
df = pd.DataFrame(np.arange(1,10).reshape(3,3), columns = ["a","b","c"],   index = ["record1","record2","record3"])
df.to_csv("example6.csv")
!type "example6.csv"

,a,b,c
record1,1,2,3
record2,4,5,6
record3,7,8,9


above CSV file cannot be succesfully parsed by pd.read_csv (or more accurately, what parsed out won't be what we want). However, we can use <span style="color:darkorange">**"DataFrame.from_csv"**</span> classmethod to read it back

In [25]:
pd.DataFrame.from_csv("example6.csv")

Unnamed: 0,a,b,c
record1,1,2,3
record2,4,5,6
record3,7,8,9


### deal with NaN
by default, NaN is output as empty string. you can use **ra_rep** to specify other representation

In [26]:
df = pd.DataFrame([[1., 6.5, 3.],                            
                   [1., np.nan, np.nan],                            
                   [np.nan, np.nan, np.nan],                            
                   [np.nan, 6.5, 3.]],columns=list("abc"),index=["r%d" % i for i in xrange(1,5)])
df.to_csv("example7_nan.csv",index_label="index")# export NaN as empty string
!type example7_nan.csv

index,a,b,c
r1,1.0,6.5,3.0
r2,1.0,,
r3,,,
r4,,6.5,3.0


In [27]:
df.to_csv("example7_nan.csv",index_label="index",na_rep="nan")# export NaN as empty string
!type example7_nan.csv

index,a,b,c
r1,1.0,6.5,3.0
r2,1.0,nan,nan
r3,nan,nan,nan
r4,nan,6.5,3.0


In [28]:
### partically export
df = pd.DataFrame(np.arange(1,10).reshape(3,3), columns = ["a","b","c"],   index = ["record1","record2","record3"])
df.to_csv("example8.csv",header=False,index=False)# no header row, no index column
!type example8.csv

1,2,3
4,5,6
7,8,9


In [29]:
# specify which columns to export and in what order
df.to_csv("example8.csv",columns=["c","a"],index=False)
!type example8.csv

c,a
3,1
6,4
9,7


### write into string

In [31]:
df = pd.DataFrame(np.arange(1,10).reshape(3,3), columns = ["a","b","c"],   index = ["record1","record2","record3"])
df

Unnamed: 0,a,b,c
record1,1,2,3
record2,4,5,6
record3,7,8,9


In [34]:
import cStringIO
csv = cStringIO.StringIO()
df.to_csv(csv)
csv.getvalue()

',a,b,c\nrecord1,1,2,3\nrecord2,4,5,6\nrecord3,7,8,9\n'

In [36]:
pd.read_csv(cStringIO.StringIO(csv.getvalue()))

Unnamed: 0.1,Unnamed: 0,a,b,c
0,record1,1,2,3
1,record2,4,5,6
2,record3,7,8,9


In [37]:
pd.DataFrame.from_csv(cStringIO.StringIO(csv.getvalue()))

Unnamed: 0,a,b,c
record1,1,2,3
record2,4,5,6
record3,7,8,9
