## Ch06 数据加载、存储与文件格式

### 读取文本格式的数据
#### 基本读取
- read_csv
- read_table （已经废弃， 原来的接口并入read_csv）
- read_fwf
- read_clipboard

**数据读取注意事项**
- 索引
- 类型推断和数据转换
- 日期解析
- 迭代
- 不规整数据问题


In [None]:
import pandas as pd
from pandas import DataFrame, Series
import numpy as np

In [None]:
! cat pydata/ch06/ex1.csv
df = pd.read_csv('../pydata/ch06/ex1.csv')
print( df )
## read_table is deprecated
pd.read_table('../pydata/ch06/ex1.csv', sep=',')


In [None]:
! cat pydata/ch06/ex2.csv

print('-'*32)
df = pd.read_csv('../pydata/ch06/ex2.csv', header=None)
print( df )

print('-'*32)
df = pd.read_csv('../pydata/ch06/ex2.csv', 
            names=['a','b','c','d', 'message']
           )
print( df )

In [None]:
! cat pydata/ch06/csv_mindex.csv

print('-'*32)
parsed = pd.read_csv('../pydata/ch06/csv_mindex.csv')
print( parsed )

print('-'*32)
parsed = pd.read_csv('../pydata/ch06/csv_mindex.csv', index_col=['key1', 'key2'])
print(parsed)
parsed

**用正则表达式来作为read_table的分隔符** 

In [None]:
l = list( open( '../pydata/ch06/ex3.txt') )

#result = pd.read_csv('pydata/ch06/ex3.txt', sep='\s+')
result = pd.read_csv('../pydata/ch06/ex3.txt', sep='\s+')

[ l, result ]

In [None]:
! cat pydata/ch06/ex4.csv
r1 = pd.read_csv('../pydata/ch06/ex4.csv')
r2 = pd.read_csv('../pydata/ch06/ex4.csv', skiprows=[0,2,3])
[r1, r2]

In [None]:
! cat pydata/ch06/ex5.csv
result = pd.read_csv('../pydata/ch06/ex5.csv')
result
pd.isnull(result)

In [None]:
result = pd.read_csv('../pydata/ch06/ex5.csv', na_values=['NULL'])
result

In [None]:
'''可以用一个字典为各列指定不同的NA标记值'''
sentinels = {
    'message': ['foo', 'NA'],
    'something': ['two']
}
result = pd.read_csv('../pydata/ch06/ex5.csv', na_values=sentinels)
result

**read_csv函数的参数 P178**

参数 | 说明
--- | ---
 path | URL
 sep/delimiter | 分隔符
 header | 用作列名的行号。默认为0（第一行
 index_col | 用作行索引的列编号或列名
 names | 用于结果的列名列表， 结合header=None
 skiprows | 需要忽略的行数
 na_values | 用于替代NA的值
 comment | 用于将注释信息从行尾拆分出去的字符
 parse_dates | 解析日期
 keep_date_col | 用于连接多列解析日期
 converters | 由列号/列名跟函数之间的映射关系组成的字典。
             | 例如：{'foo': f}会对foo列应用函数f
dayfirst | 当解析有歧义的日期时，将其看作国际格式
nrows | 需要读取的行数
iterator | 返回一个textParser以便逐块读取文件
chunksize | 文件块的大小
skip_footer | 从末尾算起，忽略的行数
verbose | 打印各种解析器信息
encoding | 用于unicode的文本编码格式，比如utf-8
squeeze | 如果仅仅是一列，则返回为Series
thousands | 千分位分隔符，如 ， or .
 


#### 逐块读取文本文件

In [None]:
! head -n 3 pydata/ch06/ex6.csv
result = pd.read_csv('../pydata/ch06/ex6.csv')
result.tail()

In [None]:
chunker = pd.read_csv('../pydata/ch06/ex6.csv', chunksize=1000)
tot = Series([])

for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_index(ascending=False)
print( tot[:10] )
print( '-'*32 )
print(  tot.sum() )

#### 将数据写出到文本格式

In [None]:

data = pd.read_csv('../pydata/ch06/ex5.csv')
print(data)

print('-'*32)
data.to_csv('../pydata/ch06/out.csv')
! cat pydata/ch06/out.csv

print('-'*32)
import sys
data.to_csv(sys.stdout, sep='|')

print('-'*32)
data.to_csv(sys.stdout, na_rep='NULL')

print('-'*32)
data.to_csv(sys.stdout, index=False, header=False)

print('-'*32)
data.to_csv(sys.stdout, index=False, header=False,
            columns=list('abd')
           )


**Series也有一个to_csv方法**
这也是一个被废弃的使用方式

In [None]:
dates = pd.date_range('1/1/2000', periods=7)
ts = Series(np.arange(7), index=dates)
ts.to_csv(sys.stdout)

ts.to_csv('../pydata/ch06/out.csv')
ts.from_csv('../pydata/ch06/out.csv', parse_dates=True)

#### 手工处理分隔符格式

In [None]:
! cat pydata/ch06/ex7.csv

In [None]:
import csv
f = open('../pydata/ch06/ex7.csv')
reader = csv.reader(f)
for line in reader:
    print(line)
f.close()

In [None]:
lines = list(csv.reader(open('../pydata/ch06/ex7.csv' )))
header, values = lines[0], lines[1:]
data_dict = {
    h:v for h,v in zip(header, zip(*values))
}
data_dict

**CSV文件的形式有很多，只需要定义csv.Dialect的子类即可以定义出新格式**
- 分隔符
- 字符串引用约定
- 行结束符

In [None]:
class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '\"'
    quoting = 0

reader = csv.reader(
    open('../pydata/ch06/ex7.csv'), 
    dialect=my_dialect)

lines = list(reader)
print(lines)

**CSV语支选项**
- delimiter        分隔符
- lineterminator   行结束符
- qtotechar        字符引用符号
- quoting          引用约定
- skipinitialspace 忽略分隔符后面的分隔符
- doublequote
- escapechar       转义字符


In [None]:
with open('mydata.csv', 'w') as f:
    writer = csv.writer(f, dialect=my_dialect)
    writer.writerow(('one','two','three'))
    writer.writerow(('1','2','3'))

!cat mydata.csv

#### JSON 数据集
P184

In [None]:
obj = """ 
{
"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"], 
"pet": null,
"siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"},
             {"name": "Katie", "age": 33, "pet": "Cisco"}]
} 
"""

import json
result = json.loads(obj)
result

In [None]:
asjson = json.dumps(result)
asjson

In [None]:
siblings = DataFrame(result['siblings'], columns=['name', 'age'])
siblings

#### XML和HTML: Web信息收集
Python有许多可以阅读HTML和XML格式的库，lxml就是一个常用的
- lxml.html
- lxml.objectify


从yahoo金融下载一些信息.找到你希望获取数据的URL，利用urllib2将其打开，然后用lxml解析得到的数据流

P186

In [None]:
from lxml.html import parse
from urllib2 import urlopen

#parsed = parse()
doc = parsed.getroot()

links = doc.findall('.//a')
links[15:20]



#### 利用lxml.objectify解析XML

In [None]:
from lxml import objectify 
path = 'Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()

data = []
skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ', 'DESIRED_CHANGE','DECIMAL_PLACES']

for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)
    
perf = DataFrame(data)
perf

In [None]:
from StringIO import StringIO
tag = '<a href="http://www.google.com>"Google</a>'
root = objectify.parse(StringIO(tag).getroot())
print( root )

root.get('href')
root.text


### 二进制数据格式

使用数据的二进制格式存储最简单的办法之一是使用Python内置的pickel序列化

In [None]:
frame = pd.read_csv('../pydata/ch06/ex1.csv')
print ( frame )
frame.save('../pydata/ch06/frame_pickle')

#### 使用HDF5格式 
- hierarchical data format
- HDF5可以高效读写磁盘上以二进制格式存储的科学数据
- 如果需要处理海量数据，PyTables和h5py是好选择
pandas有一个最小化的类似于字典的HDFStore类，它通过PyTables存储pandas对象：

In [None]:
import pandas as pd
store  = pd.HDFStore('mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']

print( store )
print( store['obj1'])

### 读取Microsoft Excel文件

In [None]:
xls_file = pd.ExcelFile('data.xls')
table = xls_file.parse('Sheet1')

### 使用htmp和Web API
很多网站都有一些通过JSON或者其他格式提供数据的公共API。
推荐的简单办法是：**requests包**

In [None]:
import requests
url = 'http://search.twitter.com/search.json?q=python%20pandas'
resp = requests.get(url)
resp

In [None]:
import json
data = json.loads(resp.text)
data.keys()

In [None]:

tweet_feilds = ['created_at', 'from_user', 'id', 'text']
tweets = DataFrame(data['results'], columns=tweet_feilds)
print ( tweets )
print ( tweets.loc[7] )

### 使用数据库

In [None]:
import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER );"""

con = sqlite3.connect(':memory:') 
con.execute(query)
con.commit()


In [None]:
data = [('Atlanta', 'Georgia', 1.25, 6), ('Tallahassee', 'Florida', 2.6, 3), ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data) 
con.commit()

In [None]:
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows

In [None]:
cursor.description

In [None]:
 DataFrame(rows, columns=zip(*cursor.description)[0])

**SQL**

In [None]:
import pandas.io.sql as sql

In [None]:
sql.read_frame('select * from test', con)

#### 使用MongoDB中的数据

In [None]:
import pymongo
con = pymongo.Connection('localhost', port=27017)
tweets = con.db.tweets

import requests, json
url = 'http://search.twitter.com/search.json?q=python%20pandas' data = json.loads(requests.get(url).text)
for tweet in data['results']:
    tweets.save(tweet)

cursor = tweets.find({'from_user': 'wesmckinn'})

tweet_fields = ['created_at', 'from_user', 'id', 'text'] 
result = DataFrame(list(cursor), columns=tweet_fields)