## 逐块读取文本文件  
  | 函数        | 说明    |  
  | --------   | -----:   |  
  | read_csv        | 从文件、URL、文件型对象中加载带分隔符的数据。默认分隔符为逗号     |   
  | read_table        | 从文件、URL、文件型对象中加载带分隔符的数据。默认分隔符为制表符\t     |    
  | read_fwf        | 读取定宽列格式数据（也就是说，没有分隔符）     |  
  | read_clipboard    | 读取剪贴板中的数据，可以看做 read_table 的剪贴板版。在将网页转换为表格时很有用|
    
  
  这些函数的选项可以划分为以下几大类：
 
- 索引：将一个或多个列当做返回的 DataFrame 处理，以及是否从文件、用户获取列名
- 类型推断和数据转换：包括用户定义值的转换、缺失值标记列表等
- 日期解析：包括组合功，比如将分散在多个列中的日期时间信息组合成结果中的单个列
- 迭代：支持对大文件进行逐块迭代
- 不规整数据问题：跳过一些行、页脚、注释或其他一些不重要的东西（比如由成千上万个逗号隔开的数值数据）

In [9]:
!cat test.csv

a,2,3,5,hello
b,2,6,32,world



In [2]:
import pandas as pd
data = pd.read_csv('test.csv',header=0)
data

Unnamed: 0,a,b,c,message
0,1,3,2,hello
1,2,6,32,world


In [15]:
pd.read_table('test.csv',sep=',')

Unnamed: 0,a,b,c,message
0,1,3,2,hello
1,2,6,32,world


In [16]:
#指定message列作为索引
name = ['a','b','c','message']
pd.read_csv('test.csv',names=name,index_col='message')

Unnamed: 0_level_0,a,b,c
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
message,a,b,c
hello,1,3,2
world,2,6,32


In [None]:
#有些表格可能不是用固定的分隔符去分隔字段的（比如空白符或其他字符串）。对于这种情况，我们可以编写一个正则表达式来作为 read_table 的分隔符。
list(open('test3.txt'))
pd.read_table('test3.txt',sep='\s+')

在处理很大文件或找出大文件中的参数集以便后续处理时，我们可能只想读取文件的一小部分或逐块对文件进行迭代。

In [34]:
import pandas as pd
#import pandas.io.data as web
from pandas import Series,DataFrame
import datetime
from pandas_datareader import data as pdr
import yfinance as yf

yf.pdr_override() #需要调用这个函数

#Panel 为pandas 中的一种数据结构，我们可以将其看做一个三维的dataframe
#pdata = pd.Panel(dict((symbol, pdr.DataReader(symbol, data_source = 'yahoo',start = '2009-1-1', end = '2012-1-31')) for symbol in ['AAPL', 'GOOG', 'MSFT']))

ddata = dict((symbol, pdr.DataReader(symbol, data_source = 'yahoo',start = '2009-1-1', end = '2012-1-31')) for symbol in ['AAPL', 'GOOG', 'MSFT'])
#panel中每项都是一个DataFrame
#pdata = pd.Panel(ddata) 
 

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


  from ipykernel import kernelapp as app


TypeError: Panel() takes no arguments

In [35]:
ddata['AAPL'].to_csv('test3.csv')

In [37]:
#pdata = pdata.swapaxes('items','minor') #转置
#将Adj close存入文件
#pdata['Adj Close'].to_csv('test4.csv')
ddata['AAPL']['Adj Close'].to_csv('test4.csv')

  after removing the cwd from sys.path.


In [33]:
pd.show_versions()


INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.7.final.0
python-bits      : 64
OS               : Linux
OS-release       : 3.10.0-957.1.3.el7.x86_64
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.1
numpy            : 1.16.5
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 46.0.0
Cython           : 0.29.16
pytest           : None
hypothesis       : None
sphinx           : 2.4.4
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.5.0
html5lib         : None
pymysql          : 0.9.3
psycopg2         : None
jinja2           : 2.11.1
IPython          : 7.13.0
pandas_datareader: 0.8.1
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.5.0
matplotlib       : 3.0.1
numexpr  

In [39]:
#读取前几行
pd.read_csv('test3.csv',nrows=5)


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2008-12-31,12.281428,12.534286,12.191428,12.192857,10.583897,151885300
1,2009-01-02,12.268572,13.005714,12.165714,12.964286,11.253528,186503800
2,2009-01-05,13.31,13.74,13.244286,13.511429,11.728474,295402100
3,2009-01-06,13.707143,13.881429,13.198571,13.288571,11.535025,322327600
4,2009-01-07,13.115714,13.214286,12.894286,13.001429,11.285772,188262200


In [17]:
#大文件读取，逐块读取
chunker = pd.read_csv('test3.csv',chunksize=1000)
 

In [19]:
#逐个计数， 聚合到date
from pandas import Series
tot = Series([])
for piece in chunker:
    tot = tot.add(piece['Date'].value_counts(),fill_value = 0)
#tot = tot.order(ascending=False)

In [20]:
tot[:10]

Series([], dtype: float64)

## 手工处理分隔符格式


In [23]:
import csv

In [25]:
f = open('test3.csv')
reader = csv.reader(f)
for line in reader:
    print(line)

['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
['2008-12-31', '12.281428337097168', '12.534285545349121', '12.191428184509277', '12.192856788635254', '10.58389663696289', '151885300']
['2009-01-02', '12.268571853637695', '13.005714416503906', '12.165714263916016', '12.964285850524902', '11.253527641296387', '186503800']
['2009-01-05', '13.3100004196167', '13.739999771118164', '13.244285583496094', '13.511428833007812', '11.728473663330078', '295402100']
['2009-01-06', '13.70714282989502', '13.881428718566895', '13.19857120513916', '13.28857135772705', '11.535024642944336', '322327600']
['2009-01-07', '13.115714073181152', '13.214285850524902', '12.894286155700684', '13.001428604125977', '11.285772323608398', '188262200']
['2009-01-08', '12.918571472167969', '13.307143211364746', '12.8628568649292', '13.242856979370117', '11.495339393615723', '168375200']
['2009-01-09', '13.315713882446289', '13.34000015258789', '12.877142906188965', '12.9399995803833', '11.232447624206

In [28]:
lines = list(csv.reader(open('test3.csv')))

In [29]:
header,values = lines[0],lines[1:]


In [31]:
data_dict = {h:v for h,v in zip(header,zip(*values))}

In [33]:
data_dict

{'Date': ('2008-12-31',
  '2009-01-02',
  '2009-01-05',
  '2009-01-06',
  '2009-01-07',
  '2009-01-08',
  '2009-01-09',
  '2009-01-12',
  '2009-01-13',
  '2009-01-14',
  '2009-01-15',
  '2009-01-16',
  '2009-01-20',
  '2009-01-21',
  '2009-01-22',
  '2009-01-23',
  '2009-01-26',
  '2009-01-27',
  '2009-01-28',
  '2009-01-29',
  '2009-01-30',
  '2009-02-02',
  '2009-02-03',
  '2009-02-04',
  '2009-02-05',
  '2009-02-06',
  '2009-02-09',
  '2009-02-10',
  '2009-02-11',
  '2009-02-12',
  '2009-02-13',
  '2009-02-17',
  '2009-02-18',
  '2009-02-19',
  '2009-02-20',
  '2009-02-23',
  '2009-02-24',
  '2009-02-25',
  '2009-02-26',
  '2009-02-27',
  '2009-03-02',
  '2009-03-03',
  '2009-03-04',
  '2009-03-05',
  '2009-03-06',
  '2009-03-09',
  '2009-03-10',
  '2009-03-11',
  '2009-03-12',
  '2009-03-13',
  '2009-03-16',
  '2009-03-17',
  '2009-03-18',
  '2009-03-19',
  '2009-03-20',
  '2009-03-23',
  '2009-03-24',
  '2009-03-25',
  '2009-03-26',
  '2009-03-27',
  '2009-03-30',
  '2009-03-31',


CSV文件的形式有很多。只需定义 csv.Dialect 的一个子类即可定义出新格式（如专门的分隔符、字符串引用约定、行结束符等）


In [35]:
class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'

reader = csv.reader(f,dialect=my_dialect,quoting = csv.QUOTE_ALL)
reader

<_csv.reader at 0x7efdc0feb2d0>

In [37]:
with open('mydata.csv','w') as f:
    writer = csv.writer(f,dialect = my_dialect,quoting=csv.QUOTE_ALL)
    writer.writerow(('one','two','three'))
    writer.writerow(('1','2','3'))
    writer.writerow(('4','5','6'))
    writer.writerow(('7','8','9'))

In [38]:
pd.read_csv('mydata.csv')

Unnamed: 0,"one;""two"";""three"""
0,"1;""2"";""3"""
1,"4;""5"";""6"""
2,"7;""8"";""9"""


## JSON数据


In [39]:
obj = """
 { "name":"Limei",
   "places_lived":["China","UK","Germany"],
   "pet":null,
   "siblings":[{"name":"Liming","age":23,"pet":"Xiaobai"},
               {"name":"Lifang","age":33,"pet":"Xiaohei"}]
 }
 """

In [40]:
import json
result = json.loads(obj)
result

{'name': 'Limei',
 'places_lived': ['China', 'UK', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Liming', 'age': 23, 'pet': 'Xiaobai'},
  {'name': 'Lifang', 'age': 33, 'pet': 'Xiaohei'}]}

In [41]:
#保持到json
asjson = json.dumps(result)
asjson

'{"name": "Limei", "places_lived": ["China", "UK", "Germany"], "pet": null, "siblings": [{"name": "Liming", "age": 23, "pet": "Xiaobai"}, {"name": "Lifang", "age": 33, "pet": "Xiaohei"}]}'

In [43]:
#将一个或一组json对象转为DataFrame
siblings = pd.DataFrame(result['siblings'],columns = ['name','age'])

In [44]:
siblings

Unnamed: 0,name,age
0,Liming,23
1,Lifang,33


## XML 和 HTML

In [48]:
from lxml.html import parse

#from urllib2 import urlopen
import urllib.request as re
parsed = parse(re.urlopen('http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices'))

doc = parsed.getroot()

In [50]:
#查询链接
links = doc.findall('.//a')

In [51]:
links[:3]

[<Element a at 0x7efdc0798d70>,
 <Element a at 0x7efdc0798dd0>,
 <Element a at 0x7efdc0798e30>]

In [55]:
#但是这些是表示 HTML 元素的对象。要得到 URL 和链接文本，你必须使用各对象的 get 方法（针对 URL）和 text_content 方法（针对显示文本）：
lnk = links[5]
lnk

<Element a at 0x7efdc0798f50>

In [57]:
lnk.get('href')

'/quote/AAPL/profile?p=AAPL'

In [59]:
lnk.text_content()

'Profile'

In [61]:
#因此编写下面这条列表推导式即可获取文档中的全部 URL：
urls = [lnk.get('href') for lnk in doc.findall('.//a')]

In [62]:
urls[:3]

['https://finance.yahoo.com/',
 'https://mail.yahoo.com/?.intl=us&.lang=en-US&.partner=none&.src=finance',
 '/quote/AAPL?p=AAPL']