Python for Data Analysis
----

# Book
- CN: 利用Python进行数据分析 78.4MB.pdf
- EN: Python for Data Analysis 2nd Edition.pdf

![cover](images/cover1.png)


# 概述
原书英文版，2013年由OReilly出版，中文版由机械工业出版社出版。

全书12个章节：
- 准备工作
- 引言
- IPython
- Number基础
- pandas入门
- 数据加载、存储与文件格式
- 数据规整化
- 绘图和可视化
- 数据聚合与分组运算
- 时间序列
- 金融和经济数据
- NumPy高级应用
- 附录
 - Python语言精要


# 源代码

 `git clone https://github.com/pydata/pydata-book -b 1st-edition`

# 读书笔记
## Ch01 准备工作
本章节如题，即为学习如何来用Python来分析数据的准备工作。
- what： 处理对象是什么？ 主要是结构化数据表格
- How： 工具是什么？ Python， NumPy/Matplotlib/IPython/pandas/SciPy
- Why： Python简单易用，有强大的公共库资源
- Setup: 准备代码调试编写环境 

## Ch02 引言
### JSON 数据集

In [None]:
#### Load JSON
import json
path = '/opt/Work/ML/pydata-book/ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
records[0]

In [None]:
type(records[0])

#time_zones = [rec['tz'] for rec in records]
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
time_zones[:5]

In [None]:
def get_counts(sequence):
    counts = {} # dict
    for x in sequence:
        if x in counts.keys():
            counts[x] += 1
        else:
            counts[x] = 1
    return counts
            
get_counts(time_zones)

In [None]:
from collections import defaultdict

def get_counts2(sequence):
    counts = defaultdict(int)
    for x in sequence:
        counts[x] += 1
    return counts

get_counts2(time_zones)

In [None]:
counts = get_counts2(time_zones)

In [None]:
type(counts) #collections.defaultdict
type(counts.items()) # dict_items

In [None]:
def top_counts(count_dict, n=10):
    value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
    value_key_pairs.sort()
    return value_key_pairs[-n:]
counts = get_counts2(time_zones)
top_counts(counts)

In [None]:
import pandas as pd; import numpy as np
from pandas import DataFrame, Series
frame = DataFrame(records)

In [None]:
tz_counts = frame['tz'].value_counts()

In [None]:
clean_tz = frame['tz'].fillna('Missing')

In [None]:
type(clean_tz)

In [None]:
# NB!!!
clean_tz[clean_tz==''] = 'Unknow'

In [None]:
tz_counts = clean_tz.value_counts()
tz_counts[:10]

In [None]:
frame['a'].head()
frame.a.head()

In [None]:
results = Series(x.split()[0] for x in frame.a.dropna())
print(results.head(5))
print( results.value_counts()[:8] )

In [None]:
# ?????
cframe = frame[frame.a.notnull()]
operating_system = np.where(cframe['a'].str.contains('Windows'),
                            'Windows', 'Not Windows')

In [None]:
print( operating_system[:5])

In [None]:
by_tz_os = cframe.groupby(['tz', operating_system])
agg_counts = by_tz_os.size().unstack().fillna(0)
agg_counts[:10]

In [None]:
# 用于按照升序排列
indexer = agg_counts.sum(1).argsort()
indexer[:10]

In [None]:
count_subset = agg_counts.take(indexer)[-10:]
count_subset

In [None]:
count_subset.plot(kind='barh', stacked=True)

In [None]:
normed_subset = count_subset.div(count_subset.sum(1), axis=0)
normed_subset.plot(kind='barh',stacked=True)

### MovieLens 1M数据集


In [None]:
import pandas as pd
unames = ['user_id','gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames)

mnames = ['movie_id', 'title', 'genres']
movies=pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames)


In [None]:
users.head()

In [None]:
ratings.head()

In [None]:
movies.head()

In [None]:
data = pd.merge(pd.merge(ratings, users), movies)
data.head()

In [None]:
mean_ratings = data.pivot_table('rating', index='title',columns='gender', aggfunc=np.mean)
mean_ratings[:5]

In [None]:
ratings_by_title = data.groupby('title').size()
ratings_by_title[:10]

In [None]:
active_titles = ratings_by_title.index[ratings_by_title >= 250]
active_titles

In [None]:
mean_ratings = mean_ratings.ix[active_titles]
mean_ratings.head()

In [None]:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
sorted_by_diff = mean_ratings.sort_index(by='diff')
sorted_by_diff.head()

In [None]:
sorted_by_diff[::-1].head()

In [None]:
# 根据电影名称分组的得分数据的标准差
rating_std_by_title = data.groupby('title')['rating'].std()

# 根据active_titles进行过滤
rating_std_by_title = rating_std_by_title.ix[active_titles]

# 根据值对series进行降序排列
rating_std_by_title.sort_values(ascending=False).head()

### 全美婴儿姓名分析
#### 1880-2010年间全美婴儿姓名

In [None]:
! head -n 10 names/yob1881.txt

In [None]:
import pandas as pd
names1880 = pd.read_csv('names/yob1880.txt', names=['name', 'sex', 'births'] )
names1880.head()

In [None]:
names1880.groupby('sex')['births'].sum()

names1880.groupby('sex').births.sum()

In [None]:
# 1880-2010
years = range(1880, 2011)
pieces = []
columns = ['name', 'sex', 'births']

for year in years:
    path = 'names/yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)
    
    frame['year'] = year
    pieces.append(frame)
    
names = pd.concat(pieces, ignore_index=True)
names.head()

In [None]:
names.groupby('sex')['births'].sum()

In [None]:
total_births = names.pivot_table('births', index='year', columns='sex', aggfunc=sum)
total_births.tail()

In [None]:
total_births.plot(title='Total births by sex and years')

In [None]:
def add_prop(group):
    births = group.births.astype(float)
    group['prop'] = births / births.sum()
    return group
names = names.groupby(['year', 'sex']).apply(add_prop)
names.head()

In [None]:
np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)

In [None]:
g=names.groupby(['year', 'sex'])
type(g)

 
 ''' [:1000] cannot work as desired '''

def get_top1000(group):
    return group.sort_index(by='births', ascending=False)[:1000]

grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
top1000.head()

 			name 	sex 	births 	year 	prop
year 	sex 						
1947 	F 	431022 	Linda 	F 	99651 	1947 	0.056229
1948 	F 	441381 	Linda 	F 	96185 	1948 	0.056657
1947 	M 	437125 	James 	M 	94601 	1947 	0.051768
1957 	M 	544528 	Michael 	M 	92700 	1957 	0.043008
1947 	M 	437126 	Robert 	M 	91557 	1947 	0.050102

def get_top1000(group):
    #return group.sort_index(by='births', ascending=False)[:1000]
    return group

grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
top1000.head()
top1000 = top1000.sort_index(by='births', ascending=False)[:1000]
top1000.head()

In [None]:
def get_top1000(group):
    return group.sort_index(by='births', ascending=False)[:1000]

grouped = names.groupby(['year', 'sex'],as_index=False)
top1000 = grouped.apply(get_top1000)
top1000.head()

#### 分析命名趋势

In [None]:
boys = top1000[top1000.sex == 'M']
girls= top1000[top1000.sex == 'F']

In [None]:
total_births = top1000.pivot_table('births', index='year', columns='name',aggfunc=sum)
#totle_births = names.pivot_table('births', index='year', columns='name',aggfunc=sum)
total_births.head()

In [None]:
subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
#subset = total_births[['John', 'Mary']]
subset.plot(subplots=True, figsize=(12,10), grid=False, title="Number of births per year")

#### 评估命名多样性的增长

In [None]:
table        = top1000.pivot_table('prop', index='year', columns='sex',aggfunc=sum)
table.plot(title='Sum of the table1000.prop by year and sex', yticks=np.linspace(0, 1.2,13), xticks=range(1880,2020,10) )

In [None]:
boys[boys.year==1947].head()


In [None]:
df = boys[boys.year==1947]

prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum()
prop_cumsum.head(10)

In [None]:
prop_cumsum.searchsorted(0.5)

In [None]:
df = boys[boys.year == 1900]
in1900 = df.sort_index(by='prop', ascending=False).prop.cumsum()
in1900.searchsorted(0.5)+1

In [None]:
def get_quantile_count(group, q=0.5):
    group = group.sort_index(by='prop', ascending=False)
    return group.prop.cumsum().searchsorted(q)+1
diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)
diversity = diversity.unstack('sex')
diversity.head()

In [None]:
diversity.plot(title='Number of polular names in top 50%')

#### 最后一个字母的变革

In [None]:
get_last_letter = lambda x: x[-1]
last_letters = names.name.map(get_last_letter)
last_letters.name = 'last_letters'
table = names.pivot_table('births', index=last_letters, columns=['sex', 'year'], aggfunc=sum)

In [None]:
# 我们选择代表性的三年
subtable = table.reindex(columns=[1910,1960,2010], level='year')
subtable.head()

In [None]:
subtable.sum()

In [None]:
letter_prop = subtable/subtable.sum().astype(float)

In [None]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2,1, figsize=(10,8))
letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female', legend=False)


In [None]:
# 选几个典型的字母 d n y
letter_prop = table/table.sum().astype(float)
dny_ts = letter_prop.ix[['d','n','y'], 'M'].T
dny_ts.head()

In [None]:
dny_ts.plot()

#### 男孩 ---> 女孩


In [None]:
all_names = top1000.name.unique()
mask = np.array(['lesl' in x.lower() for x in all_names])

In [None]:
all_names.shape

In [None]:
lesley_like = all_names[mask]
lesley_like

In [None]:
filtered = top1000[top1000.name.isin(lesley_like)]
filtered.groupby('name').births.sum()

In [None]:
table = filtered.pivot_table('births', index='year', columns='sex', aggfunc=sum)
table = table.div(table.sum(1), axis=0)
table.tail()

In [None]:
table.plot(style={'M': 'k-', 'F':'k--'})

### 总结

此一章主要介绍了DF的用法，常规操作能解决很多问题
- 分组
- 统计
- 透视图
- 画图
    - pd.plot
    - matplotlib.pyplot.subplot
    
编写或者说抄写代码的时候才发现问题。比如，
+ 1. 区别：
    * groupby(['year', 'sex'])
    * groupby(['year', 'sex'], asindex=False)
+ 2. 取列：
    * names.year
    * names['year']
+ 3. 筛选
    * names[names.year==1880]
+ 4. 文件
    * [ json.loads(line) for l in open('some/file/path') ]

## IPython

## Number基础

### ndarray: 一种多维数组对象
- create narray
    - array
    - asarray
    - arange
    - ones/ ones_like
    - zeros/ zeros_like
    - empty/ empty_like
    - eye/ identity
- methods
    - random
    - shape
    - reshape
    - ndim
    - dtype

In [None]:
import numpy as np
data = np.random.rand(2,3)

In [None]:
data
data*10
data.shape
data.dtype

In [None]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1
arr1.shape

In [None]:

data2 = [[1,2,3,4], [5,6,7,8]]
arr2 = np.array(data2) 
arr2.ndim  # 2
arr2.shape # (2,4)
arr2.dtype #int64

In [None]:
np.zeros((3,6))

In [None]:
np.empty((2,3,2))

In [None]:
np.arange(15)

In [None]:
arr1 = np.array([1,2,3], dtype=np.float64)
arr2 = np.array([1,2,3], dtype=np.int32)

In [None]:
arr1.dtype

In [None]:
arr2.dtype

In [None]:
arr = np.array([1,2,3,4,5])
arr.dtype

In [None]:
float_arr = arr.astype(np.float64)
float_arr.dtype

In [None]:
numberic_strings = np.array(['1.25','-9.6', '42'], dtype=np.string_)
numberic_strings.astype(float)

#### 数组和标量之间的运算

In [None]:
import numpy as np
arr = np.array([[1.,2,3],[4.,5.,6.]])
arr

In [None]:
arr*arr

In [None]:
1/arr

#### 基本的索引和切片

In [None]:
arr = np.arange(10)
arr

In [None]:
arr[5:8]

In [None]:
arr[5:8] = 12
arr

In [None]:
arr_slice = arr[5:8]
arr_slice[1] = 12345
arr

In [None]:
arr_slice[:] = 666
arr

In [None]:
arr2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr2d

In [None]:
arr2d[1][2]  ==  arr2d[1, 2]

In [None]:
arr3d = np.array([ 
    [ [1,2,3],[4,5,6] ], 
    [ [7,8,9], [10,11,12] ]
])
arr3d

In [None]:
arr3d.shape

In [None]:
arr3d[0]

In [None]:
arr3d[0] = 666
arr3d
arr3d[1,0]

##### 切片索引

In [None]:
arr[1:6]
arr2d[:2]
arr2d[:2, 1:]
arr2d[:, :1]

#### 布尔型索引

In [None]:
import math
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Joe', 'Will','Joe'])
data = np.random.randn(7,4)

In [None]:
names

In [None]:
data

In [None]:
names == 'Bob'

In [None]:
mask = (names=='Bob') | (names=='Joe')
print(mask)
print(data.shape)
print(mask.shape)
data[mask]

In [None]:
data[data < 0 ] = 0
data

#### 花式索引
花式索引和切片不一样，它总是将数据复制到新数组中。

In [None]:
arr = np.empty((8,4))
for i in range(8):
    arr[i] = i
arr

In [None]:
arr[[4, 3, 0, 6]]

In [None]:
arr[[-3, -5, -7]]

In [None]:
arr = np.arange(32).reshape((8,4))
arr

In [None]:
arr[[1,5,7,2], [0,3,1,2]]

In [None]:
arr[[1,5,7,2]] [: , [0,3,1,2]]

In [None]:
arr[ np.ix_([1,5,7,2], [0,3,1,2]) ]

#### 数组转置和轴对换
- transpose 转置
- 轴对换
- 内积 np.dot

In [None]:
arr = np.arange(15).reshape((3,5))
arr

In [None]:
arr.T

In [None]:
arr = np.arange(16).reshape((2,2,4))
arr.transpose((1,0,2))

In [None]:
arr = np.arange(16).reshape((2,2,4))
arr.swapaxes(1,2)

### 通用函数： 快速的元素级数组函数
- 一元ufunc
    - ads, fabs
    - sqrt
    - square
    - exp
    - log, log10, log2, log1p
    -sign
    - ceil
    - floor
    - rint
    - modf
    - isnan
    - isfinite, isinf
    - cos, cosh, sin, sinh,tan, tanh
    - arccos, arccosh, arcsin, arcsinh, arctan, arctanh
    - logical_not
    
- 二元ufunc
    - add
    - subtract
    - multiply
    - divide, floor_divide
    - power
    - maximum, fmax
    - minimum, fmin
    - mod
    - copysign
    - greater, greater_equal
    - less, less_equal
    - equal, not_equal
    - logical_and, logical_or, logical_xor
    

In [None]:
arr = np.arange(10)
np.sqrt(arr)

In [None]:
x = np.random.randn(8)
y = np.random.randn(8)
print(x)
print(y)
np.maximum(x,y)

In [None]:
arr = np.random.randn(7)*5
np.modf(arr)

### 利用数组进行数据处理

In [None]:
points = np.arange(-5, 5 , 0.01)
xs, ys = np.meshgrid(points, points)
z = np.sqrt(xs**2+ys**2)

import matplotlib.pyplot as plt
plt.imshow(z, cmap=plt.cm.gray);plt.colorbar()
plt.title('Image plot of $\sqrt{x^2+y^2}$ for a grid of values')


#### 将条件逻辑表述为数组运算
布尔值在计算的过程中可以被当作0或者1来使用


In [None]:
xarr = np.array([1.1,1.2,1.3,1.4,1.5])
yarr = np.array([2.1,2.2,2.3,2.4,2.5])
cond = np.array([True, False, True,True, False])
result = [
    (x if c else y)
          for x, y, c in zip(xarr, yarr, cond)
         ]
result

In [None]:
arr = np.random.randn(4,4)
arr
np.where(arr>0, 2, -2) # 正数设置为2，负数设置为-2
np.where(arr>0, 2, arr)

In [None]:
cond1 = np.array([True, False, False])
cond2 = np.array([False, True, False])
cond3 = np.array([False, False, True])

n =3
result = []
for i in range(n):
    if cond1[i] and cond2[i]:
        result.append(0)
    elif cond1[i]:
        result.append(1)
    elif cond2[i]:
        result.append(2)
    else:
        result.append(3)
print(result)

#等价于
np.where(cond1 & cond2, 0,
        np.where(cond1, 1, 
                 np.where(cond2, 2, 3)
        )
)

#等价于
#result = 1 * (cond1 -cond2) +2 * (cond2 & -cond1)+3 * -(cond1 | cond2)
for i in range(n):
    c1 = True if cond1[i]  else False
    c2 = True if cond2[i]  else False
    result = 1 * (c1 -c2) +2 * (c2 & -c1)+3 * -(c1 | c2)
    
        

In [None]:
a = False
b = True
a-b
c = np.array([a,b])
np.logical_not(c)

#### 数学和统计方法
- sum
- mean
- std, var
- min, max
- argmin, argmax
- cumsum, cumprod


In [None]:
arr = np.random.randn(5,4)
arr.mean() == np.mean(arr)
arr.sum()

arr.mean(axis=1) == arr.mean(1)
arr.sum(axis=0) == arr.sum(0)

In [None]:
arr = np.array([ [0,1,2], [3,4,5], [6,7,8] ])
arr.cumsum(0)

#### 用于布尔型数组的方法

In [None]:
arr = np.random.randn(100)
(arr>0).sum()

In [None]:
bools = np.array( [False, False, True, False] )
bools.any()
bools.all()

#### 排序

In [None]:
arr  = np.random.randn(8)
arr.sort()
arr

In [None]:
arr = np.random.randn(5,3)
arr.sort(1)
arr

In [None]:
large_arr = np.random.randn(1000) 
large_arr.sort()
large_arr[int(0.05*len(large_arr))] #5%分位数

#### 唯一化以及其他的集合逻辑
- unique(X)
- intersect1d(X,Y)
- union1d(X,Y)
- in1d(x, A)
- setdiff1d(X, Y)
- setxor1d(X, Y)

In [None]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob','Will', 'Joe'])
np.unique(names)
sorted(set(names))
(np.unique(names) == sorted(set(names))).all()



### 用于数组的文件输入输出
#### 将数组以二进制格式保存到磁盘

In [None]:
arr = np.arange(10)
np.save('some_array', arr)
np.load('some_array.npy')

In [None]:
#将多个数组保存到一个压缩文件中
np.savez('array_archive.npz', a=arr, b=arr)
arch = np.load('array_archive.npz')
arch['a']

#### 存取文本文件

arr = np.loadtxt('array_ex.txt', delimiter=',')
pf = pd.read_csv('array_ex.txt')

### 线性代数

In [None]:
x = np.array(
    [
        [1,2,3],
        [4,5,6]
    ],
    dtype=np.float
)

y = np.array(
    [
        [6,-23],
        [-1, 7],
        [8, 9]
    ],
    dtype = np.float
)

x.shape
y.shape
x.dot(y)
np.dot(x,y)

In [None]:
np.dot(x, np.ones(3))

**numpy.linalg中有一组标准的矩阵分解运算以及诸如求逆和行列式之类的东西**

- diag
- dot
- trace
- det
- eig
- inv
- pinv
- qr
- svd
- solve
- lstsq

In [None]:
from numpy.linalg import inv, qr
X = np.random.randn(5,5)

In [None]:
mat = X.T.dot(X)
mat

In [None]:
inv(mat)

In [None]:
np.dot(mat, inv(mat))

In [None]:
q,r = qr(mat)

In [None]:
r

### 随机数生成

- seed
- permutation
- shuffle
- rand 均匀分布
- randint 整数
- randn （0，1）正态分布
- bioomial 二项式分布
- normal
- beta
- chisquare 卡方分布
- gamma
- uniform [0,1]均匀分布


In [None]:
sample = np.random.normal(size=(4,4))

In [None]:
sample

In [None]:
from random import normalvariate
N = 1000000
%timeit samples = [normalvariate(0,1) for _ in range(N)]

In [None]:
%timeit samples = np.random.normal(size=N)

### 随机漫步

import random
position = 0
walk = [position]
steps = 1000
for i in range(steps):
    step = 1 if random.randint(0,1) else -1
    position += step
    walk.append(position)


nsteps = 1000
draws = np.random.randint(0,2,size=nsteps)
steps = np.where(draws>0, 1, -1)
walk = steps.cumsum()
print (walk.min())
print (walk.max())
(np.abs(walk) >= 10).argmax()

#### 一次模拟多个随机漫步

nwalks = 5000
nsteps = 1000
draws = np.random.normal(0,2, size=(nwalks, nsteps)) # 0 or 1
steps = np.where(draws>0, -1, 1)
walks = steps.cumsum(1)
walks
walks.max()
walks.min()
hits30 = (np.abs(walks) >= 30).any(1)
hits30
hits30.sum()
crossing_times = (np.abs(walks[hits30]) >= 30).argmax(1)
crossing_times.mean()

steps = np.random.normal(loc=0, scale=0.25, size=(nwalks, nsteps))

## pandas入门


In [None]:
from pandas import Series, DataFrame
import pandas as pd

### pandas的数据结构介绍
#### Series

In [None]:
obj = Series([4,7,-5,3])
obj

In [None]:
obj.values

In [None]:
obj.index

In [None]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

In [None]:
obj2 = Series([4,7,-5,3], index=['d','b','a','c'])
obj2

In [None]:
obj2.index

In [None]:
obj2['a']

In [None]:
obj2[obj2>0]

In [None]:
obj2*2

In [None]:
np.exp(obj2)

In [None]:
'b' in obj2
'e' in obj2

In [None]:
#dict -> Series
sdata = {'Ohio':3500, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
obj3 = Series(sdata)
obj3
#obj3.index

In [None]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4

In [None]:
pd.isnull(obj4)  
#pd.notnull(obj4)

In [None]:
r = pd.isnull(obj4)
type(r)

In [None]:
r = obj4.isnull()
type(r) 

In [None]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

#### DataFrame
DataFrame是一个表格型的数据结构，它含有一组有序每列可以是不同的值类型（数值、字符串、布尔值等） **DataFrame既有行索引也有列索引，它可以被看作由Series组成的词典**。 

- 二维ndarray
- 由数组、列表、或者元组组成的字典
- NumPy的结构化/记录数组
- 由Series组成的字典
- 由字典组成的字典
- 字典或者Series的列表
- 由列表或者元组组成的列表
- 另一个DataFrame
- NumPy的MaskedArray


#### 构建DataFrame的最常用方法是：直接传入一个由等长列表或者NumPy数组组成的辞典。

In [None]:
data = {
    'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Neveda'],
    'year' :[2000,2001,2002,2001,2002] ,
    'pop'  :[1.5,1.7,3.6,2.4,2.9]
}
frame = DataFrame(data)
frame

In [None]:
DataFrame(data, columns=['year', 'state', 'pop'])

In [None]:
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'])
frame2

In [None]:
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                   index = ['one', 'two', 'three', 'four', 'five']
                  )
frame2

In [None]:
frame2['state'] == frame2.state

In [None]:
frame2.loc['three']

In [None]:
frame2.debt=16.5
frame2

In [None]:
val = Series([-1.2,-1.5, -1.7], index=['two', 'four','five'])
frame2.debt = val
frame2

In [None]:
frame2['eastern'] = frame2.state=='Ohio'
frame2

In [None]:
print(frame2.columns)
del frame2['eastern']
print(frame2.columns)

#### 另一种常见的数据形式是嵌套字典

In [None]:
pop ={
    'Nevada' : {
        2001:2.4, 
        2002:2.9
    },
    'Ohio'   : {
        2000:1.5, 
        2001:1.7, 
        2002:3.6
    }
}

frame3 = DataFrame(pop)
print(frame3.index)
print(frame3.columns)
frame3


In [None]:
frame3.T

In [None]:
frame3['Ohio']

In [None]:
pdata = {
    'Ohio' : frame3['Ohio'][:-1],
    'Nevada': frame3['Nevada'][:3]
}
DataFrame(pdata)

In [None]:
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3

In [None]:
frame3.values

In [None]:
frame2.values

### 索引对象
pandas的索引对象负责管理轴标签和其他元数据（比如轴名称）。构建Series或者DataFrame时，所用到的任何数组或者其他序列的标签都会被转化成一个Index

**pandas中主要的Index对象**

- Index
- Int64Index
- MultiIndex
- DatatimeIndex
- PeriodIndex

**Index的方法和属性**
- append
- diff
- intersection
- union
- isin
- delete
- drop
- insert
- is_monotonic
- is_unique
- unique

In [None]:
obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

In [None]:
index[1:]

In [None]:
try:
    index[1] = 'd'
except TypeError:
    print('pd.index object should not be changed')

In [None]:
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5,0], index=index)
print( obj2.index == index )
print( obj2.index is index )

**除了长得像数组， Index的功能也类似一个固定大小的集合**

In [None]:
print(  frame3  )
print( 'Ohio' in frame3.columns  )
print(  2003 in frame3.index)

### 基本功能

#### 重新索引

In [None]:
obj = Series([4.5,7.2,-5.3,3.6], index=['d', 'b','a','c'])
obj

In [None]:
obj2 = obj.reindex(['a','b','c','d','e'])
obj2

In [None]:
obj2 = obj.reindex(['a','b','c','d','e'], fill_value=0)
obj2

**使用ffill可以实现前向值填充**

reindex的插值方式，method选项：
- ffill/pad    前向填充
- bfill/backfill    后向填充

In [None]:
obj3 = Series(['blue', 'purple','yellow'], index=[0,2,4])
obj3.reindex(range(9), method='ffill')

**reindex可以修改行和列**

In [None]:
frame = DataFrame(np.arange(9).reshape((3,3)), index=['a','c','b'],
                  columns=['Ohio', 'Texas', 'California'])
frame

In [None]:
frame2 = frame.reindex(['a','b','c','d'])
frame2

In [None]:
states =['Texas','Ohio', 'California', 'Utah']
frame.reindex(columns=states)

In [None]:
states =['Texas','Utah', 'California' ]
#frame.reindex(index=['a','b','c','d'], method='ffill', columns=states)
frame.reindex(index=['a','b','c','d'],  columns=states).ffill()

In [None]:
#frame.ix[['a'],states]
frame.loc[['a','b','c','d'],states]

In [None]:
'''
注意reindex的参数：
index,columns
method
fill_value
limit
level
copy
'''

#help(df.reindex)


#### 丢弃指定轴上的项

In [None]:
obj = Series(np.arange(5), index=['a','b','c','d','e'])
new_obj = obj.drop('c')
print (new_obj)
print(obj)

obj.drop(['d','c'])

In [None]:
data = DataFrame(np.arange(16).reshape((4,4)),
                index=['Ohio','Colorado', 'Utah','NewYork'],
                 columns=['one','two', 'three','four']
                 )
data.drop(['Colorado', 'Ohio'])

In [None]:
print( data.drop('two',axis=1) )
print( data.drop(['two', 'four'], axis=1) )

#### 索引、选取和过滤
**Series索引的工作方式类似于NumPy数组的索引，只不过Series的索引值不只是整数**

In [None]:
obj = Series(np.arange(4,8), index=['a','b','c','d'])

print ( obj ) ;print ('-'*16)
print ( obj['b'] ) ; print ('-'*16)
print ( obj[2:4] ) ;print ('-'*16)
print ( obj[['b','a','d']] )  ;print ('-'*16)
print ( obj[[1,3]]) ;print ('-'*16)
print ( obj[obj>5]) ;print ('-'*16)

In [None]:
obj['b':'d'] = 666
obj

In [None]:
data = DataFrame(np.arange(16).reshape((4,4)),
                index=['Ohio','Colorado', 'Utah','NewYork'],
                 columns=['one','two', 'three','four']
                 )
print ( data ) ;print('-'*16)
print ( data['two'] ) ;print('-'*16)
print ( data[['three', 'one']])  ;print('-'*16)
print ( data[:2] ) ;print('-'*16)


In [None]:
data < 5

In [None]:
data[data<5] = 0
data

In [None]:
#data.ix['Colorado', ['two','three']]
data.loc['Colorado', ['two','three']]

In [None]:
slice = data.loc[
    ['Colorado', 'Utah'],
    ['four','one','two']
]
print (slice) ;print('-'*32)

slice = data.loc[['Colorado', 'Utah']]
print (slice) ;print('-'*32)

slice = data.loc[data.three>5][:3]
print (slice) ;print('-'*32)

#### 算术运算和数据对齐

- 算术运算
    - +
    - -
    - *
    - /
    - add
    - sub
    - div
    - mul
- NB    
**自动的数据对齐操作在不重叠的索引处引入了NA值**


In [None]:
s1 = Series( 
    [7.3,-2.5,3.4,1.5],
    index = ['a','c','d','e']
)

s2 = Series(
    [-2.1, 3.6,-1.5,4,3.1],
    index=['a','c','e','f','g']
)
print (s1) ; print('-'*32)
print (s2) ; print('-'*32)
print (s1+s2) ; print('-'*32)


In [None]:
df1 = DataFrame(np.arange(9).reshape((3,3)),
                columns=list('bcd'),
                index=['Ohio','Texas', 'Colorado']
               )
df2 = DataFrame(np.arange(12).reshape((4,3)),
                columns=list('bde'),
                index=['Utah', 'Ohio', 'Texas', 'Oregon']
               )
print(df1) ;print('-'*32)
print(df2) ;print('-'*32)
print(df1+df2) ;print('-'*32)
print(df1-df2) ;print('-'*32)

#### 算术方法中填充值

In [None]:
df1 = DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'),dtype=np.int)
df2 = DataFrame(np.arange(20).reshape((4,5)), columns=list('abcde'),dtype=np.int)
print(df1) ;print('1-'*32)
print(df2) ;print('2-'*32)
print(df1+df2) ;print('3-'*32)
d3 = DataFrame.add(df1,df2,fill_value=0)
print(d3) ;print('4-'*32)
d4 = df1.add(df2, fill_value=0)
print(d4) ;print('5-'*32)
r = d3 == d4
print(r) ;print('6-'*32)
print(r.any());print('7-'*32)


####  DataFrame和Series之间的运算

**广播**

In [None]:
arr = np.arange(12).reshape( (3,4) )
print(arr) ; print('1-'*16)
print(arr[0]) ; print('2-'*16)
print(arr-arr[0]) ;print('3-'*16)

In [None]:
frame = DataFrame( np.arange(12).reshape( (4,3) ),
                 columns=list('bde'),
                  index=['Utah', 'Texas','Ohio', 'Oregon']
                 )
#series = frame.ix[0]
series = frame.loc['Utah']
print( frame ) ; print( '-'*32 )
print( series ) ; print( '-'*32 )

print(frame - series) ; print( '-'*32 )

In [None]:
series2 = Series(range(3), index=['b','e','f'] )
frame + series2

In [None]:
series3 = frame['d']
print (frame) ; print('-'*32)
print(series3) ; print('-'*32)
print(frame.sub(series3, axis=0) ) ; print('-'*32)

#### 函数应用和映射

In [None]:
frame = DataFrame( np.arange(12).reshape( (4,3) ),
                 columns=list('bde'),
                  index=['Utah', 'Texas','Ohio', 'Oregon']
                 )
np.abs(frame)

In [None]:
f = lambda x : x.max() - x.min()
print ( frame.apply(f) )
print ( frame.apply(f, axis=0) )
print ( frame.apply(f, axis=1) )

In [None]:
def f(x):
    return Series(
        [x.min(), x.max()],
        index=['min', 'max']
    )
frame.apply(f)

In [None]:
format = lambda x: '%.2f' %x
frame.applymap(format)

之所以叫做applymap，是因为Series有一个应用元素级函数的map方法

In [None]:
frame['e'].map(format)

#### 排序和排名
要对行或列索引进行排序，可以使用sort_index方法，它将返回一个**已经排序的新对象**

In [None]:
obj = Series(range(4), index=list('dabc'))
print(obj) ; print('-'*32)
print(obj.sort_index())

**而对于DataFrame，则可以根据任意一个轴上的索引进行排序**

In [None]:
frame = DataFrame(np.arange(8).reshape((2,4)), 
                  index=['three','one'],
                  columns=['d','a','b','c']
                 )
frame.sort_index()

In [None]:
frame.sort_index(axis=1, ascending=False)

In [None]:
obj = Series([4,7,-3,2])
obj.sort_values()

In [None]:
obj = Series([4,np.nan, 7, np.nan, -3, 2])
obj.sort_values()

In [None]:
frame = DataFrame(
    {
        'b': [4,7,-3,2],
        'a': [0,1,0,1]
    }
)

print(frame) ; print('-'*32)
print(frame.sort_values(by='a')); print('-'*32)
print(frame.sort_values(by='b')) ; print('-'*32)
print(frame.sort_values(by=['a','b'])); print('-'*32)

#print(frame.sort_index(by='a')) ; print('-'*32)
#print(frame.sort_index(by='b')) ; print('-'*32)
#print(frame.sort_index(by=['a','b'])); print('-'*32)

In [None]:
obj = Series([7,-5,7,4,2,0,4])
print( obj )
print( obj.rank() )
print( obj.rank(method='first') )
print( obj.rank(ascending=False, method='max'))

In [None]:
frame = DataFrame(
    {
        'b' : [4.3, 7, -3, 2],
        'a' : [0,1,0,1],
        'c' : [-2, 5, 8, -2.5]
    }
)

print( frame )
print( frame.rank( axis=1 ) )

**排名时用于破坏平衡关系的method选项**
- 'average'
- 'min'
- 'max'
- 'first'

#### 带有重复值的轴索引

In [None]:
obj = Series( range(5), index=list('aabbc'))
print ( obj )
print ( obj.index.is_unique )
print ( obj.a ); print ( type( obj.a ) )
print ( obj.c ); print ( type( obj.c ) )

In [None]:
df = DataFrame(np.random.randn(4,3), index=list('aabb'))
print ( df )
print ( df.loc['b'] )

### 汇总和计算描述统计

### 处理缺失数据

### 层次化索引

### 其他有关Pandas的话题

## 数据加载、存储与文件格式

### 读取文本格式的数据
#### 基本读取
- read_csv
- read_table （已经废弃， 原来的接口并入read_csv）
- read_fwf
- read_clipboard

**数据读取注意事项**
- 索引
- 类型推断和数据转换
- 日期解析
- 迭代
- 不规整数据问题


In [None]:
! cat pydata/ch06/ex1.csv
df = pd.read_csv('pydata/ch06/ex1.csv')
print( df )
## read_table is deprecated
pd.read_table('pydata/ch06/ex1.csv', sep=',')


In [None]:
! cat pydata/ch06/ex2.csv

print('-'*32)
df = pd.read_csv('pydata/ch06/ex2.csv', header=None)
print( df )

print('-'*32)
df = pd.read_csv('pydata/ch06/ex2.csv', 
            names=['a','b','c','d', 'message']
           )
print( df )

In [None]:
! cat pydata/ch06/csv_mindex.csv

print('-'*32)
parsed = pd.read_csv('pydata/ch06/csv_mindex.csv')
print( parsed )

print('-'*32)
parsed = pd.read_csv('pydata/ch06/csv_mindex.csv', index_col=['key1', 'key2'])
print(parsed)
parsed

**用正则表达式来作为read_table的分隔符** 

In [None]:
l = list( open( 'pydata/ch06/ex3.txt') )

#result = pd.read_csv('pydata/ch06/ex3.txt', sep='\s+')
result = pd.read_csv('pydata/ch06/ex3.txt', sep='\s+')

[ l, result ]

In [None]:
! cat pydata/ch06/ex4.csv
r1 = pd.read_csv('pydata/ch06/ex4.csv')
r2 = pd.read_csv('pydata/ch06/ex4.csv', skiprows=[0,2,3])
[r1, r2]

In [None]:
! cat pydata/ch06/ex5.csv
result = pd.read_csv('pydata/ch06/ex5.csv')
result
pd.isnull(result)

In [None]:
result = pd.read_csv('pydata/ch06/ex5.csv', na_values=['NULL'])
result

In [None]:
'''可以用一个字典为各列指定不同的NA标记值'''
sentinels = {
    'message': ['foo', 'NA'],
    'something': ['two']
}
result = pd.read_csv('pydata/ch06/ex5.csv', na_values=sentinels)
result

**read_csv函数的参数 P178**

参数 | 说明
--- | ---
 path | URL
 sep/delimiter | 分隔符
 header | 用作列名的行号。默认为0（第一行
 index_col | 用作行索引的列编号或列名
 names | 用于结果的列名列表， 结合header=None
 skiprows | 需要忽略的行数
 na_values | 用于替代NA的值
 comment | 用于将注释信息从行尾拆分出去的字符
 parse_dates | 解析日期
 keep_date_col | 用于连接多列解析日期
 converters | 由列号/列名跟函数之间的映射关系组成的字典。
             | 例如：{'foo': f}会对foo列应用函数f
dayfirst | 当解析有歧义的日期时，将其看作国际格式
nrows | 需要读取的行数
iterator | 返回一个textParser以便逐块读取文件
chunksize | 文件块的大小
skip_footer | 从末尾算起，忽略的行数
verbose | 打印各种解析器信息
encoding | 用于unicode的文本编码格式，比如utf-8
squeeze | 如果仅仅是一列，则返回为Series
thousands | 千分位分隔符，如 ， or .
 


#### 逐块读取文本文件

In [None]:
! head -n 3 pydata/ch06/ex6.csv
result = pd.read_csv('pydata/ch06/ex6.csv')
result.tail()

In [None]:
chunker = pd.read_csv('pydata/ch06/ex6.csv', chunksize=1000)
tot = Series([])

for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_index(ascending=False)
print( tot[:10] )
print( '-'*32 )
print(  tot.sum() )

#### 将数据写出到文本格式

In [None]:

data = pd.read_csv('pydata/ch06/ex5.csv')
print(data)

print('-'*32)
data.to_csv('pydata/ch06/out.csv')
! cat pydata/ch06/out.csv

print('-'*32)
import sys
data.to_csv(sys.stdout, sep='|')

print('-'*32)
data.to_csv(sys.stdout, na_rep='NULL')

print('-'*32)
data.to_csv(sys.stdout, index=False, header=False)

print('-'*32)
data.to_csv(sys.stdout, index=False, header=False,
            columns=list('abd')
           )


**Series也有一个to_csv方法**
这也是一个被废弃的使用方式

In [None]:
dates = pd.date_range('1/1/2000', periods=7)
ts = Series(np.arange(7), index=dates)
ts.to_csv(sys.stdout)

ts.to_csv('pydata/ch06/out.csv')
ts.from_csv('pydata/ch06/out.csv', parse_dates=True)

#### 手工处理分隔符格式

In [None]:
! cat pydata/ch06/ex7.csv

In [None]:
import csv
f = open('pydata/ch06/ex7.csv')
reader = csv.reader(f)
for line in reader:
    print(line)
f.close()

In [None]:
lines = list(csv.reader(open('pydata/ch06/ex7.csv' )))
header, values = lines[0], lines[1:]
data_dict = {
    h:v for h,v in zip(header, zip(*values))
}
data_dict

**CSV文件的形式有很多，只需要定义csv.Dialect的子类即可以定义出新格式**
- 分隔符
- 字符串引用约定
- 行结束符

In [None]:
class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '\"'
    quoting = 0

reader = csv.reader(
    open('pydata/ch06/ex7.csv'), 
    dialect=my_dialect)

lines = list(reader)
print(lines)

**CSV语支选项**
- delimiter        分隔符
- lineterminator   行结束符
- qtotechar        字符引用符号
- quoting          引用约定
- skipinitialspace 忽略分隔符后面的分隔符
- doublequote
- escapechar       转义字符


In [None]:
with open('mydata.csv', 'w') as f:
    writer = csv.writer(f, dialect=my_dialect)
    writer.writerow(('one','two','three'))
    writer.writerow(('1','2','3'))

!cat mydata.csv

#### JSON 数据集
P184

In [None]:
obj = """ 
{
"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"], 
"pet": null,
"siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"},
             {"name": "Katie", "age": 33, "pet": "Cisco"}]
} 
"""

import json
result = json.loads(obj)
result

In [None]:
asjson = json.dumps(result)
asjson

In [None]:
import pandas as pd
from pandas import DataFrame, Series
siblings = DataFrame(result['siblings'], columns=['name', 'age'])
siblings

#### XML和HTML: Web信息收集
Python有许多可以阅读HTML和XML格式的库，lxml就是一个常用的
- lxml.html
- lxml.objectify


从yahoo金融下载一些信息.找到你希望获取数据的URL，利用urllib2将其打开，然后用lxml解析得到的数据流

P186

In [None]:
from lxml.html import parse
from urllib2 import urlopen

#parsed = parse()
doc = parsed.getroot()

links = doc.findall('.//a')
links[15:20]



#### 利用lxml.objectify解析XML

In [None]:
from lxml import objectify 
path = 'Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()

data = []
skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ', 'DESIRED_CHANGE','DECIMAL_PLACES']

for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)
    
perf = DataFrame(data)
perf

In [None]:
from StringIO import StringIO
tag = '<a href="http://www.google.com>"Google</a>'
root = objectify.parse(StringIO(tag).getroot())
print( root )

root.get('href')
root.text


### 二进制数据格式

使用数据的二进制格式存储最简单的办法之一是使用Python内置的pickel序列化

In [None]:
frame = pd.read_csv('pydata/ch06/ex1.csv')
print ( frame )
frame.save('pydata/ch06/frame_pickle')

#### 使用HDF5格式 
- hierarchical data format
- HDF5可以高效读写磁盘上以二进制格式存储的科学数据
- 如果需要处理海量数据，PyTables和h5py是好选择
pandas有一个最小化的类似于字典的HDFStore类，它通过PyTables存储pandas对象：

In [None]:
import pandas as pd
store  = pd.HDFStore('mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']

print( store )
print( store['obj1'])

### 读取Microsoft Excel文件

In [None]:
xls_file = pd.ExcelFile('data.xls')
table = xls_file.parse('Sheet1')

### 使用htmp和Web API
很多网站都有一些通过JSON或者其他格式提供数据的公共API。
推荐的简单办法是：**requests包**

In [None]:
import requests
url = 'http://search.twitter.com/search.json?q=python%20pandas'
resp = requests.get(url)
resp

In [None]:
import json
data = json.loads(resp.text)
data.keys()

In [None]:

tweet_feilds = ['created_at', 'from_user', 'id', 'text']
tweets = DataFrame(data['results'], columns=tweet_feilds)
print ( tweets )
print ( tweets.loc[7] )

### 使用数据库

In [None]:
import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER );"""

con = sqlite3.connect(':memory:') 
con.execute(query)
con.commit()


In [None]:
data = [('Atlanta', 'Georgia', 1.25, 6), ('Tallahassee', 'Florida', 2.6, 3), ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data) 
con.commit()

In [None]:
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows

In [None]:
cursor.description

In [None]:
 DataFrame(rows, columns=zip(*cursor.description)[0])

**SQL**

In [None]:
import pandas.io.sql as sql

In [None]:
sql.read_frame('select * from test', con)

#### 使用MongoDB中的数据

In [None]:
import pymongo
con = pymongo.Connection('localhost', port=27017)
tweets = con.db.tweets

import requests, json
url = 'http://search.twitter.com/search.json?q=python%20pandas' data = json.loads(requests.get(url).text)
for tweet in data['results']:
    tweets.save(tweet)

cursor = tweets.find({'from_user': 'wesmckinn'})

tweet_fields = ['created_at', 'from_user', 'id', 'text'] 
result = DataFrame(list(cursor), columns=tweet_fields)

## 数据规整化： 清理、转换、合并、重塑

### 合并数据

pandas对象中的数据可以通过一些内置的方式进行合并
- pandas.merge 可以根据一个或者多个健将不同DataFrame中的行连接起来
- pandas.concat 可以沿着一个轴将多个对象堆叠到一起
- combine_first 重复数据编接，用一个对象中的值填充另一个对象中的缺失值

#### 数据库风格的DataFrame合并
- merge的参数
    - left 左df
    - right 右df
    - on 连接列，未指定则是交集
    - left_on 左侧连接列
    - right_on 右侧连接列
    - left_index 左侧的行索引用作连接健
    - right_index 右...
    - sort 根据连接健排序，默认是True
    - suffixes 字符串值元组
    - copy 默认总是复制，除非设置为False
    

In [None]:
from pandas import DataFrame, Series
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 
                 'data1': range(7)})

df2 = DataFrame({'key': ['a', 'b', 'd'], 
                 'data2': range(3)})

print ( df1 ) ;print( '-'*32 ) 
print ( df2 ) ; print( '-'*32 )
print ( pd.merge(df1, df2)) ; print( '-'*32 )
print ( pd.merge(df1, df2, on='key') )

In [None]:
df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 
                 'data1': range(7)})

df4 = DataFrame({'rkey': ['a', 'b', 'd'], 
                 'data2': range(3)})

pd.merge(df3, df4, left_on='lkey', right_on='rkey')


In [None]:
print ( pd.merge(df1, df2, how='inner') ) ; print('-'*32)
print ( pd.merge(df1, df2, how='outer') ) ; print('-'*32)
print ( pd.merge(df1, df2, how='left') )  ; print('-'*32)
print ( pd.merge(df1, df2, how='right') ) ; print('-'*32)

- 多对多的合并操作非常简单，无需额外的工作
- 多对多连接产生的结果是行的**笛卡尔积**

In [None]:
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 
                 'data1': range(6)})
df2 = DataFrame({'key': ['a', 'b', 'a', 'b', 'd'], 
                 'data2': range(5)})

print ( df1 ) ;print('-'*32)
print ( df2 ) ;print('-'*32)
print ( pd.merge(df1, df2, on='key', how='left') )

需要根据多个健进行合并，传入一个由列名组成的列表即可

In [None]:
left = DataFrame({'key1': ['foo', 'foo', 'bar'], 
                  'key2': ['one', 'two', 'one'],
                  'lval': [1, 2, 3]})

right = DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'], 
                   'key2': ['one', 'one', 'one', 'two'],
                   'rval': [4, 5, 6, 7]})

print ( left )
print ( right )
pd.merge(left, right, on=['key1', 'key2'], how='outer')

In [None]:
pd.merge(left, right, on='key1')

In [None]:
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

### 重塑和轴向旋转

### 数据转化

### 字符串操作

### 示例： USDA视频数据库

## 绘图和可视化

### matplotlib API 入门

### pandas中的绘图函数

### 绘制地图： 图形化显示海地地震危机数据

### Python图形化工具生态系统

## 数据聚合与分组运算

### groupby技术

### 数据聚合

### 分组运算和转换

### 透视表和交叉表

### 示例： 2012联邦选举委员会数据库

## 时间序列

### 日期和时间数据类型及工具

### 时间序列基础

### 日期的范围、频率以及移动

### 时区处理

### 时期及其算术运算

### 重采样及频率转换

### 时间序列绘图

### 移动窗口函数

### 性能和内存使用方面的注意事项

## 金融和经济数据

### 数据规整化方面的话题

### 分组变换和分析

### 更多示例应用

## NumPy高级应用

### ndarray对象的内部机理

### 高级数组操作

### 广播

### ufunc高级应用

### 结构化和记录式数组

### 更多有关排序的话题

### numpy的matrix类

### 高级数组输入输出

### 性能建议

## 附录： Python语言精要