
<font size=17>Pandas (高阶) 学习</font> 

# 层级索引
层级索引：层级索引的对象是MultiIndex对象

设置多个索引列：

set_index([‘a’,’b’],inplace=True)，其中a的列是第一级行索引，在最外层，b的列设置为第二级行索引，在次外层的位置，a和b的先后顺序是有意义的。

选取子集：

1. 外层选取loc[‘outer_index’]，outer_index指外层索引中指定索引行的名称，比如行索引country为[“A”,”B”,”C”,”D”]，则loc[“B”]表示获取索引行为B的分组数据

2. 内层选取loc[“outer_index”,”inner_index”]，表示从外层索引为outer_index的分组中选取内层索引行为inner_index的分组数据。

In [26]:
import numpy as np
import pandas as pd

In [4]:
file = "2016_happiness.csv"
data = pd.read_csv(file, usecols = ['Country','Region','Happiness Rank','Happiness Score'])

In [5]:
data.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score
0,Denmark,Western Europe,1,7.526
1,Switzerland,Western Europe,2,7.509
2,Iceland,Western Europe,3,7.501
3,Norway,Western Europe,4,7.498
4,Finland,Western Europe,5,7.413


In [6]:
data.set_index(['Region','Country'], inplace = True)
data

Unnamed: 0_level_0,Unnamed: 1_level_0,Happiness Rank,Happiness Score
Region,Country,Unnamed: 2_level_1,Unnamed: 3_level_1
Western Europe,Denmark,1,7.526
Western Europe,Switzerland,2,7.509
Western Europe,Iceland,3,7.501
Western Europe,Norway,4,7.498
Western Europe,Finland,5,7.413
...,...,...,...
Sub-Saharan Africa,Benin,153,3.484
Southern Asia,Afghanistan,154,3.360
Sub-Saharan Africa,Togo,155,3.303
Middle East and Northern Africa,Syria,156,3.069


### 选取子集

In [7]:
data.loc['Western Europe']

Unnamed: 0_level_0,Happiness Rank,Happiness Score
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Denmark,1,7.526
Switzerland,2,7.509
Iceland,3,7.501
Norway,4,7.498
Finland,5,7.413
Netherlands,7,7.339
Sweden,10,7.291
Austria,12,7.119
Germany,16,6.994
Belgium,18,6.929


In [9]:
data.loc['Australia and New Zealand','New Zealand']

Happiness Rank     8.000
Happiness Score    7.334
Name: (Australia and New Zealand, New Zealand), dtype: float64

In [11]:
data.swaplevel()

Unnamed: 0_level_0,Unnamed: 1_level_0,Happiness Rank,Happiness Score
Country,Region,Unnamed: 2_level_1,Unnamed: 3_level_1
Denmark,Western Europe,1,7.526
Switzerland,Western Europe,2,7.509
Iceland,Western Europe,3,7.501
Norway,Western Europe,4,7.498
Finland,Western Europe,5,7.413
...,...,...,...
Benin,Sub-Saharan Africa,153,3.484
Afghanistan,Southern Asia,154,3.360
Togo,Sub-Saharan Africa,155,3.303
Syria,Middle East and Northern Africa,156,3.069


In [12]:
data
# 不改变位置

Unnamed: 0_level_0,Unnamed: 1_level_0,Happiness Rank,Happiness Score
Region,Country,Unnamed: 2_level_1,Unnamed: 3_level_1
Western Europe,Denmark,1,7.526
Western Europe,Switzerland,2,7.509
Western Europe,Iceland,3,7.501
Western Europe,Norway,4,7.498
Western Europe,Finland,5,7.413
...,...,...,...
Sub-Saharan Africa,Benin,153,3.484
Southern Asia,Afghanistan,154,3.360
Sub-Saharan Africa,Togo,155,3.303
Middle East and Northern Africa,Syria,156,3.069


In [13]:
data.sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,Happiness Rank,Happiness Score
Region,Country,Unnamed: 2_level_1,Unnamed: 3_level_1
Australia and New Zealand,Australia,9,7.313
Australia and New Zealand,New Zealand,8,7.334
Central and Eastern Europe,Albania,109,4.655
Central and Eastern Europe,Armenia,121,4.360
Central and Eastern Europe,Azerbaijan,81,5.291
...,...,...,...
Western Europe,Portugal,94,5.123
Western Europe,Spain,37,6.361
Western Europe,Sweden,10,7.291
Western Europe,Switzerland,2,7.509


## 分组与聚合
分组：对数据集进行分组，然后对每组数据进行统计分析

分组运算的基本原理:

split->apply->combine

(1)拆分：进行分组的根据

(2)应用：每个分组进行的计算规则

(3)合并：把每个分组的计算结果合并起来

**分组运算过程， 拆分，应用，合并**

聚合：

数组产生标量的过程，如mean()、count()…

常用于对分组后的数据进行计算

内置的聚合函数：sum()，mean()，max()，count() , size()

count() 非零的数据
size() 有多大，包含空值

In [14]:
# 重置数据
data = pd.read_csv(file, usecols = ['Country','Region','Happiness Rank','Happiness Score'])

In [15]:
obj1 = data.groupby('Region')

In [18]:
obj1.mean()

Unnamed: 0_level_0,Happiness Rank,Happiness Score
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia and New Zealand,8.5,7.3235
Central and Eastern Europe,78.448276,5.37069
Eastern Asia,67.166667,5.624167
Latin America and Caribbean,48.333333,6.10175
Middle East and Northern Africa,78.105263,5.386053
North America,9.5,7.254
Southeastern Asia,80.0,5.338889
Southern Asia,111.714286,4.563286
Sub-Saharan Africa,129.657895,4.136421
Western Europe,29.190476,6.685667


In [19]:
obj1.max()

Unnamed: 0_level_0,Country,Happiness Rank,Happiness Score
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Australia and New Zealand,New Zealand,9,7.334
Central and Eastern Europe,Uzbekistan,129,6.596
Eastern Asia,Taiwan,101,6.379
Latin America and Caribbean,Venezuela,136,7.087
Middle East and Northern Africa,Yemen,156,7.267
North America,United States,13,7.404
Southeastern Asia,Vietnam,140,6.739
Southern Asia,Sri Lanka,154,5.196
Sub-Saharan Africa,Zimbabwe,157,5.648
Western Europe,United Kingdom,99,7.526


## 自定义分组

In [20]:
# 自定义分组规则

def get_score_group(score):
    if score <= 4:
        score_group = 'low'
    elif score <= 6:
        score_group = 'middle'
    else:
        score_group = 'high'
    return score_group

In [22]:
# 使用groupby 传入一个自定义的分组
# 把关系的那一列首先得设置成为index

data2 = data.set_index('Happiness Score')
data2.groupby(get_score_group).size()

high      47
low       21
middle    89
dtype: int64

In [23]:
## 方法2
data['score group'] = data['Happiness Score'].apply(get_score_group)
data.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,score group
0,Denmark,Western Europe,1,7.526,high
1,Switzerland,Western Europe,2,7.509,high
2,Iceland,Western Europe,3,7.501,high
3,Norway,Western Europe,4,7.498,high
4,Finland,Western Europe,5,7.413,high


In [24]:
data.groupby('score group').size()

score group
high      47
low       21
middle    89
dtype: int64

### 聚合操作

In [27]:
data.groupby('Region').agg(np.max)

Unnamed: 0_level_0,Country,Happiness Rank,Happiness Score,score group
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Australia and New Zealand,New Zealand,9,7.334,high
Central and Eastern Europe,Uzbekistan,129,6.596,middle
Eastern Asia,Taiwan,101,6.379,middle
Latin America and Caribbean,Venezuela,136,7.087,middle
Middle East and Northern Africa,Yemen,156,7.267,middle
North America,United States,13,7.404,high
Southeastern Asia,Vietnam,140,6.739,middle
Southern Asia,Sri Lanka,154,5.196,middle
Sub-Saharan Africa,Zimbabwe,157,5.648,middle
Western Europe,United Kingdom,99,7.526,middle


In [28]:
data.groupby('Region').max()

Unnamed: 0_level_0,Country,Happiness Rank,Happiness Score,score group
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Australia and New Zealand,New Zealand,9,7.334,high
Central and Eastern Europe,Uzbekistan,129,6.596,middle
Eastern Asia,Taiwan,101,6.379,middle
Latin America and Caribbean,Venezuela,136,7.087,middle
Middle East and Northern Africa,Yemen,156,7.267,middle
North America,United States,13,7.404,high
Southeastern Asia,Vietnam,140,6.739,middle
Southern Asia,Sri Lanka,154,5.196,middle
Sub-Saharan Africa,Zimbabwe,157,5.648,middle
Western Europe,United Kingdom,99,7.526,middle


In [30]:
# 传入包含多个函数的列表

data.groupby('Region')['Happiness Score'].agg([np.max,np.min,np.mean])

Unnamed: 0_level_0,amax,amin,mean
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Australia and New Zealand,7.334,7.313,7.3235
Central and Eastern Europe,6.596,4.217,5.37069
Eastern Asia,6.379,4.907,5.624167
Latin America and Caribbean,7.087,4.028,6.10175
Middle East and Northern Africa,7.267,3.069,5.386053
North America,7.404,7.104,7.254
Southeastern Asia,6.739,3.907,5.338889
Southern Asia,5.196,3.36,4.563286
Sub-Saharan Africa,5.648,2.905,4.136421
Western Europe,7.526,5.033,6.685667


In [31]:
# 通过传入字典
data.groupby('Region').agg({'Happiness Score':np.mean, 'Happiness Rank': np.max})


Unnamed: 0_level_0,Happiness Score,Happiness Rank
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia and New Zealand,7.3235,9
Central and Eastern Europe,5.37069,129
Eastern Asia,5.624167,101
Latin America and Caribbean,6.10175,136
Middle East and Northern Africa,5.386053,156
North America,7.254,13
Southeastern Asia,5.338889,140
Southern Asia,4.563286,154
Sub-Saharan Africa,4.136421,157
Western Europe,6.685667,99


In [35]:
# 传入自定义
def max_min_diff(x):
    return x.max() - x.min()

data.groupby('Region')['Happiness Rank'].agg(max_min_diff)

Region
Australia and New Zealand            1
Central and Eastern Europe         102
Eastern Asia                        67
Latin America and Caribbean        122
Middle East and Northern Africa    145
North America                        7
Southeastern Asia                  118
Southern Asia                       70
Sub-Saharan Africa                  91
Western Europe                      98
Name: Happiness Rank, dtype: int64

In [38]:
df = pd.DataFrame({'key':['one', 'three', 'two', 'two', 'one','three','three','two','one','one'],

     'data1':np.random.randint(25,75,size=10),

    'data2':np.random.randint(1,50,size=10),

    'data3':np.random.randint(50,100,size=10),

    'data4':np.random.randint(100,150,size=10)})
df

Unnamed: 0,key,data1,data2,data3,data4
0,one,51,34,83,118
1,three,34,12,94,119
2,two,32,5,76,141
3,two,47,42,62,107
4,one,30,38,61,137
5,three,41,28,74,147
6,three,53,8,83,139
7,two,34,47,79,106
8,one,34,8,52,122
9,one,28,4,94,131
