
<font size=17>Pandas (高阶) 学习</font> 

# 层级索引
层级索引：层级索引的对象是MultiIndex对象

设置多个索引列：

set_index([‘a’,’b’],inplace=True)，其中a的列是第一级行索引，在最外层，b的列设置为第二级行索引，在次外层的位置，a和b的先后顺序是有意义的。

选取子集：

1. 外层选取loc[‘outer_index’]，outer_index指外层索引中指定索引行的名称，比如行索引country为[“A”,”B”,”C”,”D”]，则loc[“B”]表示获取索引行为B的分组数据

2. 内层选取loc[“outer_index”,”inner_index”]，表示从外层索引为outer_index的分组中选取内层索引行为inner_index的分组数据。

In [1]:
import numpy as np
import pandas as pd

In [2]:
file = "2016_happiness.csv"
data = pd.read_csv(file, usecols = ['Country','Region','Happiness Rank','Happiness Score'])

In [3]:
data.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score
0,Denmark,Western Europe,1,7.526
1,Switzerland,Western Europe,2,7.509
2,Iceland,Western Europe,3,7.501
3,Norway,Western Europe,4,7.498
4,Finland,Western Europe,5,7.413


In [4]:
data.set_index(['Region','Country'], inplace = True)
data

Unnamed: 0_level_0,Unnamed: 1_level_0,Happiness Rank,Happiness Score
Region,Country,Unnamed: 2_level_1,Unnamed: 3_level_1
Western Europe,Denmark,1,7.526
Western Europe,Switzerland,2,7.509
Western Europe,Iceland,3,7.501
Western Europe,Norway,4,7.498
Western Europe,Finland,5,7.413
...,...,...,...
Sub-Saharan Africa,Benin,153,3.484
Southern Asia,Afghanistan,154,3.360
Sub-Saharan Africa,Togo,155,3.303
Middle East and Northern Africa,Syria,156,3.069


### 选取子集

In [5]:
data.loc['Western Europe']

Unnamed: 0_level_0,Happiness Rank,Happiness Score
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Denmark,1,7.526
Switzerland,2,7.509
Iceland,3,7.501
Norway,4,7.498
Finland,5,7.413
Netherlands,7,7.339
Sweden,10,7.291
Austria,12,7.119
Germany,16,6.994
Belgium,18,6.929


In [6]:
data.loc['Australia and New Zealand','New Zealand']

Happiness Rank     8.000
Happiness Score    7.334
Name: (Australia and New Zealand, New Zealand), dtype: float64

In [7]:
data.swaplevel()

Unnamed: 0_level_0,Unnamed: 1_level_0,Happiness Rank,Happiness Score
Country,Region,Unnamed: 2_level_1,Unnamed: 3_level_1
Denmark,Western Europe,1,7.526
Switzerland,Western Europe,2,7.509
Iceland,Western Europe,3,7.501
Norway,Western Europe,4,7.498
Finland,Western Europe,5,7.413
...,...,...,...
Benin,Sub-Saharan Africa,153,3.484
Afghanistan,Southern Asia,154,3.360
Togo,Sub-Saharan Africa,155,3.303
Syria,Middle East and Northern Africa,156,3.069


In [8]:
data
# 不改变位置

Unnamed: 0_level_0,Unnamed: 1_level_0,Happiness Rank,Happiness Score
Region,Country,Unnamed: 2_level_1,Unnamed: 3_level_1
Western Europe,Denmark,1,7.526
Western Europe,Switzerland,2,7.509
Western Europe,Iceland,3,7.501
Western Europe,Norway,4,7.498
Western Europe,Finland,5,7.413
...,...,...,...
Sub-Saharan Africa,Benin,153,3.484
Southern Asia,Afghanistan,154,3.360
Sub-Saharan Africa,Togo,155,3.303
Middle East and Northern Africa,Syria,156,3.069


In [9]:
data.sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,Happiness Rank,Happiness Score
Region,Country,Unnamed: 2_level_1,Unnamed: 3_level_1
Australia and New Zealand,Australia,9,7.313
Australia and New Zealand,New Zealand,8,7.334
Central and Eastern Europe,Albania,109,4.655
Central and Eastern Europe,Armenia,121,4.360
Central and Eastern Europe,Azerbaijan,81,5.291
...,...,...,...
Western Europe,Portugal,94,5.123
Western Europe,Spain,37,6.361
Western Europe,Sweden,10,7.291
Western Europe,Switzerland,2,7.509


## 分组与聚合
分组：对数据集进行分组，然后对每组数据进行统计分析

分组运算的基本原理:

split->apply->combine

(1)拆分：进行分组的根据

(2)应用：每个分组进行的计算规则

(3)合并：把每个分组的计算结果合并起来

**分组运算过程， 拆分，应用，合并**

聚合：

数组产生标量的过程，如mean()、count()…

常用于对分组后的数据进行计算

内置的聚合函数：sum()，mean()，max()，count() , size()

count() 非零的数据
size() 有多大，包含空值

In [10]:
# 重置数据
data = pd.read_csv(file, usecols = ['Country','Region','Happiness Rank','Happiness Score'])

In [11]:
obj1 = data.groupby('Region')

In [12]:
obj1.mean()

Unnamed: 0_level_0,Happiness Rank,Happiness Score
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia and New Zealand,8.5,7.3235
Central and Eastern Europe,78.448276,5.37069
Eastern Asia,67.166667,5.624167
Latin America and Caribbean,48.333333,6.10175
Middle East and Northern Africa,78.105263,5.386053
North America,9.5,7.254
Southeastern Asia,80.0,5.338889
Southern Asia,111.714286,4.563286
Sub-Saharan Africa,129.657895,4.136421
Western Europe,29.190476,6.685667


In [13]:
obj1.max()

Unnamed: 0_level_0,Country,Happiness Rank,Happiness Score
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Australia and New Zealand,New Zealand,9,7.334
Central and Eastern Europe,Uzbekistan,129,6.596
Eastern Asia,Taiwan,101,6.379
Latin America and Caribbean,Venezuela,136,7.087
Middle East and Northern Africa,Yemen,156,7.267
North America,United States,13,7.404
Southeastern Asia,Vietnam,140,6.739
Southern Asia,Sri Lanka,154,5.196
Sub-Saharan Africa,Zimbabwe,157,5.648
Western Europe,United Kingdom,99,7.526


## 自定义分组

In [14]:
# 自定义分组规则

def get_score_group(score):
    if score <= 4:
        score_group = 'low'
    elif score <= 6:
        score_group = 'middle'
    else:
        score_group = 'high'
    return score_group

In [15]:
# 使用groupby 传入一个自定义的分组
# 把关系的那一列首先得设置成为index

data2 = data.set_index('Happiness Score')
data2.groupby(get_score_group).size()

high      47
low       21
middle    89
dtype: int64

In [16]:
## 方法2
data['score group'] = data['Happiness Score'].apply(get_score_group)
data.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,score group
0,Denmark,Western Europe,1,7.526,high
1,Switzerland,Western Europe,2,7.509,high
2,Iceland,Western Europe,3,7.501,high
3,Norway,Western Europe,4,7.498,high
4,Finland,Western Europe,5,7.413,high


In [17]:
data.groupby('score group').size()

score group
high      47
low       21
middle    89
dtype: int64

### 聚合操作

In [18]:
data.groupby('Region').agg(np.max)

Unnamed: 0_level_0,Country,Happiness Rank,Happiness Score,score group
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Australia and New Zealand,New Zealand,9,7.334,high
Central and Eastern Europe,Uzbekistan,129,6.596,middle
Eastern Asia,Taiwan,101,6.379,middle
Latin America and Caribbean,Venezuela,136,7.087,middle
Middle East and Northern Africa,Yemen,156,7.267,middle
North America,United States,13,7.404,high
Southeastern Asia,Vietnam,140,6.739,middle
Southern Asia,Sri Lanka,154,5.196,middle
Sub-Saharan Africa,Zimbabwe,157,5.648,middle
Western Europe,United Kingdom,99,7.526,middle


In [19]:
data.groupby('Region').max()

Unnamed: 0_level_0,Country,Happiness Rank,Happiness Score,score group
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Australia and New Zealand,New Zealand,9,7.334,high
Central and Eastern Europe,Uzbekistan,129,6.596,middle
Eastern Asia,Taiwan,101,6.379,middle
Latin America and Caribbean,Venezuela,136,7.087,middle
Middle East and Northern Africa,Yemen,156,7.267,middle
North America,United States,13,7.404,high
Southeastern Asia,Vietnam,140,6.739,middle
Southern Asia,Sri Lanka,154,5.196,middle
Sub-Saharan Africa,Zimbabwe,157,5.648,middle
Western Europe,United Kingdom,99,7.526,middle


In [20]:
# 传入包含多个函数的列表

data.groupby('Region')['Happiness Score'].agg([np.max,np.min,np.mean])

Unnamed: 0_level_0,amax,amin,mean
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Australia and New Zealand,7.334,7.313,7.3235
Central and Eastern Europe,6.596,4.217,5.37069
Eastern Asia,6.379,4.907,5.624167
Latin America and Caribbean,7.087,4.028,6.10175
Middle East and Northern Africa,7.267,3.069,5.386053
North America,7.404,7.104,7.254
Southeastern Asia,6.739,3.907,5.338889
Southern Asia,5.196,3.36,4.563286
Sub-Saharan Africa,5.648,2.905,4.136421
Western Europe,7.526,5.033,6.685667


In [21]:
# 通过传入字典
data.groupby('Region').agg({'Happiness Score':np.mean, 'Happiness Rank': np.max})


Unnamed: 0_level_0,Happiness Score,Happiness Rank
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia and New Zealand,7.3235,9
Central and Eastern Europe,5.37069,129
Eastern Asia,5.624167,101
Latin America and Caribbean,6.10175,136
Middle East and Northern Africa,5.386053,156
North America,7.254,13
Southeastern Asia,5.338889,140
Southern Asia,4.563286,154
Sub-Saharan Africa,4.136421,157
Western Europe,6.685667,99


In [22]:
# 传入自定义
def max_min_diff(x):
    return x.max() - x.min()

data.groupby('Region')['Happiness Rank'].agg(max_min_diff)

Region
Australia and New Zealand            1
Central and Eastern Europe         102
Eastern Asia                        67
Latin America and Caribbean        122
Middle East and Northern Africa    145
North America                        7
Southeastern Asia                  118
Southern Asia                       70
Sub-Saharan Africa                  91
Western Europe                      98
Name: Happiness Rank, dtype: int64

In [23]:
df = pd.DataFrame({'key':['one', 'three', 'two', 'two', 'one','three','three','two','one','one'],

     'data1':np.random.randint(25,75,size=10),

    'data2':np.random.randint(1,50,size=10),

    'data3':np.random.randint(50,100,size=10),

    'data4':np.random.randint(100,150,size=10)})
df

Unnamed: 0,key,data1,data2,data3,data4
0,one,60,1,93,137
1,three,51,13,70,102
2,two,64,36,79,125
3,two,28,16,89,137
4,one,37,35,60,135
5,three,45,25,54,147
6,three,44,7,61,149
7,two,46,19,52,130
8,one,74,4,80,136
9,one,69,48,53,106


# 透视表

In [24]:
# 创建一个表格

# 创建dataframe
d = {
    'Name':['Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine',
            'Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine'],
    
    'Semester':['Semester 1','Semester 1','Semester 1','Semester 1','Semester 1','Semester 1',
            'Semester 2','Semester 2','Semester 2','Semester 2','Semester 2','Semester 2'],
     
    'Subject':['Mathematics','Mathematics','Mathematics','Science','Science','Science',
               'Mathematics','Mathematics','Mathematics','Science','Science','Science'],
    'Score':[62,47,55,74,31,77,85,63,42,67,89,81]}
 
df = pd.DataFrame(d)
df

Unnamed: 0,Name,Semester,Subject,Score
0,Alisa,Semester 1,Mathematics,62
1,Bobby,Semester 1,Mathematics,47
2,Cathrine,Semester 1,Mathematics,55
3,Alisa,Semester 1,Science,74
4,Bobby,Semester 1,Science,31
5,Cathrine,Semester 1,Science,77
6,Alisa,Semester 2,Mathematics,85
7,Bobby,Semester 2,Mathematics,63
8,Cathrine,Semester 2,Mathematics,42
9,Alisa,Semester 2,Science,67


## 分层索引

In [27]:
df.groupby(['Semester', 'Subject'])['Score'].mean().to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,Score
Semester,Subject,Unnamed: 2_level_1
Semester 1,Mathematics,54.666667
Semester 1,Science,60.666667
Semester 2,Mathematics,63.333333
Semester 2,Science,79.0


## 透视表

df.pivot_table(values,index,columns,aggfunc,margins)

含义：

根据一个或多个键对数据进行聚合，并根据行和列上得分组建将数据分配到各个矩形区域中。

参数：

values：透视表中的元素值(根据聚合函数得出)

index：透视表的行索引

columns：透视表的列索引

aggfunc：聚合函数，可以指定多个函数

margins：表示是否对所有数据进行统计

In [28]:
df

Unnamed: 0,Name,Semester,Subject,Score
0,Alisa,Semester 1,Mathematics,62
1,Bobby,Semester 1,Mathematics,47
2,Cathrine,Semester 1,Mathematics,55
3,Alisa,Semester 1,Science,74
4,Bobby,Semester 1,Science,31
5,Cathrine,Semester 1,Science,77
6,Alisa,Semester 2,Mathematics,85
7,Bobby,Semester 2,Mathematics,63
8,Cathrine,Semester 2,Mathematics,42
9,Alisa,Semester 2,Science,67


In [29]:
df.pivot_table(values = 'Score', index = 'Semester', columns = 'Subject')

Subject,Mathematics,Science
Semester,Unnamed: 1_level_1,Unnamed: 2_level_1
Semester 1,54.666667,60.666667
Semester 2,63.333333,79.0


In [31]:
df.pivot_table(values = 'Score', index = 'Semester', columns = 'Subject', aggfunc = np.max)

Subject,Mathematics,Science
Semester,Unnamed: 1_level_1,Unnamed: 2_level_1
Semester 1,62,77
Semester 2,85,89


In [30]:
df.pivot_table(values = 'Score', index = 'Semester', columns = 'Subject', margins = True)

Subject,Mathematics,Science,All
Semester,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Semester 1,54.666667,60.666667,57.666667
Semester 2,63.333333,79.0,71.166667
All,59.0,69.833333,64.416667


**分别横向纵向做aggfunc**

In [32]:
cars = pd.read_csv('cars.csv')
cars

Unnamed: 0,YEAR,Make,Model,Size,(kW),Unnamed: 5,TYPE,CITY (kWh/100 km),HWY (kWh/100 km),COMB (kWh/100 km),CITY (Le/100 km),HWY (Le/100 km),COMB (Le/100 km),(g/km),RATING,(km),TIME (h)
0,2012,MITSUBISHI,i-MiEV,SUBCOMPACT,49,A1,B,16.9,21.4,18.7,1.9,2.4,2.1,0,,100,7
1,2012,NISSAN,LEAF,MID-SIZE,80,A1,B,19.3,23.0,21.1,2.2,2.6,2.4,0,,117,7
2,2013,FORD,FOCUS ELECTRIC,COMPACT,107,A1,B,19.0,21.1,20.0,2.1,2.4,2.2,0,,122,4
3,2013,MITSUBISHI,i-MiEV,SUBCOMPACT,49,A1,B,16.9,21.4,18.7,1.9,2.4,2.1,0,,100,7
4,2013,NISSAN,LEAF,MID-SIZE,80,A1,B,19.3,23.0,21.1,2.2,2.6,2.4,0,,117,7
5,2013,SMART,FORTWO ELECTRIC DRIVE CABRIOLET,TWO-SEATER,35,A1,B,17.2,22.5,19.6,1.9,2.5,2.2,0,,109,8
6,2013,SMART,FORTWO ELECTRIC DRIVE COUPE,TWO-SEATER,35,A1,B,17.2,22.5,19.6,1.9,2.5,2.2,0,,109,8
7,2013,TESLA,MODEL S (40 kWh battery),FULL-SIZE,270,A1,B,22.4,21.9,22.2,2.5,2.5,2.5,0,,224,6
8,2013,TESLA,MODEL S (60 kWh battery),FULL-SIZE,270,A1,B,22.2,21.7,21.9,2.5,2.4,2.5,0,,335,10
9,2013,TESLA,MODEL S (85 kWh battery),FULL-SIZE,270,A1,B,23.8,23.2,23.6,2.7,2.6,2.6,0,,426,12


In [35]:
cars.pivot_table(values = '(kW)',index = 'YEAR', columns = 'Make')

Make,BMW,CHEVROLET,FORD,KIA,MITSUBISHI,NISSAN,SMART,TESLA
YEAR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2012,,,,,49.0,80.0,,
2013,,,107.0,,49.0,80.0,35.0,280.0
2014,,104.0,107.0,,49.0,80.0,35.0,268.333333
2015,125.0,104.0,107.0,81.0,49.0,80.0,35.0,320.666667
2016,125.0,104.0,107.0,81.0,49.0,80.0,35.0,409.7


In [36]:
data = pd.DataFrame({'key1':['a', 'c', 'b', 'a', 'b','b','c','a','b','c'],

     'key2':['one', 'three', 'two', 'two', 'one','three','three','two','one','one'],

     'key3':list("ABCBACBACB"),

     'data_random':np.random.randint(1,10,size=10)})

data.pivot_table(index=['key1','key2'],columns='key3')

Unnamed: 0_level_0,Unnamed: 1_level_0,data_random,data_random,data_random
Unnamed: 0_level_1,key3,A,B,C
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,one,2.0,,
a,two,3.0,7.0,
b,one,7.0,,2.0
b,three,,,5.0
b,two,,,6.0
c,one,,4.0,
c,three,,4.0,


# 数据规整

## pd.concat( objs, axis )

含义：

按照指定轴的方向对对个数据对象进行数据合并。

参数：

objs：多个数据对象，如包含DataFrame的列表

axis：0按索引方向（纵向），1按列方向（横向）

**纵向行数增多， 横向列数增多**

In [37]:
# 创建dataframe
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7])

df3 = pd.DataFrame({'C': ['A8', 'A9', 'A10', 'A11'],
                    'D': ['B8', 'B9', 'B10', 'B11'],
                    'E': ['C8', 'C9', 'C10', 'C11'],
                    'F': ['D8', 'D9', 'D10', 'D11']},
                    index=[0, 1, 2, 3])

In [39]:
# 纵向合并
pd.concat([df1,df2],axis = 0)

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [41]:
# 纵向合并，增加列
pd.concat([df1,df3],axis = 0)

Unnamed: 0,A,B,C,D,E,F
0,A0,B0,C0,D0,,
1,A1,B1,C1,D1,,
2,A2,B2,C2,D2,,
3,A3,B3,C3,D3,,
0,,,A8,B8,C8,D8
1,,,A9,B9,C9,D9
2,,,A10,B10,C10,D10
3,,,A11,B11,C11,D11


In [42]:
df1 = pd.DataFrame({"score1":np.random.randint(1,5,size=5),

                   "score2":np.random.randint(5,10,size=5),

                   "score3":np.random.randint(10,15,size=5)},

                  index=[1,2,3,4,5])

df2 = pd.DataFrame({"score1":np.random.randint(1,5,size=3),

                   "score2":np.random.randint(5,10,size=3),

                   "score3":np.random.randint(10,15,size=3)},

                  index=[4,5,6])

pd.concat([df1,df2],axis=1)

Unnamed: 0,score1,score2,score3,score1.1,score2.1,score3.1
1,4.0,5.0,13.0,,,
2,3.0,9.0,11.0,,,
3,4.0,9.0,13.0,,,
4,3.0,6.0,12.0,3.0,8.0,11.0
5,2.0,9.0,14.0,3.0,5.0,14.0
6,,,,4.0,8.0,14.0


## Merge连接

merge根据单个或多个键将不同的DataFrame的行进行连接

merge(how,on,suffixes)

参数：

how：连接的方式，常见的连接方式有：inner（内连接），left（左连接），right（右连接），outer（全连接）

on：不同的DataFrame通过重叠列的列名作为“外键”进行连接

(1)on显示指定的“外键”；

(2)left_on可以指定左侧数据的“外键”，right_on可以指定右侧数据的“外键”，left_on和right_on多用于列的内容属于同一个类型，但是列名不同的情况；

(3)按索引进行连接，left_index=True或right_index=True

suffixes：处理重复列名，当指定了外键后，不同的DataFrame中仍然存在列名相同的列，默认情况下，相同列名的列会分别自动添加后缀_x,_y，也可以自定义设置后缀。

In [43]:
# 创建dataframe
staff_df = pd.DataFrame([{'姓名': '张三', '部门': '研发部'},
                        {'姓名': '李四', '部门': '财务部'},
                        {'姓名': '赵六', '部门': '市场部'}])


student_df = pd.DataFrame([{'姓名': '张三', '专业': '计算机'},
                        {'姓名': '李四', '专业': '会计'},
                        {'姓名': '王五', '专业': '市场营销'}])

### 外连接

In [44]:
pd.merge(staff_df,student_df, how = 'outer', on = '姓名')

Unnamed: 0,姓名,部门,专业
0,张三,研发部,计算机
1,李四,财务部,会计
2,赵六,市场部,
3,王五,,市场营销


In [45]:
staff_df.merge(student_df, how = 'outer', on = '姓名')

Unnamed: 0,姓名,部门,专业
0,张三,研发部,计算机
1,李四,财务部,会计
2,赵六,市场部,
3,王五,,市场营销


### 内连接

In [50]:
pd.merge(staff_df,student_df, how = 'inner', on = '姓名')

Unnamed: 0,姓名,部门,专业
0,张三,研发部,计算机
1,李四,财务部,会计


### 右连接

In [52]:
pd.merge(staff_df,student_df, how = 'right', on = '姓名')

Unnamed: 0,姓名,部门,专业
0,张三,研发部,计算机
1,李四,财务部,会计
2,王五,,市场营销


### 左连接

In [53]:
pd.merge(staff_df,student_df, how = 'left', on = '姓名')

Unnamed: 0,姓名,部门,专业
0,张三,研发部,计算机
1,李四,财务部,会计
2,赵六,市场部,


In [54]:
# 添加新的数据列
staff_df['地址'] = ['天津', '北京', '上海']
student_df['地址'] = ['天津', '上海', '广州']

### 不同列名处理

In [55]:
# 处理重复列名
# 如果两个数据中包含有相同的列名（不是要合并的列）时，merge会自动加后缀作为区别
pd.merge(staff_df, student_df, how='left', left_on='姓名', right_on='姓名')

Unnamed: 0,姓名,部门,地址_x,专业,地址_y
0,张三,研发部,天津,计算机,天津
1,李四,财务部,北京,会计,上海
2,赵六,市场部,上海,,


### 索引merge

In [56]:
# 设置"姓名"为索引
staff_df.set_index('姓名', inplace=True)
student_df.set_index('姓名', inplace=True)

In [57]:
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)
# 或者
# staff_df.merge(student_df, how='left', left_index=True, right_index=True)

Unnamed: 0_level_0,部门,地址_x,专业,地址_y
姓名,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
张三,研发部,天津,计算机,天津
李四,财务部,北京,会计,上海
赵六,市场部,上海,,


### 把索引转换为列

In [58]:
staff_df

Unnamed: 0_level_0,部门,地址
姓名,Unnamed: 1_level_1,Unnamed: 2_level_1
张三,研发部,天津
李四,财务部,北京
赵六,市场部,上海


In [60]:
staff_df.reset_index() # （all）index 改为 column

Unnamed: 0,姓名,部门,地址
0,张三,研发部,天津
1,李四,财务部,北京
2,赵六,市场部,上海


# 数据堆叠 stack & unstack

数据重构：stack和unstack

stack()和unstack()适用于层级索引对象，仅是对数据显示的转换，并不对数据本身产生聚合操作。

1> stack(level=n)：将数据的列索引旋转为行索引，n指索引的层级，默认为-1，最里层的索引为-1或者(层级数-1)，Level从外往里依次为：0,1…-1

2> unstack(level=n)：将数据的行索引旋转为列索引。

In [61]:
# 创建dataframe
header = pd.MultiIndex.from_product([['Semester1','Semester2'],['Maths','Science']])
d = [[12,45,67,56],[78,89,45,67],[45,67,89,90],[67,44,56,55]]
 
df = pd.DataFrame(d, index=['Alisa','Bobby','Cathrine','Jack'], columns=header)
df

Unnamed: 0_level_0,Semester1,Semester1,Semester2,Semester2
Unnamed: 0_level_1,Maths,Science,Maths,Science
Alisa,12,45,67,56
Bobby,78,89,45,67
Cathrine,45,67,89,90
Jack,67,44,56,55


In [63]:
header

MultiIndex([('Semester1',   'Maths'),
            ('Semester1', 'Science'),
            ('Semester2',   'Maths'),
            ('Semester2', 'Science')],
           )

In [62]:
df.stack()

Unnamed: 0,Unnamed: 1,Semester1,Semester2
Alisa,Maths,12,67
Alisa,Science,45,56
Bobby,Maths,78,45
Bobby,Science,89,67
Cathrine,Maths,45,89
Cathrine,Science,67,90
Jack,Maths,67,56
Jack,Science,44,55


In [65]:
df = pd.DataFrame(np.random.randint(0,150,size=(4,4)),

               index = ['第一季度','第二季度','第三季度','第四季度'],

               columns=[['python','python','机器学习','机器学习'],['初级','高阶','初级','高阶']])

stacked_df=df.stack()
stacked_df

Unnamed: 0,Unnamed: 1,python,机器学习
第一季度,初级,116,136
第一季度,高阶,27,111
第二季度,初级,26,97
第二季度,高阶,18,44
第三季度,初级,133,51
第三季度,高阶,118,61
第四季度,初级,143,56
第四季度,高阶,27,132


In [66]:
stacked_df.unstack(0)

Unnamed: 0_level_0,python,python,python,python,机器学习,机器学习,机器学习,机器学习
Unnamed: 0_level_1,第一季度,第三季度,第二季度,第四季度,第一季度,第三季度,第二季度,第四季度
初级,116,133,26,143,136,51,97,56
高阶,27,118,18,27,111,61,44,132


In [70]:
stacked_sl_df = stacked_df.swaplevel()
stacked_sl_df

Unnamed: 0,Unnamed: 1,python,机器学习
初级,第一季度,116,136
高阶,第一季度,27,111
初级,第二季度,26,97
高阶,第二季度,18,44
初级,第三季度,133,51
高阶,第三季度,118,61
初级,第四季度,143,56
高阶,第四季度,27,132
