# Pandas怎样实现DataFrame的Merge

Pandas的Merge，相当于Sql的Join，将不同的表接key关联到一个表

merge的语法:
- pd.mergelleft, right, how="inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False,sort=True, suffixes=('_x','_y'), copy=True,indicator=False, validate=None)
- left,right:要merge的dataframe或者有name的Series
- how: join类型，"left', 'right', 'outer, "inner
- on:join的key，left和right都需要有这个key
- left on:left的df或者series的key
- right on:right的df或者seires的key
- left index，right index:使用index而不是普通的column做join
- suffixes:两个元素的后缀，如果列有重名，自动添加后缀，默认是('X，y)

文档地址: https:/lpandas.pydata.org!pandas-docs/stable/reference/apilpandas.DataFrame.merge.html
本次讲解提纲:
- 1.电影数据集的join实例
- 2.理解merge时一对一、一对多、多对多的数量对齐关系
- 3.理解left join、right join、inner join、outer join的区别
- 4.如果出现非Key的字段重名怎么办

https:/grouplens.org/datasets/movielens/

## 1

In [3]:
import pandas as pd

In [18]:
df_ratings = pd.read_csv('./movies/ratings.csv',
                         header=None,
                         skiprows = 1,
                         names= "UserID::MovieID::Rating::Timestamp".split("::")
                        )

In [19]:
df_ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [21]:
df_users = pd.read_excel('./movies/users.xlsx',
                      header=None,
                      skiprows=1,
                      names='UserID::Age::Gender::Occupation::Zip-code'.split('::')
                      )

In [22]:
df_users.head()

Unnamed: 0,UserID,Age,Gender,Occupation,Zip-code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [43]:
df_movies = pd.read_csv("./movies/movies.csv",
                        header=None,
                        skiprows=1,
                        names=["movield", "title", "genres"])

In [44]:
df_movies.head()

Unnamed: 0,movield,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [45]:
# 指定合并方式为内连接（inner join），即只保留两个DataFrame中都有的匹配行。
df_ratings_users = pd.merge(
    df_ratings, df_users, left_on='UserID', right_on='UserID', how='inner'
)

In [46]:
df_ratings_users.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Age,Gender,Occupation,Zip-code
0,1,1,4.0,964982703,24,M,technician,85711
1,1,3,4.0,964981247,24,M,technician,85711
2,1,6,4.0,964982224,24,M,technician,85711
3,1,47,5.0,964983815,24,M,technician,85711
4,1,50,5.0,964982931,24,M,technician,85711


In [50]:
df_ratings_users_movies =pd.merge (
    df_ratings_users, df_movies, left_on='MovieID', right_on='movield', how='inner'
)

In [52]:
df_ratings_users_movies.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Age,Gender,Occupation,Zip-code,movield,title,genres
0,1,1,4.0,964982703,24,M,technician,85711,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,24,M,technician,85711,3,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,24,M,technician,85711,6,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,24,M,technician,85711,47,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,24,M,technician,85711,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


## 2、理解merge时数量的对齐关系

以下关系要正确理解:
- one-to-one:一对一关系，关联的key都是唯一的 比如(学号，姓名)merge(学号，年龄)。结果条数为:1”1
- one-to-many:一对多关系，左边唯一key，右边不唯-key。比如(学号，姓名)merge(学号，[语文成绩、数学成绩、英语成绩])。结果条数为:1"N
- many-to-many:多对多关系，左边右边都不是唯一的

比如(学号，[语文成绩、数学成绩、英语成绩]) merge(学号，[篮球、足球、乒乓球])

结果条数为:M"N

### 2.1 one-to-one一对一关系的merge

In [56]:
left = pd.DataFrame({'id':[11,12,13,14],'name':['na','nb','nc','nd']})
left

Unnamed: 0,id,name
0,11,na
1,12,nb
2,13,nc
3,14,nd


In [57]:
right = pd.DataFrame({'id':[11,12,13,14],'age':[22,23,24,25]})
right

Unnamed: 0,id,age
0,11,22
1,12,23
2,13,24
3,14,25


In [58]:
pd.merge(left,right,on='id')

Unnamed: 0,id,name,age
0,11,na,22
1,12,nb,23
2,13,nc,24
3,14,nd,25


### 2.2 one-to-many 一对多关系的merge

In [60]:
left = pd.DataFrame({'id':[11,12,13,14],'name':['na','nb','nc','nd']})
left

Unnamed: 0,id,name
0,11,na
1,12,nb
2,13,nc
3,14,nd


In [61]:
right = pd.DataFrame({'id':[11,11,11,12,12,13],'grade':['语文99','数学90','英语1','语文89','数学2','英语99']})
right

Unnamed: 0,id,grade
0,11,语文99
1,11,数学90
2,11,英语1
3,12,语文89
4,12,数学2
5,13,英语99


In [63]:
pd.merge(left,right,on='id')

Unnamed: 0,id,name,grade
0,11,na,语文99
1,11,na,数学90
2,11,na,英语1
3,12,nb,语文89
4,12,nb,数学2
5,13,nc,英语99


### many-to-many多对多关系的merge

注意:结果数量会出现乘法:n*m

In [65]:
left = pd.DataFrame({'id':[11,11,12,12,12],'爱好':['蓝','足','羽毛','乒乓','足']})
left

Unnamed: 0,id,爱好
0,11,蓝
1,11,足
2,12,羽毛
3,12,乒乓
4,12,足


In [67]:
right = pd.DataFrame({'id':[11,11,11,12,12,13],'grade':['语文99','数学90','英语1','语文89','数学2','英语99']})
right

Unnamed: 0,id,grade
0,11,语文99
1,11,数学90
2,11,英语1
3,12,语文89
4,12,数学2
5,13,英语99


In [68]:
pd.merge(left,right,on='id')

Unnamed: 0,id,爱好,grade
0,11,蓝,语文99
1,11,蓝,数学90
2,11,蓝,英语1
3,11,足,语文99
4,11,足,数学90
5,11,足,英语1
6,12,羽毛,语文89
7,12,羽毛,数学2
8,12,乒乓,语文89
9,12,乒乓,数学2


## 3

In [69]:
left = pd. DataFrame({'key': ['K0','K1','K2','K3'],
                      'A': ['A0','A1','A2','A3'],
                      'B': ['B0','B1','B2','B3']})
right = pd.DataFrame({'key': ['K0','K1','K4','K5'],
                      'C': ['C0','C1','C4','C5'],
                      'D': ['D0','D1','D4','D5']})

In [70]:
left

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3


In [71]:
right

Unnamed: 0,key,C,D
0,K0,C0,D0
1,K1,C1,D1
2,K4,C4,D4
3,K5,C5,D5


### 3.1 inner join,默认

左边右边的key同时有,才会出现在结果里

In [72]:
pd.merge(left,right,how='inner')

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1


### 3.2 left join

左边的都会出现在结果里,右边无法匹配则为Null

In [73]:
pd.merge(left,right,how='left')

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,,
3,K3,A3,B3,,


### 3.3 right join,

右边的都会出现在结果里,左边边无法匹配则为Null

In [75]:
pd.merge(left,right,how='right')

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K4,,,C4,D4
3,K5,,,C5,D5


### 3.4 outer join

左边,右边的都会出现在结果里,无法匹配则为Null

In [76]:
pd.merge(left,right,how='outer')

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,,
3,K3,A3,B3,,
4,K4,,,C4,D4
5,K5,,,C5,D5


## 4.出现非Key的字段重名怎么办?

In [77]:
left = pd. DataFrame({'key': ['K0','K1','K2','K3'],
                      'A': ['A0','A1','A2','A3'],
                      'B': ['B0','B1','B2','B3']})
right = pd.DataFrame({'key': ['K0','K1','K4','K5'],
                      'A': ['A10','A11','A14','A15'],
                      'D': ['D0','D1','D4','D5']})

In [78]:
left

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3


In [79]:
right

Unnamed: 0,key,A,D
0,K0,A10,D0
1,K1,A11,D1
2,K4,A14,D4
3,K5,A15,D5


In [80]:
pd.merge(left,right,on='key')

Unnamed: 0,key,A_x,B,A_y,D
0,K0,A0,B0,A10,D0
1,K1,A1,B1,A11,D1


In [81]:
pd.merge(left,right,on='key',suffixes=('_left','_right'))

Unnamed: 0,key,A_left,B,A_right,D
0,K0,A0,B0,A10,D0
1,K1,A1,B1,A11,D1


In [82]:
pd.merge(left,right,how='left',on='key',suffixes=('_left','_right'))

Unnamed: 0,key,A_left,B,A_right,D
0,K0,A0,B0,A10,D0
1,K1,A1,B1,A11,D1
2,K2,A2,B2,,
3,K3,A3,B3,,
