In this lecture we're going to address how you can bring multiple dataframe objects together, either by
merging them horizontally, or by concatenating them vertically. Before we jump into the code, we need to
address a little relational theory and to get some language conventions down. I'm going to bring in an image
to help explain some concepts.

![Venn Diagram](merging1.png)


Ok, this is a Venn Diagram. A Venn Diagram is traditionally used to show set membership. For example, the 
circle on the left is the population of students at a university. The circle on the right is the population
of  staff at a university. And the overlapping region in the middle are all of those students who are also
staff.  Maybe these students run tutorials for a course, or grade assignments, or engage in running research
 experiments.

So, this diagram shows two populations whom we might have data about, but there is overlap between those 
populations.

When it comes to translating this to pandas, we can think of the case where we might have these two 
populations as indices in separate DataFrames, maybe with the label of Person Name. When we want to join the
DataFrames together, we have some choices to make. First what if we want a list of all the people regardless
of whether they're staff or student, and all of the information we can get on them? In database terminology,
this is called a full outer join. And in set theory, it's called a union. In the Venn diagram, it represents
everyone in any circle.

Here's an image of what that would look like in the Venn diagram.

![Union](merging2.png)

It's quite possible though that we only want those people who we have maximum information for, those people
who are both staff and students. Maybe being a staff member and a student involves getting a tuition waiver,
and we want to calculate the cost of this. In database terminology, this is called an inner join. Or in set
theory, the intersection. It is represented in the Venn diagram as the overlapping parts of each circle.

Here's what that looks like: ![Intersection](merging3.png)


In [None]:
# With that background, let's see an example of how we would do this in pandas, where we would use the merge
# function.
import pandas as pd

# First we create two DataFrames, staff and students.
staff_df = pd.DataFrame([{'Name': 'Mehmet', 'Role': 'Instructor'},
                         {'Name': 'Nihan', 'Role': 'Chair'},
                         {'Name': 'Semih', 'Role': 'Vice Chair'}])
# And lets index these staff by name
staff_df = staff_df.set_index('Name')
# Now we'll create a student dataframe
student_df = pd.DataFrame([ {'Name': 'Nihan', 'School': 'ITU'},
                            {'Name': 'Semih', 'School': 'YTU'},
                            {'Name': 'Mert', 'School': 'ODTU'}
                          ])
# And we'll index this by name too
student_df = student_df.set_index('Name')
staff_df.head()


In [None]:
# And lets just print out the dataframes

student_df.head()

### Please note the indicies. They are the names, i.e., the names are indicies of the dataframe. 

In [None]:
# There's some overlap in these DataFrames in that James and Sally are both students and staff, 
# but Mike and Kelly are not. 
# Importantly, both DataFrames are indexed along the value we want to merge them on,
# which is called Name.

In [None]:
# If we want the union of these, we would call merge() passing in the DataFrame on the left and the DataFrame
# on the right and telling merge that we want it to use an outer join.
# Recall In database terminology, this is called a full outer join. 
# And in set theory, it's called a union. In the Venn diagram, it represents everyone in any circle.
# We want to use the left and right indices as the joining columns.

pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True) #Outer = union

In [None]:
# We see in the resulting DataFrame that everyone is listed. And since Mike does not have a role, and John
# does not have a school, those cells are listed as missing values.

# If we wanted to get the intersection, that is, just those who are a student AND a staff, we could set the
# how attribute to inner. Again, we set both left and right indices to be true as the joining columns


pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True) #inner = intersection

In [None]:
# And we see the resulting DataFrame has only James and Sally in it. 
# Now there are two other common use cases
# when merging DataFrames, and both are examples of what we would call set addition. The first is when we
# would want to get a list of all staff regardless of whether they were students or not. But if they were
# students, we would want to get their student details as well. To do this we would use a left join. It is
# important to note the order of dataframes in this function: the first dataframe is the left dataframe and
# the second is the right

pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True) 
#  take only  the left DF and fetch their properties from right

In [None]:
# You could probably guess what comes next. We want a list of all of the students and their roles if they were
# also staff. To do this we would do a right join.
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)

#  take only  the RIGHT DF and fetch their properties from LEFT


In [None]:
# We can also do it another way. The merge method has a couple of other interesting parameters. First, you
# don't need to use indices to join on, you can use columns as well. Here's an example. Here we have a
# parameter called "on", and we can assign a column that both dataframe has as the joining column

# First, lets remove our index from both of our dataframes
staff_df   = staff_df.reset_index()
student_df = student_df.reset_index()

staff_df


In [None]:
student_df

In [None]:
# Now lets merge using the on parameter
pd.merge(staff_df, student_df, how='outer', on='Name')

#### This is used more frequently . 

### Conflicts

In [None]:
# So what happens when we have **CONFLICTS** between the DataFrames? 

# Let's take a look by creating new staff and
# student DataFrames that have a location information added to them.

staff_df = pd.DataFrame([{'Name': 'Nihan', 'Role': 'Chair', 
                          'Location': 'Gemi Faculty'},
                         {'Name': 'Selcuk', 'Role': 'Instructor', 
                          'Location': 'O block'},
                         {'Name': 'Coskun', 'Role': 'Vice Manager', 
                          'Location': 'E block'}])

student_df = pd.DataFrame([{'Name': 'Selcuk', 'School': 'YTU', 
                            'Location': 'O Block'},
                          {'Name': 'Coskun', 'School': 'ITU', 
                            'Location': 'F block'},
                           {'Name': 'Mert', 'School': 'Ege', 
                            'Location': 'F Block'}])

# In the staff DataFrame, this is an office location where we can find the staff person. And we can see the
# Director of HR is on State Street, while the two students are on Washington Avenue, and these locations just
# happen to be right outside my window as I film this. But for the student DataFrame, the location information
# is actually their home address.
staff_df



In [None]:
student_df

In [None]:
# The merge function preserves this information, but appends an _x or _y to help differentiate between which
# index went with which column of data. The _x is always the left DataFrame information, and the _y is always
# the right DataFrame information.

# Here, if we want all the staff information regardless of whether they were students or not. But if they were
# students, we would want to get their student details as well.Then we can do a left join and on the column of
# Name


pd.merge(staff_df, student_df, how='left', on='Name')

In [None]:
# From the output, we can see there are columns Location_x and Location_y. Location_x refers to the Location
# column in the left dataframe, which is staff dataframe and Location_y refers to the Location column in the
# right dataframe, which is student dataframe.

# Before we leave merging of DataFrames, let's talk about multi-indexing and multiple columns. It's quite
# possible that the first name for students and staff might overlap, but the last name might not. In this
# case, we use a list of the multiple columns that should be used to join keys from both dataframes on the on
# parameter. Recall that the column name(s) assigned to the on parameter needs to exist in both dataframes.

# Here's an example with some new student and staff data

staff_df = pd.DataFrame([{'Name': 'Nihan', 'Last Name': 'Demirel', 'Role': 'Chair',  
                          'Location': 'Gemi Faculty'},
                         {'Name': 'Selcuk', 'Last Name': 'Alp', 'Role': 'Instructor', 
                          'Location': 'O block'},
                         {'Name': 'Coskun', 'Last Name': 'Ozkan', 'Role': 'Vice Manager',  
                          'Location': 'E block'}])

student_df = pd.DataFrame([{'Name': 'Selcuk', 'Last Name': 'Cebi', 'School': 'YTU', 
                            'Location': 'O Block'},
                          {'Name': 'Coskun',  'Last Name': 'Ozkan',  'School': 'ITU',
                            'Location': 'F block'},
                           {'Name': 'Mert',  'Last Name': 'Edali', 'School': 'Ege',
                            'Location': 'F Block'}])

# As you see here, James Wilde and James Hammond don't match on both keys since they have different last
# names. So we would expect that an inner join doesn't include these individuals in the output, and only Sally
# Brooks will be retained.
staff_df

In [None]:
student_df

In [None]:
pd.merge(staff_df, student_df, how='inner', on=['Name','Last Name'])

In [None]:
#also check this

pd.merge(staff_df, student_df, how='inner', on=['Name']) 

In [None]:
# Joining dataframes through merging is incredibly common, 
# and you'll need to know how to pull data from
# different sources, clean it, and join it for analysis. 
# This is a staple not only of pandas, but of database
# technologies as well.

### Horizontally merge vs vertically merge

In [None]:
# If we think of merging as joining "horizontally", meaning we join on similar values in a column found in two
# dataframes, then, concatenating is joining "vertically", meaning we put dataframes on top or at the bottom of
# each other



In [73]:
import pandas as pd
# First DataFrame

staff_df1 = pd.DataFrame([{'Name': 'Nihan', 'Last Name': 'Demirel', 'Role': 'Chair',  
                          'Location': 'Gemi Faculty'},
                         {'Name': 'Selcuk', 'Last Name': 'Alp', 'Role': 'Instructor', 
                          'Location': 'O block'},
                         {'Name': 'Coskun', 'Last Name': 'Ozkan', 'Role': 'Vice Manager',  
                          'Location': 'E block'}])

staff_df2 = pd.DataFrame([{'Name': 'Mehmet', 'Last Name': 'Guler', 'Role': 'Instructor',  
                          'Location': 'B Roof'},
                         {'Name': 'Selcuk', 'Last Name': 'Cebi', 'Role': 'Vice Manager',  
                          'Location': 'B Roof'},
                         {'Name': 'Alev', 'Last Name': 'Gumus', 'Role': 'Instructor',  
                          'Location': 'V  Block'},
                         {'Name': 'Umut', 'Last Name': 'Tuzkaya', 'Role': 'Vice President',  
                          'Location': 'Tas Bina'},])
frames = [staff_df1, staff_df2]




In [74]:
staff_df1

Unnamed: 0,Name,Last Name,Role,Location
0,Nihan,Demirel,Chair,Gemi Faculty
1,Selcuk,Alp,Instructor,O block
2,Coskun,Ozkan,Vice Manager,E block


In [75]:
staff_df1 = staff_df1.set_index('Name')

In [76]:
staff_df2

Unnamed: 0,Name,Last Name,Role,Location
0,Mehmet,Guler,Instructor,B Roof
1,Selcuk,Cebi,Vice Manager,B Roof
2,Alev,Gumus,Instructor,V Block
3,Umut,Tuzkaya,Vice President,Tas Bina


In [77]:
staff_df2 = staff_df2.set_index('Name')


In [78]:
staff_df2

Unnamed: 0_level_0,Last Name,Role,Location
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mehmet,Guler,Instructor,B Roof
Selcuk,Cebi,Vice Manager,B Roof
Alev,Gumus,Instructor,V Block
Umut,Tuzkaya,Vice President,Tas Bina


In [79]:
pd.concat([staff_df1,staff_df2])


Unnamed: 0_level_0,Last Name,Role,Location
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Nihan,Demirel,Chair,Gemi Faculty
Selcuk,Alp,Instructor,O block
Coskun,Ozkan,Vice Manager,E block
Mehmet,Guler,Instructor,B Roof
Selcuk,Cebi,Vice Manager,B Roof
Alev,Gumus,Instructor,V Block
Umut,Tuzkaya,Vice President,Tas Bina


In [80]:
staff_df1

Unnamed: 0_level_0,Last Name,Role,Location
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Nihan,Demirel,Chair,Gemi Faculty
Selcuk,Alp,Instructor,O block
Coskun,Ozkan,Vice Manager,E block


In [81]:
staff_df1   = staff_df1.reset_index()
staff_df2   = staff_df2.reset_index()

pd.concat([staff_df1,staff_df2])


Unnamed: 0,Name,Last Name,Role,Location
0,Nihan,Demirel,Chair,Gemi Faculty
1,Selcuk,Alp,Instructor,O block
2,Coskun,Ozkan,Vice Manager,E block
0,Mehmet,Guler,Instructor,B Roof
1,Selcuk,Cebi,Vice Manager,B Roof
2,Alev,Gumus,Instructor,V Block
3,Umut,Tuzkaya,Vice President,Tas Bina


Now you know how to merge and concatenate datasets together. You will find such functions very useful for
combining data to get more complex or complicated results and to do analysis with. A solid understanding of
how to merge data is absolutely essentially when you are procuring, cleaning, and manipulating data. It's
worth knowing how to join different datasets quickly, and the different options you can use when joining
datasets, and I would encourage you to check out the pandas docs for joining and concatenating data.