# Joining DataFrames

In this chapter, we will join two DataFrames horizontally using the `merge` method. The DataFrame `merge` method performs the same task as a SQL join. Many people reading this book will be familiar with SQL and already understand SQL joins.

For those unfamiliar with SQL, it is a language used to work with data stored in relational databases. A relational database is an organized collection of two-dimensional tables - the same two-dimensional tables that we've worked with throughout this book and read in as CSV files. The tables are **related** to one another based on specific columns. The tables can be **joined** together whenever the values in these columns are the same. Much greater detail will be presented in the upcoming Fundamentals of SQL part of the book.

A `join` DataFrame method exists, and is similar to the `merge` method, but does not provide the SQL-like join mechanism like `merge`. The `join` method will be covered in an upcoming chapter. It's a bit unfortunate (in my opinion) that the `merge` method was not named "join" as to synchronize with SQL. In fact, if you [look at the documentation][0] for `merge`, you'll see "SQL" mentioned in several places.


## Comparing the `merge` method to `pd.concat`

In the last chapter, we learned how to join together two (or more) DataFrames horizontally using the `pd.concat` function. The `merge` method is similar to `pd.concat` but with the following differences:

* `merge` only joins DataFrames horizontally (`pd.concat` joins both vertically and horizontally)
* `merge` joins together two and only two DataFrames. There is no limit to how many DataFrames `pd.concat` may join together
* `merge` aligns the DataFrames on any number of column and/or index values. `pd.concat` aligns only on the index
* `merge` allows for the type of join to be either "inner", "left", "right", "outer", or "cross". `pd.concat` only allows "inner" or "outer"
* `merge` preserves the number of index levels while `pd.concat` adds a new index level to label the original DataFrames

[0]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

## Left and right tables

Before calling the `merge` method, we introduce the terminology **left table** and **right table** as the order of the DataFrames matter when joining. The left table will be the one calling the `merge` method and the right table will be the one passed as the first argument. The courses and professors datasets (found in the `data/joins` folder and shown in the image below) will be used to showcase the `merge` method.

![0]

[0]: images/left_right.png

## Inner join

There are five different types of joins possible with the `merge` method - "inner", "left", "right", "outer", and "cross". Each of these terms comes from SQL, which is why the `merge` method is often described as completing a "SQL-like" join.

When joining two tables together using the `merge` method, the type of join and the joining column of each DataFrame must be specified. Whenever the values in the joining columns match, the rows from the DataFrames are joined together.

By default, the `merge` method completes an inner join which only keeps rows where there is a match in both DataFrames. The image below highlights the rows that align during an inner join on the `professor_id` column of each DataFrame.

![0]

[0]: images/inner_join_color.png

The DataFrames are read in and the `merge` method called. The first parameter is the other "right" table. The `how` parameter must be one of the join types while the `on` parameter must be set to the joining column that is common to both. In this particular joining, each row of the left table joins zero or one rows in the right table. Only the last row in the courses DataFrame has no match in the professors DataFrame.

In [None]:
import pandas as pd
courses = pd.read_csv('../data/joins/courses.csv')
professors = pd.read_csv('../data/joins/professors.csv')
courses.merge(professors, how='inner', on='professor_id')

## Left join

In a left join, all of the rows in the left table that do not have a match in the right table are kept in the result along with the same rows from the inner join. Each of these non-matching rows in the left table will have missing values for the columns in the right table in the result.

![0]

[0]: images/left_join_color.png

Setting the `how` parameter to "left" forces a left join. Because the last row did not align with any professor_id it has missing values for the last two columns.

In [None]:
courses.merge(professors, how='left', on='professor_id')

## Right join

In a right join, all of the rows in the right table that do not have a match in the left table are kept in the result along with all of the rows from the inner join.

![0]

[0]: images/right_join_color.png

Rows with professor_id 3 and 40 do not appear in the left table and therefore have missing values for the left table columns in the result.

In [None]:
courses.merge(professors, how='right', on='professor_id')

## Outer join

In an outer join (also called a "full outer join"), all of the non-matching rows in either table are kept in the result along with all of the rows from the inner join.

![0]

[0]: images/outer_join_color.png

In this outer join, the first four rows represent those from the inner join where there is a match in professor_id. The next row (course_id equal to 5) is only in the left table. The last two rows are found only in the right table. 

Setting the `indicator` parameter to `True` adds a `_merge` column to the resulting DataFrame with values of `left_only`, `right_only`, or `both` indicating which table the row is found. 

In [None]:
courses.merge(professors, how='outer', on='professor_id', indicator=True)

## Cross join

In a cross join, there is no joining column. Each row in the left DataFrame is aligned with each row of the right DataFrame creating a Cartesian product. The number of rows in the result will be the product of the number of rows in each DataFrame. The image below shows a single row in the left DataFrame aligning with all of the rows in the right. This process happens for every row in the left DataFrame.

![1]

[1]: images/cross_join_arrow.png

There will be 30 rows in the result (five rows in the left times six rows in the right). The `on` parameter has been removed as there is no joining column. Unlike the other joins, when performing a cross join, all of the columns from each of the DataFrames are placed in the result. In this instance, each of the `professor_id` columns are present in the result. To avoid confusion, pandas places a suffix at the end of any column names that are the same in each DataFrame. By default, these values are `'_x'` and `'_y'` but can be changed with the `suffixes` parameter.

In [None]:
courses.merge(professors, how='cross', 
              suffixes=('_left', '_right')).head(10)

## Multiple matches per row

In each of the above examples, every row in the left table matched at most one row in the right table. The column names that we matched values were the same name in each table. It's possible for a row to have any number of matches in the other table and for the columns to have different names. Take a look at the procedure and doctor tables below. We will align the left table's `department` column with the right table's `specialty` column.

![0]

[0]: images/proc_doc.png

Use the `left_on` and `right_on` parameters to specify the joining columns when the column names are different in each table. Because the column clinic (which we are not joining on in this example) appears in both DataFrames, the default suffix is used to indicate which DataFrame it originated from. An outer join is performed, so all rows from each table are kept in the result.

In [None]:
procedure = pd.read_csv('../data/joins/procedure.csv')
doctor = pd.read_csv('../data/joins/doctor.csv')
procedure.merge(doctor, how='outer', left_on='department', 
                right_on='specialty', indicator=True)

## Joining on multiple columns

All of the above examples have a single joining column, requiring a single match per row. It's possible to have multiple joining columns that require multiple matches per row. Here, the department and clinic columns in the left table must match the specialty and clinic columns in the right table. An inner join is performed, so only the matched rows are kept in the result.

In [None]:
procedure.merge(doctor, how='inner', left_on=['department', 'clinic'],
                right_on=['specialty', 'clinic'], indicator=True)

## Joining tables from SQL databases

Tables from SQL databases are usually created so that they may be joined together. Anytime you find yourself working with a SQL database (an organized collection of related tables), you should try and find its **database diagram**, which will make joins a much easier task. A database diagram shows all of the tables, the column names for each table, the data types of each column and the relations between each table.

Take a look at the database diagram below for the Chinook database, which contains many tables of data for a music store. Each rectangular box represents a two-dimensional table with its name centered at the top. The column names and their corresponding data types follow below the table name. The lines depict which tables can be joined with the joining column written above the line. The golden key symbol to the left of some of the column names represents the **primary key** of each table, which is an integer to uniquely identify each row. Primary keys usually start at one and increment by one each row.

![0]

[0]: images/chinook_erd.png

### Reading in tables as DataFrames from a SQL database

In order to read in a table from a database, you must use a **connection string**, which contains information on how to connect to the database (username, password, location, etc...). The Chinook database is stored as a SQLite database (a specific database software) and its connection string is assigned to the variable `CS`.

A longer discussion of database diagrams and connection strings follows in the upcoming Fundamentals of SQL part of the book and it is strongly suggested you read it to get a deeper understanding of relational databases and how tables are related to one another. We will continue to focus on the mechanics of joining two tables together with the `merge` method in this chapter.

In [None]:
CS = 'sqlite:///../data/databases/chinook.db'

Entire tables from a database may be read in as a pandas DataFrame using the `read_sql` function by passing it the table name and connection string as the first two arguments. Here, we read in the tracks table.

In [None]:
tracks = pd.read_sql('tracks', CS)
tracks.head(3)

The genres, albums, and artists table are read in as well.

![0]

[0]: images/genres_albums_artists.png

In [None]:
genres = pd.read_sql('genres', CS)
albums = pd.read_sql('albums', CS)
artists = pd.read_sql('artists', CS)

With the help of the database diagram, we can determine which tables are related and how to join them. Let's join the tracks and genres table on the `GenreId` column. Both tables use the column `Name`, so a custom suffix is provided.

In [None]:
tracks.merge(genres, how='inner', on='GenreId', 
             suffixes=('_tracks', '_genres')).head()

## Joining multiple tables together

In SQL databases, it's often the case that multiple tables will need to be joined together. For instance, the tracks table does not contain the album name nor the artist name. We'll need to join the tracks, albums, and artists tables together to get the track name, album name, and artist name in the same table. Looking at the database diagram, it's not possible to join the tracks table directly with the artists table. We have to first join the tracks and albums tables, which we do so below. We slim the result down to a few select columns and sort by `TrackId` so preserve the original order of the tracks table.

In [None]:
cols = ['TrackId', 'Name', 'AlbumId', 'GenreId', 'Title', 'ArtistId']
(tracks.merge(albums, how='inner', on='AlbumId')[cols]
      .sort_values('TrackId')
      .head())

We can now join the the artists table to this newly created table to get the track name, title name and artist name in the same table.

In [None]:
(tracks.merge(albums, how='inner', on='AlbumId')[cols]
       .merge(artists, how='inner', on='ArtistId')
       .sort_values('TrackId')
       .head())

The default suffixes are added to the `Name` column, which appears in both the tracks and artists tables. Below, we join one more table, genres, to add the genre name of each and then drop all the Id columns.

In [None]:
(tracks.merge(albums, how='inner', on='AlbumId')[cols]
       .merge(artists, how='inner', on='ArtistId')
       .merge(genres, how='inner', on='GenreId')
       .sort_values('TrackId')
       .drop(columns=['TrackId', 'AlbumId', 'ArtistId', 'GenreId'])
       .head())

Notice how the `Name` column of the genre table does not have a suffix. This is because the other `Name` columns were already changed before the last `merge` call. To help understand the data better, each column is renamed.

In [None]:
df_names = (
    tracks.merge(albums, how='inner', on='AlbumId')[cols]
       .merge(artists, how='inner', on='ArtistId')
       .merge(genres, how='inner', on='GenreId')
       .rename(columns={'Name_x': 'TrackName', 
                        'Name_y': 'ArtistName', 
                        'Title': 'AlbumTitle',
                        'Name': 'GenreName'})
       .sort_values('TrackId')
       .drop(columns=['TrackId', 'AlbumId', 'ArtistId', 'GenreId']))
df_names.head()

## Further analysis after join

Data analyses typically do not end after joining tables together. Let's find the artists with the most tracks.

In [None]:
df_names['ArtistName'].value_counts().head(10)

Let's see if there are any artists that have multiple different genres.

In [None]:
(df_names.groupby('ArtistName')['GenreName'].nunique()
         .sort_values(ascending=False).head())

Let's verify this result by returning the first track of each genre for the artist with the most number of unique genres.

In [None]:
df_names.query('ArtistName == "Iron Maiden"').drop_duplicates(subset='GenreName')

## Exercises

Answer the exercises regarding the Chinook database using pandas. Read in tables with the following syntax.

In [None]:
CS = 'sqlite:///../data/databases/chinook.db'
tracks = pd.read_sql('tracks', CS)

### Exercise 1

<span style="color:green; font-size:16px">Find the occurrences of each media type in the tracks table. Use the name of the media type.</span>

### Exercise 2

<span style="color:green; font-size:16px">Are there any playlists that have no tracks? If so, which ones are they? Use `merge` in your solution.</span>

### Exercise 3

<span style="color:green; font-size:16px">Find the number of tracks per playlist. Use the playlist name in the result. Some playlists have the same name. Make sure not to combine them.</span>

### Exercise 4

<span style="color:green; font-size:16px">Find the number of invoices per customer. Show the customer id, first name, and last name and count of invoices.</span>

### Exercise 5

<span style="color:green; font-size:16px">How many customers is each employee responsible for.</span>

### Exercise 6

<span style="color:green; font-size:16px">Find all of the tracks with the same name as the album title.</span>

### Exercise 7

<span style="color:green; font-size:16px">Find the top 10 tracks by length of song (Milliseconds).</span>

### Exercise 8

<span style="color:green; font-size:16px">Are there any genres that do not appear in the tracks table? If so, which ones are they? Use `merge` in your solution.</span>

### Exercise 9

<span style="color:green; font-size:16px">Count the number of albums per artist. Make sure to include artists that do not have any albums.</span>

### Exercise 10

<span style="color:green; font-size:16px">Find the cost of each playlist. Include playlists with zero tracks.</span>

### Exercise 11

<span style="color:green; font-size:16px">Count the total number of times each track was sold and return the top 10 tracks.</span>

### Exercise 12

<span style="color:green; font-size:16px">Create a pivot table with billing country and genre as the index and columns and the number of tracks sold as the values.</span>

### Exercise 13

<span  style="color:green; font-size:16px">Find the name and email of each employee's boss. Make use of the suffix arguments to better label the merged data. Be sure to include employees that don't have bosses. This is called a recursive relationship.</span>

### Exercise 14

<span style="color:green; font-size:16px">Find the average length of tracks for each artist for those with at least 10 tracks. Return five artists with the longest average track length.</span>