# Joining Data

> It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in.
>
> \- Garrett Grolemund, Master Instructor, RStudio

# General Model

## Combining Data

* We frequently want to use more than one table at once, so we need to combine them in some way

* Because tables are two-dimensional, we can combine them **vertically** and **horizontally**

* Combining data **vertically** is known as **appending**/**unioning**/**concatenating**

* Combiding data **horizontally** is known as **joining**/**merging**

## Appending Data Vertically

* When we combine data **vertically**, we are stacking tables on top of one another:

<center>
<img src="https://raw.githubusercontent.com/pp-ct/scg_python/main/notebooks/images/combine-vertically.png" height="300">
</center>

* Note that this is particularly useful when all columns are the same between the two tables

## Joining Data Horizontally

* When we combine data **horizontally**, we are attaching the tables at their sides:

* The joining occurs by matching on a **key column**

<center>
<img src="https://raw.githubusercontent.com/pp-ct/scg_python/main/notebooks/images/combine-horizontally-key.png">
</center>

# Combining DataFrames

## Appending DataFrames

* When we combine DataFrames vertically, we want to stack two DataFrames on top of one another

* Let's start by creating two DataFrames with the same variables:

In [None]:
import pandas as pd

In [None]:
df_1 = pd.DataFrame({'x': [1, 2], 'y': ['a', 'b']})
df_1

In [None]:
df_2 = pd.DataFrame({'x': [3, 4], 'y': ['c', 'd']})
df_2

We can stack `df_1` and `df_2` on top of one another using the `concat()` function from `pandas` with a list:

We also can add the `ignore_index = True` to make the Index reset:

In [None]:
df_3 = 

# Joining DataFrames

* Joining DataFrames may be one of the most important skills to learn in Python

* As a reminder, joining DataFrames is the horizontal combining of two DataFrames on some **key column**:

<center>
<img src="https://raw.githubusercontent.com/pp-ct/scg_python/main/notebooks/images/combine-horizontally-key.png" height="300">
</center>


* We have `flights_df`, but we need another DataFrame to join to `flights_df` that has a common **key column**

* As an example, assume we want to know which airline carried each flight in `flights_df`:

In [None]:
airlines_df = pd.read_csv('https://raw.githubusercontent.com/pp-ct/scg_python/main/data/airlines.csv')
flights_df = pd.read_csv('https://raw.githubusercontent.com/pp-ct/scg_python/main/data/flights.csv')

In [None]:
airlines_df.head()

In [None]:
flights_df.head()

The `carrier` column is our key because it's in both DataFrames.

We can join/merge the DataFrames together using the `merge()` function:

In [None]:
pd.merge()

# Join Types

### Inner Joins

All of our joins have been **inner joins**:

<center>
<img src="https://raw.githubusercontent.com/pp-ct/scg_python/main/notebooks/images/inner-join.png" alt="inner-join.png" height="500">
</center>

**Inner joins** only keep rows where the key is in *both tables*.

### Left Joins

Sometimes we only want to include data that is **in the left table** regardless of whether it's in the right table:

<center>
<img src="https://raw.githubusercontent.com/pp-ct/scg_python/main/notebooks/images/left-outer-join.png" alt="left-outer-join.png" height="500">
</center>

Left outer joins, or simply **left joins**, keep rows where the key is in the left table.

### Right Joins

Sometimes we only want to include data that is **in the right table** regardless of whether it's in the left table:

<center>
<img src="https://raw.githubusercontent.com/pp-ct/scg_python/main/notebooks/images/right-outer-join.png" alt="right-outer-join.png" height="500">
</center>

Right outer joins, or simply **right joins**, keep rows where the key is in the right table.

### Outer Joins

Sometimes we want to include **all rows** in either the left table or the right table:

<center>
<img src="https://raw.githubusercontent.com/pp-ct/scg_python/main/notebooks/images/full-outer-join.png" alt="full-outer-join.png" height="500">
</center>

**Full outer joins** keep all rows.

# Applying Different Join Types

We can apply these different join types using the `how` parameter of the `merge()` function:

While `how = 'inner'` is the default, we can also use `'left':`, `'right'`, and `'outer'`:

In [None]:
pd.merge()