<a href="https://colab.research.google.com/github/stevenkhwun/P4DS/blob/main/SQL_in_Python_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Comparison with SQL - Part 2

This notebook is based on this [link](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html#compare-with-sql).

**Contents of this notebook**:
* JOIN

In [1]:
import pandas as pd
import numpy as np

In [2]:
url = (
    "https://raw.githubusercontent.com/pandas-dev"
    "/pandas/main/pandas/tests/io/data/csv/tips.csv"
)

## JOIN

`JOIN`s can be performed with `join()` or `merge()`. By default, `join()` will join the DataFrames on their indices. Each method has parameters allowing you to specify the type of join to perform (`LEFT`, `RIGHT`, `INNER`, `FULL`) or the columns to join on (column names or indices).

In [3]:
df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})

In [4]:
df1

Unnamed: 0,key,value
0,A,-2.857523
1,B,-0.829377
2,C,-1.362061
3,D,-1.156072


In [5]:
df2

Unnamed: 0,key,value
0,B,-0.325485
1,D,-0.649698
2,D,0.855154
3,E,-0.148022


Assume we have two database tables of the same name and structure as our DataFrames.

Now let's go over the various types of `JOIN`s.

### INNER JOIN

```SAS
# SAS code
SELECT *
FROM df1
INNER JOIN df2
  ON df1.key =df2.key;
```

In [6]:
# merge performs an INNER JOIN by default
pd.merge(df1, df2, on="key")

Unnamed: 0,key,value_x,value_y
0,B,-0.829377,-0.325485
1,D,-1.156072,-0.649698
2,D,-1.156072,0.855154


`merge()` also offers parameters for cases when you'd like to join one DataFrame's column with another DataFrame's index.

In [7]:
indexed_df2 = df2.set_index("key")
indexed_df2

Unnamed: 0_level_0,value
key,Unnamed: 1_level_1
B,-0.325485
D,-0.649698
D,0.855154
E,-0.148022


In [8]:
pd.merge(df1, indexed_df2, left_on="key", right_index=True)

Unnamed: 0,key,value_x,value_y
1,B,-0.829377,-0.325485
3,D,-1.156072,-0.649698
3,D,-1.156072,0.855154


### LEFT OUTER JOIN

Show all records from `df1`.

```SAS
# SAS code
SELECT *
FROM df1
LEFT OUTER JOIN df2
  ON df1.key = df2.key;
```

In [9]:
pd.merge(df1, df2, on="key", how="left")

Unnamed: 0,key,value_x,value_y
0,A,-2.857523,
1,B,-0.829377,-0.325485
2,C,-1.362061,
3,D,-1.156072,-0.649698
4,D,-1.156072,0.855154


### RIGHT JOIN

Show all records from `df2`.

```SAS
# SAS code
SELECT *
FROM df1
RIGHT OUTER JOIN df2
  ON df1.key = df2.key;
```

In [10]:
pd.merge(df1, df2, on="key", how="right")

Unnamed: 0,key,value_x,value_y
0,B,-0.829377,-0.325485
1,D,-1.156072,-0.649698
2,D,-1.156072,0.855154
3,E,,-0.148022


### FULL JOIN

pandas also allows for `FULL JOIN`s, which display both sides of the dataset, whether or not the joined columns find a match. As of writing, `FULL JOIN`s are not supported in all RDBMS (MySQL).

```SAS
# SAS code
SELECT *
FROM df1
FULL OUTER JOIN df2
  ON df1.key = df2.key;
```

In [11]:
pd.merge(df1, df2, on="key", how="outer")

Unnamed: 0,key,value_x,value_y
0,A,-2.857523,
1,B,-0.829377,-0.325485
2,C,-1.362061,
3,D,-1.156072,-0.649698
4,D,-1.156072,0.855154
5,E,,-0.148022
