<a href="https://colab.research.google.com/github/stevenkhwun/P4DS/blob/main/SQL_in_Python_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Comparison with SQL - Part 3

This notebook is based on this [link](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html#compare-with-sql).

**Contents of this notebook**:
* UNION

In [18]:
import pandas as pd
import numpy as np

In [19]:
url = (
    "https://raw.githubusercontent.com/pandas-dev"
    "/pandas/main/pandas/tests/io/data/csv/tips.csv"
)

In [26]:
tips = pd.read_csv(url)

## UNION

`UNION ALL` can be performed using `concat()`.

In [20]:
df1 = pd.DataFrame(
      {"city": ["Chicago", "San Francisco", "New York City"], "rank": range(1, 4)}
      )
 
df2 = pd.DataFrame(
      {"city": ["Chicago", "Boston", "Los Angeles"], "rank": [1, 4, 5]}
      )

In [21]:
df1

Unnamed: 0,city,rank
0,Chicago,1
1,San Francisco,2
2,New York City,3


In [22]:
df2

Unnamed: 0,city,rank
0,Chicago,1
1,Boston,4
2,Los Angeles,5


```SAS
# SAS code
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
```

In [23]:
pd.concat([df1, df2])

Unnamed: 0,city,rank
0,Chicago,1
1,San Francisco,2
2,New York City,3
0,Chicago,1
1,Boston,4
2,Los Angeles,5


SQL's `UNION` is similar to `UNION ALL`, however `UNION` will remove duplicate rows.

```SAS
# SAS code
SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM 2;
```

In pandas, you can use `concat()` in conjunction with `drop_duplicates()`.

In [24]:
pd.concat([df1, df2]).drop_duplicates()

Unnamed: 0,city,rank
0,Chicago,1
1,San Francisco,2
2,New York City,3
1,Boston,4
2,Los Angeles,5


## LIMIT

```SAS
# SAS code
SELECT * FROM tips
LIMIT 10;
```

In [27]:
tips.head(10)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
5,25.29,4.71,Male,No,Sun,Dinner,4
6,8.77,2.0,Male,No,Sun,Dinner,2
7,26.88,3.12,Male,No,Sun,Dinner,4
8,15.04,1.96,Male,No,Sun,Dinner,2
9,14.78,3.23,Male,No,Sun,Dinner,2


## pandas equivalents for some SQL analytic and aggregate functions

### Top n rows with offset

```MySQL
# MySQL code
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;
```

In [29]:
tips.nlargest(10 + 5, columns="tip").tail(10)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
183,23.17,6.5,Male,Yes,Sun,Dinner,4
214,28.17,6.5,Female,Yes,Sat,Dinner,3
47,32.4,6.0,Male,No,Sun,Dinner,4
239,29.03,5.92,Male,No,Sat,Dinner,3
88,24.71,5.85,Male,No,Thur,Lunch,2
181,23.33,5.65,Male,Yes,Sun,Dinner,2
44,30.4,5.6,Male,No,Sun,Dinner,4
52,34.81,5.2,Female,No,Sun,Dinner,4
85,34.83,5.17,Female,No,Thur,Lunch,4
211,25.89,5.16,Male,Yes,Sat,Dinner,4


### Top n rows per group

```Oracle
# Oracle's ROW_NUMBER() analytic function
SELECT * FROM (
  SELECT
    t.*,
    ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
  FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;
```

In [None]:
pd.merge(df1, df2, on="key", how="right")

Unnamed: 0,key,value_x,value_y
0,B,-0.829377,-0.325485
1,D,-1.156072,-0.649698
2,D,-1.156072,0.855154
3,E,,-0.148022


### FULL JOIN

pandas also allows for `FULL JOIN`s, which display both sides of the dataset, whether or not the joined columns find a match. As of writing, `FULL JOIN`s are not supported in all RDBMS (MySQL).

```SAS
# SAS code
SELECT *
FROM df1
FULL OUTER JOIN df2
  ON df1.key = df2.key;
```

In [None]:
pd.merge(df1, df2, on="key", how="outer")

Unnamed: 0,key,value_x,value_y
0,A,-2.857523,
1,B,-0.829377,-0.325485
2,C,-1.362061,
3,D,-1.156072,-0.649698
4,D,-1.156072,0.855154
5,E,,-0.148022
