# Chapter 1

### SQL in pandas

```
from pandasql import sqldf
import pandas as pd

# Create helper function for easier query execution
execute = lambda q: sqldf(q, globals())

# Load your CSV files into DataFrames
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")

# Execute query with a join and store the result
query = """
    SELECT *
    FROM df1
    JOIN df2 ON df1.common_column = df2.common_column
"""
result_df = execute(query)

# Show results
result_df.head()
```

### Window Functions

```
SELECT
col1,
AVG(col2 + col3) 
    OVER(PARTITION BY cat_col1, cat_col2 ORDER BY col2 DESC ROWS BETWEEN <start> AND <finish> ) AS partitioned_avg
FROM table;
-- <start> : n PRECEDING, UNBOUNDED PRECEDING, CURRENT ROW (n is an integer)
-- <finish> : n FOLLOWING, UNBOUNDED FOLLOWING, CURRENT ROW (n is an integer)
-- PARTITION : resets for every new categorical value in the column
-- ROWS : row, can be a RANGE as well
-- BETWEEN <start> AND <finish> : Window size
-- Some Other window functions : ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(n), LEAD(col_name), LAG(col_name), SUM(), MIN(), MAX(), FIRST_VALUE(), LAST_VALUE() etc
```

# Chapter 2

### NTILE

- used for paging (divide data into equal number of splits)
- Used to Split the data into n equally sized pages
- so each each page will contain no of rows = total_row / n 

# Chapter 3

### Row vs Range

- Both do the same thing except one major difference
- ROWS = treats each individual data separatelly
- RANGE = treats duplicates as single entities
- ROWS is mostly preferable against RANGE
- ROWS show characteristics like DENSE_RANK whereas RANGE show characteristics like RANK

<center><img src="images/03.01.png"  style="width: 400px, height: 300px;"/></center>

# Chapter 4

### Pivoting a table

```
-- Original query before pivoting
SELECT
Country, Year, COUNT(*) AS Awards
FROM Summer_Medals
WHERE
Country IN ('CHN', 'RUS', 'USA')
AND Year IN (2008, 2012)
AND Medal = 'Gold'
GROUP BY Country, Year
ORDER BY Country ASC, Year ASC;

-- Using pivoting
CREATE EXTENSION IF NOT EXISTS tablefunc;
SELECT * FROM CROSSTAB($$
 -- Start original query
SELECT
Country, Year, COUNT(*) :: INTEGER AS Awards -- make sure to cast
FROM Summer_Medals
WHERE
Country IN ('CHN', 'RUS', 'USA')
AND Year IN (2008, 2012)
AND Medal = 'Gold'
GROUP BY Country, Year
ORDER BY Country ASC, Year ASC;
-- End original query
$$) AS ct (Country VARCHAR, "2008" INTEGER, "2012" INTEGER) -- Country column remains, others are new columns
ORDER BY Country ASC;
```

### Aggregation

```
-- nonagg_col = the columns that are not aggregated
-- agg_col = the columns that are being aggregated with agg funcs
-- agg funcs = count, sum, avg, min, max
SELECT nonagg_col1, nonagg_col2,  COUNT(agg_col) 
FROM table_name
WHERE nonagg_col1 IN ('A','B')
GROUP BY nonagg_col1, nonagg_col2 -- or ROLLUP(nonagg_col1, nonagg_col2) or CUBE(nonagg_col1, nonagg_col2)
HAVING COUNT(agg_col) >10
ORDER BY nonagg_col2 DESC
LIMIT 5;
-- ROLLUP : Hierarchical group-level aggregations of specified columns from left to right
-- CUBE : All possible combination of group-level aggregations
```

### COALESCE

- Takes the first non-null value in the list
- eg : `SELECT COALESCE(NULL, NULL, NULL, 'HELLO WORLD', NULL, 'Example.com');` will take 'HELLO WORLD'

### STRING_AGG

`SELECT COALESCE(NULL, NULL, NULL, 'HELLO WORLD', NULL, 'Example.com');`

### Splitting values

```
-- Assume you have the CTE
WITH cte AS (
    SELECT STRING_AGG(Country, ', ') AS ConcatenatedCountries
    FROM Country_Medals
)

-- Use STRING_SPLIT with CROSS APPLY
SELECT value AS Country
FROM cte
CROSS APPLY STRING_SPLIT(cte.ConcatenatedCountries, ', ');

```