# Introduction

## Overview

In this course, we aim to become comfortable with SQL `SELECT` queries of the form:

```sql
SELECT [ ALL | DISTINCT ]
    * | <expression> [ [ AS ] <alias> ] [ , ... ]
    [ FROM <from> [ , ... ] ]
    [ WHERE <filter-condition> ]
    [ GROUP BY <group-by> ]
    [ HAVING <having-condition> ]
    [ UNION [ ALL | DISTINCT ] <select> ]
    [ ORDER BY <order-by> [ ASC | DESC ] [ , ... ] ]
    [ LIMIT <end> ]
    [ OFFSET <start> ]
```

Some SQL courses drill this syntax without approaching realistic usage.  We want to go beyond one-step exercises — mechanical translations from English to SQL.  We will try to simulate the sort of iterative analysis and data wrangling that is commonplace in real life data science.

Our primary data source will be public data from [Inside Airbnb](http://insideairbnb.com/get-the-data).

Our SQL implementation will be [dask-sql](https://dask-sql.readthedocs.io/en/latest/index.html).

## Software

dask-sql requires java with version >= 8.  Confirm your java installation by running:

```sh
java -version
```

Now we can install our database-like server in a new virtual environment.  Navigate to the root of this repository — the directory containing `requirements-sql.txt`.  Run the following commands:

```sh
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip setuptools wheel
pip install -r ./requirements-sql.txt
```

For working through the chapter notebooks and conducting your own analysis, any Jupyter Lab setup will suffice.  In my local setup, I have a user-global jupyter:

```sh
pip install --user jupyterlab
```

with the following extras:

```sh
pip install SQLAlchemy PyHive ipython-sql jupyterlab-lsp 'python-lsp-server[all]'
jupyter server extension enable --user --py jupyter_lsp

# in dir where jupyter will run (for me, $HOME)
jlpm add --dev sql-language-server
```


## Data

This repository includes a script for downloading data from Inside Airbnb.  Navigate to their [get the data](http://insideairbnb.com/get-the-data) page and mouse over the `listing.csv.gz` links under various cities.  Notice the URLs.  For example:

* Amsterdam: <br/> <http://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2022-06-05/data/listings.csv.gz>
* Paris: <br/> <http://data.insideairbnb.com/france/ile-de-france/paris/2022-06-06/data/listings.csv.gz>
* New York City: <br/> <http://data.insideairbnb.com/united-states/ny/new-york-city/2022-06-03/data/listings.csv.gz>

Our download script supports any data locations of the form `country/state/city/date`.  For example, to fetch data from these three cities, we can use the command:

```sh
# make sure the virtualenv is activated:
# source .venv/bin/activate

./scripts/fetch_airbnb.py \
    the-netherlands/north-holland/amsterdam/2022-06-05 \
    france/ile-de-france/paris/2022-06-06 \
    united-states/ny/new-york-city/2022-06-03
```

There's no shame in working with exactly the same data as the author, but by all means feel free to make it fun and change it up!

## Query engine server

Once we have some data and all the relevant tooling in place, we can start our server:

```sh
# make sure the virtualenv is activated:
# source .venv/bin/activate

./scripts/serve_airbnb.py
```

This sets up a server on port 8080 by default (port is configurable by `--port`).

We have a number of options for connecting to this server.  In plain Python, we can connect with [SQLAlchemy](https://www.sqlalchemy.org/) and [Pandas](https://pandas.pydata.org/):

In [1]:
import pandas as pd
from sqlalchemy import create_engine

In [2]:
conn = create_engine('presto://localhost:8080/')
QUERY = lambda q, *a, **kw: pd.read_sql_query(q, conn, *a, **kw)

In [3]:
QUERY("""
show tables
""")

Unnamed: 0,Table
0,hosts
1,calendar
2,listings
3,reviews


Throughout this course, we will rely primarily on the `%%sql` magic:

In [4]:
import pandas as pd
import pyhive.sqlalchemy_presto

# always show every column
pd.set_option('display.max_columns', None)
# suppress a SQLAlchemy warning
pyhive.sqlalchemy_presto.PrestoDialect.supports_statement_cache = False

# load and configure SQL extension
%load_ext sql
%config SqlMagic.autocommit = False
%config SqlMagic.displaycon = False
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False

# connect
%sql presto://localhost:8080/

In [5]:
%%sql

show tables

Unnamed: 0,Table
0,hosts
1,calendar
2,listings
3,reviews


You can also configure any other SQL-capable tools to query the local dask-sql server.  For example, Visual Studio Code can run queries using the [SQLTools](https://marketplace.visualstudio.com/items?itemName=mtxr.sqltools) and [Trino Driver](https://marketplace.visualstudio.com/items?itemName=mtxr.sqltools) plugins:

<center>
    <img src="_assets/00-vscode-presto-config.png" width=600/>
</center>

<center>
    <img src="_assets/00-vscode-presto-query.png" width=600 />
</center>

## Code style

A colleague of mine recently told me something to the effect of, "code style is the most boring f\*\*\*ing thing in the world and I never want to think or talk about it ever again."  With that in mind, I'll try to keep this brief.

SQL is mostly case-insensitive, especially for its reserved keywords.  It is extremely common to find SQL code formatted with all reserved keywords in uppercase.  This was typical in many older languages — computers didn't always have universal support for lowercase letters!  But it can be argued, as in [this stackoverflow response](https://stackoverflow.com/a/11944733/638083), that SQL has _so many_ keywords, and _so much_ variation across implementations/dialects, that manually uppercasing keywords remains a helpful cue from writer to reader.

You may one day find yourself working as a professional data scientist, and you may find yourself on a team that requires following the uppercasing convention.  This course (and my current position) does not.

Instead, we rely on syntax highlighting (despite its dialect-dependent shortcomings) and, more importantly, judicious use of indentation to clarify our intent.