# Metadata Queries

When we approach a query engine or database for the first time, our first task is to understand what data is available.  Ideally we would have access to some documentation, but we can learn a great deal directly from the database.

Metadata interfaces vary across SQL dialects.  This course relies on a [dask-sql](https://dask-sql.readthedocs.io/en/latest/index.html) query engine, which understands most of the [Presto](https://prestodb.io/) SQL dialect.

In this chapter we explore the metadata syntax supported by dask-sql.  We also briefly note alternative syntax in other popular dialects.

## Setup

Each notebook in this series will start with some boilerplate to import Pandas, configure PyHive, and load and configure the Jupyterlab SQL extension:

In [1]:
import pandas as pd
import pyhive.sqlalchemy_presto

# always show every column
pd.set_option('display.max_columns', None)
# suppress a SQLAlchemy warning
pyhive.sqlalchemy_presto.PrestoDialect.supports_statement_cache = False

%load_ext sql
%config SqlMagic.autocommit = False
%config SqlMagic.displaycon = False
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False

%sql presto://localhost:8080/

In subsequent chapters we will suppress this setup in the rendered book.

## `show schemas`

Databases are organized into _schemas_ (or _schemata_).  Each [schema](https://en.wikipedia.org/wiki/SQL/Schemata) specifies the set of available tables, and the names and datatypes of the columns in each table.

Depending on the SQL implementation, a schema may also specify constraints enforced on the values stored in some columns.  In dask-sql, no such constraints are supported, so we will come back to this topic later.

In dask-sql we can list the available schemas by `show schemas`:

In [3]:
%%sql

show schemas

Unnamed: 0,Schema
0,root
1,information_schema


In our setup, we have just two schemas:

* **"root"**: This is the default schema — the search path when a query does not specify a schema.
* **"information_scema"**: In many implementations, this is where we find the details about any given schema.  However, in dask-sql, this schema is not populated.

## `show tables`

We can list the available tables by `show tables from <schema-name>`:

In [4]:
%%sql

show tables from root

Unnamed: 0,Table
0,hosts
1,calendar
2,listings
3,reviews


If not specified, `show tables` lists tables under the "root" schema:

In [5]:
%%sql

show tables

Unnamed: 0,Table
0,hosts
1,calendar
2,listings
3,reviews


As mentioned above, in dask-sql "information_schema" is not populated.

In [6]:
%%sql

show tables from information_schema

(pyhive.exc.DatabaseError) {'message': 'Schema information_schema is not defined.', 'errorCode': 0, 'errorName': "<class 'AttributeError'>", 'errorType': 'USER_ERROR'}
[SQL: show tables from information_schema]
(Background on this error at: https://sqlalche.me/e/14/4xp6)


:::{.callout-note}
It looks like, at the time of writing, dask-sql has an inconsistency here.  If "information_schema" is not defined, then it seems it shouldn't be listed by `show schemas`.
:::

## `show columns` and `describe`

We can list the columns of a table by `show columns from <table-name>`:

In [7]:
%%sql

show columns from hosts

Unnamed: 0,Column,Type,Extra,Comment
0,host_id,bigint,,
1,url,varchar,,
2,name,varchar,,
3,since,timestamp_with_local_time_zone,,
4,location,varchar,,
5,about,varchar,,
6,response_time,varchar,,
7,response_rate,float,,
8,acceptance_rate,float,,
9,is_superhost,boolean,,


Equivalently, we can `describe <table-name>`:

In [8]:
%%sql

describe hosts

Unnamed: 0,Column,Type,Extra,Comment
0,host_id,bigint,,
1,url,varchar,,
2,name,varchar,,
3,since,timestamp_with_local_time_zone,,
4,location,varchar,,
5,about,varchar,,
6,response_time,varchar,,
7,response_rate,float,,
8,acceptance_rate,float,,
9,is_superhost,boolean,,


## Metadata in other implementations

In most SQL implementations, we can query tables under "information_schema" to learn about the available data.  The equivalent of `show tables` would be something like

```sql
select table_name
from information_schema.tables
```

and the equivalent of `show columns from table_name` would be something like

```sql
select column_name, data_type
from information_schema.columns
where table_name = 'table_name'
```

We will spend most of this course studying queries expressed as `select` statements.  There is a great beauty in exposing metadata using the same tables/rows/columns model as the data itself — this way, the full power of the SQL language is available for metadata analysis.

## Exercises

1. What are the columns of the "listings", "reviews", and "calendar" tables?

2. dask-sql does not track or enforce inter-table relationships.  Can we guess any relationships based solely on column names and types?