# Basic Select

One of the simplest things we can do with SQL is simply extract data.  In this chapter we will get our first look at the contents of our Airbnb dataset, while trying out the following syntax elements:

* `SELECT`
* `FROM`
* `LIMIT`
* `ALL`, `DISTINCT`
* `ORDER BY`
* Comments

We will also introduce syntax for storing result dataframes into Python variables.

In [2]:
#| echo: false

import pandas as pd
import pyhive.sqlalchemy_presto

# always show every column
pd.set_option('display.max_columns', None)
# suppress a SQLAlchemy warning
pyhive.sqlalchemy_presto.PrestoDialect.supports_statement_cache = False

%load_ext sql
%config SqlMagic.autocommit = False
%config SqlMagic.displaycon = False
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False

%sql presto://localhost:8080/

## `select`

In SQL, virtually all data analysis and extraction queries are expressed as `select` statements.  We begin with the keyword `select` followed by some column specification:

```sql
select
    [column specification]
[...]
```

Perhaps the most intuitive column specification involves one or more columns in the source table(s).  For example:

```sql
select
    id,
    name,
    color
[...]
```

If we are only querying a single table, we can use the column specification `*` to read all columns in the source table:

```sql
select
    *
[...]
```

## `from`

The second step in any query is to specify the source table(s):

```sql
select
    [column specification]
from [source specification]
[...]
```

For example, if we are reading all columns from the "hosts" table (which we will soon be doing) then our query would start like:

```sql
select *
from hosts
[...]
```

:::{.callout-warning}

We almost never want to run a query of exactly the form `select * from <table-name>`.

Instead we will want some sort of filtering and/or aggregation.  Much of this course will be devoted to these topics!

:::

## `limit`

The simplest filter we can apply is `limit <n>`.  This will restrict the output to only `n` rows — in most (or all?) implementations, just the _first_ `n` rows.  An example would be:

```sql
select *
from hosts
limit 5
```

## First Queries

Now we finally have a query that is _not_ downright irresponsible to execute.  Let's give it a try!

In [2]:
%%sql

select *
from hosts
limit 5

Unnamed: 0,host_id,url,name,since,location,about,response_time,response_rate,acceptance_rate,is_superhost,thumbnail_url,picture_url,neighbourhood,listings_count,total_listings_count,has_profile_pic,identity_verified,calculated_listings_count,calculated_listings_count_entire_homes,calculated_listings_count_private_rooms,calculated_listings_count_shared_rooms,has_email_verification,has_phone_verification,has_work_email_verification,access_date,country,state,city
0,2438,https://www.airbnb.com/users/show/2438,Tasos,2008-08-22T00:00:00+00:00,"New York, New York, United States",,,0.0,0.0,False,https://a0.muscache.com/im/users/2438/profile_...,https://a0.muscache.com/im/users/2438/profile_...,Williamsburg,0.0,0.0,True,True,1,1,0,0,True,True,True,2022-06-03,united-states,ny,new-york-city
1,2571,https://www.airbnb.com/users/show/2571,Teedo,2008-08-27T00:00:00+00:00,"New York, New York, United States",We shared our previous penthouse apartment wit...,within an hour,1.0,0.21,True,https://a0.muscache.com/im/users/2571/profile_...,https://a0.muscache.com/im/users/2571/profile_...,Bedford-Stuyvesant,1.0,1.0,True,True,1,1,0,0,True,True,False,2022-06-03,united-states,ny,new-york-city
2,2782,https://www.airbnb.com/users/show/2782,Matthew,2008-09-07T00:00:00+00:00,"New York, New York, United States",The Basics: \nOutgoing. Curious. Social. Consc...,within a day,0.5,0.18,False,https://a0.muscache.com/im/pictures/user/2e675...,https://a0.muscache.com/im/pictures/user/2e675...,,2.0,2.0,True,True,1,1,0,0,True,True,False,2022-06-03,united-states,ny,new-york-city
3,2787,https://www.airbnb.com/users/show/2787,John,2008-09-07T00:00:00+00:00,"Yonkers, New York, United States",Educated professional living in Brooklyn. I l...,within an hour,1.0,0.92,False,https://a0.muscache.com/im/pictures/user/86745...,https://a0.muscache.com/im/pictures/user/86745...,Gravesend,7.0,7.0,True,True,7,1,4,2,True,True,False,2022-06-03,united-states,ny,new-york-city
4,2845,https://www.airbnb.com/users/show/2845,Jennifer,2008-09-09T00:00:00+00:00,"New York, New York, United States",A New Yorker since (Phone number hidden by Air...,a few days or more,0.39,0.19,False,https://a0.muscache.com/im/pictures/user/50fc5...,https://a0.muscache.com/im/pictures/user/50fc5...,Midtown,6.0,6.0,True,True,3,3,0,0,True,True,True,2022-06-03,united-states,ny,new-york-city


This is roughly equivalent to the Pandas expression `hosts.head()`.

We can easily select just a subset of columns:

In [3]:
%%sql

select
    host_id,
    name,
    since,
    country,
    state,
    city
from hosts
limit 5

Unnamed: 0,host_id,name,since,country,state,city
0,2438,Tasos,2008-08-22T00:00:00+00:00,united-states,ny,new-york-city
1,2571,Teedo,2008-08-27T00:00:00+00:00,united-states,ny,new-york-city
2,2782,Matthew,2008-09-07T00:00:00+00:00,united-states,ny,new-york-city
3,2787,John,2008-09-07T00:00:00+00:00,united-states,ny,new-york-city
4,2845,Jennifer,2008-09-09T00:00:00+00:00,united-states,ny,new-york-city


## `select distinct`

In the above queries, we have accepted the implicit default behavior that we want to obtain `all` rows.  For example, the last query above can also be written as:

In [4]:
%%sql

select all
    host_id,
    name,
    since,
    country,
    state,
    city
from hosts
limit 5

Unnamed: 0,host_id,name,since,country,state,city
0,2438,Tasos,2008-08-22T00:00:00+00:00,united-states,ny,new-york-city
1,2571,Teedo,2008-08-27T00:00:00+00:00,united-states,ny,new-york-city
2,2782,Matthew,2008-09-07T00:00:00+00:00,united-states,ny,new-york-city
3,2787,John,2008-09-07T00:00:00+00:00,united-states,ny,new-york-city
4,2845,Jennifer,2008-09-09T00:00:00+00:00,united-states,ny,new-york-city


The alternative to `select all` is `select distinct`, which does not return duplicate rows:

In [5]:
%%sql

select distinct
    host_id,
    name,
    since,
    country,
    state,
    city
from hosts
limit 5

Unnamed: 0,host_id,name,since,country,state,city
0,2438,Tasos,2008-08-22T00:00:00+00:00,united-states,ny,new-york-city
1,2571,Teedo,2008-08-27T00:00:00+00:00,united-states,ny,new-york-city
2,2782,Matthew,2008-09-07T00:00:00+00:00,united-states,ny,new-york-city
3,2787,John,2008-09-07T00:00:00+00:00,united-states,ny,new-york-city
4,2845,Jennifer,2008-09-09T00:00:00+00:00,united-states,ny,new-york-city


When a unique key like `host_id` is involved, `select distinct` and `select all` are equivalent.  `select distinct` becomes extremely useful when looking at columns that only have a few values over an entire table:

In [6]:
%%sql

select distinct
    country
from hosts

Unnamed: 0,country
0,united-states
1,the-netherlands
2,france


Using `select distinct` we can quickly discern the scope of a table:

In [7]:
%%sql

select distinct
    country,
    state
from hosts

Unnamed: 0,country,state
0,united-states,ny
1,the-netherlands,north-holland
2,france,ile-de-france


In [8]:
%%sql

select distinct
    country,
    state,
    city
from hosts

Unnamed: 0,country,state,city
0,united-states,ny,new-york-city
1,the-netherlands,north-holland,amsterdam
2,france,ile-de-france,paris


## `order by`

We can add an `order by` specification to request the result rows to be sorted:

In [9]:
%%sql

select distinct
    country,
    state,
    city
from hosts
order by
    city

Unnamed: 0,country,state,city
0,the-netherlands,north-holland,amsterdam
1,united-states,ny,new-york-city
2,france,ile-de-france,paris


By default the sort is in ascending order, but we can request descending order by adding `desc` after the column name:

In [10]:
%%sql

select distinct
    country,
    state,
    city
from hosts
order by
    city desc

Unnamed: 0,country,state,city
0,france,ile-de-france,paris
1,united-states,ny,new-york-city
2,the-netherlands,north-holland,amsterdam


We can also sort on multiple columns, applying `desc` only to a subset of those columns:

In [11]:
%%sql

select distinct
    city,
    response_time
from hosts
order by
    city,
    response_time desc

Unnamed: 0,city,response_time
0,amsterdam,
1,amsterdam,within an hour
2,amsterdam,within a few hours
3,amsterdam,within a day
4,amsterdam,a few days or more
5,new-york-city,
6,new-york-city,within an hour
7,new-york-city,within a few hours
8,new-york-city,within a day
9,new-york-city,a few days or more


`order by` sorting is applied prior to `limit`.  Thus we can use `order by` as an extremely inefficient way to find the minimum or maximum value of a column (we'll look at the efficient way in the next chapter):

In [12]:
%%sql

select
    calculated_listings_count
from hosts
order by
    calculated_listings_count desc
limit 1

Unnamed: 0,calculated_listings_count
0,391


More usefully, we can combine `order by` and `limit` to find out the properties of the top or bottom few entries in some regard.  For example, here are the hosts with the most listings:

In [3]:
%%sql

select
    host_id,
    name,
    since,
    city,
    calculated_listings_count
from hosts
order by
    calculated_listings_count desc
limit 5

Unnamed: 0,host_id,name,calculated_listings_count
0,107434423,Blueground,391
1,33889201,Veeve,233
2,158969505,Untitled,208
3,402191311,GuestReady,204
4,314994947,Blueground,199


## Comments

Like most programming languages, SQL supports code comments.  The most common syntax is the _trailing comment_, which in SQL is indicated by `--`.  For example:

In [4]:
%%sql

select
    host_id,  -- uniquely identifying
    name      -- not uniquely identifying
from hosts
order by
    calculated_listings_count desc
limit 5

Unnamed: 0,host_id,name
0,107434423,Blueground
1,33889201,Veeve
2,158969505,Untitled
3,402191311,GuestReady
4,314994947,Blueground


C-style multi-line _block comments_ of the form `/* [commented content here] */` are also supported by some implementations, including dask-sql:

In [5]:
%%sql

select
    host_id,
    name,
    /*
    skip these fields for now?
    since,
    city,
    */
    calculated_listings_count
from hosts
order by
    calculated_listings_count desc
limit 5

Unnamed: 0,host_id,name,calculated_listings_count
0,107434423,Blueground,391
1,33889201,Veeve,233
2,158969505,Untitled,208
3,402191311,GuestReady,204
4,314994947,Blueground,199


:::{.callout-warning}

This author does not recommend the use of block comments.  Here is [some nerd](https://futhark-lang.org/blog/2017-10-10-block-comments-are-a-bad-idea.html) (I say with love!) arguing convincingly against the inclusion of block comments during language design.  And here is a StackOverflow thread discussing [why Python doesn't have multiline comments](https://stackoverflow.com/questions/397148/why-doesnt-python-have-multiline-comments).

:::

## Storing in Python

The `%%sql` magic offers a strange but adequate syntax for storing query results in Python variables rather than merely displaying them:

In [14]:
%%sql hosts_head <<

select *
from hosts
limit 100

Returning data to local variable hosts_head


From here we can use Pandas to analyze the data.  Let's check a couple facts about these few rows:

In [15]:
hosts_head.head()

Unnamed: 0,host_id,url,name,since,location,about,response_time,response_rate,acceptance_rate,is_superhost,thumbnail_url,picture_url,neighbourhood,listings_count,total_listings_count,has_profile_pic,identity_verified,calculated_listings_count,calculated_listings_count_entire_homes,calculated_listings_count_private_rooms,calculated_listings_count_shared_rooms,has_email_verification,has_phone_verification,has_work_email_verification,access_date,country,state,city
0,2438,https://www.airbnb.com/users/show/2438,Tasos,2008-08-22T00:00:00+00:00,"New York, New York, United States",,,0.0,0.0,False,https://a0.muscache.com/im/users/2438/profile_...,https://a0.muscache.com/im/users/2438/profile_...,Williamsburg,0.0,0.0,True,True,1,1,0,0,True,True,True,2022-06-03,united-states,ny,new-york-city
1,2571,https://www.airbnb.com/users/show/2571,Teedo,2008-08-27T00:00:00+00:00,"New York, New York, United States",We shared our previous penthouse apartment wit...,within an hour,1.0,0.21,True,https://a0.muscache.com/im/users/2571/profile_...,https://a0.muscache.com/im/users/2571/profile_...,Bedford-Stuyvesant,1.0,1.0,True,True,1,1,0,0,True,True,False,2022-06-03,united-states,ny,new-york-city
2,2782,https://www.airbnb.com/users/show/2782,Matthew,2008-09-07T00:00:00+00:00,"New York, New York, United States",The Basics: \nOutgoing. Curious. Social. Consc...,within a day,0.5,0.18,False,https://a0.muscache.com/im/pictures/user/2e675...,https://a0.muscache.com/im/pictures/user/2e675...,,2.0,2.0,True,True,1,1,0,0,True,True,False,2022-06-03,united-states,ny,new-york-city
3,2787,https://www.airbnb.com/users/show/2787,John,2008-09-07T00:00:00+00:00,"Yonkers, New York, United States",Educated professional living in Brooklyn. I l...,within an hour,1.0,0.92,False,https://a0.muscache.com/im/pictures/user/86745...,https://a0.muscache.com/im/pictures/user/86745...,Gravesend,7.0,7.0,True,True,7,1,4,2,True,True,False,2022-06-03,united-states,ny,new-york-city
4,2845,https://www.airbnb.com/users/show/2845,Jennifer,2008-09-09T00:00:00+00:00,"New York, New York, United States",A New Yorker since (Phone number hidden by Air...,a few days or more,0.39,0.19,False,https://a0.muscache.com/im/pictures/user/50fc5...,https://a0.muscache.com/im/pictures/user/50fc5...,Midtown,6.0,6.0,True,True,3,3,0,0,True,True,True,2022-06-03,united-states,ny,new-york-city


In [16]:
hosts_head['neighbourhood'].value_counts().head()

Bedford-Stuyvesant    13
Upper West Side        7
Crown Heights          6
Park Slope             5
Williamsburg           4
Name: neighbourhood, dtype: int64

In [17]:
hosts_head['response_time'].value_counts()

within an hour        27
within a day          23
within a few hours    18
a few days or more     1
Name: response_time, dtype: int64

In [18]:
hosts_head[['response_rate', 'acceptance_rate']].mean()

response_rate      0.6518
acceptance_rate    0.4991
dtype: float64

We can even (very optimistically / recklessly) find the approximate data volume in this way:

In [19]:
%%sql host_ids <<

select host_id
from hosts

Returning data to local variable host_ids


In [20]:
len(host_ids)

73805

In [21]:
approx_hosts_mb = (
    # memory usage in bytes
    hosts_head.memory_usage(deep=True).sum()
    # scale from 100 rows to total length 
    * (len(host_ids) / len(hosts_head))
    # convert to megabytes
    / 10**6
)
print(f'The "hosts" table contains about {approx_hosts_mb:.2f} MB of data.')

The "hosts" table contains about 110.92 MB of data.


:::{.callout-warning}

This approach — running a simple `select * from whatever` and then conducting all data manipulation in Python — is widely considered to be **a noob move**.

In the coming chapters, we will learn **how** and **why** to use SQL to prepare the data _before_ bringing it into Python for visualization and other data science tasks.

:::

## Exercises

1. What does the top of the "listings" table look like?

2. What do the tops of the "calendar" and "reviews" tables look like?

3. Based on these few-row previews, which columns (of which tables) contain free-form prose?

4. In which city do we find the listing with the largest `number_of_reviews`?

5. In which city do we find the review with the earliest `review_date`?

6. What value does your chosen dataset lead to in `approx_hosts_mb`?  Can you calculate similar estimates `approx_listings_mb`, `approx_calendar_mb`, `approx_reviews_mb`?

7. How robust are these total memory usage estimate?  What total data volume would cause this data volume estimation method to fail with a `MemoryError`?

    **Hint:** `select * ... limit 100` is relatively safe, but what is the memory usage of `host_ids`?