# SQL Exercises - Fugue

This exercise is about using SQL to retrieve information from a database.

To simplify things, instead of connecting to an actual database, we will work with a pandas dataframe and treat it as if it were a database table.

We will be using a library called [Fugue](https://fugue-tutorials.readthedocs.io/tutorials/fugue_sql/index.html) which will enable us to execute SQL commands on the dataframe directly in a notebook.
To run these examples, you will first have to install it with `pip install fugue[sql]`.

### About Fugue

Fugue is one way of calling SQL commands through the notebook. One of its strengths is that it can use SQL-like syntax for different data sources, such as data frames.

The syntax we show here is almost identical to SQL, except that we need to add the `PRINT` keyword to see the output. We also only see up to 10 lines of output for each query, even if there are more results.
Otherwise, the commands we give and the results we get are very similar to what we would see in a real RDBMS.

## Examples

We start with some examples of setting up and using the library.

In [1]:
# Unfortunately `fugue` is going to cause some `FutureWarnings`, so turning them off for now to keep things cleaner
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

This next cell sets up `fugue`, which we must do before using it:

In [2]:
import pandas as pd
from fugue_notebook import setup
setup()

<IPython.core.display.Javascript object>

As a quick demonstration, we will now create a simple dataframe with `pandas`, and then use `fugue` to play around with it.

In [3]:
df = pd.DataFrame({"numeric": [0, 1, 2], "text": ["A", "B", "C"]})

In [4]:
df.head()

Unnamed: 0,numeric,text
0,0,A
1,1,B
2,2,C


To recreate this with SQL, let's retrieve all values.

In [5]:
%%fsql
SELECT *
FROM df
PRINT

Unnamed: 0,numeric,text
0,0,A
1,1,B
2,2,C


Note the line at the bottom of the output, which gives us the **schema** (the data type for each column).

We can retrieve particular rows that match a condition as usual:

In [6]:
%%fsql
SELECT *
FROM df
WHERE numeric=0
PRINT

Unnamed: 0,numeric,text
0,0,A


and columns:

In [7]:
%%fsql
SELECT numeric
FROM df
PRINT

Unnamed: 0,numeric
0,0
1,1
2,2


Now let's get started with the exercises. First, let's download a dataset from `scikit-learn`.

In [8]:
from sklearn.datasets import fetch_california_housing
data_california = fetch_california_housing()

Let's convert this to a dataframe so we can play with it:

In [9]:
california = pd.DataFrame(data=data_california.data, columns=data_california.feature_names)
california['target'] = data_california.target
california.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


## Finding rows with high target variable

We can query this to find the rows with `target` value greater than 4: (note that the `fugue` will only print the first 10 rows by default)

In [10]:
%%fsql
SELECT *
FROM california
WHERE target > 4
PRINT

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,1.2434,52.0,2.929412,0.917647,396.0,4.658824,37.8,-122.27,5.00001
2,7.5544,40.0,7.631498,1.030581,1616.0,2.470948,37.83,-122.21,4.115
3,6.3624,30.0,5.615385,0.730769,126.0,2.423077,37.81,-122.18,4.833
4,8.8793,52.0,8.972868,1.131783,861.0,3.337209,37.81,-122.23,4.103
5,6.9414,52.0,8.067194,1.067194,671.0,2.652174,37.9,-122.27,4.175
6,8.8342,52.0,8.345205,1.021918,975.0,2.671233,37.89,-122.27,4.305
7,5.7979,52.0,5.981432,1.087533,823.0,2.183024,37.88,-122.26,4.153
8,1.1696,52.0,2.436,0.944,1349.0,5.396,37.87,-122.25,5.00001
9,3.0417,48.0,4.690632,1.126362,1656.0,3.607843,37.86,-122.25,4.896


We can also find how many such rows there are:

In [11]:
%%fsql
SELECT COUNT(*) AS TotalHighTarget
FROM california
WHERE target > 4
PRINT

Unnamed: 0,TotalHighTarget
0,1744


To get an idea of the distribution of values in the target column, we can use some aggregate SQL functions:

In [12]:
%%fsql
SELECT AVG(target) AS AverageTarget, MIN(target) AS MinTarget, MAX(target) AS MaxTarget
FROM california
PRINT

Unnamed: 0,AverageTarget,MinTarget,MaxTarget
0,2.068558,0.14999,5.00001


The results should show that the target values range between approximately 0.15 and 5, so our choice of 4 as a "high" target may be reasonable.

## Focus on older buildings

Find the rows where `HouseAge` is greater than 50 and `Population` is more than 1000.

In [13]:
%%fsql
SELECT *
FROM california
WHERE HouseAge > 50
AND Population > 1000
PRINT

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,3.6591,52.0,4.931907,0.951362,1094.0,2.128405,37.84,-122.25,2.992
1,3.12,52.0,4.797527,1.061824,1157.0,1.788253,37.84,-122.25,2.414
2,3.6912,52.0,4.970588,0.990196,1551.0,2.172269,37.84,-122.25,2.611
3,3.2705,52.0,4.77248,1.024523,1504.0,2.049046,37.85,-122.26,2.418
4,3.075,52.0,5.32265,1.012821,1098.0,2.346154,37.85,-122.26,2.135
5,1.9167,52.0,4.262903,1.009677,1212.0,1.954839,37.85,-122.26,1.592
6,1.725,52.0,5.096234,1.131799,1015.0,2.123431,37.84,-122.27,1.139
7,2.6,52.0,5.270142,1.035545,1006.0,2.383886,37.84,-122.27,1.326
8,1.808,52.0,4.780856,1.060453,1102.0,2.775819,37.85,-122.28,1.055
9,1.9615,52.0,4.882086,1.090703,1168.0,2.648526,37.84,-122.28,1.152


And count how many rows like these there are:

In [14]:
%%fsql
SELECT COUNT(*) AS OldAndBusy
FROM california
WHERE HouseAge > 50
AND Population > 1000
PRINT

Unnamed: 0,OldAndBusy
0,521


## More advanced keywords

Find the 5 rows with the highest number of average bedrooms, which are less than 30 years old.

**Hint:** You will need the `ORDER BY` and `LIMIT` keywords.
`ORDER BY` is followed by a column name and a sorting direction (`ASC` or `DESC` for ascending or descending, respectively).
`LIMIT` is followed by the maximum number of results we want to retrieve

In [15]:
%%fsql
SELECT *
FROM california
WHERE HouseAge < 30
ORDER BY AveBedrms DESC
LIMIT 5
PRINT

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,2.625,25.0,59.875,15.3125,28.0,1.75,40.27,-121.25,0.675
1,1.6154,17.0,62.422222,14.111111,83.0,1.844444,33.97,-114.49,0.875
2,3.2431,14.0,52.848214,11.410714,265.0,2.366071,37.64,-119.02,2.214
3,3.125,11.0,47.515152,11.181818,82.0,2.484848,38.42,-120.19,0.775
4,4.0714,19.0,61.8125,11.0,112.0,2.333333,39.01,-120.06,4.375
