In [2]:
import pymysql
from sqlalchemy import create_engine
import pandas as pd
import getpass  # To get the password without showing the input

In [3]:
password = getpass.getpass()
connection_string = 'mysql+pymysql://root:' + password + '@localhost/bank'
engine = create_engine(connection_string)
%load_ext sql
%sql {connection_string}

 ······


'Connected: root@bank'

# Introduction to subqueries

In this section, we will just introduce the concept of subqueries and show an example of _self-contained subquery_. Correlated subqueries will be covered in the later sessions.

Introduction to subqueries

- Self-contained subqueries
- Correlated subqueries
- Using the results of another query as a table in a database

Using subqueries is a convenient capability in the language when you want one query to operate on the result of another, and if you prefer not to use intermediate objects like variables for this purpose. Primarily there are two kinds of subqueries:

- `self-contained` - queries that are independent of the outer query.
- `correlated` - queries that have references (correlations) to columns of tables from the outer query.
- [OPTIONAL] `scalar` - queries that return a single value and that are allowed where a single-valued expression is expected.

One of the advantages of self-contained subqueries compared to correlated ones is the ease of troubleshooting. You can always copy the inner query to a separate window and troubleshoot it independently. When you are done troubleshooting and fixing what you need, you can paste the subquery back in the host query. Also, important to is that `self-contained` (`uncorrelated`) subqueries _run only once_ for the entire query.


Lets use the `loan` table from the `bank` database. We want to identify the customers who have borrowed amount which are more than the average amount of all customers. This would not be possible to achieve through simple queries that have used before (without using variables which we will take a look at, later in the course). For this we will use a subquery.

In [4]:
%%sql
-- step 1: calculate the average
select avg(amount) from bank.loan;

 * mysql+pymysql://root:***@localhost/bank
1 rows affected.


avg(amount)
151169.238


-- step 2 --> pseudo code the main goal of this step ....

select * from bank.loan where amount > "AVERAGE";

In [9]:
%%sql
-- step 3 ... create the query
select * from bank.loan
where amount > (
  select avg(amount)
  from bank.loan
);

 * mysql+pymysql://root:***@localhost/bank
285 rows affected.


loan_id,account_id,date,amount,duration,payments,status
5316,1801,930711,165960,36,4610.0,A
7240,11013,930906,274740,60,4579.0,A
6111,5428,930924,174744,24,7281.0,B
7235,10973,931013,154416,48,3217.0,A
6228,6034,931201,464520,60,7742.0,B
7104,10320,931213,259740,60,4329.0,A
5170,1071,940120,253200,60,4220.0,C
7226,10940,940223,197748,36,5493.0,A
6087,5313,940227,300660,60,5011.0,C
7262,11135,940301,182628,36,5073.0,A


In [10]:
%%sql
-- step 4 - Prettify the result. Let's find top 10 such customers
select * from bank.loan
where amount > (select avg(amount) from bank.loan)
order by amount desc
limit 5;

 * mysql+pymysql://root:***@localhost/bank
5 rows affected.


loan_id,account_id,date,amount,duration,payments,status
6534,7542,971019,590820,60,9847.0,C
6791,8926,980123,566640,60,9444.0,C
5447,2335,971112,541200,60,9020.0,D
5132,817,950217,538500,60,8975.0,C
5569,2936,980120,504000,60,8400.0,C


# 3.04 Activity 4

Top 10 loans vs. top 10% loans in the `loan` table from the `bank` database .

### Solution:

In [12]:
%%sql
select loan_id, amount from bank.loan
order by amount desc
limit 10

 * mysql+pymysql://root:***@localhost/bank
10 rows affected.


loan_id,amount
6534,590820
6791,566640
5447,541200
5132,538500
5569,504000
6436,495180
7142,482940
6415,475680
6625,473280
5043,468060


In [21]:
%%sql 
select count(loan_id) from bank.loan

 * mysql+pymysql://root:***@localhost/bank
1 rows affected.


count(loan_id)
685


In [48]:
%%sql
select loan_id, amount, (row_number() over (order by amount desc)) / (select count(loan_id) from bank.loan) * 100 as top10percent 
from bank.loan 
limit 5;

 * mysql+pymysql://root:***@localhost/bank
5 rows affected.


loan_id,amount,top10percent
6534,590820,0.146
6791,566640,0.292
5447,541200,0.438
5132,538500,0.5839
5569,504000,0.7299


First try (this query gives an error):

In [61]:
%%sql
select loan_id, amount, (row_number() over (order by amount desc)) / (select count(loan_id) from bank.loan) * 100 as top10percent 
from bank.loan 
where top10percent >= 10
limit 5;

 * mysql+pymysql://root:***@localhost/bank
(pymysql.err.OperationalError) (1054, "Unknown column 'top10percent' in 'where clause'")
[SQL: select loan_id, amount, (row_number() over (order by amount desc)) / (select count(loan_id) from bank.loan) * 100 as top10percent 
from bank.loan 
where top10percent >= 10
limit 5;]
(Background on this error at: http://sqlalche.me/e/13/e3q8)


The error tells us the there's an unknown column _top10percent_ . This is because Windows functions are evaluated later than WHERE clauses in the logical order of SQL:

1. FROM, JOIN
2. WHERE
3. GROUP BY
4. Aggregation functions
5. HAVING
6. Window functions
7. SELECT
8. DISTINCT
9. UNION/INTERSECT/EXCEPT
10. ORDER BY
11. OFFSET
12. LIMIT/FETCH/TOP

Solution: use a subquery

In [60]:
%%sql
with loans_rated as (
    select loan_id, amount, (row_number() over (order by amount desc)) / (select count(loan_id) from bank.loan) * 100 as top10percent 
    from bank.loan 
)

select loan_id, top10percent 
from loans_rated
where top10percent <= 10;

 * mysql+pymysql://root:***@localhost/bank
68 rows affected.


loan_id,top10percent
6534,0.146
6791,0.292
5447,0.438
5132,0.5839
5569,0.7299
6436,0.8759
7142,1.0219
6415,1.1679
6625,1.3139
5043,1.4599
