In [2]:
import pymysql
from sqlalchemy import create_engine
import pandas as pd
import getpass  # To get the password without showing the input

In [3]:
password = getpass.getpass()
connection_string = 'mysql+pymysql://root:' + password + '@localhost/bank'
engine = create_engine(connection_string)
%load_ext sql
%sql {connection_string}

 ······


'Connected: root@bank'

### Lesson 4 key concepts

> :clock10: 20 min

- Introduction to correlated subqueries
- Writing simple correlated subqueries

Correlated subqueries have references known as _correlations_ to columns from tables in the outer query. They tend to be trickier to troubleshoot when problems occur because you can't run them independently. If you copy the inner query and paste it in a new window (to make it runnable), you have to substitute the correlations with constants representing sample values from your data. But then when you're done troubleshooting and fixing what you need, you have to replace the constants back with the correlations. This makes troubleshooting correlated subqueries more complex and more prone to errors.

Unlike self-contained subqueries that are executed only once during the execution of the query, correlated subqueries are executed once for each row that's processed by the main query. The picture below shows how they work:

![Correlated Subqueries](https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/3.6-correlated_subqueries.png)


<summary>Code Sample</summary>

Here we will try to build on the same example that we looked at for self-contained subquery. We extracted the results only for those customers whose loan amount was greater than the average. Here is the self-contained subquery:

In [7]:
%%sql
select * from bank.loan
where amount > (
  select avg(amount)
  from bank.loan
)
order by amount desc
limit 10;

 * mysql+pymysql://root:***@localhost/bank
10 rows affected.


loan_id,account_id,date,amount,duration,payments,status
6534,7542,971019,590820,60,9847.0,C
6791,8926,980123,566640,60,9444.0,C
5447,2335,971112,541200,60,9020.0,D
5132,817,950217,538500,60,8975.0,C
5569,2936,980120,504000,60,8400.0,C
6436,7049,980522,495180,60,8253.0,C
7142,10451,941219,482940,60,8049.0,D
6415,6950,970212,475680,48,9910.0,C
6625,7966,970907,473280,60,7888.0,D
5043,339,971225,468060,60,7801.0,C


Now we want to find those customers whose loan amounts are greater than the average but only within the same status group; ie. we want to find those averages by each group and simultaneously compare the loan amount of that customer with its status group's average.

In [4]:
%%sql
select * from bank.loan l1
where amount > (
  select avg(amount)
  from bank.loan l2
  where l1.status = l2.status
)
order by amount desc
limit 10;

 * mysql+pymysql://root:***@localhost/bank
10 rows affected.


loan_id,account_id,date,amount,duration,payments,status
6534,7542,971019,590820,60,9847.0,C
6791,8926,980123,566640,60,9444.0,C
5447,2335,971112,541200,60,9020.0,D
5132,817,950217,538500,60,8975.0,C
5569,2936,980120,504000,60,8400.0,C
6436,7049,980522,495180,60,8253.0,C
7142,10451,941219,482940,60,8049.0,D
6415,6950,970212,475680,48,9910.0,C
6625,7966,970907,473280,60,7888.0,D
5043,339,971225,468060,60,7801.0,C


# 3.06 Activity 4

Select loans greater than the average in their district.

### Solution:

In [5]:
%%sql
select loan_id, account_id, amount
from bank.loan l1
inner join account a1
using (account_id)
where amount > (
  select avg(amount) avg_amount
  from bank.loan l2
  join bank.account a2
  using (account_id)
  where a1.district_id = a2.district_id
)
order by amount desc
limit 5;

 * mysql+pymysql://root:***@localhost/bank
5 rows affected.


loan_id,account_id,amount
6534,7542,590820
6791,8926,566640
5447,2335,541200
5132,817,538500
5569,2936,504000
