In [1]:
%load_ext sql

In [2]:
%sql postgresql://postgres:postgres@localhost:5432/analysis

'Connected: postgres@analysis'

# Advanced Query Techniques

* a *subquery* is nested inside another query
* it is used for calculation or logical test that provides a value or set of data be passed into thje main portion of the query
* syntax:
    * enclose the subquery in parantheses and use it where nedded

* UPDATE table
* SET column = (SELECT colum FROM table_b WHERE table.column = table_b.colum)


* we use the SELECT query for the SET colum query which updates the table

* WHERE EXISTS (SELECT column FROM table_b WHERE table.column = table_b.column);

* similarly we use the second subquery to filter the rows we want to update

## Filtering with Subqueries in a WHERE Clause

* to use one or more query in a **WHERE** clause we use SUBQUERIES

### Generating Values for a Query Expression

In [82]:
%%sql

SELECT geo_name,
        state_us_abbreviation,
        p0010001
FROM us_counties_2010
WHERE p0010001 >= (
        SELECT percentile_cont(.9) WITHIN GROUP (ORDER BY p0010001)
        FROM us_counties_2010
        )
ORDER BY p0010001 ASC
LIMIT 15;    

 * postgresql://postgres:***@localhost:5432/analysis
15 rows affected.


geo_name,state_us_abbreviation,p0010001
Sangamon County,IL,197465
Elkhart County,IN,197559
Saginaw County,MI,200169
Mohave County,AZ,200186
St. Louis County,MN,200226
Richmond County,GA,200549
Broome County,NY,200600
Yolo County,CA,200849
Champaign County,IL,201081
Whatcom County,WA,201140


* Using a subquery in a WHERE clause
* we want a query to show which U.S. counties are at or above the 90th percentile, or top 10 percent, for population
* rather than writing two queries, one to calculate the 90th percentile and the other to filter by counties
* you can do both at once using a subquery in a WHERE clause
    * the WHERE clause, which filters by the total population colum p0010001, doesn't include a value like it normally would
    * instead after the >= comparison, we provide a second query in parantheses
    * this second query uses percentile_count() function, which will then be used in the main query

In [8]:
%%sql

SELECT percentile_cont(.9) WITHIN GROUP (ORDER BY p0010001)
        FROM us_counties_2010

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


percentile_cont
197465.0


* if we run the subquery seperately we would see just the reulst of the percentile function which was used in the query above
* run the query above with ASC and DESC to see it clearly, for the population above this value

### Using a Subquery to Identify Rows to Delete

* adding a **DELETE** to a subquery to specify what we want to delete from a table

In [None]:
%%sql

CREATE TABLE us_counties_2010_top10 AS
SELECT * FROM us_counties_2010;

DELETE FROM us_counties_2010_top10
WHERE p0010001 < (
    SELECT percentile_cont(.9) WITHIN GROUP (ORDER BY p0010001)
    FROM us_counties_2010_top10
    );

* using a subquery in a WHERE clause with DELETE
* we created a table from the original which we name top10
* we then delete from the new geneated table the top 10 percent in the population

In [80]:
%%sql

SELECT count(*) FROM us_counties_2010_top10;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


count
630


* we cut of the 630 top counties in the top 10 percent of population

In [79]:
%%sql

SELECT count(*) AS CNT
FROM us_counties_2010
HAVING COUNT(*) > 1;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


cnt
3143


## Creating Derived Tables with Subqueries

* if your subquery returns data, you can store that data in a table by placing it in a FROM clause, the result of which is known as derived table
* this approch comes in if one single query can't perform all operations you need
* we want to see the difference between the average and the median to see if the distribution is normal curved
* we do that with subqueries because we have more than one operation

In [78]:
%%sql

SELECT round(calcs.average, 0) AS average,
        calcs.median,
        round(calcs.average - calcs.median, 0) AS median_average_diff
FROM (
        SELECT avg(p0010001) AS average,
                percentile_cont(.5)
                    WITHIN GROUP (ORDER BY p0010001)::numeric(10,1) AS median
        FROM us_counties_2010
                                 )
AS calcs;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


average,median,median_average_diff
98233,25857.0,72376


* we have the average
    * which comes from the calculations subquery
* we have the median
    * which also comes form the calculations subqery
* we have the difference between both 
    * which also comes form the calculations subqery
* **the calculations**
    * is a subquery in the **FROM** clause
    * which takes the avg of the population
    * the percentile 50 (median)
    * we use the **WITHIN GROUP** and cast this value into a numeric

## Joining Derived Tables

In [77]:
%%sql

SELECT census.state_us_abbreviation AS st,
       census.st_population,
       plants.plant_count,
       round((plants.plant_count/census.st_population::numeric(10,1)) * 1000000, 1)
           AS plants_per_million
FROM
    (
         SELECT st,
                count(*) AS plant_count
         FROM meat_poultry_egg_inspect
         GROUP BY st
    )
    AS plants
JOIN
    (
        SELECT state_us_abbreviation,
               sum(p0010001) AS st_population
        FROM us_counties_2010
        GROUP BY state_us_abbreviation
    )
    AS census
ON plants.st = census.state_us_abbreviation
ORDER BY plants_per_million DESC
LIMIT 10;

 * postgresql://postgres:***@localhost:5432/analysis
10 rows affected.


st,st_population,plant_count,plants_per_million
NE,1826341,110,60.2
IA,3046355,149,48.9
VT,625741,27,43.1
HI,1360301,47,34.6
ND,672591,22,32.7
WI,5686986,185,32.5
MN,5303925,161,30.4
AR,2915918,87,29.8
SD,814180,24,29.5
PA,12702379,364,28.7


* the first subqery is for counting the plants for each st
* the second one simply gives us back the st population
* we join them together with the join operation at the colum st and state_us_abbreviation
* we do the calculation in the SELECT clause at the top
    * we devide the number of plants by the population and then multiply that quotient by 1 million


## Generating Columns with Subqueries

In [76]:
%%sql

SELECT count(*)
FROM us_counties_2010;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


count
3143
