# Multiple WHERE clauses

You've learned about semi joins in the form of nested subqueries within the `WHERE` clause of the main query. In this exercise, you'll familiarize yourself with semi join syntax by thinking through and re-ordering the lines of code provided. Note that subqueries are queries in their own right, so they can have a WHERE clause of their own! This is why you see two WHERE statements here.

Your task is to construct a semi join that pulls all records from `economies2019` where gross_savings in the `economies2015` table were below the 2015 global average. The global average `gross_savings` in 2015 was 22.5, and is already pre-calculated in the lines of code provided.

In [1]:
from pandasql import sqldf
import pandas as pd

# Create helper function for easier query execution
execute = lambda q: sqldf(q, globals())

In [3]:
import pandas as pd
economies2015 = pd.read_csv("dataset/countries/economies2015.csv")
economies2019 = pd.read_csv("dataset/countries/economies2019.csv")
currencies = pd.read_csv("dataset/countries/currencies.csv")
cities = pd.read_csv("dataset/countries/cities.csv")
populations = pd.read_csv("dataset/countries/populations.csv")
languages = pd.read_csv("dataset/countries/languages.csv")
countries = pd.read_csv("dataset/countries/countries.csv")
economies = pd.read_csv("dataset/countries/economies.csv")
countries.rename(columns={'country_name':'name'}, inplace=True)
# populations = reviews.reset_index()
# reviews.columns = ['id',	'film_id',	'num_user',	'num_critic',	'imdb_score',	'num_votes',	'facebook_likes']
# print(reviews.columns)
economies2019.head()


Unnamed: 0,code,year,income_group,gross_savings
0,AGO,2019,Lower middle income,25.524848
1,ALB,2019,Upper middle income,14.499826
2,ARG,2019,Upper middle income,14.285295
3,ARM,2019,Upper middle income,9.815574
4,ATG,2019,High income,26.383427


In [5]:
query = """
SELECT * 
FROM economies2019
WHERE code IN
( SELECT code
FROM economies2015
WHERE gross_savings<22.5)
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,code,year,income_group,gross_savings
0,ALB,2019,Upper middle income,14.499826
1,ARG,2019,Upper middle income,14.285295
2,ARM,2019,Upper middle income,9.815574
3,ATG,2019,High income,26.383427
4,BEN,2019,Lower middle income,21.719427


# Semi join

Let's say you are interested in identifying `languages` spoken in the Middle East. The `languages` table contains information about `languages` and countries, but it does not tell you what region the `countries` belong to. You can build up a semi join by filtering the `countries` table by a particular region, and then using this to further filter the `languages` table.

You'll build up your semi join as you did in the video exercise, block by block, starting with a selection of `countries` from the `countries` table, and then leveraging a WHERE clause to filter the `languages` table by this selection.

In [6]:
query = """
-- Select country code for countries in the Middle East
SELECT code
FROM countries
WHERE region =  'Middle East'
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,code
0,ARE
1,ARM
2,AZE
3,BHR
4,GEO


In [7]:
query = """
-- Select unique language names
SELECT DISTINCT name
FROM languages
-- Order by the name of the language
ORDER BY name;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,name
0,Afar
1,Afrikaans
2,Akyem
3,Albanian
4,Alsatian


In [8]:
query = """
SELECT DISTINCT name
FROM languages
-- Add syntax to use bracketed subquery below as a filter
WHERE code IN
    (SELECT code
    FROM countries
    WHERE region = 'Middle East')
ORDER BY name;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,name
0,Arabic
1,Aramaic
2,Armenian
3,Azerbaijani
4,Azeri


# Diagnosing problems using anti join

Nice work on semi joins! The anti join is a related and powerful joining tool. It can be particularly useful for identifying whether an incorrect number of records appears in a join.

Say you are interested in identifying currencies of Oceanian countries. You have written the following `INNER JOIN`, which returns 15 records. Now, you want to ensure that all Oceanian countries from the `countries` table are included in this result. You'll do this in the first step.

```
SELECT c1.code, name, basic_unit AS currency
FROM countries AS c1
INNER JOIN currencies AS c2
ON c1.code = c2.code
WHERE c1.continent = 'Oceania';
```


If there are any Oceanian countries excluded in this `INNER JOIN`, you want to return the names of these countries. You'll write an anti join to this in the second step!

In [9]:
query = """
SELECT c1.code, name, basic_unit AS currency
FROM countries AS c1
INNER JOIN currencies AS c2
ON c1.code = c2.code
WHERE c1.continent = 'Oceania';
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,code,name,currency
0,AUS,Australia,Australian dollar
1,KIR,Kiribati,Australian dollar
2,MHL,Marshall Islands,United States dollar
3,NRU,Nauru,Australian dollar
4,PLW,Palau,United States dollar


In [10]:
query = """
-- Select code and name of countries from Oceania
SELECT code, name
FROM countries
WHERE continent = 'Oceania'
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,code,name
0,ASM,American Samoa
1,AUS,Australia
2,FJI,Fiji Islands
3,GUM,Guam
4,KIR,Kiribati


In [11]:
query = """
SELECT code, name
FROM countries
WHERE continent = 'Oceania'
-- Filter for countries not included in the bracketed subquery
  AND code NOT IN
    (SELECT code
    FROM currencies);
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,code,name
0,ASM,American Samoa
1,FJI,Fiji Islands
2,GUM,Guam
3,FSM,"Micronesia, Federated States of"
4,MNP,Northern Mariana Islands


# Subquery inside WHERE

you will nest a subquery from the `populations` table inside another query from the same table, `populations`. Your goal is to figure out which countries had high average life expectancies in 2015.

You can use SQL to do calculations for you. Suppose you only want records from 2015 with `life_expectancy` above `1.15 * avg_life_expectancy`. You could use the following SQL query.
```
SELECT *
FROM populations
WHERE life_expectancy > 1.15 * avg_life_expectancy
  AND year = 2015;
```
In the first step, you'll write a query to calculate a value for `avg_life_expectancy`. In the second step, you will nest this calculation into another query.

In [14]:
query = """
-- Select average life_expectancy from the populations table
SELECT AVG(life_expectancy)
FROM populations
-- Filter for the year 2015
WHERE year = 2015
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,AVG(life_expectancy)
0,71.676342


In [15]:
query = """
SELECT *
FROM populations
-- Filter for only those populations where life expectancy is 1.15 times higher than average
WHERE life_expectancy > 1.15 *
  (SELECT AVG(life_expectancy)
   FROM populations
   WHERE year = 2015) 
	 AND year = 2015;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,pop_id,country_code,year,fertility_rate,life_expectancy,size
0,21,AUS,2015,1.833,82.45122,23789752.0
1,376,CHE,2015,1.54,83.197561,8281430.0
2,356,ESP,2015,1.32,83.380488,46443994.0
3,134,FRA,2015,2.01,82.670732,66538391.0
4,170,HKG,2015,1.195,84.278049,7305700.0


# WHERE do people live?

In this exercise, you will strengthen your knowledge of subquerying by identifying capital cities in order of largest to smallest population.

Follow the instructions below to get the urban area population for capital cities only. You'll use the `countries` and `cities` tables displayed in the console to help identify columns of interest as you build your query.

In [16]:
query = """
-- Select relevant fields from cities table
SELECT  name, country_code, urbanarea_pop
-- Filter using a subquery on the countries table
FROM cities
WHERE name IN ( SELECT capital FROM countries)
ORDER BY urbanarea_pop DESC;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,name,country_code,urbanarea_pop
0,Beijing,CHN,21516000
1,Dhaka,BGD,14543124
2,Tokyo,JPN,13513734
3,Moscow,RUS,12197596
4,Cairo,EGY,10230350


# Subquery inside SELECT

there are often multiple ways to produce the same result in SQL. You saw that subqueries can provide an alternative to joins to obtain the same result.

In this exercise, you'll go further in exploring how some queries can be written using either a join or a subquery.

In Step 1, you'll begin with a `LEFT JOIN` combined with a `GROUP BY `to select the nine countries with the most cities appearing in the `cities` table, along with the counts of these cities. In Step 2, you'll write a query that returns the same result as the join, but leveraging a nested query instead.

In [17]:
query = """
-- Find top nine countries with the most cities
SELECT countries.name AS country, COUNT(cities.name) AS cities_num
FROM countries
LEFT JOIN cities
ON countries.code = cities.country_code
-- Order by count of cities as cities_num
GROUP BY country
ORDER BY cities_num DESC
LIMIT 9;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,country,cities_num
0,China,36
1,India,18
2,Japan,11
3,Brazil,10
4,United States,9


In [18]:
query = """
SELECT countries.name AS country,
-- Subquery that provides the count of cities 
  (SELECT COUNT(*) 
  FROM cities 
  WHERE countries.code = cities.country_code) AS cities_num
FROM countries

ORDER BY cities_num DESC, country
LIMIT 9;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,country,cities_num
0,China,36
1,India,18
2,Japan,11
3,Brazil,10
4,Pakistan,9


# Subquery inside FROM

Subqueries inside `FROM` can help select columns from multiple tables in a single query.

Say you are interested in determining the number of languages spoken for each country. You want to present this information alongside each country's `local_name`, which is a field only present in the `countries` table and not in the `languages` table. You'll use a subquery inside `FROM` to bring information from these two tables together!

In [19]:
query = """
-- Select code, and language count as lang_num
SELECT languages.code, COUNT(languages.name) AS lang_num
FROM languages
GROUP BY languages.code
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,code,lang_num
0,ABW,7
1,AFG,4
2,AGO,12
3,AIA,1
4,ALB,4


In [20]:
query = """
-- Select local_name and lang_num from appropriate tables
SELECT local_name, lang_num
FROM countries,
  (SELECT code, COUNT(*) AS lang_num
  FROM languages
  GROUP BY code) AS sub
-- Where codes match
WHERE countries.code = sub.code
ORDER BY lang_num DESC;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,local_name,lang_num
0,Zambia,19
1,YeItyop´iya,16
2,Zimbabwe,16
3,Bharat/India,14
4,Nepal,14


# Subquery challenge

You're near the finish line! Test your understanding of subquerying with a challenge problem.

Suppose you're interested in analyzing inflation and unemployment rate for certain countries in 2015. You are not interested in countries with "Republic" or "Monarchy" as their form of government, but are interested in all other forms of government, such as emirate federations, socialist states, and commonwealths.

You will use the field `gov_form` to filter for these two conditions, which represents a country's form of government. You can review the different entries for `gov_form` in the `countries` table.

In [21]:
query = """
-- Select relevant fields
SELECT code, inflation_rate, unemployment_rate
FROM economies
WHERE year = 2015 
  AND code NOT IN
-- Subquery returning country codes filtered on gov_form
	(SELECT code 
  FROM countries 
  WHERE gov_form LIKE '%epublic%' OR gov_form LIKE '%onarch%' )
ORDER BY inflation_rate;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,code,inflation_rate,unemployment_rate
0,AFG,-1.549,
1,CHE,-1.14,3.178
2,PRI,-0.751,12.0
3,ROU,-0.596,6.812
4,TLS,0.553,


# Final challenge

You've made it to the final challenge problem! Get ready to tackle this step-by-step.

Your task is to determine the top 10 capital cities in Europe and the Americas by `city_perc`, a metric you'll calculate. `city_perc` is a percentage that calculates the "proper" population in a city as a percentage of the total population in the wider metro area, as follows:

`city_proper_pop / metroarea_pop * 100`

Do not use table aliasing in this exercise.

In [22]:
query = """
-- Select fields from cities
SELECT cities.name, cities.country_code, cities.city_proper_pop, cities.metroarea_pop, (cities.city_proper_pop /cities.metroarea_pop * 100) AS city_perc
FROM cities
-- Use subquery to filter city name
WHERE cities.name IN (SELECT capital 
FROM countries
WHERE (countries.continent LIKE 'Europe' OR countries.continent LIKE '%America')
-- Add filter condition such that metroarea_pop does not have null values
AND cities.metroarea_pop IS NOT NULL)
-- Sort and limit the result
ORDER BY city_perc DESC
LIMIT 10
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,name,country_code,city_proper_pop,metroarea_pop,city_perc
0,Lima,PER,8852000,10750000.0,82.344186
1,Bogota,COL,7878783,9800000.0,80.395745
2,Moscow,RUS,12197596,16170000.0,75.433494
3,Vienna,AUT,1863881,2600000.0,71.687731
4,Montevideo,URY,1305082,1947604.0,67.009618
