# Remembering what is LEFT

To become faster at writing queries, it's helpful to memorize their structure. In this exercise, you will reconstruct the order of the steps of `LEFT JOIN `from memory!

<center><img src="images/02.05.png"  style="width: 400px, height: 300px;"/></center>


# This is a LEFT JOIN, right?

As before, you will be using the `cities` and `countries` tables.

You'll begin with an `INNER JOIN` with the `cities` table (left) and `countries` table (right). This helps if you are interested only in records where a country is present in both tables.

You'll then change to a `LEFT JOIN`. This helps if you're interested in returning all countries in the `cities` table, whether or not they have a match in the `countries` table.

In [1]:
from pandasql import sqldf
import pandas as pd

# Create helper function for easier query execution
execute = lambda q: sqldf(q, globals())

In [21]:
import pandas as pd
currencies = pd.read_csv("dataset/countries/currencies.csv")
cities = pd.read_csv("dataset/countries/cities.csv")
populations = pd.read_csv("dataset/countries/populations.csv")
languages = pd.read_csv("dataset/countries/languages.csv")
countries = pd.read_csv("dataset/countries/countries.csv")
economies = pd.read_csv("dataset/countries/economies.csv")
countries.rename(columns={'country_name':'name'}, inplace=True)
# populations = reviews.reset_index()
# reviews.columns = ['id',	'film_id',	'num_user',	'num_critic',	'imdb_score',	'num_votes',	'facebook_likes']
# print(reviews.columns)
countries.head()


Unnamed: 0,code,name,continent,region,surface_area,indep_year,local_name,gov_form,capital,cap_long,cap_lat
0,AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,Afganistan/Afqanestan,Islamic Emirate,Kabul,69.1761,34.5228
1,NLD,Netherlands,Europe,Western Europe,41526.0,1581.0,Nederland,Constitutional Monarchy,Amsterdam,4.89095,52.3738
2,ALB,Albania,Europe,Southern Europe,28748.0,1912.0,Shqiperia,Republic,Tirane,19.8172,41.3317
3,DZA,Algeria,Africa,Northern Africa,2381740.0,1962.0,Al-Jazair/Algerie,Republic,Algiers,3.05097,36.7397
4,ASM,American Samoa,Oceania,Polynesia,199.0,,Amerika Samoa,US Territory,Pago Pago,-170.691,-14.2846


In [3]:
query = """
SELECT 
    c1.name AS city,
    code,
    c2.name AS country,
    region,
    city_proper_pop
FROM cities AS c1
-- Perform an inner join with cities as c1 and countries as c2 on country code
INNER JOIN countries AS c2
ON c1.country_code = c2.code
ORDER BY code DESC;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,city,code,country,region,city_proper_pop
0,Harare,ZWE,Zimbabwe,Eastern Africa,1606000
1,Lusaka,ZMB,Zambia,Eastern Africa,1742979
2,Cape Town,ZAF,South Africa,Southern Africa,3740026
3,Durban,ZAF,South Africa,Southern Africa,3442361
4,Ekurhuleni,ZAF,South Africa,Southern Africa,3178470


In [4]:
query = """
SELECT 
	c1.name AS city, 
    code, 
    c2.name AS country,
    region, 
    city_proper_pop
FROM cities AS c1
-- Join right table (with alias)
LEFT JOIN  countries AS c2
ON c1.country_code = c2.code
ORDER BY code DESC;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,city,code,country,region,city_proper_pop
0,Harare,ZWE,Zimbabwe,Eastern Africa,1606000
1,Lusaka,ZMB,Zambia,Eastern Africa,1742979
2,Cape Town,ZAF,South Africa,Southern Africa,3740026
3,Durban,ZAF,South Africa,Southern Africa,3442361
4,Ekurhuleni,ZAF,South Africa,Southern Africa,3178470


# Building on your LEFT JOIN

You'll now revisit the use of the `AVG()` function introduced in a previous course.

Being able to build more than one SQL function into your query will enable you to write compact, supercharged queries.

You will use `AVG()` in combination with a `LEFT JOIN` to determine the average gross domestic product (`GDP`) per capita by `region` in `2010`.

In [5]:
query = """
SELECT name, region, gdp_percapita
FROM countries AS c
LEFT JOIN economies AS e
-- Match on code fields
USING(code)
-- Filter for the year 2010
WHERE year = 2010;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,name,region,gdp_percapita
0,Afghanistan,Southern and Central Asia,539.667
1,Angola,Central Africa,3599.27
2,Albania,Southern Europe,4098.13
3,United Arab Emirates,Middle East,34628.63
4,Argentina,South America,10412.95


In [6]:
query = """
-- Select region, and average gdp_percapita as avg_gdp
SELECT region, AVG(gdp_percapita) AS avg_gdp
FROM countries AS c
LEFT JOIN economies AS e
USING(code)
WHERE year = 2010
-- Group by region
GROUP BY region;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,region,avg_gdp
0,Australia and New Zealand,44792.385
1,Baltic Countries,12631.03
2,British Islands,43588.33
3,Caribbean,11413.339462
4,Central Africa,4797.239889


In [7]:
query = """
SELECT region, AVG(gdp_percapita) AS avg_gdp
FROM countries AS c
LEFT JOIN economies AS e
USING(code)
WHERE year = 2010
GROUP BY region
-- Order by descending avg_gdp
ORDER BY avg_gdp DESC
-- Return only first 10 records
LIMIT 10
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,region,avg_gdp
0,Western Europe,58130.962857
1,Nordic Countries,57073.998
2,North America,47911.51
3,Australia and New Zealand,44792.385
4,British Islands,43588.33


# Is this RIGHT?

A key reason for this is that right joins can always be re-written as left joins, and because joins are typically typed from left to right, joining from the left feels more intuitive when constructing queries.

It can be tricky to wrap one's head around when left and right joins return equivalent results. You'll explore this in this exercise!

In [9]:
# query = """
# -- Modify this query to use RIGHT JOIN instead of LEFT JOIN
# SELECT countries.name AS country, languages.name AS language, percent
# FROM languages
# RIGHT JOIN  countries
# USING(code)
# ORDER BY language;
# """
# result_df = execute(query)

# # Show results
# result_df.head()

# Comparing joins

In this exercise, you'll examine how results can differ when performing a full join compared to a left join and inner join by joining the `countries` and `currencies` tables. You'll be focusing on the North American `region` and records where the name of the `country` is missing.

You'll begin with a full join with `countries` on the left and `currencies` on the right.

In [12]:
# query = """
# SELECT name AS country, code, region, basic_unit
# FROM countries
# -- Join to currencies
# FULL JOIN currencies 
# USING (code)
# -- Where region is North America or name is null
# WHERE region = 'North America' OR name IS NULL
# ORDER BY region;
# """
# result_df = execute(query)

# # Show results
# result_df.head()

In [15]:
query = """
SELECT name AS country, code, region, basic_unit
FROM countries
-- Join to currencies
LEFT JOIN currencies 
USING (code)
-- Where region is North America or name is null
WHERE region = 'North America' OR name IS NULL
UNION
SELECT name AS country, code, region, basic_unit
FROM  currencies
-- Join to currencies
LEFT JOIN  countries
USING (code)
-- Where region is North America or name is null
WHERE region = 'North America' OR name IS NULL
ORDER BY region;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,country,code,region,basic_unit
0,,AIA,,East Caribbean dollar
1,,CCK,,Australian dollar
2,,COK,,New Zealand dollar
3,,FLK,,Falkland Islands pound
4,,HKG,,Hong Kong dollar


In [16]:
query = """
SELECT name AS country, code, region, basic_unit
FROM countries
-- Join to currencies
LEFT JOIN  currencies
USING (code)
WHERE region = 'North America' 
	OR name IS NULL
ORDER BY region;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,country,code,region,basic_unit
0,Bermuda,BMU,North America,Bermudian dollar
1,Greenland,GRL,North America,
2,Canada,CAN,North America,Canadian dollar
3,United States,USA,North America,United States dollar


In [17]:
query = """
SELECT name AS country, code, region, basic_unit
FROM countries
-- Join to currencies
INNER JOIN currencies 
USING (code)
WHERE region = 'North America' 
	OR name IS NULL
ORDER BY region;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,country,code,region,basic_unit
0,Bermuda,BMU,North America,Bermudian dollar
1,Canada,CAN,North America,Canadian dollar
2,United States,USA,North America,United States dollar


# Chaining FULL JOINs

As you have seen in the previous chapter on `INNER JOIN`, it is possible to chain joins in SQL, such as when looking to connect data from more than two tables.

Suppose you are doing some research on Melanesia and Micronesia, and are interested in pulling information about languages and currencies into the data we see for these regions in the countries table. Since languages and currencies exist in separate tables, this will require two consecutive full joins involving the `countries`, `languages` and `currencies` tables

In [22]:
# query = """
# SELECT 
# 	c1.name AS country, 
#     region, 
#     l.name AS language,
# 	basic_unit, 
#     frac_unit
# FROM countries as c1 
# -- Full join with languages (alias as l)
# FULL JOIN languages AS l 
# USING (code)
# -- Full join with currencies (alias as c2)
# FULL JOIN currencies AS c2
# USING (code)
# WHERE region LIKE 'M%esia';
# """
# result_df = execute(query)

# # Show results
# result_df.head()

# Histories and languages

Well done getting to know all about `CROSS JOIN`! As you have learned, CROSS JOIN can be incredibly helpful when asking questions that involve looking at all possible combinations or pairings between two sets of data.

Imagine you are a researcher interested in the languages spoken in two countries: Pakistan and India. You are interested in asking:

- What are the languages presently spoken in the two countries?
- Given the shared history between the two countries, what languages could potentially have been spoken in either country over the course of their history?

In [23]:
query = """
SELECT c.name AS country, l.name AS language
-- Inner join countries as c with languages as l on code
FROM countries AS c
INNER JOIN languages AS l
USING (code)
WHERE c.code IN ('PAK','IND')
	AND l.code in ('PAK','IND');
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,country,language
0,India,Assamese
1,India,Bengali
2,India,Gujarati
3,India,Hindi
4,India,Kannada


In [24]:
query = """
SELECT c.name AS country, l.name AS language
FROM countries AS c        
-- Perform a cross join to languages (alias as l)
CROSS JOIN languages AS l
WHERE c.code in ('PAK','IND')
	AND l.code in ('PAK','IND');
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,country,language
0,India,Hindi
1,India,Bengali
2,India,Telugu
3,India,Marathi
4,India,Tamil


# Choosing your join

Now that you're fully equipped to use joins, try a challenge problem to test your knowledge!

You will determine the names of the five countries and their respective regions with the lowest life expectancy for the year 2010. Use your knowledge about joins, filtering, sorting and limiting to create this list!

In [25]:
# query = """
# SELECT 
# 	c.name AS country,
#     region,
#     life_expectancy AS life_exp
# FROM countries AS c
# -- Join to populations (alias as p) using an appropriate join
# FULL JOIN populations AS p 
# ON c.code = p.country_code
# -- Filter for only results in the year 2010
# WHERE p.year = 2010
# -- Sort by life_exp
# ORDER BY life_exp
# -- Limit to five records
# LIMIT 5;
# """
# result_df = execute(query)

# # Show results
# result_df.head()

# Comparing a country to itself

Self joins are very useful for comparing data from one part of a table with another part of the same table. Suppose you are interested in finding out how much the populations for each country changed from 2010 to 2015. You can visualize this change by performing a self join.

In this exercise, you'll work to answer this question by joining the `populations` table with itself. Recall that, with self joins, tables must be aliased. Use this as an opportunity to practice your aliasing!

Since you'll be joining the `populations` table to itself, you can alias populations first as p1 and again as p2. This is good practice whenever you are aliasing tables with the same first letter.

In [26]:
query = """
-- Select aliased fields from populations as p1
SELECT p1.country_code, p1.size AS size2010, p2.size AS size2015
-- Join populations as p1 to itself, alias as p2, on country code
FROM populations AS p1
INNER JOIN populations AS p2
USING (country_code)
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,country_code,size2010,size2015
0,ABW,101597.0,101597.0
1,ABW,101597.0,103889.0
2,ABW,103889.0,101597.0
3,ABW,103889.0,103889.0
4,AFG,27962207.0,27962207.0


In [27]:
query = """
SELECT 
	p1.country_code, 
    p1.size AS size2010, 
    p2.size AS size2015
FROM populations AS p1
INNER JOIN populations AS p2
ON p1.country_code = p2.country_code
WHERE p1.year = 2010
-- Filter such that p1.year is always five years before p2.year
    AND p2.year = p1.year + 5
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,country_code,size2010,size2015
0,ABW,101597.0,103889.0
1,AFG,27962207.0,32526562.0
2,AGO,21219954.0,25021974.0
3,ALB,2913021.0,2889167.0
4,AND,84419.0,70473.0


# All joins on deck

Excellent work! You've made it to the end of the chapter. In this exercise, you will test your knowledge on all the joins you've learned so far.

For each of the problems presented, think carefully about what types of tables are involved and how each of the joins you have learned relates to `NULL` values.

<center><img src="images/02.06.png"  style="width: 400px, height: 300px;"/></center>
