# 1. Introduction to joins
**In this chapter, you'll be introduced to the concept of joining tables, and will explore the different ways you can enrich your queries using inner joins and self joins. You'll also see how to use the case statement to split up a field into different categories.**

In [1]:
%load_ext sql

In [2]:
%sql sqlite://

In [30]:
%sql sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite

## Introduction to INNER JOIN
As the name suggests, the focus of this course is using SQL to join two or more database tables together into a single table, an essential skill for data scientists. In this chapter, you'll learn about the INNER JOIN, which along with LEFT JOIN are probably the two most common JOINs. You'll see diagrams throughout this course that are designed to help you understand the mechanics of the different joins. Let's begin with a diagram showing the layout of some data and then how an INNER JOIN can be applied to that data.

### Initial data diagram
In this chapter and the next, we'll often work with two tables named left and right. 
- `left_table`

id | val
:---|:---
1 | L1
2 | L2
3 | L3
4 | L4

- `right_table`

id | val
:---|:---
1 | R1
4 | R2
5 | R3
6 | R4

The id field is known as a KEY field since it can be used to reference one table to another. Both the left and right tables also have another field named val. This will be useful in helping you see specifically which records and values are included in each join.

### INNER JOIN diagram
An `INNER JOIN` only includes records in which the key is in both tables. You can see here that the id field matches for values of 1 and 4 only. With inner joins we look for matches in the right table corresponding to all entries in the key field in the left table.

So the focus here shifts to only those records with a match in terms of the id field. The records not of interest to `INNER JOIN` have been faded.

Here's a resulting single table from the `INNER JOIN` clause that gives the val field from the right table with records corresponding to only those with id value of 1 or 4.
- `INNER JOIN`

L_id | L_val | R_val
:---|:---|:---
1 | L1 | R1
4 | L4 | R2

Now that you have a sense for how `INNER JOIN` works, let's try an example in SQL.

### prime_ministers table
The `prime_ministers` table is one of the tables in the leaders database. It is displayed here.

In [4]:
%%sql
SELECT *
FROM prime_ministers

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


country,continent,prime_minister
Egypt,Africa,Sherif Ismail
Portugal,Europe,Antonio Costa
Vietnam,Asia,Nguyen Xuan Phuc
Haiti,North America,Jack Guy Lafontant
India,Asia,Narendra Modi
Australia,Oceania,Malcolm Turnbull
Norway,Europe,Erna Solberg
Brunei,Asia,Hassanal Bolkiah
Oman,Asia,Qaboos bin Said al Said
Spain,Europe,Mariano Rajoy


Note the countries that are included. Suppose you were interested in determining nations that have both a prime minister and a president AND putting the results into a single table. Next you'll see the presidents table.

### presidents table
How did I display all of the `prime_ministers` table in the previous slide? Recall the use of SELECT and FROM clauses as is shown for the `presidents` table here. 

In [5]:
%%sql
SELECT *
FROM presidents

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


country,continent,president
Egypt,Africa,Abdel Fattah el-Sisi
Portugal,Europe,Marcelo Rebelo de Sousa
Haiti,North America,Jovenel Moise
Uruguay,South America,Jose Mujica
Liberia,Africa,Ellen Johnson Sirleaf
Chile,South America,Michelle Bachelet
Vietnam,Asia,Tran Dai Quang


Which countries appear in both tables? With small tables like these, it is easy to notice that Egypt, Portugal, Vietnam, and Haiti appear in both tables. For larger tables, it isn't as simple as just picking these countries out visually. So what does the syntax look like for SQL to get the results of countries with a prime minister and a president from these two tables into one?

### INNER JOIN in SQL
The syntax for completing an `INNER JOIN` from the `prime_ministers` table to the presidents table based on a key field of country is shown. Note the use of aliases for prime_ministers as `p1` and presidents as `p2`. This helps to simplify your code, especially with longer table names like `prime_ministers` and presidents. A `SELECT` statement is used to select specific fields from the two tables. In this case, since country exists in both tables, we must write p1 and the period to avoid a SQL error. Next we list the table on the left of the inner join after `FROM` and then we list the table on the right after `INNER JOIN`. Lastly, we specify the keys in the two tables that we would like to match on.

In [6]:
%%sql
SELECT p1.country, p1.continent, prime_minister, president
FROM prime_ministers AS p1
INNER JOIN presidents AS p2
ON p1.country = p2.country

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


country,continent,prime_minister,president
Egypt,Africa,Sherif Ismail,Abdel Fattah el-Sisi
Portugal,Europe,Antonio Costa,Marcelo Rebelo de Sousa
Vietnam,Asia,Nguyen Xuan Phuc,Tran Dai Quang
Haiti,North America,Jack Guy Lafontant,Jovenel Moise


## Inner join
Although this courses focuses on PostgreSQL, you'll find that these joins and the material here applies to different forms of SQL as well.

Throughout this course, you'll be working with the `countries` database containing information about the most populous world cities as well as country-level economic data, population data, and geographic data. This `countries` database also contains information on languages spoken in each country.

You can see the different tables in this database by clicking on the corresponding tabs. Click through them to get a sense for the types of data that each table contains before you continue with the course! Take note of the fields that appear to be shared across the tables.

Recall from the video the basic syntax for an `INNER JOIN`, here including all columns in **both** tables:
```sql
SELECT *
FROM left_table
INNER JOIN right_table
ON left_table.id = right_table.id;
```
You'll start off with a `SELECT` statement and then build up to an `INNER JOIN` with the `cities` and `countries` tables.

In [26]:
%sql sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite

- Begin by selecting all columns from the `cities` table.

In [8]:
%%sql
SELECT *
FROM cities
LIMIT 10

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


name,country_code,city_proper_pop,metroarea_pop,urbanarea_pop
Abidjan,CIV,4765000,,4765000
Abu Dhabi,ARE,1145000,,1145000
Abuja,NGA,1235880,6000000.0,1235880
Accra,GHA,2070463,4010054.0,2070463
Addis Ababa,ETH,3103673,4567857.0,3103673
Ahmedabad,IND,5570585,,5570585
Alexandria,EGY,4616625,,4616625
Algiers,DZA,3415811,5000000.0,3415811
Almaty,KAZ,1703481,,1703481
Ankara,TUR,5271000,4585000.0,5271000


- Inner join the `cities` table on the left to the `countries` table on the right, keeping all of the fields in both tables.
- You should match the tables on the `country_code` field in `cities` and the `code` field in `countries`.
- **Do not** alias your tables here or in the next step. Using `cities` and `countries` is fine for now.

In [9]:
%%sql
SELECT * 
FROM cities
    INNER JOIN countries
        ON cities.country_code = countries.code
LIMIT 10;    

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


name,country_code,city_proper_pop,metroarea_pop,urbanarea_pop,code,country_name,continent,region,surface_area,indep_year,local_name,gov_form,capital,cap_long,cap_lat
Abidjan,CIV,4765000,,4765000,CIV,Cote d'Ivoire,Africa,Western Africa,322463,1960,Cote dIvoire,Republic,Yamoussoukro,-4.0305,5.332
Abu Dhabi,ARE,1145000,,1145000,ARE,United Arab Emirates,Asia,Middle East,83600,1971,Al-Imarat al-´Arabiya al-Muttahida,Emirate Federation,Abu Dhabi,54.3705,24.4764
Abuja,NGA,1235880,6000000.0,1235880,NGA,Nigeria,Africa,Western Africa,923768,1960,Nigeria,Federal Republic,Abuja,7.48906,9.05804
Accra,GHA,2070463,4010054.0,2070463,GHA,Ghana,Africa,Western Africa,238533,1957,Ghana,Republic,Accra,-0.20795,5.57045
Addis Ababa,ETH,3103673,4567857.0,3103673,ETH,Ethiopia,Africa,Eastern Africa,1104300,-1000,YeItyop´iya,Republic,Addis Ababa,38.7468,9.02274
Ahmedabad,IND,5570585,,5570585,IND,India,Asia,Southern and Central Asia,3287260,1947,Bharat/India,Federal Republic,New Delhi,77.225,28.6353
Alexandria,EGY,4616625,,4616625,EGY,Egypt,Africa,Northern Africa,1001450,1922,Misr,Republic,Cairo,31.2461,30.0982
Algiers,DZA,3415811,5000000.0,3415811,DZA,Algeria,Africa,Northern Africa,2381740,1962,Al-Jazair/Algerie,Republic,Algiers,3.05097,36.7397
Almaty,KAZ,1703481,,1703481,KAZ,Kazakhstan,Asia,Southern and Central Asia,2724900,1991,Qazaqstan,Republic,Astana,71.4382,51.1879
Ankara,TUR,5271000,4585000.0,5271000,TUR,Turkey,Asia,Middle East,774815,1923,Turkiye,Republic,Ankara,32.3606,39.7153


- Modify the `SELECT` statement to keep only the name of the city, the name of the country, and the name of the region the country resides in.
- Alias the name of the city `AS city` and the name of the country `AS country`.

In [10]:
%%sql
SELECT cities.name AS city, countries.country_name AS country, countries.region 
FROM cities
    INNER JOIN countries
        ON cities.country_code = countries.code
LIMIT 10;

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


city,country,region
Abidjan,Cote d'Ivoire,Western Africa
Abu Dhabi,United Arab Emirates,Middle East
Abuja,Nigeria,Western Africa
Accra,Ghana,Western Africa
Addis Ababa,Ethiopia,Eastern Africa
Ahmedabad,India,Southern and Central Asia
Alexandria,Egypt,Northern Africa
Algiers,Algeria,Northern Africa
Almaty,Kazakhstan,Southern and Central Asia
Ankara,Turkey,Middle East


## Inner join (2)
Instead of writing the full table name, you can use table aliasing as a shortcut. For tables you also use `AS` to add the alias immediately after the table name with a space. Check out the aliasing of `cities` and `countries` below.
```sql
SELECT c1.name AS city, c2.name AS country
FROM cities AS c1
INNER JOIN countries AS c2
ON c1.country_code = c2.code;
```
Notice that to select a field in your query that appears in multiple tables, you'll need to identify which table/table alias you're referring to by using a `.` in your `SELECT` statement.

You'll now explore a way to get data from both the `countries` and `economies` tables to examine the inflation rate for both 2010 and 2015.

Sometimes it's easier to write SQL code out of order: you write the `SELECT` statement after you've done the `JOIN`.

- Join the tables `countries` (left) and `economies` (right) aliasing `countries AS c and economies AS e`.
- Specify the field to match the tables `ON`.
- From this join, `SELECT`:
    - `c.code`, aliased as `country_code`.
    - `name`, `year`, and `inflation_rate`, not aliased.

In [11]:
%%sql
SELECT c.code AS country_code, c.country_name, e.year, e.inflation_rate
FROM countries AS c
    INNER JOIN economies AS e
        ON c.code = e.code
LIMIT 10;

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


country_code,country_name,year,inflation_rate
AFG,Afghanistan,2010,2.179
AFG,Afghanistan,2015,-1.549
NLD,Netherlands,2010,0.932
NLD,Netherlands,2015,0.22
ALB,Albania,2010,3.605
ALB,Albania,2015,1.896
DZA,Algeria,2010,3.913
DZA,Algeria,2015,4.784
AGO,Angola,2010,14.48
AGO,Angola,2015,10.287


## Inner join (3)
The ability to combine multiple joins in a single query is a powerful feature of SQL, e.g:
```sql
SELECT *
FROM left_table
  INNER JOIN right_table
    ON left_table.id = right_table.id
  INNER JOIN another_table
    ON left_table.id = another_table.id;
```
As you can see here it becomes tedious to continually write long table names in joins. This is when it becomes useful to alias each table using the first letter of its name (e.g. `countries AS c`). It is standard practice to alias in this way and, if you choose to alias tables or are asked to specifically for an exercise in this course, you should follow this protocol.

Now, for each country, you want to get the country name, its region, the fertility rate, and the unemployment rate for both 2010 and 2015.

Note that results should work throughout this course with or without table aliasing unless specified differently.

- Inner join `countries` (left) and `populations` (right) on the `code` and `country_code` fields respectively.
- Alias `countries AS c` and `populations AS p`.
- Select `code`, `name`, and `region` from `countries` and also select `year` and `fertility_rate` from `populations` (5 fields in total).

In [12]:
%%sql
SELECT c.code, c.country_name, c.region, p.year, p.fertility_rate
FROM countries AS c
    JOIN populations as P
        ON c.code = p.country_code
LIMIT 10;

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


code,country_name,region,year,fertility_rate
AFG,Afghanistan,Southern and Central Asia,2010,5.746
AFG,Afghanistan,Southern and Central Asia,2015,4.653
NLD,Netherlands,Western Europe,2010,1.79
NLD,Netherlands,Western Europe,2015,1.71
ALB,Albania,Southern Europe,2010,1.663
ALB,Albania,Southern Europe,2015,1.793
DZA,Algeria,Northern Africa,2010,2.873
DZA,Algeria,Northern Africa,2015,2.805
ASM,American Samoa,Polynesia,2010,
ASM,American Samoa,Polynesia,2015,


- Add an additional `INNER JOIN` with `economies` to your previous query by joining on `code`.
- Include the `unemployment_rate` column that became available through joining with `economies`.
- Note that `year` appears in both `populations` and `economies`, so you have to explicitly use `e.year` instead of `year` as you did before.

In [13]:
%%sql
SELECT c.code, c.country_name, c.region, e.year, p.fertility_rate, e.unemployment_rate
FROM countries AS c
    INNER JOIN populations AS p
        ON c.code = p.country_code
    INNER JOIN economies AS e
        ON c.code = e.code
LIMIT 10;

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


code,country_name,region,year,fertility_rate,unemployment_rate
AFG,Afghanistan,Southern and Central Asia,2010,4.653,
AFG,Afghanistan,Southern and Central Asia,2015,4.653,
AFG,Afghanistan,Southern and Central Asia,2010,5.746,
AFG,Afghanistan,Southern and Central Asia,2015,5.746,
NLD,Netherlands,Western Europe,2010,1.71,4.995
NLD,Netherlands,Western Europe,2015,1.71,6.891
NLD,Netherlands,Western Europe,2010,1.79,4.995
NLD,Netherlands,Western Europe,2015,1.79,6.891
ALB,Albania,Southern Europe,2010,1.663,14.0
ALB,Albania,Southern Europe,2015,1.663,17.1


- Scroll down the query result and take a look at the results for Albania from your previous query. Does something seem off to you?
- The trouble with doing your last join on `c.code = e.code` and not also including `year` is that e.g. the 2010 value for `fertility_rate` is also paired with the 2015 value for `unemployment_rate`.
- Fix your previous query: in your last `ON` clause, use `AND` to add an additional joining condition. In addition to joining on `code` in `c` and `e`, also join on `year` in `e` and `p`.

In [14]:
%%sql
SELECT c.code, c.country_name, c.region, e.year, p.fertility_rate, e.unemployment_rate
FROM countries AS c
    INNER JOIN populations AS p
        ON c.code = p.country_code
    INNER JOIN economies AS e
        ON c.code = e.code AND p.year = e.year
ORDER BY c.code
LIMIT 10;

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


code,country_name,region,year,fertility_rate,unemployment_rate
AFG,Afghanistan,Southern and Central Asia,2010,5.746,
AFG,Afghanistan,Southern and Central Asia,2015,4.653,
AGO,Angola,Central Africa,2010,6.416,
AGO,Angola,Central Africa,2015,5.996,
ALB,Albania,Southern Europe,2010,1.663,14.0
ALB,Albania,Southern Europe,2015,1.793,17.1
ARE,United Arab Emirates,Middle East,2010,1.868,
ARE,United Arab Emirates,Middle East,2015,1.767,
ARG,Argentina,South America,2010,2.37,7.75
ARG,Argentina,South America,2015,2.308,


---
## INNER JOIN via USING
You'll next learn about the USING keyword in SQL and how it can be used in joins.

### The INNER JOIN diagram again
Recall the `INNER JOIN` diagram you saw in the last video. Think about the SQL code needed to complete this diagram. Let's check it out. We select and alias three fields and use the left table on the left of the join and the right table on the right of the join matching based on the entries for the id key field.

In [15]:
%sql sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite

In [16]:
%%sql
SELECT left_table.id AS L_id, left_table.val AS L_val, right_table.val AS R_val
FROM left_table
    INNER JOIN right_table
        ON left_table.id = right_table.id;

   sqlite://
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


L_id,L_val,R_val
1,L1,R1
4,L4,R2


### The INNER JOIN diagram with USING
When the key field you'd like to join on is the same name in both tables, you can use a `USING` clause instead of the `ON` clause you have seen so far. Since id is the same name in both the left table and the right table we can specify `USING` instead of `ON` here. Note that the parentheses are required around the key field with `USING`. 


In [17]:
%%sql
SELECT left_table.id AS L_id, left_table.val AS L_val, right_table.val AS R_val
FROM left_table
    INNER JOIN right_table
    USING (id);

   sqlite://
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


L_id,L_val,R_val
1,L1,R1
4,L4,R2


### Countries with prime ministers and presidents
Let's revisit the example of joining the `prime_ministers` table to the presidents table to determine countries with both types of leaders. How could you fill in the blanks to get the result with `USING`? 
```sql
SELECT p1.country, p1.continent, prime_minister, president
FROM ___ As p1
INNER JOIN ___ AS p2
___ (___);
```


In [24]:
%%sql
SELECT p1.country, p1.continent, prime_minister, president
FROM presidents As p1
INNER JOIN prime_ministers AS p2
USING (country);

   sqlite://
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


country,continent,prime_minister,president
Egypt,Africa,Sherif Ismail,Abdel Fattah el-Sisi
Portugal,Europe,Antonio Costa,Marcelo Rebelo de Sousa
Haiti,North America,Jack Guy Lafontant,Jovenel Moise
Vietnam,Asia,Nguyen Xuan Phuc,Tran Dai Quang


Since an `INNER JOIN` includes entries in both tables and both tables contain the countries listed, it doesn't matter the order in which we place the tables in the join if we `SELECT` these columns. You'll be told in the exercises which table to use on the left and on the right to avoid this confusion. Note again the use of the parentheses around country after `USING`.

## Inner join with using
When joining tables with a common field name, e.g.
```sql
SELECT *
FROM countries
  INNER JOIN economies
    ON countries.code = economies.code
```
You can use `USING` as a shortcut:
```sql
SELECT *
FROM countries
  INNER JOIN economies
    USING(code)
```
You'll now explore how this can be done with the `countries` and `languages` tables.

- Inner join `countries` on the left and `languages` on the right with `USING(code)`.
- Select the fields corresponding to:
    - country name `AS country`,
    - continent name,
    - language name `AS language`, and
    - whether or not the language is official.
    
Remember to alias your tables using the first letter of their names.

In [28]:
%%sql
SELECT c.country_name AS country, c.continent, l.name AS language, l.official
    FROM countries AS c
    INNER JOIN languages AS l
    USING(code)
LIMIT 10;

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


country,continent,language,official
Afghanistan,Asia,Dari,True
Afghanistan,Asia,Other,False
Afghanistan,Asia,Pashto,True
Afghanistan,Asia,Turkic,False
Netherlands,Europe,Dutch,True
Albania,Europe,Albanian,True
Albania,Europe,Greek,False
Albania,Europe,Other,False
Albania,Europe,unspecified,False
Algeria,Africa,Arabic,True


---
## Self-ish joins, just in CASE
You'll now dive into inner joins where a table is joined with itself. Sounds a little selfish, doesn't it? These types of joins, as you may have guessed, are called self joins. You'll also explore how to slice a numerical field into categories using the CASE command. 

### Join a table to itself?
Joining a table to itself may seem like a bit of a crazy, strange thing to ever want to do. Self-joins are used to compare values in a field to other values of the same field from within the same table. Let's further explore this with an example. Recall the `prime_ministers` table from earlier. 

In [36]:
%sql sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite

In [37]:
%%sql
SELECT *
FROM prime_ministers

   sqlite://
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


country,continent,prime_minister
Egypt,Africa,Sherif Ismail
Portugal,Europe,Antonio Costa
Vietnam,Asia,Nguyen Xuan Phuc
Haiti,North America,Jack Guy Lafontant
India,Asia,Narendra Modi
Australia,Oceania,Malcolm Turnbull
Norway,Europe,Erna Solberg
Brunei,Asia,Hassanal Bolkiah
Oman,Asia,Qaboos bin Said al Said
Spain,Europe,Mariano Rajoy


What if you wanted to create a new table showing countries that are in the same continent matched as pairs? Let's explore a chunk of `INNER JOIN` code using the `prime_ministers` table.

### Join prime_ministers to itself?
The country column is selected twice as well as continent. The `prime_ministers` table is on both the left and the right. The vital step here is setting the key columns by which we match the table to itself. For each country, we will have a match if the country in the "right table" (that is also prime_ministers) is in the same continent. Lastly, since the results of this query are more than can fit on the slide, you'll only see the first 14 records. See how we have exactly this in the result.

In [42]:
%%sql
SELECT p1.country AS country1, p2.country AS country2, p1.continent
FROM prime_ministers AS p1
    INNER JOIN prime_ministers AS p2
        ON p1.continent = p2.continent
LIMIT 14;

   sqlite://
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


country1,country2,continent
Egypt,Egypt,Africa
Portugal,Norway,Europe
Portugal,Portugal,Europe
Portugal,Spain,Europe
Vietnam,Brunei,Asia
Vietnam,India,Asia
Vietnam,Oman,Asia
Vietnam,Vietnam,Asia
Haiti,Haiti,North America
India,Brunei,Asia


 It's a pairing of each country with every other country in its same continent. But do you see a problem here? We don't want to list the country with itself after all. In the next slide, you'll see a way to do this. Pause to think about how to get around this before continuing. We don't want to include rows

### Finishing off the self-join on prime_ministers
where the country is the same in the `country1` and `country2` fields. The `AND` clause can check that multiple conditions are met. Here a match will not be made between `prime_ministers` and itself if the countries match. You, thus, have the correct table now; the results here are again limited in order for them to fit on the slide.

In [43]:
%%sql
SELECT p1.country AS country1, p2.country AS country2, p1.continent
FROM prime_ministers AS p1
    INNER JOIN prime_ministers AS p2
        ON p1.continent = p2.continent AND p1.country <> p2.country
LIMIT 13;

   sqlite://
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


country1,country2,continent
Portugal,Norway,Europe
Portugal,Spain,Europe
Vietnam,Brunei,Asia
Vietnam,India,Asia
Vietnam,Oman,Asia
India,Brunei,Asia
India,Oman,Asia
India,Vietnam,Asia
Norway,Portugal,Europe
Norway,Spain,Europe


 Notice that self-join doesn't have a syntax quite as simple as `INNER JOIN` (You can't just write SELF JOIN in SQL code).

### CASE WHEN and THEN
The next command isn't a join, but is a useful tool in your repertoire. You'll be introduced to using `CASE` with another table in the leaders database. The states table contains numeric data about different countries in the six inhabited world continents. We'll focus on the field `indep_year` now. Suppose we'd like to group the year of independence into categories of before 1900, between 1900 and 1930, and after 1930. `CASE` will get us there. `CASE` is a way to do multiple if-then-else statements in a simplified way in SQL.

### Preparing indep_year_group in states
You can now see the basic layout for creating a new field containing the groupings. 

```sql
SELECT name, continent, indep_year,
    CASE WHEN ___ < ___ THEN 'before 1900'
        WHEN indep_year <= 1930 THEN '___'
        ELSE '___' END
        AS indep_year_group
FROM states
ORDER BY indep_year_group;
```

How might we fill them in? After the first `WHEN` should specify that we want to check for `indep_year` being less than 1900. Next we want `indep_year_group` to contain 'between 1900 and 1930' in the next blank. Lastly any other record not matching these conditions will be assigned the value of 'after 1930' for `indep_year_group`.

### Creating indep_year_group in states
Check out the completed query with completed blanks. Notice how the values of `indep_year` are grouped in `indep_year_group`. Also observe how continent relates to `indep_year_group`.

In [46]:
%%sql
SELECT name, continent, indep_year,
    CASE WHEN indep_year < 1900 THEN 'before 1900'
        WHEN indep_year <= 1930 THEN 'between 1900 and 1930'
        ELSE 'after 1930' END
        AS indep_year_group
FROM states
ORDER BY indep_year;

   sqlite://
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


name,continent,indep_year,indep_year_group
Portugal,Europe,1143,before 1900
Spain,Europe,1492,before 1900
Haiti,North America,1804,before 1900
Chile,South America,1810,before 1900
Uruguay,South America,1828,before 1900
Liberia,Africa,1847,before 1900
Australia,Oceania,1901,between 1900 and 1930
Norway,Europe,1905,between 1900 and 1930
Egypt,Africa,1922,between 1900 and 1930
Vietnam,Asia,1945,after 1930


## Self-join
In this exercise, you'll use the `populations` table to perform a self-join to calculate the percentage increase in population from 2010 to 2015 for each country code!

Since you'll be joining the `populations` table to itself, you can alias `populations` as `p1` and also `populations` as `p2`. This is good practice whenever you are aliasing and your tables have the same first letter. Note that you are required to alias the tables with self-joins.

- Join `populations` with itself `ON` `country_code`.
- Select the `country_code` from `p1` and the `size` field from both `p1` and `p2`. SQL won't allow same-named fields, so alias `p1.size` as `size2010` and `p2.size` as `size2015`.

In [48]:
%sql sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite

In [49]:
%%sql
SELECT p1.country_code, p1.size AS size2010, p2.size AS size2015
FROM populations AS p1
    INNER JOIN populations AS p2
        ON p1.country_code = p2.country_code
LIMIT 10;

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


country_code,size2010,size2015
ABW,101597,101597
ABW,101597,103889
ABW,103889,101597
ABW,103889,103889
AFG,27962207,27962207
AFG,27962207,32526562
AFG,32526562,27962207
AFG,32526562,32526562
AGO,21219954,21219954
AGO,21219954,25021974


- Notice from the result that for each `country_code` you have four entries laying out all combinations of 2010 and 2015.
- Extend the `ON` in your query to include only those records where the `p1.year` (2010) matches with `p2.year - 5` (2015 - 5 = 2010). This will omit the three entries per `country_code` that you aren't interested in.

In [50]:
%%sql
SELECT p1.country_code,
       p1.size AS size2010,
       p2.size AS size2015
FROM populations as p1
    INNER JOIN populations as p2
        ON p1.country_code = p2.country_code
            AND p1.year = (p2.year - 5)
LIMIT 10;

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


country_code,size2010,size2015
ABW,101597,103889
AFG,27962207,32526562
AGO,21219954,25021974
ALB,2913021,2889167
AND,84419,70473
ARE,8329453,9156963
ARG,41222875,43416755
ARM,2963496,3017712
ASM,55636,55538
ATG,87233,91818


As you just saw, you can also use SQL to calculate values like `p2.year - 5` for you. With two fields like `size2010` and `size2015`, you may want to determine the percentage increase from one field to the next:

With two numeric fields $A$ and $B$, the percentage growth from $A$ to $B$ can be calculated as $(B - A) / A \times 100.0$.

- Add a new field to `SELECT`, aliased as `growth_perc`, that calculates the percentage population growth from 2010 to 2015 for each country, using `p2.size` and `p1.size`.

In [61]:
%%sql
SELECT p1.country_code, p1.size AS size2010, p2.size AS size2015,
        ((p2.size - p1.size)*100.0/p1.size) AS growth_perc
FROM populations AS p1
    INNER JOIN populations AS p2
        ON p1.country_code = p2.country_code
                AND p1.year = (p2.year - 5)
LIMIT 10;

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


country_code,size2010,size2015,growth_perc
ABW,101597,103889,2.255972125161176
AFG,27962207,32526562,16.3233002316305
AGO,21219954,25021974,17.917192468937493
ALB,2913021,2889167,-0.8188749754979453
AND,84419,70473,-16.519977730131842
ARE,8329453,9156963,9.934746015134488
ARG,41222875,43416755,5.321996585633583
ARM,2963496,3017712,1.82946087998769
ASM,55636,55538,-0.1761449421238047
ATG,87233,91818,5.256038425825089


## Case when and then
Often it's useful to look at a numerical field not as raw data, but instead as being in different categories or groups.

You can use `CASE` with `WHEN`, `THEN`, `ELSE`, and `END` to define a new grouping field.

Using the `countries` table, create a new field `AS geosize_group` that groups the countries into three groups:

- If `surface_area` is greater than 2 million, `geosize_group` is `'large'`.
- If `surface_area` is greater than 350 thousand but not larger than 2 million, `geosize_group` is `'medium'`.
- Otherwise, `geosize_group` is `'small'`.

In [63]:
%%sql
SELECT country_name, continent, code, surface_area,
    CASE WHEN surface_area > 2000000 THEN 'large'
        WHEN surface_area > 350000 THEN 'medium'
        ELSE 'small' END
        AS geosize_group
FROM countries
LIMIT 10;

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


country_name,continent,code,surface_area,geosize_group
Afghanistan,Asia,AFG,652090,medium
Netherlands,Europe,NLD,41526,small
Albania,Europe,ALB,28748,small
Algeria,Africa,DZA,2381740,large
American Samoa,Oceania,ASM,199,small
Andorra,Europe,AND,468,small
Angola,Africa,AGO,1246700,medium
Antigua and Barbuda,North America,ATG,442,small
United Arab Emirates,Asia,ARE,83600,small
Argentina,South America,ARG,2780400,large


## Inner challenge
The table you created with the added `geosize_group` field has been loaded for you here with the name `countries_plus`. Observe the use of (and the placement of) the `INTO` command to create this `countries_plus` table:
```sql
SELECT name, continent, code, surface_area,
    CASE WHEN surface_area > 2000000
            THEN 'large'
       WHEN surface_area > 350000
            THEN 'medium'
       ELSE 'small' END
       AS geosize_group
INTO countries_plus
FROM countries;
```

You will now explore the relationship between the size of a country in terms of surface area and in terms of population using grouping fields created with `CASE`.

By the end of this exercise, you'll be writing two queries back-to-back in a single script.

- Using the `populations` table focused only for the `year` 2015, create a new field aliased as `popsize_group` to organize population `size` into
    - `'large'` (> 50 million),
    - `'medium'` (> 1 million), and
    - `'small'` groups.

    Select only the country code, population size, and this new `popsize_group` as fields.

In [64]:
%%sql
SELECT country_code, size,
    CASE WHEN size > 50000000 THEN 'large'
        WHEN size > 1000000 THEN 'medium'
        ELSE 'small' END
        AS popsize_group
FROM populations
WHERE year = 2015
LIMIT 10;

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


country_code,size,popsize_group
ABW,103889,small
AFG,32526562,medium
AGO,25021974,medium
ALB,2889167,medium
AND,70473,small
ARE,9156963,medium
ARG,43416755,medium
ARM,3017712,medium
ASM,55538,small
ATG,91818,small


- Use `INTO` to save the result of the previous query as `pop_plus`. You can see an example of this in the `countries_plus` code in the assignment text. Make sure to include a ; at the end of your `WHERE` clause.

- Then, include another query below your first query to display all the records in `pop_plus` using `SELECT * FROM pop_plus;` so that you generate results and this will display `pop_plus` in the query result.

In [69]:
%%sql
CREATE TABLE pop_plus AS
SELECT country_code, size,
    CASE WHEN size > 50000000 THEN 'large'
        WHEN size > 1000000 THEN 'medium'
        ELSE 'small' END
        AS popsize_group
FROM populations
WHERE year = 2015;

SELECT * 
FROM pop_plus
LIMIT 10;

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.
Done.


country_code,size,popsize_group
ABW,103889,small
AFG,32526562,medium
AGO,25021974,medium
ALB,2889167,medium
AND,70473,small
ARE,9156963,medium
ARG,43416755,medium
ARM,3017712,medium
ASM,55538,small
ATG,91818,small


*** *sqlite does not support `INTO` so I used `CREATE TABLE ___ AS`* ***

- Write a query to join `countries_plus AS c` on the left with `pop_plus AS p` on the right matching on the country code fields.
- Sort the data based on `geosize_group`, in ascending order so that `large` appears on top.
- Select the `name`, `continent`, `geosize_group`, and `popsize_group` fields.

In [82]:
%%sql
SELECT c.name, c.continent, c.geosize_group, p.popsize_group, c.surface_area
FROM countries_plus AS c
    JOIN pop_plus AS p
        ON c.code = p.country_code
ORDER BY c.surface_area DESC
LIMIT 20;

   sqlite://
 * sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/countries.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/diagrams.sqlite
   sqlite:////Users/sj501/Documents/Jupyter/Jupyter_lab/17_Joining_Data_in_SQL/leaders.sqlite
Done.


name,continent,geosize_group,popsize_group,surface_area
Russian Federation,Europe,large,large,17075400
Canada,North America,large,medium,9970610
China,Asia,large,large,9572900
United States,North America,large,large,9363520
Brazil,South America,large,large,8547400
Australia,Oceania,large,medium,7741220
India,Asia,large,large,3287260
Argentina,South America,large,medium,2780400
Kazakhstan,Asia,large,medium,2724900
Sudan,Africa,large,medium,2505810
