# Your first join

Throughout this course, you'll be working with the `countries` database, which contains information about the most populous world cities in the world, along with country-level economic, population, and geographic data. The database also contains information on languages spoken in each country.

You can see the different tables in this database to get a sense of what they contain by clicking on the corresponding tabs. Click through them and familiarize yourself with the fields that seem to be shared across tables before you continue with the course.

In this exercise, you'll use the `cities` and `countries` tables to build your first inner join. You'll start off by selecting all columns in step 1, performing your join in step 2, and then refining your join to choose specific columns in step 3.

In [1]:
from pandasql import sqldf
import pandas as pd

# Create helper function for easier query execution
execute = lambda q: sqldf(q, globals())

In [2]:
import pandas as pd
cities = pd.read_csv("dataset/countries/cities.csv")
populations = pd.read_csv("dataset/countries/populations.csv")
languages = pd.read_csv("dataset/countries/languages.csv")
countries = pd.read_csv("dataset/countries/countries.csv")
economies = pd.read_csv("dataset/countries/economies.csv")
countries.rename(columns={'country_name':'name'}, inplace=True)
# populations = reviews.reset_index()
# reviews.columns = ['id',	'film_id',	'num_user',	'num_critic',	'imdb_score',	'num_votes',	'facebook_likes']
# print(reviews.columns)
populations.head()


Unnamed: 0,pop_id,country_code,year,fertility_rate,life_expectancy,size
0,20,ABW,2010,1.704,74.953537,101597.0
1,19,ABW,2015,1.647,75.573585,103889.0
2,2,AFG,2010,5.746,58.970829,27962207.0
3,1,AFG,2015,4.653,60.717171,32526562.0
4,12,AGO,2010,6.416,50.654171,21219954.0


In [3]:
query = """
-- Select all columns from cities
SELECT *
FROM cities
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,name,country_code,city_proper_pop,metroarea_pop,urbanarea_pop
0,Abidjan,CIV,4765000,,4765000
1,Abu Dhabi,ARE,1145000,,1145000
2,Abuja,NGA,1235880,6000000.0,1235880
3,Accra,GHA,2070463,4010054.0,2070463
4,Addis Ababa,ETH,3103673,4567857.0,3103673


In [4]:
query = """
SELECT * 
FROM cities
-- Inner join to countries
INNER JOIN countries
-- Match on country codes
ON cities.country_code = countries.code
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,name,country_code,city_proper_pop,metroarea_pop,urbanarea_pop,code,name.1,continent,region,surface_area,indep_year,local_name,gov_form,capital,cap_long,cap_lat
0,Abidjan,CIV,4765000,,4765000,CIV,Cote d'Ivoire,Africa,Western Africa,322463.0,1960.0,Cote dIvoire,Republic,Yamoussoukro,-4.0305,5.332
1,Abu Dhabi,ARE,1145000,,1145000,ARE,United Arab Emirates,Asia,Middle East,83600.0,1971.0,Al-Imarat al-´Arabiya al-Muttahida,Emirate Federation,Abu Dhabi,54.3705,24.4764
2,Abuja,NGA,1235880,6000000.0,1235880,NGA,Nigeria,Africa,Western Africa,923768.0,1960.0,Nigeria,Federal Republic,Abuja,7.48906,9.05804
3,Accra,GHA,2070463,4010054.0,2070463,GHA,Ghana,Africa,Western Africa,238533.0,1957.0,Ghana,Republic,Accra,-0.20795,5.57045
4,Addis Ababa,ETH,3103673,4567857.0,3103673,ETH,Ethiopia,Africa,Eastern Africa,1104300.0,-1000.0,YeItyop´iya,Republic,Addis Ababa,38.7468,9.02274


In [5]:
query = """
-- Select name fields (with alias) and region 
SELECT cities.name AS city, countries.name AS country, region
FROM cities
INNER JOIN countries
ON cities.country_code = countries.code;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,city,country,region
0,Abidjan,Cote d'Ivoire,Western Africa
1,Abu Dhabi,United Arab Emirates,Middle East
2,Abuja,Nigeria,Western Africa
3,Accra,Ghana,Western Africa
4,Addis Ababa,Ethiopia,Eastern Africa


# Joining with aliased tables

In this exercise, you'll practice joining with aliased tables. You'll use data from both the `countries` and `economies` tables to examine the inflation rate in 2010 and 2015.

When writing joins, many SQL users prefer to write the `SELECT` statement after writing the join code, in case the `SELECT` statement requires using table aliases.

In [6]:
query = """
-- Select fields with aliases
SELECT c.code AS country_code, name, year, inflation_rate 
FROM countries AS c
-- Join to economies (alias e)
INNER JOIN economies AS e
-- Match on code field using table aliases
ON c.code = e.code
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,country_code,name,year,inflation_rate
0,AFG,Afghanistan,2010,2.179
1,AFG,Afghanistan,2015,-1.549
2,NLD,Netherlands,2010,0.932
3,NLD,Netherlands,2015,0.22
4,ALB,Albania,2010,3.605


# USING in action

In the previous exercises, you performed your joins using the ON keyword. Recall that when both the field names being joined on are the same, you can take advantage of the `USING` clause.

You'll now explore the `languages` table from our database. Which languages are official `languages`, and which ones are unofficial?

In [7]:
query = """
SELECT c.name AS country, l.name AS language, official
FROM countries AS c
INNER JOIN languages AS l
-- Match using the code column
USING(code)
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,country,language,official
0,Afghanistan,Dari,1
1,Afghanistan,Other,0
2,Afghanistan,Pashto,1
3,Afghanistan,Turkic,0
4,Netherlands,Dutch,1


# Relationships in our database

Now that you know more about the different types of relationships that can exist between tables, it's time to examine a few relationships in the `countries` database!

What best describes the relationship between `code` in the `countries` table and `country_code` in the `cities` table?

In [8]:
cities['country_code'].value_counts()

CHN    36
IND    18
JPN    11
BRA    10
USA     9
       ..
SDN     1
COD     1
MYS     1
PER     1
ARM     1
Name: country_code, Length: 81, dtype: int64

In [9]:
countries['code'].value_counts()


AFG    1
PRT    1
NOR    1
CIV    1
OMN    1
      ..
AUT    1
JAM    1
JPN    1
YEM    1
PSE    1
Name: code, Length: 205, dtype: int64

- This is a one-to-many relationship.

Which of these options best describes the relationship between the `countries` table and the `languages` table?

- This is a many-to-many relationship.
- One country use many languages
- One language can be used in many countries

# Inspecting a relationship

You've just identified that the `countries` table has a many-to-many relationship with the `languages` table. That is, many languages can be spoken in a country, and a language can be spoken in many countries.

This exercise looks at each of these in turn. First, what is the best way to query all the different languages spoken in a country? And second, how is this different from the best way to query all the countries that speak each language?

In [10]:
query = """
-- Select country and language names, aliased
SELECT c.name AS country, l.name AS language
-- From countries (aliased)
FROM countries c
-- Join to languages (aliased)
INNER JOIN languages l
-- Use code as the joining field with the USING keyword
USING (code);
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,country,language
0,Afghanistan,Dari
1,Afghanistan,Other
2,Afghanistan,Pashto
3,Afghanistan,Turkic
4,Netherlands,Dutch


In [11]:
query = """
-- Rearrange SELECT statement, keeping aliases
SELECT l.name AS language, c.name AS country
FROM countries AS c
INNER JOIN languages AS l
USING(code)
-- Order the results by language
ORDER BY language
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,language,country
0,Afar,Djibouti
1,Afar,Eritrea
2,Afar,Ethiopia
3,Afrikaans,South Africa
4,Afrikaans,Namibia


In [12]:
query = """
-- Rearrange SELECT statement, keeping aliases
SELECT l.name AS language, c.name AS country
FROM countries AS c
INNER JOIN languages AS l
USING(code)
-- Find specific language
WHERE language IN ('Alsatian','Bhojpuri')
-- Find specific country
OR country = 'Armenia'
-- Order the results by language
ORDER BY language
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,language,country
0,Alsatian,France
1,Armenian,Armenia
2,Bhojpuri,Mauritius
3,Bhojpuri,Nepal
4,Kurdish,Armenia


# Joining multiple tables

Suppose you are interested in the relationship between fertility and unemployment rates. Your task in this exercise is to join tables to return the country name, year, fertility rate, and unemployment rate in a single result from the `countries`, `populations` and `economies` tables.

In [13]:
query = """
-- Select relevant fields
SELECT name, year, fertility_rate 
-- Inner join countries and populations, aliased, on code
FROM countries AS c
INNER JOIN populations AS p
ON c.code = p.country_code
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,name,year,fertility_rate
0,Afghanistan,2010,5.746
1,Afghanistan,2015,4.653
2,Netherlands,2010,1.79
3,Netherlands,2015,1.71
4,Albania,2010,1.663


In [14]:
query = """
-- Select fields
SELECT c.name, e.year, fertility_rate, e.unemployment_rate
FROM countries AS c
INNER JOIN populations AS p
ON c.code = p.country_code
-- Join to economies (as e)
JOIN economies AS e
-- Match on country code
ON p.country_code = e.code ;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,name,year,fertility_rate,unemployment_rate
0,Afghanistan,2010,4.653,
1,Afghanistan,2015,4.653,
2,Afghanistan,2010,5.746,
3,Afghanistan,2015,5.746,
4,Netherlands,2010,1.71,4.995


# Checking multi-table joins

Have a look at the results for Albania from the previous query below. You can see that the 2015 `fertility_rate` has been paired with 2010 `unemployment_rate`, and vice versa. query should return two: one for each year. The last join was performed on `c.code = e.code`, without also joining on `year`. Your task in this exercise is to fix your query by explicitly stating that both the country `code` and `year` should match!

In [15]:
query = """
SELECT name, e.year, fertility_rate, unemployment_rate
FROM countries AS c
INNER JOIN populations AS p
ON c.code = p.country_code
INNER JOIN economies AS e
ON c.code = e.code
-- Add an additional joining condition such that you are also joining on year
	AND p.year = e.year;
"""
result_df = execute(query)

# Show results
result_df.head()

Unnamed: 0,name,year,fertility_rate,unemployment_rate
0,Afghanistan,2010,5.746,
1,Afghanistan,2015,4.653,
2,Netherlands,2010,1.79,4.995
3,Netherlands,2015,1.71,6.891
4,Albania,2010,1.663,14.0
