**Imports**

In [None]:
from sqlalchemy import create_engine
from sqlalchemy import MetaData
from sqlalchemy import Table
from sqlalchemy import Column
from sqlalchemy import Integer, String
from sqlalchemy import inspect
import pandas as pd
from pprint import pprint as pp

**PostgreSQL Connection**

In [None]:
engine = create_engine('postgresql://postgres:Gr@vitics3980@localhost/postgres')

In [None]:
meta = MetaData(schema="countries")

In [None]:
conn = engine.connect()

**Example(s) without pd.DataFrames - use fetchall**

In [None]:
result = conn.execute("""SELECT datname from pg_database""")

In [None]:
rows = result.fetchall()

In [None]:
[x for x in rows]

In [None]:
cities = conn.execute("""select * from countries.countries inner join countries.cities on countries.cities.country_code = countries.code""")

In [None]:
cities_res = cities.fetchall()

In [None]:
cities_list = [x for i, x in enumerate(cities_res) if i < 10]

In [None]:
cities_list

# Introduction to joins

In this chapter, you'll be introduced to the concept of joining tables, and explore the different ways you can enrich your queries using inner joins and self-joins. You'll also see how to use the case statement to split up a field into different categories.

## Introduction to INNER JOIN

In [None]:
cities = conn.execute("select * from countries.cities")

In [None]:
cities_df = pd.read_sql("select * from countries.cities", conn)

In [None]:
cities_df.head()

In [None]:
sql_stmt = "SELECT * FROM countries.cities INNER JOIN countries.countries ON countries.cities.country_code = countries.countries.code"
pd.read_sql(sql_stmt, conn).head()

In [None]:
sql_stmt = "SELECT countries.cities.name as city, countries.countries.name as country, \
countries.countries.region FROM countries.cities INNER JOIN countries.countries ON \
countries.cities.country_code = countries.countries.code"
pd.read_sql(sql_stmt, conn).head()

## INNER JOIN via USING

![](https://github.com/trenton3983/DataCamp/blob/master/Images/joining_data_in_sql/inner_join_diagram.JPG?raw=true)

```sql
SELECT left_table.id as L_id
       left_table.val as L_val
       right_table.val as R_val
FROM left_table
INNER JOIN right_table
ON left_table.id = right_table.id;
```

* When the key field you'd like to join on is the same name in both tables, you can use a `USING` clause instead of the `ON` clause.

```sql
SELECT left_table.id as L_id
       left_table.val as L_val
       right_table.val as R_val
FROM left_table
INNER JOIN right_table
USING (id);
```

### Countries with prime ministers and presidents

```sql
SELECT p1.country, p1.continent, prime_minister, president
FROM leaders.presidents AS p1
INNER JOIN leaders.prime_ministers as p2
USING (country);
```

In [None]:
sql_stmt = "SELECT p1.country, p1.continent, prime_minister, president \
FROM leaders.presidents AS p1 \
INNER JOIN leaders.prime_ministers as p2 \
USING (country)"

pd.read_sql(sql_stmt, conn).head()

### Exercises

#### Review inner join using on

Why does the following code result in an error?

```sql
SELECT c.name AS country, l.name AS language
FROM countries AS c
  INNER JOIN languages AS l;
```

* `INNER JOIN` requires a specification of the key field (or fields) in each table.

#### Inner join with using

When joining tables with a common field name, e.g.

```sql
SELECT *
FROM countries
  INNER JOIN economies
    ON countries.code = economies.code
```

You can use `USING` as a shortcut:

```sql
SELECT *
FROM countries
  INNER JOIN economies
    USING(code)
```

You'll now explore how this can be done with the `countries` and `languages` tables.

**Instructions**

* Inner join `countries` on the left and `languages` on the right with `USING(code)`.
* Select the fields corresponding to:
    * country name `AS country`,
    * continent name,
    * language name `AS language`, and
    * whether or not the language is official.
* Remember to alias your tables using the first letter of their names.

```sql
-- 4. Select fields
SELECT c.name as country, c.continent, l.name as language, l.official
  -- 1. From countries (alias as c)
  FROM countries as c
  -- 2. Join to languages (as l)
  INNER JOIN languages as l
    -- 3. Match using code
    USING (code)
```

In [None]:
sql_stmt = "SELECT c.name as country, c.continent, l.name as language, l.official \
FROM countries.countries as c \
INNER JOIN countries.languages as l \
USING (code)"

pd.read_sql(sql_stmt, conn).head()

## Self-ish joins, just in CASE

### self-join on prime_ministers

In [None]:
sql_stmt = "SELECT * \
FROM leaders.prime_ministers"

pm_df = pd.read_sql(sql_stmt, conn)
pm_df.head()

* inner joins where a table is joined with itself
    * self join
* Explore how to slice a numerical field into categories using the CASE command
* Self-joins are used to compare values in a field to other values of the same field from within the same table
* Recall the prime ministers table:
    * What if you wanted to create a new table showing countries that are in the same continenet matched as pairs?

```sql
SELECT p1.country AS country1, p2.country AS country2, p1.continent
FROM leaders.prime_ministers as p1
INNER JOIN prime_ministers as p2
ON p1.continent = p2.continent;
```

In [None]:
sql_stmt = "SELECT p1.country AS country1, p2.country AS country2, p1.continent \
FROM leaders.prime_ministers as p1 \
INNER JOIN leaders.prime_ministers as p2 \
ON p1.continent = p2.continent"

pm_df_1 = pd.read_sql(sql_stmt, conn)
pm_df_1.head()

* The country column is selected twice as well as continent. 
* The prime ministers table is on both the left and the right.
* The vital step is setting the key columns by which we match the table to itself.
    * For each country, there will be a match if the country in the "right table" (that's also prime_ministers) is in the same continent.
* This is a pairing of each country with every other country in its same continent
    * Conditions where country1 = country2 should not be included in the table

### Finishing off the self-join on prime_ministers

```sql
SELECT p1.country AS country1, p2.country AS country2, p1.continent
FROM leaders.prime_ministers as p1
INNER JOIN prime_ministers as p2
ON p1.continent = p2.continent AND p1.country <> p2.country;
```

In [None]:
sql_stmt = "SELECT p1.country AS country1, p2.country AS country2, p1.continent \
FROM leaders.prime_ministers as p1 \
INNER JOIN leaders.prime_ministers as p2 \
ON p1.continent = p2.continent AND p1.country <> p2.country"

pm_df_2 = pd.read_sql(sql_stmt, conn)
pm_df_2.head()

In [None]:
pm_df_1.equals(pm_df_2)

* `AND` clause can check that multiple conditions are met.
* Now a match will not be made between prime_minister and itself if the countries match

### CASE WHEN and THEN

* The states table contains numeric data about different countries in the six inhabited world continents
* Group the year of independence into categories of:
    * before 1900
    * between 1900 and 1930
    * and after 1930
* CASE is a way to do multiple if-then-else statements

```sql
SELECT name, continent, indep_year,
    CASE WHEN indep_year < 1900 THEN 'before 1900'
    WHEN indep_year <= 1930 THEN 'between 1900 and 1930'
    ELSE 'after 1930' END
    AS indep_year_group
FROM states
ORDER BY indep_year_group;
```

In [None]:
sql_stmt = "SELECT name, continent, indep_year, \
CASE WHEN indep_year < 1900 THEN 'before 1900' \
WHEN indep_year <= 1930 THEN 'between 1900 and 1930' \
ELSE 'after 1930' END \
AS indep_year_group \
FROM leaders.states \
ORDER BY indep_year_group"

pd.read_sql(sql_stmt, conn)

### Exercises

#### Self-join

In this exercise, you'll use the `populations` table to perform a self-join to calculate the percentage increase in population from 2010 to 2015 for each country code!

Since you'll be joining the `populations` table to itself, you can alias `populations` as `p1` and also `populations` as `p2`. This is good practice whenever you are aliasing and your tables have the same first letter. Note that you are required to alias the tables with self-joins.

**Instructions 1/3**
* Join `populations` with itself `ON country_code`.
* Select the `country_code` from `p1` and the `size` field from both `p1` and `p2`. SQL won't allow same-named fields, so alias `p1.size as size2010` and `p2.size as size2015`.

```sql
-- 4. Select fields with aliases
SELECT p1.size as size2010,
    p1.country_code,
    p2.size as size2015
-- 1. From populations (alias as p1)
FROM countries.populations as p1
  -- 2. Join to itself (alias as p2)
    INNER JOIN countries.populations as p2
    -- 3. Match on country code
    ON p1.country_code = p2.country_code
```

In [None]:
sql_stmt = "SELECT p1.size as size2010, p1.country_code, p2.size as size2015 \
FROM countries.populations as p1 \
INNER JOIN countries.populations as p2 \
ON p1.country_code = p2.country_code"

pd.read_sql(sql_stmt, conn).head()

**Instructions 2/3**

Notice from the result that for each country_code you have four entries laying out all combinations of 2010 and 2015.

* Extend the `ON` in your query to include only those records where the `p1.year` (2010) matches with `p2.year - 5` (2015 - 5 = 2010). This will omit the three entries per `country_code` that you aren't interested in.

```sql
-- 4. Select fields with aliases
SELECT p1.country_code,
       p1.size as size2010,
       p2.size as size2015
-- 1. From populations (alias as p1)
FROM countries.populations as p1
  -- 2. Join to itself (alias as p2)
    INNER JOIN countries.populations as p2
    -- 3. Match on country code
    ON p1.country_code = p2.country_code
      -- 4. and year (with calculation)
      AND p1.year = (p2.year - 5)
```

In [None]:
sql_stmt = "SELECT p1.size as size2010, p1.country_code, p2.size as size2015 \
FROM countries.populations as p1 \
INNER JOIN countries.populations as p2 \
ON p1.country_code = p2.country_code \
AND p1.year = (p2.year - 5)"

pd.read_sql(sql_stmt, conn).head()

**Instructions 3/3**

As you just saw, you can also use SQL to calculate values like `p2.year - 5` for you. With two fields like `size2010` and `size2015`, you may want to determine the percentage increase from one field to the next:

With two numeric fields `A` and `B`, the percentage growth from `A` to `B` can be calculated as $$\frac{(B−A)}{A}∗100.0$$.

Add a new field to `SELECT`, aliased as `growth_perc`, that calculates the percentage population growth from 2010 to 2015 for each country, using `p2.size` and `p1.size`.


```sql
SELECT p1.country_code,
       p1.size AS size2010, 
       p2.size AS size2015,
       -- 1. calculate growth_perc
       ((p2.size - p1.size)/p1.size * 100.0) AS growth_perc
-- 2. From populations (alias as p1)
FROM countries.populations as p1
  -- 3. Join to itself (alias as p2)
  INNER JOIN countries.populations as p2
    -- 4. Match on country code
    ON p1.country_code = p2.country_code
        -- 5. and year (with calculation)
        AND p1.year = (p2.year - 5);
```

In [None]:
sql_stmt = "SELECT p1.size as size2010, p1.country_code, p2.size as size2015, \
((p2.size - p1.size)/p1.size * 100.0) AS growth_perc \
FROM countries.populations as p1 \
INNER JOIN countries.populations as p2 \
ON p1.country_code = p2.country_code \
AND p1.year = (p2.year - 5)"

pd.read_sql(sql_stmt, conn).head()

#### Case when and then

Often it's useful to look at a numerical field not as raw data, but instead as being in different categories or groups.

You can use `CASE` with `WHEN`, `THEN`, `ELSE`, and `END` to define a new grouping field.

**Instructions**

Using the countries table, create a new field AS geosize_group that groups the countries into three groups:

* If `surface_area` is greater than 2 million, `geosize_group` is `'large'`.
* If `surface_area` is greater than 350 thousand but not larger than 2 million, `geosize_group` is `'medium'`.
* Otherwise, `geosize_group` is `'small'`.

```sql
SELECT name, continent, code, surface_area,
    -- 1. First case
    CASE WHEN ___ > ___ THEN '___'
        -- 2. Second case
        WHEN > ___ THEN ___
        -- 3. Else clause + end
        ELSE ___ END
        -- 4. Alias name
        AS ___
-- 5. From table
FROM ___;
```

## Trimet

![](Images/postgres_trimet.JPG)

In [None]:
sql_stmt = "SELECT * FROM trimet.route"

In [None]:
trimet_route = pd.read_sql(sql_stmt, conn)

In [None]:
trimet_route

In [None]:
sql_stmt = "SELECT * FROM trimet.agency"

In [None]:
trimet_agency = pd.read_sql(sql_stmt, conn)

In [None]:
trimet_agency

In [None]:
sql_stmt = "select tr.route_number, ta.agency_name \
from trimet.route tr \
inner join trimet.agency ta \
on tr.agency_id = ta.id where agency_name = 'Trimet'"

In [None]:
pd.read_sql(sql_stmt, conn)

In [None]:
conn.close()