![sql](images/sql-logo.jpg)

# From SQL Queries to Pandas

First, let's discuss - what does this querying potentially look like in the real world/on a job?

https://sqlitebrowser.org/

## Let's Explore a Database!

In [None]:
# of course, need an import
import sqlite3

#### Load a database object with `connect` and `cursor`

In [None]:
con = sqlite3.connect('data/flights.db')
cursor = con.cursor()

Our cursor is what we'll use to execute queries on a database.

## The Structure of a SQL Query

![structure of a sql query](images/sql_statement.jpg)

### Constructing SQL queries

**`SELECT`**:  The columns you want

options: 
 - `DISTINCT`
 - using `AS` to rename columns, called *aliasing*
 - single number aggregates

**`FROM`:** the source tables

options: 
- also can alias with `AS`
     - here is also where we can join other tables too, with `[LEFT|INNER|RIGHT|FULL] JOIN ___ [ON|USING]`

**`WHERE`**: your filters

options: 
- comparators like `=` & `>=`
- `BETWEEN`, `IN`, `LIKE` (with wildcards `%`)
- booleans like `AND`, `OR`, `NOT`

**`ORDER BY`**: sorting

options: 
 - `ASC` (default) and `DESC`

**`LIMIT`**:  # of rows to return (pair with `OFFSET`)

There are more! But those are most of the pieces of an SQL query that we'll use for now.

**NOTE:** SQL doesn't care about spacing, and doesn't care about capslock for statement options. But, it's convention - plus it makes your queries easier to read, for yourself and others.

### Using `Pragma`

Note that [`PRAGMA`](https://www.sqlite.org/pragma.html) is a query statement specific to SQLite - in some ways similar to the 

**output:**

`(column id, column name, data type, whether or not the column can be NULL, the default value for the column, and whether the column is a foreign key)`

In [None]:
cursor.execute("PRAGMA table_info(airports)")
info = cursor.fetchall()
print(*info, sep = "\n")  #cool new way of using python's print

Now let's get the descriptive data for the other two tables, `airlines` and `routes`

In [None]:
cursor.execute("PRAGMA table_info(airlines)")
info = cursor.fetchall()
print(*info, sep = "\n")

In [None]:
cursor.execute("PRAGMA table_info(routes)")
info = cursor.fetchall()
print(*info, sep = "\n")

## Time to Practice!

#### Write a query that will join the **latitude** and **longitude** data from the `airports` table to the information on the `routes` table

#### Which countries have the most active airlines?

Return the 25 countries with the most active airlines

#### What about inactive airlines?

Return all the countries that have more than 10 inactive airlines.

#### How many airports are there in each timezone?

### New Stuff! CASE Statements

CASE statements are SQL's version of `if ... then ... else`. They must always be closed with an END (usually an END AS).

Useful reference: https://mode.com/sql-tutorial/sql-case/

In [None]:
# Example - create a new column that shows an airport's hemisphere
cursor.execute("""
    SELECT name, city, country,
    CASE WHEN latitude > 0 THEN 'northern'
        ELSE 'southern'
        END AS hemisphere
    FROM airports
    LIMIT 10;
    """).fetchall()

What's happening?

1. The CASE statement checks each row to see if the conditional statement (`latitude > 0`) is true
2. If that conditional statement is true for that row, the word "northern" gets printed in the column that we have named `hemisphere`
3. If the conditional statement is false for that row, the word "southern" gets printed in the `hemisphere` column instead
4. At the same time all this is happening, SQL is retrieving and displaying all the values in the `name` and `city` columns

It's always a good idea to close our connections when we're done

In [None]:
# Closing those connections
cursor.close()
con.close()

## Moving from SQLite3 to pandas

In [None]:
# need to import pandas!


In [None]:
# Can use either pd.read_sql_query or pd.read_sql


#### Convert one of the earlier queries in the lesson to a pandas data frame

In [None]:
# Now close that pandas connection
pd_con.close()

Another way to move results into a pandas dataframe:

In [None]:
# closed our connections before, need to open them back up
con = sqlite3.connect('data/flights.db')
cursor = con.cursor()

In [None]:
res = cursor.execute("""
    SELECT country, COUNT(*) as active_airline_count
    FROM airlines
    WHERE active = 'Y'
    GROUP BY country
    ORDER BY active_airline_count DESC;
    """).fetchall()

In [None]:
df = pd.DataFrame(res)
df.columns = [desc[0] for desc in cursor.description]

In [None]:
df.head()

In [None]:
cursor.close()
con.close()

## Additional Resources

Reading Resources:

- [MariaDB's list of relational database terms, which also helps explain table relationships](https://mariadb.com/kb/en/relational-databases-basic-terms/)
- [History of SQL Article](https://www.businessnewsdaily.com/5804-what-is-sql.html)
- [The original SQL paper from the 1970s](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf)

Free SQL Courses: 

- [Kaggle's Courses](https://www.kaggle.com/learn/overview) on Intro to SQL and Advanced SQL - will include connecting to a Google Biq Query database
- [Khan Academy's SQL Course](https://www.khanacademy.org/computing/computer-programming/sql), which includes using more complicated query commands like CASE
- [Coursera Course on Modern Big Data Analysis with SQL](https://www.coursera.org/specializations/cloudera-big-data-analysis-sql) which was just recommended to me via the data science subreddit - covers SQL queries with specific considerations for very very large datasets stored in clusters in the cloud (specifically covers Hive and Impala, I'll likely be taking this course for fun over the next few weeks if anyone wants to join me!)