# SQL Introduction

Structured Query Language (**SQL**) is a [declaritive language](http://en.wikipedia.org/wiki/Declarative_programming) for working with data. This differs from the [imperative style](https://en.wikipedia.org/wiki/Imperative_programming) of programming that we have used with Python so far. The biggest impact of this difference is that it is much harder to step "line-by-line" through an SQL statement to debug it, but the upside is your focus is on what you want to do rather than how to do it.

The most common use cases for SQL are the "CRUD" operations: 

* Creating, 
* Reading, 
* Updating  
* Deleting 

data. The majority of the time data scientists and data analyists will be concerned with **READING** data from the database, to use in your modeling. 

## Why SQL?

The data in an SQL database is organized into _tables_, which can be thought of as similar to a Pandas DataFrame. One of the ongoing struggles is convincing students (**like YOU!**) to use SQL, when it seems to do the same thing as a reading a CSV into a dataframe.

#### The typical Pandas workflow:
* Grab a CSV on your machine. It has all the different columns that you want, and is a "frozen" set of the data as it was when the CSV was made.
* Load it into a DataFrame
* Do the analysis.

This is simple and straight-forward, but has several drawbacks:
1. It is difficult to get up-to-date data. You could try and solve this by grabbing CSVs that are automatically created online, and loading from a web-address.
2. If working on a team, your CSVs will become out-of-date and out of sync with one another. Any changes made to the data won't be reflected in your work until you download a new CSV.
4. You need to load in _all_ the columns from the CSV. You can drop them later, but Pandas isn't memory efficient. 
5. Related to the last point, if you have to download _all_ the data before you start filtering it and throwing away what you don't want, this is very inefficient. For example, if you were looking for average traffic per hour over the last week, and each location has a sensor that took data every minute, you would be downloading 
  $$(1\text{ week}) \times (10080\text{ minutes/week}) = 10080\text{ records}$$
  per sensor to get 168 numbers!
6. If you need to make a change to the data, you cannot just modify a CSV and push it up to a server, because you will all be over writing each others changes.

#### Database workflow
Databases address these issues by
* Being a single source of truth
* Allow multiple users to modify them at the same time while ensuring consistency
* Living "close" to the data (i.e. worry less about processes within the database using significant network traffic)
* Designed to scale with large data volumes.

A workflow for using SQL and Pandas together is
1. Download a sample of your data from SQL into Pandas
2. Do EDA on the sample. In particular
    1. What sorts of missing data do you have? How are you going to impute?
    2. What are the types of each data?
    3. Are there any obvious outliers? Are you going to correct them? Limit them? Filter them out?
3. Determine, where possible, how to either select cleaned data, or update the data in batches and push back to the database
4. In a *new* notebook, select only the columns that you want, and filter to the rows that you want, and retrieve that data from the database. Load this data into a dataframe in order to create visualizations, or run your models on them.

#### Warning

Because the projects you work on at Metis prior to project 4 typically deal with small datasets (i.e. all the data can fit into memory), and only have a single developer (**YOU!**) it can be difficult to see what the benefits of using a database is in your own project.

With **large** data and **mutiple** consumers of the data, using CSVs just won't cut it. SQL is **NOT** optional!

## Pandas to SQL translator

Pandas and SQL have a lot of similarities, and we can use this to familiarize ourselves with the standard objects in SQL:

| Pandas | SQL equivalent |
| --- | --- |
| Dataframe | Table |
| Column (of a dataframe) | Field |
| Row (of a dataframe) | Record |

#### Pandas:
* Can feed data directly into sklearn's models
* Easy to do visualization

#### SQL
* Scales well with huge amounts of data
* Can summarize / group data close to source, lowering network traffic
* Allows consistency when working with data that is updating (multiple team members, or streaming updates)


## *Setup SQL (to do!)*

Today we are going to setup Postgres on our local machine, and load some data into it. Today's focus is on learning how to query the database, so we will run some setup scripts to create a `names` database and put some information in there.

In a terminal in the directory for today's lectures, run the following commands (for OSX):
```bash
# unzips the data file into a 90 MB CSV
gunzip data/all_state_1950.csv.gz

# installs postgres 
brew install postgresql

# start postgres so it "listens" to connections
brew services start postgres

# allows the current user to access postgres.
# it does this by creating a database for the active user
createdb

# run the setup script to load the data into postgres
psql -f setup.sql
```

You can check that this works by typing `psql` at the prompt. You should see something similar to 
```
psql (10.4)
Type "help" for help.

damien=# 
```
Type `\q` to quit

## Next up....

Now that we have Postgres installed and data loaded, we are ready to look at [some exercises that use SQL](../sql-intro/02_SQL_Exercises.ipynb)