## Lesson 2.2 - Advanced SQL

Here's the situation - your working with a Postgre Database at a large wine distributor who needs you to maintain their database. You'll use some of your advanced SQl skills to take care of customer cases. Let's begin! 

First, let's load in the ipython sql extension so that we can use sql within the ipython notebook. 

In [1]:
# !pip uninstall psycopg2
# !conda install psycopg2
# !pip install ipython-sql

In [2]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL
import psycopg2

In [3]:
PATH = '../../assets/datasets/'
df = pd.read_csv(PATH + 'wine.csv')
df.columns = [c.lower().replace(' ','') for c in df.columns] #postgres doesn't like capitals or spaces


In [12]:
%reload_ext sql

In [13]:
df.columns

Index([u'fixedacidity', u'volatileacidity', u'citricacid', u'residualsugar',
       u'chlorides', u'freesulfurdioxide', u'totalsulfurdioxide', u'density',
       u'ph', u'sulphates', u'alcohol', u'quality'],
      dtype='object')

Connect to the database. Note - enter your own connection string. For help on how to load the raw CSV file into a Postgre database, please refer to the documenatation in the lesson plan on previous SQL lessons. 

Create your database in postrgres. From **this folder**, launch psql in your terminal:
```bash
psql
```
```
user=# create database wine
user=# \quit
```


In [14]:
engine = create_engine('postgresql://localhost:5432')

In [15]:
df.to_sql('wine', engine)

ValueError: Table 'wine' already exists.

In [16]:
%%sql postgresql://localhost:5432/
        
SELECT * FROM wine LIMIT 5

5 rows affected.


index,fixedacidity,volatileacidity,citricacid,residualsugar,chlorides,freesulfurdioxide,totalsulfurdioxide,density,ph,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


Select all of the wines that have an alcohol content above 10 and create a new column `high_alc` where 1 denotes an alcohol content > 10. Otherwise the value should be `NULL`

In [25]:
%%sql 

SELECT *, CASE WHEN alcohol > 10 THEN '1' ELSE NULL END AS high_alc FROM wine limit 5;

5 rows affected.


index,fixedacidity,volatileacidity,citricacid,residualsugar,chlorides,freesulfurdioxide,totalsulfurdioxide,density,ph,sulphates,alcohol,quality,high_alc
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,


Someone decided that they wanted to purchase *all* of these high alcohol wines for their resteraunts, so make sure to mark them as *sold*. Your predicesor forgot to add a column for *sales date*, so you will have to add this in to the database table as well. 

In [35]:
%%sql
ALTER table wine
ADD sale_date date

(psycopg2.ProgrammingError) column "sale_date" of relation "wine" already exists
 [SQL: 'ALTER table wine\nADD sale_date date']


Set their sale date to today

In [36]:
%%sql

UPDATE wine SET sale_date = CURRENT_DATE WHERE alcohol >10

852 rows affected.


[]

Select all of the wines with high alcohol

In [46]:
%%sql
Select * from wine WHERE alcohol > 10 limit 5;

5 rows affected.


index,fixedacidity,volatileacidity,citricacid,residualsugar,chlorides,freesulfurdioxide,totalsulfurdioxide,density,ph,sulphates,alcohol,quality,high_alc,sale_date
9,7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5,5,,2016-07-19
11,7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5,5,,2016-07-19
16,8.5,0.28,0.56,1.8,0.092,35.0,103.0,0.9969,3.3,0.75,10.5,7,,2016-07-19
30,6.7,0.675,0.07,2.4,0.089,17.0,82.0,0.9958,3.35,0.54,10.1,5,,2016-07-19
31,6.9,0.685,0.0,2.5,0.105,22.0,37.0,0.9966,3.46,0.57,10.6,6,,2016-07-19


Now, for our analysis we want to take a look at all the high quality wines. Select the wines with ratings above 7 and save the result as a pandas dataframe

In [48]:
%%sql
SELECT * from wine WHERE quality > 7

18 rows affected.


index,fixedacidity,volatileacidity,citricacid,residualsugar,chlorides,freesulfurdioxide,totalsulfurdioxide,density,ph,sulphates,alcohol,quality,high_alc,sale_date
440,12.6,0.31,0.72,2.2,0.072,6.0,29.0,0.9987,2.88,0.82,9.8,8,,
1403,7.2,0.33,0.33,1.7,0.061,3.0,13.0,0.996,3.23,1.1,10.0,8,,
267,7.9,0.35,0.46,3.6,0.078,15.0,37.0,0.9973,3.35,0.86,12.8,8,,2016-07-19
278,10.3,0.32,0.45,6.4,0.073,5.0,13.0,0.9976,3.23,0.82,12.6,8,,2016-07-19
390,5.6,0.85,0.05,1.4,0.045,12.0,88.0,0.9924,3.56,0.82,12.9,8,,2016-07-19
455,11.3,0.62,0.67,5.2,0.086,6.0,19.0,0.9988,3.22,0.69,13.4,8,,2016-07-19
481,9.4,0.3,0.56,2.8,0.08,6.0,17.0,0.9964,3.15,0.92,11.7,8,,2016-07-19
495,10.7,0.35,0.53,2.6,0.07,5.0,16.0,0.9972,3.15,0.65,11.0,8,,2016-07-19
498,10.7,0.35,0.53,2.6,0.07,5.0,16.0,0.9972,3.15,0.65,11.0,8,,2016-07-19
588,5.0,0.42,0.24,2.0,0.06,19.0,50.0,0.9917,3.72,0.74,14.0,8,,2016-07-19


In [50]:
hQuality = pd.read_sql_query('SELECT * FROM wine WHERE quality >7;', engine)

In [52]:
hQuality.head()

Unnamed: 0,index,fixedacidity,volatileacidity,citricacid,residualsugar,chlorides,freesulfurdioxide,totalsulfurdioxide,density,ph,sulphates,alcohol,quality,high_alc,sale_date
0,440,12.6,0.31,0.72,2.2,0.072,6.0,29.0,0.9987,2.88,0.82,9.8,8,,
1,1403,7.2,0.33,0.33,1.7,0.061,3.0,13.0,0.996,3.23,1.1,10.0,8,,
2,267,7.9,0.35,0.46,3.6,0.078,15.0,37.0,0.9973,3.35,0.86,12.8,8,,2016-07-19
3,278,10.3,0.32,0.45,6.4,0.073,5.0,13.0,0.9976,3.23,0.82,12.6,8,,2016-07-19
4,390,5.6,0.85,0.05,1.4,0.045,12.0,88.0,0.9924,3.56,0.82,12.9,8,,2016-07-19


But wait! You just recieved a call that we not only want to view high quality wines, but we want to see high quality wines with low acidity and medium alcohol content. Remember, we cannot include the wines already sold in this query. 

In [56]:
%%sql
SELECT * FROM wine WHERE quality > 7 AND fixedacidity < 7.5 AND sale_date is NULL; 

1 rows affected.


index,fixedacidity,volatileacidity,citricacid,residualsugar,chlorides,freesulfurdioxide,totalsulfurdioxide,density,ph,sulphates,alcohol,quality,high_alc,sale_date
1403,7.2,0.33,0.33,1.7,0.061,3.0,13.0,0.996,3.23,1.1,10.0,8,,


In [60]:
q = 'SELECT * FROM wine WHERE quality > 7 AND fixedacidity < 7.5 AND sale_date is NULL;'
hQuality2 = pd.read_sql_query(q, engine)

In [61]:
hQuality2.head()

Unnamed: 0,index,fixedacidity,volatileacidity,citricacid,residualsugar,chlorides,freesulfurdioxide,totalsulfurdioxide,density,ph,sulphates,alcohol,quality,high_alc,sale_date
0,1403,7.2,0.33,0.33,1.7,0.061,3.0,13.0,0.996,3.23,1.1,10.0,8,,


Lastly, we want to round the density column to two variables within the database.

In [81]:
%%sql
SELECT *, ROUND(density, 1) as rounded_density FROM WINE;

(psycopg2.ProgrammingError) function round(double precision, integer) does not exist
LINE 1: SELECT *, ROUND(density, 1) as rounded_density FROM WINE;
                  ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
 [SQL: 'SELECT *, ROUND(density, 1) as rounded_density FROM WINE;']
