<a href="https://colab.research.google.com/github/ranjukhanal11/ranjukhanal11/blob/main/03_SQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learning Some SQL with BigQuery

The first bit below comes directly from Google, you'll need to do each of those to be successful in getting this document to work.  

## Before you begin


1.   Use the [Cloud Resource Manager](https://console.cloud.google.com/cloud-resource-manager) to Create a Cloud Platform project if you do not already have one.
2.   [Enable billing](https://support.google.com/cloud/answer/6293499#enable-billing) for the project.
3.   [Enable BigQuery](https://console.cloud.google.com/flows/enableapi?apiid=bigquery) APIs for the project.

In [10]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


Now that I am authenticated, I can start to play around in the dataset.  I am going to look at the liquor sales data from Iowa and try to find the most and least sales by city.  I do have a project called `pic-math` in my BigQuery interface.  So you'll need to make one but keep the name simple but identifiable!

## Why do we use SQL

Below you'll see a basic SQL call.  This illustrates why excel is not useful, 22 million rows is about 21.5 million more than excel can handle!  Essentially SQL will do the data manipulations on the database server side instead of on you machine (or in the cloud with colab)

In [11]:
%%bigquery --project white-device-278509
SELECT 
  COUNT(*) as total_rows
FROM `bigquery-public-data.iowa_liquor_sales.sales`

Unnamed: 0,total_rows
0,22972250


We see that is a lot of rows.  We really don't want to try to store that into memory!  Let's have a peak at the data.

The only two required features of an SQL call are `SELECT` and `FROM`.  `SELECT` picks the columns you want by name in the data table.  `FROM` picks the table you want to look at.  Both can be shorted in the call and sometimes it is nesseccary to all the table name with the column.  Below I do the same thing in two different ways.  Do you see a difference in the output?

In [12]:
%%bigquery --project white-device-278509
SELECT 
  AVG(sale_dollars)
FROM `bigquery-public-data.iowa_liquor_sales.sales`

Unnamed: 0,f0_
0,138.881749


In [13]:
%%bigquery --project white-device-278509
SELECT AVG(table.sale_dollars) as average_sale_dollars
FROM `bigquery-public-data.iowa_liquor_sales.sales` as table

Unnamed: 0,average_sale_dollars
0,138.881749


See any difference?  You should be asking yourself why it would be adventageous to name your tables.  Well, we will see shortly that joining the tables (remember relational database?) is going to be an important task!  Sometimes some info we want will be in one table and some of it in another.

Here is a command that will allow you to peak at the whole table (like head!)  The `*` gives you all the columns and the `LIMIT` gives only how many you specify.  There is not preferred order in SQL.

In [14]:
%%bigquery --project white-device-278509
SELECT *
FROM `bigquery-public-data.iowa_liquor_sales.sales`
LIMIT 5

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,store_location,county_number,county,category,category_name,vendor_number,vendor_name,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
0,INV-37586600012,2021-06-17,2538,Hy-Vee Food Store #3 / Waterloo,1422 Flammang Dr,Waterloo,50702.0,POINT (-92.327917 42.459938),7,BLACK HAWK,1091100.0,American Distilled Spirit Specialty,481.0,Sugarlands Distilling Company LLC,977332,Sugarlands Shine Butterscotch Gold,6,750,13.52,20.28,18,365.04,13.5,3.56
1,INV-37765400077,2021-06-23,5315,Brother's Market Wine and Spirits,110 South Main Street,Sigourney,52591.0,POINT (-92.20510500000002 41.333293),54,KEOKUK,1081500.0,Triple Sec,434.0,LUXCO INC,86251,Juarez Triple Sec,12,1000,2.42,3.63,2,7.26,2.0,0.52
2,INV-20093700131,2019-06-19,2629,Hy-Vee Food Store #2 / Council Bluffs,1745 Madison Ave,Council Bluffs,51503.0,POINT (-95.825137 41.242732),78,POTTAWATTA,1081100.0,Coffee Liqueurs,370.0,PERNOD RICARD USA,67524,Kahlua Coffee,24,375,6.49,9.74,3,29.22,1.12,0.29
3,INV-13397100003,2018-07-23,2629,Hy-Vee Food Store #2 / Council Bluffs,1745 Madison Ave,Council Bluffs,51503.0,POINT (-95.825137 41.242732),78,POTTAWATTA,1062300.0,Aged Dark Rum,380.0,Phillips Beverage,43686,Cross Keys Rum,12,750,10.07,15.11,36,543.96,27.0,7.13
4,INV-14045300014,2018-08-24,2190,"Central City Liquor, Inc.",1460 2ND AVE,Des Moines,50314.0,POINT (-93.619787 41.60566),77,POLK,1062100.0,Gold Rum,35.0,BACARDI USA INC,43034,Bacardi Gold Rum,24,375,4.5,6.75,1,6.75,0.37,0.09


Not sure how much we might use this but if I wanted the data into a `pandas` dataframe, I just add a name for it after the bigquery project name.

In [15]:
%%bigquery --project white-device-278509
SELECT 
  city, 
  store_name,
  SUM(sale_dollars) as total_sales
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE city is not null
GROUP BY city, store_name

Unnamed: 0,city,store_name,total_sales
0,Waterloo,Hy-Vee Food Store #3 / Waterloo,6547896.92
1,Sigourney,Brother's Market Wine and Spirits,647180.36
2,Council Bluffs,Hy-Vee Food Store #2 / Council Bluffs,9380511.23
3,Des Moines,"Central City Liquor, Inc.",11778799.03
4,CEDAR RAPIDS,Sam's Club 8162 / Cedar Rapids,16905097.38
...,...,...,...
4166,DeWitt,TYCOGA Vineyard & Winery,9205.20
4167,Raymond,Thome Enterprises LLC,288.00
4168,Bettendorf,"Cats Eye Distillery, LLC",693.12
4169,Dubuque,"3-Oaks Distillery, LLC",388.80


In [16]:
df

<google.cloud.bigquery.job.QueryJob at 0x7fb63817d350>

In [17]:
import pandas as pd

groupeddf = df.groupby('city')

AttributeError: ignored

In [None]:
maxdf = groupeddf.max()

In [None]:
maxdf

In [None]:
mindf = groupeddf.min()

mindf

I am clearly just showing off now.  I have left more along this line at the bottom but let's get your assignment up!

Assignement for today

1. Start a notebook getting BigQuery to work.  Feel free to use the authentication atop.
2. Navigate to the dataset 'austin_bikeshare.bikeshare_trips'
3. Compute how many entries are in the dataset
4. Compute the longest trip from 'duration_minutes'
5. Compute the average time for a trip

### Q3: Compute how many entries are in the dataset

In [19]:
%%bigquery --project white-device-278509
select count(*) as Entries from bigquery-public-data.austin_bikeshare.bikeshare_trips

Unnamed: 0,Entries
0,1424786


## Q4: Compute the longest trip from 'duration_minutes'

In [20]:
%%bigquery --project white-device-278509
select max(duration_minutes) as Longest_Trip from bigquery-public-data.austin_bikeshare.bikeshare_trips

Unnamed: 0,Longest_Trip
0,34238


## Q5: Compute the average time for a trip


In [21]:
%%bigquery --project white-device-278509
select avg(duration_minutes) as Average from bigquery-public-data.austin_bikeshare.bikeshare_trips

Unnamed: 0,Average
0,30.870428


## More on Liquors not Needed today

I notice a few things attempting this.  While I think I have a solution, it is clearly not the best.  Zwingle and ZWINGLE are probably the same town and SNK may just be the only store but the fact that it appears four times in my lists is disappointing!

In [None]:
maxdf.sort_values('total_sales',ascending=False)

I want to try the extra challenge **and** do it all in SQL.  I'll try to find which store had the most sales by date!

In [None]:
%%bigquery --project white-device-278509

WITH bestday as (
SELECT 
  date, 
  store_name,
  city,
  SUM(sale_dollars) as total_sales,
  RANK() over (PARTITION BY date ORDER BY SUM(sale_dollars) desc) as top_sales_rank
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE city is not null
GROUP BY date, store_name, city
)

SELECT
  date,
  store_name,
  city,
  total_sales
FROM bestday
WHERE top_sales_rank = 1
ORDER BY date