***

**<center><font size = "6">Build and Optimize Data Warehouses with BigQuery<center>**
***
<center><font size = "2">Prepared by: Sitsawek Sukorn<center>

### BigQuery: Qwik Start - Command Line

### Examine a table

In [None]:
# Run in shell 
bq show bigquery-public-data:samples.shakespeare

+ bq to invoke the BigQuery command line tool
+ show is the action
+ Then you're listing the name of the project:public dataset.table in BigQuery that you want to see.

### Run the help command

In [None]:
bq help query

### Run a query

Now you'll run a query to see how many times the substring "raisin" appears in Shakespeare's works.

+ To run a query, run the command bq query "[SQL_STATEMENT]":

+ Escape any quotation marks inside the [SQL_STATEMENT] with a \ mark, or

+ Use a different quotation mark type than the surrounding marks ("versus").

+ Run the following standard SQL query in Cloud Shell to count the number of times that the substring "raisin" appears in all of Shakespeare's works:

In [None]:
bq query --use_legacy_sql=false \
'SELECT
   word,
   SUM(word_count) AS count
 FROM
   `bigquery-public-data`.samples.shakespeare
 WHERE
   word LIKE "%raisin%"
 GROUP BY
   word'

+ --use_legacy_sql=false makes standard SQL the default query syntax.

+ If you search for a word that isn't in Shakespeare's works, no results are returned.

+ Run the following search for "huzzah", returns no matches:



In [None]:
bq query --use_legacy_sql=false \
'SELECT
   word
 FROM
   `bigquery-public-data`.samples.shakespeare
 WHERE
   word = "huzzah"'

### Create a new table

#### Create a new dataset
+ Use the bq ls command to list any existing datasets in your project:

In [None]:
bq ls

+ Run bq ls and the bigquery-public-data Project ID to list the datasets in that specific project, followed by a colon (:):

In [None]:
bq ls bigquery-public-data:

+ Use the bq mk command to create a new dataset named babynames in your Qwiklabs project:

In [None]:
bq mk babynames

+ Run bq ls to confirm that the dataset now appears as part of your project:

In [None]:
bq ls

#### Upload the dataset

+ Run this command to add the baby names zip file to your project, using the URL for the data file:

In [None]:
curl -LO http://www.ssa.gov/OACT/babynames/names.zip

+ List the file:

In [None]:
ls

+ Now unzip the file:

In [None]:
unzip names.zip

+ That's a pretty big list of text files! List the files again:

In [None]:
ls

+ Create your table:

In [None]:
bq load babynames.names2010 yob2010.txt name:string,gender:string,count:integer

+ Run bq ls and babynames to confirm that the table now appears in your dataset:

In [None]:
bq ls babynames

+ Run bq show and your dataset.table to see the schema:

In [None]:
bq show babynames.names2010

+ Note: By default, when you load data, BigQuery expects UTF-8 encoded data. If you have data that is in ISO-8859-1 (or Latin-1) encoding and are having problems with your loaded data, you can tell BigQuery to treat your data as Latin-1 explicitly, using the -E flag. Learn more about Character Encodings from the Introduction to loading data guide.

### Run queries

+ Run the following command to return the top 5 most popular girls names:

In [None]:
bq query "SELECT name,count FROM babynames.names2010 WHERE gender = 'F' ORDER BY count DESC LIMIT 5"

+ Run the following command to see the top 5 most unusual boys names:

In [None]:
bq query "SELECT name,count FROM babynames.names2010 WHERE gender = 'M' ORDER BY count ASC LIMIT 5"

+ Note: The minimum count is 5 because the source data omits names with fewer than 5 occurrences.

### Clean up

+ Run the bq rm command to remove the babynames dataset with the -r flag to delete all tables in the dataset:

In [None]:
bq rm -r babynames

+ Confirm the delete command by typing Y.

***

**<center><font size = "6">Creating a Data Warehouse Through Joins and Unions<center>**
***

### Create a new dataset to store your tables

First, create a new dataset titled ecommerce in BigQuery.

+ In the left pane, click on the name of your BigQuery project (qwiklabs-gcp-xxxx).

+ Click on the three dots next to your project name, then select CREATE DATASET.

+ The Create dataset dialog opens.

+ Set the Dataset ID to ecommerce, leave all other options at their default values.

+ Click Create dataset.

+ Click on the Disable Editor Tabs link to enable the Query Editor.

### Explore the product sentiment dataset

+ First, create a copy the table that the data science team made so you can read it:



In [None]:
create or replace TABLE ecommerce.products AS
SELECT
*
FROM
`data-to-insights.ecommerce.products`

+ Note: This is only for you to review, the queries in this lab will be using the data-to-insights project.

+ Click on the ecommerce dataset to display the products table.

#### Create a query that shows the top 5 products with the most positive sentiment

+ In the Query Editor, write your SQL query.

In [None]:
SELECT
  SKU,
  name,
  sentimentScore,
  sentimentMagnitude
FROM
  `data-to-insights.ecommerce.products`
ORDER BY
  sentimentScore DESC
LIMIT 5

+ Revise your query to show the top 5 products with the most negative sentiment and filter out NULL values.

In [None]:
SELECT
  SKU,
  name,
  sentimentScore,
  sentimentMagnitude
FROM
  `data-to-insights.ecommerce.products`
WHERE sentimentScore IS NOT NULL
ORDER BY
  sentimentScore
LIMIT 5

### Join datasets to find insights

#### Calculate daily sales volume by productSKU

+ Create a new table in your ecommerce dataset with the below requirements:

- Title it sales_by_sku_20170801
- Source the data from data-to-insights.ecommerce.all_sessions_raw
- Include only distinct results
- Return productSKU
- Return the total quantity ordered (productQuantity). Hint: Use a SUM() with a IFNULL condition
- Filter for only sales on 20170801
- ORDER BY the SKUs with the most orders first

In [None]:
# pull what sold on 08/01/2017
CREATE OR REPLACE TABLE ecommerce.sales_by_sku_20170801 AS
SELECT DISTINCT
  productSKU,
  SUM(IFNULL(productQuantity,0)) AS total_ordered
FROM
  `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20170801'
GROUP BY productSKU
ORDER BY total_ordered DESC #462 skus sold

+ Click on the sales_by_sku table, then click the Preview tab.

#### Join sales data and inventory data


+ Using a JOIN, enrich the website ecommerce data with the following fields from the product inventory dataset:

+ name

+ stockLevel

+ restockingLeadTime

+ sentimentScore

- sentimentMagnitude

+ Complete the partially written query:

In [None]:
# join against product inventory to get name
SELECT DISTINCT
  website.productSKU,
  website.total_ordered,
  inventory.name,
  inventory.stockLevel,
  inventory.restockingLeadTime,
  inventory.sentimentScore,
  inventory.sentimentMagnitude
FROM
  ecommerce.sales_by_sku_20170801 AS website
  LEFT JOIN `data-to-insights.ecommerce.products` AS inventory
ORDER BY total_ordered DESC

+ Possible solution:

In [None]:
# join against product inventory to get name
SELECT DISTINCT
  website.productSKU,
  website.total_ordered,
  inventory.name,
  inventory.stockLevel,
  inventory.restockingLeadTime,
  inventory.sentimentScore,
  inventory.sentimentMagnitude
FROM
  ecommerce.sales_by_sku_20170801 AS website
  LEFT JOIN `data-to-insights.ecommerce.products` AS inventory
  ON website.productSKU = inventory.SKU
ORDER BY total_ordered DESC

+ Modify the query you wrote to now include:

+ A calculated field of (total_ordered / stockLevel) and alias it "ratio". Hint: Use SAFE_DIVIDE(field1,field2) to ***avoid divide by 0 errors*** when the stock level is 0.
+ Filter the results to only include products that have gone through 50% or more of their inventory already at the beginning of the month

+ Possible solution:

In [None]:
# calculate ratio and filter
SELECT DISTINCT
  website.productSKU,
  website.total_ordered,
  inventory.name,
  inventory.stockLevel,
  inventory.restockingLeadTime,
  inventory.sentimentScore,
  inventory.sentimentMagnitude,
  SAFE_DIVIDE(website.total_ordered, inventory.stockLevel) AS ratio
FROM
  ecommerce.sales_by_sku_20170801 AS website
  LEFT JOIN `data-to-insights.ecommerce.products` AS inventory
  ON website.productSKU = inventory.SKU
# gone through more than 50% of inventory for the month
WHERE SAFE_DIVIDE(website.total_ordered,inventory.stockLevel) >= .50
ORDER BY total_ordered DESC

### Append additional records

#### Create a new empty table to store sales by productSKU for 08/02/2017

- For the schema, specify the following fields:

- table name is ecommerce.sales_by_sku_20170802
- productSKU STRING
- total_ordered as an INT64 field

+ Possible solution:

In [None]:
CREATE OR REPLACE TABLE ecommerce.sales_by_sku_20170802
(
productSKU STRING,
total_ordered INT64
);

+ Confirm you now have two date-shared sales tables - use the dropdown menu next to the Sales_by_sku table name in the table results, or refresh your browser to see it listed in the left menu:

+ Insert the sales record provided to you by your sales team:

In [None]:
INSERT INTO ecommerce.sales_by_sku_20170802
(productSKU, total_ordered)
VALUES('GGOEGHPA002910', 101)

+ Confirm the record appears by previewing the table - click on the table name to see the results.

#### Append together historical data

- There are multiple ways to append together data that has the same schema. Two common ways are using UNIONs and table wildcards.

- Union is an SQL operator that appends together rows from different result sets.

- Table wildcards enable you to query multiple tables using concise SQL statements. Wildcard tables are available only in standard SQL.

- Write a UNION query that will result in all records from the below two tables:

- ecommerce.sales_by_sku_20170801

- ecommerce.sales_by_sku_20170802

In [None]:
SELECT * FROM ecommerce.sales_by_sku_20170801
UNION ALL
SELECT * FROM ecommerce.sales_by_sku_20170802

+ Note: The difference between a UNION and UNION ALL is that a ***UNION will not include duplicate records***.

+ What is a pitfall of having many daily sales tables? You will have to write many UNION statements chained together.

+ A better solution is to use the table wildcard filter and _TABLE_SUFFIX filter.

+ Write a query that uses the (*) table wildcard to select all records from ecommerce.sales_by_sku_ for the year 201

In [None]:
SELECT * FROM `ecommerce.sales_by_sku_2017*`

+ Modify the previous query to add a filter to limit the results to just 08/02/2017.

In [None]:
SELECT * FROM `ecommerce.sales_by_sku_2017*`
WHERE _TABLE_SUFFIX = '0802'

+ Note: Another option to consider is to create a Partitioned Table which automatically can ingest daily sales data into the correct partition.

***