***

**<center><font size = "6">Build and Optimize Data Warehouses with BigQuery<center>**
***
<center><font size = "2">Prepared by: Sitsawek Sukorn<center>

### BigQuery: Qwik Start - Command Line

### Examine a table

In [None]:
# Run in shell 
bq show bigquery-public-data:samples.shakespeare

+ bq to invoke the BigQuery command line tool
+ show is the action
+ Then you're listing the name of the project:public dataset.table in BigQuery that you want to see.

### Run the help command

In [None]:
bq help query

### Run a query

Now you'll run a query to see how many times the substring "raisin" appears in Shakespeare's works.

+ To run a query, run the command bq query "[SQL_STATEMENT]":

+ Escape any quotation marks inside the [SQL_STATEMENT] with a \ mark, or

+ Use a different quotation mark type than the surrounding marks ("versus").

+ Run the following standard SQL query in Cloud Shell to count the number of times that the substring "raisin" appears in all of Shakespeare's works:

In [None]:
bq query --use_legacy_sql=false \
'SELECT
   word,
   SUM(word_count) AS count
 FROM
   `bigquery-public-data`.samples.shakespeare
 WHERE
   word LIKE "%raisin%"
 GROUP BY
   word'

+ --use_legacy_sql=false makes standard SQL the default query syntax.

+ If you search for a word that isn't in Shakespeare's works, no results are returned.

+ Run the following search for "huzzah", returns no matches:



In [None]:
bq query --use_legacy_sql=false \
'SELECT
   word
 FROM
   `bigquery-public-data`.samples.shakespeare
 WHERE
   word = "huzzah"'

### Create a new table

#### Create a new dataset
+ Use the bq ls command to list any existing datasets in your project:

In [None]:
bq ls

+ Run bq ls and the bigquery-public-data Project ID to list the datasets in that specific project, followed by a colon (:):

In [None]:
bq ls bigquery-public-data:

+ Use the bq mk command to create a new dataset named babynames in your Qwiklabs project:

In [None]:
bq mk babynames

+ Run bq ls to confirm that the dataset now appears as part of your project:

In [None]:
bq ls

#### Upload the dataset

+ Run this command to add the baby names zip file to your project, using the URL for the data file:

In [None]:
curl -LO http://www.ssa.gov/OACT/babynames/names.zip

+ List the file:

In [None]:
ls

+ Now unzip the file:

In [None]:
unzip names.zip

+ That's a pretty big list of text files! List the files again:

In [None]:
ls

+ Create your table:

In [None]:
bq load babynames.names2010 yob2010.txt name:string,gender:string,count:integer

+ Run bq ls and babynames to confirm that the table now appears in your dataset:

In [None]:
bq ls babynames

+ Run bq show and your dataset.table to see the schema:

In [None]:
bq show babynames.names2010

+ Note: By default, when you load data, BigQuery expects UTF-8 encoded data. If you have data that is in ISO-8859-1 (or Latin-1) encoding and are having problems with your loaded data, you can tell BigQuery to treat your data as Latin-1 explicitly, using the -E flag. Learn more about Character Encodings from the Introduction to loading data guide.

### Run queries

+ Run the following command to return the top 5 most popular girls names:

In [None]:
bq query "SELECT name,count FROM babynames.names2010 WHERE gender = 'F' ORDER BY count DESC LIMIT 5"

+ Run the following command to see the top 5 most unusual boys names:

In [None]:
bq query "SELECT name,count FROM babynames.names2010 WHERE gender = 'M' ORDER BY count ASC LIMIT 5"

+ Note: The minimum count is 5 because the source data omits names with fewer than 5 occurrences.

### Clean up

+ Run the bq rm command to remove the babynames dataset with the -r flag to delete all tables in the dataset:

In [None]:
bq rm -r babynames

+ Confirm the delete command by typing Y.

***

**<center><font size = "6">Creating a Data Warehouse Through Joins and Unions<center>**
***

### Create a new dataset to store your tables

First, create a new dataset titled ecommerce in BigQuery.

+ In the left pane, click on the name of your BigQuery project (qwiklabs-gcp-xxxx).

+ Click on the three dots next to your project name, then select CREATE DATASET.

+ The Create dataset dialog opens.

+ Set the Dataset ID to ecommerce, leave all other options at their default values.

+ Click Create dataset.

+ Click on the Disable Editor Tabs link to enable the Query Editor.

### Explore the product sentiment dataset

+ First, create a copy the table that the data science team made so you can read it:



In [None]:
create or replace TABLE ecommerce.products AS
SELECT
*
FROM
`data-to-insights.ecommerce.products`

+ Note: This is only for you to review, the queries in this lab will be using the data-to-insights project.

+ Click on the ecommerce dataset to display the products table.

#### Create a query that shows the top 5 products with the most positive sentiment

+ In the Query Editor, write your SQL query.

In [None]:
SELECT
  SKU,
  name,
  sentimentScore,
  sentimentMagnitude
FROM
  `data-to-insights.ecommerce.products`
ORDER BY
  sentimentScore DESC
LIMIT 5

+ Revise your query to show the top 5 products with the most negative sentiment and filter out NULL values.

In [None]:
SELECT
  SKU,
  name,
  sentimentScore,
  sentimentMagnitude
FROM
  `data-to-insights.ecommerce.products`
WHERE sentimentScore IS NOT NULL
ORDER BY
  sentimentScore
LIMIT 5

### Join datasets to find insights

#### Calculate daily sales volume by productSKU

+ Create a new table in your ecommerce dataset with the below requirements:

- Title it sales_by_sku_20170801
- Source the data from data-to-insights.ecommerce.all_sessions_raw
- Include only distinct results
- Return productSKU
- Return the total quantity ordered (productQuantity). Hint: Use a SUM() with a IFNULL condition
- Filter for only sales on 20170801
- ORDER BY the SKUs with the most orders first

In [None]:
# pull what sold on 08/01/2017
CREATE OR REPLACE TABLE ecommerce.sales_by_sku_20170801 AS
SELECT DISTINCT
  productSKU,
  SUM(IFNULL(productQuantity,0)) AS total_ordered
FROM
  `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20170801'
GROUP BY productSKU
ORDER BY total_ordered DESC #462 skus sold

+ Click on the sales_by_sku table, then click the Preview tab.

#### Join sales data and inventory data


+ Using a JOIN, enrich the website ecommerce data with the following fields from the product inventory dataset:

+ name

+ stockLevel

+ restockingLeadTime

+ sentimentScore

- sentimentMagnitude

+ Complete the partially written query:

In [None]:
# join against product inventory to get name
SELECT DISTINCT
  website.productSKU,
  website.total_ordered,
  inventory.name,
  inventory.stockLevel,
  inventory.restockingLeadTime,
  inventory.sentimentScore,
  inventory.sentimentMagnitude
FROM
  ecommerce.sales_by_sku_20170801 AS website
  LEFT JOIN `data-to-insights.ecommerce.products` AS inventory
ORDER BY total_ordered DESC

+ Possible solution:

In [None]:
# join against product inventory to get name
SELECT DISTINCT
  website.productSKU,
  website.total_ordered,
  inventory.name,
  inventory.stockLevel,
  inventory.restockingLeadTime,
  inventory.sentimentScore,
  inventory.sentimentMagnitude
FROM
  ecommerce.sales_by_sku_20170801 AS website
  LEFT JOIN `data-to-insights.ecommerce.products` AS inventory
  ON website.productSKU = inventory.SKU
ORDER BY total_ordered DESC

+ Modify the query you wrote to now include:

+ A calculated field of (total_ordered / stockLevel) and alias it "ratio". Hint: Use SAFE_DIVIDE(field1,field2) to ***avoid divide by 0 errors*** when the stock level is 0.
+ Filter the results to only include products that have gone through 50% or more of their inventory already at the beginning of the month

+ Possible solution:

In [None]:
# calculate ratio and filter
SELECT DISTINCT
  website.productSKU,
  website.total_ordered,
  inventory.name,
  inventory.stockLevel,
  inventory.restockingLeadTime,
  inventory.sentimentScore,
  inventory.sentimentMagnitude,
  SAFE_DIVIDE(website.total_ordered, inventory.stockLevel) AS ratio
FROM
  ecommerce.sales_by_sku_20170801 AS website
  LEFT JOIN `data-to-insights.ecommerce.products` AS inventory
  ON website.productSKU = inventory.SKU
# gone through more than 50% of inventory for the month
WHERE SAFE_DIVIDE(website.total_ordered,inventory.stockLevel) >= .50
ORDER BY total_ordered DESC

### Append additional records

#### Create a new empty table to store sales by productSKU for 08/02/2017

- For the schema, specify the following fields:

- table name is ecommerce.sales_by_sku_20170802
- productSKU STRING
- total_ordered as an INT64 field

+ Possible solution:

In [None]:
CREATE OR REPLACE TABLE ecommerce.sales_by_sku_20170802
(
productSKU STRING,
total_ordered INT64
);

+ Confirm you now have two date-shared sales tables - use the dropdown menu next to the Sales_by_sku table name in the table results, or refresh your browser to see it listed in the left menu:

+ Insert the sales record provided to you by your sales team:

In [None]:
INSERT INTO ecommerce.sales_by_sku_20170802
(productSKU, total_ordered)
VALUES('GGOEGHPA002910', 101)

+ Confirm the record appears by previewing the table - click on the table name to see the results.

#### Append together historical data

- There are multiple ways to append together data that has the same schema. Two common ways are using UNIONs and table wildcards.

- Union is an SQL operator that appends together rows from different result sets.

- Table wildcards enable you to query multiple tables using concise SQL statements. Wildcard tables are available only in standard SQL.

- Write a UNION query that will result in all records from the below two tables:

- ecommerce.sales_by_sku_20170801

- ecommerce.sales_by_sku_20170802

In [None]:
SELECT * FROM ecommerce.sales_by_sku_20170801
UNION ALL
SELECT * FROM ecommerce.sales_by_sku_20170802

+ Note: The difference between a UNION and UNION ALL is that a ***UNION will not include duplicate records***.

+ What is a pitfall of having many daily sales tables? You will have to write many UNION statements chained together.

+ A better solution is to use the table wildcard filter and _TABLE_SUFFIX filter.

+ Write a query that uses the (*) table wildcard to select all records from ecommerce.sales_by_sku_ for the year 201

In [None]:
SELECT * FROM `ecommerce.sales_by_sku_2017*`

+ Modify the previous query to add a filter to limit the results to just 08/02/2017.

In [None]:
SELECT * FROM `ecommerce.sales_by_sku_2017*`
WHERE _TABLE_SUFFIX = '0802'

+ Note: Another option to consider is to create a Partitioned Table which automatically can ingest daily sales data into the correct partition.

***

**<center><font size = "6">Creating Date-Partitioned Tables in BigQuery<center>**
***

### Creating tables with date partitions

#### Query webpage analytics for a sample of visitors in 2017

+ In the Query Editor, add the below query:

In [None]:
#standardSQL
SELECT DISTINCT
  fullVisitorId,
  date,
  city,
  pageTitle
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20170708'
LIMIT 5

+ Click Run.

#### Query webpage analytics for a sample of visitors in 2018

+ Let's modify the query to look at visitors for 2018 now.

+ Click COMPOSE NEW QUERY to clear the Query Editor, then add this new query. Note the WHERE date parameter is changed to 20180708:

In [None]:
#standardSQL
SELECT DISTINCT
  fullVisitorId,
  date,
  city,
  pageTitle
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20180708'
LIMIT 5

+ Click Run.

#### Common use-cases for date-partitioned tables
+ Scanning through the entire dataset everytime to compare rows against a WHERE condition is wasteful. This is especially true if you only really care about records for a specific period of time like:

+ All transactions for the last year
+ All visitor interactions within the last 7 days
+ All products sold in the last month
+ Instead of scanning the entire dataset and filtering on a date field like we did in the earlier queries, Now set up a date-partitioned table. This allows you to completely ignore scanning records in certain partitions if they are irrelevant to our query.

#### Create a new partitioned table based on date

+ Click COMPOSE NEW QUERY , add the below query, then click Run:

In [None]:
#standardSQL
 CREATE OR REPLACE TABLE ecommerce.partition_by_day
 PARTITION BY date_formatted
 OPTIONS(
   description="a table partitioned by date"
 ) AS
 SELECT DISTINCT
 PARSE_DATE("%Y%m%d", date) AS date_formatted,
 fullvisitorId
 FROM `data-to-insights.ecommerce.all_sessions_raw`

+ Click on the ecommerce dataset, then select the new partiton_by_day table:

+ Click on the Details tab.

+ Note: Partitions within partitioned tables on your lab account will auto-expire after 60 days from the value in your date column. Your personal Google Cloud account with billing-enabled will let you have partitioned tables that don't expire.

### View data processed with a partitioned table

+ Run the below query, and note the total bytes to be processed:



In [None]:
#standardSQL
SELECT *
FROM `data-to-insights.ecommerce.partition_by_day`
WHERE date_formatted = '2016-08-01'

+ Now run the below query, and note the total bytes to be processed:

In [None]:
#standardSQL
SELECT *
FROM `data-to-insights.ecommerce.partition_by_day`
WHERE date_formatted = '2018-07-08'

+ You should see This query will process 0 B when run.



### Creating an auto-expiring partitioned table

#### Explore the available NOAA weather data tables

+ In the left menu, in Explorer, click on Add Data and select Explore public datasets.

+ Search for "GSOD NOAA" then select the dataset.

+ Click on View Dataset.

+ (Need to copy before see data but google not mention this!)

+ Scroll through the tables in the noaa_gsod dataset (which are manually sharded and not partitioned):

Your goal is to create a table that:

+ Queries on weather data from 2018 onward

+ Filters to only include days that have had some precipitation (rain, snow, etc.)

+ Only stores each partition of data for 90 days from that partition's date (rolling window)

+ First, copy and paste this below query:

In [None]:
#standardSQL
 SELECT
   DATE(CAST(year AS INT64), CAST(mo AS INT64), CAST(da AS INT64)) AS date,
   (SELECT ANY_VALUE(name) FROM `bigquery-public-data.noaa_gsod.stations` AS stations
    WHERE stations.usaf = stn) AS station_name,  -- Stations may have multiple names
   prcp
 FROM `bigquery-public-data.noaa_gsod.gsod*` AS weather
 WHERE prcp < 99.9  -- Filter unknown values
   AND prcp > 0      -- Filter stations/days with no precipitation
   AND _TABLE_SUFFIX >= '2018'
 ORDER BY date DESC -- Where has it rained/snowed recently
 LIMIT 10

+ Note: The table wildcard * used in the FROM clause to limit the amount of tables referred to in the TABLE_SUFFIX filter.

+ Note: Although a LIMIT 10 was added, this still does not reduce the total amount of data scanned (about 1.83 GB) since there are no partitions yet.


+ Click Run.

+ Confirm the date is properly formatted and the precipitation field is showing non-zero values.

### Your turn: create a partitioned table

+ Modify the previous query to create a table with the below specifications:

- Table name: ecommerce.days_with_rain
- Use the date field as your PARTITION BY
- For OPTIONS, specify partition_expiration_days = 60
- Add the table description = "weather stations with precipitation, partitioned by day"

In [None]:
#standardSQL
 CREATE OR REPLACE TABLE ecommerce.days_with_rain
 PARTITION BY date
 OPTIONS (
   partition_expiration_days=60,
   description="weather stations with precipitation, partitioned by day"
 ) AS
 SELECT
   DATE(CAST(year AS INT64), CAST(mo AS INT64), CAST(da AS INT64)) AS date,
   (SELECT ANY_VALUE(name) FROM `bigquery-public-data.noaa_gsod.stations` AS stations
    WHERE stations.usaf = stn) AS station_name,  -- Stations may have multiple names
   prcp
 FROM `bigquery-public-data.noaa_gsod.gsod*` AS weather
 WHERE prcp < 99.9  -- Filter unknown values
   AND prcp > 0      -- Filter
   AND _TABLE_SUFFIX >= '2018'

#### Confirm data partition expiration is working

+ To confirm you are only storing data from 60 days in the past up until today, run the DATE_DIFF query to get the age of your partitions, which are set to expire after 60 days.

+ Below is a query which tracks the average rainfall for the NOAA weather station in Wakayama, Japan which has significant precipitation.

In [None]:
#standardSQL
# avg monthly precipitation
SELECT
  AVG(prcp) AS average,
  station_name,
  date,
  CURRENT_DATE() AS today,
  DATE_DIFF(CURRENT_DATE(), date, DAY) AS partition_age,
  EXTRACT(MONTH FROM date) AS month
FROM ecommerce.days_with_rain
WHERE station_name = 'WAKAYAMA' #Japan
GROUP BY station_name, date, today, month, partition_age
ORDER BY date DESC; # most recent days first

### Confirm the oldest partition_age is at or below 60 days

+ Update the ORDER BY clause to show the oldest partitions first.

In [None]:
#standardSQL
# avg monthly precipitation
SELECT
  AVG(prcp) AS average,
  station_name,
  date,
  CURRENT_DATE() AS today,
  DATE_DIFF(CURRENT_DATE(), date, DAY) AS partition_age,
  EXTRACT(MONTH FROM date) AS month
FROM ecommerce.days_with_rain
WHERE station_name = 'WAKAYAMA' #Japan
GROUP BY station_name, date, today, month, partition_age
ORDER BY partition_age DESC

+ Note: Your results will vary if you re-run the query in the future, as the weather data, and your partitions, are continuously updated.

***

**<center><font size = "6">Troubleshooting and Solving Data Join Pitfalls<center>**
***

### Pin the lab project in BigQuery

+ The project with the new dataset is data-to-insights.

+ Click Navigation menu Navigation menu icon > BigQuery.
The Welcome to BigQuery in the Cloud Console message box opens.

+ Note: The Welcome to BigQuery in the Cloud Console message box provides a link to the quickstart guide and UI updates.
Click Done.

+ BigQuery public datasets are not displayed by default in the BigQuery web UI. To open the public datasets project, copy data-to-insights.

+ Click Add Data > Pin a project > Enter Project Name, then paste in the data-to-insights name. Click Pin.

### Examine the fields

Next, get familiar with the products and fields on the website you can use to create queries to analyze the dataset.

+ In the left pane in the Resources section, navigate to data-to-insights > ecommerce > all_sessions_raw.

+ On the right, under the Query editor, click the Schema tab to see the Fields and information about each field.

### Identify a key field in your ecommerce dataset

#### Examine the records

In this section you find how many product names and product SKUs are on your website and whether either one of those fields is unique.

+ Find how many product names and product SKUs are on the website. Copy and Paste the below query in bigquery EDITOR:

In [None]:
#standardSQL
# how many products are on the website?
SELECT DISTINCT
productSKU,
v2ProductName
FROM `data-to-insights.ecommerce.all_sessions_raw`

+ Click Run.

+ Clear the previous query and run the below query to list the number of distinct SKUs are listed using DISTINCT:

In [None]:
#standardSQL
# find the count of unique SKUs
SELECT
DISTINCT
productSKU
FROM `data-to-insights.ecommerce.all_sessions_raw`

#### Examine the relationship between SKU & Name

+ Clear the previous query and run the below query to determine if some product names have more than one SKU. Notice we use the STRING_AGG() function to aggregate all the product SKUs that are associated with one product name into ***comma separated values***.

In [None]:
SELECT
  v2ProductName,
  COUNT(DISTINCT productSKU) AS SKU_count,
  STRING_AGG(DISTINCT productSKU LIMIT 5) AS SKU
FROM `data-to-insights.ecommerce.all_sessions_raw`
  WHERE productSKU IS NOT NULL
  GROUP BY v2ProductName
  HAVING SKU_count > 1
  ORDER BY SKU_count DESC

+ Click Run.

So we have seen that 1 Product can have 12 SKUs. What about 1 SKU? Should it be allowed to belong to more than 1 product?

+ Clear the previous query and run the below query to find out:

In [None]:
SELECT
  productSKU,
  COUNT(DISTINCT v2ProductName) AS product_count,
  STRING_AGG(DISTINCT v2ProductName LIMIT 5) AS product_name
FROM `data-to-insights.ecommerce.all_sessions_raw`
  WHERE v2ProductName IS NOT NULL
  GROUP BY productSKU
  HAVING product_count > 1
  ORDER BY product_count DESC

In [None]:
SELECT
  productSKU,
  COUNT(DISTINCT v2ProductName) AS product_count,
  STRING_AGG(DISTINCT v2ProductName LIMIT 5) AS product_name,
  ARRAY_AGG(DISTINCT v2ProductName LIMIT 5) AS product_name2
FROM `data-to-insights.ecommerce.all_sessions_raw`
  WHERE v2ProductName IS NOT NULL
  GROUP BY productSKU
  HAVING product_count > 1
  ORDER BY product_count DESC

+ Note: Try replacing STRING_AGG() with ARRAY_AGG() instead. Pretty cool, right? BigQuery natively supports nested array values. You can learn more from the Work with arrays guide.

### Pitfall: non-unique key

In inventory tracking, a SKU is designed to uniquely identify one and only one product. For us, it will be the basis of your JOIN condition when you lookup information from other tables. Having a non-unique key can cause serious data issues as you will see.

+ Write a query to identify all the product names for the SKU 'GGOEGPJC019099'.

In [None]:
SELECT DISTINCT
  v2ProductName,
  productSKU
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE productSKU = 'GGOEGPJC019099'

+ Click Run.

### Joining website data against your product inventory list

Let's see the impact of joining on a dataset with multiple products for a single SKU. First explore the product inventory dataset (the products table) to see if this SKU is unique there.

In [None]:
SELECT
  SKU,
  name,
  stockLevel
FROM `data-to-insights.ecommerce.products`
WHERE SKU = 'GGOEGPJC019099'

### Join pitfall: Unintentional many-to-one SKU relationship

We now have two datasets: one for inventory stock level and the other for our website analytics. Let's JOIN the inventory dataset against your website product names and SKUs so you can have the inventory stock level associated with each product for sale on the website.

+ Clear the previous query and run the below query:

In [None]:
SELECT DISTINCT
  website.v2ProductName,
  website.productSKU,
  inventory.stockLevel
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
JOIN `data-to-insights.ecommerce.products` AS inventory
  ON website.productSKU = inventory.SKU
  WHERE productSKU = 'GGOEGPJC019099'

Next, let's expand our previous query to simply SUM the inventory available by product.

+ Clear the previous query and run the below query:

In [None]:
WITH inventory_per_sku AS (
  SELECT DISTINCT
    website.v2ProductName,
    website.productSKU,
    inventory.stockLevel
  FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
  JOIN `data-to-insights.ecommerce.products` AS inventory
    ON website.productSKU = inventory.SKU
    WHERE productSKU = 'GGOEGPJC019099'
)
SELECT
  productSKU,
  SUM(stockLevel) AS total_inventory
FROM inventory_per_sku
GROUP BY productSKU

+ Oh no! It is 154 x 3 = 462 or triple counting the inventory! This is called an unintentional cross join (a topic we'll revisit later).



### Join pitfall solution: use distinct SKUs before joining

What are the options to solve your triple counting dilemma? First you need to only select distinct SKUs from the website before joining on other datasets.

We know that there can be more than one product name (like 7" Dog Frisbee) that can share a single SKU.

+ Let's gather all the possible names into an array:

In [None]:
SELECT
  productSKU,
  ARRAY_AGG(DISTINCT v2ProductName) AS push_all_names_into_array
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE productSKU = 'GGOEGAAX0098'
GROUP BY productSKU

Now instead of having a row for every Product Name, we only have a row for each unique SKU.

+ If you wanted to deduplicate the product names, you could even LIMIT the array like so:

In [None]:
SELECT
  productSKU,
  ARRAY_AGG(DISTINCT v2ProductName LIMIT 1) AS push_all_names_into_array
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE productSKU = 'GGOEGAAX0098'
GROUP BY productSKU

#### Join pitfall: losing data records after a join

Now you're ready to join against your product inventory dataset again.

In [None]:
#standardSQL
SELECT DISTINCT
website.productSKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU

It seems we lost 819 SKUs after joining the datasets Let's investigate by adding more specificity in your fields (one SKU column from each dataset):

In [None]:
#standardSQL
# pull ID fields from both tables
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
# IDs are present in both tables, how can we dig deeper?

+ It appears the SKUs are present in both of those datasets after the join for these 1,090 records. How can we find the missing records?

#### Join pitfall solution: selecting the correct join type and filtering for NULL

The default JOIN type is an INNER JOIN which returns records only if there is a SKU match on both the left and the right tables that are joined.

+ Rewrite the previous query to use a different join type to include all records from the website table, regardless of whether there is a match on a product inventory SKU record. Join type options: INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN, CROSS JOIN.

In [None]:
#standardSQL
# the secret is in the JOIN type
# pull ID fields from both tables
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
LEFT JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU

+ Click Run.

How many SKUs are missing from your product inventory set?

+ Write a query to filter on NULL values from the inventory table.

In [None]:
#standardSQL
# find product SKUs in website table but not in product inventory table
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
LEFT JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
WHERE inventory.SKU IS NULL

+ Click Run.

Question: How many products are missing?

Answer: 819 products are missing (SKU IS NULL) from your product inventory dataset.

+ Clear the previous query and run the below query to confirm using one of the specific SKUs from the website dataset:

In [None]:
#standardSQL
# you can even pick one and confirm
SELECT * FROM `data-to-insights.ecommerce.products`
WHERE SKU = 'GGOEGATJ060517'
# query returns zero results

Now, what about the reverse situation? Are there any products in the product inventory dataset but missing from the website?

+ Write a query using a different join type to investigate.

In [None]:
#standardSQL
# reverse the join
# find records in website but not in inventory
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
RIGHT JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
WHERE website.productSKU IS NULL

Answer: Yes. There are two product SKUs missing from the website dataset

Next, add more fields from the product inventory dataset for more details.

+ Clear the previous query and run the below query:

In [None]:
#standardSQL
# what are these products?
# add more fields in the SELECT STATEMENT
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.*
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
RIGHT JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
WHERE website.productSKU IS NULL

+ Note: You typically will not see RIGHT JOINs in production queries. You would simply just do a LEFT JOIN and switch the ordering of the tables.

What if you wanted one query that listed all products missing from either the website or inventory?

+ Write a query using a different join type.


In [None]:
#standardSQL
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
FULL JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
WHERE website.productSKU IS NULL OR inventory.SKU IS NULL

LEFT JOIN + RIGHT JOIN = FULL JOIN which returns all records from both tables regardless of matching join keys. You then filter out where you have mismatches on either side

#### Join pitfall: unintentional cross join

Not knowing the relationship between data table keys (1:1, 1:N, N:N) can return unexpected results and also significantly reduce query performance.

The last join type is the CROSS JOIN.

Create a new table with a site-wide discount percent that you want applied across products in the Clearance category.

+ Clear the previous query and run the below query:

In [None]:
#standardSQL
CREATE OR REPLACE TABLE ecommerce.site_wide_promotion AS
SELECT .05 AS discount;

In the left pane, site_wide_promotion is now listed in the Resource section under your project and dataset.

+ Clear the previous query and run the below query to find out how many products are in clearance:

+ Note: For a CROSS JOIN you will notice there is no join condition (e.g. ON or USING). The field is simply multiplied against the first dataset or .05 discount across all items.

Let's see the impact of unintentionally adding more than one record in the discount table.

+ Clear the previous query and run the below query to insert two more records into the promotion table:

In [None]:
INSERT INTO ecommerce.site_wide_promotion (discount)
VALUES (.04),
       (.03);

+ Clear the previous query and run the below query:

In [None]:
SELECT discount FROM ecommerce.site_wide_promotion

What happens when you apply the discount again across all 82 clearance products?



In [None]:
SELECT DISTINCT
productSKU,
v2ProductCategory,
discount
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
CROSS JOIN ecommerce.site_wide_promotion
WHERE v2ProductCategory LIKE '%Clearance%'

Answer: Instead of 82, you now have 246 returned which is more records than your original table started with.

Let's investigate the underlying cause by examining one product SKU.

In [None]:
#standardSQL
SELECT DISTINCT
productSKU,
v2ProductCategory,
discount
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
CROSS JOIN ecommerce.site_wide_promotion
WHERE v2ProductCategory LIKE '%Clearance%'
AND productSKU = 'GGOEGOLC013299'

What was the impact of the CROSS JOIN?

Answer: Since there are 3 discount codes to cross join on, you are multiplying the original dataset by 3.

+ Note: This behavior isn't limited to cross joins, with a normal join you can unintentionally cross join when the data relationships are many-to-many this can easily result in returning millions or even billions of records unintentionally.

***