# Data Warehouse Through Joins and Unions

### Explore ecommerce website data and create a reporting table using SQL JOINS and UNIONS

Examine Google Merchandise Store dataset based on product review. To do this, we need to create a data warehouse which joins data from 3 sources:
- Website ecommerce data
- Product inventory stock levels and lead times
- Product review sentiment analysis

### 1. Connecting BigQuery Jupyter Notebook

Set environment variables for notebook to connect Bigquery

In [21]:
import os 
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'D:/Agra/Data-Engineer/GCP-DataEngineerLearningPath/Quest-DataWarehouses/Quest-1-Data-Warehouse-Using-Joins-and-Unions/qwiklabs-gcp-02-da01f39c2182-ffbc63f077b8.json'

### 2. Create a New Dataset

Used to store table for the insights. Create new dataset titled `ecommerce` can be done through SQL query.

In [20]:
%%bigquery
CREATE SCHEMA ecommerce

  query_result = wait_for_query(self, progress_bar_type, max_results=max_results)


Ecommerce dataset has been created.

### 3. Explore Product Sentiment Dataset

The product reviews are indicated by the average sentiment score and magnitude for each of products.

First, copy the product table from public dataset (because the average sentiment score and magnitude are here) to our dataset that has created before.

In [22]:
%%bigquery
CREATE OR REPLACE TABLE ecommerce.products AS
SELECT
*
FROM
`data-to-insights.ecommerce.products`

  query_result = wait_for_query(self, progress_bar_type, max_results=max_results)


Now we can read the data.

Let's Check the data type on our product table

In [24]:
%%bigquery
SELECT column_name, data_type
FROM `ecommerce.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'products'
ORDER BY ordinal_position

  query_result = wait_for_query(self, progress_bar_type, max_results=max_results)
  record_batch = self.to_arrow(


Unnamed: 0,column_name,data_type
0,SKU,STRING
1,name,STRING
2,orderedQuantity,INT64
3,stockLevel,INT64
4,restockingLeadTime,INT64
5,sentimentScore,FLOAT64
6,sentimentMagnitude,FLOAT64


The data type of the sentimentScore and sentimentMagnitude are `FLOAT`.

It's time to find out what are the top 5 products with the most positive sentiment.

In [26]:
%%bigquery
SELECT
  SKU,
  name AS product_name,
  sentimentScore AS sentiment_score,
  sentimentMagnitude AS sentiment_magnitude
FROM
  `qwiklabs-gcp-02-da01f39c2182.ecommerce.products`
ORDER BY
  sentimentScore DESC
LIMIT 5

  query_result = wait_for_query(self, progress_bar_type, max_results=max_results)
  record_batch = self.to_arrow(


Unnamed: 0,SKU,product_name,sentiment_score,sentiment_magnitude
0,GGOBJGOWUSG69402,USB wired soundbar - in store only,1.0,1.0
1,GGOEGEVB071799,Pocket Bluetooth Speaker,0.9,0.2
2,GGOEGADJ056816,Men's Watershed Full Zip Hoodie Grey,0.9,1.4
3,GGOEGDHC018299,22 oz Water Bottle,0.9,1.3
4,GGOEGOAB021499,Metal Texture Roller Pen,0.9,1.4


We can see that `USB wired soundbar - in store only` product has the highest sentiment.

Revise the query to show the top 5 products with the most negative sentiment.

In [27]:
%%bigquery
SELECT
  SKU,
  name AS product_name,
  sentimentScore AS sentiment_score,
  sentimentMagnitude AS sentiment_magnitude
FROM
  `qwiklabs-gcp-02-da01f39c2182.ecommerce.products`
WHERE sentimentScore IS NOT NULL
ORDER BY
  sentimentScore ASC
LIMIT 5

  query_result = wait_for_query(self, progress_bar_type, max_results=max_results)
  record_batch = self.to_arrow(


Unnamed: 0,SKU,product_name,sentiment_score,sentiment_magnitude
0,GGOEGAAX0098,"7"" Dog Frisbee",-0.6,0.2
1,GGOEGAAX0344,Women's Vintage Hero Tee Platinum,-0.5,1.1
2,GGOEGAAX0351,Men's Vintage Henley,-0.5,1.4
3,GGOEGAAX0607,Women's Convertible Vest-Jacket Sea Foam Green,-0.5,1.8
4,GGOEGAAX0595,Men's Microfiber 1/4 Zip Pullover Blue/Indigo,-0.5,0.6


This is the top 5 negative sentiment excluding the null value.

### 3. Join Datasets To Find Insights

The inventory team want to know the total sales by product each day and reference that against the current stock level in the inventory to see which products need to be resupplied first.

So we have to create a table to calculate daily sales volume per productSKU per date.

In [30]:
%%bigquery
CREATE OR REPLACE TABLE ecommerce.sales_by_sku_20170801 AS
SELECT
  productSKU,
  SUM(IFNULL(productQuantity,0)) AS total_ordered
FROM
  `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20170801'
GROUP BY productSKU
ORDER BY total_ordered DESC

  query_result = wait_for_query(self, progress_bar_type, max_results=max_results)


After create a table, enrich sales data with product inventory information by joining the two datasets.

In [31]:
%%bigquery
SELECT DISTINCT
  website.productSKU,
  website.total_ordered,
  inventory.name,
  inventory.stockLevel,
  inventory.restockingLeadTime,
  inventory.sentimentScore,
  inventory.sentimentMagnitude
FROM
  ecommerce.sales_by_sku_20170801 AS website
  LEFT JOIN `ecommerce.products` AS inventory
  ON website.productSKU = inventory.SKU
ORDER BY total_ordered DESC

  query_result = wait_for_query(self, progress_bar_type, max_results=max_results)
  record_batch = self.to_arrow(


Unnamed: 0,productSKU,total_ordered,name,stockLevel,restockingLeadTime,sentimentScore,sentimentMagnitude
0,GGOEGOAQ012899,456,Ballpoint LED Light Pen,2098,11,0.4,0.7
1,GGOEGDHC074099,334,17oz Stainless Steel Sport Bottle,1390,13,0.8,1.3
2,GGOEGOCB017499,319,Leatherette Journal,4978,36,0.5,0.9
3,GGOEGOCC077999,290,Spiral Journal with Pen,4668,10,0.1,0.3
4,GGOEGFYQ016599,253,Foam Can and Bottle Cooler,4495,10,0.7,1.2
...,...,...,...,...,...,...,...
457,9182780,0,Women's Shell Jacket Blue/Black,0,16,0.6,1.0
458,9180757,0,Yoga Block,0,13,0.1,0.3
459,9180850,0,Ballpoint Stylus Pen,0,6,0.3,0.5
460,GGOEGAAX0348,0,Android BTTF Cosmos Graphic Tee,4,15,0.7,1.9


Modify the query to include the ratio of total ordered to stock level and filter the results to only include products that have gone through 50% or more of the inventory.

In [32]:
%%bigquery
SELECT DISTINCT
  website.productSKU,
  website.total_ordered,
  inventory.name,
  inventory.stockLevel,
  inventory.restockingLeadTime,
  inventory.sentimentScore,
  inventory.sentimentMagnitude,
  SAFE_DIVIDE(website.total_ordered, inventory.stockLevel) AS ratio
FROM
  ecommerce.sales_by_sku_20170801 AS website
  LEFT JOIN `ecommerce.products` AS inventory
  ON website.productSKU = inventory.SKU
WHERE SAFE_DIVIDE(website.total_ordered,inventory.stockLevel) >= .50
ORDER BY total_ordered DESC

  query_result = wait_for_query(self, progress_bar_type, max_results=max_results)
  record_batch = self.to_arrow(


Unnamed: 0,productSKU,total_ordered,name,stockLevel,restockingLeadTime,sentimentScore,sentimentMagnitude,ratio
0,GGOEGOCB078299,250,Leather Journal-Black,354,10,0.5,0.8,0.706215
1,GGOEADHH073999,167,Android 17oz Stainless Steel Sport Bottle,283,8,0.3,0.5,0.590106
2,GGOEYAAJ033014,30,Men's Long & Lean Tee Charcoal,42,11,0.4,0.6,0.714286
3,GGOEGAEJ031315,18,Tri-blend Hoodie Grey,34,12,0.2,0.3,0.529412
4,GGOEAAWJ062548,7,Android Infant Short Sleeve Tee Pewter,2,14,0.1,0.3,3.5
5,GGOEGAYB068025,4,Youth Baseball Raglan Heather/Black,7,14,0.2,0.4,0.571429
6,GGOEGAYB068056,3,Youth Baseball Raglan Heather/Black,2,13,0.3,0.6,1.5
7,GGOEGAAC035016,2,Men's Bayside Graphic Tee,3,14,0.9,1.3,0.666667
8,GGOEGATH060717,1,Women's Convertible Vest-Jacket Sea Foam Green,1,12,0.1,0.2,1.0
9,GGOEGAYR068225,1,Youth Short Sleeve Tee Red,1,9,0.0,0.1,1.0


`Leather Journal-Black` was the top selling product on 1 August 2017 with 250 product orders out of 354 in stock.

### 4. Read All Daily Sales Table



The sales team has already made in-store sales on 2 August 2017 which we want to record to our daily sales tables.

Create an empty table to store sales data.

In [33]:
%%bigquery
CREATE OR REPLACE TABLE ecommerce.sales_by_sku_20170802
(
productSKU STRING,
total_ordered INT64
)

  query_result = wait_for_query(self, progress_bar_type, max_results=max_results)


The 2 August 2017 sales table schema is the same with 1 August 2017 sales table schema.

Insert the sales record provided by your sales team.

In [34]:
%%bigquery
INSERT INTO ecommerce.sales_by_sku_20170802
(productSKU, total_ordered)
VALUES('GGOEGHPA002910', 101)

  query_result = wait_for_query(self, progress_bar_type, max_results=max_results)


So now we have 2 daily sales table. If we want to read all daily sales, there are multiple ways.

The common ways are using UNION.

In [35]:
%%bigquery 
SELECT * FROM ecommerce.sales_by_sku_20170801
UNION ALL
SELECT * FROM ecommerce.sales_by_sku_20170802

  query_result = wait_for_query(self, progress_bar_type, max_results=max_results)
  record_batch = self.to_arrow(


Unnamed: 0,productSKU,total_ordered
0,GGOEGHPA002910,101
1,GGOEGOAQ012899,456
2,GGOEGDHC074099,334
3,GGOEGOCB017499,319
4,GGOEGOCC077999,290
...,...,...
458,9182780,0
459,9180757,0
460,9180850,0
461,GGOEGAAX0348,0


What is a lack of using UNION? We will have to write many UNION statements while having a many daily sales tables.

A better solution is to use table wildcard filter. (Use only when the tables have the same schema)

In [37]:
%%bigquery
SELECT * FROM `ecommerce.sales_by_sku_2017*`

  query_result = wait_for_query(self, progress_bar_type, max_results=max_results)
  record_batch = self.to_arrow(


Unnamed: 0,productSKU,total_ordered
0,GGOEGHPA002910,101
1,GGOEGOAQ012899,456
2,GGOEGDHC074099,334
3,GGOEGOCB017499,319
4,GGOEGOCC077999,290
...,...,...
458,9182780,0
459,9180757,0
460,9180850,0
461,GGOEGAAX0348,0


Modify query to limit the result to daily sales on 2 August 2017.

In [38]:
%%bigquery
SELECT * FROM `ecommerce.sales_by_sku_2017*`
WHERE _TABLE_SUFFIX = '0802'

  query_result = wait_for_query(self, progress_bar_type, max_results=max_results)
  record_batch = self.to_arrow(


Unnamed: 0,productSKU,total_ordered
0,GGOEGHPA002910,101
