# Troubleshooting and Solving Data Join Pitfalls

Joining data tables can provide meaningful insight. When join our data, there are common pitfalls that could corrupt the results. This is how to avoiding those pitfalls.

### 1. Connecting BigQuery Jupyter Notebook

Set environment variables for notebook to connect Bigquery

In [1]:
import os 
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'D:/Agra/Data-Engineer/GCP-DataEngineerLearningPath/Quest-DataWarehouses/Quest-3-Solving-Data-Join-Pitfalls/qwiklabs-gcp-02-71dac6548842-820b1794753e.json'

Load the BigQuery client library by executing the command below

In [2]:
%load_ext google.cloud.bigquery

### 2. Create a New Dataset

Used to store table for the insights. Create new dataset titled `ecommerce` can be done through SQL query.

In [3]:
%%bigquery
CREATE SCHEMA ecommerce

Query is running:   0%|          |

Ecommerce dataset has been created

### 3. Identify a Key Field

In this quest we want to analyze the inventory stock levels for each of the products for sale on the ecommerce website.

To become familiar with the products on the website, first check the schema of each fields.

In [6]:
%%bigquery
SELECT column_name, data_type
FROM `data-to-insights.ecommerce.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'all_sessions_raw'
ORDER BY ordinal_position

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,column_name,data_type
0,fullVisitorId,STRING
1,channelGrouping,STRING
2,time,INT64
3,country,STRING
4,city,STRING
5,totalTransactionRevenue,INT64
6,transactions,INT64
7,timeOnSite,INT64
8,pageviews,INT64
9,sessionQualityDim,INT64


Find how many product names and product SKUs are on the website and whether either one of those fields is unique. Run the query below to know how many products are on the website.

In [7]:
%%bigquery
SELECT DISTINCT
productSKU,
v2ProductName
FROM `data-to-insights.ecommerce.all_sessions_raw`

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,productSKU,v2ProductName
0,9180750,Android 24 oz Contigo Bottle
1,9180833,Rubber Grip Ballpoint Pen 4 Pack
2,9180842,Maze Pen
3,9181019,Google Tri-blend Hoodie Grey
4,9182569,Google Men's Zip Hoodie
...,...,...
2268,9184616,Nest® Protect Smoke + CO White Battery Alarm - CA
2269,GGOEGCLB020832,BRIGHTtravels Set of 3 Nested Travel Cases
2270,A12345,My T-Shirt
2271,9180841,Bic Tri-Tone Twist Pen


There are 2273 of products name and SKUs. Do the results mean that there are that many unique product SKUs? It can't be concluded yet.

We need to looking at the number of the distinct SKU.

In [8]:
%%bigquery
SELECT
DISTINCT
productSKU
FROM `data-to-insights.ecommerce.all_sessions_raw`

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,productSKU
0,9180750
1,9180793
2,9180833
3,9180838
4,9180844
...,...
1904,10 55402
1905,10 93149
1906,10 55418
1907,9182956


From the results, 1909 distinct SKUs are returned. There are fewer distinct SKUs than the product name and SKU. Why this is happened? because the first query also returned Product Name. It appears multiple Product Names can have the same SKU.

Then we need to examine the relationship between product name & SKU. Determine which product name have more than one SKU and which SKU have more than one product name.

In [9]:
%%bigquery
SELECT
  v2ProductName,
  COUNT(DISTINCT productSKU) AS SKU_count,
  STRING_AGG(DISTINCT productSKU LIMIT 5) AS SKU
FROM `data-to-insights.ecommerce.all_sessions_raw`
  WHERE productSKU IS NOT NULL
  GROUP BY v2ProductName
  HAVING SKU_count > 1
  ORDER BY SKU_count DESC

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,v2ProductName,SKU_count,SKU
0,Waze Women's Typography Short Sleeve Tee,12,"GGOEWALJ083415,9184705,9184708,GGOEWXXX0834,GG..."
1,Google Sunglasses,10,"GGOEGAAX0037,9180826,GGOEGHGR019499,9180827,91..."
2,Google Men's Watershed Full Zip Hoodie Grey,10,"GGOEGAAX0568,GGOEGADJ056814,GGOEGADJ056818,918..."
3,Google Women's Insulated Thermal Vest Navy,10,"GGOEGAAX0585,9182760,GGOEGAPL058515,GGOEGAPL05..."
4,Android Women's Short Sleeve Badge Tee Dark He...,10,"9182176,GGOEAAEJ028213,9182177,GGOEAAEJ028215,..."
...,...,...,...
488,Google Sports Bag,2,"GGOEGBMJ013399,9180766"
489,Google Stretch Fit Hat M/L Navy,2,"GGOEGHPL003214,9181575"
490,YouTube Sticker Sheet,2,"GGOEYFKQ020699,10 51122"
491,Android Stretch Fit Hat Charcoal,2,"GGOEGAAX0042,9182706"


Some product names have more than one SKUs. The ecommerce website shows that each product name may have multiple options like size, color, etc. - which are sold as separate SKUs.

What about 1 SKU? Should it be allowed to belong to more than 1 product name? lets check it out.

In [12]:
%%bigquery
SELECT
  productSKU,
  COUNT(DISTINCT v2ProductName) AS product_count,
  STRING_AGG(DISTINCT v2ProductName LIMIT 5) AS product_name
FROM `data-to-insights.ecommerce.all_sessions_raw`
  WHERE v2ProductName IS NOT NULL
  GROUP BY productSKU
  HAVING product_count > 1
  ORDER BY product_count DESC

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,productSKU,product_count,product_name
0,GGOEGAAX0098,3,"7&quot; Dog Frisbee,Google 7-inch Dog Flying D..."
1,GGOEGBMC056599,3,"Waterproof Gear Bag,Waterpoof Gear Bag,Google ..."
2,GGOEGCLB020832,3,"Set of 3 Nested Travel Cases,BRIGHTtravels Set..."
3,GGOEGEVA022399,3,"Micro Wireless Earbud,Micro Wireless Earbuds,A..."
4,GGOENEBJ079499,3,"Nest® Learning Thermostat 3rd Gen-USA,Nest® Le..."
...,...,...,...
342,9182768,2,Google Women's Short Sleeve Performance Tee Bl...
343,9180763,2,"Collapsible Shopping Bag,Latitudes Foldaway Sh..."
344,9182769,2,Google Women's Short Sleeve Performance Tee Pe...
345,9182747,2,Google Men's Short Sleeve Performance Badge Te...


The SKU can have more than one product name. Most of product name (with the same SKU) are similar but not exactly the same. We can infer that the relationship between product name and SKU is many-to-many.

### 4. Pitfall: Non-Unique Key

A SKU is designed to uniquely identify as one product. Having a non-unique key can cause serious data issues. We will se later.

Identify the product names for the SKU `GGOEGPJC019099`

In [13]:
%%bigquery
SELECT DISTINCT
  v2ProductName,
  productSKU
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE productSKU = 'GGOEGPJC019099'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,v2ProductName,productSKU
0,7&quot; Dog Frisbee,GGOEGPJC019099
1,"7"" Dog Frisbee",GGOEGPJC019099
2,Google 7-inch Dog Flying Disc Blue,GGOEGPJC019099


We notice that the product names are mostly the same except for a few characters.

Explore the product inventory table to see if the SKU is unique.

In [14]:
%%bigquery
SELECT
  SKU,
  name,
  stockLevel
FROM `data-to-insights.ecommerce.products`
WHERE SKU = 'GGOEGPJC019099'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,SKU,name,stockLevel
0,GGOEGPJC019099,"7"" Dog Frisbee",154


In the product inventory table, the SKU is unique because it's only return 1 record.  

Now try to join the product website and the product inventory table, so we can have the inventory stock level associated with each product on the website.

In [15]:
%%bigquery
SELECT DISTINCT
  website.v2ProductName,
  website.productSKU,
  inventory.stockLevel
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
JOIN `data-to-insights.ecommerce.products` AS inventory
  ON website.productSKU = inventory.SKU
  WHERE productSKU = 'GGOEGPJC019099'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,v2ProductName,productSKU,stockLevel
0,Google 7-inch Dog Flying Disc Blue,GGOEGPJC019099,154
1,"7"" Dog Frisbee",GGOEGPJC019099,154
2,7&quot; Dog Frisbee,GGOEGPJC019099,154


After joining the table, we have the inventory stock levels for the product but the stockLevel is showing three times.

Sum the inventory available by product.

In [16]:
%%bigquery
WITH inventory_per_sku AS (
  SELECT DISTINCT
    website.v2ProductName,
    website.productSKU,
    inventory.stockLevel
  FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
  JOIN `data-to-insights.ecommerce.products` AS inventory
    ON website.productSKU = inventory.SKU
    WHERE productSKU = 'GGOEGPJC019099'
)
SELECT
  productSKU,
  SUM(stockLevel) AS total_inventory
FROM inventory_per_sku
GROUP BY productSKU

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,productSKU,total_inventory
0,GGOEGPJC019099,462


The `GGOEGPJC019099` SKU now showing 462 on total inventory. It is three times from the actual stock. This is a mistake.

### 5. Join Pitfall Solution

To solve previous problem, we need to select only on the distinct SKU or gather all the possible product names into an array. We use another SKU for the solution.

In [17]:
%%bigquery
SELECT
  productSKU,
  ARRAY_AGG(DISTINCT v2ProductName) AS push_all_names_into_array
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE productSKU = 'GGOEGAAX0098'
GROUP BY productSKU

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,productSKU,push_all_names_into_array
0,GGOEGAAX0098,"[7"" Dog Frisbee, 7&quot; Dog Frisbee, Google 7..."


Instead of having a row for every product name, we only have a row for each unique SKU.

We can limit the array, if we want to deduplicate the product name.

In [18]:
%%bigquery
SELECT
  productSKU,
  ARRAY_AGG(DISTINCT v2ProductName LIMIT 1) AS push_all_names_into_array
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE productSKU = 'GGOEGAAX0098'
GROUP BY productSKU

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,productSKU,push_all_names_into_array
0,GGOEGAAX0098,"[7"" Dog Frisbee]"


After create the SKU become unique, now we can join to the product inventory table.

In [19]:
%%bigquery
SELECT DISTINCT
website.productSKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,productSKU
0,9180781
1,9180824
2,9182569
3,9182575
4,9182593
...,...
1085,GGOEYAEJ029616
1086,GGOEAAEJ033417
1087,GGOEAAEB028316
1088,GGOEAAEJ028215


With the previous query, it just returned 1090 records. It should be 1909 records, 819 SKUs were lost after joining the table.

Try another query.

In [20]:
%%bigquery
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,website_SKU,inventory_SKU
0,9180793,9180793
1,9180824,9180824
2,9181019,9181019
3,9182569,9182569
4,9182593,9182593
...,...,...
1085,GGOEGAWH061354,GGOEGAWH061354
1086,GGOEGATB060216,GGOEGATB060216
1087,GGOEYAEJ029017,GGOEYAEJ029017
1088,GGOEGAPC058812,GGOEGAPC058812


It appears the same results. Why this is happened? its because select the wrong join. The default JOIN type is an INNER JOIN which returns records only if there is an SKU match on both the left and the right tables that are joined.

The possible solution is using LEFT JOIN.

In [21]:
%%bigquery
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
LEFT JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,website_SKU,inventory_SKU
0,9180750,
1,9180833,9180833
2,9180838,9180838
3,9180842,
4,9180844,9180844
...,...,...
1904,GGOEGAPL058516,
1905,GGOEGAAC075815,
1906,GGOEGAEJ028815,
1907,10 14154,


Let's check which SKU are missing from product inventory.

In [22]:
%%bigquery
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
LEFT JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
WHERE inventory.SKU IS NULL

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,website_SKU,inventory_SKU
0,9182838,
1,9183211,
2,10 75152,
3,GGOEGCGB022199,
4,GGOEGBJR018199,
...,...,...
814,10 75150,
815,GGOEGAEJ030115,
816,GGOEWAAJ083517,
817,A12345,


From the previous results, confirm one of the missing SKU using the query below. 

In [23]:
%%bigquery
#standardSQL
SELECT * FROM `data-to-insights.ecommerce.products`
WHERE SKU = 'GGOEGATJ060517'

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,SKU,name,orderedQuantity,stockLevel,restockingLeadTime,sentimentScore,sentimentMagnitude


Its happened because some SKUs could be digital products that don't store in warehouse inventory.

What about the reverse situation? Are there any products in the product inventory table but missing from the product website? try the query to investigate it.

In [24]:
%%bigquery
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
RIGHT JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
WHERE website.productSKU IS NULL

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,website_SKU,inventory_SKU
0,,GGADFBSBKS42347
1,,GGOBJGOWUSG69402


Yes there are 2 product SKUs missing from the product website.

We have to know the detail about this 2 products.

In [25]:
%%bigquery
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.*
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
RIGHT JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
WHERE website.productSKU IS NULL

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,website_SKU,SKU,name,orderedQuantity,stockLevel,restockingLeadTime,sentimentScore,sentimentMagnitude
0,,GGADFBSBKS42347,PC gaming speakers,0,100,1,,
1,,GGOBJGOWUSG69402,USB wired soundbar - in store only,10,15,2,1.0,1.0


Why would these 2 products missing from website?
1. New Product (0 order, no sentimentScore, and no sentimentMagnitude)
2. The product is available in store only

If we want to list all product missing from the website or inventory, run the query below.

In [26]:
%%bigquery
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
FULL JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
WHERE website.productSKU IS NULL OR inventory.SKU IS NULL

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,website_SKU,inventory_SKU
0,GGOEGAAX0332,
1,GGOEGAEE031014,
2,GGOEGATB060215,
3,GGOEGAWC062150,
4,9182732,
...,...,...
816,GGOEGAAX0228,
817,GGOEGDHC017999,
818,GGOEGAAL059014,
819,GGOEGAEJ030114,


FULL JOIN means LEFT JOIN + RIGHT JOIN.

Last, we want to try the CROSS JOIN. Create a promotion table first.

In [27]:
%%bigquery
CREATE OR REPLACE TABLE ecommerce.site_wide_promotion AS
SELECT .05 AS discount

Query is running:   0%|          |

How many product are in clearence?

In [28]:
%%bigquery
SELECT DISTINCT
productSKU,
v2ProductCategory,
discount
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
CROSS JOIN ecommerce.site_wide_promotion
WHERE v2ProductCategory LIKE '%Clearance%'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,productSKU,v2ProductCategory,discount
0,GGOEACCQ017299,Home/Clearance Sale/,0.05
1,GGOEAOCB077499,Home/Clearance Sale/,0.05
2,GGOEGACB023699,Home/Clearance Sale/,0.05
3,GGOEGEHQ024099,Home/Clearance Sale/,0.05
4,GGOEGOAB021499,Home/Clearance Sale/,0.05
...,...,...,...
77,GGOEGAAX0327,Home/Clearance Sale/,0.05
78,GGOEGAAX0606,Home/Clearance Sale/,0.05
79,GGOEGAAX0363,Home/Clearance Sale/,0.05
80,GGOEGAAX0570,Home/Clearance Sale/,0.05


There are 82 clearence products.

Add new discount to promotion discount.

In [29]:
%%bigquery
INSERT INTO ecommerce.site_wide_promotion (discount)
VALUES (.04),
       (.03)

Query is running:   0%|          |

In [30]:
%%bigquery
SELECT discount FROM ecommerce.site_wide_promotion

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,discount
0,0.05
1,0.04
2,0.03


Now in the promotion table, we have 3 discount value.

Let's see the impact of using CROSS JOIN.

In [31]:
%%bigquery
SELECT DISTINCT
productSKU,
v2ProductCategory,
discount
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
CROSS JOIN ecommerce.site_wide_promotion
WHERE v2ProductCategory LIKE '%Clearance%'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,productSKU,v2ProductCategory,discount
0,GGOEAHPA004110,Home/Clearance Sale/,0.05
1,GGOEAOCB077499,Home/Clearance Sale/,0.05
2,GGOEGCBB074399,Home/Clearance Sale/,0.05
3,GGOEGCBB074399,Home/Clearance Sale/,0.04
4,GGOEGCBQ016499,Home/Clearance Sale/,0.05
...,...,...,...
241,GGOEGAAX0593,Home/Clearance Sale/,0.04
242,GGOEGAAX0331,Home/Clearance Sale/,0.05
243,GGOEGAAX0334,Home/Clearance Sale/,0.05
244,GGOEGAAX0602,Home/Clearance Sale/,0.05


The table returned 246 records. The website table is multiplied against the promotion table.

To check the use of CROSS JOIN on one product, run the query below.

In [32]:
%%bigquery
SELECT DISTINCT
productSKU,
v2ProductCategory,
discount
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
CROSS JOIN ecommerce.site_wide_promotion
WHERE v2ProductCategory LIKE '%Clearance%'
AND productSKU = 'GGOEGOLC013299'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,productSKU,v2ProductCategory,discount
0,GGOEGOLC013299,Home/Clearance Sale/,0.03
1,GGOEGOLC013299,Home/Clearance Sale/,0.05
2,GGOEGOLC013299,Home/Clearance Sale/,0.04
