# Project 2, Part 4, Validate data in the staging tables using SQL

University of California, Berkeley

Master of Information and Data Science (MIDS) program

w205 - Fundamentals of Data Engineering

Student: Sophie Yeh

Year: 2022

Semester: Spring

Section: 8


# Included Modules and Packages

Code cell containing your includes for modules and packages

In [2]:
import csv
import numpy as np
import pandas as pd
import psycopg2
import math
import json
import pprint
from datetime import datetime as dt

# Supporting code

Code cells containing any supporting code, such as connecting to the database, any functions, etc.  

Remember you can freely use any code from the labs. You do not need to cite code from the labs.

In [3]:
#
# function to run a select query and return rows in a pandas dataframe
# pandas puts all numeric values from postgres to float
# if it will fit in an integer, change it to integer
#
def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # fix the float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)


#connect to postgres database
connection = psycopg2.connect(
    user = "postgres",
    password = "ucb",
    host = "postgres",
    port = "5432",
    database = "postgres"
)
# create cursor for connections
cursor = connection.cursor()


# 2.4.1 Validate the data types in the staging table stage_1_peak_sales

Generally, we do not expect any issues with data types.  Write a simple query that validates the numeric and date columns.

* sale_id - validate that is is numeric
* sale_date - validate that it is a date
* sub_total - validate that it is numeric
* tax - validate that it is numeric
* total_amount - validate that it is numeric

Hint: make use of the operators: 
* xxxx::numeric
* xxxx::date

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [4]:
query = """

select stage_id, sale_id::numeric, sale_date::date, sub_total::numeric, tax::numeric, total_amount::numeric
from stage_1_peak_sales
order by stage_id;

"""
my_select_query_pandas(query, True, True).style.set_caption("Validate data types in stage_1_peak_sales")


Unnamed: 0,stage_id,sale_id,sale_date,sub_total,tax,total_amount
0,1,5763728874,2020-10-03,12,0,12
1,2,5763729036,2020-10-03,72,0,72
2,3,5763728904,2020-10-03,24,0,24
3,4,5763728973,2020-10-03,96,0,96
4,5,5763728757,2020-10-03,108,0,108
5,6,5763729051,2020-10-03,144,0,144
6,7,5763729153,2020-10-03,24,0,24
7,8,5763728608,2020-10-03,96,0,96
8,9,5763728696,2020-10-03,84,0,84
9,10,5763728768,2020-10-03,24,0,24


# 2.4.2 Validate the data types in the staging table stage_1_peak_stores

Generally, we do not expect any issues with data types.  Write a simple query that validates the numeric and date columns.

* sale_id - validate that it is numeric
* location_id - validate that it is numeric

Hint: make use of the operator xxxx::numeric

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [5]:
query = """

select stage_id, sale_id::numeric, location_id::numeric
from stage_1_peak_stores
order by stage_id
"""
my_select_query_pandas(query, True, True).style.set_caption("Validate data types in stage_1_peak_stores")


Unnamed: 0,stage_id,sale_id,location_id
0,1,5763728874,12573
1,2,5763729036,12573
2,3,5763728904,12573
3,4,5763728973,12573
4,5,5763728757,12573
5,6,5763729051,12573
6,7,5763729153,12573
7,8,5763728608,12573
8,9,5763728696,12573
9,10,5763728768,12573


# 2.4.3 Validate the data types in the staging table stage_1_peak_customers

Generally, we do not expect any issues with data types.  Write a simple query that validates the numeric and date columns.

* sale_id - validate that it is numeric
* customer_id - validate that it is numeric

Hint: make use of the operator xxxx::numeric

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [6]:
query = """

select stage_id, sale_id::numeric, customer_id::numeric
from stage_1_peak_customers
order by stage_id
"""
my_select_query_pandas(query, True, True).style.set_caption("Validate data types in stage_1_peak_customers")


Unnamed: 0,stage_id,sale_id,customer_id
0,1,5763728874,3728404
1,2,5763729036,3729309
2,3,5763728904,3728508
3,4,5763728973,3728534
4,5,5763728757,3729188
5,6,5763729051,3729276
6,7,5763729153,3729242
7,8,5763728608,3728705
8,9,5763728696,3729340
9,10,5763728768,3729016


# 2.4.4 Validate the data types in the staging table stage_1_peak_line_items

Generally, we do not expect any issues with data types.  Write a simple query that validates the numeric and date columns.

* sale_id - validate that it is numeric
* line_item_id - validate that it is numeric
* product_id - validate that it is numeric
* price - validate that it is numeric
* quantity - validate that it is numeric

Hint: make use of the operator xxxx::numeric

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [7]:
query = """

select stage_id, sale_id::numeric, line_item_id::numeric, product_id::numeric, price::numeric, quantity::numeric
from stage_1_peak_line_items
order by stage_id
"""
my_select_query_pandas(query, True, True).style.set_caption("Validate data types in stage_1_peak_line_items")


Unnamed: 0,stage_id,sale_id,line_item_id,product_id,price,quantity
0,1,5763728874,1,42314780,12,1
1,2,5763729036,2,42314677,12,1
2,3,5763729036,3,42314782,12,3
3,4,5763729036,4,42314784,12,2
4,5,5763728904,5,42314780,12,1
5,6,5763728904,6,42314784,12,1
6,7,5763728973,7,42314677,12,2
7,8,5763728973,8,42314780,12,2
8,9,5763728973,9,42314782,12,2
9,10,5763728973,10,42314784,12,2


# 2.4.5 Validate the math on sub_total, tax, and total_amount in stage_1_peak_sales

Generally, we do not expect any issues with the math.  Write a simple query that validates the math:

total_amount = sub_total + tax

It's usually best to write a query that will return rows with errors.  In our case, the query should return nothing.

Remember that with staging tables, we need to convert varchar to numeric using column::numeric before math will work.

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [47]:
query = """

select total_amount::numeric, sub_total::numeric, tax::numeric
from stage_1_peak_sales
where total_amount::numeric != sub_total::numeric + tax::numeric
order by stage_id

"""
my_select_query_pandas(query, True, True).style.set_caption("Invalid total_amounts")


Unnamed: 0,total_amount,sub_total,tax


# 2.4.6 Validate the math between stage_1_sales and stage_1_line_items

Generally, we do not expect any issues with the math.  Write a simple query that validates the math:

total_sales in stage_1_sales matches the sum of (price x quantity) in stage_1_line_items

It's usually best to write a query that will return rows with errors.  In our case, the query should return nothing.

Remember that with staging tables, we need to convert varchar to numeric using column::numeric before math will work.

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [45]:
query = """

select s.stage_id, s.sale_id, s.sub_total::numeric, sum(l.price::numeric * l.quantity::numeric) as line_item_sum
from stage_1_peak_sales as s 
    join stage_1_peak_line_items as l
    on s.sale_id = l.sale_id
group by s.stage_id, s.sale_id, s.sub_total
having (s.sub_total::numeric = sum(l.price::numeric * l.quantity::numeric)) = False
order by stage_id

"""
my_select_query_pandas(query, True, True).style.set_caption("Invalid sub_totals")


Unnamed: 0,stage_id,sale_id,sub_total,line_item_sum


# 2.4.7 Validate the tax is always zero in stage_1_peak_sales

It's usually best to write a query that will return rows with errors.  In our case, the query should return nothing.

Remember that with staging tables, we need to convert varchar to numeric using column::numeric before math will work.

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [44]:
query = """

select stage_id, tax::numeric
from stage_1_peak_sales 
where tax::numeric != 0
order by stage_id

"""
my_select_query_pandas(query, True, True).style.set_caption("Invalid tax")


Unnamed: 0,stage_id,tax


# 2.4.8 Validate the price is always 12 in stage_1_peak_line_items

It's usually best to write a query that will return rows with errors.  In our case, the query should return nothing.

Remember that with staging tables, we need to convert varchar to numeric using column::numeric before math will work.

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [48]:
query = """

select stage_id, price
from stage_1_peak_line_items 
where price::numeric != 12
order by stage_id

"""
my_select_query_pandas(query, True, True).style.set_caption("Invalid price")


Unnamed: 0,stage_id,price


# 2.4.9 Validate taxable is always N in stage_1_peak_line_items

It's usually best to write a query that will return rows with errors.  In our case, the query should return nothing.

Remember that with staging tables, we need to convert varchar to numeric using column::numeric before math will work.

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [52]:
query = """

select stage_id, taxable
from stage_1_peak_line_items 
where taxable != 'N'
order by stage_id

"""
my_select_query_pandas(query, True, True).style.set_caption("Invalid taxable")


Unnamed: 0,stage_id,taxable


# 2.4.10 Validate the store is the same for all in stage_1_peak_stores

It's usually best to write a query that will return rows with errors.  In our case, the query should return nothing.

Remember that with staging tables, we need to convert varchar to numeric using column::numeric before math will work.

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [59]:
query = """

select stage_id, location_id
from stage_1_peak_stores
where location_id::numeric != 12573
order by stage_id

"""
my_select_query_pandas(query, True, True).style.set_caption("Invalid store")


Unnamed: 0,stage_id,location_id


# 2.4.11 Validate the product id in stage_1_peak_line_items against peak_product_mapping

It's usually best to write a query that will return rows with errors.  In our case, the query should return nothing.

Remember that with staging tables, we need to convert varchar to numeric using column::numeric before math will work.

Sort by stage_id

Pattern your code after the examples in the labs.  You may use as many code cells as you need.

In [65]:
query = """

select stage_id, p.peak_product_id, l.product_id
from stage_1_peak_line_items as l
    join peak_product_mapping as p
    on p.peak_product_id = l.product_id::numeric
where p.peak_product_id != l.product_id::numeric
order by stage_id

"""
my_select_query_pandas(query, True, True).style.set_caption("Invalid product_id")


Unnamed: 0,stage_id,peak_product_id,product_id
