# **Hands-on Lab: Verifying Data Quality for a Data Warehouse**

https://www.coursera.org/learn/getting-started-with-data-warehousing-and-bi-analytics/ungradedLti/RDzaq/hands-on-lab-verifying-data-quality-for-a-data-warehouse

## **Purpose of the Lab:**

The primary purpose of this lab is to instruct participants on the process of conducting thorough data quality checks in a data warehousing environment. It focuses on using a Python-based framework within a PostgreSQL database to validate data integrity. Key areas of emphasis include identifying null values, duplicates, and invalid entries, as well as verifying data ranges. The lab aims to equip learners with the necessary skills to set up and utilize a testing framework for data validation, ensuring data accuracy and consistency.

## **Benefits of Learning the Lab:**

Engaging in this lab offers several benefits, particularly in enhancing one's capabilities in data management and quality assurance. Learners will gain hands-on experience in implementing automated data quality checks, a skill crucial for maintaining the reliability of data in real-world applications. This proficiency is especially beneficial for professionals working with large datasets, as it ensures the integrity of data used for analysis and decision-making. Moreover, understanding these concepts is essential for anyone aspiring to specialize in data science, database administration, or any field that relies heavily on accurate and reliable data.

## **Objectives**

In this lab, you will:

- Check Null values
- Check Duplicate values
- Check Min Max
- Check Invalid values
- Generate a report on data quality

Download the staging area setup script.

Run the command below to download the staging area setup script.

# **Exercise 1 - Getting the environment ready**

In [1]:
!curl -O https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0260EN-SkillsNetwork/labs/Verifying%20Data%20Quality%20for%20a%20Data%20Warehouse/setup_staging_area.sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   816  100   816    0     0   1051      0 --:--:-- --:--:-- --:--:--  1  0056


Run the setup script.

Run the command below to execute the staging area setup script.

In [3]:
#dropdb -h localhost -U postgres -p 5432 billingDW

In [2]:
#bash setup_staging_area.sh

# **Exercise 2 - Getting the testing framework ready**

You can perform most of the data quality checks by manually running sql queries on the data warehouse.

It is a good idea to automate these checks using custom programs or tools. Automation helps you to easily

- create new tests,
- run tests,
- and schedule tests.

We will be using a python based framework to run the data quality tests.

Step 1: Download the framework.

Run the commands below to download the framework

In [4]:
!curl -O https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0260EN-SkillsNetwork/labs/Verifying%20Data%20Quality%20for%20a%20Data%20Warehouse/dataqualitychecks.py
!curl -O https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0260EN-SkillsNetwork/labs/Verifying%20Data%20Quality%20for%20a%20Data%20Warehouse/dbconnect.py
!curl -O https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0260EN-SkillsNetwork/labs/Verifying%20Data%20Quality%20for%20a%20Data%20Warehouse/mytests.py
!curl -O https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0260EN-SkillsNetwork/labs/Verifying%20Data%20Quality%20for%20a%20Data%20Warehouse/generate-data-quality-report.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2054  100  2054    0     0   2735      0 --:--:-- --:--:-- --:--:--  2764
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   433  100   433    0     0    778      0 --:--:-- --:--:-- --:--:--   784
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   748  100   748    0     0   1194      0 --:--:-- --:--:-- --:--:--  1200
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1124  100  1124    0     0   1853      0 --:--:-- --:--:-- --:--:--  1864


Step 2: Install the python driver for Postgresql.

Run the command below to install the python driver for Postgresql database

In [5]:
!python3 -m pip install psycopg2

Collecting psycopg2
  Downloading psycopg2-2.9.9.tar.gz (384 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m384.9/384.9 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: psycopg2
  Building wheel for psycopg2 (setup.py) ... [?25ldone
[?25h  Created wheel for psycopg2: filename=psycopg2-2.9.9-cp39-cp39-macosx_11_0_arm64.whl size=133030 sha256=cbcf08e50ee483ad90121da48be18c8777276a7dc11d8a8d90dc6a357a407c1a
  Stored in directory: /Users/sanhuezalejandro/Library/Caches/pip/wheels/3a/06/25/adb124afd8c8346e45c455f6586f7289cde2b4e339dfbcd9e9
Successfully built psycopg2
[33mDEPRECATION: pytorch-lightning 1.6.5 has a non-standard dependency specifier torch>=1.8.*. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conformin

Step 3: Test database connectivity.

Now we need to check

- if the Postgresql python driver is installed properly.
- if Postgresql server is up and running.
- if our micro framework can connect to the database.

The command below to check all the above cases.

In [8]:
!python3 dbconnect.py

Successfully connected to warehouse
Connection closed


If all goes well, you should a message `Successfully connected to database`.

The command also disconnects from the server with a message `Connection closed`.

# **Exercise 3 - Create a sample data quality report**

Run the command below to install pandas.

In [10]:
#!python3 -m pip install pandas tabulate

Run the command below to generate a sample data quality report.

In [12]:
!python3 generate-data-quality-report.py

Connected to data warehouse
**************************************************
Tue May  7 00:02:02 2024
Starting test Check for nulls
Finished test Check for nulls
Test Passed True
Test Parameters
column = monthid
table = DimMonth

Duration :  0.05770707130432129
Tue May  7 00:02:02 2024
**************************************************
**************************************************
Tue May  7 00:02:02 2024
Starting test Check for min and max
Finished test Check for min and max
Test Passed True
Test Parameters
column = month
table = DimMonth
minimum = 1
maximum = 12

Duration :  0.0024340152740478516
Tue May  7 00:02:02 2024
**************************************************
**************************************************
Tue May  7 00:02:02 2024
Starting test Check for valid values
{'Company', 'Individual'}
Finished test Check for valid values
Test Passed True
Test Parameters
column = category
table = DimCustomer
valid_values = {'Company', 'Individual'}

Duration :  0.02281379

# **Exercise 4 - Explore the data quality tests**

Open the file `mytests.py` in the editor by using the steps below.

<img src= https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0260EN-SkillsNetwork/labs/Verifying%20Data%20Quality%20for%20a%20Data%20Warehouse/images/mytests.py.png>

The file `mytests.py` contains all the data quality tests.

It provides a quick and easy way to author and run new data quality tests.

The testing framework provides the following tests:

- check_for_nulls - this test will check for nulls in a column
- check_for_min_max - this test will check if the values in a column are with a range of min and max values
- check_for_valid_values - this test will check for any invalid values in a column
- check_for_duplicates - this test will check for duplicates in a column

Each test can be authored by mentioning a minimum of 4 parameters.

- testname - The human readable name of the test for reporting purposes
- test - The actual test name that the testing micro framework provides
- table - The table name on which the test is to be performed
- column - The table name on which the test is to be performed

# **Exercise 5 - Check for nulls**

Let us now see what a `check_for_nulls` test looks like.

Here is a sample `check_for_nulls` test:

In [14]:
# test1={
#     "testname":"Check for nulls",
#     "test":check_for_nulls,
#     "column": "monthid",
#     "table": "DimMonth"
# }

All tests must be named as `test` following by a unique number to identify the test.

- Give an easy to understand description for `testname`
- mention `check_for_nulls` for `test`
- mention the column name on which you wish to check for nulls
- mention the table name where this column exists

Let us now create a new `check_for_nulls` test and run it.

The test below checks if there are any null values in the column `year` in the table `DimMonth`.

The test fails if nulls exist.

Copy and paste the code below at the end of mytests.py file.

In [None]:
# test5={
#     "testname":"Check for nulls",
#     "test":check_for_nulls,
#     "column": "year",
#     "table": "DimMonth"
# }

Save the file using Menu -> File -> Save

Run the command below to generate the new data quality report.

In [15]:
#python3 generate-data-quality-report.py

# **Exercise 6 - Check for min max range**

Let us now see what a `check_for_min_max` test looks like.

Here is a sample `check_for_min_max` test

In [None]:
# test2={
#     "testname":"Check for min and max",
#     "test":check_for_min_max,
#     "column": "monthid",
#     "table": "DimMonth",
#     "minimum":1,
#     "maximum":12
# }

Save the file using Menu -> File -> Save

Run the command below to generate the new data quality report.

In [16]:
#python3 generate-data-quality-report.py

# **Exercise 7 - Check for any invalid entries**

Let us now see what a `check_for_valid_values` test looks like.

Here is a sample `check_for_valid_values` test:

In [17]:
# test3={
#     "testname":"Check for valid values",
#     "test":check_for_valid_values,
#     "column": "category",
#     "table": "DimCustomer",
#     "valid_values":{'Individual','Company'}
# }

In addition to the usual fields, you have an additional field here.

- use the field `valid_values` to mention what are the valid values for this column.

Let us now create a new `check_for_valid_values` test and run it.

The test below checks for valid values in the column `quartername` in the table `DimMonth`.

The valid values are Q1,Q2,Q3,Q4

The test fails if there any values less than minimum or more than maximum.

Copy and paste the code below at the end of mytests.py file.

In [None]:
# test7={
#     "testname":"Check for valid values",
#     "test":check_for_valid_values,
#     "column": "quartername",
#     "table": "DimMonth",
#     "valid_values":{'Q1','Q2','Q3','Q4'}
# }

Save the file using Menu -> File -> Save

Run the command below to generate the new data quality report.

In [18]:
#python3 generate-data-quality-report.py

# **Exercise 8 - Check for duplicate entries**

Let us now see what a `check_for_duplicates` test looks like.

Here is a sample `check_for_duplicates` test

In [None]:
# test4={
#     "testname":"Check for duplicates",
#     "test":check_for_duplicates,
#     "column": "monthid",
#     "table": "DimMonth"
# }

Let us now create a new `check_for_duplicates` test and run it.

The test below checks for any duplicate values in the column `customerid` in the table `DimCustomer`.

The test fails if duplicates exist.

Copy and paste the code below at the end of mytests.py file.

In [None]:
# test8={
#     "testname":"Check for duplicates",
#     "test":check_for_duplicates,
#     "column": "customerid",
#     "table": "DimCustomer"
# }

Save the file using Menu -> File -> Save

Run the command below to generate the new data quality report.

In [19]:
#python3 generate-data-quality-report.py

# **Practice exercises**


> Create a check_for_nulls test on column billedamount in the table FactBilling
> 
- Click here for Hint
    
    > Use the check_for_nulls test with column=billedamount and table=FactBilling
    > 
- Click here for Solution
    
    Copy and paste the code below at the end of mytests.py file.

In [None]:
# test9={
#     "testname":"Check for nulls",
#     "test":check_for_nulls,
#     "column": "billedamount",
#     "table": "FactBilling"
# }

> Create a check_for_duplicates test on column billid in the table FactBilling
> 
- Click here for Hint
    
    > Use the check_for_duplicates test with column=billid and table=FactBilling
    > 
- Click here for Solution
    
    Copy and paste the code below at the end of mytests.py file.

In [20]:
# test10={
#     "testname":"Check for duplicates",
#     "test":check_for_duplicates,
#     "column": "billid",
#     "table": "FactBilling"
# }

> Create a check_for_valid_values test on column quarter in the table DimMonth. The valid values are 1, 2, 3, 4
> 
- Click here for Hint
    
    > Use the check_for_valid_values test with column=quarter and table=DimMonth and valid_values={1, 2, 3, 4}
    > 
- Click here for Solution
    
    Copy and paste the code below at the end of mytests.py file.

In [None]:
# test11={
#     "testname":"Check for valid values",
#     "test":check_for_valid_values,
#     "column": "quarter",
#     "table": "DimMonth",
#     "valid_values":{1,2,3,4}
# }

Congratulations!! You have successfully finished this lab.