# Data Engineer Assessment
## UC02: TTD_DE_UC02_DQC: Add Data Quality Checks to the Data Engineering Pipeline. Address Any Data Quality Issues

## Goal
As a data engineer, I want to setup quality checks to ensure that the data that is provided to business is of good quality and fit for use.

## Summary
It is important to ensure that the data provides meets the data quality standards setup by the organization and the data is fit for use.


* __Problem Statement__ - Business has experienced data with issues such as duplicates in the data or inconsistent data types than those reported by the source.

Business has asked the Data Engineering team to implement controls that ensure that the data is fit for use and detect any issues with the pipeline if this is not the case.

* __Description__ - Lately, business is reporting that the data provide by the sources have several issues. This has caused incorrect reports to be generated and poor customer experience. Business has lost trust in the data due to the poor data quality

* __Data Issues Reported__ : Business has documented and highlighted the following issues with the data over the last 3 months.

  a. Data source systems changing the schema of the data without informing business.
  b. Numerous duplicates in the sales_rep_master data in the data resulting in higher processing time and duplicates in the Semantic asset.
  c. Vehicle VIN number columns not having the right length of fields or missing data. VIN Length - 17 Characters
  d. Model Brand not in expected brand types - BMW, MINI, Rolls Royce or strings not matching the expectated format.
  e. Data is not updated during the weekly build due to the source system not having the data available.

* __Expected Outcome__ : Your manager has asked you to perform a Proof of Concept (PoC) of the Great Expectations Quality framework and demonstrate to the team the various features of the framework.

You need to evaluate the features and functionality of the [`Great Expectations Quality framework`](https://greatexpectations.io/).

1. The Framework must support integration with the AWS Platform and must be deployed on to AWS. The PoC demonstration must be done on your laptop locally.  

2. The Framework must perform Schema Validation (including data type validation), data validation for VIN number (Length:17) and not null columns (VIN numbers). Other checks that are deemed suitable can also be added.
3. The Framework must also detect if there are duplicates in the data.
4. Provide a compliance report of the data on a sample datasets.
5. Cost of the check is a concern for the IT department. Ensure that this is taken into account when implementing the checks.

## Code Complexity
- Low / Medium

## `Diagram - Also refer PDF in folder`

![Data Quality Checks](./TTD_DE_UC02_DQC.png"Great Expectation Data Quality Checks")



## Datasets:

`File Location`: Refer to the attached `data` folder for information

* Vehicles (vehicles.csv)  at the plants (plants.csv) are built to order (orders.csv) placed - order_number
* Customer (customers.csv) provides reviews(welcome_call.csv) 60 to 80 days after the vehicles are delivered(vin).
* Sales (sales_rep.csv) representatives are linked to dealership (dealers.csv) and have dealership names




## Expected Outcomes:

1. The Great Expectations framework must be used to perform the data quality checks.
2. Atleast 5 different types of checks must be implemented on each dataset.  Explaination of why the check would be appropriate for the datast must be provided.

### Libraries or Options used
* Jupyter Notebook - Install and run locally on your laptop or device.
* Great Expectations Framework (Note: Install if required)
* PySpark, Pandas and matplot lib or similar plotting libraries




## `Acceptance Criteria`

1. Only the Great Expectations framework can be used for this exercise.
2. Implement different types of checks on the three datasets (vehicles, customers, sales_rep) data and provide your findings.
3. Explain the checks you have implemented and how it would be useful in detecting Data quality issues to the business.  Refer to the current challenges that business has highlighted.
4. Explain how the framework would be useful when data is stored in an RDBMS such as MySQL. Illustrate the workflow using **draw.io** and export the output in pdf format.  Expected Output: draw.io diagram


# Implementation

Provide all the implementation steps in the sections that follow. Ensure that you provide detailed explanations of the approach.


### Step 1: Import the libraries that you need for Great Expectations framework.

In [None]:
# Import any relevant libraries
import os
import re

import pandas as pd
# Import the Great Expectations library below.
# Note: You can pip install any libraries that you need into your environment.



#### List of expected dataframes to be loaded


  * Vehicles (vehicles.csv) - Provides information on the vehicles.
  * Customer (customers.csv) - Provide information about the customers
  * Sales (sales_rep.csv) Provides information about the sales_reps at the dealerships.


### Step 2: Load the data from the data folder into the data frame.

In [None]:
# Write your code below to load the relevant data into a pandas dataframe and make it available to Great Expectations.

#### Question: Did you face any challenges when setting up the environment as well as with the dataset. How did you resolve these challenges

#### *Answer*: Replace with your response



### Step 3: Explore your data and provide some column and statistical information
 Provide some statistical information about the data you just loaded


In [None]:
# Write code to provide statistical information about each dataframe that you just loaded.



### Step 4: Add your expectations Logic and for each expectation provide an explanation of why it would be useful to business.

In [None]:
# Explore your data and add the relevant Expectations. Loop these code cells as required for each expectation.

### Step 5: Review your expectations and document your findings.

* Note: You can check your expectations by using df.get_expectation_suite() (where df is a pandas dataframe)




In [None]:
# Perform the queries to perform the relevant dataframe join operations.

#### Question: Explain the different expectations that you implemented above and also provide your findings?

#### *Answer*: Replace with your response

### Report all your Findings:

<REPLACE TEXT BELOW>

Report your findings in bullet points.
Example: For illustration purpose only - replace below with your own findings and support with Evidence.
1. The data lenth check expectation would be useful as VIN numbers are 17 characters long. The check was implemented in the following manner ...

## `Acceptance Criteria`

1. Only the Great Expectations framework can be used for this exercise.
2. Implement different types of checks on the three datasets (vehicles, customers, sales_rep) data and provide your findings.
3. Explain the checks you have implemented and how it would be useful in detecting Data quality issues to the business.  Refer to the current challenges that business has highlighted.
4. Explain how the framework would be useful when data is stored in an RDBMS such as MySQL. Illustrate the workflow using **draw.io** and export the output in pdf format.