# Validation 
***

## Learning Objectives

# Links

# Additional Material

# Sources 


## Qualities of good validation data
### Accuracy 
For data to be of a good quality, it needs to be accurate. This means that the data is correct and free from errors. In practice, your data need to be within a defined set of boundary conditions. For example, if you are collecting data on the height of people, you would expect the data to be within a certain range. If you have a height of 3 meters, you know that this is not a valid value and that there is an error in the data.

Here's an example: 

In [19]:
import csv

# Define the data
data = [
    ["StudentID", "exam_score"],
    [1, 85],
    [2, 92],
    [3, 105],
    [4, 78],
    [5, 110],
]

# Specify the file name
file_name = "./files/student_scores.csv"

# Write the data to the CSV file
with open(file_name, mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)

print(f"CSV file '{file_name}' has been created.")


CSV file './files/student_scores.csv' has been created.


In [22]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv(file_name)

# Display DataFrame
df


Unnamed: 0,StudentID,exam_score
0,1,85
1,2,92
2,3,105
3,4,78
4,5,110
5,5,err


Check if all the values in `exam_score` is numeric. Note index 5.

In [24]:
non_numeric_rows = df[~df['exam_score'].str.isnumeric()]

# Display the rows with non-numeric values in the 'exam_score' column
if not non_numeric_rows.empty:
    print("Rows with non-numeric values in 'exam_score':")
    print(non_numeric_rows)
else:
    print("All values in 'exam_score' are numeric.")

Rows with non-numeric values in 'exam_score':
   StudentID exam_score
5          5        err


### Uniqueness
Data should be unique. This means that there should be no duplicate values in the data. If there are duplicate values, it means that the data is not unique and that there is an error in the data.
Did you see the duplicate value in `Student_ID`?


In [25]:
# Find duplicate values in the 'StudentID' column
duplicates = df[df.duplicated(subset=['StudentID'], keep=False)]

# Display the rows with duplicate 'StudentID' values
if not duplicates.empty:
    print("Duplicate StudentID values:")
    print(duplicates)
else:
    print("No duplicate StudentID values found.")

Duplicate StudentID values:
   StudentID exam_score
4          5        110
5          5        err


### Completeness
Data should be complete. This means that there should be no missing values in the data. If there are missing values, it means that the data is not complete and that there is an error in the data.

Here is an example: 


In [26]:
import pandas as pd
import numpy as np

# Create a DataFrame with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', np.nan],
    'Age': [25, 30, np.nan, 22, 28],
    'Salary': [50000, 60000, 75000, np.nan, 80000]
}

df = pd.DataFrame(data)

In [27]:
df

Unnamed: 0,Name,Age,Salary
0,Alice,25.0,50000.0
1,Bob,30.0,60000.0
2,Charlie,,75000.0
3,David,22.0,
4,,28.0,80000.0


`Name`,`Age` and `Salary` have missing values. 

In [30]:
# Check for missing values in the entire DataFrame
missing_values = df.isnull()

# Check for completeness
if missing_values.any().any():
    print("DataFrame contains missing values:")
    print(missing_values)
else:
    print("DataFrame is complete; it contains no missing values.")

DataFrame contains missing values:
    Name    Age  Salary
0  False  False   False
1  False  False   False
2  False   True   False
3  False  False    True
4   True  False   False


We can count the missing values in each column using the `isnull()` method and the `sum()` method.

In [29]:
missing_values.sum()

Name      1
Age       1
Salary    1
dtype: int64

### Consistency
Data should be consistent. This means that the data should be in a standard format. For example, if you are collecting data on the height of people, you would expect the data to be in the same unit of measurement. If some of the data is in meters and some of the data is in feet, then the data is not consistent and there is an error in the data.

Here is an example:

In [33]:

# Create a DataFrame with employee information
data = {
    'EmployeeID': [101, 102, 103, 104, 105],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Salary': [50000, 60000, 75000, 45000, 80000]
}

df = pd.DataFrame(data)

# Define the salary range (e.g., $40,000 to $80,000)
min_salary = 40000
max_salary = 80000

# Check for data consistency (salary within the defined range)
inconsistent_salaries = df[(df['Salary'] < min_salary) | (df['Salary'] > max_salary)]

Note the `|` means `or` in in Python. The line reads - where `Salary` is less than `min_salary` or `Salary` is greater than `max_salary`.

In [34]:

# Display rows with inconsistent salary values
if not inconsistent_salaries.empty:
    print("Inconsistent salary values:")
    print(inconsistent_salaries)
else:
    print("Data is consistent; all salary values are within the valid range.")

Data is consistent; all salary values are within the valid range.


The code above therefore checks if all the values are between `40000` and `80000`

### Timeliness
Data should be timely. This means that the data should be up to date. If the data is not up to date, then the data is not timely and there is an error in the data.

Here is an example:

In [3]:
import pandas as pd

# Create a DataFrame with timestamp data
data = {
    'Event': ['Event1', 'Event2', 'Event3', 'Event4', 'Event5'],
    'Timestamp': ['2023-08-10 09:00:00', '2023-08-10 09:30:00', '2023-08-10 10:15:00', '2023-08-10 08:45:00', '2023-08-10 11:00:00']
}

df = pd.DataFrame(data)

# Convert the 'Timestamp' column to a pandas datetime object
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

# Define a time window (e.g., events should occur between 9:00 AM and 11:00 AM)
start_time = pd.Timestamp('2023-08-10 09:00:00')
end_time = pd.Timestamp('2023-08-10 11:00:00')

# Check for timeliness
timely_events = df[(df['Timestamp'] >= start_time) & (df['Timestamp'] <= end_time)]

# Display timely events
if not timely_events.empty:
    print("Timely events:")
    print(timely_events)
else:
    print("No timely events found.")


Timely events:
    Event           Timestamp
0  Event1 2023-08-10 09:00:00
1  Event2 2023-08-10 09:30:00
2  Event3 2023-08-10 10:15:00
4  Event5 2023-08-10 11:00:00


In the code above, all of the events need to be between `2023-08-10 09:00:00` and `2023-08-10 11:00:00`. If any of the events are outside of this range, then the data is not timely and there is an error in the data.

In [2]:
# Install great_expectations library - used for data validation
!pip install great_expectations



In [1]:
import csv

# Create a list of data
data = [
    ["John Doe", 30],
    ["Jane Doe", 25],
    ["Peter Smith", 40],
    ["Susan Jones", 35],
]

# Create a CSV writer
csv_writer = csv.writer(open("files/sample_data.csv", "w"))

# Write the data to the CSV file
for row in data:
    csv_writer.writerow(row)

[Here](https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv) is the CSV file we'll be validating. Let's familiarize ourselves with the data.

First, we'll load the data into a Pandas DataFrame:

```python
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv")
```

In [9]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv")
df.head()

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code_id,store_and_fwd_flag,pickup_location_id,dropoff_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1,2019-01-15 03:36:12,2019-01-15 03:42:19,1,1.0,1,N,230,48,1,6.5,0.5,0.5,1.95,0.0,0.3,9.75,
1,1,2019-01-25 18:20:32,2019-01-25 18:26:55,1,0.8,1,N,112,112,1,6.0,1.0,0.5,1.55,0.0,0.3,9.35,0.0
2,1,2019-01-05 06:47:31,2019-01-05 06:52:19,1,1.1,1,N,107,4,2,6.0,0.0,0.5,0.0,0.0,0.3,6.8,
3,1,2019-01-09 15:08:02,2019-01-09 15:20:17,1,2.5,1,N,143,158,1,11.0,0.0,0.5,3.0,0.0,0.3,14.8,
4,1,2019-01-25 18:49:51,2019-01-25 18:56:44,1,0.8,1,N,246,90,1,6.5,1.0,0.5,1.65,0.0,0.3,9.95,0.0


Looks good. The Great Expectations library is used to validate data. Let's install it and import it.

```python
!pip install great_expectations
import great_expectations as ge
```


In [10]:
!pip install great_expectations
import great_expectations as ge




Load the file into GE to validate it.

In [12]:
# Create a Data Context
context = ge.get_context()
validator = context.sources.pandas_default.read_csv(
    "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)

Run some validations:

In [1]:
validator.expect_column_values_to_not_be_null("pickup_datetime")

validator.expect_column_values_to_be_between("passenger_count", auto=True)
validator.save_expectation_suite()

NameError: name 'validator' is not defined

Run the checkpoint 

In [13]:
checkpoint = context.add_or_update_checkpoint(
    name="my_quickstart_checkpoint",
    validator=validator,
)

Validate the data

In [14]:
checkpoint_result = checkpoint.run()

Calculating Metrics:   0%|          | 0/15 [00:00<?, ?it/s]

Look at the HTML report

In [15]:
context.view_validation_result(checkpoint_result)

The HTML report provides us with a profile of the data and a summary of the validation results. We can see that the data is accurate and that there are no errors.

### Uniqueness
Data should be unique. This means that there should be no duplicate values in the data. If there are duplicate values, it means that the data is not unique and that there is an error in the data.
Great Expectations provides a suite of expectations that can be used to validate the uniqueness of data. Let's use the unique value expectation to validate the uniqueness of the VendorID column.

```python
df = pd.read_csv("https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv")
df = df.rename(columns={"VendorID": "vendor_id"})
df["vendor_id"] = df["vendor_id"].astype(str)
df["vendor_id"] = df["vendor_id"].replace({"1": "Creative Mobile Technologies, LLC", "2": "VeriFone Inc"})
df["vendor_id"] = df["vendor_id"].astype("category")
    