# Validation 
***

## Learning Objectives

# Sources 
- [Qualities of good data](https://www.precisely.com/blog/data-quality/5-characteristics-of-data-quality)



## Qualities of good validation data


### Accuracy 
For data to be of a good quality, it needs to be accurate. This means that the data is correct and free from errors. In practice, your data need to be within a defined set of boundary conditions. For example, if you are collecting data on the height of people, you would expect the data to be within a certain range. If you have a height of 3 meters, you know that this is not a valid value and that there is an error in the data.

Here's an example: 

In [None]:
import csv

# Define the data
data = [
    ["StudentID", "exam_score"],
    [1, 85],
    [2, 92],
    [3, 105],
    [4, 78],
    [5, 110],
    [6, "err"]
]

# Specify the file name
file_name = "student_scores.csv"

# Write the data to the CSV file
with open(file_name, mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)

print(f"CSV file '{file_name}' has been created.")

In [None]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv(file_name)

# Display DataFrame
df

Check if all the values in `exam_score` is numeric. Note index 5.

In [None]:
non_numeric_rows = df[~df['exam_score'].str.isnumeric()]

# Display the rows with non-numeric values in the 'exam_score' column
if not non_numeric_rows.empty:
    print("Rows with non-numeric values in 'exam_score':")
    print(non_numeric_rows)
else:
    print("All values in 'exam_score' are numeric.")
    

#### Uniqueness
Data should be unique. This means that there should be no duplicate values in the data. If there are duplicate values, it means that the data is not unique and that there is an error in the data.
Did you see the duplicate value in `Student_ID`?


In [None]:
# Add a duplicate row to the dataframe 
df = df.append(df.iloc[2], ignore_index=True)

In [None]:
df 

In [None]:
# Find duplicate values in the 'StudentID' column
duplicates = df[df.duplicated(subset=['StudentID'], keep=False)]

# Display the rows with duplicate 'StudentID' values
if not duplicates.empty:
    print("Duplicate StudentID values:")
    print(duplicates)
else:
    print("No duplicate StudentID values found.")

In [None]:
import pandas as pd
import numpy as np

# Create a DataFrame with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', np.nan],
    'Age': [25, 30, np.nan, 22, 28],
    'Salary': [50000, 60000, 75000, np.nan, 80000]
}

df = pd.DataFrame(data)

In [None]:
df 

`Name`,`Age` and `Salary` have missing values. 

In [None]:
# Check for missing values in the entire DataFrame
missing_values = df.isnull()

# Check for completeness
if missing_values.any().any():
    print("DataFrame contains missing values:")
    print(missing_values)
else:
    print("DataFrame is complete; it contains no missing values.")

We can count the missing values in each column using the `isnull()` method and the `sum()` method.

In [None]:
missing_values.sum()

#### Consistency
Data should be consistent. This means that the data should be in a standard format. For example, if you are collecting data on the height of people, you would expect the data to be in the same unit of measurement. If some of the data is in meters and some of the data is in feet, then the data is not consistent and there is an error in the data.


Here is an example:

In [None]:

# Create a DataFrame with employee information
data = {
    'EmployeeID': [101, 102, 103, 104, 105 , 106],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Fred'],
    'Salary': [50000, 60000, 75000, 45000, 80000 , 100000]
}

df = pd.DataFrame(data)

df 


In [None]:

# Define the salary range (e.g., $40,000 to $80,000)
min_salary = 40000
max_salary = 80000

# Check for data consistency (salary within the defined range)
inconsistent_salaries = df[(df['Salary'] < min_salary) | (df['Salary'] > max_salary)]

Note the `|` means `or` in in Python. The line reads - where `Salary` is less than `min_salary` or `Salary` is greater than `max_salary`.

In [None]:

# Display rows with inconsistent salary values
if not inconsistent_salaries.empty:
    print("Inconsistent salary values:")
    print(inconsistent_salaries)
else:
    print("Data is consistent; all salary values are within the valid range.")

The code above therefore checks if all the values are between `40000` and `80000`

### Completeness
For data to be complete, it needs to have all the required values. If there are missing values, then the data is not complete and there is an error in the data.

In [None]:
import pandas as pd
import numpy as np

# Generate sample data
np.random.seed(123)
data = {'CustomerID': np.random.randint(1, 101, 10)}
df = pd.DataFrame(data)

df 


In [None]:

# Add name and last name columns
names = ['John', 'Jane', 'Bob', 'Alice', 'Eve', 'Mike', 'Sara', 'Tom', 'Kate', 'Alex']
last_names = ['Smith', 'Doe', 'Johnson', 'Brown', 'Lee', 'Garcia', 'Davis', 'Wilson', 'Taylor', 'Clark']
df['Name'] = np.random.choice(names, 10)
df['Last Name'] = np.random.choice(last_names, 10)

# Print the DataFrame
print(df)

### Reliability
For data to be reliable, it needs to be trusted. This means that the data should be collected using a reliable source. If the data is collected using a reliable source, then the data is reliable and there is no error in the data.

In [None]:
import pandas as pd

# Create two DataFrames with conflicting data
data1 = {
    'Product': ['Product1', 'Product2', 'Product3', 'Product4', 'Product5'],
    'Price': [10.99, 5.99, 8.99, 12.99, 7.99]
}
df1 = pd.DataFrame(data1)
df1

In [None]:

data2 = {
    'Product': ['Product1', 'Product2', 'Product3', 'Product4', 'Product5'],
    'Price': [9.99, 6.99, 7.99, 11.99, 8.99]
}
df2 = pd.DataFrame(data2)
df2


In [None]:

# Merge the two DataFrames on the 'Product' column
merged_df = pd.merge(df1, df2, on='Product')

# Calculate the difference in price between the two DataFrames
merged_df['Price Difference'] = merged_df['Price_x'] - merged_df['Price_y']

# Display the merged DataFrame
print(merged_df)

In [None]:
# Choose the greater price from the two DataFrames
merged_df['Price'] = merged_df[['Price_x', 'Price_y']].max(axis=1)
merged_df

# Relevalnce
For data to be relevant, it needs to be useful for the purpose for which it was collected. If the data is useful for the purpose for which it was collected, then the data is relevant and there is no error in the data.


In [None]:
import pandas as pd
import numpy as np

# Generate random sales data
num_sales = 1000
products = ['Product A', 'Product B', 'Product C', 'Product D', 'Product E']
sales_data = {
    'Product': np.random.choice(products, num_sales),
    'Date': pd.date_range(start='2022-01-01', end='2022-12-31', periods=num_sales),
    'Sales Amount': np.random.normal(loc=100, scale=50, size=num_sales),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], num_sales),
    'Customer ID': np.random.randint(low=1000, high=9999, size=num_sales),
    'Sales Rep': np.random.choice(['John', 'Jane', 'Bob', 'Sue'], num_sales)
}

df = pd.DataFrame(sales_data)
df

In [None]:
#  Remove region, customer ID, and sales rep columns
df = df.drop(columns=['Region', 'Customer ID', 'Sales Rep'])
df

### Timeliness
Data should be timely. This means that the data should be up to date. If the data is not up to date, then the data is not timely and there is an error in the data.

Here is an example:

In [49]:
import pandas as pd

# Create a DataFrame with timestamp data
data = {
    'Event': ['Event1', 'Event2', 'Event3', 'Event4', 'Event5'],
    'Timestamp': ['2023-08-10 09:00:00', '2023-08-10 09:30:00', '2023-08-10 10:15:00', '2023-08-10 08:45:00', '2023-08-10 11:00:00']
}

df = pd.DataFrame(data)

# Convert the 'Timestamp' column to a pandas datetime object
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

df 


Unnamed: 0,Event,Timestamp
0,Event1,2023-08-10 09:00:00
1,Event2,2023-08-10 09:30:00
2,Event3,2023-08-10 10:15:00
3,Event4,2023-08-10 08:45:00
4,Event5,2023-08-10 11:00:00


In [51]:
# Define a time window (e.g., events should occur between 9:00 AM and 11:00 AM)
start_time = pd.Timestamp('2023-08-10 09:00:00')
end_time = pd.Timestamp('2023-08-10 11:00:00')

# Check for timeliness
timely_events = df[(df['Timestamp'] >= start_time) & (df['Timestamp'] <= end_time)]

# Display timely events
if not timely_events.empty:
    print("Timely events:")
    print(timely_events)
else:
    print("No timely events found.")

Timely events:
    Event           Timestamp
0  Event1 2023-08-10 09:00:00
1  Event2 2023-08-10 09:30:00
2  Event3 2023-08-10 10:15:00
4  Event5 2023-08-10 11:00:00


In [52]:
import pandas as pd
import numpy as np

# Generate sample data
num_records = 1000
data = {
    'ID': np.arange(num_records),
    'Date': pd.date_range(start='2022-01-01', end='2022-12-31', periods=num_records),
    'Value': np.random.normal(loc=50, scale=10, size=num_records)
}

df = pd.DataFrame(data)

# Check for values older than 10 days
ten_days_ago = pd.Timestamp.now() - pd.Timedelta(days=10)

old_values = df[df['Date'] < ten_days_ago]

# Display old values
if not old_values.empty:
    print("Values older than 10 days:")
    print(old_values)
else:
    print("No values older than 10 days found.")

Values older than 10 days:
      ID                          Date      Value
0      0 2022-01-01 00:00:00.000000000  51.765424
1      1 2022-01-01 08:44:41.081081081  39.990055
2      2 2022-01-01 17:29:22.162162162  56.300634
3      3 2022-01-02 02:14:03.243243243  54.901732
4      4 2022-01-02 10:58:44.324324324  48.923777
..   ...                           ...        ...
995  995 2022-12-29 13:01:15.675675676  50.342561
996  996 2022-12-29 21:45:56.756756756  47.273165
997  997 2022-12-30 06:30:37.837837840  61.487841
998  998 2022-12-30 15:15:18.918918920  53.922712
999  999 2022-12-31 00:00:00.000000000  42.524444

[1000 rows x 3 columns]


In [53]:
import pandas as pd
import numpy as np

# Generate sample data
num_records = 1000
data = {
    'ID': np.arange(num_records),
    'Date': pd.date_range(start='2022-01-01', end='2022-12-31', periods=num_records),
    'Value': np.random.normal(loc=50, scale=10, size=num_records)
}

df = pd.DataFrame(data)

# Check for values older than 10 days
# ten_days_ago = pd.Timestamp.now() - pd.Timedelta(days=10)
ten_days_ago = pd.Timestamp('2022-12-31') - pd.Timedelta(days=10)

old_values = df[df['Date'] < ten_days_ago]

# Display old values
if not old_values.empty:
    print("Values older than 10 days:")
    print(old_values)
else:
    print("No values older than 10 days found.")

Values older than 10 days:
      ID                          Date      Value
0      0 2022-01-01 00:00:00.000000000  53.656261
1      1 2022-01-01 08:44:41.081081081  43.174119
2      2 2022-01-01 17:29:22.162162162  48.199763
3      3 2022-01-02 02:14:03.243243243  45.053514
4      4 2022-01-02 10:58:44.324324324  39.823869
..   ...                           ...        ...
967  967 2022-12-19 08:10:05.405405408  34.069457
968  968 2022-12-19 16:54:46.486486488  40.865303
969  969 2022-12-20 01:39:27.567567568  36.015819
970  970 2022-12-20 10:24:08.648648648  55.515140
971  971 2022-12-20 19:08:49.729729732  57.998957

[972 rows x 3 columns]
