<a href="https://colab.research.google.com/github/zelal-Eizaldeen/data_qaulity_course/blob/main/3.timeliness.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://github.com/zelal-Eizaldeen/data_qaulity_course

We start by importing the required libraries:

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

We then generate a random dataset with timestamps and values. The timestamps are randomly distributed within a given time range to simulate real-world data:

In [2]:
np.random.seed(0) # For reproducibility
n_samples = 100
start_time = datetime(2025, 6, 25, 9, 0, 0)
end_time = datetime(2025, 6, 25, 16, 0, 0)
timestamps = [start_time + timedelta(minutes=np.random.randint(0, (end_time - start_time).total_seconds() / 60)) for _ in range(n_samples)]
values = np.random.randint(50, 101, n_samples)
df = pd.DataFrame({'Timestamp': timestamps, 'Value': values})

We define a reference timestamp to compare the dataset’s timestamps to:

In [15]:
reference_timestamp = datetime(2025, 6, 25, 12, 0, 0)
reference_timestamp

datetime.datetime(2025, 6, 25, 12, 0)

We set a timeliness threshold of 30 minutes. Data with timestamps within 30 minutes of the reference timestamp will be considered timely:

In [4]:
timeliness_threshold = 30

We calculate the timeliness for each record in the dataset by computing the time difference in minutes between the reference timestamp and each record’s timestamp. We also create a Boolean column to indicate whether the record is timely based on the threshold:

In [5]:
df['Timeliness'] = (reference_timestamp - df['Timestamp']).dt.total_seconds() / 60
df['Timely'] = df['Timeliness'] <= timeliness_threshold

Finally, we calculate the average timeliness of the dataset and display the results:

In [6]:
average_timeliness = df['Timeliness'].mean()

and Percentage of Timely Records : count of timely records / total records * 100

In [13]:
percentage_timely_records = df['Timely'].sum() / len(df) * 100
percentage_timely_records

np.float64(61.0)

This will display the following output:

In [14]:
print("Dataset with Timestamps:")
print(df)
print(f"Average Timeliness (in minutes): {average_timeliness}")
print(f"Percentage of Timely Records: {percentage_timely_records} %")


Dataset with Timestamps:
             Timestamp  Value  Timeliness  Timely
0  2023-10-25 11:52:00     71         8.0    True
1  2023-10-25 09:47:00     98       133.0   False
2  2023-10-25 10:57:00     99        63.0   False
3  2023-10-25 12:12:00     55       -12.0    True
4  2023-10-25 14:23:00     91      -143.0    True
..                 ...    ...         ...     ...
95 2023-10-25 11:28:00     66        32.0   False
96 2023-10-25 12:47:00     69       -47.0    True
97 2023-10-25 13:39:00     83       -99.0    True
98 2023-10-25 12:27:00     90       -27.0    True
99 2023-10-25 15:37:00     82      -217.0    True

[100 rows x 4 columns]
Average Timeliness (in minutes): -23.8
Percentage of Timely Records: 61.0 %


A low average timeliness and a high percentage of timely records suggest that the dataset is current and aligns well with the reference timestamp. This is desirable in real-time applications or scenarios where up-to-date data is critical.

Here are some real applications of timeliness:
- Finance: In the financial sector, timeliness is crucial for stock trading, fraud detection, and risk management, where timely data can lead to better decisions and reduced risks
- Healthcare: Timeliness is vital for healthcare data, particularly in patient monitoring and real-time health data analysis
- E-commerce: Timely data is essential for e-commerce companies to monitor sales, customer behavior, and inventory in real-time