# Send Factory Plant Sensors To QuestDB

This notebook will produce simulated factory plant sensor data. The script simulates a "realistic" stream of sensor data with varying frequencies, data types, and behaviors, including downtime and value fluctuations. It provides flexibility to switch between real-time and historical data generation.

### Data Generation Logic Summary

The script simulates sensor data for ingestion into QuestDB, following these key steps:

1. **Sensor Creation**:
   - **Total Sensors**: A specified number of sensors (`TOTAL_SENSORS`) are generated using a custom device ID generation function.
   - **Sensor IDs**: Each sensor is assigned a unique ID in the format of three letters followed by four hexadecimal digits (e.g., `AAA0001`).

2. **Sensor Categorization**:
   - **Frequency Categories**:
     - **High-Frequency Sensors**: 20% of sensors.
     - **Medium-Frequency Sensors**: 30% of sensors.
     - **Low-Frequency Sensors**: 50% of sensors.
   - **Data Types**:
     - **Numeric Sensors**: 80% of sensors.
     - **Text Sensors**: 20% of sensors.
   - Sensors are randomly assigned to these categories to simulate a diverse set of devices.

3. **Event Rate Calculation**:
   - **Total Event Rate**: Defined by `RATE_PER_SECOND`.
   - **Category Weights**:
     - **High**: Weight of 1.
     - **Medium**: Weight of 0.1.
     - **Low**: Weight of 0.01.
   - **Base Rate**: Calculated by dividing `RATE_PER_SECOND` by the sum of all category weights.
   - **Per-Sensor Rate**:
     - **High-Frequency Sensors**: `base_rate`.
     - **Medium-Frequency Sensors**: `base_rate * 0.1`.
     - **Low-Frequency Sensors**: `base_rate * 0.01`.
   - This ensures that sensors in different categories send events at appropriate frequencies while respecting the total event rate.

4. **Event Generation**:
   - **Numeric Sensors**:
     - Generate values that change slowly using small random fluctuations.
     - Occasionally introduce larger jumps with a small probability (1% chance).
   - **Text Sensors**:
     - Randomly select messages from a predefined list of 100 messages (e.g., `"message001"` to `"message100"`).
   - **System Name**:
     - Each event includes a `system` symbol, randomly assigned from `"system000"` to `"system019"`.

5. **Downtime Simulation**:
   - **Downtime Probability**: Each sensor has a 1% chance of going down at each event.
   - **Downtime Duration**: Sensors stay down for a specified duration (`DOWNTIME_DURATION`).
   - **Status Handling**:
     - Sensors can have statuses: `'active'`, `'stopping'`, or `'starting'`.
     - When a sensor goes down, its status changes to `'stopping'` and stops generating events.
     - After the downtime period, the sensor status changes to `'starting'` before resuming normal operation.

6. **Timestamp Handling**:
   - **Simulation Modes**:
     - **Realtime Mode**:
       - Events are timestamped with the current system time.
       - The script sleeps as necessary to maintain the correct event rate.
     - **Historical Mode**:
       - Events are timestamped starting from a specified `START_DATETIME`.
       - Simulated time advances based on event intervals without real-time delays.
       - Optional delay (`DELAY_MS`) can be introduced to control ingestion speed.

7. **Event Scheduling**:
   - **Priority Queue**:
     - Uses a min-heap (priority queue) to schedule events based on the next event time for each sensor.
     - Efficiently processes sensors in order of their scheduled event times.
   - **Event Loop**:
     - Continues generating events until the total number of events (`TOTAL_NUMBER_OF_EVENTS`) is reached.
     - In historical mode, events are generated as fast as possible without waiting.

8. **Data Ingestion**:
   - **QuestDB Client**:
     - Uses the QuestDB Python client to send data directly to QuestDB.
     - The `sender.row()` method is used to ingest each event, specifying symbols, columns, and timestamp.
     - Relies on the client's automatic flushing mechanism for efficiency.

9. **Parallel Processing**:
   - **Multiprocessing Pool**:
     - The script uses a multiprocessing pool to run multiple sender processes in parallel.
     - Sensors are evenly distributed among the processes.
     - Each process handles its share of sensors and events independently.




In [1]:
#ignore deprecation warnings in this demo
import warnings
warnings.simplefilter("ignore", category=DeprecationWarning)

In [1]:
import psycopg as pg
import os

# Fetch environment variables with defaults
host = os.getenv('QDB_CLIENT_HOST', 'questdb')
port = os.getenv('QDB_CLIENT_PORT', '8812')
user = os.getenv('QDB_CLIENT_USER', 'admin')
password = os.getenv('QDB_CLIENT_PASSWORD', 'quest')

# Create the connection string using the environment variables or defaults
conn_str = f'user={user} password={password} host={host} port={port} dbname=qdb'

with pg.connect(conn_str, autocommit=True) as connection:
    with connection.cursor() as cur:
        cur.execute(
        """
        CREATE TABLE IF NOT EXISTS 'plant_sensors' (
          timestamp TIMESTAMP,
          system SYMBOL CACHE,
          address SYMBOL CAPACITY 100000 CACHE,
          value DOUBLE,
          text VARCHAR,
          status SYMBOL  NOCACHE
        ) timestamp (timestamp) PARTITION BY DAY WAL 
          DEDUP UPSERT KEYS(timestamp, system, address);
""")
                    

## Sending the data to QuestDB

This script will keep sending data until you click stop or exit the notebook, or until the `TOTAL_NUMBER_OF_EVENTS` number is reached. 

While the script is running, you can check the data in the table directly at QuestDB's web console at http://localhost:9000 

In [2]:
from questdb.ingress import Sender, IngressError, TimestampNanos
import os
import sys
import random
import time
from multiprocessing import Pool
from datetime import datetime, timedelta
import heapq  # For efficient scheduling

# Device ID generation
def generate_device_id(index):
    letters = index // (16**4) % (26**3)
    letter_part = ''.join(chr(65 + (letters // (26**i) % 26)) for i in range(3)[::-1])
    hex_part = format(index % (16**4), '04x').upper()
    return f"{letter_part}{hex_part}"

# Constants
HTTP_ENDPOINT = os.getenv('QUESTDB_HTTP_ENDPOINT', 'localhost:9009')
REST_TOKEN = os.getenv('QUESTDB_REST_TOKEN')

TOTAL_SENSORS = 18000  # Adjust as needed
RATE_PER_SECOND = 2000  # Total average number of events to generate per second
TOTAL_NUMBER_OF_EVENTS = 1000000 #56_050_000  # Total number of events to send across all processes

# Sensor frequency distribution
HIGH_FREQ_PERCENT = 0.2  # 20% high-frequency sensors
MEDIUM_FREQ_PERCENT = 0.3  # 30% medium-frequency sensors
LOW_FREQ_PERCENT = 0.5  # 50% low-frequency sensors

# Sensor data type distribution
NUMERIC_PERCENT = 0.8  # 80% numeric sensors
TEXT_PERCENT = 0.2     # 20% text sensors

# Downtime simulation constants
DOWNTIME_PROBABILITY = 0.01  # Probability of a sensor going down at each event
DOWNTIME_DURATION = 60  # Downtime duration in seconds

# Simulation mode: 'realtime' or 'historical'
SIMULATION_MODE = 'realtime'  # Set to 'historical' for historical simulation
START_DATETIME = datetime(2024, 9, 1, 0, 0, 0)  # Start datetime for historical simulation
DELAY_MS = 0  # Delay in milliseconds when generating data as fast as possible

NUM_SENDERS = 10  # Adjust as needed

# Generate sensor IDs
sensor_ids = [generate_device_id(i) for i in range(TOTAL_SENSORS)]

# Assign sensors to frequency categories
random.shuffle(sensor_ids)  # Shuffle sensor IDs to randomize assignments

NUM_HIGH_FREQ_SENSORS = int(TOTAL_SENSORS * HIGH_FREQ_PERCENT)
NUM_MEDIUM_FREQ_SENSORS = int(TOTAL_SENSORS * MEDIUM_FREQ_PERCENT)
NUM_LOW_FREQ_SENSORS = TOTAL_SENSORS - NUM_HIGH_FREQ_SENSORS - NUM_MEDIUM_FREQ_SENSORS

high_freq_sensors = sensor_ids[:NUM_HIGH_FREQ_SENSORS]
medium_freq_sensors = sensor_ids[NUM_HIGH_FREQ_SENSORS:NUM_HIGH_FREQ_SENSORS + NUM_MEDIUM_FREQ_SENSORS]
low_freq_sensors = sensor_ids[NUM_HIGH_FREQ_SENSORS + NUM_MEDIUM_FREQ_SENSORS:]

# Assign data types to sensors
random.shuffle(sensor_ids)  # Shuffle again before assigning data types
NUM_NUMERIC_SENSORS = int(TOTAL_SENSORS * NUMERIC_PERCENT)
NUM_TEXT_SENSORS = TOTAL_SENSORS - NUM_NUMERIC_SENSORS

numeric_sensors = set(sensor_ids[:NUM_NUMERIC_SENSORS])
text_sensors = set(sensor_ids[NUM_NUMERIC_SENSORS:])

# Build sensor info dictionary
sensor_info = {}
for sensor_id in sensor_ids:
    if sensor_id in high_freq_sensors:
        category = 'high'
    elif sensor_id in medium_freq_sensors:
        category = 'medium'
    else:
        category = 'low'
    if sensor_id in numeric_sensors:
        data_type = 'numeric'
        last_value = random.uniform(0, 100)
    else:
        data_type = 'text'
        last_value = None
    sensor_info[sensor_id] = {
        'category': category,
        'data_type': data_type,
        'status': 'active',
        'next_event_time': 0.0,  # Initialize to 0 for historical mode
        'last_value': last_value,
    }

# Calculate per-sensor event rates to match RATE_PER_SECOND
# Define weights for each category
total_weight = (NUM_HIGH_FREQ_SENSORS * 1 +
                NUM_MEDIUM_FREQ_SENSORS * 0.1 +
                NUM_LOW_FREQ_SENSORS * 0.01) or 1e-6  # Prevent division by zero

base_rate = RATE_PER_SECOND / total_weight

# Assign rates to sensors
for sensor_id, info in sensor_info.items():
    if info['category'] == 'high':
        info['rate'] = base_rate
    elif info['category'] == 'medium':
        info['rate'] = base_rate * 0.1
    else:  # 'low'
        info['rate'] = base_rate * 0.01
    # Ensure rate is not zero
    if info['rate'] <= 0:
        info['rate'] = 1e-6  # Small non-zero rate

# List of possible text messages
TEXT_MESSAGES = [f"message{str(i).zfill(3)}" for i in range(1, 101)]

def send(sender_id, sensor_ids_chunk, total_events, http_endpoint: str = None, auth=None):
    sys.stdout.write(f"Sender {sender_id} will send data for {len(sensor_ids_chunk)} sensors\n")
    events_sent = 0  # Total events sent by this sender

    try:
        if auth:
            conf = f'https::addr={http_endpoint};tls_verify=unsafe_off;token={auth};'
        else:
            conf = f'http::addr={http_endpoint};'

        sys.stdout.write(conf + '\n')
        with Sender.from_conf(conf) as sender:
            # Initialize a priority queue (min-heap) based on next_event_time
            sensor_heap = []
            for sensor_id in sensor_ids_chunk:
                info = sensor_info[sensor_id]
                # Initialize next_event_time to 0 for historical mode
                if SIMULATION_MODE == 'historical':
                    info['next_event_time'] = 0.0
                else:
                    info['next_event_time'] = time.time()
                heapq.heappush(sensor_heap, (info['next_event_time'], sensor_id))

            while events_sent < total_events and sensor_heap:
                next_event_time, sensor_id = heapq.heappop(sensor_heap)
                info = sensor_info[sensor_id]

                # In historical mode, no need to sleep; process events as fast as possible
                if SIMULATION_MODE != 'historical':
                    current_time = time.time()
                    sleep_time = next_event_time - current_time
                    if sleep_time > 0:
                        time.sleep(sleep_time)
                        current_time = time.time()
                else:
                    current_time = next_event_time  # Use simulated time

                # Check if sensor is active
                if info['status'] == 'active' or info['status'] == 'starting':
                    # Generate data
                    system_name = f"system{str(random.randint(0, 19)).zfill(3)}"
                    value = None
                    text = None

                    if info['data_type'] == 'text':
                        # Text sensor
                        text = random.choice(TEXT_MESSAGES)
                    else:
                        # Numeric sensor
                        last_value = info['last_value']
                        # Small change
                        value = last_value + random.uniform(-1, 1)
                        # Occasional larger jump
                        if random.random() < 0.01:
                            value += random.uniform(-10, 10)
                        info['last_value'] = value

                    # Simulate downtime
                    if random.random() < DOWNTIME_PROBABILITY:
                        info['status'] = 'stopping'
                        info['downtime_remaining'] = DOWNTIME_DURATION
                    else:
                        info['status'] = 'active'

                    # Determine timestamp
                    if SIMULATION_MODE == 'historical':
                        timestamp = START_DATETIME + timedelta(seconds=next_event_time)
                        timestamp_nanos = TimestampNanos(int(timestamp.timestamp() * 1e9))
                    else:
                        timestamp_nanos = TimestampNanos(int(current_time * 1e9))

                    # Send data to QuestDB
                    sender.row(
                        'plant_sensors',
                        symbols={
                            'system': system_name,
                            'address': sensor_id,
                            'status': info['status']
                        },
                        columns={
                            'value': value,
                            'text': text
                        },
                        at=timestamp_nanos
                    )

                    events_sent += 1

                    # Update next event time
                    interval = 1 / info['rate']
                    info['next_event_time'] = next_event_time + interval

                    # Add sensor back to the heap with updated next_event_time
                    heapq.heappush(sensor_heap, (info['next_event_time'], sensor_id))

                elif info['status'] == 'stopping':
                    # Decrease downtime remaining
                    if SIMULATION_MODE == 'historical':
                        # In historical mode, advance downtime_remaining based on event intervals
                        downtime_elapsed = 1 / info['rate']
                    else:
                        downtime_elapsed = current_time - info.get('last_update_time', current_time)
                    info['downtime_remaining'] -= downtime_elapsed
                    if info['downtime_remaining'] <= 0:
                        info['status'] = 'starting'
                        info['next_event_time'] = next_event_time  # Resume immediately
                        del info['downtime_remaining']
                        # Add sensor back to the heap
                        heapq.heappush(sensor_heap, (info['next_event_time'], sensor_id))
                    else:
                        info['last_update_time'] = current_time
                        # Re-add sensor to the heap with updated downtime
                        info['next_event_time'] = next_event_time + (1 / info['rate'])
                        heapq.heappush(sensor_heap, (info['next_event_time'], sensor_id))

                if DELAY_MS > 0 and SIMULATION_MODE == 'historical':
                    time.sleep(DELAY_MS / 1000.0)  # Optional delay to control ingestion speed

            sys.stdout.write(f"Sender {sender_id} finished sending {events_sent} events\n")

    except IngressError as e:
        sys.stderr.write(f'Sender {sender_id} encountered an error: {e}\n')

def parallel_send(total_events, num_senders: int, http_endpoint, auth):
    events_per_sender = total_events // num_senders
    remaining_events = total_events % num_senders

    sender_events = [events_per_sender] * num_senders
    for i in range(remaining_events):  # Distribute remaining events
        sender_events[i] += 1

    # Distribute sensors among senders
    sensor_ids_chunks = [[] for _ in range(num_senders)]
    for idx, sensor_id in enumerate(sensor_info.keys()):
        sensor_ids_chunks[idx % num_senders].append(sensor_id)

    with Pool(processes=num_senders) as pool:
        sender_ids = range(num_senders)
        pool.starmap(
            send,
            [
                (sender_id, sensor_ids_chunks[sender_id], sender_events[sender_id], http_endpoint, auth)
                for sender_id in sender_ids
            ]
        )

if __name__ == '__main__':
    sys.stdout.write(f'Ingestion started. Connecting to {HTTP_ENDPOINT}\n')
    parallel_send(TOTAL_NUMBER_OF_EVENTS, NUM_SENDERS, HTTP_ENDPOINT, REST_TOKEN)


Ingestion started. Connecting to 172.31.42.41:9000
Sender 0 will send data for 1800 sensors
Sender 1 will send data for 1800 sensors
Sender 3 will send data for 1800 sensors
Sender 2 will send data for 1800 sensors
Sender 5 will send data for 1800 sensors
Sender 6 will send data for 1800 sensors
Sender 4 will send data for 1800 sensors
Sender 7 will send data for 1800 sensors
Sender 8 will send data for 1800 sensors
https::addr=172.31.42.41:9000;tls_verify=unsafe_off;token=qt1H8Mi_vPUpt_R7ByDdJ7_kRIiieNb_BFKdjBDqxp6wh8;
https::addr=172.31.42.41:9000;tls_verify=unsafe_off;token=qt1H8Mi_vPUpt_R7ByDdJ7_kRIiieNb_BFKdjBDqxp6wh8;
https::addr=172.31.42.41:9000;tls_verify=unsafe_off;token=qt1H8Mi_vPUpt_R7ByDdJ7_kRIiieNb_BFKdjBDqxp6wh8;
Sender 9 will send data for 1800 sensors
https::addr=172.31.42.41:9000;tls_verify=unsafe_off;token=qt1H8Mi_vPUpt_R7ByDdJ7_kRIiieNb_BFKdjBDqxp6wh8;
https::addr=172.31.42.41:9000;tls_verify=unsafe_off;token=qt1H8Mi_vPUpt_R7ByDdJ7_kRIiieNb_BFKdjBDqxp6wh8;
https::ad

Process ForkPoolWorker-15:
Process ForkPoolWorker-19:
Process ForkPoolWorker-13:
Process ForkPoolWorker-16:
Process ForkPoolWorker-14:
Process ForkPoolWorker-18:
Process ForkPoolWorker-20:


KeyboardInterrupt: 