<div>
    <div>
        <img src="https://scidx.sci.utah.edu/wp-content/uploads/2024/12/logo-sm.png" alt="scidx Logo"/>
        <img src="https://nationaldataplatform.org/National_Data_Platform_horiz_stacked.svg" alt="NDP Logo" width="400" style="padding-left:100px"/>
    </div>
</div>

# SciDX Streaming Capabilities Demonstration: API Streaming 

This demonstration showcases the **SciDX Streaming capabilities**, leveraging both the **SciDX NDP Endpoint Library** for managing data objects and the **Streaming Library** for real-time data streaming and processing of an online API stream.

**`SciDX NDP Endpoint Library`:** Used to register and discover data objects (acts as the Data Provider). Interacts with the NDP endpoint API to register and manage data objects. <br>
**`Streaming Library`:** Used to create, manage, and consume real-time data streams (acts as the Data Consumer). Manages real-time data streams, including applying filters and consuming messages.

<br>

## **Setup**: Client preparation

1. **Import necessary modules** for handling data streams:

   a. Import `StreamingClient` from `scidx_streaming` module for handling streaming functionality <br>
   b. Import `APIClient` from `pointofpresence` module for API interactions


In [None]:
from scidx_streaming import StreamingClient
from pointofpresence import APIClient

2. **API Authentication Token Setup**

    1. Navigate to https://token.ndp.utah.edu
    
    2. If not already authenticated:
       - Select the `CILogon` button
       - Choose your institution from the Identity Provider list
       - Complete the institutional login process
    
    3. Upon successful authentication, you will be redirected to token.ndp.utah.edu
    
    4. Locate the `Access Token` field and copy the token value
    
    5. Replace `<your_token>` in the configuration below with your copied access token

In [None]:
TOKEN="<your_token>"

3. **Initialize the API or Kafka client** to register and discover data streams.

    a. `APIClient`: handle data registration and discovery. <br>
    b. `StreamingClient`: handle real-time data streams.

In [None]:
# The URL of the API Client to connect to the NDP endpoint service
API_URL="155.101.6.191:8003"

# Initialize the NDP endpoint client for data registration and discovery
client = APIClient(base_url=API_URL, token=TOKEN)

# Initialize the Streaming client for real-time data streaming
streaming = StreamingClient(client)
print(f"Streaming Client initialized. User ID: {streaming.user_id}")

## **Basic Usage**

**Data Provider**: Register data stream 

1. Registering the data source metadata.
2. Verifying its discoverability.

> **Note:**  
> Each registration requires defining `metadata`, which includes:  
> - Basic information: Name, title, and organization ID.  
> - Source details: Specific configurations related to the data source.  
> - Mapping: Selecting and renaming relevant fields.  
> - Processing rules: Extracting data from structured responses.

**Data Consumer**:
1. Discover and apply filters(optional) to the registered data sources to create a custom data stream.
2. Subscribe to and consume the custom data stream in real-time.
3. Process and visualize the incoming data dynamically.


<br>


#### **1**. Register a **`API Stream`** from an external API

In this step, we will use the `NDP endpoint client`, and the metadata for resgitsering an online **API Stream** into our NDP endpoint. Once registered, these streams can be discovered, and consumed dynamically.

In [None]:
# Register the Streaming data with the NDP endpoint client
api_stream_metadata = {
    "resource_name": "api_stream_example_sage",
    "resource_title": "Example API Stream",
    "owner_org": "saleem_test",
    "resource_url": "https://data.sagecontinuum.org/api/v0/stream?name=wxt.wind.direction",
    "file_type": "stream",
    "notes": "Some additional notes about the resource."
}

try:
    print(client.register_url(api_stream_metadata))
    print('Correctly registered')
except ValueError as e: # If the dataset already exists just show the error
    print(e)

#### **2**. Search for the registered entry

Now that we have registered data source, we will: 
1. Use the **NDP endpoint client** to search for datasets using the `search_datasets` method.
2. Verify that the **registered data stream** is correctly stored and available for discovery.
3. Confirm meta data accuracy before data consumption.

This ensures the dataset is discoverable for use by the Data Consumers.

In [None]:
# Search for the registered Earthscope data stream
search_results = client.search_datasets("api_stream_example_sage", server="local")
print(f"Number of datasets found: {len(search_results)}")

#### **3**. Create a Data Stream from the registered entry

Now we will leverage the functionalities of `Streaming Client` to consume the data stream registered by `API Client`.

The `create_kafka_stream` function searches for datasets matching the provided keywords, applies filtering semantics, and creates a real-time Kafka stream for consumption.

Function parameters
- **`keywords`**: List of keywords to filter relevant datasets.
- filter_semantics: Optional, defines filtering criteria for datasets.
- match_all: Optional, if True only data sources with all the keywords will be selected.
- username: Optional username for authentication in protected data sources.
- password: Optional password for authentication in protected data sources.


In [None]:
# Create a Kafka stream without filters
stream = await streaming.create_kafka_stream(
    keywords=["api_stream_example_sage"]
)

# Retrieve the stream's topic name
topic = stream.data_stream_id
print(f"Stream created: {topic}")

#### **4**. Consume the Streamed Data 

Now that we have successfully created a Kafka data stream, we transition to real-time data consumption. This step involves:

1. Initializing a kafka consumer: Passing the data stream topic to the consume_kafka_messages function.
2. Listening for incoming messages: Continuously receiving new filtered data in real-time.
3. Processing and updating the data dynamically: Messages are appended to a DataFrame for analysis or visualization.

In [None]:
# Start consuming the filtered Kafka stream
consumer = streaming.consume_kafka_messages(topic)

**Note**: It may take a few seconds for data to NDP endpointulate due to real-time processing.

In [None]:
# After some seconds you can visualize the dataset
consumer.dataframe

#### **5**: Stop Data Consumption and Clean up 

To wrap up, we will: 
1. Stop the data consumer to halt data processing.
2. Delete the created stream from the Kafka topic using the Streaming client.
3. Remove the registered dataset using the NDP endpoint client.

This ensures all resources and background tasks are properly released.

In [None]:
# Stop the Kafka consumer
consumer.stop()

# Delete the Kafka stream
await streaming.delete_stream(stream)

# Delete the registered dataset from the NDP endpoint system
client.delete_resource_by_id(search_results[0]["id"])
print("Cleanup completed: Stream and registered dataset deleted.")

<br><br>

## **Advanced Usage**: Apply transformations and filters to the consumed data

`SciDX` provides a powerful filtering system that allows users to refine their data streams by applying custom filtering conditions before consuming them. These filters enable real-time data selection based on specific rules, comparisons, and logical expressions.

In this section, we will: 
1. **Register custom data streams** (via API) using the SciDX framework.
2. **Discover and apply filters** to customize the data stream before consumption.
3. **Consume and visualize real-time data streams**.

#### **1**. Register and Pre-process a stream

This time we will map the values of interest

In [None]:
# Register the Streaming data with the NDP endpoint client
api_stream_metadata = {
    "resource_name": "api_stream_example_sage_advanced",
    "resource_title": "Example API Stream",
    "owner_org": "saleem_test",
    "resource_url": "https://data.sagecontinuum.org/api/v0/stream",
    "file_type": "stream",
    "notes": "Some additional notes about the resource.",
    "mapping": {
        "timestamp": "timestamp",
        "name": "name",
        "direction": "value",
        "vsn": "meta.vsn",
        "sensor": "meta.sensor",
        "units": "meta.units"
    }
}

try:
    print(client.register_url(api_stream_metadata))
    print('Correctly registered')
except ValueError as e: # If the dataset already exists just show the error
    print(e)

#### **2**. Search for the registered entry

Now that we have registered data source, we will: 
1. Use the **NDP endpoint client** to search for datasets using the `search_datasets` method.
2. Verify that the **registered data stream** is correctly stored and available for discovery.
3. Confirm meta data accuracy before data consumption.

This ensures the dataset is discoverable for use by the Data Consumers.

In [None]:
# Search for the registered Earthscope data stream
search_results = client.search_datasets("api_stream_example_sage_advanced", server="local")
print(f"Number of datasets found: {len(search_results)}")

### **3**. Create a Data Stream from the registered entry with Filters

Now we will leverage the functionalities of `Streaming Client` to consume the data stream registered by `API Client`.

The `create_kafka_stream` function searches for datasets matching the provided keywords, applies filtering semantics, and creates a real-time Kafka stream for consumption.

Function parameters
- **`keywords`**: List of keywords to filter relevant datasets.
- **`filter_semantics`**: defines filtering criteria for datasets.
- **`match_all`**: Optional, if True only data sources with all the keywords will be selected.
- username: Optional username for authentication in protected data sources.
- password: Optional password for authentication in protected data sources.

The filtering capabilities allow us to refine the data stream by applying conditions, alerts, and transformations.

#### Filtering capabilities: 

| **Type**                        | **Explanation**                                             | **Example**                                       |
|---------------------------------|-------------------------------------------------------------|---------------------------------------------------|
| Column Comparisons              | Column-to-column comparisons                                | `x > y`                                           |
| Mathematical Operations         | Addition, subtraction, multiplication and division          | `x > 10*y`                                        |
| IN Operator                     | Check if values are in a list                               | `station IN ['A', 'B']`                           |
| Conditional Logic (IF-THEN-ELSE)| Apply rules based on conditional statements                 | `IF x > 20 THEN alert = High ELSE y = 10`         |
| Logical Operators (AND, OR)     | Combine multiple conditions using AND and OR operators       | `IF x > 10 OR z = 20 THEN alert = High ELSE alert = Low` |
| Window-Based Filtering          | Calculate aggregates (mean, sum, max, min) over sliding windows | `IF window_filter(9, sum, x > 20) THEN alert = High` |


In [None]:
filters = [
    "name = wxt.wind.direction"
]

# Create a Kafka stream for Earthscope data
stream = await streaming.create_kafka_stream(
    keywords=["api_stream_example_sage_advanced"],
    match_all=True,
    filter_semantics=filters
)

# Retrieve the stream's topic name
topic = stream.data_stream_id
print(f"Stream created: {topic}")

`SciDX` does not remove rows from the data stream by default. Instead:

If a row does not meet the filtering criteria, its affected column values are set to null.
All rows are still sent, even if they contain null values.
Users can later apply actions to tailor the response, like the delete_null_rows that remove rows where some columns contain null values.

#### 4. Consume the Streamed Data 

Now that we have successfully created a **filtered** Kafka data stream, we transition to real-time data consumption. This step involves:

1. Initializing a kafka consumer: Passing the data stream topic to the consume_kafka_messages function.
2. Listening for incoming messages: Continuously receiving new filtered data in real-time.
3. Processing and updating the data dynamically: Messages are appended to a DataFrame for analysis or visualization.

In [None]:
# Start consuming the filtered Kafka stream
consumer = streaming.consume_kafka_messages(topic)

**Note**: It may take a few seconds for data to NDP endpointulate due to real-time processing.

In [None]:
consumer.dataframe

### **5**: Stop Data Consumption and Clean up 

To wrap up, we will: 
1. Stop the data consumer to halt data processing.
2. Delete the created stream from the Kafka topic using the Streaming client.
3. Remove the registered dataset using the NDP endpoint client.

This ensures all resources and background tasks are properly released.

In [None]:
# Stop the Kafka consumer
consumer.stop()

# Delete the Kafka stream
await streaming.delete_stream(stream)

# Delete the registered dataset from the NDP endpoint system
client.delete_resource_by_id(search_results[0]["id"])
print("Cleanup completed: Stream and registered dataset deleted.")