In this tutorial, we walk through building a real-time anomaly detection pipeline for stock price data using TurboML. Our setup enables continuous ingestion of intraday stock data from a data source (e.g., Alpha Vantage), transformation of that data into meaningful features, and immediate Random Cut Forest (RCF)-based anomaly scoring.
Financial markets move quickly. Anomalies in stock prices—such as sudden spikes, precipitous drops, or unusual volume—can indicate breaking news, market manipulation, or buy/sell opportunities. Traditional batch ML approaches require hours or days to process, missing time-sensitive signals. In contrast, TurboML provides a streaming-first approach:
- Pull or push data streams in near real-time.
- Continuously compute features, such as rolling price-volume metrics.
- Run incremental, online ML algorithms and produce immediate anomaly scores.
The outcome: a highly responsive pipeline for finance professionals to spot anomalies and act quickly.
Use Case Overview
Goal: Detect anomalies in stock trading data in near real time.
Data: Intraday stock quotes (price, volume) retrieved from Alpha Vantage or a similar financial API.
Approach:
- Ingest streaming stock data into TurboML.
- Compute rolling aggregates and transformations in real time.
- Deploy an unsupervised Random Cut Forest model.
- Retrieve anomaly scores with minimal latency.
We assume:
- Alpha Vantage (or another finance API) provides a continuous stream of intraday data. Each record contains:
timestamp
(e.g., 1-minute intervals)symbol
(e.g., AAPL, MSFT)price
volume
- TurboML can ingest these records in two main ways:
- Push-based ingestion (you explicitly upload new data frames or single records).
- Pull-based ingestion (TurboML connectors watch for new data).
For demonstration, we’ll do push-based ingestion from the api.
Anomaly detection often benefits from derived features. For stock data, we might add:
price_volume_prod
=price * volume
rolling_volume_1h
= sum of volume in the past hour- Possibly more advanced transformations or lookups, such as:
- Rolling standard deviations of price
- Weighted averages or volatility indicators
In TurboML, these transformations can be defined once, then seamlessly applied to both batch (historical) and streaming (real-time) data.
Before going live, you’ll likely test and refine your feature definitions and model offline using historical data, for example:
- Pull stock price data from a past time window (e.g., last month).
- Define features in TurboML with SQL or rolling aggregates.
- Export the computed features to a local DataFrame.
- Train an RCF or other anomaly detection model on that snapshot.
- Evaluate the model (if you have partial ground-truth anomalies) or do an unsupervised evaluation (e.g., distribution analysis).
A key advantage of TurboML is that the same feature definitions (SQL expressions, rolling windows, Python UDFs, etc.) used offline can be used unchanged for real-time data streams. This eliminates duplicating logic between batch prototypes and production streaming code.
After validating the approach in batch mode, you can deploy a streaming pipeline for real-time data:
- Data Source: Configure TurboML to ingest live or near-live intraday data from the Alpha Vantage endpoint (or from a local job that periodically calls the API).
- Materialize Features: TurboML automatically computes rolling aggregates and SQL transformations as new records arrive.
- Deploy RCF Model: The model receives each new feature vector and outputs an anomaly score immediately.
- Retrieve: Subscribe to real-time outputs, or call the model’s synchronous prediction endpoint for single-record inference.
Because we rely on the same definitions and the same pipeline, there’s no discrepancy between offline and real-time data transformations.
- Stock Data continuously arrives from your API.
- TurboML ingests it and materializes the rolling and SQL features.
- The RCF model immediately processes the new features and scores each record.
- You retrieve the anomaly scores in near real time, enabling you to raise alerts or display them on a dashboard.
- Enhance Feature Engineering: Add more domain-specific features, e.g. rolling volatility or aggregated news sentiment.
- Evaluate with partial ground-truth anomalies, or label certain abrupt price changes.
- Implement Drift Detection: Keep an eye on data distribution changes or shifts in the RCF anomaly output.
- Set Alerts: Integrate with Slack, PagerDuty, or custom dashboards to notify stakeholders of high anomaly scores.
Happy Streaming & Detecting!