# 📈 Real-Time Stock Price ELT & Analytics Pipeline with Kafka, Snowflake & Streamlit

This project illustrates how to stream real-time stock price data from Yahoo Finance into Snowflake using Kafka and visualize it using a Snowflake-hosted Streamlit app.

---

## 🚀 Overview

- **Setup**: Setup Kafka and Snowflake Kafka Connector for real-time data streaming.
- **Ingest**: Publish live stock data from Yahoo Finance to a Kafka topic.
- **Load**: Stream the data into a Snowflake table using the Kafka Connector, and Snowflake Streams and Tasks.
- **Transform**: Real-time or near-real-time analytics on stock prices in Snowflake.
- **Visualize**: Deliver live metrics in a Streamlit dashboard.

---


## Setup Kafka and Snowflake Kafka Connector for real-time data streaming


1. Create Snowflake database, schema, table objects, new kafka role and kafka connect user, setup grant and privileges in a Snowflake worksheet.

```sql
USE ROLE SYSADMIN;

CREATE OR REPLACE DATABASE KAFKA_STREAMING;

CREATE OR REPLACE SCHEMA YAHOO_FINANCE;

-- Create target table
CREATE OR REPLACE TABLE KAFKA_STREAMING.YAHOO_FINANCE.STOCK_PRICES (
  symbol STRING,
  price FLOAT,
  currency STRING,
  time STRING
);

-- Create and grant a custom kafka role

USE ROLE ACCOUNTADMIN;

CREATE ROLE kafka_role;

-- Grant required permissions
GRANT ROLE KAFKA_ROLE TO ROLE SYSADMIN;


GRANT USAGE ON DATABASE KAFKA_STREAMING TO ROLE kafka_role;
GRANT USAGE ON SCHEMA KAFKA_STREAMING.YAHOO_FINANCE TO ROLE kafka_role;
GRANT INSERT ON TABLE KAFKA_STREAMING.YAHOO_FINANCE.STOCK_PRICES TO ROLE kafka_role;

GRANT OWNERSHIP ON DATABASE KAFKA_STREAMING TO ROLE kafka_role REVOKE CURRENT GRANTS;
GRANT OWNERSHIP ON SCHEMA KAFKA_STREAMING.YAHOO_FINANCE TO ROLE kafka_role REVOKE CURRENT GRANTS;
GRANT OWNERSHIP ON TABLE KAFKA_STREAMING.YAHOO_FINANCE.STOCK_PRICES TO ROLE kafka_role REVOKE CURRENT GRANTS;

-- Create kafka connector user
CREATE USER kafka_connector_user
  PASSWORD = '****'
  DEFAULT_ROLE = kafka_role
  MUST_CHANGE_PASSWORD = FALSE;

-- Assign role to user
GRANT ROLE kafka_role TO USER kafka_connector_user;

SHOW USERS IN ACCOUNT;
```

2. Download kafka to local machine from https://kafka.apache.org/downloads.

3. Start `zookeeper` in new terminal.

```bash
cd kafka/bin
./zookeeper-server-start.sh ../config/zookeeper.properties
```

4. Start `kafka server` in new terminal.

```bash
cd kafka/bin
./kafka-server-start.sh ../config/server.properties
```

5. Create `kafka topic` in new terminal.
```bash
cd kafka/bin
./kafka-topics.sh --create --topic yahoo-finance-topic --zookeeper localhost:2181 --partitions 2 --replication-factor 1
```







6. Setup Private Key `Authentication` for Snowflake Kafka Connector

- Generate `private key`. In new terminal:

```bash
openssl genrsa 2048 | openssl pkcs8 -topk8 -v2 des3 -inform PEM -out snowflake-kafka-connector-private-rsa-key.p8 –nocrypt
```

Remove -nocrypt option to obtain an encrypted private key. Provide the encryption password.

- Generate `public key`.

```bash
openssl rsa -in snowflake-kafka-connector-private-rsa-key.p8 -pubout -out snowflake-kafka-connector-public-rsa-key.pub
```

Save the keys for later use.

7. Add the public key to the `kafka_connect_user`. In the SQL worksheet from step 1

```sql
ALTER USER kafka_connector_user SET RSA_PUBLIC_KEY='MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIB.....';
```

8. Setup Snowflake Kafka Connector

- Download `Snowflake Kafka Connector` jar file from [Maven Repository](https://mvnrepository.com/artifact/com.snowflake/snowflake-kafka-connector/3.1.0?_fsi=WP9Bbe6o&_fsi=WP9Bbe6o) and copy the jar file into kafka/libs/ folder.

- Start `Kafka Connector` on local machine. In new terminal

```bash
cd kafka/bin
./connect-distributed.sh ../config/connect-distributed.properties
```

- Create Kafka Connect `Sink Connector Configuration` File (`snowflake-kafka-connector-config.json`)

```json
{
  "name": "snowflake-kafka-yahoo-finance-connector",
  "config": {
    "connector.class": "com.snowflake.kafka.connector.SnowflakeSinkConnector",
    "tasks.max": "1",
    "topics": "yahoo-finance-topic",

    "snowflake.url.name": "ACCOUNT-IDENTIFIER.snowflakecomputing.com:443", -- replace ACCOUNT-IDENTIFIER
    "snowflake.user.name": "kafka_connector_user",
    "snowflake.private.key": "MIIEuwIBADANBgkqhkiG9w0BAQEFAASCBKUwggShA.....",
    "snowflake.database.name": "KAFKA_STREAMING",
    "snowflake.schema.name": "YAHOO_FINANCE",
    "snowflake.table.name": "stock_prices",
    "snowflake.role.name": "kafka_role",

    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": "false",

    "buffer.count.records": "1000",
    "buffer.flush.time": "10",
    "buffer.size.bytes": "5000000",

    "behavior.on.null.values": "IGNORE"
  }
}
```

9. Deploy configuration and create the Kafka Connector via REST API. In new terminal

```bash
curl -X POST -H "Content-Type: application/json" --data @snowflake-kafka-connector-config.json http://localhost:8083/connectors
```

## Publish live stock data from Yahoo Finance to yahoo-finance-topic Kafka topic


Create `kafka producer` to fetch live stock prices and publish them to Kafka.

1. Install required Python packages.

```bash
pip install yfinance confluent_kafka
```

2. Create Python script `kafka-producer.py` to fetch data and publish 

```python
import yfinance as yf 
from confluent_kafka import Producer 
import json
import time

# Kafka config
producer = Producer({'bootstrap.servers': 'localhost:9092'})
topic = 'yahoo-finance-topic'

# List of stock symbols to track
symbols = ['SNOW', 'AMZN', 'GOOGL', 'MSFT']

def acked(err, msg):
    if err is not None:
        print(f"Failed to deliver message: {err}")
    else:
        print(f"Published to {msg.topic()} [{msg.partition()}] @ offset {msg.offset()}")

while True:
    for symbol in symbols:
        stock = yf.Ticker(symbol)
        data = stock.info  # full info
        
        message = {
            'symbol': symbol,
            'price': data.get('regularMarketPrice'),
            'currency': data.get('currency'),
            'time': time.strftime('%Y-%m-%d %H:%M:%S'),
        }

        producer.produce(topic, value=json.dumps(message), key=symbol, callback=acked)
    
    producer.flush()
    time.sleep(30)  # Fetch every 30 seconds
```

3. Run kafka-producer.py script.

```bash
python kafka-producer.py
```

## Stream the data into a Snowflake table using the Kafka Connector


Once the Kafka Producer is publishing Yahoo Finance data to the Kafka topic and the Snowflake Kafka Connector is running, Snowflake will automatically ingest data.

By default, the Snowflake Kafka Connector does not write directly to the specified target table `stock_prices` unless the Kafka message schema exactly matches the table schema.

If the schema-maching mode is not enforced, Kafka auto-generates, by default, in the KAFKA_STREAMING.YAHOO_FINANCE schema:

- A `staging table`: 
```sql
CREATE OR REPLACE KAFKA_STREAMING.YAHOO_FINANCE.YAHOO_FINANCE_TOPIC_1140052305 (
    RECORD_METADATA VARIANT,
    RECORD_CONTENT VARIANT)
```

- An `internal stage` with client-side encryption and directory disabled : 
```sql
CREATE OR REPLACE STAGE SNOWFLAKE_KAFKA_CONNECTOR_SNOWFLAKE_KAFKA_YAHOO_FINANCE_CONNECTOR_430186219_STAGE_YAHOO_FINANCE_TOPIC_1140052305;
```

- A `Pipe`: 

```sql
CREATE OR REPLACE PIPE KAFKA_STREAMING.YAHOO_FINANCE.SNOWFLAKE_KAFKA_CONNECTOR_SNOWFLAKE_KAFKA_YAHOO_FINANCE_CONNECTOR_430186219_PIPE_YAHOO_FINANCE_TOPIC_1140052305_0 auto_ingest=false as copy into yahoo_finance_topic_1140052305(RECORD_METADATA, RECORD_CONTENT) from (select $1:meta, $1:content from @SNOWFLAKE_KAFKA_CONNECTOR_snowflake_kafka_yahoo_finance_connector_430186219_STAGE_yahoo_finance_topic_1140052305 t) file_format = (type = 'json');
```

Automate the movement of data from the Kafka Connector auto-created staging table into the stock_prices target table using Snowflake Streams and Tasks. 

This approach:

- Avoids directly ingesting into the stock_prices curated table

- Keeps staging and production concerns cleanly separated

- Runs automatically at intervals (e.g., every 1 minute)

1. Identify the Auto-Generated Table

```sql
SHOW TABLES LIKE '%_TOPIC_%';

SELECT * FROM YAHOO_FINANCE_TOPIC_1140052305;

-- TABLE name from SHOW TABLES
SET kafka_staging_table = 'YAHOO_FINANCE_TOPIC_1140052305';
```
2. Create a Stream on the Auto Table

The stream tracks new rows inserted into the staging table by the Kafka Connector.

```sql
CREATE OR REPLACE STREAM kafka_finance_stream
ON TABLE IDENTIFIER($kafka_staging_table);
```

3. Create a Task to Copy Data Every Minute

```sql
CREATE OR REPLACE TASK move_kafka_data_to_snowflake_stock_prices
  WAREHOUSE = COMPUTE_WH  
  SCHEDULE = '1 MINUTE'
AS
INSERT INTO stock_prices (symbol, price, currency, time)
SELECT
  RECORD_CONTENT:"symbol"::STRING,
  RECORD_CONTENT:"price"::FLOAT,
  RECORD_CONTENT:"currency"::STRING,
  RECORD_CONTENT:"time"::STRING
FROM kafka_finance_stream;
```

4. Start the task

```sql
ALTER TASK move_kafka_data_to_snowflake_stock_prices RESUME;
```
Now, Snowflake will:

- Continuously ingest data into the auto-generated Kafka staging table

- Use the stream to detect changes

- Copy those rows into the curated stock_prices table every minute

5. Verification -- after 1 minute

```sql
SELECT * FROM stock_prices ORDER BY time DESC;
```





## Real-time or near-real-time analytics on stock prices, ranging from simple aggregations to more advanced time-series and anomaly detection


```sql
USE ROLE ACCOUNTADMIN;
USE DATABASE KAFKA_STREAMING;
USE SCHEMA YAHOO_FINANCE;
```

1. Latest price per symbol

```sql
CREATE OR REPLACE VIEW vw_latest_stock_prices AS
SELECT symbol, price, time
FROM (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY symbol ORDER BY time DESC) AS rn
  FROM stock_prices
)
WHERE rn = 1;
```

2. Moving average (5-minute window)

```sql
CREATE OR REPLACE VIEW vw_5min_moving_avg AS
SELECT 
  symbol,
  DATE_TRUNC('minute', time::TIMESTAMP_NTZ) AS minute_bucket,
  AVG(price) AS avg_price_5min
FROM stock_prices
WHERE time::TIMESTAMP_NTZ >= DATEADD(HOUR, -1, CURRENT_TIMESTAMP)
GROUP BY symbol, minute_bucket;
```

3. Anomaly detection - Price spike/dip detection (5% deviation from 5-row moving avg)

```sql
CREATE OR REPLACE VIEW vw_price_anomalies AS
WITH recent AS (
  SELECT symbol, price, time,
         AVG(price) OVER (PARTITION BY symbol ORDER BY time ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS moving_avg
  FROM stock_prices
)
SELECT *
FROM recent
WHERE ABS(price - moving_avg) / NULLIF(moving_avg, 0) > 0.05;
```

4. Hourly trend (avg price) - time-series windowed comparison to compare average prices hour-over-hour

```sql
CREATE OR REPLACE VIEW vw_hourly_avg_prices AS
SELECT 
  symbol,
  DATE_TRUNC('hour', time::TIMESTAMP_NTZ) AS hour_bucket,
  AVG(price) AS avg_price_hour
FROM stock_prices
WHERE time::TIMESTAMP_NTZ >= DATEADD(DAY, -1, CURRENT_TIMESTAMP)
GROUP BY symbol, hour_bucket;
```

5. Stock leaderboard by latest price - ranks latest prices to show top movers or high-value stocks

```sql
CREATE OR REPLACE VIEW vw_price_leaderboard AS
SELECT symbol, price, RANK() OVER (ORDER BY price DESC) AS price_rank
FROM (
  SELECT symbol, price
  FROM stock_prices
  QUALIFY ROW_NUMBER() OVER (PARTITION BY symbol ORDER BY time DESC) = 1
);
```

The Snowflake QUALIFY clause is a tool used to filter results of window functions in SQL queries. Window functions perform calculations across a set of table rows related to the current row, and QUALIFY acts as an additional filter after these calculations. QUALIFY can improve query performance by reducing the need for subqueries and intermediate result sets. This can lead to faster query execution times and more efficient use of system resources, particularly in large datasets where performance is a critical concern.

## Deliver live metrics in a Streamlit dashboard


```python
import streamlit as st
import pandas as pd
import altair as alt
from snowflake.snowpark import Session

# Initialize Snowpark session
session = Session.get_active_session()

# Load data
latest = session.table("vw_latest_stock_prices").to_pandas()
moving_avg = session.table("vw_5min_moving_avg").to_pandas()
anomalies = session.table("vw_price_anomalies").to_pandas()
hourly = session.table("vw_hourly_avg_prices").to_pandas()
leaderboard = session.table("vw_price_leaderboard").to_pandas()

# Streamlit UI components
st.title("📈 Yahoo Finance Dashboard")
st.dataframe(latest)
st.dataframe(leaderboard)

# Moving Average Chart
symbol = st.selectbox("Select Symbol", moving_avg['SYMBOL'].unique())
filtered_data = moving_avg[moving_avg['SYMBOL'] == symbol]

chart = alt.Chart(filtered_data).mark_line(point=True).encode(
    x=alt.X('MINUTE_BUCKET:T', title='Time'),
    y=alt.Y('AVG_PRICE_5MIN:Q', title='5-Min Avg Price'),
    tooltip=['MINUTE_BUCKET:T', 'AVG_PRICE_5MIN:Q']
).properties(
    title=f"5-Minute Average for {symbol}",
    height=300
)

st.altair_chart(chart)

# Anomalies Table
st.dataframe(anomalies[anomalies['SYMBOL'] == symbol])
```