## Initialize Spark Session for Streaming

This section creates a SparkSession configured for local streaming jobs. The `local[*]` master utilizes all available cores on the local machine for parallelism. It also sets the configuration to gracefully shut down the streaming context if the application is stopped. Logging level is reduced to "ERROR" to avoid excessive informational logs during execution.



In [1]:
!uname -a


Linux rajesh 5.15.153.1-microsoft-standard-WSL2 #1 SMP Fri Mar 29 23:14:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux


## Fetch Real-Time Bitcoin Price from CoinGecko API

This block defines the CoinGecko API endpoint and query parameters to retrieve the current price of Bitcoin in USD. The output directory is specified for storing fetched data as JSON files locally.



In [2]:
from pyspark.sql import SparkSession

# Create SparkSession for streaming (local[*] uses all local cores)
spark = SparkSession.builder \
    .appName("BitcoinStreaming") \
    .master("local[*]") \
    .config("spark.streaming.stopGracefullyOnShutdown", True) \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")


25/05/01 09:25:58 WARN Utils: Your hostname, rajesh resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/05/01 09:25:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/01 09:25:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [7]:
import requests, json, time
from datetime import datetime

api_url = "https://api.coingecko.com/api/v3/simple/price"
params = {'ids': 'bitcoin', 'vs_currencies': 'USD'}

output_dir = "/home/ubuntu/bitcoin_stream/"  # directory for JSON files


## Setup for Storing Real-Time Bitcoin Data Locally

This section sets up a local directory (`~/bitcoin_stream`) to store the real-time Bitcoin price data fetched from the CoinGecko API. It ensures the directory exists before saving any files.


In [9]:
#Dummy
import os
import json
import time
import requests
from datetime import datetime, timezone

output_dir = os.path.expanduser("~/bitcoin_stream")
os.makedirs(output_dir, exist_ok=True)  # Ensure directory exists


## Fetch and Save Real-Time Bitcoin Prices

The following loop fetches Bitcoin price data from the CoinGecko API at regular intervals (e.g., every 5 seconds), adds a UTC timestamp, and stores each data point as a separate JSON file in the previously created local directory.


In [26]:
# socket_server.py
import socket
import time
import json
import requests
from datetime import datetime, timezone

HOST = "localhost"
PORT = 5002
api_url = "https://api.coingecko.com/api/v3/simple/price"
params = {"ids": "bitcoin", "vs_currencies": "usd"}

s = socket.socket()
s.bind((HOST, PORT))
s.listen(1)
print("Socket server listening on port", PORT)
conn, addr = s.accept()
print(f"Connection from {addr}")

while True:
    response = requests.get(api_url, params=params)
    if response.status_code == 200:
        data = response.json()
        record = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "price": data["bitcoin"]["usd"]
        }
        conn.sendall((json.dumps(record) + "\n").encode("utf-8"))
        print("Sent:", record)
    else:
        print("Failed to fetch data")
    time.sleep(5)


Socket server listening on port 5002
Connection from ('127.0.0.1', 36664)
Sent: {'timestamp': '2025-05-02T01:28:46.571674+00:00', 'price': 97106}
Sent: {'timestamp': '2025-05-02T01:28:51.752836+00:00', 'price': 97106}
Sent: {'timestamp': '2025-05-02T01:28:56.833625+00:00', 'price': 97106}
Sent: {'timestamp': '2025-05-02T01:29:01.898484+00:00', 'price': 97106}
Sent: {'timestamp': '2025-05-02T01:29:06.966206+00:00', 'price': 97106}
Sent: {'timestamp': '2025-05-02T01:29:12.057395+00:00', 'price': 97106}
Failed to fetch data
Failed to fetch data
Failed to fetch data
Failed to fetch data
Failed to fetch data
Failed to fetch data
Failed to fetch data
Failed to fetch data
Failed to fetch data
Failed to fetch data
Failed to fetch data
Failed to fetch data
Sent: {'timestamp': '2025-05-02T01:30:17.825550+00:00', 'price': 97093}
Sent: {'timestamp': '2025-05-02T01:30:22.899975+00:00', 'price': 97093}
Sent: {'timestamp': '2025-05-02T01:30:27.983159+00:00', 'price': 97093}
Sent: {'timestamp': '2025-

KeyboardInterrupt: 

## There is other code runs parallely in different file  which Streaming Bitcoin Data from Socket to HDFS