Link to Streamlit application: [https://ind320-tereseivesdal.streamlit.app/](https://ind320-tereseivesdal.streamlit.app/)

Link to Github repository: [https://github.com/teresemyhre/IND320-tereseivesdal](https://github.com/teresemyhre/IND320-tereseivesdal)


# AI Usage

AI tools were used to speed up development and resolve issues throughout the assignment. GitHub Copilot assisted with small code completions in VS Code, while ChatGPT was used for debugging, explanations, and improving the structure of the Streamlit pages. It helped clarify errors, align time series, implement the Julyâ€“June snow-season logic, build the Sliding-Window Correlation and SARIMAX interfaces, and refine Plotly visualisations and UI decisions.

In [13]:
# Imports
import requests
import pandas as pd
from datetime import datetime, timedelta

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

from cassandra.cluster import Cluster

from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

### Helper

In [14]:
# Function to generate month ranges for a given year
def month_ranges(year):
    ranges = []
    for month in range(1, 13):
        # Calculate start and end datetime for each month
        start = datetime(year, month, 1)
        if month == 12:
            end = datetime(year + 1, 1, 1) - timedelta(hours=1)
        else:
            end = datetime(year, month + 1, 1) - timedelta(hours=1)
        # Append the tuple (start, end, month) to the list
        ranges.append((start, end, month))
    return ranges

### Spark setup

In [23]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("ElhubProject")
    .config("spark.jars.packages", "com.datastax.spark:spark-cassandra-connector_2.12:3.5.1")
    .config("spark.cassandra.connection.host", "127.0.0.1")   # Cassandra in Docker
    .config("spark.cassandra.connection.port", "9042")
    .config("spark.cassandra.output.consistency.level", "LOCAL_ONE")
    .config("spark.cassandra.connection.keep_alive_ms", "60000")
    .config("spark.cassandra.output.batch.size.rows", "100")
    .config("spark.cassandra.output.batch.size.bytes", "524288")   # 512 KB
    .getOrCreate()
)

print("Spark session connected to Cassandra.")
spark

Spark session connected to Cassandra.


25/11/27 00:37:10 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### Cassandra setup

In [16]:
# Cassandra setup
cluster = Cluster(["localhost"])
session = cluster.connect()

# Create keyspace and tables
session.execute("""
CREATE KEYSPACE IF NOT EXISTS elhub 
WITH replication = {'class':'SimpleStrategy','replication_factor':1};
""")

# Use the keyspace
session.set_keyspace("elhub")

# Create tables
session.execute("""
CREATE TABLE IF NOT EXISTS production (
    pricearea text,
    productiongroup text,
    starttime timestamp,
    endtime timestamp,
    quantitykwh double,
    lastupdatedtime timestamp,
    PRIMARY KEY (pricearea, starttime, productiongroup)
);
""")

session.execute("""
CREATE TABLE IF NOT EXISTS consumption (
    pricearea text,
    consumptiongroup text,
    starttime timestamp,
    endtime timestamp,
    quantitykwh double,
    lastupdatedtime timestamp,
    meteringpointcount bigint,
    PRIMARY KEY (pricearea, starttime, consumptiongroup)
);
""")

# Confirm creation
print("Cassandra keyspace and tables created.")

Cassandra keyspace and tables created.


### MongoDB setup

In [17]:
import toml
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

# Load secrets.toml
secrets = toml.load("/Users/teresemyhre/Documents/NMBU-iCloud/IND320/IND320-tereseivesdal/.streamlit/secrets.toml")

MONGO_URI = secrets["MONGO"]["uri"]
MONGO_DB = secrets["MONGO"]["database"]
# Not using the default collection from secrets.toml
# because in Part 4 we need *two* collections
print("Loaded MongoDB config:")
print("URI:", MONGO_URI[:25] + "...")
print("DB:", MONGO_DB)

# Connect to MongoDB
client = MongoClient(MONGO_URI, server_api=ServerApi('1'))
db = client[MONGO_DB]

# Collections for Part 4
collection_prod = db["production"]    # Production table
collection_cons = db["consumption"]   # New consumption table

print("MongoDB connection established.")

Loaded MongoDB config:
URI: mongodb+srv://teresemyhre...
DB: elhub
MongoDB connection established.


### Download function

In [18]:
# Constants
BASE_URL = "https://api.elhub.no/energy-data/v0/price-areas"
DATASET_PROD = "PRODUCTION_PER_GROUP_MBA_HOUR"
DATASET_CONS = "CONSUMPTION_PER_GROUP_MBA_HOUR"

YEARS_PROD = [2022, 2023, 2024]   # new years to fetch
YEARS_CONS = [2021, 2022, 2023, 2024]  

# Fetch production data function
def fetch_production(years):
    all_rows = []
    failures = []

    # Iterate over years and months to fetch production data
    for year in years:
        for start, end, m in month_ranges(year):

            # Format start and end datetime strings with timezone offset
            start_str = start.strftime("%Y-%m-%dT%H:%M:%S") + "%2B01:00"
            end_str = end.strftime("%Y-%m-%dT%H:%M:%S") + "%2B01:00"

            # Construct the URL for the API request
            url = (
                f"{BASE_URL}?dataset={DATASET_PROD}"
                f"&startDate={start_str}&endDate={end_str}"
            )

            # Make the API request
            r = requests.get(url, timeout=60)

            if r.ok:
                j = r.json()
                rows = [
                    rec for item in j.get("data", [])
                    for rec in item.get("attributes", {}).get(
                        "productionPerGroupMbaHour", []
                    )
                ]
                all_rows.extend(rows)
                print(f"{year}-{m:02d}: {len(rows)} rows")
            else:
                failures.append((year, m, r.status_code))
                print(f"{year}-{m:02d}: FAILED {r.status_code}")

    # Create DataFrame from all rows
    df = pd.DataFrame(all_rows)
    if not df.empty:
        df.columns = [c.lower() for c in df.columns]

    return df, failures

# Fetch consumption data function
def fetch_consumption(years):
    all_rows = []
    failures = []

    # Iterate over years and months to fetch consumption data
    for year in years:
        for start, end, m in month_ranges(year):
            # Format start and end datetime strings with timezone offset
            start_str = start.strftime("%Y-%m-%dT%H:%M:%S") + "%2B01:00"
            end_str = end.strftime("%Y-%m-%dT%H:%M:%S") + "%2B01:00"

            # Construct the URL for the API request
            url = (
                f"{BASE_URL}?dataset={DATASET_CONS}"
                f"&startDate={start_str}&endDate={end_str}"
            )
            # Make the API request
            r = requests.get(url, timeout=60)

            if r.ok:
                j = r.json()
                rows = [
                    rec for item in j.get("data", [])
                    for rec in item.get("attributes", {}).get(
                        "consumptionPerGroupMbaHour", []
                    )
                ]
                all_rows.extend(rows)
                print(f"{year}-{m:02d}: {len(rows)} rows")
            else:
                failures.append((year, m, r.status_code))
                print(f"{year}-{m:02d}: FAILED {r.status_code}")
    
    # Create DataFrame from all rows
    df = pd.DataFrame(all_rows)
    if not df.empty:
        df.columns = [c.lower() for c in df.columns]

    return df, failures



### Fetch the data

In [19]:
# Fetch new production data
df_production_new, failures_prod = fetch_production(YEARS_PROD)
df_production_new.head()

2022-01: 18575 rows
2022-02: 16775 rows
2022-03: 18550 rows
2022-04: 17975 rows
2022-05: 18575 rows
2022-06: 17975 rows
2022-07: 18575 rows
2022-08: 18575 rows
2022-09: 17975 rows
2022-10: 18600 rows
2022-11: 17975 rows
2022-12: 18575 rows
2023-01: 18575 rows
2023-02: 16775 rows
2023-03: 18550 rows
2023-04: 17975 rows
2023-05: 18575 rows
2023-06: 17975 rows
2023-07: 18575 rows
2023-08: 18575 rows
2023-09: 17975 rows
2023-10: 18600 rows
2023-11: 17975 rows
2023-12: 18575 rows
2024-01: 18575 rows
2024-02: 17375 rows
2024-03: 18550 rows
2024-04: 17975 rows
2024-05: 18575 rows
2024-06: 17975 rows
2024-07: 18575 rows
2024-08: 18575 rows
2024-09: 17975 rows
2024-10: 18600 rows
2024-11: 17975 rows
2024-12: 18575 rows


Unnamed: 0,endtime,lastupdatedtime,pricearea,productiongroup,quantitykwh,starttime
0,2022-01-01T01:00:00+01:00,2025-02-01T18:02:57+01:00,NO1,hydro,1291422.4,2022-01-01T00:00:00+01:00
1,2022-01-01T02:00:00+01:00,2025-02-01T18:02:57+01:00,NO1,hydro,1246209.4,2022-01-01T01:00:00+01:00
2,2022-01-01T03:00:00+01:00,2025-02-01T18:02:57+01:00,NO1,hydro,1271757.0,2022-01-01T02:00:00+01:00
3,2022-01-01T04:00:00+01:00,2025-02-01T18:02:57+01:00,NO1,hydro,1204251.8,2022-01-01T03:00:00+01:00
4,2022-01-01T05:00:00+01:00,2025-02-01T18:02:57+01:00,NO1,hydro,1202086.9,2022-01-01T04:00:00+01:00


In [None]:
# Fetch consumption data
df_consumption, failures_cons = fetch_consumption(YEARS_CONS)
df_consumption.head()

2021-01: 18575 rows
2021-02: 16775 rows
2021-03: 18550 rows
2021-04: 17975 rows
2021-05: 18575 rows
2021-06: 17975 rows
2021-07: 18575 rows
2021-08: 18575 rows
2021-09: 17975 rows
2021-10: 18600 rows
2021-11: 17975 rows
2021-12: 18575 rows
2022-01: 18575 rows
2022-02: 16775 rows
2022-03: 18550 rows
2022-04: 17975 rows
2022-05: 18575 rows
2022-06: 17975 rows
2022-07: 18575 rows
2022-08: 18575 rows
2022-09: 17975 rows
2022-10: 18600 rows
2022-11: 17975 rows
2022-12: 18575 rows
2023-01: 18575 rows
2023-02: 16775 rows
2023-03: 18550 rows
2023-04: 17975 rows
2023-05: 18575 rows
2023-06: 17975 rows
2023-07: 18575 rows
2023-08: 18575 rows
2023-09: 17975 rows
2023-10: 18600 rows
2023-11: 17975 rows
2023-12: 18575 rows
2024-01: 18575 rows
2024-02: 17375 rows
2024-03: 18550 rows
2024-04: 17975 rows
2024-05: 18575 rows
2024-06: 17975 rows
2024-07: 18575 rows
2024-08: 18575 rows
2024-09: 17975 rows
2024-10: 18600 rows
2024-11: 17975 rows
2024-12: 18575 rows


Unnamed: 0,consumptiongroup,endtime,lastupdatedtime,meteringpointcount,pricearea,quantitykwh,starttime
0,cabin,2021-01-01T01:00:00+01:00,2024-12-20T10:35:40+01:00,100607,NO1,177071.56,2021-01-01T00:00:00+01:00
1,cabin,2021-01-01T02:00:00+01:00,2024-12-20T10:35:40+01:00,100607,NO1,171335.12,2021-01-01T01:00:00+01:00
2,cabin,2021-01-01T03:00:00+01:00,2024-12-20T10:35:40+01:00,100607,NO1,164912.02,2021-01-01T02:00:00+01:00
3,cabin,2021-01-01T04:00:00+01:00,2024-12-20T10:35:40+01:00,100607,NO1,160265.77,2021-01-01T03:00:00+01:00
4,cabin,2021-01-01T05:00:00+01:00,2024-12-20T10:35:40+01:00,100607,NO1,159828.69,2021-01-01T04:00:00+01:00


### Insert into Cassandra

In [21]:
# Insert new production data into MongoDB
df_spark_prod = spark.createDataFrame(df_production_new)

# Convert columns to appropriate types
df_spark_prod = (
    df_spark_prod
    .withColumn("starttime", col("starttime").cast("timestamp"))
    .withColumn("endtime", col("endtime").cast("timestamp"))
    .withColumn("lastupdatedtime", col("lastupdatedtime").cast("timestamp"))
)

# Write to Cassandra
(
    df_spark_prod.write
    .format("org.apache.spark.sql.cassandra")
    .mode("append")
    .options(table="production", keyspace="elhub")
    .save()
)

print("Production data inserted into Cassandra.")

25/11/27 00:30:40 WARN TaskSetManager: Stage 2 contains a task of very large size (8829 KiB). The maximum recommended task size is 1000 KiB.

Production data inserted into Cassandra.


                                                                                

In [24]:
# Insert new consumption data into MongoDB
df_spark_cons = spark.createDataFrame(df_consumption)

# Convert columns to appropriate types
df_spark_cons = (
    df_spark_cons
    .withColumn("starttime", col("starttime").cast("timestamp"))
    .withColumn("endtime", col("endtime").cast("timestamp"))
    .withColumn("lastupdatedtime", col("lastupdatedtime").cast("timestamp"))
)

# Write to Cassandra
(
    df_spark_cons.write
    .format("org.apache.spark.sql.cassandra")
    .mode("append")
    .options(table="consumption", keyspace="elhub")
    .save()
)

print("Consumption data inserted into Cassandra.")

25/11/27 00:38:35 WARN DeprecatedConfigParameter: spark.cassandra.connection.keep_alive_ms is deprecated (DSE 6.0.0) and has been automatically replaced with parameter spark.cassandra.connection.keepAliveMS. 
25/11/27 00:38:35 WARN DeprecatedConfigParameter: spark.cassandra.connection.keep_alive_ms is deprecated (DSE 6.0.0) and has been automatically replaced with parameter spark.cassandra.connection.keepAliveMS. 
25/11/27 00:38:35 WARN DeprecatedConfigParameter: spark.cassandra.connection.keep_alive_ms is deprecated (DSE 6.0.0) and has been automatically replaced with parameter spark.cassandra.connection.keepAliveMS. 
25/11/27 00:38:36 WARN TaskSetManager: Stage 4 contains a task of very large size (12344 KiB). The maximum recommended task size is 1000 KiB.
[Stage 4:>                                                          (0 + 8) / 8]

Consumption data inserted into Cassandra.


                                                                                

### Insert into MongoDB

In [25]:
df_mongo_prod = df_spark_prod.select(
    "pricearea",
    "productiongroup",
    "starttime",
    "endtime",
    "quantitykwh",
    "lastupdatedtime"
).toPandas()

collection_prod.insert_many(df_mongo_prod.to_dict("records"))
print("Production data inserted into MongoDB.")

25/11/27 00:43:25 WARN TaskSetManager: Stage 5 contains a task of very large size (8829 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

Production data inserted into MongoDB.


In [26]:
df_mongo_cons = df_spark_cons.select(
    "pricearea",
    "consumptiongroup",
    "starttime",
    "endtime",
    "quantitykwh",
    "lastupdatedtime",
    "meteringpointcount"
).toPandas()

collection_cons.insert_many(df_mongo_cons.to_dict("records"))
print("Consumption data inserted into MongoDB.")

25/11/27 00:49:07 WARN TaskSetManager: Stage 6 contains a task of very large size (12344 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

Consumption data inserted into MongoDB.


25/11/27 06:57:11 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 2165980 ms exceeds timeout 120000 ms
25/11/27 06:57:11 WARN SparkContext: Killing executors is not supported by current scheduler.
25/11/27 06:57:17 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:124)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$

# Log of Compulsory Work

For this assignment, I expanded my Streamlit application with several analytical pages and reorganised the entire structure into three clear sections: Explorative, Anomalies, and Predictive. This restructuring helped keep the workflow logical as the app grew more complex.

I began with the Map Explorer, where I downloaded and prepared the GeoJSON file for Norwegian price areas. I built a Folium choropleth that colors each area based on mean production or consumption for the selected groups and date range. To make the map interactive, I added polygon detection using Shapely, persistent click markers, and a log that stores snapshot informationâ€”coordinates, selected groups, time interval, and the mean at the moment of the click. I also made the log scroll only when needed and ensured that clicking inside an area updates the global sidebar state. Fine-tuning the colormap, preventing it from breaking when vmin = vmax, and handling clicks outside Norway all required multiple adjustments.

Next, I implemented the Snow Drift page. I moved all snow-drift calculations into a dedicated module and rewrote the seasonal logic so that each snow year runs from 1 July to 30 June, as required. I fixed several issues related to ERA5 downloads, time zones, and incorrect season boundaries. The combined yearlyâ€“monthly Qt plot went through several redesigns until the yearly bars matched the snow-year width and the monthly line extended correctly to June of the next year. I also added an overlay/grouped toggle and applied my custom Plotly colors to match the visual style of the rest of the app.

The Sliding-Window Correlation (SWC) page was inspired by the course example but recreated fully in Plotly. I added selectors for meteorology variable, energy group (based on what the user chose in the global sidebar), month, lag, and window length. Aligning weather and energy time series required removing timezone awareness and reindexing both frames. The final three-panel layout includes highlightable windows and a dynamically updated correlation marker.

Finally, I developed the SARIMAX forecasting page under the Predictive section. The user can choose the full training window, forecast horizon, and multiple exogenous meteorological variables. Getting SARIMAX to run reliably required solving several index-alignment errors. I rewrote the exogenous-variable preparation so that training and forecast matrices always have the correct shapes and timestamps. Once the model fit successfully, I added a Plotly forecast plot with confidence intervals as well as AIC/BIC reporting.

Overall, the assignment involved a lot of debugging, refactoring, and careful alignment of time series, but thefinal result is a well-structured app with clear explorative, anomaly-detection, and forecasting components.


## SWC Findings

When exploring the sliding-window correlation for NO1 in January 2024, I generally observed a negative relationship between temperature and electricity consumptionâ€”colder periods tended to coincide with higher demand. During the coldest days, this negative correlation appeared to become stronger (sometimes around â€“0.8), which fits reasonably well with the sharp temperature drop that month.

In normal conditions, adjusting the lag did not seem to change the correlation much. However, around the extreme-cold period, the highest correlations often occurred with a 6â€“12 hour lag, which might reflect slower thermal responses in buildings. For thermal production, the strongest correlation usually appeared at lag = 0, suggesting little or no delayed effect.

Overall, the correlation patterns seemed more pronounced during extreme weather, and lag effects became more noticeable.