# UV-Index Modeling

This notebook wrangles the multiple uv-index datasets from 2007-2022 for Adelaide, Brisbane, Canberra, Melbourne, Perth, and Sydney available through the ARAPANSA into a city-specific dataset to be used in the onboarding process of Monash University FIT5120 Onboarding project. The wrangling process assumes that the datasets are saved in the root folder of the project.

This dataset is later used to model the UV-Index for each city using LSTM models powered by `tensorflow`.

Data source for each city are as follows:
* Melbourne data: https://data.gov.au/dataset/ds-dga-fb836013-f300-4f92-aa1e-fb5014aea40e/details?q=Ultraviolet%20Radiation%20Index
* Adelaide data: https://data.gov.au/dataset/ds-dga-026d4974-9efb-403d-9b39-27aee31a6439/details?q=Ultraviolet%20Radiation%20Index
* Perth data: https://data.gov.au/dataset/ds-dga-1b55352e-c0d8-48c8-9828-ef12885c9797/details?q=Ultraviolet%20Radiation%20Index
* Canberra data: https://data.gov.au/dataset/ds-dga-154d4d3b-2e8d-4dc2-b8ac-8f9805f99826/details?q=Ultraviolet%20Radiation%20Index
* Brisbane data: https://data.gov.au/dataset/ds-dga-2a1a2e49-de97-450e-9d0a-482adec68b22/details?q=Ultraviolet%20Radiation%20Index
* Sydney data: https://data.gov.au/dataset/ds-dga-c31a759c-a4d4-455f-87a7-98576be14f11/details?q=Ultraviolet%20Radiation%20Index

## Pre-process data
This section wrangles each individual `.csv` file containing the cities' uv-index data into a single dataset to create a training set for the model. The wrangling process includes:
1. Loading the data
2. Cleaning the data
3. Merging the data
4. Saving the data in a single dataframe

In [1]:
import os
import pandas as pd

# List the cities to be combined
cities = ['uv-adelaide','uv-brisbane','uv-melbourne','uv-canberra','uv-perth','uv-sydney']
cities_name = ['Adelaide','Brisbane','Melbourne','Canberra','Perth','Sydney']

# Create empty dataframe to store the combined data
combined_data = pd.DataFrame()

# Loop through the files in the root folder
for file in os.listdir():
    for city_index in range(len(cities)):
        if file.endswith('.csv') and file.startswith(cities[city_index]):
            # Print the file name for debugging
            print(f'Reading {file}')

            # Read the CSV file into a DataFrame
            df = pd.read_csv(file)

            # Get rid of 'Lat' and 'Lon' column
            df = df.drop(columns=['Lat'])
            df = df.drop(columns=['Lon'])
            
            # Depending on whether the file contains the 'Date-Time' or 'timestamp' column, convert it to datetime format
            if 'Date-Time' in df.columns:
                df['Date-Time'] = pd.to_datetime(df['Date-Time'])
            elif 'timestamp' in df.columns:
                df['timestamp'] = pd.to_datetime(df['timestamp'])
                df = df.rename(columns={'timestamp': 'Date-Time'})

            # Add a column to the DataFrame to store the city name
            df['city'] = cities_name[city_index]

            # Extract data from the 'Date-Time' column
            df['Day'] = df['Date-Time'].dt.day
            df['Month'] = df['Date-Time'].dt.month
            df['Year'] = df['Date-Time'].dt.year
            df['Hour'] = df['Date-Time'].dt.hour
            df['Minute'] = df['Date-Time'].dt.minute

            # Drop the 'Date-Time' column
            df = df.drop(columns=['Date-Time'])
            
            # Append the DataFrame to the combined data
            combined_data = pd.concat([combined_data, df], ignore_index=True)

Reading uv-adelaide-2007.csv
Reading uv-adelaide-2008.csv
Reading uv-adelaide-2009.csv
Reading uv-adelaide-2010.csv
Reading uv-adelaide-2011.csv
Reading uv-adelaide-2012.csv
Reading uv-adelaide-2013.csv
Reading uv-adelaide-2014.csv
Reading uv-adelaide-2015.csv
Reading uv-adelaide-2016.csv
Reading uv-adelaide-2017.csv
Reading uv-adelaide-2018.csv
Reading uv-adelaide-2019.csv
Reading uv-adelaide-2020.csv
Reading uv-adelaide-2021.csv
Reading uv-adelaide-2022.csv
Reading uv-brisbane-2007.csv
Reading uv-brisbane-2008.csv
Reading uv-brisbane-2009.csv
Reading uv-brisbane-2010.csv
Reading uv-brisbane-2011.csv
Reading uv-brisbane-2012.csv
Reading uv-brisbane-2013.csv
Reading uv-brisbane-2014.csv
Reading uv-brisbane-2015.csv
Reading uv-brisbane-2016.csv
Reading uv-brisbane-2017.csv
Reading uv-brisbane-2018.csv
Reading uv-brisbane-2019.csv
Reading uv-brisbane-2020.csv
Reading uv-brisbane-2021.csv
Reading uv-brisbane-2022.csv
Reading uv-canberra-2010.csv
Reading uv-canberra-2011.csv
Reading uv-can

In [2]:
# Peek at the combined data
combined_data.head()

Unnamed: 0,UV_Index,city,Day,Month,Year,Hour,Minute
0,0.0,Adelaide,27,3,2007,0,1
1,0.0,Adelaide,27,3,2007,0,2
2,0.0,Adelaide,27,3,2007,0,3
3,0.0,Adelaide,27,3,2007,0,4
4,0.0,Adelaide,27,3,2007,0,5


In [3]:
# Print length of dataframe
print(len(combined_data))

45601736


Since we're working on a minute-by-minute dataframe with over 45 million rows, we'll downsample the dataframe to an hourly frequency to reduce the size of the dataset.

In [4]:
# Only keep the data from first minutes of each hour
combined_data = combined_data[combined_data['Minute'] == 1].reset_index(drop=True)

combined_data.head()

Unnamed: 0,UV_Index,city,Day,Month,Year,Hour,Minute
0,0.0,Adelaide,27,3,2007,0,1
1,0.0,Adelaide,27,3,2007,1,1
2,0.0,Adelaide,27,3,2007,2,1
3,0.0,Adelaide,27,3,2007,3,1
4,0.0,Adelaide,27,3,2007,4,1


In [5]:
print(len(combined_data))

762039


With the dataframe in more manageable size, we'll continue with the pre-processing steps:

In [6]:
# Remove the Minute column
combined_data = combined_data.drop(columns=['Minute'])

combined_data.head()

Unnamed: 0,UV_Index,city,Day,Month,Year,Hour
0,0.0,Adelaide,27,3,2007,0
1,0.0,Adelaide,27,3,2007,1
2,0.0,Adelaide,27,3,2007,2
3,0.0,Adelaide,27,3,2007,3
4,0.0,Adelaide,27,3,2007,4


Now that we have extracted all the necessary input for the model, we can proceed to process the data for model training.

In [7]:
# Encode the city names
processed_data = pd.get_dummies(combined_data, columns=['city'])

# Convert the dummies into 1 if True and 0 if False
for city in cities_name:
    processed_data[f'city_{city}'] = processed_data[f'city_{city}'].astype(int)

processed_data.head()

Unnamed: 0,UV_Index,Day,Month,Year,Hour,city_Adelaide,city_Brisbane,city_Canberra,city_Melbourne,city_Perth,city_Sydney
0,0.0,27,3,2007,0,1,0,0,0,0,0
1,0.0,27,3,2007,1,1,0,0,0,0,0
2,0.0,27,3,2007,2,1,0,0,0,0,0
3,0.0,27,3,2007,3,1,0,0,0,0,0
4,0.0,27,3,2007,4,1,0,0,0,0,0


In [8]:
# Move 'UV_index' column to the end
processed_data = processed_data[[col for col in processed_data.columns if col != 'UV_Index'] + ['UV_Index']]
processed_data.head()

Unnamed: 0,Day,Month,Year,Hour,city_Adelaide,city_Brisbane,city_Canberra,city_Melbourne,city_Perth,city_Sydney,UV_Index
0,27,3,2007,0,1,0,0,0,0,0,0.0
1,27,3,2007,1,1,0,0,0,0,0,0.0
2,27,3,2007,2,1,0,0,0,0,0,0.0
3,27,3,2007,3,1,0,0,0,0,0,0.0
4,27,3,2007,4,1,0,0,0,0,0,0.0


In [9]:
from sklearn.preprocessing import MinMaxScaler

# Normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(processed_data[['Day','Month','Year','Hour']])
processed_data[['Scaled_Day','Scaled_Month','Scaled_Year','Scaled_Hour']] = pd.DataFrame(scaled_data, columns=['Day','Month','Year','Hour'])

# Drop the original columns except year
processed_data = processed_data.drop(columns=['Day','Month','Hour'])

# Peek at the scaled data
processed_data.head()

Unnamed: 0,Year,city_Adelaide,city_Brisbane,city_Canberra,city_Melbourne,city_Perth,city_Sydney,UV_Index,Scaled_Day,Scaled_Month,Scaled_Year,Scaled_Hour
0,2007,1,0,0,0,0,0,0.0,0.866667,0.181818,0.0,0.0
1,2007,1,0,0,0,0,0,0.0,0.866667,0.181818,0.0,0.043478
2,2007,1,0,0,0,0,0,0.0,0.866667,0.181818,0.0,0.086957
3,2007,1,0,0,0,0,0,0.0,0.866667,0.181818,0.0,0.130435
4,2007,1,0,0,0,0,0,0.0,0.866667,0.181818,0.0,0.173913


## Model Training
To predict UV-index, we will be building keras LSTM model for its advantage over time-series data. The model will be trained using uv-index data fromm 2007-2021 and evaluated over the 2022 data.

In [10]:
# Split the Year 2022 data as test set, and the rest as training set
train_data = processed_data[processed_data['Year'] != 2022]
test_data = processed_data[processed_data['Year'] == 2022]

# Drop the 'Year' column
train_data = train_data.drop(columns=['Year'])
test_data = test_data.drop(columns=['Year'])

# Split the features and target variable
X_train = train_data.drop(columns=['UV_Index'])
y_train = train_data['UV_Index']
X_test = test_data.drop(columns=['UV_Index'])
y_test = test_data['UV_Index']

In [11]:
import numpy as np

def reshape_data(data, timesteps):
  X_reshaped = []
  for i in range(len(data) - timesteps + 1):
    X_reshaped.append(data[i:i + timesteps])
  return np.array(X_reshaped)

# Define timesteps -> 24 hours
timesteps = 24

# Reshape the data
X_train_reshaped = reshape_data(X_train.values, timesteps)
X_test_reshaped = reshape_data(X_test.values, timesteps)
y_train_reshaped = y_train.values[timesteps - 1:]
y_test_reshaped = y_test.values[timesteps - 1:]

In [12]:
X_train_reshaped.shape, y_train_reshaped.shape

((709498, 24, 10), (709498,))

In [13]:
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

# Define the LSTM model
model = Sequential()
model.add(LSTM(units=64, return_sequences=True, input_shape=(timesteps, X_train.shape[1])))  # Adjust units
model.add(LSTM(units=32))  # Adjust units
model.add(Dense(1))

model.compile(loss="mse", optimizer="adam")

# Train the LSTM Model
model.fit(X_train_reshaped, y_train_reshaped, epochs=10, batch_size=32, validation_data=(X_test_reshaped, y_test_reshaped))




Epoch 1/10

Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x208201234d0>

In [14]:
# Save model in .keras format for future deployment
model.save('uv-predict.keras')

In [15]:
# BACKUP: Save model in .h5 format for future deployment
model.save('uv-predict.h5')

  saving_api.save_model(


**References**
* https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/