# UV-Index Modeling

This notebook wrangles the multiple uv-index datasets from 2007-2022 for Adelaide, Brisbane, Canberra, Melbourne, Perth, and Sydney available through the ARAPANSA into a city-specific dataset to be used in the onboarding process of Monash University FIT5120 Onboarding project. The wrangling process assumes that the datasets are saved in the root folder of the project.

This dataset is later used to model the UV-Index for each city using LSTM models powered by `tensorflow`.

Data source for each city are as follows:
* Melbourne data: https://data.gov.au/dataset/ds-dga-fb836013-f300-4f92-aa1e-fb5014aea40e/details?q=Ultraviolet%20Radiation%20Index
* Adelaide data: https://data.gov.au/dataset/ds-dga-026d4974-9efb-403d-9b39-27aee31a6439/details?q=Ultraviolet%20Radiation%20Index
* Perth data: https://data.gov.au/dataset/ds-dga-1b55352e-c0d8-48c8-9828-ef12885c9797/details?q=Ultraviolet%20Radiation%20Index
* Canberra data: https://data.gov.au/dataset/ds-dga-154d4d3b-2e8d-4dc2-b8ac-8f9805f99826/details?q=Ultraviolet%20Radiation%20Index
* Brisbane data: https://data.gov.au/dataset/ds-dga-2a1a2e49-de97-450e-9d0a-482adec68b22/details?q=Ultraviolet%20Radiation%20Index
* Sydney data: https://data.gov.au/dataset/ds-dga-c31a759c-a4d4-455f-87a7-98576be14f11/details?q=Ultraviolet%20Radiation%20Index

## Pre-process data
This section wrangles each individual `.csv` file containing the cities' uv-index data into a single dataset to create a training set for the model. The wrangling process includes:
1. Loading the data
2. Cleaning the data
3. Merging the data
4. Saving the data in a single dataframe

In [1]:
import os
import pandas as pd

# List the cities to be combined
cities = ['uv-adelaide','uv-brisbane','uv-melbourne','uv-canberra','uv-perth','uv-sydney']
cities_name = ['Adelaide','Brisbane','Melbourne','Canberra','Perth','Sydney']

# Create empty dataframe to store the combined data
combined_data = pd.DataFrame()

# Loop through the files in the root folder
for file in os.listdir():
    for city_index in range(len(cities)):
        if file.endswith('.csv') and file.startswith(cities[city_index]):
            # Print the file name for debugging
            print(f'Reading {file}')

            # Read the CSV file into a DataFrame
            df = pd.read_csv(file)

            # Get rid of 'Lat' and 'Lon' column
            df = df.drop(columns=['Lat'])
            df = df.drop(columns=['Lon'])
            
            # Depending on whether the file contains the 'Date-Time' or 'timestamp' column, convert it to datetime format
            if 'Date-Time' in df.columns:
                df['Date-Time'] = pd.to_datetime(df['Date-Time'])
            elif 'timestamp' in df.columns:
                df['timestamp'] = pd.to_datetime(df['timestamp'])
                df = df.rename(columns={'timestamp': 'Date-Time'})

            # Add a column to the DataFrame to store the city name
            df['city'] = cities_name[city_index]

            # Extract data from the 'Date-Time' column
            df['Day'] = df['Date-Time'].dt.day
            df['Month'] = df['Date-Time'].dt.month
            df['Year'] = df['Date-Time'].dt.year
            df['Hour'] = df['Date-Time'].dt.hour
            df['Minute'] = df['Date-Time'].dt.minute

            # Drop the 'Date-Time' column
            df = df.drop(columns=['Date-Time'])
            
            # Append the DataFrame to the combined data
            combined_data = pd.concat([combined_data, df], ignore_index=True)

Reading uv-adelaide-2007.csv
Reading uv-adelaide-2008.csv
Reading uv-adelaide-2009.csv
Reading uv-adelaide-2010.csv
Reading uv-adelaide-2011.csv
Reading uv-adelaide-2012.csv
Reading uv-adelaide-2013.csv
Reading uv-adelaide-2014.csv
Reading uv-adelaide-2015.csv
Reading uv-adelaide-2016.csv
Reading uv-adelaide-2017.csv
Reading uv-adelaide-2018.csv
Reading uv-adelaide-2019.csv
Reading uv-adelaide-2020.csv
Reading uv-adelaide-2021.csv
Reading uv-adelaide-2022.csv
Reading uv-brisbane-2007.csv
Reading uv-brisbane-2008.csv
Reading uv-brisbane-2009.csv
Reading uv-brisbane-2010.csv
Reading uv-brisbane-2011.csv
Reading uv-brisbane-2012.csv
Reading uv-brisbane-2013.csv
Reading uv-brisbane-2014.csv
Reading uv-brisbane-2015.csv
Reading uv-brisbane-2016.csv
Reading uv-brisbane-2017.csv
Reading uv-brisbane-2018.csv
Reading uv-brisbane-2019.csv
Reading uv-brisbane-2020.csv
Reading uv-brisbane-2021.csv
Reading uv-brisbane-2022.csv
Reading uv-canberra-2010.csv
Reading uv-canberra-2011.csv
Reading uv-can

In [2]:
# Peek at the combined data
combined_data.head()

Unnamed: 0,UV_Index,city,Day,Month,Year,Hour,Minute
0,0.0,Adelaide,27,3,2007,0,1
1,0.0,Adelaide,27,3,2007,0,2
2,0.0,Adelaide,27,3,2007,0,3
3,0.0,Adelaide,27,3,2007,0,4
4,0.0,Adelaide,27,3,2007,0,5


In [3]:
# Print length of dataframe
print(len(combined_data))

45601736


Since we're working on a minute-by-minute dataframe with over 45 million rows, we'll downsample the dataframe to an hourly frequency to reduce the size of the dataset.

In [4]:
# Only keep the data from first minutes of each hour
combined_data = combined_data[combined_data['Minute'] == 1]

combined_data.head()

Unnamed: 0,UV_Index,city,Day,Month,Year,Hour,Minute
0,0.0,Adelaide,27,3,2007,0,1
60,0.0,Adelaide,27,3,2007,1,1
120,0.0,Adelaide,27,3,2007,2,1
180,0.0,Adelaide,27,3,2007,3,1
240,0.0,Adelaide,27,3,2007,4,1


In [5]:
print(len(combined_data))

762039


In [3]:
# Encode the city names
processed_data = pd.get_dummies(combined_data, columns=['city'])
processed_data.head()

Unnamed: 0,UV_Index,Day,Month,Year,Hour,Minute,city_Adelaide,city_Brisbane,city_Canberra,city_Melbourne,city_Perth,city_Sydney
0,0.0,27,3,2007,0,1,True,False,False,False,False,False
1,0.0,27,3,2007,0,2,True,False,False,False,False,False
2,0.0,27,3,2007,0,3,True,False,False,False,False,False
3,0.0,27,3,2007,0,4,True,False,False,False,False,False
4,0.0,27,3,2007,0,5,True,False,False,False,False,False


In [5]:
# Move 'UV_index' column to the end
processed_data = processed_data[[col for col in processed_data.columns if col != 'UV_Index'] + ['UV_Index']]
processed_data.head()

Unnamed: 0,Day,Month,Year,Hour,Minute,city_Adelaide,city_Brisbane,city_Canberra,city_Melbourne,city_Perth,city_Sydney,UV_Index
0,27,3,2007,0,1,True,False,False,False,False,False,0.0
1,27,3,2007,0,2,True,False,False,False,False,False,0.0
2,27,3,2007,0,3,True,False,False,False,False,False,0.0
3,27,3,2007,0,4,True,False,False,False,False,False,0.0
4,27,3,2007,0,5,True,False,False,False,False,False,0.0


Now that we have extracted all the necessary input for the model, we can proceed to first sequencing the data and then building the model.

In [6]:
from sklearn.preprocessing import MinMaxScaler

# Normalize UV-index
scaler = MinMaxScaler()
processed_data['UV_Index_scaled'] = scaler.fit_transform(processed_data[['UV_Index']])

# Create sequences
look_back = 12  # Considering 12 past UV-index readings for prediction
sequences = []
for i in range(look_back, len(df)):
  sequence = processed_data[['UV_Index_scaled']].iloc[i-look_back:i].values
  sequences.append(sequence)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  processed_data['UV_Index_scaled'] = scaler.fit_transform(processed_data[['UV_Index']])


MemoryError: Unable to allocate 348. MiB for an array with shape (45601736, 1) and data type float64

**References**
* https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/