<h3 style="text-align: center"><i>Regular Python Lists when Computing a Loss Function</i></h3>

**Author:** Russel Marcelo

**Date:** December 1, 2023

---

**Goal:** 
This assignment focuses on examining how using Numpy [[2]](#2) affects numerical computation performance, while also <br /> reinforcing skills in Pandas [[1]](#1). It introduces the concept and application of a loss function.

**Problem and Input:** A machine learning model devised by weather researchers forecasts rainfall and evaporation <br />  across various locations in Australia on different days. [Table 1](#t1) displays the data collected from both experimental observations and model predictions.

**The observables are described by the following:**

- **Date** - The observation date.

- **Location** - The weather station location.
  
- **MinTemp** - The minimum temperature (°C).
  
- **MaxTemp** - The maximum temperature (°C).
  
- **Rainfall** - The rainfall amount in 24 hours (mm).
  
- **Evaporation** - The evaporation amount in 24 hours (mm).
  
- **Sunshine** - The sunshine amount in 24 hours (h).
  
- **WindGustSpeed** - The maximum wind gust speed in 24 hours (h).
  
- **RainToday** - Did it rain on that day? yes: if precipitation >= 1 mm, no:if precipitation < 1 mm.
  
- **RainTomorrow** - Did it rain in the following day? yes: if precipitation >= 1 mm, no: if precipitation < 1 mm.

**References**

<a id="1">[1]</a> Pandas: Python Data Analysis Library: https://johnfoster.pge.utexas.edu/numerical-methods-book/ScientificPython_Pandas.html
<br />
<a id="2">[2]</a> Numpy: https://www.geeksforgeeks.org/introduction-to-numpy/
<br />
<a id="3">[3]</a> Loss function: https://en.wikipedia.org/wiki/Loss_function
<br />
<a id="4">[4]</a> zip function: https://realpython.com/python-zip-function/
<br /> 
<a id="5">[5]</a> timeit: https://docs.python.org/3/library/timeit.html

<a id="t1" style="text-align: center">
<h3>Table 1</h3>
</a>

| Location: Australia     |             |                |                |               |              | Experimental     |                     |            |               | Model Prediction     |               |
| ------------ | ---------- |:--------------:|:--------------:|:-------------:|:------------:|:------------:|:-------------------:|:----------:|:-------------:|:-------------:|:-------------:|
| *Date*         | *Location*    | *Min. Temp (°C)* | *Max. Temp (°C)* | *Rainfall (mm)* | *Evapor. (mm)* | *Sunshine (h)* | *Wind speed (km/h)* | *Rain Today* | *Rain Tomorrow* | *Rainfall (mm)* | *Evapor. (mm)* |
| 2009-01-02 | Cobar | 18.4 | 28.9 | 0.0 | 14.8 | 13.0 | 37.0 | No | No | 1.16457 | 7.564111 |
| 2009-01-04 | Cobar | 19.4 | 37.6 | 0.0 | 10.8 | 10.6 | 46.0 | No | No | 1.077602 | 2.872613 |
| 2009-01-05 | Cobar | 21.9 | 38.4 | 0.0 | 11.4 | 12.2 | 31.0 | No | No | 2.082352 | 8.060459 |
| 2009-01-06 | Cobar | 24.2 | 41.0 | 0.0 | 11.2 | 8.4 | 35.0 | No | No | 7.453461 | 7.468973 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

<div style="text-align: center">

#### Loss function to calculate the predicted rainfall and evaporation

A loss function [[3]](#3) is a mathematical method used to measure how well a machine learning model performs <br /> by quantifying the difference between predicted and actual values, guiding the model towards improvement.

</div>

<div style="text-align: center"><a id="loss">

#### $$Loss = \alpha * | R^{Pred.} - R^{Exp.}| + \beta * | E^{Pred.} - E^{Exp} |$$

</a>
</div>

$\alpha$ is the rainfall contemplatig factor. 
<br /> 
$\beta$ is the evaporation contemplating factor.
<br />
$R^{Pred.} - R^{Exp.}$ are the predicted and experimental rainfall attributes.
<br />
$E^{Pred.} - E^{Exp}$ are the predicted and experimental evaportation attributes.

**Import**

Let's import the essential libraries needed for the assignment: `pandas` [[1]](#1), `numpy` [[2]](#2) and `timeit` [[3]](#3). <br />So that we can use them later in the code blocks.

In [1]:
import pandas as pd
import numpy as np
import timeit


---

<h3> Task 1: </h3>

We'll extract data from the files `experiment.csv` and `weather-prediction.csv` using `pd.read_csv from` <br /> the Pandas library. Upon importing the data, we'll employ `.drop_duplicates()` to **eliminate any duplicate entries**. <br /> Additionally, to handle any **missing or NaN values** within our dataframe, we'll utilize `.dropna()` to **clean and remove them.**

In [2]:
data_exp = pd.read_csv("../docs/weather_experiment.csv", header=0, sep=",")
data_exp.dropna(axis=0, how='any', inplace=True)
data_exp.drop_duplicates(inplace=True)
data_exp

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,RainToday,RainTomorrow
0,2009-01-01,Cobar,17.9,35.2,0.0,12.0,12.3,48.0,No,No
1,2009-01-02,Cobar,18.4,28.9,0.0,14.8,13.0,37.0,No,No
2,2009-01-04,Cobar,19.4,37.6,0.0,10.8,10.6,46.0,No,No
3,2009-01-05,Cobar,21.9,38.4,0.0,11.4,12.2,31.0,No,No
4,2009-01-06,Cobar,24.2,41.0,0.0,11.2,8.4,35.0,No,No
...,...,...,...,...,...,...,...,...,...,...
55242,2017-06-20,Darwin,19.3,33.4,0.0,6.0,11.0,35.0,No,No
55243,2017-06-21,Darwin,21.2,32.6,0.0,7.6,8.6,37.0,No,No
55244,2017-06-22,Darwin,20.7,32.8,0.0,5.6,11.0,33.0,No,No
55245,2017-06-23,Darwin,19.5,31.8,0.0,6.2,10.6,26.0,No,No


Next, we repeat the process with the second .csv file, housing predicted rainfall and evaporation data.

In [3]:
data_pred = pd.read_csv("../docs/weather_prediction.csv",header=0, sep=",")
data_pred.dropna(axis=0, how='any', inplace=True)
data_pred.drop_duplicates(inplace=True)
data_pred

Unnamed: 0,Date,Location,Rainfall Pred.,Evaporation Pred.
0,2009-01-01,Cobar,7.098998,6.179719
1,2009-01-02,Cobar,1.433238,6.375806
2,2009-01-04,Cobar,0.914834,5.687946
3,2009-01-05,Cobar,5.285904,6.897139
4,2009-01-06,Cobar,0.993975,0.050364
...,...,...,...,...
55242,2017-06-20,Darwin,5.693780,3.400099
55243,2017-06-21,Darwin,1.548031,1.696780
55244,2017-06-22,Darwin,1.516136,2.945245
55245,2017-06-23,Darwin,1.158509,4.960711


---

<h3> Task 2: </h3>

In this task, we need to develop user-defined functions that encode and calculate the [loss function](#loss) with the following criteria:

1. Perform calculations using standard Python lists exclusively (avoiding the use of Numpy or ndarrays).
  <br />
  <br />
2. Perform calculations using Numpy, maximizing the utilization of its library functionalities for enhanced performance.

<div style="text-align: center">
<p><b>standard Python lists</b></p>

To perform the calculation using regular Python lists:

</div>

In [6]:
def loss_func_regular(alpha:float, beta:float, r_pred:list, r_exp:list, e_pred:list, e_exp:list) -> list:
    return [alpha * abs(r_p - r_e) + beta * abs(e_p - e_e) 
            for r_p, r_e, e_p, e_e in zip(r_pred, r_exp, e_pred, e_exp)]

In this context, `zip` [[4]](#4) is used to take elements from the lists `r_pred`, `r_exp`, `e_pred` and `e_exp` and <br /> then execute the `loss_func_regular` on them. At the end we get a list with **all results**.

<div style="text-align: center">
<p><b>Numpy Array</b></p>

To utilize Numpy for the calculation, the initial step involves converting the data lists into Numpy arrays by employing `np.array()`.

</div>

In [7]:
def loss_function_np(alpha:float, beta:float, 
                        r_pred_np: np.ndarray, r_exp_np: np.ndarray, 
                        e_pred_np: np.ndarray, e_exp_np: np.ndarray):
    
    # Then we calculate the results using Numpy's functions
    return alpha * np.abs(r_pred_np - r_exp_np) + beta * np.abs(e_pred_np - e_exp_np)

---

<h3>Task 3:</h3>

In this task, we aim to check how fast the functions from Task 2 work using the `timeit` [[5]](#5) library. We'll do this by <br /> measuring how long it takes for the functions to calculate the loss value when both $\alpha$ and $\beta$ are set to `0.5`.

The essential data will be stored as `lists` initially and then transformed into `Numpy arrays`. <br /> These arrays will serve as parameters for the two custom functions mentioned earlier.

In [8]:
r_pred = data_pred['Rainfall Pred.'].values.tolist()
r_exp = data_exp['Rainfall'].values.tolist()
e_pred = data_pred['Evaporation Pred.'].values.tolist()
e_exp = data_exp['Evaporation'].values.tolist()

# Then we convert the data from lists to Numpy arrays
r_pred_np = np.array(r_pred)
r_exp_np = np.array(r_exp)
e_pred_np = np.array(e_pred)
e_exp_np = np.array(e_exp)

#### **Python lists:**

In [9]:
time_py_lists = timeit.timeit(lambda: loss_func_regular(0.5, 0.5, r_pred, r_exp, e_pred, e_exp), number=100)

time_py_lists

0.8761188520002179

#### **Numpy array:**

In [10]:
time_py_numpy = timeit.timeit(lambda: loss_function_np(0.5, 0.5, r_pred_np, r_exp_np, e_pred_np, e_exp_np), number=100)

time_py_numpy

0.026985819000401534

#### **Conclusion:**

`Numpy arrays` showed **faster** computation times compared to `Python lists` in calculating the loss function with $\alpha$ and $\beta$ set to `0.5`. <br /> This highlights Numpy's efficiency due to optimized array operations, making it a faster choice for numerical computations compared to standard Python lists.

---