# Sensor Simulator

In the previous notebook
[`bayesian_optimization_blooper.ipynb`](bayesian_optimization_blooper.ipynb), we ran
into some unexpected results. Bayesian optimization was on par with grid search, and
random search was the best? See below.

![blooper](bayesian_optimization_blooper.png)

To help troubleshoot the source of the unexpected
behavior, we'll move the schedule up and use a simulator. The large stochasticity for
identical inputs shown at the end of the last notebook is a cause for concern.

|  | mean | std |
|---|---|---|
| ch415_violet | 17928.500000 | 25288.711738 |
| ch445_indigo | 8746.200000 | 9379.586248 |
| ch480_blue | 19396.100000 | 17832.023101 |
| ch515_cyan | 4870.900000 | 12126.813252 |
| ch560_green | 11908.300000 | 18944.797128 |
| ch615_yellow | 18406.400000 | 19626.981147 |
| ch670_orange | 30584.100000 | 20868.707815 |
| ch720_red | 5776.100000 | 9066.354228 |
| ch_clear | 1294.100000 | 2017.116669 |
| ch_nir | 0.000000 | 0.000000 |
| mae | 13798.390000 | 2959.743242 |
| rmse | 23851.888103 | 3422.810056 |

Is the stochasticity due to excessive noise from too short of integration time or is it
a problem with the hardware such as a sensor malfunction? Based on inspection from my
own eyes, fixed inputs appeared to produce very similar colors. Last, this could be due
to (intentional) naive design decision of starting out with mean absolute error for the
objective function rather than something more tuned to this problem, such as Wasserstein
distance between the two discrete spectra. In reality, it's probably some mix and
interplay of the previous issues:

- **(epistemic) uncertainty**
  - i.e. just a part of the system, solved by a longer integration time (more data)
- **sensor malfunction/degradation**
  - due to blasting it repeatedly with a bright DotStar
  - and/or hard device resets due to system crashes
  - and/or or (just remembered!) a series
  of recent unexpected power outages for my housing community
- **Poorly defined objective function**
  - It was defined in a way that it doesn't handle stochastic bias when
  a signal is measured in an adjacent channel, solved by using a robust distance metric
  for discrete distributions (e.g. Wasserstein)
- **Something else?**

My wife kindly reminded me of past lessons learned during experimentation, and
encouraged me to try out the simulation. So, let's do that here! We'll take a look at
our domain knowledge / assumptions for the simulation, briefly describe the approach,
and then dig into a snapshot of the `SensorSimulator` class. Finally, we'll demonstrate
the usage and run our grid vs. random vs. Bayesian search algorithm comparison with our
simulation.


## Domain Knowledge

> TLDR; extract the data from the DotStar RGB spectrum listed in the [manufacturer's datasheet](https://cdn-shop.adafruit.com/product-files/2343/SK9822_SHIJI.pdf)

A basic physics course teaches the wave-particle duality of light. Light can be thought
of as a wave (signal with frequency/wavelength) or as a particle (photon). In the case
of designing a (very basic) simulation for an RGB LED spectrum being read by a
discrete-channel sensor, we focus mainly on the wave portion of wave-particle duality of
light. Most light-emitting-diodes (LEDs) have fairly narrow distributions of wavelengths
that they emit like what's shown in the following image from the [DotStar manufacturer's datasheet](https://cdn-shop.adafruit.com/product-files/2343/SK9822_SHIJI.pdf).

<img src="../reports/figures/dotstar/rgb-relative-emission-vs-wavelength.png" width=350>

Controlling the brightness and RGB values of our LED controls the relative contribution
of each of the three distributions portrayed above. Meanwhile, our spectrophotometer (a device that measures the intensity of light at various
wavelengths) measures the light intensity at 8 wavelengths (it also has "clear" and
"near infrared" channels that we'll ignore):

| wavelength | intensity |
|---|---|
415 nm | Violet
445 nm | Indigo or Blue
480 nm | Blue or Blue-Green
515 nm | Blue-Green or Green
555 nm | Green or Yellow-Green 
590 nm | Yellow-Green or Yellow
630 nm | Orange or Orange-Red or Red
680 nm | Red

We can digitize the data from the manufacturer datasheet using
[WebPlotDigitizer](https://automeris.io/WebPlotDigitizer/) and then use that for
calculating/mixing an arbitrary spectrum based on brightness and RGB settings. We'll make the
reasonable approximation/assumption that the [superposition principle](https://en.wikipedia.org/wiki/Superposition_principle) applies*. In other
words, we assume that the individual spectra from the red, green, and blue LEDs can be
added together. Below is a screenshot of the datapoints extracted via the
WebPlotDigitizer interface for each of the three curves. Corresponding `.csv` files are
also saved: [`red.csv`](../src/self_driving_lab_demo/data/red.csv), [`green.csv`](../src/self_driving_lab_demo/data/green.csv), and [`blue.csv`](../src/self_driving_lab_demo/data/blue.csv).

<img src="../src/self_driving_lab_demo/data/wpd-rgb-spectrum-points-overlay.png"
width=350>

See [RGB LEDs vs. having 10+ monochromatic light
sources](https://github.com/sparks-baird/self-driving-lab-demo/issues/6) for more
details on the hardware design considerations for the LEDs and sensor.

<p><sup>
*While it would be interesting to look at optimization setups that
involve light cancellation, the equipment required is likely not within the budget
and time constraints of the self-driving-lab-demo project (less than 100 USD and less
than an hour of setup time).
</p></sup>


## Simulation Details

As mentioned, we'll mix the three spectra extracted from the manufacturer datasheet
according to the simulation inputs (brightness
and RGB values). After loading the CSV data, we'll clip any data below 0.0 and average
the intensities that are not one-to-one (i.e. multiple intensities for the same
wavelength) due to the data extraction process. Next, we'll use `scipy`'s `interp1d`
function for each of the three color spectra to create linear interpolation functions
which are set to zero outside of the range of the original data (extrapolation).
Finally, we'll calculate a weighted sum of each of the interpolators as a function of
brightness and RGB sampled at each of the wavelengths mentioned above.

A more sophisticated setup might involve sampling a distribution of wavelengths for each
channel; however, we'll keep it simple for now.

## Sensor Simulator Python class
Below is a snapshot of the `SensorSimulator` class, broken into chunks.

First, we need our imports and define the wavelengths we'll be sampling at as a constant.

```python
from importlib.resources import open_text
import numpy as np
import pandas as pd
from scipy.interpolate import interp1d
from self_driving_lab_demo import data as data_module

CHANNEL_WAVELENGTHS = [
    415,
    445,
    480,
    515,
    560,
    615,
    670,
    720,
]
```

The class doesn't take any keyword arguments (again, keeping it simple), and when the
class is instantiated, it creates an interpolator.

```python
class SensorSimulator(object):
    def __init__(self):
        self.red_interp = self.create_interpolator("red.csv")
        self.green_interp = self.create_interpolator("green.csv")
        self.blue_interp = self.create_interpolator("blue.csv")

```

We make it easy to get the channel wavelengths (making it a constant outside the class
makes it easily accessible for other modules).

```python
    @property
    def channel_wavelengths(self):
        return CHANNEL_WAVELENGTHS
```

The data is read using best practices (`open_text` using Python modules), negative
values are clipped, and y-values with repeat x-values are averaged. Finally, the
`interp1d` uses a linear interpolation (`cubic` gave some outlandish values during
testing) and zero-ing extrapolation (zero anywhere outside the range of the original
dataset). This function operates on a single color curve.

```python
    def create_interpolator(self, fname):
        df = pd.read_csv(
            open_text(data_module, fname),
            header=None,
            names=["wavelength", "relative_intensity"],
        )

        df["relative_intensity"].clip(lower=0.0, inplace=True)

        # average y-values for repeat x-values
        # see also https://stackoverflow.com/a/51258988/13697228
        df = df.groupby("wavelength", as_index=False).mean()

        return interp1d(
            df["wavelength"],
            df["relative_intensity"],
            kind="linear",
            bounds_error=False,
            fill_value=0.0,
        )

```

To perform the weighted average (mixing) of spectra, we divide the RGB values by 255
(this is arbitrary) and multiply by the brightness. This is then multipled by the
interpolated value at each of the wavelengths. Finally, we sum the contribution of each
of the wavelengths.

```python
    def _simulate_sensor_data(self, wavelengths, brightness, R, G, B):
        rI, gI, bI = brightness * np.array([R, G, B]) / 255
        channel_data = np.sum(
            [
                self.red_interp(wavelengths) * rI,
                self.green_interp(wavelengths) * gI,
                self.blue_interp(wavelengths) * bI,
            ],
            axis=0,
        )
        return tuple(channel_data)
```

Then, we define a class method that fixes the wavelengths to the constant mentioned before.

```python
    def simulate_sensor_data(self, brightness, R, G, B):
        return self._simulate_sensor_data(self.channel_wavelengths, brightness, R, G, B)
```

Last, let's take a look at how this is integrated into the `SelfDrivingLabDemo` class
(another reasonable design choice might be to create separate class using class
inheritance).

It's important that we only use the RPi-specific modules when running on RPi, otherwise
an error such as `NotImplementedError` will be thrown.

```python
try:
    import board
    from adafruit_as7341 import AS7341
    from blinkt import clear, set_brightness, set_pixel, show
except NotImplementedError as e:
    print(e)
    _logger.warning(
        "Safe to ignore if this is CI or not on a Raspberry Pi. However, only the simulator will be available."  # noqa: E501
    )
```

We add a `simulation` boolean to the `__init__` constructor.

```python
class SelfDrivingLabDemo(object):
    def __init__(
        self,
        ...
        simulation=False,
    ):
```

We store the kwarg as a class attribute, instantiate a `SensorSimulator` object, and
store the object as a class attribute.
```python
        ...
        self.simulation = simulation
        self.simulator = SensorSimulator()
```

It's important that we don't use the `board` and `AS7341` modules if we're not running
on a Raspberry Pi, so we assign `i2c` and `sensor` as `None` if those modules aren't
available (see `try` statement above).

```python
        # uses board.SCL and board.SDA
        self.i2c = None if "board" not in sys.modules else board.I2C()
        self.sensor = None if "AS7341" not in sys.modules else AS7341(self.i2c)
        ...
```

We make a slight modification to `observe_sensor_data` that calls the
`simulate_sensor_data` function if `self.simulation` is `True`.

```python
    def observe_sensor_data(self, brightness, R, G, B):
        if self.simulation:
            return self.simulate_sensor_data(brightness, R, G, B)
        ...

    def simulate_sensor_data(self, brightness, R, G, B):
        return self.simulator.simulate_sensor_data(brightness, R, G, B)
    ...
```

Note that in later versions, the details may differ somewhat from the above example, and
more features might be added.

In [1]:
%load_ext autoreload
%autoreload 2
from self_driving_lab_demo.core import SensorSimulator
sim = SensorSimulator()
sim.simulate_sensor_data(0.5, 12, 24, 48)

(3.6003093860216055e-05,
 0.005718901898901329,
 0.005328294858512386,
 0.0040458290797878394,
 0.0,
 0.0035949178953476857,
 0.00015071481701215217,
 0.0)

In [3]:
from self_driving_lab_demo.core import SelfDrivingLabDemo
sdl = SelfDrivingLabDemo(autoload=True, simulation=True)
sdl.evaluate(0.5, 50, 150, 250)

{'ch415_violet': 0.00018751611385529196,
 'ch445_indigo': 0.02978594739011109,
 'ch480_blue': 0.02911825030328232,
 'ch515_cyan': 0.025286431748673996,
 'ch560_green': 0.0,
 'ch615_yellow': 0.01497882456394869,
 'ch670_orange': 0.0006279784042173006,
 'ch720_red': 0.0,
 'mae': 0.010248416435380638,
 'rmse': 0.014591586880756832}

## Algorithm Comparison

We'll skip over some of the content that was covered in more detail in
[`bayesian_optimization_blooper.ipynb`](bayesian_optimization_blooper.ipynb).

### Setup

We'll instantiate our 10 `SelfDrivingLabDemo` instances, but this time with the
`simulation` flag set to `True`.

In [48]:
import numpy as np
import pandas as pd

num_iter = 81
num_repeats = 10
SEEDS = range(10, 10 + num_repeats)

sdls = [
    SelfDrivingLabDemo(autoload=True, simulation=True, target_seed=seed)
    for seed in SEEDS
]


The target inputs used to create the target spectrum are shown below.

In [49]:
target_inputs = [sdl.get_target_inputs() for sdl in sdls]
pd.DataFrame(target_inputs, columns=["brightness", "R", "G", "B"])


Unnamed: 0,brightness,R,G,B
0,0.334601,52,211,38
1,0.045,127,153,7
2,0.087789,241,48,45
3,0.302679,218,206,66
4,0.290844,92,179,219
5,0.24246,208,87,11
6,0.198421,109,23,88
7,0.295776,41,142,93
8,0.139757,182,71,21
9,0.147133,236,69,15


### Grid and Random Search

We'll use some helper functions for grid and random search based on the implementation
in [`bayesian_optimization_blooper.ipynb`](bayesian_optimization_blooper.ipynb). Notice
how much faster our simple simulation runs compared with the experiment! (a few seconds vs. ~10 minutes)

In [50]:
%%time
from self_driving_lab_demo.utils.search import grid_search, random_search
grid_results = [grid_search(sdl, num_iter) for sdl in sdls]
random_results = [random_search(sdls[0], num_iter) for sdl in sdls]

CPU times: user 2.61 s, sys: 17.3 ms, total: 2.63 s
Wall time: 2.62 s


#### Post-processing

Separate the results into input tuples and output dictionaries.

In [51]:
grid_inputs, grid_data = zip(*grid_results)
random_inputs, random_data = zip(*random_results)

np.array(grid_data).shape

(10, 81)

Extract the mean absolute error (MAE) from each of the output dictionaries.

In [52]:
grid_mae = np.array([[g["mae"] for g in gd] for gd in grid_data])
random_mae = np.array([[r["mae"] for r in rd] for rd in random_data])

grid_mae.shape

(10, 81)

Compute statistics (mean and standard deviation) across the `num_repeats` campaigns.

In [53]:
grid_avg_mae = np.mean(grid_mae, axis=0)
grid_std_mae = np.std(grid_mae, axis=0)

random_avg_mae = np.mean(random_mae, axis=0)
random_std_mae = np.std(random_mae, axis=0)

grid_avg_mae.shape

(81,)

How did we do? It's difficult to say since our simulation data is unitless and therefore
not directly comparable to the experiments. However, it's perhaps surprising that grid
search is performing  somewhat better than random search.

In [54]:
[np.min(grid_avg_mae), np.min(random_avg_mae)]

[0.002866803251013429, 0.0030098903090408683]

Since it's so fast, let's try to do it again, but this time with more repeats.

In [55]:
num_repeats = 100
SEEDS = range(10, 10 + num_repeats)

sdls = [
    SelfDrivingLabDemo(autoload=True, simulation=True, target_seed=seed)
    for seed in SEEDS
]

grid_results = [grid_search(sdl, num_iter) for sdl in sdls]
random_results = [random_search(sdls[0], num_iter) for sdl in sdls]

grid_inputs, grid_data = zip(*grid_results)
random_inputs, random_data = zip(*random_results)

grid_mae = np.array([[g["mae"] for g in gd] for gd in grid_data])
random_mae = np.array([[r["mae"] for r in rd] for rd in random_data])

grid_avg_mae = np.mean(grid_mae, axis=0)
grid_std_mae = np.std(grid_mae, axis=0)

random_avg_mae = np.mean(random_mae, axis=0)
random_std_mae = np.std(random_mae, axis=0)

[np.min(grid_avg_mae), np.min(random_avg_mae)]

[0.002852020706353647, 0.003943137789678828]

So, for this simulation, grid search seems to be outperforming random search! Typically,
random search is better, so digging back into the `get_random_inputs` function from
`SelfDrivingLabDemo`, it turns out there is a bug:

Wrong: `R, G, B = RGB.astype(int)`

Correct: `R, G, B = np.round(RGB).astype(int)`