# Toqua Technical Interview

Ryan Cushen

Tuesday, 2 May 2023

## Table of Contents

1. [Introduction](#introduction)
2. [Technical Investigations](#technical-investigations)
    1. [Wind Turbines: A Brief Overview](#wind-turbines-a-brief-overview)
    3. [The Dataset](#the-dataset)
    4. [My Big Idea](#my-big-idea)
    5. [But: Oh No](#oh-no)
    6. [An Alternative Approach](#an-alternative-approach)
3. [Engie Proposition](#engie-proposition)
4. [Conclusions](#conclusions)
4. [Questions](#questions)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import windrose
import os

from sqlalchemy import create_engine
from dotenv import load_dotenv

## Introduction

Set the scene: Toqua wants to expand its services to the wind industry (wind turbines) and needs some ideas on how to do this.

We have seen that Engie has an open data set available online and Engie’s head of Wind Turbines has agreed to take a meeting with Toqua in 2 weeks out of curiosity...

The exam question: How can we provide value to Engie?

To answer this question, we will need to develop an understanding of the wind turbine industry, the available dataset, and demonstrate some concrete ideas for how Toque can help Engine improve its operations.

## Technical Investigations

Let's get into some details...

### Wind Turbines: A Brief Overview

Wind turbines convert kinetic wind energy into electrical energy.

* Wind spins the rotor, which is connected to a gearbox, which is connected to a generator. Low speed, high torque goes to high speed, low torque.
* It's all just basic physics: $P = \tau \omega$. In particular, power increases with the cube of wind speed (torque increases in the square). Air density (i.e. temperature and pressure) also play a role.
* But the rotors are not spun arbitrarily fast! Instead, they typically operate around fixed bands, so power output is mostly a function of torque.
* The rotor's control systems will ensure that the rotor is spun at the optimal speed, given the wind speed.
* The input to the generator is therefore a given rotor speed and torque. The control systems of the generator will then manage the excitation level of the generator, to ensure that the maximum amount of power is being generated whilst also maintaining the stability of the grid. These outputs are captured by the active power and reactive power variables.


What about Engie's particular wind turbines?

* Our wind turbines are the *Senvion MM82* model
* This model starts working at a wind speed of 3.5 m/s and has a cut-out speed of 25 m/s.
* It has a rated wind speed of 14.5 m/s and a rated power of 2,050kW. This is equivalent to about one-four-hundredth of the power of a nuclear reactor.
* It also has a maximum rotor speed of 17 rpm, and a tipspeed of 73 m/s (which is 262km/h)

Here is a photo:

![the wind turbine](references/turbine-senvion_mm82-DP71t6EmCB2.jpg)

### The Dataset

The primary dataset is composed of hourly observations of 14 mechanical metrics taken from four wind turbines located across the Grand Est region in France, paired with hourly weather observations across 7 dimensions. This means a total of 21 numeric variables, indexed by hour and turbine.

There are just over one million observations, with roughly 261,000 observations for each of the four turbines, dated between January 2013 and January 2018. In fact, the four turbines have exactly the same date range, but each are missing some records across the period.


In [2]:
load_dotenv()
DATABASE_URL = os.getenv("DATABASE_URL")

engine = create_engine(DATABASE_URL)
conn = engine.connect()

In [3]:
static_information = pd.read_sql("""
SELECT * FROM wind_turbines.static_information
""", conn)

coords = static_information[['wind_turbine_name', 'gps']].to_dict(orient='records')
for coord in coords:
    coord['gps'] = [float(x) for x in coord['gps'].split(',')]
    coord['lat'] = coord['gps'][0]
    coord['lon'] = coord['gps'][1]

mean_lat = sum([coord['lat'] for coord in coords]) / len(coords)
mean_lon = sum([coord['lon'] for coord in coords]) / len(coords)

m = folium.Map(location=[mean_lat, mean_lon], zoom_start=13)

for coord in coords:
    folium.Marker(
        location=[coord['lat'], coord['lon']],
        popup=coord['wind_turbine_name'],
        icon=folium.Icon(color='green', icon='ok-sign')
    ).add_to(m)

In [4]:
m

In [5]:
conn.close()

The 14 mechanical metrics and 7 weather metrics can be distinguished into a few categories:

* Mechanical – Wind
    * `ws`: wind speed in m/s
    * `ws_1`: wind speed in m/s for the first anemometer on the nacelle
    * `ws_2`: wind speed in m/s for the second anemometer on the nacelle
    * `wa`: wind direction in degrees
* Mechanical – Positioning
    * `ba`: pitch angle of the blades in degrees
    * `va`: vane position, which is the angular position of the wind vane, in degrees
    * `ya`: nacelle angle in degrees
* Mechanical – Temperatures
    * `ot`: outdoor temperature in degrees Celsius
    * `yt`: nacelle temperature in degrees Celsius
    * `rbt`: rotor bearing temperature in degrees Celsius
* Mechanical – Power
    * `rm`: torque in Nm
    * `rs`: rotor speed in rpm
    * `p`: active power (i.e. "real power") in kW
    * `q`: reactive power in kVAr


* Weather
    * `temp`: the outside temperature in degrees Celsius
    * `pressure`: the outside pressure in hPa
    * `humidity`: the outside humidity in %
    * `wind_speed`: the wind speed in m/s
    * `wind_deg`: the wind direction in degrees
    * `rain_1h`: the quantity of rain in the last hour in mm
    * `snow_1h`: the quantity of snow in the last hour in mm

Let's create some plots!

We need to do some data cleaning... I chose parametric truncation. This is a bit of a hack, but it's quick and easy.

And let's also create a few new variables:

* `ws_sq`: wind speed squared, since rotor torque is proportional to the square of wind speed
* `ws_cb`: wind speed cubed, since rotor power is proportional to the cube of wind speed
* `ws_a`: actual wind speed, since the wind speed measured by the anemometer is not the same as that experienced by the turbine if it is not facing directly into the wind
* `temp_6hr`: a rolling average of temperature over the past 6 hours, to capture the persistent effect of temperature on the turbine
* `rho`: pressure divided by temperature, as a proxy for air density
* `rs_n`: normalised rotor speed, as a ratio to maximum rotor speed (17 rpm)
* `ro`: whether the rotor is running without wind power, i.e. wind speed is zero but wind speed is not


* `ts`: tip speed ratio, as a ratio of rotor speed to wind speed
* `rsm`: rotor speed-torque gradient
* `o`: a boolean, indicating whether the wind speed is within the operational range of the turbine (i.e. between cut-in and cut-out speeds)
* `tg`: the ratio of temperatures outside and inside the nacelle, as a proxy for the efficiency of the cooling system
* `s`: apparent power, defined as $S = \sqrt{P^2 + Q^2}$
* `pf`: the power factor, defined as $\cos \phi = \frac{P}{S}$
* Four indicators for each six-hour block of the day, to capture varying grid demand:
    * `d1`: 00:00 - 06:00
    * `d2`: 06:00 - 12:00
    * `d3`: 12:00 - 18:00
    * `d4`: 18:00 - 00:00

### My Big Idea

So we know how wind turbines work, and we have some data on four wind turbines. What can we do with this?

Let's build a digital twin! This could function as a dynamic 'lookup table', which could be used for two important purposes:

* **Forecasting**: Given a set of inputs (e.g. a weather forecast), we could use the digital twin to predict the expected output of the turbine by "looking up" the closest historical analogue.
* **Anomaly detection**: Given a set of inputs (e.g. the current weather), we could use the digital twin to predict the expected output of the turbine, and then compare this to the actual output. If there is a significant difference, then we could flag this as an anomaly.

Mathematically, a digital twin defined in this way is just a big joint probability distribution, defined over the state space of the turbine. "Looking up" a value is then equivalent to computing conditional probabilities. This formulation also has the added benefit of being flexible; we can compute conditional probabilities using just one variable, or many variables.

But this is a very high-dimensional state space... and we know there are some physically-informed causal relationships between the variables. So instead of just doing a big Gaussian mixture model, we could model the joint probability distribution using a Bayesian network.

We can also make this Bayesian network dynamic, allowing the digital twin to evolve over time (i.e. learn 'online') as new data from the physical twin was received.

To make this discussion a little more tangible, we can specify a rough outline of the desired API for interacting with this digital twin. Ideally, it would look something like:

```python
import DigitalTwin from toqua

# Create a new wind turbine
wt = DigitalTwin(type="wind_turbine")

# Train the wind turbine on some data
wt.train(data)

# Query the wind turbine
wt.query(
    inputs={
        'wind_speed': 10,
        'rotor_speed': 10,
        'air_pressure': 1000
    },
    output='power'
)
# -> returns a probability distribution over power (and a MAP estimate)
```


```python
wt.query(
    inputs={
        'wind_speed': 10,
    },
    output='power'
)
# -> returns a probability distribution over power (and a MAP estimate)
# Assess an observation for anomalous behaviour
wt.assess(
    inputs={
        'wind_speed': 10,
        'rotor_speed': 10,
        'air_pressure': 1000,
        'power': 1000
    }
)
# -> returns a judgement on whether the observation is anomalous

# Update the twin with new data
wt.update(data)
```

### But: Oh No

I couldn't find any Python libraries for doing this! I found a few that could do discrete variables, but none that could run the updates and queries for continuous variables.

But I still think this is a really powerful idea... and I would have implemented it myself..

### An Alternative Approach

Okay, so we can't build a Bayesian Network just yet. Perhaps we can just build an ensemble of regression models that can predict any variables of interest? This could kind-of function the same way... and at least we could demonstrate that the problem is soluble.

#### The Simplest Model

Let's start with the simplest possible model: weather -> power. We know that there is a direct causal relationship here, so we should be able to attain strong model performance.

We can denote this simple model as

$$ P = f(W) $$

where $P \in \mathcal{P} \subset \mathbb{R}$ is the active power output, $W \in \mathcal{W} \subset \mathbb{R}^5$ is a vector of weather variables, composed of wind speed, wind direction (away from the nacelle), temperature, air pressure and humidity, and $f: \mathcal{W} \rightarrow \mathbb{R}$ is a representation of the wind turbine system.

However, we will immediately run into a problem here! The wind turbine system does not run in a vacuum – it is also attached to the power grid, meaning that the observed real power generated will also be a funtion of grid demand. But we do not have access to grid demand data...

We can, however, use the reactive power output as a proxy (i.e. instrumental variable) for the grid demand, since:
* Reactive power is correlated with grid demand (I think...)
* Reactive power is not nominally correlated with the weather, and a priori, not correlated with active power.
Our updated model is therefore

$$ P = f(W, Q) $$

where $Q \in \mathcal{Q} \subset \mathbb{R}$ is the reactive power output.

* Linear model: $R^2 = 80\%$
* Gradient boosting: $R^2 = 99\%$

#### A Physics-Informed Model

But let's apply some knowledge of physics. For example, know that active power generation is almost perfectly linear in rotor torque... and that rotor torque increases in the square of wind speed and linearly in air density. So let's include these fields and run the same models again.

* Linear model: $R^2 = 92\%$ or $+12\%$
* Gradient boosting: $R^2 = 99\%$, little to no change

The large improvement in the linear model suggests that there is some 'information gain' in the new variables; the lack of improvement in XGBoost is probably an issue with overfitting.

But this is cleary a very soluble problem! Given a small vector of observables, we can predict active power generation to a high degree of accuracy. 

## Engie Proposition

So my technical investigations were not quite as fruitful as I had hoped... but I think we could still form a strong pitch for Engie.

*We are working on an innovative new approach to modelling Digital Twins, and we want to partner with you for a proof of concept. Our model will provide a probability network defined over each physical asset, allowing you to answer questions like:*

* "Oil temperature looks like it is running hot. How likely is this, given the current weather conditions and grid demand?"
* "Here is a weather forecast for the next week. What is the range of power generation outcomes that I should expect?
* "Have there been any recent anomalous events that I should investigate?"
* "Has asset performance remained consistent over the past month? If not, what has deteriorated?"

*This will include an intuitive and semantic API that allows you to easily train, query and update the model based on the data you have already made publically available.*

*We don't yet have a working version, but to demonstrate that we serious, we have built some physics-informed predictive models that take a more classical approach, which achieve very high levels of performance. Concretely, given a few weather observations, we can predict active power generation with 99% accuracy. These will be embedded in our probability network model, so you can be confident that our new approach will be performant.*

## Conclusions

So everything didn't quite go to plan... but I think there is still a lot of potential in this space.

My next steps would be:
* Try and build the continuous Bayesian Network model! Shouldn't be too hard...
* ...assuming that this idea is indeed valid. I can't be the first person to think of this...
* Get some better measures of grid demand. This is a crucial input in the power generation model.
* Talk to a wind turbine engineer

## Questions