# Terra Validator Churn
> "Validator Uptime Insights"

- toc:true
- branch: master
- badges: true
- comments: false
- author: Scott Simpson
- categories: [terra, validators]

The Terra network operates a set of proof of stake validator nodes to provide security for the network.  Validators are judged on their uptime and on the quality of their validation - any deviations from expectations can result in slashing and loss of funds for the validator operator.  This post explores some uptime & downtime insights of the Terra validator nodes.

In [7]:
#hide
#Imports & settings
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
sns.set_theme(style="ticks", color_codes=True)
%matplotlib inline
#%load_ext google.colab.data_table
%load_ext rpy2.ipython
%R options(tidyverse.quiet = TRUE)
%R options(lubridate.quiet = TRUE)
%R options(jsonlite.quiet = TRUE)
%R suppressMessages(library(tidyverse))
%R suppressMessages(library(lubridate))
%R suppressMessages(library(jsonlite))
%R suppressMessages(options(dplyr.summarise.inform = FALSE))

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


0,1
dplyr.summarise.inform,[RTYPES.LGLSXP]


In [47]:
#hide
%%R
#Grab base query from Flipside
df_group1 = fromJSON('https://api.flipsidecrypto.com/api/v2/queries/5eebcedf-5edd-4afd-931a-5932d5fbf964/data/latest', simplifyDataFrame = TRUE)
df_group2 = fromJSON('https://api.flipsidecrypto.com/api/v2/queries/9c6e455a-66eb-45f9-b87a-c76d716e0040/data/latest', simplifyDataFrame = TRUE)
df_group3 = fromJSON('https://api.flipsidecrypto.com/api/v2/queries/7013252f-a0fe-43b2-abea-b4f42e319210/data/latest', simplifyDataFrame = TRUE)
df_vals = fromJSON('https://api.flipsidecrypto.com/api/v2/queries/5ac709ee-887c-4522-9bc8-15cec97720e0/data/latest', simplifyDataFrame = TRUE)

#union all the three query groups
df <- df_group1 %>%
  bind_rows(df_group2) %>%
  bind_rows(df_group3)

rm(list = c("df_group1", "df_group2", "df_group3"))

#Change the date to date format
df$BLOCK_TIMESTAMP <- parse_datetime(df$BLOCK_TIMESTAMP)
df$FIRST_BLOCK_TIMESTAMP <- parse_datetime(df$FIRST_BLOCK_TIMESTAMP)
df_vals$FIRST_BLOCK_TIMESTAMP <- parse_datetime(df_vals$FIRST_BLOCK_TIMESTAMP)
df_vals$LAST_BLOCK_TIMESTAMP <- parse_datetime(df_vals$LAST_BLOCK_TIMESTAMP)

#Rename & reorder columns
names(df)<-tolower(names(df))
df <- df  %>%
  select(block_timestamp, block_id, first_block_timestamp, first_block_id,
         validator, label, missed_blocks, downtime)

names(df_vals)<-tolower(names(df_vals))
df_vals <- df_vals %>% rename(validator = valcons)

#add week and month columns based on first_block_id
df <- df %>%
  mutate(week = floor_date(first_block_timestamp, "weeks"),
         month = floor_date(first_block_timestamp, "months"))


#time running - rates & uptime for all time
event_rates_all_time <- df_vals %>% select(validator, label, runtime) %>%
  left_join(df %>% count(validator)) %>%
  rename(event_count = n) %>%
  mutate(event_rate = if_else(runtime == 0, 0, event_count / runtime),
         average_time_per_event = if_else(event_count == 0, Inf, runtime / event_count / 60 / 60))


#weeks-validators
df_val_weeks <- df %>% distinct(week) %>%
  full_join(df_vals$validator, by = character(), copy = TRUE) %>%
  rename(validator = y) %>%
  left_join(df %>% count(week, validator)) %>%
  left_join(df_vals %>% select(validator, first_block_timestamp, last_block_timestamp)) %>%
  left_join(df %>% group_by(week, validator) %>% summarise(downtime = sum(downtime))) %>%
  filter(first_block_timestamp < week) %>%
  filter(last_block_timestamp > week)  %>%
  rename(event_count = n) %>%
  replace_na(list(event_count = 0, downtime = 0)) %>%
  mutate(event_rate = event_count / 7 * 24 * 60 * 60,
         average_time_per_event = if_else(event_count == 0, Inf, 7 * 24 / event_count))


#weeks only
df_weeks <- df_val_weeks %>%
  group_by(week) %>%
  summarise(event_count = sum(event_count),
            downtime = sum(downtime),
            validators = n(),
            average_time_per_event = 7 * 24 / event_count * validators,
            events_per_validator = event_count / validators,
            downtime_per_validator = downtime / validators / 60 / 60) %>%
  ungroup() %>%
  filter(week != max(week))


#top & bottom validators by downtime
top_10_validators <- event_rates_all_time %>%
  arrange(event_rate) %>%
  ungroup() %>%
  head(10)

#top & bottom validators by downtime
bottom_10_validators <- event_rates_all_time %>%
  arrange(event_rate) %>%
  ungroup() %>%
  tail(10)

# by week for top 10
df_weeks_top10 <- df_val_weeks %>%
  filter(validator %in% top_10_validators$validator) %>%
  group_by(week) %>%
  summarise(event_count = sum(event_count),
            downtime = sum(downtime),
            validators = n(),
            average_time_per_event = 7 * 24 / event_count * validators) %>%
  ungroup() %>%
  filter(week != max(week))



# by week for bottom 10
df_weeks_bottom10 <- df_val_weeks %>%
  filter(validator %in% bottom_10_validators$validator) %>%
  group_by(week) %>%
  summarise(event_count = sum(event_count),
            downtime = sum(downtime),
            validators = n(),
            average_time_per_event = 7 * 24 / event_count * validators) %>%
  ungroup() %>%
  filter(week != max(week))

Joining, by = "validator"
Joining, by = c("week", "validator")
Joining, by = "validator"
Joining, by = c("week", "validator")


# How often do Validators go down?
How often do validators that have been online at least once in the past 6 months turn off? It turns out, that on average, it happens pretty often - once every 1.6 hours for each validator (see table below).  This simple statistic doesn't really tell the story though - there are some very reliable validators and some rather unreliable ones making up this average.

We took a dataset of all validators who collected rewards over the last 6 months, and looked for the *liveness* events recorded on chain.  These events are collected by the protocol and used in slashing calculations for nodes that haven't met their validation requirements.  These events keep a record of how many blocks a validator missed validating.

Taking these liveness events and looking at how many, on average, each validator has had over this time (adjusting for how long the validator has been running), we can calculate an average time between downtime events for each validator.  Another name for this metric is Mean Time Between Failure (MTBF) - a common metric in Asset Management.  We can plot the MTBF for each validator and see what the distribution looks like below.



In [51]:
#hide_input
%%R
event_rates_all_time %>%
  summarise(rt = sum(runtime),
            ev = sum(event_count),
            val=n()) %>%
  mutate(MTBF = round(rt / ev / 60 / 60, 1) ) %>%
  select(MTBF)

  MTBF
1  1.6


In [36]:
#hide_input
#Histo of time between events for all validators
df_p = %R event_rates_all_time
fig = px.histogram(df_p
             , x = "average_time_per_event"
             #, y = "n"
             , labels=dict(average_time_per_event="MTBF", count="Number of Validators")
             , title= 'Mean Time Between Failure for Last 6 Months'
             , template="simple_white", width=800, height=800/1.618
             )
fig.update_yaxes(title_text='Number of Validators')
fig.update_xaxes(title_text='Mean Time Between Failure (hours)')
fig.update_layout(showlegend=False)
fig.show()

We see that there is a large variation between validators, pointing towards the conclusions that there are a number of very professional outfits, and potentially some less skilled or resourced node operators on the network.  There are 4 validators who operate with an MTBF of greater than 60 hours (one downtime event every 2-3 days) and there are nearly 50 who operate at less than 2 hours between each downtime events.

# Change in MTBF Over Time

To see how the MTBF metric has changed over the last 6 months, we plot the MTBF (per validator) over time below.  We can see there has been a marked decrease in MTBF over the 6 months, with early weeks as high as 6 hours between downtime events, trending down to less than 1 hour per event per validator.  This indicates a decrease in the overall reliability of the validator nodes over this time period. 

In [38]:
#hide_input
#time plot of number of downtime events
df_p = %R df_weeks
fig = px.bar(df_p
             , x = "week"
             , y = "average_time_per_event"
             , labels=dict(week="Week", average_time_per_event="MTBF")
             , title= 'Mean Time Between Failure over Time'
             , template="simple_white", width=800, height=800/1.618
             )
fig.update_yaxes(title_text='Mean Time Between Failure (hours)')
#fig.update_xaxes(title_text='Borrowers Net Loan Position (UST)')
#fig.update_traces(line_shape='spline', line_smoothing = 0.5)
fig.show()

# High Performing Validators

We saw earlier that there was a distinct difference between the best and worst performing validators.  The graph below segments the data to just look at the top 10 best performing validators (by number of downtime events relative to their operating time) over the last 6 months.  Here we also see a downward trend. Earlier, MTBF ranged in the hundreds of hours between failures.  This has trended down to around 30 hours between dowtime events.  Notice that this is more than an order of magnitude better than the current average of around 1 hour.


In [48]:
#hide_input
#time plot of number of downtime events
df_p = %R df_weeks_top10
fig = px.bar(df_p
             , x = "week"
             , y = "average_time_per_event"
             , labels=dict(week="Week", average_time_per_event="MTBF")
             , title= 'Mean Time Between Failure Top 10 Validators'
             , template="simple_white", width=800, height=800/1.618
             )
fig.update_yaxes(title_text='Mean Time Between Failure (hours)')
#fig.update_xaxes(title_text='Borrowers Net Loan Position (UST)')
#fig.update_traces(line_shape='spline', line_smoothing = 0.5)
fig.show()

# Low Performing Validators

Performing the same analysis on the worst 10 validators (by downtime events) we can see why the average is so low.  The most recent week average MTBF for this group is around 8 minutes, and there is a distinct downtrend from 6 months ago when this number was as high as 1 hour.  This mirrors what we have seen in the rest of the dataset - reliability is trending downwards.

In [49]:
#hide_input
#time plot of number of downtime events
df_p = %R df_weeks_bottom10
fig = px.bar(df_p
             , x = "week"
             , y = "average_time_per_event"
             , labels=dict(week="Week", average_time_per_event="MTBF")
             , title= 'Mean Time Between Failure Bottom 10 Validators'
             , template="simple_white", width=800, height=800/1.618
             )
fig.update_yaxes(title_text='Mean Time Between Failure (hours)')
#fig.update_xaxes(title_text='Borrowers Net Loan Position (UST)')
#fig.update_traces(line_shape='spline', line_smoothing = 0.5)
fig.show()

# Conclusions

The data above has explored the downtime events among Terra validators.  We have seen that, on average over the entire validator set, validators miss blocks around once every 1.6 hours.  There is a vast difference between validators however - the best 10 validators currently miss blocks every 30 hours and the worst 10 miss one every 8 minutes.  Across all datasets there has been a trend of decreasing reliability compared with 6 months ago.

## References

All data sourced from [Flipside Crypto](https://flipsidecrypto.com/).  Data sources:

*https://api.flipsidecrypto.com/api/v2/queries/5eebcedf-5edd-4afd-931a-5932d5fbf964/data/latest
https://api.flipsidecrypto.com/api/v2/queries/9c6e455a-66eb-45f9-b87a-c76d716e0040/data/latest
https://api.flipsidecrypto.com/api/v2/queries/7013252f-a0fe-43b2-abea-b4f42e319210/data/latest
https://api.flipsidecrypto.com/api/v2/queries/5ac709ee-887c-4522-9bc8-15cec97720e0/data/latest*

