Much thanks to Phillip Cloud at Voltron Data for helping me out.  Here's his answer to my [gist](https://gist.github.com/pybokeh/9fd661dd3c430da2a8dcbb65c8e3d007?permalink_comment_id=4722957#gistcomment-4722957).

## BACKGROUND

I have a scenario where I need to calculate a value that is not derived or is sourced from an existing table.  It is a value that is solely used in deriving a value that is used to create a new column in a table.  Basically, I need to be able to create a variable to hold a value and then this variable's value needs to change or needs to get updated based on a value from an ibis table column.  Using ibis' normal expressions or built-in functions do not support this.  I learned to solve this problem, requires using ibis' UDF abstraction.  Below are the concrete specifics that explain what I am trying to achieve.

In [1]:
import ibis
import pandas as pd
import pyarrow as pa
from ibis import _, udf
ibis.options.interactive = True

# create a DuckDB client
client = ibis.duckdb.connect()

In [2]:
failures = client.read_csv('data/rivet_failures.csv')

In [3]:
failures

I need to create or add 4 additional columns:

- Add `status` column to represent status of failure.  Valid values are "FAILED" or "SUSPENDED".  I am only interested in the "Flair Failure" mode.  I want to assign a status of "FAILED" to this failure mode, "SUSPENDED" for all other failure modes.
- Add `rank` column to represent rank of the failed unit by failure time sorted in ascending order
- Add `reverse_rank` column to represent the rank in reverse
- Add `adjusted_rank` column to represent adjusted rank

Where `adjusted_rank` is based on the formula below:

<center>$\large{Adjusted Rank = \frac{(Reverse Rank)(Previous AdjustedRank)+(N+1)}{(Reverse Rank)+1}}$</center>

where N equals the number of failures regardless of failure mode or essentially the total number of rows in the data set.

Using ibis, I can add the first 3 columns: `status`, `rank`, and `reverse_rank`

In [4]:
(
    failures
    .mutate(
        status=(
            ibis.case()
            .when(_.failure_mode == 'Flair Failure', 'FAILED')
            .else_('SUSPENDED')
            .end()
        )
    )
    .order_by(_.failure_time_minutes)
    .mutate(rank=ibis.row_number()+1)
    .mutate(reverse_rank=failures.count()+1 - _.rank)
)

But, I don't know how to add the fourth column `adjusted_rank` using ibis since it requires a variable (`prev_adjusted_rank`) that is not part of a table expression at all, that needs to be updated based on a value from an actual ibis table column.  I'm guessing I need to use ibis UDF for this.

Using pandas, I can add or create `adjusted_rank` column.  First, I'll convert my ibis table expression to a pandas dataframe:

In [5]:
pdf = (
    failures
    .mutate(
        status=(
            ibis.case()
            .when(_.failure_mode == 'Flair Failure', 'FAILED')
            .else_('SUSPENDED')
            .end()
        )
    )
    .order_by(_.failure_time_minutes)
    .mutate(rank=ibis.row_number()+1)
    .mutate(reverse_rank=failures.count()+1 - _.rank)
).to_pandas()

In [6]:
pdf

Unnamed: 0,serial_number,failure_time_minutes,failure_mode,status,rank,reverse_rank
0,7,10,Lug failed,SUSPENDED,1,8
1,4,30,Flair Failure,FAILED,2,7
2,6,45,Flair loosened,SUSPENDED,3,6
3,5,49,Flair Failure,FAILED,4,5
4,8,82,Flair Failure,FAILED,5,4
5,1,90,Flair Failure,FAILED,6,3
6,2,96,Flair Failure,FAILED,7,2
7,3,100,Flair loosened,SUSPENDED,8,1


<center>$\large{AdjustedRank = \frac{(Reverse Rank)(Previous AdjustedRank)+(N+1)}{(Reverse Rank)+1}}$</center>

Below is custom function to create or add `adjusted_rank` column using pandas idiom:

In [7]:
def add_adjusted_rank(df: pd.DataFrame, col_status: str, col_rev_rank: str):
    """
    Adds adjusted rank column

    Parameters
    ----------
    df : pd.DataFrame
        pandas dataframe containing failure data
    col_status: str
        column containing the status of the unit.  Must only contain "FAILED" or "SUSPENDED"
    col_rev_rank : str
        column containing the reverse rank
    """

    # Previous adjusted rank initialized to zero
    prev_adj_rank = [0]
    
    def adj_rank(series):
        if series[col_status] == "SUSPENDED":
            return "SUSPENSION"
        else:
            adjusted_rank = (series[col_rev_rank] * 1.0 * prev_adj_rank[0] + (len(df) + 1))/(series[col_rev_rank] + 1)
            # Update previous adjusted rank to the current adjusted rank
            prev_adj_rank[0] = adjusted_rank
            return adjusted_rank

    df = df.assign(adjusted_rank=df.apply(adj_rank, axis=1))

    return df

Below is what I get using the custom function:

In [8]:
add_adjusted_rank(pdf, 'status', 'reverse_rank')

Unnamed: 0,serial_number,failure_time_minutes,failure_mode,status,rank,reverse_rank,adjusted_rank
0,7,10,Lug failed,SUSPENDED,1,8,SUSPENSION
1,4,30,Flair Failure,FAILED,2,7,1.125
2,6,45,Flair loosened,SUSPENDED,3,6,SUSPENSION
3,5,49,Flair Failure,FAILED,4,5,2.4375
4,8,82,Flair Failure,FAILED,5,4,3.75
5,1,90,Flair Failure,FAILED,6,3,5.0625
6,2,96,Flair Failure,FAILED,7,2,6.375
7,3,100,Flair loosened,SUSPENDED,8,1,SUSPENSION


I need to be able to accomplish creating this `adjusted_rank` column using ibis.  I'm assuming perhaps I need to look into using ibis' UDF.  I looked at the [documentation](https://ibis-project.org/reference/scalar-udfs) for UDFs, but I'm still not sure how to use ibis' UDF using duckdb backend to create this `adjusted_rank` column.

#### The solution provided by Phillip Cloud using ibis idiom

In [9]:
@udf.scalar.pyarrow
def adjusted_rank(n: int, col_status: str, col_rev_rank: int) -> float:
    # Previous adjusted rank initialized to zero
    prev_adj_rank = [0]

    def adj_rank(n, status, rev_rank):
        if status == "SUSPENDED":
            return None
        else:
            adjusted_rank = (rev_rank * prev_adj_rank[0] + (n + 1)) / (rev_rank + 1)
            # Update previous adjusted rank to the current adjusted rank
            prev_adj_rank[0] = adjusted_rank
            return adjusted_rank

    return pa.array(
        map(adj_rank, n.to_numpy(), col_status.to_numpy(), col_rev_rank.to_numpy())
    )

In [10]:
ranks = (
    failures.mutate(
        status=(
            ibis.case()
            .when(_.failure_mode == "Flair Failure", "FAILED")
            .else_("SUSPENDED")
            .end()
        )
    )
    .order_by(_.failure_time_minutes)
    .mutate(rank=ibis.row_number() + 1)
    .mutate(reverse_rank=failures.count() + 1 - _.rank)
    .mutate(adjusted_rank=adjusted_rank(_.count(), _.status, _.reverse_rank))
)

In [11]:
ranks

I actually need to add a 5th column called `median_rank` based on the following formula:

<center>$\huge{\frac{(AdjustedRank - 0.3)}{(n + 0.4)}}$</center>

Where n is the total number of rows in our data set and `median_rank` would be Null/NaN if the adjusted rank is Null/NaN.

Since the `median_rank` calculation does not rely on an external variable, we don't need to resort to using a UDF.

In [12]:
ranks2 = (
    failures.mutate(
        status=(
            ibis.case()
            .when(_.failure_mode == "Flair Failure", "FAILED")
            .else_("SUSPENDED")
            .end()
        )
    )
    .order_by(_.failure_time_minutes)
    .mutate(rank=ibis.row_number() + 1)
    .mutate(reverse_rank=failures.count() + 1 - _.rank)
    .mutate(adjusted_rank=adjusted_rank(_.count(), _.status, _.reverse_rank))
    .mutate(
        median_rank=(
            ibis.case()
            .when(_.adjusted_rank == None, None)
            .else_( (_.adjusted_rank - 0.3) / (_.count() + 0.4) )
            .end()
        )
    )
)

In [13]:
ranks2