Skip to content

Filtering Framework Proposal

Sirma edited this page Feb 25, 2016 · 1 revision

This proposal has been implemented into RTBKit with the exception of the dynamic configuration scheme. A tutorial is available to get you up and running with filtering in RTBKit.

This proposal is one of two aimed at improving the filtering mechanism within RTBKit. This proposal is focused on laying some ground work which will eventually be useful to implement the sparse matrix filtering proposal.

Problem Statement

The current filtering scheme is a monolithic function where we read a set of predefined filtering criteria specified in various formats (segment filters, include/exclude, regex, size filters, etc.) and apply them in a static order. This scheme has several problems:

  • It's difficult to extend: requires the modification of a monolithic function and modification to a predetermined interface.
  • It's inflexible: all filters have to be executed regardless of whether they are needed for the current set of agents or not.
  • It doesn't scale with the number of agents: requires that we iterate over every agent configuration for each bid request it receives.
  • It's the biggest bottleneck: filtering consistently occupies the top spot in perf metrics taken on production routers.

There are two major directions we can take which revolve around how the agent configuration is specified and influences the format of the internal bid request object and the filtering algorithms involved. We will propose each solutions in turn along with an analysis of their impacts.

Filtering Framework

This framework comes from the idea that we should give full control to the user's of RTBKit over the format of the agent configuration. To do this, we need to develop a filtering framework capable of handling a wide range of filter types while still allowing for efficient filtering.

Note that an ongoing proof of concept can be found in the dynamic-filters branch.

Framework Overview

The idea is to break up the old monolithic filtering function into many smaller filtering objects with a common interface. We then assemble these filters into a dynamic pool which represent the filters required by the currently live set of agent configurations. Each filters would take a bid request as input and output the set of agent configuration that matches the bid request. The picture below gives an overview of how the main filtering loop would work.

Filtering framework

The filter pool contains a set of filter functions which take a bid request as inputs and outputs a set of configs that match the bid request. The output of each filter is successively combined with each other using the intersect operator until the set of configuration is either empty or we run out of filters to apply.

While performance improvements will be explored in the next two sections, there are still many advantages to this filtering framework:

  1. Filters are simple self-contained units which makes them easier to reason about and test.
  2. The available filter pool can be easily extended to adapt to future requirements.
  3. Makes it easier to experiment with filters. For example, it's possible to implement the sparse matrix on top of this framework.

Note that our old monolithic filtering scheme could be implemented as is within this framework which makes for a gentle transition.

Performance Improvements

As is, the proposed scheme is no more scalable then the current scheme because a naive filter implementation would still need iterate over every configuration to apply the filters. So where's our gain?

  1. Filters that are not required by a user are not executed.
  2. We can easily modify the order in which the filters are executed such that filters that are more likely to lead to an empty set are executed earlier. The order can even be adapted at runtime using live stats.
  3. If we allow for pre-processing of configurations, most filters can be reduced to a few table look-ups.

Point 3. might be a little surprising so here's a quick example of how the language filter could be expressed:

typedef bitset<256> ConfigSet;
struct LanguageFilter
{
    map<string, ConfigSet> data;
    void addConfig(const AgentConfig& config, unsigned configIndex)
    {
        for (const auto& lang : config.languages)
            data[lang].set(configIndex);
    }
    ConfigSet filter(const BidRequest& br)
    {
        return data[br.language];
    }
}

The addConfig function is called whenever a new config is added to the router and updates a filter cache to indicate that a given config is associated with a given language. Now that we have this cache, all that remains to do during filtering is do quick look-ups for the corresponding bitset in our cache and we're done. The biggest performance gain here is that the filter will scale independently as the number of agents grows.

Notice that the previous example treats its config sets as bitsets. This makes the intersect operation of the main filtering loop extremely fast since it can be implemented as a simple AND of the various outputs of the filters. Using bitsets is possible because it's very easy to associate a unique sequence id to an agent configuration.

Agent Configuration

This is where the dynamic nature of the filtering framework truly shines. If we were to associate each filter with a set of attribute within the agent configuration, we would be able to dynamically determine the exact set of filters required to cover every agent configuration. Furthermore, if we were to do-away with the pre-defined settings within the agent configuration, users of RTBKit would be able to define their own custom filters and associated agent configurations and RTBKit would automatically adjust the filter set accordingly. Simple and powerful.

This does have major downsides:

  • We can't assume that any RTBKit installation will have a uniform agent configuration which makes it difficult to create third-party utilities that produces agent configuration.
  • Give your users a gun and they're guaranteed to shoot themselves in the foot. In other words, creating efficient filters is not trivial and this opens the door for lots of support headaches. This is even more of an issue since we've established in the sparse matrix section that it's possible to support most types of filtering that users care about using predefined filters.

As a first iteration, I'd recommend not implementing this feature just yet because it's much easier to relax a strict interface then it is to do the inverse. Additionally, the sparse matrix filtering proposal makes a case for a uniform filter which removes the need for a dynamic filter pool.

Conclusions

Advantages

  • Self-contained filters allows for a more configurable, malleable and extensible filter setup.
  • Dynamic set of filters allows for runtime adaptation of the filter pool.
  • Agent configuration pre-processing allows filtering to scale independently of the number of agents.
  • Can express any type of filters.

Disadvantages

  • Uncontrolled agent configuration API can be a support nightmare.
  • Maintaining a dynamic filter pool introduces overhead in the filtering due to indirections and synchronization.
  • Additional pre-processing of agent configuration could overload the main router loop.
Clone this wiki locally