Skip to content

Sparse Matrix Filtering

Rémi Attab edited this page Mar 12, 2014 · 8 revisions

This document is a work in progress and will probably undergo several radical changes before it is considered worthy. You have been warned.

This proposal is one of two aimed at improving the filtering mechanism within RTBKit. The filtering framework proposal is focused on laying some ground work which will eventually be useful to implement this proposal.

In this proposal, we make a rather bold recommendation which is worth noting from the get go:

We recommend that all existing filters except the segment filter be deprecated.

Read on to know our rationale behind this decision.

Problem Statement

The agent configuration is currently fairly messy and hard to understand because it contains various types of filters which all functions use in their own unique and special way. This makes it harder to guarantee a certain level of performance for our filters and complicates the development of third-party libraries on top of the agent configuration interface.

Sparse Matrix Filtering

The idea is to cut down on the various of filters until we're left with a single filter type which can express all other filters. To do this, we first need to express all bid requests in a unified format which can be interpreted by our uniform filter. The unification of the internal bid requests and agent configuration format has some nice side effects:

  • Format of agent configurations are now much simpler and can easily be consumed by third-party libraries.
  • The internal bid request format can be filtered on any exchange specific fields without having to extend the bid request object. Note that, while this problem is unrelated to filtering, it's still a problem that will have to be solved at some point in RTBKit.

A representation that meets these requirements is the segment filter, which can express just about any filters we'd be interested in.

Sparse Matrix

Unfortunately, this uses a map of strings to strings or integers could prove to be quite slow because of all the string manipulations that would be required. Fortunately, we can use a more efficient representation to create lightning fast filters: sparse matrix.

Sparse matrix representation

The idea is to hash the keys and values into 64bit integers as shown in the picture above. A simple implementation of this sparse matrix would be to use a simple hash map of integers to a hash set of integers which can be implemented pretty efficiently.

Converting bid requests to this sparse matrix is relatively straight forward. For agent configurations, there's several directions we could take to describe the filters but they all boil down to defining a boolean expression of values for each segment.

As an example, let's take our current include/exclude filter for segments which boils down to the following boolean expression (we'll ignore excludeIfNotPresent for now):

segment.contains-any(includes) && ~segment.contains-any(excludes)

This filter could be implemented as two sparse matrices: one for the includes and one for the excludes. These sparse matrices could then be combined using the boolean expression described above to determine whether a bid request passes the filter or not.

Note that there's an ongoing debate as to whether it's a good idea to use include/exclude as our primary filter. The issue is that it contains quite a few edge cases (see the segment filtering test) and in our experience it's notoriously difficult to get right. Note that if this is the case then we could eventually switch to a more generic and straight forward boolean expression mechanism (eg. A or (B and !C)).

Dynamic Filters

So far we've only addressed static filters which come directly from the bid request. What about augmentors or more complex regex filters?

For augmentors, the filters are already set up as segment filters so the leap is pretty easy to make. Note that while we could apply the filter on the combined sparse matrix of both the static filters and the augmentation filters, this is a bad idea because it would require sending every bid request to the augmenter, which would be a scalability nightmare. Instead we'll create 2 sets of filters, one for the static filters which can be evaluated early and one for the augmentation filters which would be evaluated before sending the bid request to the agents.

Regex filters and the other hand don't seem to fit at all with segment filtering so how will this work out? Since most of our current regex filters were only added to emulate boolean logic, most of them are pretty easy to convert into segment filters. Take the language filter as example where they would usually take this form ^FR|EN$ and could easily be converted into an include filter for FR or EN. In other cases, like urls, we could transform the url into several segments by applying various level of canonization. As an example, http://foo.bar.com/folder?arg=value could generate the following segments:

http://foo.bar.com/folder?arg=value
foo.bar.com/folder?arg=value
foo.bar.com/folder
foo.bar.com
bar.com

Unfortunately, the number of possible combination to generate could quickly grow out control for some filter types. This is where we need dynamic segments. The idea is that when we execute the dynamic filters for a set segments we would first execute a set of functions that would generate the various segments which would then be progressively be filtered using the same mechanism as the static filters.

Let's take url filtering as example where the regex is too complicate to be constrained by canonization. In this case, once the static filters are processed, the router would call our dynamic filter with the bid request. We can then iterate over the various agent configurations and apply the associated regex filters which will yield a set of pass or fail outputs. These outputs can then be hashed and stored in a segment named is-good-url. Now that we have an ordinary segment, we can go ahead and apply our regular segment filter.

Note that the whole process seems a bit convoluted: why not just take the output of the regex and do the filtering with just that? By turning the output into a segment, we can deliver it to the agent along with the bid request which can then use it to make bid decisions. The whole process is very similar to augmenters except that it's done entirely within the router.

Integration

In this section we'll discuss how this sparse matrix filter can be integrated into the filtering framework.

First off, segment filtering can be efficiently implemented as bitops and cache look-ups (an early proof of concept is available here). This implies that sparse matrix filtering can be implemented efficiently on top of this framework and will therefore scale independently with the number of agents. We will therefore use the sparse matrix proposal in its entirety and build it on top of the filtering framework. While the dynamic nature of the filtering framework is no longer as appealing, it still provides a convenient mechanism for pre-processing agent configurations which is a requirement of the sparse matrix filter.

Conclusions

Advantages:

  • Single point of optimization for all RTBKit filters. In other words there's a single filter we need to optimize to ensure that all RTBKit filtering is fast.
  • Normalizes the bid request for downstream users. As an example, machine learning can be efficiently implemented on top of the sparse matrix.
  • Exchange connectors can now easily extend the internal bid request representation to support exchange specific fields that can in turn be used to filter.
  • Agent configurations are uniform and the interface is stable across all RTBKit users. This enables third-party tools to be developed for RTBKit which will be compatible across the entire RTBKit user base.

Disadvantages:

  • Filters that would otherwise be simple, become difficult to express as segment filters. HourOfTheWeek is a good example.
  • Breaks backwards compatibility for the agent configuration.
  • Requires that the segmentation be applied in the exchange connectors which breaks backwards compatibility (not a big deal at this stage).
Clone this wiki locally