Unnatural Categories Are Optimized for Deception

Followup to: Where to Draw the Boundaries?

There is an important difference between having a utility function defined over a statistical model's performance against specific real-world data (even if another mind with different values would be interested in different data), and having a utility function defined over features of the model itself.

Arbitrariness in the map doesn't correspond to arbitrariness in the territory. Whatever criterion your brain is using to decide which word you want, is your non-arbitrary reason ...

So the one comes back to you and says:

That seems wrong—why wouldn't I care about the utility of having a particular model? I agree that categories derive much of their usefulness from "carving reality at the joints"—that's one very important kind of consequence of choosing to draw category boundaries in a particular way. But other consequences might matter too, if we have some moral reason to value drawing our categories a particular way. I don't see why I shouldn't be willing to trade off one unit of categorizational nonawkwardness for $X$ units of morality, even if trading off a million units of categorizational nonawkwardness for the same $X$ units of morality would be bad.

I once read about an analogy between category boundaries and national borders. Imagine a diplomat trying to come up with a proposal for a two-state solution to the Israeli–Palestinian conflict. There's no such thing as the "correct" border between Israel and Palestine, but there are consequences of choosing one border or another. For example, awarding territory to one side risks angering the other. For another, if the West Bank and Gaza Strip are to be part of Palestine, but Tel-Aviv and the southern city of Eilat are to be part of Israel, then topology forces you to decide which of Israel and Palestine gets to be continuous, and which will be split into two parts. Since borders can't be "true" or "false", the diplomat's task is and can only be to weigh these kinds of trade-offs.

Analogously, I think of language, following Eliezer Yudkowsky's "A Human's Guide to Words", as being a human-made project intended to help people understand each other. It draws on the structure of reality, but has many free variables, so that the structure of reality doesn't constrain it completely. This forces us to make decisions, and since these are not about factual states of the world—what the definition of a word really is, in God's dictionary—we have nothing to make those decisions on except consequences.

... okay, I think I see the problem. I see how one might have gotten that out of "A Human's Guide to Words"—if you skipped all the parts with math. I am now prepared to explain exactly what's wrong here in more detail than my previous attempt: not just that this position is not in harmony with the hidden Bayesian structure of language and cognition, but how the hidden Bayesian structure of language and cognition explains why an intelligent system might find this particular mistake tempting in the first place, and what breaks as a result.

Category "boundaries" are a useful visual metaphor for helping explain the cognitive function of categorization. If you have the visualization but you don't have the math, you might think you have the freedom to "redraw" the category "boundaries". Simple, compact boundaries might tend to be more useful, but more complicated boundaries aren't false and therefore aren't forbidden if you have some non-epistemic reason to prefer them ... right?

Only in the sense that no hypothesis is "false"! Categories, words, correspond to hypotheses—probabilistic models that make predictions. If I see a dolphin in the water, and I say, "Hey, there's a dolphin!", and you understand me, that enables you to predict quite a lot about there being this-and-such kind of aquatic mammal with fins, a tail, &c. in the water.

This AI technology whereby the cause-and-effect evidential entanglement of photons bouncing off the dolphin hitting my eyes and causing me to emit sound waves hitting your eardrum, enabled you to make predictions about what's in the water, works because we happen to live in a world where the distribution of creatures has cluster-structure whereby dolphins have lots of things in common with each other, such that humans with a shared interest in communicating with each other learned a convention that maps the signal "dolphin" to a concept of dolphins in each human's mind.

But the dolphin concept/model/hypothesis is subject to the universal mathematical laws of reasoning under uncertainty. In particular, probability-mass flows between hypotheses: as long as you never assign a probability of zero (which is a log-odds of negative infinity), nothing you believe can ever be definitively (infinitely) "falsified"—it "just" makes worse predictions as compared to other hypotheses.

Because category "boundaries" are merely a visualization for a probabilistic model that makes predictions about the real world, you can't "redraw the boundaries" associated with a communication signal without messing with the model that generates them, which means messing with your predictions about the real world.

Might there be some non-epistemic reason for an agent to prefer a model that makes worse predictions? Sure! Correct maps are useful for steering reality into configurations ranked higher in your preference ordering—but causing a different agent to have incorrect maps might make them mis-navigate reality in a way that benefits you! We call this deception.

In a closely related phenomenon, a poorly-designed agent might get confused and end up manipulating its own beliefs: optimizing its map to inaccurately portray a high-value territory, rather than optimizing the territory to be high-value by using a map that reflects the territory—a kind of self-deception. We call this wireheading.

The laws of probability and information theory allow us to calculate how information can be efficiently encoded and transmitted from one place to another. Given some distribution of random variables, and some specification of what information about those variables you want to transmit, some encodings—some ways of "drawing" category "boundaries"—quantitatively perform better than others. Agents that want to communicate with each other will tend to invent or discover conventions that efficiently encode the information they're trying to communicate. Agents that communicate in ways that systematically depart from efficient encodings are better modeled as trying to deceive each other or wirehead themselves.

Let's walk through a simple example. Imagine that you have a peculiar job in a peculiar factory: specifically, you're a machine-learning engineer tasked with automating away the jobs of humans who sort objects from a mysterious conveyor belt.

Another engineer has already written a system that processes camera and sensor data about the objects into more convenient "features": color (measured on an eight-point blueness scale), shape (measured on an eight-point "eggness" scale), and vanadium content (a boolean Yes or No). Your task is to further process this information into a format suitable for giving commands to other systems—for example, the robot arm that will physically move the objects into appropriate bins.

The feature data consists of the blueness–eggness–vanadium-content joint distribution given by this 128-entry table:

This seems like ... not the most useful representation? The data is all there, so in principle, you could code whatever you needed to do based off the full table, but it seems like it would be an unmaintainable mess: you'd sooner resign than write a 128-case switch statement. Furthermore, when the system is deployed, you hope to typically be able to give the binning robot messages based on only the color and shape observations, because the Sorting Scanner that the vanadium readings come from is expensive to run. You could just do a Bayesian update on the entire joint distribution, of course, but it seems like it should be possible to be more efficient by exploiting regularities in the data, not entirely unlike how your colleague's system has already made your job much simpler by giving you blueness and eggness feature scores rather than raw camera data. Eyeballing the table, you notice it seems to have a lot of redundancy: most of the probability-mass is concentrated in two regions where the blueness and eggness scores are either both high or both low—and vanadium is only found when both blueness and eggness are high.

O tragedy O the stars! If only there were some more convenient and flexible way to represent this knowledge—some kind of deep structural insight to rescue you from this cruel predicament!

... alright, dear reader—I shouldn't patronize. You already know how this story ends. The distribution factorizes!

$$\sum_{\mathrm{category}} P(\mathrm{category}) \cdot P(\mathrm{blueness}|\mathrm{category}) \cdot P(\mathrm{eggness}|\mathrm{category}) \cdot P(\mathrm{vanadium}|\mathrm{category})$$

(The distribution in this made-up toy example factorizes exactly, but in a messy real-world application, you might have a spectrum of approximate models to choose from.)

We can simplify our representation of our observations by using a naïve Bayes model, a "star-shaped" Bayesian network where a central "category" node is posited to underlie all of our observations: we believe that each object either "is a blegg" (and therefore contains vanadium and has high blueness and eggness scores) with probability 0.48, "is a rube" (and therefore has no vanadium and low blueness and eggness scores) with probability 0.48, or belongs to a catch-all "other"/error class with probability 0.04. (Maybe the camera is buggy sometimes, or maybe there are some other random objects mixed in with the rubes and bleggs?)

Whereas the full joint distribution had 127 degrees of freedom (a table of $8 \cdot 8 \cdot 2 = 128$ separate probabilites, constrained to add up to 1), the naïve-Bayes representation only has 50 ($3 \cdot 1$ prior probabilities for the categories, plus $3 \cdot 8 = 24$, $3 \cdot 8 = 24$, and $3 \cdot 1 = 3$-entry conditional probability tables for each of the features, with each table constrained to add up to 1). The advantage would be much larger for more complicated problems—the joint distribution table grows exponentially with more features, quickly becoming infeasible to store and represent, let alone learn.

It must be stressed that our "categories" here are a specific mathematical model that makes specific (probabilistic) predictions. Suppose we see a black-and-white photo of an egg-shaped object: specifically, one with an eggness score of 7. Given that observation of $\mathrm{eggness} = 7$, we can update our probabilities of category-membership.

$$P(\mathrm{category} = c | \mathrm{eggness} = 7) =$$

$$\frac{P(\mathrm{eggness} = 7|\mathrm{\mathrm{category} = c})P(\mathrm{category} = c)}{\sum_{d \in {\mathrm{blegg}, \mathrm{rube}, \mathrm{??} } } P(\mathrm{eggness} = 7| \mathrm{category}=d)P(\mathrm{category} = d)}$$

We think the egg-shaped object is almost certainly a blegg (specifically, with probability 0.96), even if the black-and-white photo doesn't directly tell us how blue it is, because

$$P(\mathrm{category} = \mathrm{blegg} | \mathrm{eggness} = 7) = \frac{\frac{1}{4} \cdot \frac{12}{25}}{\frac{1}{4} \cdot \frac{12}{25} + 0 \cdot \frac{12}{25} + \frac{1}{8} \cdot \frac{1}{25}} = \frac{24}{25} = 0.96$$

We can then use our updated beliefs about category membership (0.96 blegg/0 rube/0.04 unknown, as contrasted to the 0.48/0.48/0.04 prior) to get our updated posterior distribution on the 0–7 blueness score (0.005/0.005/0.005/0.005/0.005/0.245/0.485/0.245—left as an excercise for the reader).

In addition to categories facilitating efficient probabilistic inference within the system that you're currently programming, labels for categories turn out to be useful for communicating with other systems. Maybe there's a robot arm in the Sorting room that puts bleggs in a blegg bin, which gets taken to a room elsewhere in the factory where there's sophisticated vanadium-ore-processing machinery that has to handle both bleggs and gretrahedrons.

Maybe the binning arm doesn't need to know about the blueness and eggness scores: it can close its claws around rubes and bleggs alike, and you only need to program it to pick up an object from a certain spot on the conveyor belt and place it into the correct bin. But maybe the vanadium-ore-processing machine does need to do further information processing before it can operate on an object—maybe it needs to vary its drill speed in proportion to the density of a particular blegg's flexible outer material (which it can estimate based on how brightly the blegg glows in the dark), but it uses a different drilling pattern for gretrahedrons.

If you need to send commands to the binning arm and the ore-processing machine, it's a more efficient communication protocol to just be able to send the 28-byte JSON payload {"object_category": "BLEGG"} and let the other machines do their work using their own models of bleggs, rather than having to send over the raw camera data plus the binary code of the Bayesian network and feature extractors that the feature-extraction and classifier systems initially used to identify bleggs.

The {"object_category": "BLEGG"} message is a useful shorthand for "linking up" the models between different machines. Different machines might not use the same model: the classifer system uses blueness and eggness scores to identify bleggs, but the ore-processing machine, having been told that an object is a blegg, can take its blueness and eggness for granted and only needs to reason about its luminescence and vanadium content.

But this trick of using a signal to correlate the models between different machines only works because and insofar as both models are pointing to the same cluster-structure in reality. If the model in the classifier system doesn't meaningfully match the model in the ore-processing system—if the classifier code sends the BLEGG message given a object with eggness score between 5 and 7, but the ore-processor, upon recieving the BLEGG message, positions its drills in the expectation of processing an object with an eggness score between 0 and 2—then the factory doesn't work.

This is also, at an abstract high level, how human natural language works—how it has to work. There is an idea in my head. When I speak, my larynx creates pressure wave in the air. When the waves hit your eardrum and are decoded by your auditory cortex, that hopefully triggers a similar idea in your head. You don't have direct access to the computations going on in my head, but your brain can infer something about them from the noises I make, and that's what it means to speak and to listen in a reductionist universe.

I am not a cognitive scientist and I don't know the details of how language works in the brain. But I do know a little information theory and probability theory—enough to glimpse a bit of how the laws of mathematics constain how language has to work, to the extent that it works.

In studying or explaining the math, I like to focus on simple examples with explicit probability distributions that I can do my own calculations for with pencil or Python. If I want to tell a story to go along with the math, I want to make the story about factory machines that I could actually program.

The actual implementation of natural language in human brains is going to be much more complicated, of course. Telling a story problem about computer programs controlling factory machines has the advantage not only of being a simple explanation of the math that we can trust governs the more complicated real-world phenomenon. It's also less tempting to rationalize about the story problem about factory machines, than it is to directly think about language.

(As it is written of the tenth virtue of precision: even if you cannot do the math, knowing that the math exists tells you that the dance step is precise and has no room in it for your whims.)

Humans are designed to decieve each other—it's always tempting to speak in a way that propagates misinformation while retaining deniability that we weren't lying—it's the other guy's fault for misinterpreting what I really meant. When we think about designing messages for computer programs to give commands to each other about quantifiable observables, this excuse vanishes: if there's a bug in deterministic computer code such that the robot arm puts an object in the rube bin when it gets the BLEGG message, then that's what happens. There's no room to use the complexity of humans and their political games to obscure the behavior of the physical system and how it's processing information.

[TODO: more cleanly distinguish two reasons for example choice: simplicity, and sidestepping political intuitions. Link to "Toolbox-thinking and law-thinking": https://www.lesswrong.com/posts/CPP2uLcaywEokFKQG/toolbox-thinking-and-law-thinking ]

As a human learning math, it's helpful to examine multiple representations of the same mathematical object. We've already seen our blueness–eggness–vanadium model represented as a table, and factorized into a graphical model. We've done also some algebraic calculations with it. But we can also visualize it: the set of camera observations that the model classifies as a blegg with probability $\ge 0.96$ can be thought of a area with a boundary in two-dimensional blueness–eggness space:

[TODO: highlight the significance of "with probability > p"—inside/outside the boundaries is a simplification that loses information if we're using a model where the classes generate overlapping observations in whatever subspace we're making observations in]

If you were trying to teach someone about the hidden Bayesian structure of language and cognition, but thought your audience was too stupid or lazy to understand the actual math, you might be tempted to skip the part about factorizing a joint distribution into a star-shaped Bayesian network and just talk about "drawing" "boundaries" in configuration space for human convenience, perhaps with a hokey metaphor about national borders. Then the audience might walk away with the idea that there's no reason not to replace the old blegg concept and its boring compact boundary, with a new blegg* concept that has an exciting squiggly border.

Alaska isn't even contiguous with the rest of the United States. If that's okay, why can't the borders of bleggness be a little squiggly?

Because the "national borders" metaphor is just a metaphor. It immediately breaks down as soon as you try to do any calculations.

When we say that the United States purchased Alaska from the Russian Empire, that means that this-and-such physical area on the Earth's surface went from being the territory of the Russian government, to being territory of the United States government, where land being the "territory of" a "government" is a complicated idea that has something to do Schelling points over who gives orders to policemen and soldiers in that area.

When you reprogram your machine-learning system to send an {"object_category": "BLEGG"} message when it sees an object with an eggness score of 2 and a blueness score of 1, then your vanadium-ore-processing machine wears down its drill bits trying to process a rube.

Other than the fact that some aspects of both of these situations can be usefully visualized as changes to a two-dimensional diagram depicting an area with a boundary, what do these situations have to do with each other? They don't. Countries aren't Bayesian networks. They just aren't. When we depict a country on a map, we're not talking about a cognitive system that can use observations of latitude to estimate probabilities of country-membership and then use that distribution on country-membership to get an updated probability distribution on longitude. (I mean, given a world map, you could program such a thing, but it seems kind of useless—it's not clear why anyone would want that particular program.) Why would you expect to understand an AI-theory concept by telling a story about national borders?

So, that's what's wrong with the national-borders metaphor. But we haven't yet really explained the problem with "unnatural" categories—those that you would visualize as a squiggly, "gerrymandered" boundary. The squiggly blegg* boundary doesn't have the nice property of corresponding to the category labels in our nice factorized naïve Bayes model, but it still contains information. You can still do a Bayesian update on being told that an object lies within a squiggly boundary in configuration space. If that update eliminates half of your probability-mass, that's one information-theoretic bit, no matter how the category is shaped in Thingspace.

If you only care about how much probability you assign to the exact answer, then a bit is a bit. But if an approximate answer is approximately as good—if your answerspace has a metric on it, so that "approximate" can mean something—then some bits can be more valuable than others.

[TODO: rewrite based on better math]

Suppose some random variable $X$ is uniformly distributed on the set ${1, 2, 3, 4, 5, 6, 7, 8}$. You have the option of being told whether an observation $x$ is even or odd, or whether $x$ is greater or less than 4.5. Either way, you eliminate half of your hypotheses: the entropy of your probability distribution goes from $log_2 8 = 3$ to $log_2 4 = 2$. You've learned 1 bit.

But if you learn whether $x$ is even or odd, your mean squared error only goes down from 10.5 to 10, whereas if you learn whether $x$ is 1–4 or 5–8, your mean squared error plummets to 2.5. (The squared error has nicer mathematical properties than the absolute error.) By being compact, the "1–4 or 5–8" category system is much more useful for getting close to the right answer than the "even/odd" category system, even though they both provide the same amount of information about the exact answer.

[TODO: more on strictly proper scoring rules?!]

The same goes for natural categories vs. squiggly category "boundaries" in higher dimensions. For our blueness–eggness–vanadium distribution, your mean squared error before being told anything about an object is about 27.26 (with respect to Euclidean distance on blueness-score ✕ eggness-score ✕ 1-if-vanadium-present-else-0). On being told that an object is a blegg, your mean squared error plummets to about 0.46. On being told that an object is a blegg*, your mean squared error only goes down to about 4.13.

In this sense, the gerrymandered blegg* concept is quantitatively less informative than the original, compact blegg concept. The metric we assigned to blueness–eggness–vanadium space was our choice, and could depend on our values: if we simply don't care about predicting how blue an object is, we could disregard the blueness score and only define a concept on the eggness–vanadium subspace. Or if we don't care about predicting blueness very much, we could calculate our error score with respect to a metric that gave blueness very little weight.

But given a metric on the variables that you care about predicting and using to inform predictions, which categories are cognitively useful depends on the the distribution of data in the world (not on your values). You can't define a word any way you want.

[TODO: rewrite cleaner code and double-check and footnote-hyperlink calculations; quote how the squared error changes with different metrics]

A caveat: the dependence on a choice of metric on configuration space—and really, a choice of the space—gives a sense in which optimal categories are value-laden, but it's a specific kind of lawful dependence between your values and the distribution of data in the world, not an atomic preference for using a particular encoding for its own sake.

The cognitive function of categorization is to group similar things together so that we can make similar decisions about them. A function measuring the extent to which things are "similar" has to take the things as input, but the extent to which things are decision-relevantly similar also depends on what you're trying to accomplish with your decisions, and that can be algorithmically complex. It might not be just a matter of only looking at some decision-relevant subspace of a natural, "obvious" configuration space that's available to all possible minds (like not caring what color your toothbrush handle is¹); the dimensions of the space you do your similarity-clustering in might themselves be complicated features (in the sense of machine learning) of which agents with different values would have no reason to logically pinpoint that particular criterion by which things may be judged. How you should define words depends on what you want, but that's not the same as defining words any way you want.

For example, posion isn't a natural category to a generic mind studying chemistry: we group cyanide and hemlock together as poison because we value human health, and so we want to have a category for scary chemicals that disrupt human metabolism, causing death or serious illness. But this determination depends on the intricate details of human biochemistry. (The theobromine in chocolate is okay for humans at typical doses, but potentially fatal to dogs, which are actually pretty similar to us in animalspace.) The compact category "boundary" that minimizes predictive error on human-healthspace, corresponds to a squiggly "boundary" in the chemicalspace you would be looking at if you've never seen a human and just want to make predictions about the chemicals themselves.

Or tiny molecular smileyfaces and real human smiles might be grouped together as similar as far as an image-classifier's curve detector is concerned, even if they're not similar as far as the abstracted idealized dynamic of human morality is concerned.

End caveat. The technical sense in which optimal categories can be value-laden doesn't alter the basic morals of our basic Bayesian philosophy of language. Your values can give you a particular configuration space and a metric on the space, but given that, sane agents want to "carve it at the joints" in order to get a communication system that minimizes predictive error. If you're trying to find an efficient encoding of your observations, there's no reason to want squiggly, gerrymandered categories in the decision-relevant space.

The one replies:

You're still not addressing my crux! I don't doubt what you say about minimizing prediction error with respect to some metric thingy. But what if that's not what I care about? My utility function assigns high value to using the squiggly blegg* category boundary—such that the utility of using my preferred category outweighs the disutility of making less accurate predictions. You can define a word any way you want—if you're willing to pay the costs.

So, what, you just intrinsically assign high utility to using the same communication signal to encode eggness-2/blueness-1 observations as eggness-6/blueness-6 observations, given the joint distribution specified in my story problem about sorting objects in a factory? Really?

"... yes!"

Okay, but where would that kind of exotic utility function come from? How would it arise naturally in an intelligent system?

There's a trivial sense in which you can interpret any action taken by an agent as being taken because the agent values taking that action. This theory is compatible with all possible behaviors and therefore explains nothing.

The value of decision-theoretic utility functions isn't that "Because utility!" serves as an all-purpose excuse for any possible behavior. It's that simple coherence deciderata imply that an agent's behavior should be describable as maximizing expected utility for some utility function—with corresponding constraints on the shape of that behavior.

Situations like the Allias paradox illustrate what these constaints look like. Consider an AI faced with playing the following game. There's a switch that can be turned On or Off, that starts out on in the Off position.

At midnight, a coin is flipped. If the coin comes up Tails, the game ends. If the coin comes up Heads, then at a quarter past midnight, if the switch is Off, then the AI gets paid $100, and if the switch is On, a six-sided die is rolled, and the AI gets paid $110 if the die doesn't come up 6.

Suppose that, before midnight, the AI is willing to pay a dollar to flip the switch On (as if it thought that winning $110 with a probability of 5/12 is better than winning $100 with a probability of 1/2). Suppose the coin comes up Heads, and the AI is then willing to pay another dollar to flip the switch Off again (as if it thought that $100 with certainty is better than $110 with probability 5/6). Then the AI is two dollars poorer in exchange for the switch being in the same position it started in.

[TODO: sentence about violating the independence axiom]

If, for some reason, you specifically programmed the AI to prefer options it considers "certain", or to want switches to be "On" before midnight but "Off" after midnight, then it would be functioning as designed.

What we can say about such an AI, is that it's not coherently optimizing for acquiring money. (We say that a system is an optimizer if it systematically steers the future into configurations that rank higher with respect to some preference ordering. This helps us make predictions about what effects the system has, without having to model the details of how it brings those effects about.) A well-designed agent that was optimizing for acquiring money would be expected to obey the independence axiom.

If the AI playing this game isn't coherently optimizing for acquiring money, what is it optimizing for? To tell, we'd need to observe its behavior in different environments. If it is trying to acquire money but is just biased to prefer certainty (in violation of the von Neumann–Morgenstern axioms), then we'd expect it to make choices that result in money but continue to exhibit Allais-like glitches around gambles involving probabilities close to 1. If it just likes switches to be off after midnight, then we'd expect it to turn switches off at that time even if there's no gambling game going on.

This methodology for attributing goals to an agent—consider it to be "optimizing for" outcomes that it systematically achieves across a variety of environments—applies to the behavior of sending communication signals, just as it does to the behavior of flipping switches.

Back to the factory. Our classifier system that sends a BLEGG message when it gets camera data corresponding to the compact blegg concept is optimized for sending messages that allow other systems to minimize the squared error of their predictions of objects with respect to our standard metric on blueness–eggness–vanadium space. We don't intrinsically assign utility to using that particular category system; the category is the solution to an optimization problem about how to efficiently get blueness–eggness–vanadium information from one place to another.

A system that sends a BLEGG message when it gets camera data corresponding to the gerrymandered blegg* concept would be optimized for ... what? If you don't instrinsically assign utility to using that particular category system, then why would you program the system that way? What could possibly be the problem for which the gerrymandered category is an optimized solution?

Well. Suppose that, besides your dayjob as a machine-learning engineer, you also happen to own a side interest in the firm that supplies rubes and bleggs to this very factory. And suppose that vanadium fetches higher market prices than palladium, such that the factory is to pay the supplier $2 per blegg but only $1 per rube—and that the accounts-payable records are to be compiled based on how much the classifier you're currently programming sends BLEGG and RUBE messages, not how much metal actually gets harvested.

You can't help but notice that you stand to make more money if the system you're programming sends BLEGG messages more often. You can't just make it send BLEGG all the time—someone would notice and you'd get fired. But the ore-processing room can cope with a few suboptimally-sorted objects. Surely it's no big deal if you just ... adjusted the category boundary of BLEGG-ness a bit?

We saw earlier that the blegg concept does better than the blegg* concept with respect to mean squared error (given a metric on the feature space).

That's not the only possible scoring function. Suppose instead we score our category system by which one best minimizes the squared error minus [TODO: specify exact function after rewriting code] revenue to the supplier. [TODO: run the numbers; rewrite cleaner code and double-check and footnote-hyperlink calculations]

So with respect to that scoring function, the blegg* category "boundary" is preferable.

The one says:

But now it sounds like you're agreeing with me! The compact blegg category serves the factory owner's goals better, which you formalized in terms of minimizing average squared error. The squiggly blegg* boundary makes the factory perform less well, but it serves the moonlighting engineer's goals better, which you formalized in terms of minimizing squared error minus supplier revenue. There's no rule of rationality against the engineer programming the system using the blegg* category boundary if it suits their goals better.

Only in the sense that there's no rule of rationality against lying! Suppose I'm selling you some number of gold and silver bars, but you can't examine the metal yourself until later; you can only hope that the receipt I give you is accurate. Consider the following two scenarios.

In the first scenario, I lie: the receipt says I delivered 60 gold bars and 20 silver bars, but I actually delivered 40 gold bars and 40 silver bars. (Incidentally, the silver bars happen to all have odd serial numbers, but that's not important in this scenario.) You live in a low-trust world where lying is very common and contract enforcement isn't really a thing: about a third of the time someone claims something is gold, it turns out to be silver. So when you discover the fraud, you feel disappointed but not surprised: you would have preferred to get what you paid for, but you can't say you anticipated it.

In the second scenario, I tell the truth—with respect to a category system that suits my goals. The receipt says I delivered 60 gold bars and 20 silver bars—and I did. It's just that what I prefer to call "gold bars", you prefer to call "gold bars, or silver bars with odd serial numbers", and what I call "silver bars", you call "silver bars with even serial numbers". You know this, so when you examine the actual contents of the delivery, you're disappointed but not surprised: you would have preferred to transact under your definitions of 'gold' and 'silver', but you can't say you anticipated it.

We might question whether these are two different scenarios, or two descriptions of the same scenario: the same physical receipt, the same physical metal, the same buyer anticipations about the metal conditional on observing the receipt. If we just pay attention to the evidential entanglements instead of being confused by words, then there's no functional difference between saying "I reserve the right to lie p% of the time about whether something belongs to category C", and adopting a new, less-accurate category system that misclassifies p% of instances with respect to the old system.

Minimizing the squared error score is about map–territory correspondence: ways of communicating that help the factory machines make better predictions about the objects, get a higher score.

Minimizing the squared-error-minus-supplier-revenue score is a compromise between map–territory correspondence and saying whatever makes the supplier the most money.

The degree of compromise is quantitative: there's a continuum of possible scoring functions between "minimize squared error, only" (for which the naïve-Bayes categorizer is a good solution), and "maximize supplier revenue, only" (for which "always say BLEGG" is the optimal solution).

If always saying whatever profits you and not revealing any information about the territory is deception pure and simple, then the intermediate points on a continuum with that, must be considered partially deceptive.

Depending on your goals, deception can be rational! If you don't care about other agents having accurate models and just want them to believe whatever makes them behave in a way that benefits you—or whatever makes them happy—then you can do that! There's no God to stop you. But in order to help you decide whether deceiving people is the right thing to do, it helps to notice that what you're doing is decieving people.

It helps to notice what you're doing—if you're trying to be an agent that coherently steers the future in some direction. But who does that, really? Maybe you just want to feel good! And not even coherently steer the universe into configurations where you feel good, either!

Rational agents should want to have true beliefs: the map that reflects the territory, is the map that is useful for navigating the territory. But you don't—can't—have unmediated access to the world; you can only infer what the world is like from sensory data, and effectively live in your model of the world. Given the tricky indirection involved, it's not surprising that poorly-designed agents sometimes get confused and "wirehead" themselves: if you don't notice the difference, it's tempting to fabricate a fake map that falsely portrays the territory as being good, instead of making a map that reflects the territory (which you can use to figure out how to improve the territory).

Similarly, if you don't notice the difference, it's tempting to choose language that makes the world sound good, than to have your language accurately describe the world (which description you can use to figure out how to make the world better).

Suppose I want to be attractive. Attractiveness is a value-laden concept in the sense described in the earlier caveat: non-human agents would have no motive to evaluate that particular fixed computation. It's also a fuzzy concept: we don't have a simple membership test to precisely measure in standard units exactly how attractive someone is, but there's enough regularity in how people use the word "attractive" for the word to be a useful communication signal. It's also a two-place concept: what I find attractive isn't necessarily the same as what you find attractive (even if there's likely to be a lot of overlap as contrasted to an alien's analogue of attractiveness).

Given all these complications, one could imagine being tempted to think that attractiveness is "subjective", and that therefore I can define it any way I want, and that therefore, if I feel sad about not being "attractive", I can fix that by changing my definition of the word "attractive" such that it includes me. Because definitions can't be "false", right!? There's no rule of rationality prohibiting this boundary-redrawing project—and since I want so desperately to be "attractive", there's every rule of human decency in favor of it, right?!

So, this obviously doesn't work. (Okay, it "works" if you deliberately choose to define the word "work" such that it works, but it doesn't actually work.) The motion to redefine the word "attractive" came with the purported justification that words don't have intrinsic meanings, so it can't be "wrong" to redefine it. But precisely because words don't have intrinsic meanings, there's no reason to want to redefine an existing word, except to piggyback off the meaning people are already using that word for.

(Note that this, in itself, isn't necessarily deceptive. Sometimes, coining new senses of a word that piggyback off an existing meaning can be a powerful tool for extending our vocabulary to cover new phenomena that we don't already have words for—as long as we're careful to specify which meaning is intended when it's not clear from context.)

It's not plausible to suppose that I want to be "attractive" because I like ten-letter words that start with the letter a; I want to be attractive because of what that word already refers to in common usage. The redefinition might (or might not) succeed at making me feel better about myself, but if it does, it only works by means of confusing me: using algorithmically–strategic equivocation to arbitrage the hedonic gap between my new definition, and the old definition (which I still mentally associate with the word).

If it does succeed at making me feel better about myself, is the redefinition "rational"? Happiness is good, right? Should not rationalists win?

I do not frame an answer: that would depend on how you draw the category boundaries of "rational", which is not an interesting question. (As it is written of a virtue which is nameless: if you speak overmuch of the Way, you will not attain it.)

What I can say, however, is that choosing a suboptimal use of categories is not following a cognitive algorithm that uses a map that reflects the territory to systematically achieve goals across a wide range of environments. If there's anything I can do to become more attractive (going to the gym, buying new clothes, addressing deep-seated personality flaws?), I would seem less likely to notice and execute on such a plan after having sabotaged the concept I would need to notice the problem in the first place.

The map is not the territory ... but for real agents embedded in the physical universe, the map is part of the territory. This presents some complications to applications of our anti-wireheading moral. We don't want to wirehead ourselves by making the map look good at the expense of undermining our ability to navigate the territory—but there's no bright-line distinction demarcating which configurations of atoms are "the map". From the perspective of the eternal, it's all just territory.

In the previous post, we considered the case of an assembly line (well, sorting line) worker in the blegg–rube factory being excited about an ostensible promotion to the position of Vice President of Sorting—only to be aggrieved on finding out that it's a promotion literally in name only, with no changes in pay, authority, or work tasks.

If we interpret the title as part of "the map", a communication signal with the function of encoding information about the person's job, then we want to say that the new title is substantively misleading: when you hear that someone's job is being a "Vice President", you predict that their work involves managing people and making high-level executive decisions for the firm. Your probability that the "Vice President" has to spend all day moving objects from a conveyor belt into one of two bins based on the object's color and shape (a task that should probably be automated), is lower than before you heard the person's title: hearing the title made you update in the wrong direction.

But if we interpret the title as part of "the territory", a feature of the job itself, rather than a communication signal about the job—then it's not misleading and can't be misleading. The job happens to be one that has the symbols "Vice President" printed on the accompanying business cards and employee roster, much like how bleggs are objects that happen to be blue. You can't say the blue is "lying"; that doesn't make any sense!

The function of words is to serve as communication symbols, so it seems safe to say that language should usually be construed as part of "the map". Changing names and only names, without altering the things that the names refer to, as in the phony "Vice President" example, can only be intended to deceive. But for other features associated with a category, it may not always be obvious when we should construe them as "map" rather than "territory": using a feature to infer category-membership is formally equivalent to regarding it as a signal sent by senders of that category. Is that man pretending to be a doctor, or does he just happen to be wearing a lab coat?

The concept we're groping towards, and hoping to formulate an elegant reduction of, is that of mimicry. Suppose there is some existing category of entity, an original, typified by some cluster of traits. A mimic is an entity optimized to approximately match the distribution of the original in many, but not all traits, thereby being part of the same cluster as the original in some subspace of the space the original category is defined in, but not the space as a whole. For example, if the vector $[4, 4, 4, 4, 4] \in \mathbb{R}^5$ is the original, then $[4, 0, 0, 4, 4]$ would be a mimic: it looks the same if you project into the subspace spanned by $x_1$, $x_4$, and $x_5$.

[TODO: rewrite to make it clear that 4,0,0,4,4 is being chosen on the basis of "looking like"]

We can find examples in nature. Suppose one type of butterfly has evolved to be toxic to a type of predator, and also has distinctive wing markings that function as an honest warning signal to that predator: this butterfly is not good to eat. This provides an "opportunity" (in evolutionary time) for a second species of butterfly to develop similar wing markings, so that predators will confuse it for the first type of butterfly, despite the second butterly not paying the metabolic cost of producing toxins. This kind of situation is called Batesian mimicry.

Is Batesian mimicry deceptive? (In our usual functionalist sense, which is obviously not a claim about butterfly psychology.) Is the second butterfly's very existence a kind of lie?

In some sense, yes! The mimic butterfly has been optimized by evolution to look like the first butterfly because of the fitness payoff of being categorized by the predator as the first, toxic, kind of butterfly. The "recognized by the predator as toxic" category is a natural, compact region in wing-marking-space, but "comes apart" into two clusters in the broader wing-markings–actual-toxicity space.

Furthermore, the evolutionary dynamics create asymmetric relationship between the two categories, that isn't captured by just the two trait-clusters themselves. The reason for the mimic butterfly to have those particular wing-markings is in order to increase the predator's expected squared error on toxicity (which is learned from encounters with the original), so if the original's wing-markings were to change as a result of some new selection pressure, the mimic would be subjected to selection pressure to "keep up" by changing its wing-markings accordingly.

That's not true in the other direction: if the mimic's markings were to change, the original wouldn't "follow": rather, the original would benefit from the probabilistic strength of its warning signal not being parasitically diluted by the mimic anymore. Thus, the asymmetric terminology "original" and "mimic" is appropriate: it's not just that these two species happen to like like each other; one of them was there first, and the other looks like it.

Is mimicry always deceptive? Not necessarily—there might be some situations where the relevant set of variables are among those where the mimic matches the distribution of the original.

Suppose you and I are watching some ducks in the park. I say, "I love watching these ducks!"

You say, "Wrong! These aren't all ducks. This park is where a local inventor tests out his Anatid-oid robots that are designed to look and act like ducks. Therefore, you can't say, 'I love watching these ducks'; you need to say 'I love watching these ducks and Anatidoid robots'."

"Wow, they're so realistic!" I say. "I can't even tell which ones are really robots! In fact," I continue, "since I can't tell, I'm inclined to just keep calling them all ducks; it would be pretty awkward to refer to each one as a duck-or-Anatidoid-robot."

"But it is possible to tell," you claim. "For example, if you get really close to one of the Anatidoid robots, and there's not a lot of ambient noise, you can hear the gears inside, turning."

"Okay," I say, "but I can't hear the gears from here. Since I have no way of telling the difference between ducks and Anatidoid robots without doing the more expensive evidence-gathering of cornering one in a quiet place, it makes sense for me to talk and think about the robots as being a kind of duck."

"But that's a lie! Ducks and Anatidoid robots may look and act similarly, but they're actually very different! Ducks are made of flesh and blood inside, eat organic matter, and are fated to die, whereas Anatidoid robots have a plastic interior, recharge wirelessly in the inventor's lab, and are immortal."

"Sure," I agree. "And if I were interacting with these entities in a context where I wanted to minimize the expected squared error of my predictions about their internal makeup, energy sources, or ultimate fate, then I would want to make that distinction. But I just want to watch some cool ducks in the park, and in the context of that activity, I only need to minimize the expected squared error of my predictions about appearance and behavior."

This is the origin of the famous duck test: if it looks like a duck, and quacks like a duck, and you can model it as a duck without making any grievous prediction errors, then it makes sense to consider it a member of the category duck in the range of circumstances where your model continues to perform well.

The features for which mimics fail to match the original need not be hidden (like gear sounds that you can't hear in a noisy park) in order for mimics to not be deceptive; they only need to be irrelevant in the context the category is being used. Squirt guns aren't guns—and are usually manufactured in unrealistic colors specifically to prevent being confused with real guns—but in the context of a water fight, the utterance "Don't point that gun at me" (without the privative adjective squirt) is understood perfectly well.

Nondeceptive mimicry is fragile, however: it works in contexts where the all the relevant features are ones where the mimic matches the original. Mimics that don't match the distribution of the original along relevant features are deceptive: agents that observe the mimic and assign it to the same mental category as the original on the basis of the matching features, will use that categorization to make predictions about unobserved but nonmatching features, and be wrong. And they'll be wrong because the mimic is optimized to "look like" the original (to match on many features).

[TODO: work in https://www.lesswrong.com/posts/4mEsPHqcbRWxnaE5b/typicality-and-asymmetrical-similarity ]

If different agents using a shared language disagree on what features are "relevant", they may have an incentive to fight about how rare and valuable short codewords should be defined in their common language, in order to exert control over what inferences and decisions agents using that language can easily make and coordinate on.

Consider the

[ https://slate.com/technology/2018/07/should-lab-grown-meat-be-called-meat.html ] [... vegan meat]

For these reasons it is written of the third virtue of lightness: you cannot make a true map of the category by drawing lines upon paper according to impulse; you must observe the joint distribution and draw lines on paper that correspond to what you see. If, seeing the category unclearly, you think that you can shift a boundary just a little to the right, just a little to the left, according to your caprice, this is just the same mistake.

And as it is written of a virtue which is nameless: perhaps your conception of rationality is that it is rational to believe the words of the Great Teacher, who lives in an area where claiming that the sky is blue would be political suicide.

And the Great Teacher says, "Some people I usually respect for their willingness to publicly die on a hill of facts, now seem to be talking as if color references are necessarily a factual statement about frequencies of light. But using language in a way you dislike, is not lying. You're not standing in defense of Truth if you insist on a word, brought explicitly into question, being used with some particular meaning." And you look up at the sky and see blue.

If you think: "It may look like the sky is blue, such that I'd ordinarily think that someone who said 'The sky is green' was being deceptive. But surely the Great Teacher wouldn't egregiously mislead people about the philosophy of language when being egregiously misleading happens to be politically convenient," you lose a chance to discover your mistake.

How will you discover your mistake? Not by comparing your description to itself.

But by comparing it to that which you did not name.

(Thanks to Jessica Taylor for discussion.)

Footnotes

This example isn't quite right—actually, most possible minds aren't going to have human-like color vision! But something like color vision—making inferences about objects based on what frequencies of light they reflect—is going to be pretty broadly useful. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unnatural_categories_are_optimized_for_deception.md

unnatural_categories_are_optimized_for_deception.md

Unnatural Categories Are Optimized for Deception

Files

unnatural_categories_are_optimized_for_deception.md

Latest commit

History

unnatural_categories_are_optimized_for_deception.md

File metadata and controls

Unnatural Categories Are Optimized for Deception

Footnotes