Skip to content
This repository has been archived by the owner on Dec 3, 2019. It is now read-only.

Differential privacy of example. #1

Closed
frankmcsherry opened this issue Sep 15, 2017 · 6 comments
Closed

Differential privacy of example. #1

frankmcsherry opened this issue Sep 15, 2017 · 6 comments

Comments

@frankmcsherry
Copy link

frankmcsherry commented Sep 15, 2017

I may be misunderstanding, but the com.uber.engsec.dp.util.DPExample example you have doesn't seem to correspond to differential privacy.

object DPExample extends App {
  // Use the table schemas and metadata defined by the test classes
  System.setProperty("schema.config.path", "src/test/resources/schema.yaml")

  // example query: How many US customers ordered product #1?
  val query = """
    SELECT COUNT(*) FROM orders
    JOIN customers ON orders.customer_id = customers.customer_id
    WHERE orders.product_id = 1 AND customers.address LIKE '%United States%'
  """

  // query result when executed on the database
  val QUERY_RESULT = 100000

  // privacy budget
  val EPSILON = 0.1

  println(s"Query: $query")
  println(s"Private result: $QUERY_RESULT\n")

  (1 to 10).foreach { i =>
    val noisyResult = DPUtils.addNoise(query, QUERY_RESULT, EPSILON)
    println(s"Noisy result (run $i): %.0f".format(noisyResult))
  }
}

Obviously, it is hard to tell because QUERY_RESULT is just fabricated here, but if we are expected to view this as "run the query; imagine you got these results", it does not appear to have the property of differential privacy.

It looks like you are assuming that orders.customer_id has maximum multiplicity one. Otherwise I could have a customers relation with one customer, FRANK, and an orders relation with 100,000 orders for each of my collection of Ted Cruz sex dolls. The reported results, all of which are tightly concentrated around 100,000 would reveal that I am present in the customers relation, whereas had I withheld my record from customers the answers would be tightly concentrated around zero.

I'm not sure how the rest of your pipeline works, but perhaps you could clean up the example or explain the rationale for why it has differential privacy for the stated query.


EDIT: Oho! You do have assumptions about the data, laid out in https://github.com/uber/sql-differential-privacy/blob/master/src/test/resources/schema.yaml. This would be super helpful to point out! :D What happens when these assumptions are violated?


EDIT2: It looks like in the orders relation the customer_id field has maxFreq: 100. So let's assume I order 100 Ted Cruz blow-up dolls. My presence or absence in customers will change the output by +/-100, which it looks like would not be masked by your noise addition, which when I run produces results like

Query: 
    SELECT COUNT(*) FROM orders
    JOIN customers ON orders.customer_id = customers.customer_id
    WHERE orders.product_id = 1 AND customers.address LIKE '%United States%'
  
Private result: 100000

Noisy result (run 1): 100006
Noisy result (run 2): 100008
Noisy result (run 3): 100000
Noisy result (run 4): 99999
Noisy result (run 5): 100001
Noisy result (run 6): 99999
Noisy result (run 7): 100007
Noisy result (run 8): 99992
Noisy result (run 9): 100018
Noisy result (run 10): 100002

These seem to be clearly 100,000 and certainly not 99,900, which is what the result would be without my hypothetical record. What am I missing?

@noahj
Copy link
Contributor

noahj commented Sep 19, 2017

This behavior is due to the isUnique flag on column customers.customer_id.

In short, the flag allows elastic sensitivity to use constraints of the data model in order to eliminate invalid candidates from the set of neighboring databases that must be considered in sensitivity calculation. We describe this optimization in our paper in Section 3.6 ("Using Data Models for Tighter Bounds on Sensitivity").

If you change the flag isUnique: true to maxFreq: 1, you’ll see the behavior you expect -- namely, the addition of enough noise to protect the user who purchased 100 dolls:

Query:
   SELECT COUNT(*) FROM orders
   JOIN customers ON orders.customer_id = customers.customer_id
   WHERE orders.product_id = 1 AND customers.address LIKE '%United States%'
 
Private result: 100000

Noisy result (run 1): 97256
Noisy result (run 2): 102718
Noisy result (run 3): 99308
Noisy result (run 4): 96859
Noisy result (run 5): 103095
Noisy result (run 6): 102176
Noisy result (run 7): 102561
Noisy result (run 8): 101989
Noisy result (run 9): 100635
Noisy result (run 10): 99848

@frankmcsherry
Copy link
Author

frankmcsherry commented Sep 19, 2017

Thanks! I appreciate the information. I'm reading your text in section 3.6, and I still can't square it with differential privacy:

Unique join keys in private tables. Many private tables have primary key columns that will never contain duplicates, making many neighboring databases impossible. For example, in the users table, each user corresponds to a unique ID. In the trips table, however, a user ID may appear multiple times—once for each trip.
A naive (but correct) sensitivity bound for a relation joining these two tables considers a neighboring database in which the users table contains a duplicate user, resulting in duplication of each of that user’s trips in the output of the join. Our data model optimization recognizes that this neighboring database violates data integrity conditions and therefore cannot occur.

In this case I'm not worried about an adjacent dataset with duplicate customer_id values, I'm worried about the difference between adjacent datasets with and without a record (mine). Or with the one record, but with its data changed.

For example, let's think about how well protected is the fact that my address field contains "United States". If I move from my current location in the US to somewhere in Europe, the count of the query above will change from 100,000 to 99,900. How does the "uniqueness" of the key signify that my address (and other fields in the customer relation) should be less protected?

@noahj
Copy link
Contributor

noahj commented Sep 20, 2017

The optimization here actually requires slightly stronger assumptions: that the key is unique and the database is append-only, therefore the scenario you described falls outside the set of valid adjacent databases.

This variant of the optimization was informed by several production systems at Uber with exactly these properties (e.g. logging systems, event streams), where records are only added and each protected record is identified by a unique key. Our code implements a flag for enabling this optimization in those situations where it applies; when enabled, it effectively allows analysts to join on unique keys from these records without being penalized for it.

The example query is admittedly not an ideal showcase for the optimization -- as you correctly point out, in a typical setting customers may be added or deleted, and the optimization would thus be turned off in a real deployment. To avoid confusion we'll remove the optimization in the example code and rename the flag to clarify the assumptions being made.

@frankmcsherry
Copy link
Author

Thanks again!

I'm afraid this still doesn't line up with differential privacy. If I submit my data with address United States and then tell the IRS that I've been in Europe for years, I don't care about whether Uber's database is append-only or not. I care about the fact that a hypothetical alternate database, in which my address was Europe, would produce results distinguishable from what you actually produce, and thereby inform the IRS that Uber thinks my address is in the US.

The "adjacent" datasets you need to consider involve record removal as well as introduction. "Adjacent" does not mean "datasets that may result from further interaction", it means "hypothetical alternatives that I might worry you will distinguish between to others".

Furthermore, differential privacy also doesn't really have a provision for "assumptions about the data", like that there is at most a certain multiplicity of certain keys. I understand that you might have those constraints in your production systems, but if you just ignore datasets violating these constraints you end up with a weaker privacy guarantee than differential privacy.

In particular, you lose the "large step" corollary that

Pr[M(A) = x] <= exp(eps ||A - B||) * Pr[M(B) = x]

If your mechanism doesn't satisfy this property, it doesn't have differential privacy.

I think it is a fair question to ask about the topology of the space of datasets, and allow the distances between two to be larger than the symmetric difference if the shortest path goes through infeasible space, but you would be getting pretty exotic here and differential privacy is pretty bound up in symmetric difference.

What you can do, which is what PINQ does, is to have computations enforce the constraints. From arbitrary data, you can take the first 100 orders for each customer, ensuring that there are at most 100 orders. However, this introduces a factor of 2 in the privacy analysis (adding or removing an order could result in both the addition and removal of an order).

@noahj
Copy link
Contributor

noahj commented Oct 16, 2017

Thanks for the clarification. This feature has been removed in 702578a.

@noahj noahj closed this as completed Oct 16, 2017
@frankmcsherry
Copy link
Author

Cool, thanks. Don't forget that the max_freq assumed upper bounds are in conflict with differential privacy too. I'm not clear on how you resolve them in a data-agnostic way, but happy to let you all take a swing at that before trying things again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants