# Part 6: Build an Encrypted, Decentralized Database

In the last section (Part 5), we learned about the basic tools PySyft supports for encrypted computation. In this section, we're going to give one example of how to use those tools to build an encrypted, decentralized database. 

# Encrypted

The database will be encrypted because BOTH the values in the database will be encrypted AND all queries to the database will be encrypted.

# Decentralized

The database will be decentralized because, using SMPC, all values will be "shared" amongst a variety of owners, meaning that all owners must agree to allow a query to be performed. It has no central "owner".

# The Schema:

While we could construct a variety of database types, for this first tutorial we're going to focus on a simple key-value store, where both the keys and values are strings.


Authors:
- Andrew Trask - Twitter: [@iamtrask](https://twitter.com/iamtrask)

In [1]:
import syft as sy
hook = sy.TorchHook()

bob = sy.VirtualWorker(id="bob")
alice = sy.VirtualWorker(id="alice")
bill = sy.VirtualWorker(id="bill")

# Section 1: Constructing a Key System

In this section, we're going to show how to use the equality operation to build a simple key system. The only tricky part about this is that we need to choose the datatype we want to use for keys. The most common usecase is probably strings, so that's what we're going to use here.

Now, one thing you'll notice about our SMPC techniques, they all use exclusively numbers. Thus, we now have an issue. We need to decide how to encode our strings into numbers so that we can query them efficiently as "keys". The fastest way would be to map every possible key to a unique hash (integer) and then key based on that. Let's use that approach.

In [2]:
# Note that sy.mpc.securenn.field is the max value that we can encode using SMPC by default
# This is, however, somewhat configurable in the system.
def string2key(input_str):
    return sy.LongTensor([hash(input_str) % sy.mpc.securenn.field])

In [3]:
string2key("hello")


 7.3583e+08
[syft.core.frameworks.torch.tensor.LongTensor of size 1]

In [4]:
string2key("world")


 1.4739e+09
[syft.core.frameworks.torch.tensor.LongTensor of size 1]

# Section 2: Constructing a Value Storage System

Now, we are able to convert our string "keys" to integers which we can use for our database, but now we need to figure out how to encode the values in our database using numbers as well. For this, we're going to simply encode each string as a list of numbers like so.

In [8]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [9]:
import string
char2int = {}
int2char = {}
for i, c in enumerate(' ' + string.ascii_letters + '0123456789' + string.punctuation):
    char2int[c] = i
    int2char[i] = c

In [10]:
def string2values(input_str):
    values = list()
    for char in input_str:
        values.append(char2int[char])
    return sy.LongTensor(values)

def values2string(input_values):
    s = ""
    for v in input_values:
        s += int2char[int(v)]
    return s

In [11]:
vs = string2values("hello world")
vs


  8
  5
 12
 12
 15
  0
 23
 15
 18
 12
  4
[syft.core.frameworks.torch.tensor.LongTensor of size 11]

In [12]:
values2string(vs)

'hello world'

# Section 3: Creating the Tensor Based Key-Value Store

Now for our next operation, we want to write some logic which will allow us to query this database using ONLY addition, multiplication, and comparison operations. For this we will use a simple strategy. 

The database will be a list of integer keys and a list of integer arrays (values).

In [13]:
keys = list()
values = list()

To add a value to the database, we'll just add its key and value to the lists.

In [14]:
def add_entry(string_key, string_value):
    keys.append(string2key(string_key))
    values.append(string2values(string_value))

In [15]:
add_entry("Bob","(123) 456-7890")
add_entry("Bill", "(234) 567-8901")
add_entry("Sue","(345) 678-9012")

In [16]:
keys

[
  1.8915e+09
 [syft.core.frameworks.torch.tensor.LongTensor of size 1], 
  1.2644e+09
 [syft.core.frameworks.torch.tensor.LongTensor of size 1], 
  1.2281e+09
 [syft.core.frameworks.torch.tensor.LongTensor of size 1]]

In [17]:
values

[
  70
  54
  55
  56
  71
   0
  57
  58
  59
  75
  60
  61
  62
  53
 [syft.core.frameworks.torch.tensor.LongTensor of size 14], 
  70
  55
  56
  57
  71
   0
  58
  59
  60
  75
  61
  62
  53
  54
 [syft.core.frameworks.torch.tensor.LongTensor of size 14], 
  70
  56
  57
  58
  71
   0
  59
  60
  61
  75
  62
  53
  54
  55
 [syft.core.frameworks.torch.tensor.LongTensor of size 14]]

# Section 4: Querying the Key->Value Store

Our query will be in three:

- 1) check for equality between the query key and every key in the database - returning a 1 or 0 for each row. We'll call each row's result it's "key_match" integer.

- 2) Multiply each row's "key_match" integer by all the values in its corresponding row. This will zero out all rows in the database which don't have matching keys.

- 3) Sum all the masked rows in the database together. 

- 4) Return the result.

In [18]:
# this is our query
query = "Bob"

# convert our query to a hash
qhash = string2key(query)
qhash[0]

1891469763

In [19]:
# see if our query matches any key
key_match = list()
for key in keys:
    key_match.append((key == qhash).long())
key_match

[
  1
 [syft.core.frameworks.torch.tensor.LongTensor of size 1], 
  0
 [syft.core.frameworks.torch.tensor.LongTensor of size 1], 
  0
 [syft.core.frameworks.torch.tensor.LongTensor of size 1]]

In [20]:
# Multiply each row's value by its corresponding keymatch
value_match = list()
for i, value in enumerate(values):
    value_match.append(key_match[i].expand(value.shape) * value)

In [21]:
# sum the values together
final_value = value_match[0]
for v in value_match[1:]:
    final_value = final_value + v

In [22]:
# Decypher final value
values2string(final_value)

'(123) 456-7890'

# Section 5: Putting It Together

Here's what this logic looks like when put together in a simple database class.

In [23]:
import string

char2int = {}
int2char = {}

for i, c in enumerate(' ' + string.ascii_letters + '0123456789' + string.punctuation):
    char2int[c] = i
    int2char[i] = c

def string2key(input_str):
    return sy.LongTensor([hash(input_str) % sy.mpc.securenn.field])

def string2values(input_str):
    values = list()
    for char in input_str:
        values.append(char2int[char])
    return sy.LongTensor(values)

def values2string(input_values):
    s = ""
    for v in input_values:
        s += int2char[int(v)]
    return s

class TensorDB:

    def __init__(self):
        self.keys = list()
        self.values = list()
        
    def add_entry(self, string_key, string_value):
        self.keys.append(string2key(string_key))
        self.values.append(string2values(string_value))
        
    def query(self, str_query):
        # hash the query string
        qhash = string2key(str_query)
        
        # see if our query matches any key
        key_match = list()
        for key in self.keys:
            key_match.append((key == qhash).long())

        # Multiply each row's value by its corresponding keymatch
        value_match = list()
        for i,value in enumerate(self.values):
            value_match.append(key_match[i].expand(value.shape) * value)
            
        # sum the values together
        final_value = value_match[0]
        for v in value_match[1:]:
            final_value = final_value + v
            
        # Decypher final value
        return values2string(final_value)

In [24]:
db = TensorDB()

In [25]:
db.add_entry("Bob","(123) 456-7890")
db.add_entry("Bill", "(234) 567-8901")
db.add_entry("Sue","(345) 678-9012")

In [26]:
db.query("hey")

'              '

In [27]:
db.query("Bob")

'(123) 456-7890'

In [28]:
db.query("Bill")

'(234) 567-8901'

In [29]:
db.query("Sue")

'(345) 678-9012'

# Section 6: Building an Encrypted, Decentralized Database

Now, the interesting thing here is that we have not used a single operation other than addition, multiplication, and comparison (equality). Thus, we can trivially create an encrypted database by simply encrypting all of our keys and values!

In [30]:
import string

char2int = {}
int2char = {}

for i, c in enumerate(' ' + string.ascii_letters + '0123456789' + string.punctuation):
    char2int[c] = i
    int2char[i] = c

def string2key(input_str):
    return sy.LongTensor([(hash(input_str)+1234) % int(sy.mpc.securenn.field)])

def string2values(input_str):
    values = list()
    for char in input_str:
        values.append(char2int[char])
    return sy.LongTensor(values)

def values2string(input_values):
    s = ""
    for v in input_values:
        if(int(v) in int2char):
            s += int2char[int(v)]
        else:
            s += "."
    return s


class DecentralizedDB:
    
    def __init__(self, *owners):
        self.owners = owners
        self.keys = list()
        self.values = list()
        
    def add_entry(self, string_key, string_value):
        key = string2key(string_key).share(*self.owners)
        value = string2values(string_value).share(*self.owners)
        
        self.keys.append(key)
        self.values.append(value)
        
    def query(self, str_query):
        # hash the query string
        qhash = sy.LongTensor([string2key(str_query)])
        qhash = qhash.share(*self.owners)
        
        # see if our query matches any key
        key_match = list()
        for key in self.keys:
            key_match.append((key == qhash))

        # Multiply each row's value by its corresponding keymatch
        value_match = list()
        for i, value in enumerate(self.values):
            shape = list(value.get_shape())
            km = key_match[i]
            expanded_key = km.expand(1,shape[0])[0]
            value_match.append(expanded_key * value)
            
        # sum the values together
        final_value = value_match[0]
        for v in value_match[1:]:
            final_value = final_value + v
        
        result = values2string(final_value.get())
        
        # there is a certain element of randomness
        # which can cause the database to return empty
        # so if this happens, just try again
        if(list(set(result))[0] == '.'):
            return self.query(str_query)
            
        # Decypher final value
        return result

In [31]:
db = DecentralizedDB(bob, alice)
db.add_entry("Bob","(123) 456-7890")
db.add_entry("Bill", "(234) 567-8901")
db.add_entry("Sam","(345) 678-9012")

In [32]:
db.query("Bob")

'..... ........'

In [33]:
db.query("Bill")

'(234) 567-8901'

In [34]:
db.query("Sam")

'(345) 678-9012'

### Success!!!

And there you have it! We now have a key-value store capable of storing arbitrary strings and values in an encrypted, decentralized state such that even the queries are also private/encrypted.

# Section 7: Increasing Performance


### Strategy 1: One-hot Encoded Keys

As it turns out, comparisons (like ==) can be very expensive to compute, which make the query take a long time. Thus, we also have another option. We can encode our strings using one_hot encodings. This allows us to exclusively use multiplication for our database query, like so.

### Strategy 2: Fixed Length Values
By using fixed length values, we can encode the whole database as a single tensor which lets us use the underlying hardware to work a bit faster.

In [256]:
import string

char2int = {}
int2char = {}

for i, c in enumerate(' ' + string.ascii_lowercase + '0123456789' + string.punctuation):
    char2int[c] = i
    int2char[i] = c

def one_hot(index, length):
    vect = sy.zeros(length).long()
    vect[index]  = 1
    return vect
    
def string2one_hot_matrix(str_input, max_len=8):
    # truncate strings longer than max_len
    str_input = str_input[:max_len].lower()
    
    # pad strings shorter than max_len
    if(len(str_input) < max_len):
        str_input = str_input + "." * (max_len - len(str_input))
    
    char_vectors = list()
    for char in str_input:
        char_vectors.append(one_hot(char2int[char],len(int2char)).unsqueeze(0))
    
    return sy.cat(char_vectors,dim=0)

def string2values(str_input, max_len=128):
    # truncate strings longer than max_len
    str_input = str_input[:max_len].lower()
    
    # pad strings shorter than max_len
    if(len(str_input) < max_len):
        str_input = str_input + "." * (max_len - len(str_input))
    
    
    values = list()
    for char in str_input:
        values.append(char2int[char])
        
    return sy.LongTensor(values)

In [257]:
one_hots = string2one_hot_matrix("hey")

In [298]:
class DecentralizedDB:
    
    def __init__(self, *owners, max_key_len=8, max_value_len=256):
        self.max_key_len = max_key_len
        self.max_value_len = max_value_len
        self.owners = owners
        self.keys = list()
        self.values = list()
        
    def add_entry(self, string_key, string_value):
        key = string2one_hot_matrix(string_key, self.max_key_len).share(*self.owners)
        value = string2values(string_value, self.max_value_len).share(*self.owners)
        
        self.keys.append(key)
        self.values.append(value)
        
    def query(self,query_str):
        query = string2one_hot_matrix(query_str, self.max_key_len).send(*self.owners)
        
        # see if our query matches any key
        # note: this is the slowest part of the program
        # it could probably be greatly faster with minimal improvements
        key_match = list()
        for key in self.keys:
            vect = (key * query).sum(1)
            x = vect[0]
            for i in range(vect.get_shape()[0]):
                x = x * vect[i]
            key_match.append(x)

        # Multiply each row's value by its corresponding keymatch
        value_match = list()
        for i, value in enumerate(self.values):
            shape = list(value.get_shape())
            km = key_match[i]
            expanded_key = km.expand(1,shape[0])[0]
            value_match.append(expanded_key * value)

        # NOTE: everything before this line could (in theory) happen in full parallel
        # on different threads.
            
        # sum the values together
        final_value = value_match[0]
        for v in value_match[1:]:
            final_value = final_value + v

        result = values2string(final_value.get())
        
        return result.replace(".","")

In [292]:
db  = DecentralizedDB(bob, alice, bill, max_key_len=3)
db.add_entry("Bob","(123) 456 7890")
db.add_entry("Bill", "(234) 567 8901")
db.add_entry("Sam","(345) 678 9012")
db.add_entry("Key","really big json value")

In [293]:
db.query("Bob")

'(123) 456 7890'

In [294]:
db.query("Bill")

'(234) 567 8901'

In [295]:
db.query("Sam")

'(345) 678 9012'

In [296]:
db.query("Not a Person")

'                                                                                                                                                                                                                                                                '

In [297]:
db.query("Key")

'really big json value'

# Success!!

And there we have it - a marginally more performant version. We could further add performance by running the query on all the rows in parallel, but we'll leave that for someone else to work on :).


Note: we can add as many owners to the database as we want! (although the more owners you have the slower queries will be)

In [289]:
import syft as sy
hook = sy.TorchHook()

bob = sy.VirtualWorker(id="bob")
alice = sy.VirtualWorker(id="alice")
bill = sy.VirtualWorker(id="bill")
sue = sy.VirtualWorker(id="sue")
tara = sy.VirtualWorker(id="tara")

db  = DecentralizedDB(bob, alice, bill, sue, tara, max_key_len=3)
db.add_entry("Bob","(123) 456 7890")
db.add_entry("Bill", "(234) 567 8901")
db.add_entry("Sam","(345) 678 9012")
db.add_entry("Key","really big json value")



In [290]:
db.query("Bob")

'(123) 456 7890'

# Congratulations!!! - Time to Join the Community!

Congraulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement toward privacy preserving, decentralized ownership of AI and the AI supply chain (data), you can do so in the following ways!

### Star PySyft on Github

The easiest way to help our community is just by starring the Repos! This helps raise awareness of the cool tools we're building.

- [Star PySyft](https://github.com/OpenMined/PySyft)

### Join our Slack!

The best way to keep up to date on the latest advancements is to join our community! You can do so by filling out the form at [http://slack.openmined.org](http://slack.openmined.org)

### Join a Code Project!

The best way to contribute to our community is to become a code contributor! At any time you can go to PySyft Github Issues page and filter for "Projects". This will show you all the top level Tickets giving an overview of what projects you can join! If you don't want to join a project, but you would like to do a bit of coding, you can also look for more "one off" mini-projects by searching for github issues marked "good first issue".

- [PySyft Projects](https://github.com/OpenMined/PySyft/issues?q=is%3Aopen+is%3Aissue+label%3AProject)
- [Good First Issue Tickets](https://github.com/OpenMined/PySyft/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22)

### Donate

If you don't have time to contribute to our codebase, but would still like to lend support, you can also become a Backer on our Open Collective. All donations go toward our web hosting and other community expenses such as hackathons and meetups!

[OpenMined's Open Collective Page](https://opencollective.com/openmined)