# All abord the blocktrain
## Making a basic blockchain to improve understanding

We've all heard about blockchain, we've discussed it, but few really understand it.

To make matters worse, we often confuse it with things *built on* blockchains (like Bitcoin and other cryptocurrencies). Furthermoore, when we're talking about blockchains, we really mean **distributed** blockchains, since a blockchain in and of itself is no more useful than any other database.

Let's build the simplest possible (useful) blockchain, to make sure we really understand what it is. Then, we'll look to make it distributed.

## What the Block?
A **Block** is just another data structure (like a list, dict, etc) that holds some data.  
It also has 
- a timestamp, so we know when it was created
- a Unique ID so we can know if it's different 

In [1]:
from datetime import datetime
from uuid import uuid4

class Block:
    def __init__(self, data):
        self.uid = uuid4()
        self.timestamp = datetime.now()
        self.data = data
        
    def __repr__(self):
        return f'Block({str(self.uid)[:5]}...)'
        
a_block = Block(data={'balance': 250})

## The "chain" in blockchain

A **Blockchain** is just a list (more or less) of Blocks.  
There's something special about Blocks in a blockchain though. 

Each block contains a special value which is a *summary* of the block before it.  
And since the previous block also contains a *summary* of the block before, the *summary* in *this* block contains a *summary* of the previous block's *summary*, and so on, creating a chain of blocks.

This recursive behaviour continues all the way back to the very first block!

## What does this summary look like?

One naive way to get a summary could just be to count the number of characters in the data.


In [2]:
def summary(data):
    return len(str(data))

summary('hello')

5

We need our summary to always give the same value if we give it the same data

In [3]:
tests = [
    "summary('hello') == summary('hello')",
    "summary(10) == summary(10)",
    "summary({'a': 'dict'}) == summary({'a': 'dict'})"
]

for test in tests:
    print(eval(test), test)

True summary('hello') == summary('hello')
True summary(10) == summary(10)
True summary({'a': 'dict'}) == summary({'a': 'dict'})


And we also need it to give a different value **most of the time** when the data is changed.

In [4]:
tests = [
    "summary('hello') != summary('helloo')",
    "summary('hello') != summary('hullo')",
]

for test in tests:
    print(eval(test), test)

True summary('hello') != summary('helloo')
False summary('hello') != summary('hullo')


OK, so our length counter summary isn't perfect.

Let's try something a little more sophisticated... we could somehow convert each character into a value, and then we would be able to detect character changes even when the length doesn't change.

We could get its ASCII value (the unique number which represents every character using the ASCII system) using the `ord()` function in python, and then sum those values.

In [5]:
def summary(data):
    data_string = str(data)
    return sum(ord(char) for char in data_string)

summary('hello')

532

In [6]:
tests = [
    "summary('hello') != summary('hullo')",
    "summary('hello') != summary('ehllo')",
]

for test in tests:
    print(eval(test), test)

True summary('hello') != summary('hullo')
False summary('hello') != summary('ehllo')


Better! But still pretty easy to trick.

Luckily there are some well known algorithms that make good summaries!


## Hash it out
A *summary* that meets these criteria reasonably well is called a **hash** in computing terms. 

The function used to make a **hash** is called a *hash function*.

Python has a hash function built in, so we'll just use that *for now*. 

In [7]:
import json

class Block(Block):  # Inherits from the old Block definition so we don't have to re-define stuff
    def __init__(self, data, previous_block):
        self.uid = uuid4()
        self.timestamp = datetime.now()
        self.data = data
        self.hash = hash(previous_block)
        
    # In order for a Python object to be hashable, it must define a __hash__ method
    def __hash__(self):
        data = json.dumps(self.data) # you can't hash a dictionary, let's make it a string first
        attrs = (self.uid, self.timestamp, data, self.hash)
        return hash(attrs)


## Why chain?
The whole point of this chain, this summary-of-a-summary back to the start, is to make it harder to **modify** data.

> But how does it make it harder to modfy?

Because the function we use to make the *summary* is not a secret. Anyone can look at the previous block, use the same function to make a summary, and check if it is equal to the current block's summary value. 

If it's equal, you know that the data in the previous block hasn't been changed.  
And you can perform this process all the way back along a blockchain to verify that none of the data in the blocks have changed.



In [8]:
def chain_is_consistent(chain):
    previous_block = chain[0]
    for block in chain[1:]:
        matches = hash(previous_block) == block.hash
        
        if not matches:
            print(f'hash({previous_block}) does not match {block}.hash')
            return False
        
        previous_block = block
        
    return True

In [9]:
block0 = Block(data={'balance': 100}, previous_block=None)
block1 = Block(data={'balance': 120}, previous_block=block0)
block2 = Block(data={'balance': 135}, previous_block=block1)
block3 = Block(data={'balance': 150}, previous_block=block2)

chain = [block0, block1, block2, block3]

chain_is_consistent(chain)

True

But what happens if I (e.g. a bad person with access to the database) decide to change the value of one block?

In [10]:
block1.data = {'balance': 200}

chain_is_consistent(chain)

hash(Block(de4ed...)) does not match Block(b851b...).hash


False

Great news, we can easily tell that someone has tampered with the blockchain.

## Blockchained
OK Great, we've made our own simple blockchain.

Now let's make it distributed!


# Distributed Blockchain
Why do we want a distributed blockchain?

Well, because our `chain_is_consistent()` function has a bit of a weakness if you run it on only one computer

In [11]:
block2.hash = hash(block1)
block3.hash = hash(block2)

chain_is_consistent(chain)

True

I, the malicious person with access to the database, can just update all the hashes to make the chain consistent again.

## Consensus
But if this information was stored across many computers (that I don't have access to), then I wouldn't be able to edit the copies of the blockchain on those other computers.

So we could imagine that I have tampered with the blockchain on my computer, but there are 4 other copies of the blockchain on other people's computers that still show the original blockchain.

These 5 computers could somehow talk to each other and decide which is the original version.  
This is called reaching a **consensus**.

There are many different ways to reach a consensus. In a minute we'll explore a basic one.


## Decentralisation and Trust

Another benefit of having the blockchain copied on to several computers (i.e. decentralised) is that there is no one party that I have to trust.

Juxtapose this with your bank. You have to trust your bank to correctly maintain your account balance. A rogue employee with access to the bank's database could edit your balance if they wanted to.
With a decentralised blockchain, this can't happen. Therefore, decentralised blockchains are described as "trustless" - you don't have to trust a single party to correctly maintain your balance.

## Nodes
Let's say that each computer that keeps a copy of this blockchain is called a **Node**

In [12]:
class Node:
    def __init__(self):
        self.uid = uuid4()
        self.other_nodes = []
        self.chain = []
        
    def __repr__(self):
        return f'Node({str(self.uid)[:5]}...)'

Each node needs to be aware of the other nodes on the network, so that it knows who it has to reach a consensus with.

> Some consensus algorithms don't require each node to be aware of **all** other nodes... but ours does!  

So let's find a way to join the network:

In [13]:
class Node(Node):  # Inherits from the old Node definition so we don't have to re-define stuff
        
    def join_network(self, other):
        # copy the other nodes from your peer
        self.other_nodes = other.other_nodes[:]
        
        # but your peer isn't in that list, so add them
        self.other_nodes.append(other)
        
        # now tell all your peers that you've arrived!
        for other_node in self.other_nodes:
            other_node.add_peer(self)
                    
    def add_peer(self, other):
        # here we can just do this as a method call, since all our
        # nodes are are running in the same process in my computer.
        # But I'm making this its own method so that we could just
        # replace this implementation with an HTTP call for example.
        self.other_nodes.append(other)

Remember, the whole point of distributed blockchain is that it is decentralised. So you *can't* have some central registry of all the nodes in the network. When you join the network, you need to ask just one peer, and then they help you discover the whole network.

## What are you syncing about?
Nodes need a way to *agree* on what the current state of the blockchain is. This is called reaching a consensus.

One type of disagreement that can happen is that one Node is Ahead of another, because it isn't aware of a more recent transaction yet.

If one node `is_ahead()` of another, it's safe to just fast forward the node that is behind.


In [14]:
class Node(Node):  # Inherits from the old Node definition so we don't have to re-define stuff
    
    def join_network(self, other):
        self.other_nodes = other.other_nodes[:]
        self.other_nodes.append(other)
        for other_node in self.other_nodes:
            other_node.add_peer(self)
            
        # after joining the network, first thing we should do is sync!
        self.sync()
        
    def sync(self):
        for other_node in self.other_nodes:
            self.sync_with(other_node)        
            
    def sync_with(self, other):

        if not self.chain:
            # We've just joined the network, let's
            # just copy this node's chain.
            self.chain = other.chain[:]
            
        if self.is_synced_with(other):
            # nice
            return
            
        elif other.is_ahead(self):
            # we need to catch up
            self.chain = other.chain[:]
            
        elif self.is_ahead(other):
            # they need to catch up
            other.sync_with(self)
            
        else:  # ???
            # Not sure how to handle this yet.
            pass
            
    def is_synced_with(self, other):
        return self.chain == other.chain
            
    def is_ahead_of(self, other):
        return other.last_block in self.chain and len(self.chain) > len(other.chain)
            
    @property
    def last_block(self):
        return self.chain[-1]



However if there is a forked chain, at least one of the nodes needs to have their chain corrected.
This could be done many different ways. To keep it simple, we will revert both nodes to the latest common ancester in this case.

1. One node is ahead of another, because the other isn't aware of a more recent transaction yet.
2. The chain has forked. There is a Block X which is followed by Block Y on one Node, but followed by Block Z on another Node.

In [15]:
class Node(Node):  # Inherits from the old Node definition so we don't have to re-define stuff

    def sync_with(self, other):
        if not self.chain:
            self.chain = other.chain[:]
            
        elif self.is_synced_with(other):
            pass
            
        elif other.is_ahead_of(self):
            self.chain = other.chain[:]
            
        elif self.is_ahead_of(other):
            other.sync_with(self)
            
        else:  # chains have forked...
            # revert both to latest common ancester
            lcb_idx = self.latest_common_block_idx(other)
            self.chain = self.chain[:lcb_idx]
            other.chain = other.chain[:lcb_idx]
            print(f'Reverted {self} and {other}')

    def latest_common_block_idx(self, other):
        for idx, (block, other_block) in enumerate(zip(self.chain, other.chain)):
            if block.uid != other_block.uid:
                return idx - 1
                

So, it's definitely not the best way to handle a fork, but it's a simple way. We simply look for the last block that the two chains have in common, and revert both chains back to it.

## Get on board, man.

The next thing a node needs to know how to do is create a block from some data it is given.

In [16]:
class Node(Node):  # Inherits from the old Node definition so we don't have to re-define stuff
    def create_block(self, data):
        self.sync()
        block = Block(data, previous_block=self.last_block)
        self.chain.append(block)
        self.sync()
        return self.last_block == block

Cool. Looks like we can now create a block.

Note that we sync:
- Before adding the block, to ensure that we have the correct previous_block, and
- After adding the block, to ensure that all other nodes are up to date

## Test Test Test

Let's make the first node, and give it a genesis block

In [17]:
genesis_block = Block('genesis', previous_block=None)
node1 = Node()
node1.chain.append(genesis_block)

node1.chain

[Block(6c793...)]

OK. But we can't say it's a distributed blockchain until there is more than one node!

In [18]:
node2 = Node()
node2.join_network(node1)

assert node1.chain == node2.chain
node2.chain

[Block(6c793...)]

It's good to see that we have the genesis node already in the chain on `node2`.

How did it get there? Thanks to the `sync()` which is performed when nodes join the network.

Let's make a block!

In [19]:
data = {'test': 'data'}
created = node1.create_block(data)

assert created
assert node1.chain == node2.chain
node2.chain

[Block(6c793...), Block(a8782...)]

Great, looks like the block has showed up on both nodes!.

Let's make it bigger!

In [20]:
node3 = Node()
node3.join_network(node2)

node4 = Node()
node4.join_network(node1)

node5 = Node()
node5.join_network(node3)

5 Nodes!

Notice that they all joined from any other node that's already in the network.

Let's make some blocks!

In [21]:
data2 = {'test': 'more data!'}
created = node2.create_block(data2)
assert created
assert node1.chain == node2.chain == node3.chain == node4.chain == node5.chain
node5.chain

[Block(6c793...), Block(a8782...), Block(c516b...)]

In [22]:
data3 = {'test': 'even more data!!'}
created = node1.create_block(data3)
assert created
assert node1.chain == node2.chain == node3.chain == node4.chain == node5.chain
node4.chain

[Block(6c793...), Block(a8782...), Block(c516b...), Block(f3677...)]

## Bad Actors

How do we know if someone has manipulated the data?

Coming soon...

## Limitations
- Scalability (Nodes sync with every other node)
- Does not handle forks very well
- Doesn't prevent blocks from being manipulated (yet)

More coming soon...