# Bitwise Integrity

Before we get into more complex definitions of data integrity, let's consider a simple thought experiment. Suppose, we went around the classroom and asked every student to write down an integer on a strip of paper. We put all of these strips of paper into a bag and give them to one student to keep safe for a week. The next week, we want to confirm that all of the strips of paper are still there (and have their original values!). What if the student lost one of the tickets and made up a value to ensure the count would be the same? Short of writing down every number and putting it in a safe place, how would we efficiently test if some modification happened to the strips of paper (either a lost strip or a manipulated strip)?

## Naive Technique
Let's consider the following algorithm. When students add their strips of paper to the bag, we incrementally calculate the total value of all of the numbers. When all of the students have added their strips of paper to the bag, we will add the final sum (called a Checksum) on a special colored strip of paper to the bag. When we retrieve the bag next week we simply sum up all of the values again and compare them to the checksum. If the checksum itself is missing then the question is moot as the bag is already untrustworthy. 

In [18]:
def checksum(data):
    return (data, sum(data))

def verify(data):
    return (sum(data[0]) == data[1])  

v = checksum([1,3,4,5,6])
print('Test original', verify(v))

#manipulated
v[0][3] = 7
print('Test manipulated',verify(v))

Test original True
Test manipulated False


Essentially, this scheme takes a large amount of data and summarizes it into a signature, which we use for data verification. The verification is substantially more storage efficient than redundantly recording every piece of information. Why won't this particular signature work? Most obviously we can easily trick this system by strategically modifying the data

In [19]:
v = checksum([1,3,4,5,6])

#manipulated
v[0][3] = 7
v[0][4] = 4
print('Test strategic',verify(v))

Test strategic True


This problem is called a "collision", when two different data sets have the same checksum. In a sense, collisions are guaranteed to happen because the checksum is much smaller than the original data. Let's ignore this issue for a bit and think of an equally important but more subtle problem with the above scheme.

There is a glaring systems problem with this approach. A sum of two integers is not guaranteed to be an integer (it's data may exceed the number of bits needed to represent an integer). 

## Cyclic Checksum
We can define a new checksum function that restricts the range of the checksum to a fixed number of bits by using the modulo operator:

In [20]:
def checksum(data, length):
    #length is the size in bits needed to store the checksum
    return (data, (sum(data) % (2**length - 1) ))

v = checksum([1,3,4,5,6], 2)
print(v)

v = checksum([1,3,4,5,6], 3)
print(v)

([1, 3, 4, 5, 6], 1)
([1, 3, 4, 5, 6], 5)


The obvious consequence of this is that you are even more susceptible missing errors. Random errors that simply have the same modulus can result in the same checksum. The more bits that you have the safer you are, but it is still unreliable.

## A Better Approach

The key issue is that summing up a bunch of numbers washes out the variation in those numbers, i.e., 2 + 4 is the same as 3 + 3. The natural thing to do instead is to concatenate the numbers instead of of summing them up. Let's assume that every number had the same number of digits:

In [21]:
def concat(data):
    strs = [str(d) for d in data] #turn it into strings
    return int(''.join(strs))

def checksum(data, length):
    #length is the size in bits needed to store the checksum
    return (data, (concat(data) % (2**length - 1) ))

Because of the concatenation it is harder to determine how a small change affects the checksum. Decrementing one of the numbers leads to a drastic change in the checksum: 

In [22]:
v = checksum([1,3,4,5,6], 4)
print(v)

v = checksum([1,2,4,5,6], 4)
print(v)

([1, 3, 4, 5, 6], 1)
([1, 2, 4, 5, 6], 6)


This particular approach is called a cylic redundancy check and can actually be implemented in fully boolean arithmetic (instead of digits you work with bits) (summations and divisions are constructed with the XOR and AND operators). 

https://en.wikipedia.org/wiki/Cyclic_redundancy_check

While substantially better than simply summing up the numbers it is still not robust to strategic manipulation. It is possible to use mathematics to determine which numbers to manipulate and get the same final checksum. In the next lecture, we'll start to dig deeper in to this problem.


## Hashing
The CRC is a particularly efficient calculation and can be executed in hardware. Sometimes, we may want stronger checks that are a bit less efficient. The study of designing such functions is the part of the general area of "hashing" in computer science. 

Like a checksum, a hash function is any function that can be used to map data of arbitrary size onto data of a fixed size. The key to designing a good hash function is that it must be disorderly---small changes in the input map to very large and unpredictable changes in the output. Thus, while collisions certainly will exist they will be evenly spread out over the entire domain of the data. Why does this work? While the potential domain of the data could be huge, its practical domain is often small (i.e., the realized values in the dataset). So if we are careful about how we hash the data there is often a unique encoding for each value. 

Simple hash functions often take a form similar to the CRC checks that we saw above. They sum and multiply the bits in a data point and take the modulus. Here is a hash function that hashes strings to integer values:

In [2]:
import binascii
import random

# maximum integer value
MAXINT = 2**32-1

# We need the next largest prime number above MAXINT
NEXTPRIME = 4294967311

# Two numbers that parametrize the hash
A = random.randint(0, MAXINT)
B = random.randint(0, MAXINT)

def hashcode(st):
    val = binascii.crc32(bytes(str(st),'utf-8')) & 0xffffffff
    return (A * val + B) % NEXTPRIME

print(hashcode('the bear'))
print(hashcode('a bear'))
print(hashcode('the bear'))

1223513802
2366007334
1223513802


Notice, a couple of differences with the CRC scheme above. First, the modulus is a prime number. Intuitively, prime numbers make sense as a modulus because there are less ways to accidentally find numbers that cleanly divide with them. We might have to further restrict these hash functions to a smaller domain (for example to fit into a fixed size list):

In [3]:
N = 10
lst = [None]* N
lst[hashcode('the bear') % N] = ['the bear']
lst[hashcode('a bear') % N] = ['a bear']

print(lst)

[None, None, ['the bear'], None, ['a bear'], None, None, None, None, None]


The above data structure is called a hashmap and illustrates the power of hashing. To determine whether we have already inserted a string into our data structure before, we can simply test to see if there is an entry corresponding to its hash code and only compare to those elements that have the same code. 

Next, the weights A and B are randomly generated. In fact, different A's and B's provide an entire family of hash functions. If the A's and B's are picked independently, we can almost model them as independently random variables: the probability of collision from each one is independent. Thus, even if we have "weak" hash functions, as long as we can generate a independent functions 