## 11.09 Find the missing IP address

Suppose you were given a file containing roughly one billion IP addresses, each of which is a 32-bit quantity, How would you programmatically find an IP address that is not in the file?  Assume that you have unlimited drive space but only a few megabytes of RAM at your disposal.

### Hint:
Can you be sure that there is an address which is not in the file?


### Initial remarks:

A dictionary would work, but it seems like the big restriction here is the 32 bit integers.  If I only have a few megabytes of RAM that would be 2^21 / 2^5 = 2^17 different keys, or about 64,000.  That's not nearly enough for a direct dictionary.  There are some special probabilistic data structures like bloom filters that could be used, but let's just ignore that for now.  I could conceivably take advantage of my unlimited drive space by splitting the file into multiple output files as I read the first file.  Let's pretend that I want to split 2^31 (about 2 billion) IP addresses into a number of files that aren't bigger than 2^17.  I suppose that I could split it into 2^14 files each holding a maximum of 2^17 items, which could be held in a dictionary.  This also seems plausible.  Let's try just counting the number of ip addresses in 2^16 groups in the first pass, organzied by the first 16 bits and if one of those is fewer than 2^16, then we create a dictionary with 2^16 trailing bits


In [15]:
buckets = {a: 0 for a in range(0,2**16)}
print(len(buckets))

import random
random.seed(42)

ip_addresses = random.sample(range(2**32), k=2**16)
print(len(y))

for ip in ip_addresses:
    key = ip >> 16
    buckets[key] += 1
    
print(buckets)


65536
65536
{0: 2, 1: 0, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 0, 10: 2, 11: 1, 12: 0, 13: 0, 14: 1, 15: 0, 16: 1, 17: 2, 18: 2, 19: 0, 20: 1, 21: 0, 22: 0, 23: 1, 24: 0, 25: 0, 26: 2, 27: 1, 28: 0, 29: 0, 30: 0, 31: 0, 32: 1, 33: 1, 34: 1, 35: 2, 36: 0, 37: 2, 38: 1, 39: 0, 40: 0, 41: 0, 42: 2, 43: 1, 44: 4, 45: 0, 46: 0, 47: 0, 48: 0, 49: 1, 50: 2, 51: 1, 52: 1, 53: 1, 54: 1, 55: 1, 56: 2, 57: 2, 58: 1, 59: 2, 60: 0, 61: 0, 62: 2, 63: 1, 64: 4, 65: 0, 66: 0, 67: 0, 68: 1, 69: 1, 70: 1, 71: 1, 72: 0, 73: 0, 74: 0, 75: 3, 76: 2, 77: 0, 78: 2, 79: 2, 80: 1, 81: 0, 82: 0, 83: 2, 84: 1, 85: 2, 86: 1, 87: 0, 88: 1, 89: 0, 90: 1, 91: 1, 92: 0, 93: 1, 94: 3, 95: 3, 96: 3, 97: 0, 98: 1, 99: 0, 100: 2, 101: 0, 102: 1, 103: 3, 104: 0, 105: 0, 106: 0, 107: 0, 108: 2, 109: 0, 110: 1, 111: 1, 112: 0, 113: 0, 114: 0, 115: 0, 116: 1, 117: 1, 118: 0, 119: 0, 120: 1, 121: 0, 122: 0, 123: 2, 124: 0, 125: 2, 126: 0, 127: 1, 128: 1, 129: 1, 130: 1, 131: 0, 132: 1, 133: 0, 134: 3, 135: 2, 136: 3, 1

### Yeah, that was fun!

I can see where this going, and it would work, but let's look at the book solution

### Book Solution 
I modified the book solution slightly, but this is basically it.  I'm using 2^24 - 1 ip addresses instead of 2^32 -1 ip addresses for my stream.


In [1]:
import itertools
def find_missing_element(stream):
    NUM_BUCKET = 1 << 16
    counter = [0] * NUM_BUCKET
    stream, stream_copy = itertools.tee(stream)
    for x in stream:
        upper_part_x = int(x) >> 16
        counter[upper_part_x] += 1
        
    BUCKET_CAPACITY = 1 << 16
    candidate_bucket = next(i for i, c in enumerate(counter) if c < BUCKET_CAPACITY)
    
    candidates = [0] * BUCKET_CAPACITY
    stream = stream_copy
    
    for x in stream_copy:
        upper_part_x = int(x) >> 16
        if candidate_bucket == upper_part_x:
            lower_part_x = ((1 << 16) - 1) & int(x)
            candidates[lower_part_x] = 1
            
    for i, v in enumerate(candidates):
        if v == 0:
            return (candidate_bucket << 16) | i
        

import random
random.seed(42)

BITS = 24
data = random.sample(range(2**BITS), k=2**BITS-1)
with open("stream.txt", "w") as outfile:

    outfile.writelines("%s\n" % number for number in data)
        
with open("stream.txt") as infile:
    print(find_missing_element(infile))

2113359


### Remarks
I think this could also have been done by writing into 2^16 files, since drive space is free in this task.
One pass could be used to write out the stream to individual files, with a counter dictionary to tell you which file has a smaller number of entries.

In [None]:
0 1 1 0 1 1 0 1

bucket = 0 1 1 0 = 6 

buckets = {} -> {6: 1}

1 0 0 0 0 0 0 1
bucket = 1 0 0 0 = 8

buckets -> {6: 1, 8: 1}
