## Stanford Algo, Part 1. Programming Assignment 6.

### 0. The 2-SUM problem

Download the following text file: algo1-programming_prob-2sum.txt

The goal of this problem is to implement a variant of the 2-SUM algorithm (covered in the Week 6 lecture on hash table applications).

The file contains 1 million integers, both positive and negative (there might be some repetitions!). This is your array of integers, with the **_i-th_** row of the file specifying the **_i-th_** entry of the array.

Your task is to compute the number of target values t in the interval [-10000,10000] (inclusive) such that there are distinct numbers **_x, y_** in the input file that satisfy **_x + y = t_**. (NOTE: ensuring distinctness requires a one-line addition to the algorithm from lecture.)

Write your numeric answer (an integer between 0 and 20001) in the space provided.

OPTIONAL CHALLENGE: If this problem is too easy for you, try implementing your own hash table for it. For example, you could compare performance under the chaining and open addressing approaches to resolving collisions.

In [1]:
# import necessary modules
from time import time
from bisect import bisect

### 1. Load and explore input data

In [2]:
def loadArray(filename):
    '''Load input array separating it into two subarrays:
        L1 with negative values and L2 with positive values'''
    L1, L2 = [], []
    with open(filename) as f:
        for line in f:
            k = int(line)
            if k < 0:
                L1.append(k)
            else:
                L2.append(k)
    return sorted(L1), sorted(L2)

In [3]:
# explore input data: they are represented by large negatives and positives
start = time()
L1, L2 = loadArray('algo1-programming_prob-2sum.txt')
finish = time()
print(f'Array is loaded in {finish - start:.2f} secs')
print(f'len(L1) = {len(L1)}, len(L2) = {len(L2)}')
print(f'min(L1) = {min(L1)}, max(L1) = {max(L1)}')
print(f'min(L2) = {min(L2)}, max(L2) = {max(L2)}')

Array is loaded in 0.95 secs
len(L1) = 499990, len(L2) = 500010
min(L1) = -99999887310, max(L1) = -87405
min(L2) = 112177, max(L2) = 99999662302


**Conclusion:** The input data consist of large negative and positive numbers. The max of negatives = -87405, the min of positives = 112177. The positives and negatives are approximately equally populated and divided into two subarrays L1 and L2. If we want a sum of two numbers x+y to be in the range -10000 to +10000 (i.e. comparably small numbers), than inevitably x should negative and y positive, or vice versa. It means we can iterate x only in the negative subarray L1 and look up y in the positive subarray L2 within the interval -x-10000 to -x+10000. So we need only one traversal of x over L1 with mostly constant time lookups into L2 (i.e. mostly no or very few items y in L2 in the range -x-10000 to -x+10000).

**Note:** My first implementation was based on the traversal of x over the whole dataset with lookups of y=t-x in a hash table for each t in the range -10000, 10000. It means 1e6 x 20e3 = 20e9 or 20 billion lookups! It produced the correct result, but the code run over 8300 sec (2.3 hours). The current implementation produces the correct answer in less than a second!!!

### 2. Efficient algorithm for 2-SUM problem based on sorted arrays

In [4]:
# if x in L1, than y should be in L2 and within [-x-10000, -x+10000]
start = time()
target = len(set([x+y for x in L1 for y in L2[bisect(L2, -x-10000):bisect(L2, -x+10000)]]))
finish = time()
print(f'The number of target values = xxx was calculated in {finish-start:.2f} secs')

The number of target values = xxx was calculated in 0.74 secs
