# Heaps

Heaps are binary trees with a specific ordering criterion:
- in min-heaps, the child nodes are always bigger
- in max-heaps, the child nodes are always smaller

This data structure is therefore useful if we have to keep track of the minimum or maximum over time. The heap operations are:

- push: O(logN)
- pop: O(logN)
- get min/max: constant
- heapify (i.e. turn an array into a heap): O(N)
- nlargest: O(log n)
- nsmallest: O(log n)

In [5]:
from heapq import heapify, heappush, heappop, nlargest, nsmallest

In [18]:
heap = [6,2,9,4,7,1]
heapify(heap)
for _ in range(len(heap)):
    print( heappop(heap) )

1
2
4
6
7
9


In [14]:
# of course, this is equivalent to: 

heap = []
for i in [6,2,9,4,7,1]:
    heappush(heap,i)
    
for _ in range(len(heap)):
    print( heappop(heap) )

1
2
4
6
7
9


Note there's no implementation of max-heap in python, but instead we can simply negate the min-heap to create a proxy-max heap:

In [11]:
heap = [6,2,9,4,7,1]
heap = [-n for n in heap]
heapify(heap)
for _ in range(len(heap)):
    print( -heappop(heap) )

9
7
6
4
2
1


## Using nsmallest and nlargest
In python, we can simply call nsmallest() and nlargest on a python list. Under the hood, the list will be converted into a min/max heap of size n, so that we can then pop from the heap n times to get the desired output. 

In [33]:
heap = [6,2,9,4,7,1]
print( nsmallest(3,heap) )
print( nlargest(3,heap) )

[1, 2, 4]
[9, 7, 6]


## Problem: get the top-k most frequent elements in a list

We could solve this problem with frequency counting + sorting, which would be O(NlogN), where N is the number of unique elements in the list. However, we actually don't need to sort everything, instead we only need the top-k elements. That's why a heap-based solution wins here. The solution shown below runs with O(klogk)

In [40]:
from collections import Counter
def topk(nums,k):
    counts = Counter(nums)
    return nlargest(k,counts.keys(), key=lambda n: counts[n])

topk([1,1,1,2,2,3],2)

[1, 2]

## Problem: number of meeting rooms

Given a list of meeting intervals, what's the number of conference rooms needed to accomodate all meetings?

This problem can be elegantly solved with a min-heap: 
1. sort the intervals by start date
2. create a min-heap of the end times: the head node of the heap is then the earliest time available.
3. loop through the intervals. If it starts after the head of the min-heap, pop it. Otherwise, add it to the min-heap.

Example:\
Input: intervals = [[0,30],[5,10],[15,20]]\
Output: 2

In [10]:
def num_rooms(intervals):
    
    intervals.sort(key = lambda x:x[0])
    heap = []
    heappush(heap,intervals[0][1])
    
    for i in range(1,len(intervals)):
        start,end = intervals[i][0],intervals[i][1]
        if start>heap[0]:
            heappop(heap)
        heappush(heap,intervals[i][1])
    print(len(heap))

In [11]:
num_rooms( [[0,30],[5,10],[15,20]] )

2


# Problem: get the k closest points to origin

This is another problem of the type "get the k smallest" or "get the k largest". In these cases, a heap is always prefered. Why? It's a waste to sort the entire array!

Let's say, the array has length 1M, and we want the top 5 smallest elements. Using the heap (instead of sorting) would be log(1M)/log(5) ~ 9X faster!

In [3]:
import numpy as np
np.log(1e6)/np.log(5)

8.58405934844036

In [4]:
def kClosest(self,points,k):
    '''
    min-heap solution, O(Nlogk)
    Note we use negative squared distance here because python heapq only supports max-heap.
    '''

    def calc_dist(p):
        # return the negative squared distance
        return -(p[0]**2+p[1]**2)

    heap = []
    for i in range(k):
        dist = calc_dist(points[i])
        heapq.heappush(heap,(dist,i))  # <O(log(k))

    for i in range(k,len(points)):
        dist = calc_dist(points[i])
        if dist>heap[0][0]:
            heapq.heappop(heap)
            heapq.heappush(heap,(dist,i))  # O(log(k))

    return [points[heap[i][1]] for i in range(k)]