### Problem: Use a heap to generate a list of the top n websites by usage similarity

To do this we'll use the Jacard index to set the similarity and we'll use a min heap to keep track of the top valued pairs by simialrity. We'll pop the smallest values and push larger ones of they are available. Note that we have to generate a blank heap first that sets the size that we want to report, then call pushpop to handle the ordering

In [6]:
import heapq
from collections import defaultdict

def similarity(a,b,visitors):

  ## Similariy score or Jaccard index = number in common/number independent
  return len(visitors[a] & visitors[b])/len(visitors[a] | visitors[b])

def top_pairs(log,k):

  visitors = defaultdict(set)
  for site, user in log:
    visitors[site].add(user)

  #pairs will be arranged in the format of a heap
  pairs = []
  sites = list(visitors.keys())

  #Get the heap ready. See that each element will contain a value and a tuple of strings
  #Note that the pairs list will be heapified here
  for _ in range(k):
    heapq.heappush(pairs,(0,('','')))
  
  for i in range(len(sites)-1):
    for j in range(i+1,len(sites)):
      score = similarity(sites[i],sites[j],visitors)
      #This pushes a new value to the heap and pops the smallest value so that 
      #we maintain a heap of only the largest. Note that heapq only supports min-heap
      #see https://docs.python.org/3/library/heapq.html
      heapq.heappushpop(pairs,(score,(sites[i],sites[j])))
    
  print(pairs)

In [7]:
sites = [("google.com",1),("google.com",3),("google.com",5),
 ("pets.com",1),("pets.com",2),("yahoo.com",6),
 ("yahoo.com",2),("yahoo.com",3),("yahoo.com",4),("yahoo.com",5),
 ("wikipedia.org",4),("wikipedia.org",5),("wikipedia.org",6)]

In [10]:
top_pairs(sites,k=3)

[(0.25, ('google.com', 'pets.com')), (0.3333333333333333, ('google.com', 'yahoo.com')), (0.6, ('yahoo.com', 'wikipedia.org'))]
