## CS 2302 - Lab 5 - Hash Tables + Heaps





## **Before you start**

Make a copy of this Colab by clicking on File > Save a Copy in Drive

Name:  Salvador Robles Herrera

Student ID:  80683116

### Grading
As stated in the syllabus, your lab consists of two parts: the source code  and the report. This colab counts as your source code submission only. However, for your report submission, you  are more than welcome to extend your colab to include what is required for the report. Alternatively, you can use any other text editor to write your lab report (Google Docs, Word, etc.). I personally recommend to stick to Google Colab as you can write code to draw the required plots, which makes the whole process simpler. 

Each subsection in this colab is marked with point values, totaling 100 points.


## Problem 1 

### [50 points] Least Recently Used Cache

Your job is do design and implement a data structure called [Least Recently Used (LRU)](https://en.wikipedia.org/wiki/Cache_replacement_policies#LRU) cache. This data structure supports the following operations

- get(key): Gets the value of the key if the key exists in the cache, otherwise return None.

- put(key, value) - Insert or replace the value if the given key is not already in the cache. When the cache reaches its maximum capacity, it should invalidate the least recently used item before inserting a new item.

- size() – Returns the number of key/value pairs currently stored in the cache

- max_capacity() – Returns the maximum capacity of the cache

All operations MUST run in O(1) time complexity to receive credit. You are free to use Python’s set and/or dictionary data structures. If you need to use a doubly linked list (hint), you need to code it yourself.
 

In [14]:
class nodeLL:
    def __init__(self, key = 0, item=0):
        self.key = key
        self.item = item
        self.next = None
        self.prev = None

class doublyLL:
    def __init__(self):
        self.head = None
        self.tail = None

    def insertEnd(self, node):
        if self.head is None and self.tail is None:
          self.head = node
          self.tail = node
        else:
          self.tail.next = node
          node.prev = self.tail
          self.tail = node

    def changeHead(self, node):
        if self.head == None and self.tail == None:
          self.head = node
          self.tail = node
          return
        elif self.head == self.tail:
          self.head = node
          self.tail = node
          return
        else:
          node.next = self.head.next
          self.head.next.prev = node
          self.head = node
          return
    
    def moveToEnd(self, node):
        if self.head is None: #If empty
          return
        if self.head == self.tail: #if only one element
          return 
        if self.head == node: #If at the head
          self.head.next.prev = None
          self.head = self.head.next
          node.next = None
          node.prev = None
          self.insertEnd(node)
          return
        if self.tail == node: #If it is at the tail no changes needed
          return
        node.next.prev = node.prev 
        node.prev.next = node.next
        node.next = None
        node.prev = None
        self.insertEnd(node) 

    def removeHead(self): 
        if self.head is None:
          return
        if self.head == self.tail:
          self.head = None
          self.tail = None
          return
        self.head.next.prev = None
        self.head = self.head.next

    def printLL(self):
        items = []
        iter = self.head    
        while iter:
          if iter.prev != None:
            print("Prev: ", iter.prev.item)
          if iter != None:
            print("Curr: ", iter.item)
          if iter.next != None:
            print("Next: ", iter.next.item)
          items.append(iter.item)
          iter = iter.next
          print()    
        print(items)
        print()
        print()

class LRUCache:
    def __init__(self, max_capacity=4):
        self._max_capacity = max_capacity # maximum memory capacity
        # TODO: Feel free to add more lines here
        self.keys = set()
        self.pairs = {}
        self.linkedList = doublyLL()
        self._size = 0

    # TODO: Implement this method - Required Time Complexity: O(1)
    # Gets the value of the key if the key exists 
    # in the cache, otherwise return None.
    def get(self, key):
        """Implements 'value = self[idx]'
        Raises IndexError if idx is invalid."""  
        if key not in self.keys:
          return None
        
        self.linkedList.moveToEnd(self.pairs[key])

        return self.pairs[key].item
        
    # TODO: Implement this method - Required Time Complexity: O(1)
    # Insert or replace the value if the given key is not already in the cache. 
    # When the cache reaches its maximum capacity, it should invalidate the 
    # least recently used item before inserting a new item.
    def put(self, key, value):
        """Implements 'self[idx] = value'
        Raises IndexError if idx is invalid."""

        if key in self.keys: #If the key is already in the lru
          self.pairs[key].item = value
          self.linkedList.moveToEnd(self.pairs[key])
        else: #ADDS PAIR KEY-VALUE
          if self._max_capacity <= self._size: #If full, delete lru
            del self.pairs[self.linkedList.head.key]
            self.linkedList.removeHead()
            self._size -= 1
          #Add key-value pair to lru, and node to linked list
          node = nodeLL(key, value) 
          self.pairs[key] = node
          self.keys.add(key)
          self.linkedList.insertEnd(node)
          self._size += 1


    # TODO: Implement this method - Required Time Complexity: O(1)
    # Returns the number of key/value pairs currently stored in the cache
    def size(self):
        """Implements 'self[idx] = value'
        Raises IndexError if idx is invalid."""
        return self._size
    
    # TODO: Implement this method - Required Time Complexity: O(1)
    # Returns the maximum capacity of the cache
    def max_capacity(self):
        """Implements 'self[idx] = value'
        Raises IndexError if idx is invalid."""
        return self._max_capacity
        

In [15]:
# Use this code cell to test the implementation of your LRUCache (test cases)
linked = doublyLL()
node1 = nodeLL(0,10)
node2 = nodeLL(1,9)
node3 = nodeLL(2,8)
node4 = nodeLL(3,20)

linked.insertEnd(node1)
linked.insertEnd(node2)
linked.insertEnd(node3)
linked.printLL()
linked.changeHead(node4)
linked.printLL()
linked.moveToEnd(node2)
linked.removeHead()
linked.printLL()

lru = LRUCache()

lru.put(1,10)
lru.put(2,20)
lru.put(3,30)

print(lru.pairs)
print(lru.get(1))
print(lru.get(2))
print(lru.get(3))
lru.put(3,50)
print(lru.get(3))
print(lru.get(5))
lru.put(10,100)
print(lru.get(10))
print(lru.pairs)
lru.put(5,50)
print(lru.get(5))
print(lru.pairs)

Curr:  10
Next:  9

Prev:  10
Curr:  9
Next:  8

Prev:  9
Curr:  8

[10, 9, 8]


Curr:  20
Next:  9

Prev:  20
Curr:  9
Next:  8

Prev:  9
Curr:  8

[20, 9, 8]


Curr:  8
Next:  9

Prev:  8
Curr:  9

[8, 9]


{1: <__main__.nodeLL object at 0x7ff45c5e86d8>, 2: <__main__.nodeLL object at 0x7ff45c5e8710>, 3: <__main__.nodeLL object at 0x7ff45c5e8940>}
10
20
30
50
None
100
{1: <__main__.nodeLL object at 0x7ff45c5e86d8>, 2: <__main__.nodeLL object at 0x7ff45c5e8710>, 3: <__main__.nodeLL object at 0x7ff45c5e8940>, 10: <__main__.nodeLL object at 0x7ff45c58aa20>}
50
{2: <__main__.nodeLL object at 0x7ff45c5e8710>, 3: <__main__.nodeLL object at 0x7ff45c5e8940>, 10: <__main__.nodeLL object at 0x7ff45c58aa20>, 5: <__main__.nodeLL object at 0x7ff45c5e86d8>}


In [16]:
#REAL TEST CASES

#LRU test cases with maximum capacity of 4
lru2 = LRUCache()
print("What I got: ", lru2.max_capacity(), "Expected: 4")
print("What I got: ", lru2.get(0), " Expected: None")
lru2.put(0,10)
print("What I got: ", lru2.get(0), " Expected: 10")
print("What I got: ", lru2.size(), " Expected: 1")
lru2.put(1,20)
lru2.put(2,500)
lru2.put(3,70)
print("What I got: ", lru2.get(0), lru2.get(1), lru2.get(2), lru2.get(3), " Expected: 10 20 500 70")

lru2.put(3,100)
lru2.put(2,100)
lru2.put(0,100)

print("What I got: ", lru2.size(), " Expected: 4")
print("What I got: ", lru2.get(0), lru2.get(2), lru2.get(3), " Expected: 100 20 100 100")

lru2.put(5,100)
#Replace with lru
print(lru2.pairs)

What I got:  4 Expected: 4
What I got:  None  Expected: None
What I got:  10  Expected: 10
What I got:  1  Expected: 1
What I got:  10 20 500 70  Expected: 10 20 500 70
What I got:  4  Expected: 4
What I got:  100 100 100  Expected: 100 20 100 100
{0: <__main__.nodeLL object at 0x7ff46ff5b630>, 2: <__main__.nodeLL object at 0x7ff46ff5b710>, 3: <__main__.nodeLL object at 0x7ff46ff5b0b8>, 5: <__main__.nodeLL object at 0x7ff46ff5be80>}


## Problem 2 

### [50 points] Passwords

In [Lab 2](https://colab.research.google.com/drive/1BINN7dw1b0nIXZuAGp6qQhBPIk-IZ9Cl#scrollTo=MKsRDH5ZUdfasdv), you used multiple sorting algorithms to find the *k* most used passwords in a given data set. In this problem, you are asked to write another solution to the problem that uses a heap to find the *k* most used passwords (in descending order). When sorting, sort the passwords by the number of times they appear in the data set. If two passwords have the same frequency, the password with lower alphabetical order should come first. Your solution must use a heap (coded by yourself) and a dictionary (the one that comes with Python). If your code takes a long time to run, only use a subset of the passwords. 

Make sure the name of your heap class is "Heap" and that the method is named "heap_sort" or the auto-grader will have trouble grading it automatically.

#### Hints:
* Use a dictionary to count the number of occurrences of each password. That is, the key would be a string (password), and the value would be an integer (count/frequency). 

* Once the dictionary is created, create a heap. Traverse the dictionary, and as you visit the key, value pairs, insert them into the heap using the frequency/count as the attribute used by the heap to order the nodes.

* Once the heap is populated, perform the "extract" operation k times to find the k most used passwords.


In [17]:
# Your code goes here
# Import the files package
from google.colab import files

# Get a list of the zip files that have been uploaded into your colab
# environment.
zip_uploaded = !ls *.zip

# If the zip file is not already in the colab enviroment, upload it
if ('10-million-combos.zip' not in zip_uploaded):
  uploaded = files.upload()

# Unzip file
import zipfile
with zipfile.ZipFile('10-million-combos.zip', 'r') as zip_ref:
    zip_ref.extractall()

# Read the resulting txt file and print the first 15 lines 
passwords_file = open("10-million-combos.txt", "r", encoding="ISO-8859-1")

for i in range(15):
  line = passwords_file.readline()
  print(line) 

class PasswordTuple(object):
  def __init__(self, password, count):
    self.password = password
    self.count = count

  def __str__(self):
    return "password: " + str(self.password) + ", count: " + str(self.count)

# Read the passwords txt file 
passwords_file = open("10-million-combos.txt", "r", encoding="ISO-8859-1")

# Create list of PasswordTuple objects
passwords_lst = []

password_dict = {}

for line in passwords_file:
  # The username and password are separated by \t. 
  # Extract password only from each line
  try:
    password = line.split("\t")[1]  
  except:
    print("Skipping line as it does not contain username and/or password: ", line)
    continue  # skip the line

  # Remove new line character \n from the end of the line
  password = password.replace("\n","")

  if password in password_dict: 
    password_dict[password] += 1
  else:
    password_dict[password] = 1
    passwords_lst.append(password)


0000	00000000

0000	00001

0000	00001111

0000	000099

0000	00009999

0000	0000w

0000	5927499

0000	634252

0000	6911703

0000	701068

0000	721010

_0000	7227545yfnfif

0000	77777777

0000	8888

0000	99999

Skipping line as it does not contain username and/or password:  markcgilberteternity2969

Skipping line as it does not contain username and/or password:  sailer1216soccer1216



In [63]:
import math

#I implemented a Min heap so its easier to know where is the min value (the first element)
class Heap():
  def __init__(self,k):
      self.tree = []
      self.max = k
      self.size = 0

  def remove(self,pairs):
      if len(self.tree) < 1: #If there are no elements, don't remove
        return
      originalRoot = self.tree[0] #First element/root
      password = self.tree[len(self.tree)-1]
      self.tree.pop(-1) #Delete first element
      self.size -= 1
      if self.size != 0:
        self.tree[0] = password
        self.percolate_down(0, pairs) #Put it in its correct place
      return originalRoot #Return that element

  def parent(self, i):
      if i == 0:
        return None
      return self.tree[(i-1) // 2]

  def leftChild(self, i):
      left = 2 * i + 1
      if left >= len(self.tree):
        return math.inf
      return self.tree[left]

  def rightChild(self, i):
      right = 2 * i + 2
      if right >= len(self.tree):
        return math.inf
      return self.tree[right]

  def insert(self, value, password, pairs):
      if self.size == self.max:
        if value < pairs[self.tree[0]]:
          return
        else:
          if value == pairs[self.tree[0]]:
            if password <= self.tree[0]:
              return
          self.remove(pairs)
      
      self.tree.append(password)
      self.percolate_up(len(self.tree)-1, pairs)
      self.size += 1

  def percolate_up(self, i, pairs):
      if i == 0:
        return
      
      parent = (i-1) // 2

      if pairs[self.tree[parent]] == pairs[self.tree[i]]:
        if self.tree[parent] < self.tree[i]:
          return 
      
      if pairs[self.tree[parent]] >= pairs[self.tree[i]]:
        self.tree[i], self.tree[parent] = self.tree[parent], self.tree[i]
        self.percolate_up(parent, pairs)

  def percolate_down(self,i,pairs):
      entered = 0
      min_child_index = None
      left = 0
      right = 0
      if self.leftChild(i) == math.inf: 
        left = math.inf
      else:
        left = pairs[self.leftChild(i)]

      if self.rightChild(i) == math.inf:
        right = math.inf
      else:
        right = pairs[self.rightChild(i)]
      
      if pairs[self.tree[i]] == left and pairs[self.tree[i]] == right: #Equal to both
        #print("Left and right are equal to the parent")
        current_min = 2 * i + 1 
        entered = 1
        if self.tree[2 * i + 1] > self.tree[2 * i + 2]:
          current_min = 2 * i + 2
        if self.tree[current_min] < self.tree[i]:
          min_child_index = current_min 
      elif pairs[self.tree[i]] == left: #Only equal to left
        #print("Equal to the left!!!!!!")
        entered = 1
        if self.tree[i] > self.tree[2 * i + 1]: #Alphabetically
          #print("Parent is larger")
          min_child_index = 2 * i + 1
      elif pairs[self.tree[i]] == right: #Only equal to right
        entered = 1
        if self.tree[i] > self.tree[2 * i + 2]: #Alphabetically
          min_child_index = 2 * i + 2
      
      if pairs[self.tree[i]] < min(left, right) or (min_child_index == None and entered == 1): #Don't perlocate down!!!
        #print("Don't swap")
        return

      if min_child_index == None:
        #print("Should not enter")
        min_child_index = 2 * i + 1 if left < right else 2 * i + 2
      
      self.tree[i], self.tree[min_child_index] = self.tree[min_child_index], self.tree[i]
      #print("Enters perlocates down list: ", self.tree)
      self.percolate_down(min_child_index, pairs)


def heap_sort(password_dict, k):
  if k > len(password_dict):
    print("Note that k is larger than the number of passwords")
    return []
  
  arr = []
  heap = Heap(k)

  for password in password_dict:
    heap.insert(password_dict[password], password, password_dict)
    #print("Insert: ", heap.tree)

  for i in range(k):
    arr.insert(0,heap.remove(password_dict))
    #print("Remove: ", heap.tree)
  
  return arr


In [65]:
# Your test cases go here
sub_dictionary = {}
sub_dictionary2 = {}
for i in range(5):
  sub_dictionary[passwords_lst[i]] = password_dict[passwords_lst[i]]
  sub_dictionary2[passwords_lst[i+5]] = password_dict[passwords_lst[i+5]]

sub_dictionary3 = {'hello': 1, 'hellos': 1, 'hellp': 1, 'helli': 1, 'hella': 1}

print("Sub: ", sub_dictionary)
print("Sub2: ", sub_dictionary2)
print("Sub2: ", sub_dictionary3)

ret = heap_sort(sub_dictionary, 5)
print("Returned list: ", ret)

ret2 = heap_sort(sub_dictionary2, 5)
print("Returned list: ", ret2)

ret3 = heap_sort(sub_dictionary3, 5)
print("Returned list: ", ret3)


Sub:  {'00000000': 388, '00001': 23, '00001111': 47, '000099': 17, '00009999': 17}
Sub2:  {'0000w': 1, '5927499': 1, '634252': 2, '6911703': 3, '701068': 1}
Sub2:  {'hello': 1, 'hellos': 1, 'hellp': 1, 'helli': 1, 'hella': 1}
Returned list:  ['00000000', '00001111', '00001', '00009999', '000099']
Returned list:  ['6911703', '634252', '701068', '5927499', '0000w']
Returned list:  ['hellp', 'hellos', 'hello', 'helli', 'hella']


## How to Submit This Lab

1. File > Download .ipynb
2. Go to Blackboard, find the lab submission page, and upload the .ipynb file you just downloaded.

## Grading Rubric

|     Criteria    	|     Proficient    	|     Satisfactory    	|     Unsatisfactory    	|
|-	|-	|-	|-	|
|     Correctness    	|     The code compiles, runs, and solves the problem.                	|     The code compiles, runs, but does not solve the problem (partial implementation).    	|     The code does not compile/run, or little progress was made.          	|
|     Space and Time </br> complexities    	|     Appropriate for the problem.    	|     Can be greatly improved.    	|     Space and time complexity not analyzed     	|
|     Problem Decomposition    	|     Operations are broken down into loosely coupled, highly cohesive   methods    	|     Operations are broken down into methods, but they are not loosely   coupled/highly cohesive    	|     Most of the logic is inside a couple of big methods          	|
|     Style    	|     Variables and methods have meaningful/appropriate names     	|     Only a subset of the variables and methods have   meaningful/appropriate names     	|     Few or none of the variables and methods have meaningful/appropriate   names     	|
|     Robustness    	|     Program handles erroneous or unexpected input gracefully    	|     Program handles some erroneous or unexpected input gracefully    	|     Program does not handle erroneous or unexpected input gracefully    	|
|     Documentation    	|     Non-obvious code segments are well documented    	|     Some non-obvious code segments are documented    	|     Few or none non-obvious segments are documented    	|
|     Report     	|     Covers all required material in a concise and clear way with proper   grammar and spelling.    	|     Covers a subset of the required material in a concise and clear way   with proper grammar and spelling.    	|     Does not cover enough material and/or the material is not presented   in a concise and clear way with proper grammar and spelling.    	|