- Random Access Memory
- Binary Numbers
- Fixed-Width Integers
- Arrays
- Strings
- Pointers
- Dynamic Arrays
- Linked Lists
- Hash Tables

Summary

- Arrays have O(1)time lookups. But you need enough uninterrupted space in RAM to store the whole array. And the array items need to be the same size.

- But if your array stores pointers to the actual array items (like we did with our list of baby names), you can get around both those weaknesses. You can store each array item wherever there's space in RAM, and the array items can be different sizes. The tradeoff is that now your array is slower because it's not cache-friendly.

- Another problem with arrays is you have to specify their sizes ahead of time. There are two ways to get around this: dynamic arrays and linked lists. Linked lists have faster appends and prepends than dynamic arrays, but dynamic arrays have faster lookups.

- Fast lookups are really useful, especially if you can look things up not just by indices (0, 1, 2, 3, etc.) but by arbitrary keys ("lies", "foes"...any string). That's what hash tables are for. The only problem with hash tables is they have to deal with hash collisions, which means some lookups could be a bit slow.

- Each data structure has tradeoffs. You can't have it all. So you have to know what's important in the problem you're working on. What does your data structure need to do quickly? Is it lookups by index? Is it appends or prepends? Once you know what's important, you can pick the data structure that does it best. 

### 1. Array
Some languages (including Python) don't have these bare-bones arrays. 

Worst Case
- __space__:  O(n)
- __lookup__:  O(1)
- __append__:  O(1)
- __insert__:  O(n)
- __delete__:  O(n)

### 2. Dynamic Array
Other names: array list, growable array, resizable array, mutable array 

Average Case 	
- __space__: 	O(n) 
- __lookup__: 	O(1)
- __append__: 	O(1)
- __insert__: 	O(n)
- __delete__: 	O(n)


 Worst Case:
- __space__: 	O(n)
- __lookup__: 	O(1)
- __append__: 	O(n)
- __insert__: 	O(n)
- __delete__: 	O(n)


### 3. Hash Table
 In Python 3.6, hash tables are called dictionaries. Sets copy the implementation of Dictionary in Python but they only store key without value

Average 
- __space__: O(n) 
- __insert__: O(1) 
- __lookup__: O(1) 
- __delete__: O(1) 

    Worst Case
- __space__: O(n)
- __insert__: O(n)
- __lookup__: O(n)
- __delete__: O(n)

### 4. Linked List

Worst Case
- __space__: O(n)
- __prepend__: O(1)
- __append__: O(1)
- __lookup__: O(n)
- __insert__: O(n)
- __delete__: O(n)

 Most languages (including Python 3.6) don't provide a linked list implementation. Assuming we've already implemented our own, here's how we'd construct the linked list above: 

In [1]:
class LinkedListNode(object):
    def __init__(self,val):
        self.key = val
        self.next = None

a = LinkedListNode(5)
b = LinkedListNode(1)
c = LinkedListNode(9)

a.next = b
b.next = c

__Doubly Linked Lists__

In a basic linked list, each item stores a single pointer to the next element. In a doubly linked list, items have pointers to the next and the previous nodes. 

In [3]:
class DoubleLinkedListNode(object):
    def __init__(self,val):
        self.key = val
        self.next = None
        self.prev = None

a = DoubleLinkedListNode(5)
b = DoubleLinkedListNode(1)
c = DoubleLinkedListNode(9)

a.next = b
b.prev = a
b.next = c
c.prev = b

### 5. Queue

Worst Case
- __space__: O(n)
- __enqueue__: O(1)
- __dequeue__: O(1)
- __peek__: O(1)

__Implementation__

Queues are easy to implement with linked lists:
- To enqueue, insert at the tail of the linked list.
- To dequeue, remove at the head of the linked list.

### 6. Stack
A stack stores items in a last-in, first-out (LIFO) order. You can implement a stack with either a linked list or a dynamic array—they both work pretty well.

Worst Case
- __space__:O(n)
- __push__:O(1)
- __pop__:O(1)
- __peek__:O(1)

### 7. Binary Tree 
A binary tree is a tree where every node has two or fewer children. The children are usually called left and right

In [4]:
class BinaryTreeNode(object):

    def __init__(self, value):
        self.value = value
        self.left  = None
        self.right = None

- Property 1: the number of total nodes on each "level" doubles as we move down the tree. 
- Property 2: the number of nodes on the last level is equal to the sum of the number of nodes on all other levels (plus 1)

    - Level 0: 2<sup>0</sup> nodes,
    - Level 1: 2<sup>1</sup> nodes,
    - Level 2: 2<sup>2</sup> nodes,
    - Level 3: 2<sup>3</sup> nodes,
    - etc

### 8. Graph
- A graph organizes items in an interconnected network. 
- Most graph algorithms are O(n∗lg(n) or even slower. Depending on the size of your graph, running algorithms across your nodes may not be feasible. 

__Edge list:__ A list of all the edges in the graph: 

In [5]:
#Since node 3 has edges to nodes 1 and 2, [1, 3] and [2, 3] are in the edge list. 
graph = [[0, 1], [1, 2], [1, 3], [2, 3]]

__Adjacency list:__ A list where the index represents the node and the value at that index is a list of the node's neighbors: 

In [6]:
#Since node 3 has edges to nodes 1 and 2, graph[3] has the adjacency list [1, 2]. 
graph = [[1],
         [0, 2, 3],
         [1, 3],
         [1, 2],]

In [8]:
graph = {0: [1],
         1: [0, 2, 3],
         2: [1, 3],
         3: [1, 2],}

__Adjacency matrix:__  A matrix of 0s and 1s indicating whether node x connects to node y (0 means no, 1 means yes). 

In [9]:
#Since node 3 has edges to nodes 1 and 2, graph[3][1] and graph[3][2] have value 1. 
graph = [[0, 1, 0, 0],
         [1, 0, 1, 1],
         [0, 1, 0, 1],
         [0, 1, 1, 0],]

__Algorithms__

___BFS and DFS___
- Lots of graph problems can be solved using just these traversals:
- Is there a path between two nodes in this undirected graph? Run DFS or BFS from one node and see if you reach the other one.
- What's the shortest path between two nodes in this undirected, unweighted graph? Run BFS from one node and backtrack once you reach the second. Note: BFS always finds the shortest path, assuming the graph is undirected and unweighted. DFS does not always find the shortest path.
- Can this undirected graph be colored with two colors? Run BFS, assigning colors as nodes are visited. Abort if we ever try to assign a node a color different from the one it was assigned earlier.
- Does this undirected graph have a cycle? Run BFS, keeping track of the number of times we're visiting each node. If we ever visit a node twice, then we have a cycle. 

___Advanced graph algorithms___
- __Dijkstra's Algorithm:__ Finds the shortest path from one node to all other nodes in a weighted graph.
- __Topological Sort:__ Arranges the nodes in a directed, acyclic graph in a special order based on incoming edges.
- __Minimum Spanning Tree:__ Finds the cheapest set of edges needed to reach all nodes in a weighted graph.


## TEST PROBLEMS

O(nlgn) is the time to beat. Even if our list of scores were already sorted we'd have to do a full walk through the list to confirm that it was in fact fully sorted. 

In [32]:
unsorted_scores = [37,37, 89, 41, 65, 91, 53]
HIGHEST_POSSIBLE_SCORE = 100
def sort_scores(unsorted_scores, highest_possible_score):
    # List of 0s at indices 0..highest_possible_score
    score_counts = [0] * (highest_possible_score+1)

    # Populate score_counts
    for score in unsorted_scores:
        score_counts[score] += 1

    # Populate the final sorted list
    sorted_scores = []

    # For each item in score_counts
    for score in range(len(score_counts) - 1, -1, -1):
        count = score_counts[score]
        sorted_scores += [score]*count

    return sorted_scores

sort_scores(unsorted_scores,HIGHEST_POSSIBLE_SCORE)

[91, 89, 65, 53, 41, 37, 37]

In [52]:
#number_one = "193283492420348904832902348908239048823480823"
#number_two = "3248234890238902348823940990234"

#Question:
#1) I need to multiply this and get the answer
#2) DO NOT CONVERT TO INT AND DO THE MULTIPLICATION

#ord return an integer representing the Unicode code point of the character
print([(char,':',ord(char))for char in '0123456789'])s

def string_multiplication(a, b):
    result = 0
    for i in range(len(a)):
        for j in range(len(b)):
            result += (ord(a[i])-ord('0'))*(10**(len(a)-i-1))*(ord(b[j])-ord('0'))*(10**(len(b)-j-1))
    return result

number_one = 193283492420348904832902348908239048823480823
number_two = 3248234890238902348823940990234
print(number_one*number_two)
print(string_multiplication(str(number_one), str(number_two)))

[('0', ':', 48), ('1', ':', 49), ('2', ':', 50), ('3', ':', 51), ('4', ':', 52), ('5', ':', 53), ('6', ':', 54), ('7', ':', 55), ('8', ':', 56), ('9', ':', 57)]
627830183787003738979638778171212515677536395487409726836477465973329282582
627830183787003738979638778171212515677536395487409726836477465973329282582


 Write a function that returns a list of all the duplicate files. We'll check them by hand before actually deleting them, since programmatically deleting files is really scary. To help us confirm that two files are actually duplicates, return a list of tuples ↴ where:

    the first item is the duplicate file
    the second item is the original file

In [None]:
def find_duplicate_files(starting_directory):
    files_seen_already = {}
    stack = [starting_directory]

    # We'll track tuples of (duplicate_file, original_file)
    duplicates = []

    while len(stack) > 0:
        current_path = stack.pop()

    return duplicates

In [None]:
import os

def find_duplicate_files(starting_directory):
    files_seen_already = {}
    stack = [starting_directory]

    # We'll track tuples of (duplicate_file, original_file)
    duplicates = []

    while len(stack) > 0:
        current_path = stack.pop()

        # If it's a directory, put the contents in our stack
        if os.path.isdir(current_path):
            for path in os.listdir(current_path):
                full_path = os.path.join(current_path, path)
                stack.append(full_path)

        # If it's a file
        else:
            # Get its contents
            with open(current_path) as file:
                file_contents = file.read()

            # Get its last edited time
            current_last_edited_time = os.path.getmtime(current_path)

            # If we've seen it before
            if file_contents in files_seen_already:
                existing_last_edited_time, existing_path = files_seen_already[file_contents]
                if current_last_edited_time > existing_last_edited_time:
                    # Current file is the dupe!
                    duplicates.append((current_path, existing_path))
                else:
                    # Old file is the dupe! So delete it
                    duplicates.append((existing_path, current_path))
                    # But also update files_seen_already to have the new file's info
                    files_seen_already[file_contents] = (current_last_edited_time, current_path)

            # If it's a new file, throw it in files_seen_already and record the path and the last edited time,
            # so we can delete it later if it's a dupe
            else:
                files_seen_already[file_contents] = (current_last_edited_time, current_path)

    return duplicates

In [None]:
import os
import hashlib

def find_duplicate_files(starting_directory):
    files_seen_already = {}
    stack = [starting_directory]

    # We'll track tuples of (duplicate_file, original_file)
    duplicates = []

    while len(stack) > 0:
        current_path = stack.pop()

        # If it's a directory,
        # put the contents in our stack
        if os.path.isdir(current_path):
            for path in os.listdir(current_path):
                full_path = os.path.join(current_path, path)
                stack.append(full_path)

        # If it's a file
        else:
            # Get its hash
            file_hash = sample_hash_file(current_path)

            # Get its last edited time
            current_last_edited_time = os.path.getmtime(current_path)

            # If we've seen it before
            if file_hash in files_seen_already:
                existing_last_edited_time, existing_path = files_seen_already[file_hash]
                if current_last_edited_time > existing_last_edited_time:
                    # Current file is the dupe!
                    duplicates.append((current_path, existing_path))
                else:
                    # Old file is the dupe!
                    duplicates.append((existing_path, current_path))
                    # But also update files_seen_already to have the new file's info
                    files_seen_already[file_hash] = (current_last_edited_time, current_path)

            # If it's a new file, throw it in files_seen_already and record its path and last edited time,
            # so we can tell later if it's a dupe
            else:
                files_seen_already[file_hash] = (current_last_edited_time, current_path)

    return duplicates


def sample_hash_file(path):
    num_bytes_to_read_per_sample = 4000
    total_bytes = os.path.getsize(path)
    hasher = hashlib.sha512()

    with open(path, 'rb') as file:
        # If the file is too short to take 3 samples, hash the entire file
        if total_bytes < num_bytes_to_read_per_sample * 3:
            hasher.update(file.read())
        else:
            num_bytes_between_samples = ((total_bytes - num_bytes_to_read_per_sample * 3) / 2)

            # Read first, middle, and last bytes
            for offset_multiplier in range(3):
                start_of_sample = (offset_multiplier
                    * (num_bytes_to_read_per_sample + num_bytes_between_samples))
                file.seek(start_of_sample)
                sample = file.read(num_bytes_to_read_per_sample)
                hasher.update(sample)

    return hasher.hexdigest()


Complexity

Each "fingerprint" takes O(1) time and space, so our total time and space costs are O(n)where nnn is the number of files on the file system.

If we add the last-minute check to see if two files with the same fingerprints are actually the same files (which we probably should), then in the worst case all the files are the same and we have to read their full contents to confirm this, giving us a runtime that's order of the total size of our files on disc.
