## Rubric

| Criteria                    | Ratings                                                                                                                                      | Pts    |
| --------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
| **Sort Algorithms**         | - 20 pts Full Marks<br>- 10 pts data description (The arrangement of the data needs to be described)<br>- 0 pts No Marks                     | 20 pts |
| **Insertion Sort**          | - 20 pts Full Marks<br>- 0 pts No Marks                                                                                                      | 20 pts |
| **Missing Numbers**         | - 20 pts Full Marks<br>- 15 pts slower than O(n) solution<br>- 10 pts only one function implemented<br>- 0 pts No Marks                      | 20 pts |
| **Anagram Clusters**        | - 20 pts Full Marks (O(n) solution)<br>- 15 pts does not work with all test cases<br>- 10 pts slower than O(kn) solution<br>- 0 pts No Marks | 20 pts |
| **Longest Subarray Length** | - 20 pts Full Marks<br>- 15 pts does not work with certain cases<br>- 10 pts could be more efficient                                         | 20 pts |
| **Total Points**            |                                                                                                                                              | 100    |

## 1. Dataset Sorting Cases

Describe the worst case data and the best case data for each of the following sorting algorithms. Also, include the big O notation for each case.

- Bubble Sort
- Selection Sort
- Insertion Sort
- Merge Sort
- Quicksort

---
### Answers
**Bubble Sort**

- **Best Case Data**:
    - Already sorted array (ascending order).
    - Because with an optimization (checking if no swaps occurred), it only needs **one pass**.
    - **Best Case Complexity**: **O(n)**
- **Worst Case Data**:
    - Array sorted in **reverse order**.
    - Every comparison leads to a swap in every pass.
    - **Worst Case Complexity**: **O(n²)**


**Selection Sort**

- **Best Case Data**:
    - Unfortunately, **Selection Sort doesn’t improve with input order**.
    - Even if the array is already sorted, it still scans the entire unsorted section each pass to find the minimum.
    - **Best Case Complexity**: **O(n²)**
- **Worst Case Data**:
    - As per above, the input order does not affect the performance of Selection Sort.
    - **Worst Case Complexity**: **O(n²)**


 **Insertion Sort**
 
- **Best Case Data**:
    - Already sorted array.
    - Only **n – 1 comparisons** and no shifts required.
    - **Best Case Complexity**: **O(n)**
- **Worst Case Data**:
    - Array sorted in **reverse order**.
    - Each element must be compared against and shifted past all earlier elements.
    - **Worst Case Complexity**: **O(n²)**


**Merge Sort**

- **Best Case Data**:
    - Merge sort always divides and merges regardless of input.
    - Even if already sorted, it still performs the divide-and-merge process.
    - **Best Case Complexity**: **O(n log n)**
- **Worst Case Data**:
    - Same as best case: input order doesn’t matter.
    - Still splits and merges every time.
    - **Worst Case Complexity**: **O(n log n)**


**Quicksort**

- **Best Case Data**:
    - Data is arranged so that the chosen pivot **always splits the array into two equal halves** (e.g., median pivot).
    - Balanced partitions reduce recursion depth.
    - **Best Case Complexity**: **O(n log n)**
- **Worst Case Data**:
    - If pivot is consistently the **smallest or largest element** (bad pivot choice).
    - Happens with sorted or reverse-sorted arrays if pivot is chosen as first/last element.
    - Partitions become highly unbalanced, leading to a recursion depth of n.
    - **Worst Case Complexity**: **O(n²)**



## 2. Insertion Sort

Implement an insertion sort function.

---
### Answers

The below `insertion_sort` function takes as input a list of integers and sorts it using the insertion sort algorithm. In the worst case, this algorithm takes $O(n^2)$ time, since each step would require iterating the entire sorted section; this can occur with reversed sorted arrays. In the best case - in an already sorted array - it only takes $O(n)$ time performing comparisons against each index's prior value. In all cases this implementation takes $O(1)$ space complexity since all operations are done in-place. 

In [1]:
def insertion_sort(arr: list[int]) -> list[int]:
    """Insertion Sort
    Takes an input list of integers and sorts it in ascending order
    Time Complexity: O(n^2) time in the worst case, O(n) in best case
    Space Complexity: O(1) since all operations are done in-place
    """

    for i in range(1, len(arr)):
        key = arr[i]
        j = i - 1
        while arr[j] > key and j >= 0:  # skips if already sorted
            arr[j + 1] = arr[j]  # move elements up one position
            j -= 1

        arr[j + 1] = key  # insert key at correct position

    return arr


# --- Test cases ---
test_cases = [
    ("already_sorted", [1, 2, 3, 4, 5], [1, 2, 3, 4, 5]),
    ("reverse_sorted", [5, 4, 3, 2, 1], [1, 2, 3, 4, 5]),
    ("random_order", [3, 1, 4, 1, 5, 9], [1, 1, 3, 4, 5, 9]),
    ("with_negatives", [0, -1, 3, -2, 2], [-2, -1, 0, 2, 3]),
    ("single_element", [42], [42]),
    ("empty_list", [], []),
    ("duplicates", [5, 5, 5, 5], [5, 5, 5, 5]),
]


def test_insertion_sort():
    for name, input_list, expected in test_cases:
        print(f"Running test: {name}")
        result = insertion_sort(input_list.copy())
        if result == expected:
            print(f"  ✅ PASS | Input={input_list} | Result={result}")
        else:
            print(f"  ❌ FAIL | Input={input_list} | Result={result} | Expected={expected}")

test_insertion_sort()

Running test: already_sorted
  ✅ PASS | Input=[1, 2, 3, 4, 5] | Result=[1, 2, 3, 4, 5]
Running test: reverse_sorted
  ✅ PASS | Input=[5, 4, 3, 2, 1] | Result=[1, 2, 3, 4, 5]
Running test: random_order
  ✅ PASS | Input=[3, 1, 4, 1, 5, 9] | Result=[1, 1, 3, 4, 5, 9]
Running test: with_negatives
  ✅ PASS | Input=[0, -1, 3, -2, 2] | Result=[-2, -1, 0, 2, 3]
Running test: single_element
  ✅ PASS | Input=[42] | Result=[42]
Running test: empty_list
  ✅ PASS | Input=[] | Result=[]
Running test: duplicates
  ✅ PASS | Input=[5, 5, 5, 5] | Result=[5, 5, 5, 5]


## 3. Missing Number Tracker

Implement **two** versions of a function that identifies missing numbers from an input array.

- **First Function:** Use a hashtable or hashset to track missing numbers.
- **Second Function:** Use an array-based approach instead of a hash structure.

Input:

- An array of size `n`, containing random integers in the range `[0, n-1]`.
- The array may contain duplicates and is not necessarily sorted.

Output:

- An array containing all missing numbers from the range `[0, n-1]`.

Both implementations must run in `O(n)` time complexity to receive full credit.

Example:

```python
find_missing([0, 3, 6, 7, 3, 3, 0, 4]) 

# Returns
[1, 2, 5]
```
---
### Answers

#### Version 1: HashSet Approach

- Traverse array, insert elements into a set.
- Then iterate over $[0..n-1]$ and collect numbers not in the set.
- Complexity: $O(n)$ time average, $O(n^2)$ worst; $O(n)$ space.

In [2]:
def find_missing_hashset(arr: list[int]) -> list[int]:
    """
    Return the numbers missing from the range [0, n), where n = len(arr).
    Input may contain duplicates and need not be sorted.

    Time: O(n) average (set build + membership); O(n^2) worst with collisions.
    Space: O(n) for the set and result list. 
    """
    n = len(arr)
    seen: set[int] = set(arr)  # O(n) time and space

    # NOTE: the 'in seen' operation on sets takes O(1) time in the best
    # and average cases, since it uses hash tables. In the worst case
    # with multiple hash collisions it takes O(n) time.
    missing = [x for x in range(n) if x not in seen]  # O(n) not considering hash collisions, O(n^2) otherwise
    return missing

#### Version 2: Array-Based Approach

- Use an auxiliary boolean/int list of length n to mark presence of a number.
- Traverse input, mark seen values.
- Collect unmarked indices.
- Complexity: $O(n)$ time, $O(n)$ space (but avoids hash overhead).
- *Note:* unlike the HashSet implementation, this approach does not risk the worst case $O(n^2)$ scenario of hash collisions, though unlikely.

In [3]:
def find_missing_array(arr: list[int]) -> list[int]:
    """
    Return the numbers missing from the range [0, n), where n = len(arr).
    Input may contain duplicates and need not be sorted.
    
    Time: O(n) for list traversals
    Space: O(n) extra space for auxillary list
    """
    n = len(arr)
    seen = [False] * n   # O(n) space
    
    for num in arr:      # O(n) pass
        if 0 <= num < n: # numbers only in range of len(list)
            seen[num] = True
    
    missing = [i for i, present in enumerate(seen) if not present]  # O(n)
    return missing

#### Tests

In [4]:
# --- Test cases (name, input, expected) ---
test_cases = [
    ("no_missing", [0, 1, 2, 3], []),          # complete range
    ("one_missing", [0, 1, 3], [2]),           # missing one
    ("multiple_missing", [0, 2, 2, 4], [1, 3]),# missing several
    ("empty", [], []),                         # empty input
    ("two_missing", [0,0,0], [1, 2]),     # nothing in range
    ("unordered", [3, 1, 0], [2]),             # unsorted input
]


def test_find_missing_hashset():
    for name, input_list, expected in test_cases:
        result = find_missing_hashset(input_list)
        if result == expected:
            print(f"test_find_missing_hashset [{name}] ✅ PASS | Input={input_list} | Result={result}")
        else:
            print(f"test_find_missing_hashset [{name}] ❌ FAIL | Input={input_list} | Result={result} | Expected={expected}")


def test_find_missing_array():
    for name, input_list, expected in test_cases:
        result = find_missing_array(input_list)
        if result == expected:
            print(f"test_find_missing_array [{name}] ✅ PASS | Input={input_list} | Result={result}")
        else:
            print(f"test_find_missing_array [{name}] ❌ FAIL | Input={input_list} | Result={result} | Expected={expected}")

test_find_missing_hashset()
test_find_missing_array()

test_find_missing_hashset [no_missing] ✅ PASS | Input=[0, 1, 2, 3] | Result=[]
test_find_missing_hashset [one_missing] ✅ PASS | Input=[0, 1, 3] | Result=[2]
test_find_missing_hashset [multiple_missing] ✅ PASS | Input=[0, 2, 2, 4] | Result=[1, 3]
test_find_missing_hashset [empty] ✅ PASS | Input=[] | Result=[]
test_find_missing_hashset [two_missing] ✅ PASS | Input=[0, 0, 0] | Result=[1, 2]
test_find_missing_hashset [unordered] ✅ PASS | Input=[3, 1, 0] | Result=[2]
test_find_missing_array [no_missing] ✅ PASS | Input=[0, 1, 2, 3] | Result=[]
test_find_missing_array [one_missing] ✅ PASS | Input=[0, 1, 3] | Result=[2]
test_find_missing_array [multiple_missing] ✅ PASS | Input=[0, 2, 2, 4] | Result=[1, 3]
test_find_missing_array [empty] ✅ PASS | Input=[] | Result=[]
test_find_missing_array [two_missing] ✅ PASS | Input=[0, 0, 0] | Result=[1, 2]
test_find_missing_array [unordered] ✅ PASS | Input=[3, 1, 0] | Result=[2]


## 4. Anagram Clusters

Write a function that accepts a list of words and groups them into clusters of anagrams.

An anagram is a word formed by rearranging the letters of another word, such as "listen" becoming "silent." 

**Requirements**:

- Input: A list of lowercase words with no spaces, symbols, or non-alphabetic characters
- Output: A list of lists, where each inner list contains anagram words grouped together
    - The order of words within each group is not important
- Make your code as time-efficient as possible
- State its time complexity using n as the number of words and k as the average word length

  
For example, given
```python
["listen", "silent", "enlist", "google", "gooegl", "elbow", "below", "bored", "robed"]
```

Output
```python
[  
  ["listen", "silent", "enlist"],  
  ["google", "gooegl"],  
  ["elbow", "below"],  
  ["bored", "robed"]  
]
```

---
### Answers

- For each word, generate a canonical key that uniquely identifies its anagram group.
    - Option 1: sorted(word) → but sorting costs $O(k \log k)$.
    - Option 2 (faster): Count characters (26 letters only). Use a tuple of counts as the key → $O(k)$ per word.
- Group words by this key using a dictionary.
- Collect dictionary values as the result.


The function `group_anagrams` groups words into lists of anagrams using a dictionary, where the key is a tuple representing the count of each character in a word. This ensures that all words with the same character composition are grouped together, guaranteeing correctness. By leveraging this character-count key, the function meets the requirements for an $O(n)$ solution in the average case.

**Complexity Analysis:**
- Time: $O(n * k)$ on average, dominated by iterating over each word and its characters; O(n² * k) in the unlikely worst case of many dictionary hash collisions, where n = number of words and k = maximum word length
- Space: $O(n * k)$ for storing the resulting groups and the temporary character count arrays

In [5]:
from collections import defaultdict

def group_anagrams(words: list[str]) -> list[list[str]]:
    """
    Returns groups of words that are anagrams.
    Input is a list of lowercase words, where the order within groups is arbitrary.

    Time: O(n * k) average, O(n^2 * k) worst-case for dictionary hash collisions,
          where n = number of words, k = max word length
    Space: O(n * k) extra space for storing groups and temporary counts
    """
    groups: dict[tuple[int, ...], list[str]] = defaultdict(list)

    for word in words: # O(n)
        # Character frequency (26 letters)
        count = [0] * 26
        for ch in word: # O(k)
            count[ord(ch) - ord('a')] += 1 # O(1)
        key = tuple(count)  # immutable, usable as dict key
        groups[key].append(word) # O(1) avg. case, O(n^2) worst case for hash collisions

    return list(groups.values()) # O(n)


test_cases = [
    ("simple", ["eat", "tea", "tan", "ate", "nat", "bat"],
     [["eat", "tea", "ate"], ["tan", "nat"], ["bat"]]),
    ("empty", [], []),
    ("single", ["abc"], [["abc"]]),
    ("no_anagrams", ["a", "b", "c"], [["a"], ["b"], ["c"]]),
    ("all_anagrams", ["abc", "bca", "cab"], [["abc", "bca", "cab"]]),
]

def test_group_anagrams():
    for name, input_list, expected in test_cases:
        result = group_anagrams(input_list)

        # Sorting inner lists and outer list for comparison, since order is arbitrary
        result_sorted = sorted([sorted(group) for group in result])
        expected_sorted = sorted([sorted(group) for group in expected])

        if result_sorted == expected_sorted:
            print(f"test_group_anagrams [{name}] ✅ PASS | Input={input_list} | Result={result}")
        else:
            print(f"test_group_anagrams [{name}] ❌ FAIL | Input={input_list} | Result={result} | Expected={expected}")

test_group_anagrams()

test_group_anagrams [simple] ✅ PASS | Input=['eat', 'tea', 'tan', 'ate', 'nat', 'bat'] | Result=[['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]
test_group_anagrams [empty] ✅ PASS | Input=[] | Result=[]
test_group_anagrams [single] ✅ PASS | Input=['abc'] | Result=[['abc']]
test_group_anagrams [no_anagrams] ✅ PASS | Input=['a', 'b', 'c'] | Result=[['a'], ['b'], ['c']]
test_group_anagrams [all_anagrams] ✅ PASS | Input=['abc', 'bca', 'cab'] | Result=[['abc', 'bca', 'cab']]


## 5. Longest Subarray Length

Write a function that returns the longest contiguous subarray whose sum equals a given target

- Input: an array of integers and a target value
- Output: an integer representing the length of the longest subarray with a sum equal to the target
- Your solution must run in O(n) time for full credit

Given an array: 
```python
[3, 1, -1, 2, -1, 5, -2, 3]
```

and a target value of 3, the longest subarray length is 

```txt
5 
Length of  [-1, 2, -1, 5, -2] (sum of 3)  
```

---
### Answers

- Use a running prefix sum as we iterate.
- This approach essentially checkes if `prefix_sum[j] - prefix_sum[i] = target`
- Store the first index where each prefix sum appears in a hashmap.
- At each step:
    - If `prefix_sum - target` has been seen before, then the subarray between that index+1 and the current index sums to target.
    - Update the max length if this subarray is longer.
- Also handle the case when the prefix sum itself equals the target (subarray from start).

#### Proof
Let  

$$
\text{prefix}[i] = \sum_{m=0}^{i} \text{arr}[m], \quad \text{with } \text{prefix}[-1] = 0
$$

The sum of a subarray from index $j+1$ to $i$ is  

$$
\text{sum}(j+1 \dots i) = \text{prefix}[i] - \text{prefix}[j]
$$

We want  

$$
\text{sum}(j+1 \dots i) = \text{target}
$$

So  

$$
\text{prefix}[i] - \text{prefix}[j] = \text{target}
$$

Rearrange:  

$$
\text{prefix}[j] = \text{prefix}[i] - \text{target}
$$

Thus, if at index $i$ the running sum is $\text{prefix}[i]$, and we have already seen a previous prefix sum equal to $\text{prefix}[i] - \text{target}$, then the subarray $(j+1 \dots i)$ must sum to the target.  

Its length is  

$$
i - j
$$


#### Implementation

The below implementation of `longest_subarray_sum` accepts as input an integer list and a target integer. It returns the longest subarray in the input which sums to the target. This implementation utilizes a dictionary to store and fetch the first occurrences prefix sums, in accordance with the aforementioned proof; the longest subarray is calculated as the maximum length which satisfies the criteria $\text{prefix}[j] = \text{prefix}[i] - \text{target}$.

**Complexity Analysis:**
- Time: $O(n)$ average, $O(n^2)$ worst-case due to dictionary hash collisions
- Space: $O(n)$ extra space is required for storing the prefix sums in a dictionary


In [6]:
def longest_subarray_sum(nums: list[int], target: int) -> int:
    """
    Returns the length of the longest contiguous subarray whose sum equals `target`.

    Time: O(n) average, O(n^2) worst-case due to dictionary hash collisions,
          where n = len(nums)
    Space: O(n) extra space for storing prefix sums in a dictionary
    """
    prefix_sum = 0
    first_occurrence: dict[int, int] = {}  # prefix_sum -> earliest index
    max_len = 0

    for i, num in enumerate(nums): # O(n)
        prefix_sum += num

        # Case 1: subarray from start
        if prefix_sum == target: # O(1)
            max_len = max(max_len, i + 1)

        # Case 2: subarray between two indices
        if (prefix_sum - target) in first_occurrence: # O(1) avg. case, O(n) worst case with hash collisions
            length = i - first_occurrence[prefix_sum - target]
            max_len = max(max_len, length)

        # Store prefix sum index only if not seen (to maximize length)
        if prefix_sum not in first_occurrence: # O(1) avg. case, O(n) worst case with hash collisions
            first_occurrence[prefix_sum] = i

    return max_len

def test_longest_subarray_sum():
    test_cases = [
        ("simple", [1, 2, 3, 4, 5], 9, 3),           # subarray [2,3,4]
        ("all_positive", [1, 1, 1, 1], 2, 2),        # multiple subarrays sum to 2
        ("single_element", [5], 5, 1),               # single element matches
        ("none", [1, 2, 3], 10, 0),                  # no subarray sums to 10
        ("negatives", [1, -1, 5, -2, 3], 3, 4),      # subarray [1,-1,5,-2]
        ("entire_array", [2, 3, 1], 6, 3),           # entire array
        ("empty", [], 0, 0),                          # empty array
    ]

    for name, nums, target, expected in test_cases:
        result = longest_subarray_sum(nums, target)
        if result == expected:
            print(f"test_longest_subarray_sum [{name}] ✅ PASS | Result={result}")
        else:
            print(f"test_longest_subarray_sum [{name}] ❌ FAIL | Result={result} | Expected={expected}")


test_longest_subarray_sum()

test_longest_subarray_sum [simple] ✅ PASS | Result=3
test_longest_subarray_sum [all_positive] ✅ PASS | Result=2
test_longest_subarray_sum [single_element] ✅ PASS | Result=1
test_longest_subarray_sum [none] ✅ PASS | Result=0
test_longest_subarray_sum [negatives] ✅ PASS | Result=4
test_longest_subarray_sum [entire_array] ✅ PASS | Result=3
test_longest_subarray_sum [empty] ✅ PASS | Result=0
