## Comparing Algorithms

Which algorithm is better? Why?

### `dup1`

| operation | symbolic count| count, N = 10,000|
| --- | --- |--- |
| `i = 0` | 1 | 1 |
| `j = i + 1` | 1 to N| 1 to 10.000|
| less than (`<`) | 2 to $\frac{N^2 + 3N + 2}{2}$| 2 to 50,015,001 |
| increment (`+=1`) | 0 to $\frac{N^2 + N}{2}$| 0 to 50,005,000 |
| equals (`==`) | 1 to $\frac{N^2 - N}{2}$ |1 to 49,995,000 |
| array accesses| 2 to $N^2 - N$ | 2 to 99,990,000|

### `dup2`

| operation | symbolic count| count, N = 10000 |
| --- | --- | --- |
| `i = 0` | 1 | 1 |
| less than (`<`) | 0 to $N$ | 0 to 10,000|
| increment (`+=1`) | 0 to $N - 1$| 0 to 9,999 |
| equals (`==`) | 1 to $N - 1$ | 1 to 9,999 |
| array accesses| 2 to $2N - 2$ | 2 to 19,998 |

* Good answer: Fewer operations to do the same work (e.g. 50,015,001 vs. 10,000 operations)
* Better answer: Algorithm **scales better** in the worst case 
    * $\frac{N^2 + 3N + 2}{2}$ vs. N
* Evem better answer: Parabolas ($N^2$) grow faster than lines ($N$)

## Asymptotic Behavior

In most cases, we only care about **asymptotic behavior**
* e.g. what happens when $N$ is very large. Some real-life examples:
    * Simulation of billions of interacting particles
    * Social network with billion of users
    * Logging of billions of transactions
    * Encoding of billion of bytes of video data
    
Algorithms which scale well (e.g. look like lines) have better asymptotic runtime behavior than algorithms that scale relatively poorly (e.g. look like parabolas)

## Parabolas vs. Lines

Suppose we have 2 algorithms that zerpify a collection of N items.
* `zerp1` takes $2N^2$ operations
* `zerp2` takes $500N$ operations

![](images/zerp.png)

* For small `N`, `zerp1` might be faster
* as the dataset size grows, the parabolic algorithm is going to fall further behind (takes more time to complete)

## Scaling Across Many Domains

We'll refer the shape of a runtime function as its `order of growth`
* Effect is dramatic! This often tells us whether a problem can be solved at all

![](images/table.png)

* For `n = 10`, the shape of the algorithm might doesn't  really matter
* But once `n = 100,000`, the shape matters. A linear time algorithm takes less than a second, while a quadratic / parabolic algorithm takes 3 hours. 

## Duplicate Finding

Our goal is to somehow **characterize the runtimes** of the `dup` functions. Our characterization so far is simple and mathematically rigorous.

### `dup1` - parabolic a.k.a. quadratic

| operation | symbolic count| count, N = 10,000|
| --- | --- |--- |
| `i = 0` | 1 | 1 |
| `j = i + 1` | 1 to N| 1 to 10.000|
| less than (`<`) | 2 to $\frac{N^2 + 3N + 2}{2}$| 2 to 50,015,001 |
| increment (`+=1`) | 0 to $\frac{N^2 + N}{2}$| 0 to 50,005,000 |
| equals (`==`) | 1 to $\frac{N^2 - N}{2}$ |1 to 49,995,000 |
| array accesses| 2 to $N^2 - N$ | 2 to 99,990,000|

### `dup2` - linear

| operation | symbolic count| count, N = 10000 |
| --- | --- | --- |
| `i = 0` | 1 | 1 |
| less than (`<`) | 0 to $N$ | 0 to 10,000|
| increment (`+=1`) | 0 to $N - 1$| 0 to 9,999 |
| equals (`==`) | 1 to $N - 1$ | 1 to 9,999 |
| array accesses| 2 to $2N - 2$ | 2 to 19,998 |

However, what we want is a characterization that demonstrates the superiority of `dup2` over `dup1`.