# Algorithms 1: Linear Search

An **algorithm** is a finite sequence of precise instructions that solves a
computational problem. 

For example, suppose you want to add up a list of numbers. Here's one algorithm
to do that:

1. Define a variable `sum` and initialize it to 0.
2. If the list is empty, then you're done: return `sum` and stop. stop.
3. Add the first number in the list to `sum`, and then remove that number from
   the list.
4. Go to step 2.

There are many things we can study about algorithms. For example:

- Does it *always* returns the correct sum of the numbers in the list?
  - For the summing algorithm, it is always correct.
- How fast does it run? Does it do any unnecessary work?
  - For the summing algorithm, there does not *seem* to be any way to
    significantly speed it up.
- How much memory does it use?
  - For the summing algorithm, it uses a small, constant amount of extra memory
    above and beyond the memory needed by the list itself, e.g. it uses one
    variable `sum`.
- Does it modify its inputs? Or just read them?
  - For the summing algorithm, it modifies the list by removing numbers from it.
    Better would be if it just read the numbers without modify the list.

We usually write algorithms in pseudocode, i.e. code-like language but designed
to be read and understood by humans. Then we can translate it into a programming
language. For example:

In [2]:
def sum_list(numbers):
    """Return the sum of all numbers in list numbers.
    """
    total = 0
    for n in numbers:
        total += n
    return total

print(sum_list([6, 4, 5]))  # 15

15


`sum_list` does *not* modify the list. Instead, it uses a for-loop to iterate
over the list without changing it.

The same function in other languages can look different. For example, in the C
language we might write:

```c
int sum(int* numbers, int n) {
    int sum = 0;
    for (int i = 0; i < n; i++) {
        sum += numbers[i];
    }
    return sum;
}
```

Or in [Haskell](https://www.haskell.org/), we could implement the algorithm like
this (using recursion!):

```haskell
sum :: [Int] -> Int
sum [] = 0
sum (x:xs) = x + sum xs
```

All three of these functions are implementations of the same algorithm.

## Linear Search

Suppose you have a list of values $[a_0, a_1, \ldots, a_{n-1}]$ that are all the
same type. They could be numbers, strings, letters, or lists --- *any* value
that can be compared using `==`.

The **list search problem** is this:

> Where in $[a_0, a_1, \ldots, a_{n-1}]$ is a given *target value* $x$?

Here's an algorithm that solves it:

1. Define an index variable `i` and initialize it to 0.
2. If $i = n$, then you're done: return -1 and stop. -1 means $x$ was *not*
   found.
3. If $a_i == x$, then you're done: return `i` and stop.
4. Otherwise, increment `i` by 1 and go to step 2.

This algorithm is called **linear search**, or **sequential search**, because it
checks the items, one at time, to see if they equal $x$. If it finds $x$, it
immediately returns the index position, $i$. If it reaches the end of the list
without finding $x$, it returns -1.

This is the standard algorithm for solving the list search problem. By looking
at the algorithm we can see that:
- **It is pretty efficient**. It doesn't appear to do any unnecessary work, and
  it stops as soon as it's done.
- **It is space-efficient**. It requires only a small, finite amount of extra
  memory, i.e. the variable `i`.
- **It is not destructive**. The list is *not* modified.
- **It is quite general**. The items in the list can be any type that can be
  compared using `==`.
- **The algorithm is *correct*, i.e. it always returns the right answer**. This
  can be proven mathematically, although in practice we usually rely on careful
  design and testing to make sure our algorithms are correct.

### Linear Search Performance

When discussing the performance of an algorithm, we usually don't talk about
*time* because different computers run at different speeds. Instead, we usually
pick a **key operation** and count how many times it is performed. A well-chosen
key operation will give us a good idea of how fast the algorithm is, and how it
scales as the size of the input grows.

For linear search, a good key operation is the comparison operator `==`. By
counting the number of times linear search calls `==`, we can get a good idea of
its run-time performance without needing to run experiments.

For example:

- At *best*, `==` is called 1 time.
  - This occurs when $x$ is the first item, or if there's only one item in the
    list.
- At *worst*, `==` is called $n$ times. 
  - This occurs when $x$ is either not in the list, or when $x$ is the last item
    of the list.
- On *average*, `==` is called about $\frac{n}{2}$ times.
  - This assumes that $x$ is equally likely to be anywhere in the list. $x$ will
    be near the start of the list just as often as it is near the end, and over
    many searches this averages out to be as if we searched half way through the
    list each time

In practice, it is usually a good idea to assume the worst-case performance. If
your program runs well with the worst-case performance, then it will also run
well with the average-case or best-case performance.

> **Rule of thumb**: Hope for the best, but plan for the worst.

## Implementation 1: index

Now lets see a couple of different ways to implement linear search. 

Pythons `index` method is a built-in way to do linear search:

In [2]:
nums = [3, 9, -2, 4, -2]
print(nums.index(-2))  # 2
print(nums.index(9))   # 1
# print(nums.index(5))   # ValueError: 5 is not in list

2
1


Or:

In [3]:
vehicles = ['car', 'ebike', 'foot', 'scooter']
print(vehicles.index('scooter'))  # 3
print(vehicles.index('bike'))     # ValueError: bike is not in list

3


ValueError: 'bike' is not in list

`index` is a pretty good implementation of linear search. It's bug-free, fast,
and easy to use. 

Notice that it crashes with a `ValueError` if $x$ is not found. Our algorithm
says that -1 should be returned, but most of Python's built-in functions cause
exceptions in error cases.

## Implementation 2: while loop

If Python didn't provide an `index` method, then one standard implementation is
to use a while-loop:

In [12]:
def while_linear_search(x, lst):
    """Returns the position of the left-most x in lst.
    If x is not in lst, returns -1.
    Order of the elements in lst doesn't matter.
    """
    i = 0
    while i < len(lst):
        if lst[i] == x:  # x found at location i
            return i 
        i += 1
    return -1            # x not in lst

nums = [3, 9, -2, 4, -2]
print(while_linear_search(-2, nums))  # 2
print(while_linear_search(5, nums))   # -1

vehicles = ['car', 'ebike', 'foot', 'scooter']
print(while_linear_search('scooter', vehicles))  # 3
print(while_linear_search('bike', vehicles))     # -1

2
-1
3
-1


A nice feature of this implementation is that each step of the original
algorithm is clearly spelled out in code.

## Implementation 3: reverse while loop

**Reverse linear** search is the same as regular linear search, except it scans
the list from right to left:

In [2]:
def reverse_while_linear_search(x, lst):
    """Returns the position of the right-most x in lst.
    If x is not in lst, returns -1.
    Order of the elements in lst doesn't matter.
    """
    i = len(lst) - 1     # starts at right end
    while i >= 0:
        if lst[i] == x:  # x found at location i
            return i 
        i -= 1           # i is decremented
    return -1            # x not in lst

nums = [3, 9, -2, 4, -2]
print(reverse_while_linear_search(-2, nums))  # 4 (not 2!)
print(reverse_while_linear_search(5, nums))   # -1

vehicles = ['car', 'ebike', 'foot', 'scooter']
print(reverse_while_linear_search('scooter', vehicles))  # 3
print(reverse_while_linear_search('bike', vehicles))     # -1

4
-1
3
-1


## Which Implementation is Fastest?

We've seen three implementations of linear search: `index`, a while-loop, and a
reverse while-loop. 

Which one do you think is fastest?

We know from the algorithm analysis above that they all call `==` from 0 to `n`
times. So one guess is that all three are about the same speed.

But `index` is special: it is implemented internally in Python, and could be
very efficient using features not available to regular Python programmers. So
another guess is that `index` is faster than the other two.

Or, maybe `index` is special in a way that makes it slower than the other two?
Maybe, say, it is more general-purpose, or maybe the fact that it raises an
exception makes it slower? It's hard to say without running an experiment.

Finally, it seems like the while-loop and reverse while-loop implementations
should be about the same speed. Surely the direction of the searching does not
make a difference?

So lets keep these hypotheses in mind:

- **Hypothesis 1**: All three are about the same speed.
- **Hypothesis 2**: `index` is significantly *faster* than the other two.
- **Hypothesis 3**: `index` is significantly *slower* than the other two.
- **Hypothesis 4**: The while-loop and reverse while-loop are about the same
  speed.

### Running an Experiment

Lets create experiment functions that run the various linear search functions on
the same data and record how long they take to run.

To measure the running time, we will use the `time.time()` function from the
`time` module like this:

In [4]:
import time

start_time = time.time()
print("Hello, World!")
end_time = time.time()

print(f'Elapsed time: {end_time - start_time} seconds')

Hello, World!
Elapsed time: 9.512901306152344e-05 seconds


The experiment we'll run on each implementation is as follows:

1. Read the first 1000 words from `words.txt` into a *search* list.
2. Read all the words of `austenPandP.txt` (Jane Austen's *Pride and
   Prejudice*) into a *text* list.
3. Time how long it takes to find each search list word in the list of words for
   `austenPandP.txt`.
4. Print the total number of times `==` was called and the elapsed time.

We've also created modified versions of the while-loop implementations that
count the number of times `==` is called.

In [20]:
import time

def while_linear_search_counted(x, lst):
    """Returns the position of the left-most x in lst.
    If x is not in lst, returns -1.
    Order of the elements in lst doesn't matter.
    """
    comps = 0
    i = 0
    while i < len(lst):
        comps += 1
        if lst[i] == x:  # x found at location i
            return i, comps
        i += 1
    return -1, comps     # x not in lst


def speed_test_while_linear_search():
    print('\nRunning speed_test_while_linear_search ...')
    words = open('words.txt').read().split()
    words = words[:1000]  # use only first 1000 words
    text = open('austenPandP.txt').read().lower().split()
 
    start_time = time.time()                    # start timing
    
    total_comps = 0
    for w in words:
        result, num_comparisons = while_linear_search_counted(w, text)
        total_comps += num_comparisons
        
    end_time = time.time()                      # stop timing
    elapsed_seconds = end_time - start_time 
    
    print(f'while_linear_search elapsed time: {elapsed_seconds} seconds')
    print(f'           total times == called: {total_comps}')

def reverse_while_linear_search_counted(x, lst):
    """Returns the position of the right-most x in lst.
    If x is not in lst, returns -1.
    Order of the elements in lst doesn't matter.
    """
    comps = 0
    i = len(lst) - 1     # starts at right end
    while i >= 0:
        comps += 1
        if lst[i] == x:  # x found at location i
            return i, comps
        i -= 1           # i is decremented
    return -1, comps     # x not in lst

def speed_test_reverse_while_linear_search():
    print('\nRunning speed_test_reverse_while_linear_search ...')
    words = open('words.txt').read().split()
    words = words[:1000]  # use only first 1000 words
    text = open('austenPandP.txt').read().lower().split()
 
    start_time = time.time()                    # start timing
    
    total_comps = 0
    for w in words:
        # Careful! Reverse linear search starts at the right end, so the total_comps
        # number of times `==` is called is not the returned index.
        result, num_comparisons = reverse_while_linear_search(w, text)
        total_comps += num_comparisons
        
    end_time = time.time()                      # stop timing
    elapsed_seconds = end_time - start_time 
    
    print(f'reverse_while_linear_search elapsed time: {elapsed_seconds} seconds')
    print(f'                   total times == called: {total_comps}')

def speed_test_index_linear_search():
    print('\nRunning speed_test_index_linear_search ...')
    words = open('words.txt').read().split()
    words = words[:1000]  # use only first 1000 words
    text = open('austenPandP.txt').read().lower().split()
 
    start_time = time.time()                    # start timing
    
    # Note that we don't know how index() works internally, so we don't know if
    # it calls `==` at all. Maybe it uses a different comparison function. So we
    # can't calls to `==`.
    numbers_checked = 0
    for w in words:
        # When index() is called, it raises a ValueError if `w` is not found. We
        # need to use a try-except block to catch the exception, otherwise the
        # program will crash.
        try:
            result = text.index(w)
        except ValueError:
            pass  # do nothing if w is not found
            
    end_time = time.time()                      # stop timing
    elapsed_seconds = end_time - start_time 
    
    print(f'reverse_while_linear_search elapsed time: {elapsed_seconds} seconds')

speed_test_while_linear_search()
speed_test_reverse_while_linear_search()
speed_test_index_linear_search()


Running speed_test_while_linear_search ...
while_linear_search elapsed time: 3.219709873199463 seconds
           total times == called: 116930120

Running speed_test_reverse_while_linear_search ...
reverse_while_linear_search elapsed time: 4.441780090332031 seconds
                   total times == called: 116861613

Running speed_test_index_linear_search ...
reverse_while_linear_search elapsed time: 0.6090891361236572 seconds


In summary, the experiment seems to show that `index` is the fastest, about 10
times faster than the other two. But, surprisingly, the reverse while-loop is
noticeably faster than the regular while-loop.

Here are our original hypotheses:

- **Hypothesis 1**: All three are about the same speed.
  <br> *False*: the algorithms are clearly not the same speed.
- **Hypothesis 2**: `index` is significantly *faster* than the other two.
  <br> *True*: `index` is significantly faster than the other two.
- **Hypothesis 3**: `index` is significantly *slower* than the other two.
  <br> *False*
- **Hypothesis 4**: The while-loop and reverse while-loop are about the same
  speed. <br> *False*: The reverse while-loop appears to be a little faster than
  the regular while-loop. It's not clear why this is the case. More testing on
  other machines and different data might help us understand why.


## Questions

1. What is an algorithm?
2. What is the difference between an algorithm and an implementation?
3. Would the linear search algorithm given in the notes still work correctly if
   in step 2 it return 0 instead of -1? Why or why not? What if it returned $n$
   instead of -1?
4. What is the linear search problem?
5. When analyzing the performance of the linear search algorithm (not
   implementation!), why do we count the number of times `==` is called instead
   of the time it takes to run the algorithm?
6. What is the *best-case*, *average-case*, and *worst-case* performance of the
   linear search algorithm?
7. What does Python's `index` method *not* return -1 if the target value being
   searched for is not found? What does it do instead?
8. What are some reasons you might want to implement your own linear search
   algorithm instead of using Python's `index` method?
9. What was the slowest implementation of linear search in the experiment? The
   fastest?