# Algorithms 1: Linear Search

Algorithms are one of the key ideas in computer science.

An **algorithm** is a finite sequence of precise instructions that solves a
computational problem. For example, suppose you have a list of numbers and want
to add them all up. One algorithm for doing that is this:

1. Set the variable `sum` to 0.
2. If the are no numbers left in the list, then you're done: return `sum` and
   stop.
3. Otherwise, add the first number in the list to `sum`, and then remove that
   number from the list.
4. Go to step 2.

Given an algorithm, we can study it without necessarily implementing it in a
program. For instance:

- It always returns the correct sum of the numbers in the list, even if the list
  is empty.
- The run-time performance is quite good. It doesn't do any unnecessary work,
  and it stops as soon as it's done.
- It's quite space-efficient. The variable `sum` is the only extra memory it
  requires. 
- However, it is destructive. The list is modified, i.e. the numbers are removed
  from it.

Given an algorithm, we can also translate it to a real programming language. For
example:

In [2]:
def sum_list(numbers):
    """Return the sum of all numbers in list numbers.
    """
    sum = 0
    for n in numbers:
        sum += n
    return sum

print(sum_list([6, 4, 5]))  # 15

15


Notice that this function does *not* modify the list. Instead, it uses a
for-loop to iterate over the list without changing it. This is an easy and
common pattern in Python, and we might say it is a *Pythonic* implementation of
the algorithm.

Other languages might do it differently. For example, in the C language we might
implement the algorithm like this:

```c
int sum(int *numbers, int n) {
    int sum = 0;
    for (int i = 0; i < n; i++) {
        sum += numbers[i];
    }
    return sum;
}
```

Or in [Haskell](https://www.haskell.org/), we could implement the algorithm like
this (using recursion!):

```haskell
sum :: [Int] -> Int
sum [] = 0
sum (x:xs) = x + sum xs
```

All three of these functions are implementations of the same algorithm. 

## The Linear Search Problem

Suppose you have a list of values $[a_0, a_1, \ldots, a_{n-1}]$ that are all the
same type. They could be numbers, strings, letters, or lists --- *any* value
that can be compared using `==`.

The **list search problem** is this:

> Where in $[a_0, a_1, \ldots, a_{n-1}]$ is a given *target value* $x$?

Here's an algorithm for solving it:

1. Set the index variable `i` to 0.
2. If $i = n$, then you're done: return -1 and stop. -1 means $x$ was not found.
3. If $a_i == x$, then you're done: return `i` and stop.
4. Otherwise, increment `i` by 1 and go to step 2.

This algorithm is called **linear search**, or **sequential search**, because it
checks the items, one at time, to see if they equal $x$. If it finds $x$, it
immediately returns the index position, $i$. If it reaches the end of the list
without finding $x$, it returns -1.

This is the standard algorithm for solving the list search problem. By looking
at the algorithm we can see that:
- It is pretty efficient. It doesn't appear to do any unnecessary work, and it
  stops as soon as it's done.
- It is space-efficient. The only extra memory it requires is the variable `i`.
- It is not destructive. The list is *not* modified.
- It is quite general. The items in the list can be any type that can be
  compared using `==`.
- We can also see that the algorithm is *correct*, i.e. that it always returns
  the right answer. Proving that mathematically is actually a challenge, and so
  in practice we usually rely on careful design and testing to make sure our
  algorithms are correct.

  How fast is linear search? If we had an implementation of linear search we
  could run an experiment on with data and time how long it takes. But we only
  have the algorithm, written in pseudocode, and so we have nothing to run.

  But we can say some useful things about the performance of the linear search
  algorithm itself:
  - At *most*, it calls `==` $n$ times, where $n$ is the length of the list.
  - At *best*, it calls `==` once, when the first item in the list is $x$. Also,
    if the list is empty it will be called 0 times.
  - On *average*, if $x$ has the same chance of being anywhere in the list, it
    will call `==` approximately $\frac{n}{2}$ times. In general, figuring out
    the average performance of an algorithm is a bit tricky because it often
    depends on the data given to the algorithm.

So, just by looking at the algorithm, we see that learn calls `==` from 0 to $n$
times. So it is probably wise to assume the worst case, that it will call `==`
$n$ every time. We say that the **worst-case performance** of linear search is
about $n$ calls to `==`.

Notice we said *nothing* about time. We only talked about the number of calls to
`==`. That's because the time it takes for any step of an algorithm to run
depends on the speed of the computer running it, and we have no idea what that
is since there is no computer yet!

But no matter how fast the computer is, or how much memory it has, it will
always do somewhere between 0 and $n$ calls to `==`.

## Implementation 1: index

Now lets see a couple of different ways to implement linear search. 

Pythons `index` method is a built-in way to do linear search:

In [4]:
nums = [3, 9, -2, 4, -2]
print(nums.index(-2))  # 2
print(nums.index(9))   # 1
print(nums.index(5))   # ValueError: 5 is not in list

2
1


ValueError: 5 is not in list

Or:

In [3]:
vehicles = ['car', 'ebike', 'foot', 'scooter']
print(vehicles.index('scooter'))  # 3
print(vehicles.index('bike'))     # ValueError: bike is not in list

3


ValueError: 'bike' is not in list

`index` is a pretty good implementation of linear search, and you should use it
when you can. It's bug-free, fast, and easy to use. 

Notice that it crashes with a `ValueError` if $x$ is not found. Our algorithm
says that -1 should be returned, but the "Pythonic" way of handling errors is to
raise an exception.

## Implementation 2: while loop

If Python already provides a good implementation of linear search, why would we
want to write our own? Well, there are a few reasons:

- `index` could be more flexible. What if you want to search only the positions
  of the list from, say, index location 10 to 20? `index` can't do that. You'd
  need to make a slice and then call `index` on the slice.
- Someone needs to implement Python's `index` method. A programmer needs to
  implement it, and that programmer could be you.

So here's a while-loop implementation of linear search:

In [6]:
def while_linear_search(x, lst):
    """Returns the position of the left-most x in lst.
    If x is not in lst, returns -1.
    Order of the elements in lst doesn't matter.
    """
    i = 0
    while i < len(lst):
        if lst[i] == x:  # x found at location i
            return i 
        i += 1
    return -1            # x not in lst

nums = [3, 9, -2, 4, -2]
print(while_linear_search(-2, nums))  # 2
print(while_linear_search(5, nums))   # -1

vehicles = ['car', 'ebike', 'foot', 'scooter']
print(while_linear_search('scooter', vehicles))  # 3
print(while_linear_search('bike', vehicles))     # -1

2
-1
3
-1


A nice feature of this implementation is that each step of the algorithm is
clearly spelled out in code.

## Implementation 3: reverse while loop

Reverse linear is the same as regular linear search, except it scans the list
from right to left:

In [7]:
def reverse_while_linear_search(x, lst):
    """Returns the position of the right-most x in lst.
    If x is not in lst, returns -1.
    Order of the elements in lst doesn't matter.
    """
    i = len(lst) - 1     # starts at right end
    while i >= 0:
        if lst[i] == x:  # x found at location i
            return i 
        i -= 1           # i is decremented
    return -1            # x not in lst

nums = [3, 9, -2, 4, -2]
print(reverse_while_linear_search(-2, nums))  # 4 (not 2!)
print(reverse_while_linear_search(5, nums))   # -1

vehicles = ['car', 'ebike', 'foot', 'scooter']
print(reverse_while_linear_search('scooter', vehicles))  # 3
print(reverse_while_linear_search('bike', vehicles))     # -1

4
-1
3
-1


## Which Implementation is Fastest?

We've seen three implementations of linear search: `index`, a while-loop, and a
reverse while-loop. 

Which one do you think is fastest? Lets think about the possible outcomes.

We know from the algorithm analysis we did at the beginning that they all call
`==` the same number of times. So one guess is that all three are about the same
speed.

But `index` is special: it is implemented internally in Python, and could be
very efficient using features not available in Python. So another guess is that
`index` is faster than the other two.

But, maybe `index` is special in a way that makes it slower than the other two?
Maybe, say, it is more general-purpose, or maybe the fact that it raises an
exception makes it slower? It's hard to say without running an experiment.

Finally, it seems like the while-loop and reverse while-loop implementations
should be about the same speed. Surely the direction of the searching does not
make a difference?

So lets keep these hypotheses in mind:

- **Hypothesis 1**: all three are about the same speed.
- **Hypothesis 2**: `index` is significantly *faster* than the other two.
- **Hypothesis 3**: `index` is significantly *slower* than the other two.
- **Hypothesis 4**: the while-loop and reverse while-loop are about the same
  speed.

### Running an Experiment

Lets create a function that takes a linear search function as input and calls
that function lots of time on some data. The function will return the time it
takes to run the function.

To measure the running time, we will use the `time.time()` function from the
`time` module like this:

In [12]:
import time

start_time = time.time()
print("Hello, World!")
end_time = time.time()

print(f'Elapsed time: {end_time - start_time} seconds')

Hello, World!
Elapsed time: 0.0005843639373779297 seconds


The experiment we run on each implementation is to search which words in *Pride
and Prejudice* that occur in a list of words:

In [27]:
def speed_test_while_linear_search():
    print('\nRunning speed_test_while_linear_search ...')
    words = open('words.txt').read().split()
    text = open('austenPandP.txt').read().lower().split()
    words = words[:len(words)//100]  # use only 1/100 of the words
 
    start_time = time.time()
    total = 0
    for w in words:
        total += while_linear_search(w, text)
    end_time = time.time()
    elapsed_seconds = end_time - start_time
    
    print(f'while_linear_search elapsed time: {elapsed_seconds} seconds')
    print(f'total: {total}')

def speed_test_reverse_while_linear_search():
    print('\nRunning speed_test_reverse_while_linear_search ...')
    words = open('words.txt').read().split()
    text = open('austenPandP.txt').read().lower().split()
    words = words[:len(words)//100]  # use only 1/100 of the words
 
    start_time = time.time()
    total = 0
    for w in words:
        total += reverse_while_linear_search(w, text)
    end_time = time.time()
    elapsed_seconds = end_time - start_time
    
    print(f'reverse_while_linear_search elapsed time: {elapsed_seconds} seconds')
    print(f'total: {total}')

def speed_test_index_linear_search():
    print('\nRunning speed_test_index_linear_search ...')
    words = open('words.txt').read().split()
    text = open('austenPandP.txt').read().lower().split()
    words = words[:len(words)//100]  # use only 1/100 of the words
 
    start_time = time.time()
    total = 0
    for w in words:
        try:
            total += text.index(w)
        except ValueError:
            total += -1
    end_time = time.time()
    elapsed_seconds = end_time - start_time
    
    print(f'reverse_while_linear_search elapsed time: {elapsed_seconds} seconds')
    print(f'total: {total}')

speed_test_while_linear_search()
speed_test_reverse_while_linear_search()
speed_test_index_linear_search()


Running speed_test_while_linear_search ...
while_linear_search elapsed time: 9.102183818817139 seconds
total: 4142304

Running speed_test_reverse_while_linear_search ...
reverse_while_linear_search elapsed time: 6.664890289306641 seconds
total: 9267421

Running speed_test_index_linear_search ...
reverse_while_linear_search elapsed time: 0.907773494720459 seconds
total: 4142304


In summary, the experiment seems to show that `index` is the fastest, about 10
times faster than the others. But, perhaps surprisingly, the reverse while-loop
is noticeably faster than the regular while-loop.

Here are our original hypotheses:

- **Hypothesis 1**: all three are about the same speed.
  <br> *False*: the algorithms are clearly not the same speed.
- **Hypothesis 2**: `index` is significantly *faster* than the other two.
  <br> *True*: `index` is significantly faster than the other two.
- **Hypothesis 3**: `index` is significantly *slower* than the other two.
  <br> *False*
- **Hypothesis 4**: the while-loop and reverse while-loop are about the same
  speed. <br> *False*: the reverse while-loop is a little faster than the
  regular while-loop.

## Questions

1. What is an algorithm?
2. What is the difference between an algorithm and an implementation?
3. Would the linear search algorithm given in the notes still work correctly if
   in step 2 it return 0 instead of -1? Why or why not? What if it returned $n$
   instead of -1?
4. What is the linear search problem?
5. When analyzing the performance of the linear search algorithm (not
   implementation!), why do we count the number of times `==` is called instead
   of the time it takes to run the algorithm?
6. What is the *best-case*, *average-case*, and *worst-case* performance of the
   linear search algorithm?
7. What does Python's `index` method *not* return -1 if the target value being
   searched for is not found? What does it do instead?
8. What are some reasons you might want to implement your own linear search
   algorithm instead of using Python's `index` method?
9. What was the slowest implementation of linear search in the experiment? The
   fastest?