##Workshop plan

- Part 1: introducing the example and writing a first test
- Part 2: building a test suite and fixing some bugs
- Part 3: refactoring and testing
- Part 4: introducing `nose`
- Part 5: interpreting test output
- Part 6: fixing regressions
- Part 7: wrapping up
- Part 8: special types of testing

###All about n-grams

If we take a list of words and split them into overlapping pairs then we have *digrams* or *2-grams* i.e. starting with

```
what is your favourite colour
```

gives us

```
what is
     is your
        your favourite
             favourite colour
```             

How to do it in code: 

In [238]:
text = "it was the best of times it was the worst of times"
words = text.split(' ')
for start in range(len(words) -1):
        digram = words[start] + ' ' + words[start + 1]
        print digram

it was
was the
the best
best of
of times
times it
it was
was the
the worst
worst of
of times


We can do the same for *3-grams*, *4-grams*, etc. i.e. print all *n-grams*:

In [239]:
text = "it was the best of times it was the worst of times"
words = text.split(' ')
n = 4
for start in range(len(words) +1 -n):
        ngram = ' '.join(words[start:start+n])
        print ngram

it was the best
was the best of
the best of times
best of times it
of times it was
times it was the
it was the worst
was the worst of
the worst of times


To analyse a bit of text, we might be interested in which n-grams occur most frequently. Write a function which takes a piece of text, a value for n, and a cutoff number, and returns all ngrams that occur more than the given number of times in the text. First attempt:

In [240]:
def find_common_ngrams(text, cutoff, n):
    
    # split the text into a list of words
    words = text.lower().split(' ')
    
    # create an empty list to hold the result
    result = []

    # iterate over the start positions
    for start in range(len(words) +1 -n):
        
        # extract the n-gram from the list of words
        ngram = ' '.join(words[start:start+n])
        
        # count the number of times the ngram appears in the original text
        if text.count(ngram) >= cutoff:
            result.append(ngram)
            
    return result


Now a very simple test: can our function correctly figure out that 

`it was the best of times it was the worst of times`

contains the digrams

```it was
was the
of times```

twice?

In [241]:
text = "it was the best of times it was the worst of times"
find_common_ngrams(text, 2, 2)

['it was', 'was the', 'of times', 'it was', 'was the', 'of times']

Hmm, elements in the answer are not unique. Is this correct? No way to tell from the description; it depends on what we want to use the output for. 

**Testing forces us to think carefully about how we want our code to behave.**

###Write the test before you fix the bug

Let's say that we only want unique digrams in the output. We can test this:

In [242]:
assert find_common_ngrams(text, 2, 2) == ['it was', 'was the', 'of times']

AssertionError: 

As expected, the test fails. Let's edit the code to fix it:

In [243]:
def find_common_ngrams(text, cutoff, n):
    words = text.lower().split(' ')
    result = []
    for start in range(len(words) +1 - n):
        ngram = ' '.join(words[start:start+n])
        
        # only add a ngram to the result if it's not already in there
        if text.count(ngram) >= cutoff and ngram not in result:
            result.append(ngram)
            
    return result

assert find_common_ngrams(text, 2, 2) == ['it was', 'was the', 'of times']

Now it runs without error. Why bother writing the test if we're going to fix it anyway? Because bugs have a habit of re-emerging when you start editing the code. 



###How many tests to write?

As soon as we start thinking about testing, it's obvious that there are an infinite number of possible tests. A good way to write tests efficiently is to test extreme inputs. If it works for n=1 (i.e. single words) and n=12 (i.e. the complete sentence) then it probably works for other values. 

####Do this: work out what the correct output should be for the following function calls:

```
find_common_ngrams(text, 2, 1)
find_common_ngrams(text, 2, 3)
find_common_ngrams(text, 1, 12)
```
####and write `assert()` statements to test these behaviours.

[click here for part 2](Testing for scientists part 2.html)

In [244]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [245]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")