# Assignment 8: kmers

## Finding Common K-mers

In this assignment, you will create a Python program called `kmers.py` that accepts two readable text files and an optional `-k|--kmer` argument that accepts an integer value greater than 0 and which defaults to 3.

In [None]:
# Run this cell to make sure this assignment is up to date
%cd ~/be434-Spring2024
!git pull --no-edit upstream main

## Getting Started with new.py

Let's start out by using new.py to create a program template for us.


In [None]:
# Generate the `kmers.py` using `new.py`
%cd ~/be434-Spring2024/assignments/08_kmers
!new.py -p 'Find common kmers' kmers.py

You should see the following:

```
$ new.py -p 'Find common kmers' kmers.py
Done, see new script "kmers.py."
```

## Instructions

### Usage

When provided no arguments, the program should print a brief usage:

```
$ ./kmers.py
usage: kmers.py [-h] [-k int] FILE1 FILE2
kmers.py: error: the following arguments are required: FILE1, FILE2
```

When run with `-h|--help`, it should print a longer help message:

```
$ ./kmers.py -h
usage: kmers.py [-h] [-k int] FILE1 FILE2

Find common kmers

positional arguments:
  FILE1               Input file 1
  FILE2               Input file 2

optional arguments:
  -h, --help          show this help message and exit
  -k int, --kmer int  K-mer size (default: 3)
```

### Output

The program should generate errors if either of the file arguments is invalid:

```
$ ./kmers.py foo bar
usage: kmers.py [-h] [-k int] FILE1 FILE2
kmers.py: error: argument FILE1: can't open 'foo': 
[Errno 2] No such file or directory: 'foo'

$ ./kmers.py ./inputs/foo.txt bar
usage: kmers.py [-h] [-k int] FILE1 FILE2
kmers.py: error: argument FILE2: can't open 'bar': 
[Errno 2] No such file or directory: 'bar'
```

The program should reject a non-integer value for `-k|--kmer`:

```
$ ./kmers.py ./inputs/foo.txt ./inputs/bar.txt -k foo
usage: kmers.py [-h] [-k int] FILE1 FILE2
kmers.py: error: argument -k/--kmer: invalid int value: 'foo'
```

Any non-positive value for `-k|--kmer` should likewise be rejected.
Consider manually checking the value of `args.kmer` in the `get_args()` function and using `parser.error()` to create this error:

```
$ ./kmers.py ./inputs/foo.txt ./inputs/bar.txt -k 0
usage: kmers.py [-h] [-k int] FILE1 FILE2
kmers.py: error: --kmer "0" must be > 0
```

When run with the default `-k|--kmer` of 3, the program should find two shared 3-mers between the _inputs/foo.txt_ and _inputs/bar.txt_ file.
The output from the program should be each found kmer followed by the number of times it was found in the two files.
The columns should be formatted with the width 10, 5, and 5, respectively.
The order of the rows is not important:

```
$ ./kmers.py ./inputs/foo.txt ./inputs/bar.txt
bar            1     1
foo            1     1
```

Change the size of `-k` to 2 and notice the difference:

```
$ ./kmers.py ./inputs/foo.txt ./inputs/bar.txt -k 2
ar             1     1
ba             2     1
fo             1     1
oo             1     1
```

Try it on the DNA sequences:

```
$ ./kmers.py inputs/sample1.txt inputs/sample2.txt
AAA            4     2
AAT            3     1
ATA            1     1
CCC            2     2
TAA            1     1
TCC            2     2
TTC            1     1
TTT            4     4
```

Try it on the American and British language files:

```
$ ./kmers.py inputs/american.txt inputs/british.txt -k 4 | head -n 5
abou           1     2
ally           1     1
alog           2     2
anal           1     1
atal           1     1
```


## Time to write some code!

Open the script here in VS Code in be434-Spring2024 -> assignments -> 08_kmers -> kmers.py 

Write/edit the code using the instructions above.

## Writing the Program

To get started, we will want to make sure all of our arguments provided by the user
are correct. Make sure that you pass the following tests before getting started on
writing the rest of the program:

* test_exists
* test_usage
* test_no_args
* test_bad_file1
* test_bad_file2
* test_bad_kmer_string
* test_bad_kmer_not_positive

You can always open the test.py file to see what tests are being performed and 
determine how to "pass" them by checking the input arguments.

In this program, you will want to break each "word" or sequence into k-mers.
I suggest you incorporate this function into your program to do this:

```
def find_kmers(seq, k):
    """ Find k-mers in string """

    n = len(seq) - k + 1
    return [] if n < 1 else [seq[i:i + k] for i in range(n)]
```

It might help to add a _unit test_ for this function.
Add the following to your `kmers.py` program just after the `find_kmers` function:

```
def test_find_kmers():
    """ Test find_kmers """

    assert find_kmers('', 1) == []
    assert find_kmers('ACTG', 1) == ['A', 'C', 'T', 'G']
    assert find_kmers('ACTG', 2) == ['AC', 'CT', 'TG']
    assert find_kmers('ACTG', 3) == ['ACT', 'CTG']
    assert find_kmers('ACTG', 4) == ['ACTG']
    assert find_kmers('ACTG', 5) == []
```



In [None]:
# You can now run `pytest` to check that the function works:
%cd ~/be434-Spring2024/assignments/08_kmers
!pytest -v kmers.py

You should see something like the following:

```
$ pytest -v kmers.py
============================= test session starts ==============================
...
collected 1 item

kmers.py::test_find_kmers PASSED                                         [100%]

============================== 1 passed in 0.00s ===============================
```

To integrate this function into your code, consider something like the following:

```
words1 = {}
for line in args.file1:
    for word in line.split():
        for kmer in find_kmers(word, k):
            # increment the count of this "kmer" in "words1"
```

Try to get your program to print the following data where I have two dictionaries, one for the frequency of the k-mers from each input file:

```
$ ./kmers.py inputs/foo.txt inputs/bar.txt
{'foo': 1, 'bar': 1, 'baz': 1}
{'quu': 1, 'uux': 1, 'bar': 1, 'fli': 1, 'lip': 1, 'foo': 1}
```

If you change `-k` to 2, it should print this:

```
$ ./kmers.py inputs/foo.txt inputs/bar.txt -k 2
{'fo': 1, 'oo': 1, 'ba': 2, 'ar': 1, 'az': 1}
{'qu': 1, 'uu': 1, 'ux': 1, 'ba': 1, 'ar': 1, 'fl': 1, 'li': 1, 'ip': 1, 
 'fo': 1, 'oo': 1}
```

If you change `-k` to 4, it will find no k-mers in the first file because all the words are three characters:

```
$ ./kmers.py inputs/foo.txt inputs/bar.txt -k 4
{}
{'quux': 1, 'flip': 1}
```

In [None]:
# Try running the examples above
# Example 1, default k value
%cd ~/be434-Spring2024/assignments/08_kmers
!./kmers.py inputs/foo.txt inputs/bar.txt

In [None]:
# Example 2: -k 2
!./kmers.py inputs/foo.txt inputs/bar.txt -k 2

In [None]:
# Example 3: -k 4, no k-mers found
!./kmers.py inputs/foo.txt inputs/bar.txt -k 4



Next, you should next find the shared keys of the two dictionaries.
First, just get your program to print these shared k-mers:

```
$ ./kmers.py inputs/foo.txt inputs/bar.txt
bar
foo
```

Then add the counts from the first and second files, formatted in columns of 10, 5, and 5:

```
$ ./kmers.py inputs/foo.txt inputs/bar.txt
bar            1     1
foo            1     1
```

At this point, your program should pass all the tests.

## Testing

As you write your code, you can test it along the way to make sure that you are passing all of the tests for the homework. 
We will use the test suite that is included with the assignment to test that you are meeting all of the requirements in the instructions above. 
You will find the steps below to test your code. Note that you can also run these commands from a "shell" within the VS Code GUI. Or, you can run them here... 

In [None]:
# Format your code to make it beautiful (this is called linting)
!black ~/be434-Spring2024/assignments/08_kmers/kmers.py

In [None]:
# Now run the tests on your code
%cd ~/be434-Spring2024/assignments/08_kmers
!make test

A passing test suite looks like this:

```
============================= test session starts ==============================
...
--------------------------------------------------------------------------------
Linting files
.
--------------------------------------------------------------------------------

test.py::PYLINT PASSED                                                   [ 11%]
test.py::FLAKE8 PASSED                                                   [ 22%]
test.py::test_exists PASSED                                              [ 33%]
test.py::test_usage PASSED                                               [ 44%]
test.py::test_defaults PASSED                                            [ 55%]
test.py::test_greeting PASSED                                            [ 66%]
test.py::test_name PASSED                                                [ 77%]
test.py::test_excited PASSED                                             [ 88%]
test.py::test_all_options PASSED                                         [100%]

============================== 9 passed in 0.51s ===============================
```

Your grade is whatever percentage of tests your code passes.

## Uploading your code to GitHub

Once you have written the code for your assignment, and are passing all of the tests above, you are ready to submit the assignment for grading. Use the steps below to submit your code to GitHub.

* Note, if you are having any issues with passing tests, and need help, you can also submit the code with a different commit message like the following. 

```
git commit -m "test_greeting failing for 08_kmers"
```

Once you have done that, send a private slack message to me @bhurwitz to let me know you submitted code and need help.


In [None]:
# Submit your code to Github
%cd
%cd be434-Spring2024
!git add -A && git commit -m "Submitting 08_kmers for grading"
!git push

Great job! You are done with this assignment.

## Authors

Bonnie Hurwitz <bhurwitz@arizona.edu> and Ken Youens-Clark <kyclark@gmail.com>