# Writing Functions

## Functions with variable numbers of arguments

Here the task is to take the names of FASTA files and return their names and the corresponding sequence lengths sorted in ascending order. (I have deliberately avoided Biopython on this occasion.)

I begin by defining two simple functions that do part of the work: 

1. `is_fasta()` checks whether a file has a FASTA extension. This could easily be extended to perform other checks, e.g. whether the file actually exists.

2. `get_fasta_seq_len()` calculates the length of the sequence in a FASTA file.

Note that the second function invokes the first function.

In [1]:
import os
import operator

def is_fasta(file_name):
    """
    If file is FASTA, return True, otherwise False.
    
    Note: function only checks for file extension .fasta.
    """

    (fname, ext) = os.path.splitext(file_name)
    if ext == '.fasta':
        return True
    else:
        return False
    
def get_fasta_seq_len(fasta_fname):
    """Returns length of FASTA sequence or zero if not FASTA."""
 
    if not is_fasta(fasta_fname):
        return 0

    with open(fasta_fname, 'r') as f:
        header = f.readline()
        seq = f.read()
        seq = seq.replace('\n', '') 
        return len(seq)

Now I write a function that will read in as many FASTA files as I want and return their names and lengths sorted by length. The structure returned here is a list of tuples.

In [2]:
def sort_fasta_names_by_length(*args):
    """Returns list of FASTA files sorted by sequence length."""
 
    fasta_lengths = {}   
    for a in args:
        this_len = get_fasta_seq_len(a)
        if this_len > 0:
            fasta_lengths[a] = this_len
  
    return sorted( fasta_lengths.items(), key=operator.itemgetter(1) )

Having written the function, I can test it out using different arguments:

In [3]:
fasta1 = '../data/A0A0G2RR03.fasta'
fasta2 = '../data/A0A0G2RZ64.fasta'
fasta3 = '../data/FA8_HUMAN.fasta'
fasta4 = '../data/P03437.fasta'

help(sort_fasta_names_by_length)
print('With 3 FASTA files:')
print(sort_fasta_names_by_length(fasta1, fasta2, fasta3))
print('\nWith 4 FASTA files (ordered differently) + non-existent file:')
print(sort_fasta_names_by_length(fasta3, fasta4, 'dummy.txt', fasta2, fasta1))
print('\nWith single arguments:')
print(sort_fasta_names_by_length(fasta1))

Help on function sort_fasta_names_by_length in module __main__:

sort_fasta_names_by_length(*args)
    Returns list of FASTA files sorted by sequence length.

With 3 FASTA files:
[('../data/A0A0G2RR03.fasta', 376), ('../data/A0A0G2RZ64.fasta', 566), ('../data/FA8_HUMAN.fasta', 2351)]

With 4 FASTA files (ordered differently) + non-existent file:
[('../data/A0A0G2RR03.fasta', 376), ('../data/P03437.fasta', 566), ('../data/A0A0G2RZ64.fasta', 566), ('../data/FA8_HUMAN.fasta', 2351)]

With single arguments:
[('../data/A0A0G2RR03.fasta', 376)]


## Functions with keyword arguments

Now I'm going to modify the preceding function to add a couple of additional options using keyword parameters:

1. `reverse` parameter &mdash; returns the output in reverse order if set to `True` (default `False`).

2. `names_only` parameter &mdash; returns a list of the sorted filenames only, rather than a list of tuples (default `True`).

In [4]:
def sort_fasta_names_by_length(*args, reverse=False, names_only=True):
    """Returns list of FASTA files sorted by sequence length."""
 
    fasta_lengths = {}   
    for a in args:
        this_len = get_fasta_seq_len(a)
        if this_len > 0:
            fasta_lengths[a] = this_len
  
    # check whether reverse sort is wanted 
    if reverse:
        sorted_by_length = sorted( 
            fasta_lengths.items(), 
            key=operator.itemgetter(1),
            reverse=True
        )
    else:
        sorted_by_length = sorted( fasta_lengths.items(), key=operator.itemgetter(1) )
    
    # check whether only names are wanted
    if names_only:
        names = []
        for t in sorted_by_length:
            names.append(t[0])
        return names
    else:
        return sorted_by_length

Having written the revised function, I can test it out using different combinations of arguments. 

**Note:** The revised function has the same name as the first version in the preceding section. It is crucial that the revised definition is executed before running the next cell.    

In [5]:
print(help(sort_fasta_names_by_length))

print('\n4 FASTA files, arbitrary order, plus non-existent file:')
print(sort_fasta_names_by_length(fasta3, fasta1, 'dummy.txt', fasta2, fasta4))
print('\nnames_only set to False:')
print(sort_fasta_names_by_length(fasta2, fasta4, fasta1, fasta3, names_only=False))
print('\nreverse set to True:')
print(sort_fasta_names_by_length(fasta1, fasta4, fasta3, fasta2, reverse=True))
print('\nnames_only=False, reverse=True:')
print(sort_fasta_names_by_length(fasta4, fasta2, fasta3, fasta1, names_only=False, reverse=True))

Help on function sort_fasta_names_by_length in module __main__:

sort_fasta_names_by_length(*args, reverse=False, names_only=True)
    Returns list of FASTA files sorted by sequence length.

None

4 FASTA files, arbitrary order, plus non-existent file:
['../data/A0A0G2RR03.fasta', '../data/A0A0G2RZ64.fasta', '../data/P03437.fasta', '../data/FA8_HUMAN.fasta']

names_only set to False:
[('../data/A0A0G2RR03.fasta', 376), ('../data/A0A0G2RZ64.fasta', 566), ('../data/P03437.fasta', 566), ('../data/FA8_HUMAN.fasta', 2351)]

reverse set to True:
['../data/FA8_HUMAN.fasta', '../data/P03437.fasta', '../data/A0A0G2RZ64.fasta', '../data/A0A0G2RR03.fasta']

names_only=False, reverse=True:
[('../data/FA8_HUMAN.fasta', 2351), ('../data/P03437.fasta', 566), ('../data/A0A0G2RZ64.fasta', 566), ('../data/A0A0G2RR03.fasta', 376)]


## Write two functions

Wrap each of the strings below in a function that accepts a biological sequence (stored in a one-line string without newlines) as an argument and returns `True` or `False` depending on whether all the characters in the sequence match one of the letters in the string. The functions should be named as follows:

1. `is_dna(seq)`
2. `is_aa(seq)`

In [54]:
def is_dna(seq: str) -> bool: 
    dna_letters = 'ACGT'
    for i in seq.upper():
        if i not in dna_letters:
            return False
    return True

def is_aa(seq: str) -> bool: 
    aa_letters = 'ACDEFGHIKLMNPQRSTVWY'
    for i in seq.upper():
        if i not in aa_letters:
            return False
    return True

Now test the functions using the following list of sequences:

In [55]:
seqs = [
    'ATGTCGATAGCCAGCGACCCATTGATTGCC',
    'MSIASDPLIAGLDDQQREAVLAPRGPVCVLAGAGTGKTRTITHR',
    'atgtcgatagccagcgacccattgattgcc',
    'msiasdpliaglddqqreavlaprgpvcvlagagtgktrtithr'
]

dna_checker = [is_dna(i) for i in seqs]
aa_checker = [is_aa(i) for i in seqs]

print(dna_checker)
print(aa_checker)

[True, False, True, False]
[True, True, True, True]


If you have time, add a keyword parameter `ignore_case` to the functions and revise them so that they can perform case-insensitive matching to sequences. Then try them out:

## Docstrings

Here are examples of three popular docstring format styles:

In [8]:
# reST style
def calc_hypotenuse_a(a, b):
    """ 
    Return hypotenuse of right-angled triangle. 

    :param a: one (non-hypotenuse) side of the triangle 
    :param b: other (non-hypotenuse) side of the triangle 
    :returns: the hypotenuse 
    """

# Google style
def calc_hypotenuse_b(a, b):
    """ 
    Return hypotenuse of right-angled triangle.
    
    Args:
        a: one (non-hypotenuse) side of the triangle. 
        b: other (non-hypotenuse) side of the triangle. 

    Returns: 
        The hypotenuse. 
    """

# NumPy doc style
def calc_hypotenuse_c(a, b):
    """ 
    Return hypotenuse of right-angled triangle.

    Parameters 
    ---------- 
    a : float
        one (non-hypotenuse) side of the triangle. 
    b : float
        other (non-hypotenuse) side of the triangle.
        
    Returns
    -------
    float
        hypotenuse of right-angled triangle
    """

In [15]:
help(calc_hypotenuse_c)

Help on function calc_hypotenuse_c in module __main__:

calc_hypotenuse_c(a, b)
    Return hypotenuse of right-angled triangle.
    
    Parameters 
    ---------- 
    a : float
        one (non-hypotenuse) side of the triangle. 
    b : float
        other (non-hypotenuse) side of the triangle.
        
    Returns
    -------
    float
        hypotenuse of right-angled triangle

