<a href="https://colab.research.google.com/github/sokrypton/ws2023/blob/main/Day1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Exercise 1: Parse Fasta file (MSA)

Write a function that: Reads in a FASTA file and returns two lists: Names (that start with ">") and sequences.

```python
def parse_fasta(filename):
  # something
  return names, seqs
```

For example an input file that looks like:
(Note the sequences maybe split into multiple lines and maybe of variable length)
```
>name_1
-ABC--ABC--ABC
>name_2
-ABACCCBC
AAADC
>name_3
-----CCBC-----
>name_4
-AACC-CBC--
ACC
```
and returns the following two lists.
```python
['name_1','name_2','name_3','name_4']
['-ABC--ABC--ABC','-ABACCCBCAAADC','-----CCBC-----',"-AACC-CBC--ACC"]
```

In [None]:
%%writefile test.fasta
>name_1
-ABC--ABC--ABC
>name_2
-ABACCCBC
AAADC
>name_3
-----CCBC-----
>name_4
-AACC-CBC--
ACC

In [None]:
#@title answer
def parse_fasta(filename):
  '''function to parse fasta'''  
  # create empty lists to append names/seqs
  names = []
  seqs = [] 
  # open file
  lines = open(filename, "r")  
  # go through file, line by line
  for line in lines:
    # remove linebreak
    line = line.rstrip()    
    # if the first character is ">"
    if line[0] == ">":      
      # save name
      names.append(line[1:])     
      # start empty string
      seqs.append("")
    else:      
      # add to existing string
      seqs[-1] += line      
  # close file
  lines.close() 
  return names, seqs

In [None]:
names, seqs = parse_fasta("test.fasta")
print(names, seqs)

## Execise 2: Filter the MSA

Write a function that:
1. Remove positions that are gap ("-") in the query (first) sequence.
2. Remove sequences that have > 25% gaps ("-").

```python
def filt_seqs(names, seqs):
  # something
  return new_names, new_seqs
```

Input example:
```
-ABC--ABC--ABC
-ABACCCBCAAADC
-----CCBC-----
-AACC-CBC--ACC
```
Output example:
```
ABCABCABC
ABACBCADC
AACCBCACC
```

In [None]:
#@title answer
import numpy as np
def filt_seqs(names, seqs):
  
  # get query (first) sequence
  query_seq = seqs[0]
  
  # convert sequence into numpy array of characters
  query_array = np.array(list(query_seq))
  
  # check which characters are not "-"
  query_non_gap = query_array != "-"
  
  # the length of query
  query_length = sum(query_non_gap)
  
  # make a new list of names/sequences
  new_names = []
  new_seqs = []
    
  # for each name and sequence
  for name,seq in zip(names,seqs):
    
    # convert sequence into numpy array of characters
    seq_array = np.array(list(seq))

    # select only positions that are non-gap in query
    seq_array = seq_array[query_non_gap]

    # count number of gaps remaining in sequence
    seq_gap_count = sum(seq_array == "-")
    
    # if there are more than 25% gaps, ignore
    if seq_gap_count/query_length <= 0.25:
      new_names.append(name)
      new_seqs.append("".join(seq_array))
            
  return new_names, new_seqs

In [None]:
new_names, new_seqs = filt_seqs(names,seqs)
print(new_names, new_seqs)

# Exercise 3: Parse and Filter Blast output

In [None]:
!wget -qnc https://raw.githubusercontent.com/sokrypton/ws2023/main/day1/example.fasta

In [None]:
names, seqs = parse_fasta("example.fasta")
new_names, new_seqs = filt_seqs(names,seqs)

In [None]:
def save_fasta(filename,names,seqs):
  # open file
  new_fasta = open(filename,"w")

  # for each name and seq
  for name,seq in zip(names,seqs):
    new_fasta.write(">"+name+"\n"+seq+"\n")

  # close file
  new_fasta.close()

In [None]:
save_fasta("example_filt.fasta",new_names,new_seqs)

# logomaker (weblogo inside notebook)
https://logomaker.readthedocs.io/

In [None]:
!pip -q install logomaker

In [None]:
import logomaker as logo
import matplotlib.pyplot as plt

In [None]:
bits = logo.alignment_to_matrix(new_seqs,to_type='information')
plot = logo.Logo(bits, color_scheme="hydrophobicity", figsize=(20,2))
plot.style_xticks(anchor=0, spacing=5)
plot.ax.set_xlabel("positions")
plot.ax.set_ylabel("information (bits)")
plot.fig.tight_layout()
plt.savefig("tmp.pdf")