# Computational Skills for Biocuration

## Programming Skills with Python

**UniProt Example**: [Human TP53 Fasta file](https://www.uniprot.org/uniprot/Q53FA7.fasta)
    - Download the fasta file for human TP53 locally and save it as 'Q53FA7.fasta'
    - Download the fasta file for all TP53 and save the file as 'uniprot-gene_tp53.fasta'

## Introduction to Biopython

The [Biopython](http://biopython.org/DIST/docs/tutorial/Tutorial.html) project is a collaborative effort of developer who create free Open Source [Python](https://www.python.org) tools for computational molecular biology. 

We will use Bio.SeqIO module of Biopython, which provides a uniform interface to input and output (read, write and index) different sequence file formats. 

Let's start by loading Biopython's SeqIO module with the import command:

In [None]:
from Bio import SeqIO

- `SeqIO.read()` takes a file handle and format name, and returns a "SeqRecord iterator" that can give information about the sequence such as `.id` and `.seq`.


- Test it to read 'Q53FA7.fasta'.

- `SeqIO.parse` can read a fasta file that contains multiple sequences.
- Use `for` loop to read every record from the file.
- Test it to read 'uniprot-gene_tp53.fasta'.

If you want to read only the first record, you can use the python function `next(SeqIO.parse(...))`:

#### Reading entries into a list

- You can also read the file containing multiple sequences using python function list (`list(SeqIO.parse(...))`)
- You can read the different entries using list index (e.g. `record_list[0]` to read the first entry).

#### Exercise:

Expand your code to read `.name` and `.description` of the entry in your file.

#### Connecting with biological databases

In bioinformatics we have to extract information from biological databases to avoid repetitive task of manually downloading files. In Biopython some on-line databases available from Python scripts. There are many ways to access UniProt and NCBI database, here we will use `Entrez` module of Biopython.

- Hints:
    - Import Entrez module: `from Bio import Entrez`
    - You should provide your email: `Entrez.email = ""`
    - Use the example UniProt ID for Dog TP53: Q29537

#### Sequence Output

`SeqIO.write( )` takes a sequence record (or list), output handle (or filename) and format string ('fasta' in this case):

- Syntax:

    ```
    SeqIO.write(record, filename, "fasta")
    ```

### Accessing more than one entries

#### Execise:

Given a list of UniProt Ids, how will you access multiple entries?

- List of protein IDs: Q9WUR6, P61260, 	P02340, Q9TUB2

We have only been working with fasta files, but there are more ways to work with different modules of BioPython to fetch files of different formats. You can have a look at the official [BioPython Documentation](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc14) for more after the course.

#### Reading entries into a dictionary

- You can index your records (values) by identifiers (keys). 
- `SeqIO.to_dict(SeqIO.parse(...))` turns a SeqRecord iterator (or list) into a dictionary (in memory).

#### Exercise

- Print out all the unique keys (ids).
- Check the value of `sp|Q53FA7|QORX_HUMAN`.

- It is not ideal to hold everything in memory when dealing with larger files.
- `SeqIO.index` provides an indexing approach providing dictionary like access to any record.

...but index gives us limited ability to explore data. You will see how to fetch data from databases in next sessions.

Check this [workshop by Peter Cock](https://github.com/peterjc/biopython_workshop) for futher lessons.