File formats and working with data 

Let's look at a the FASTA format file for green fluorescent protein

\>sp|P42212.1|GFP_AEQVI RecName: Full=Green fluorescent protein
MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFS
YGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDF
KEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLL
PDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK                                           

The first line always begins with a greater than symbol and then gives the name and/or some description of the protein. The next line contains the sequence in single-letter amino acid code.

Let's go ahead and create a sample fasta record using this data.

In the code box below copy this text:



In [2]:
from Bio import SeqIO

records = list(SeqIO.parse("files/uniprot-dtxr.fasta", "fasta"))
print("There are %i sequences in your file.\n" % len(records))

print("The first 10 sequence record ids are:\n")
for i in range(10):
    print(records[i].id)

    
print("\nThe record: %s has a sequence of: %s\n" % (records[0].id, records[0].seq))


There are 31263 sequences in your file.

The first 10 sequence record ids are:

sp|P9WMH1|IDER_MYCTU
sp|P0DJL7|DTXR_CORDI
sp|Q8NP95|DTXR_CORGL
sp|Q8FPG6|DTXR_COREF
sp|Q4JV96|DTXR_CORJK
sp|H2I233|DTXR_CORDW
sp|P0A673|IDER_MYCBO
sp|P9WMH0|IDER_MYCTO
sp|P54512|MNTR_BACSU
tr|A0A7C2F0J4|A0A7C2F0J4_9BACT

The record: sp|P9WMH1|IDER_MYCTU has a sequence of: MNELVDTTEMYLRTIYDLEEEGVTPLRARIAERLDQSGPTVSQTVSRMERDGLLRVAGDRHLELTEKGRALAIAVMRKHRLAERLLVDVIGLPWEEVHAEACRWEHVMSEDVERRLVKVLNNPTTSPFGNPIPGLVELGVGPEPGADDANLVRLTELPAGSPVAVVVRQLTEHVQGDIDLITRLKDAGVVPNARVTVETTPGGGVTIVIPGHENVTLPHEMAHAVKVEKV



In [10]:
name = str.split(records[0].id, "|")[1]
print(name)

P9WMH1


In [15]:
original_file = "files/uniprot-dtxr.fasta"
corrected_file = "files/uniprot-dtxr_corr.fasta"

with open(original_file) as original, open(corrected_file, 'w') as corrected:
    records = SeqIO.parse(original_file, 'fasta')

    for record in records:
        name = str.split(record.id, "|")[1]
        record.id = name
        #print record.id             # prints 'bar' as expected
        SeqIO.write(record, corrected, 'fasta')

In [19]:
for record in SeqIO.parse("files/uniprot-dtxr_corr.fasta", "fasta"):
    #print(record.id)
    continue

In [21]:
!cd-hit -i files/uniprot-dtxr_corr.fasta -o files/uniprot-dtxr_corr_90.fasta -c 0.9

Program: CD-HIT, V4.8.1 (+OpenMP), Apr 07 2021, 10:57:21
Command: cd-hit -i files/uniprot-dtxr_corr.fasta -o
         files/uniprot-dtxr_corr_90.fasta -c 0.9

Started: Fri Jun  4 14:20:37 2021
                            Output                              
----------------------------------------------------------------
total seq: 31263
longest and shortest : 1366 and 24
Total letters: 6159848
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 10M
Buffer          : 1 X 11M = 11M
Table           : 1 X 65M = 65M
Miscellaneous   : 0M
Total           : 87M

Table limit with the given memory limit:
Max number of representatives: 2217219
Max number of word counting entries: 89067386

comparing sequences from          0  to      31263
..........    10000  finished       4548  clusters
..........    20000  finished       9579  clusters
..........    30000  finished      13758  clusters
.
    31263  finished      14294  clusters

Approximated maximum memory 

In [22]:
!cd-hit -i files/uniprot-dtxr_corr.fasta -o files/uniprot-dtxr_corr_70.fasta -c 0.7

Program: CD-HIT, V4.8.1 (+OpenMP), Apr 07 2021, 10:57:21
Command: cd-hit -i files/uniprot-dtxr_corr.fasta -o
         files/uniprot-dtxr_corr_70.fasta -c 0.7

Started: Fri Jun  4 14:21:23 2021
                            Output                              
----------------------------------------------------------------
total seq: 31263
longest and shortest : 1366 and 24
Total letters: 6159848
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 10M
Buffer          : 1 X 11M = 11M
Table           : 1 X 65M = 65M
Miscellaneous   : 0M
Total           : 87M

Table limit with the given memory limit:
Max number of representatives: 2217219
Max number of word counting entries: 89067386

comparing sequences from          0  to      31263
..........    10000  finished       2531  clusters
..........    20000  finished       5177  clusters
..........    30000  finished       7286  clusters
.
    31263  finished       7589  clusters

Approximated maximum memory 

In [25]:
records = list(SeqIO.parse("files/uniprot-dtxr_corr_70.fasta", "fasta"))
print("There are %i sequences in your file.\n" % len(records))

print("The first 10 sequence record ids are:\n")
for i in range(10):
    print(records[i].id)

    
print("\nThe record: %s has a sequence of: %s\n" % (records[0].id, records[0].seq))

There are 7589 sequences in your file.

The first 10 sequence record ids are:

Q4JV96
A0A7C2F0J4
Q8NQ98
F3B501
A0A6B8VWX5
A0A497M387
F8DXY3
C4LJ14
A0A076NPQ6
I7L8E4

The record: Q4JV96 has a sequence of: MRDLVDTTEMYLRTIYELEEEGIPPLRARIAERLDQSGPTVSQTVARMERDELLTVEKDRSLKLSAQGRALATAVMRKHRLAERLLTDVIGLPWEKVHDEACRWEHVMGDEVEVQLVKVLSEYATSPFGNPIPGLDELMEGIPDSERAELQQKIDNLQVVTSQRASDIEPPEPIQVKILSINEIIQVEHKLMAKFHALGMRPGSVVDLVATEDGLEFSNDNGAMVVPEELGHAVRVEKVN



In [6]:
from Bio import SeqIO

original_file = "files/uniprot-dtxr_corr_70.fasta"
corrected_file = "files/uniprot-dtxr_corr_70_trim.fasta"

with open(original_file) as original, open(corrected_file, 'w') as corrected:
    records = SeqIO.parse(original_file, 'fasta')

    for record in records:
        if len(record.seq) > 120 and len(record.seq) < 300: 
            #print(len(record.seq))
            SeqIO.write(record, corrected, 'fasta')

In [7]:
records = list(SeqIO.parse("files/uniprot-dtxr_corr_70_trim.fasta", "fasta"))
print("There are %i sequences in your file.\n" % len(records))

print("The first 10 sequence record ids are:\n")
for i in range(10):
    print(records[i].id)

    
print("\nThe record: %s has a sequence of: %s\n" % (records[0].id, records[0].seq))

There are 6510 sequences in your file.

The first 10 sequence record ids are:

Q4JV96
A0A6B8VWX5
A0A497M387
F8DXY3
C4LJ14
A0A076NPQ6
I7L8E4
C0XSL1
A0A524MTU1
A0A550H4H4

The record: Q4JV96 has a sequence of: MRDLVDTTEMYLRTIYELEEEGIPPLRARIAERLDQSGPTVSQTVARMERDELLTVEKDRSLKLSAQGRALATAVMRKHRLAERLLTDVIGLPWEKVHDEACRWEHVMGDEVEVQLVKVLSEYATSPFGNPIPGLDELMEGIPDSERAELQQKIDNLQVVTSQRASDIEPPEPIQVKILSINEIIQVEHKLMAKFHALGMRPGSVVDLVATEDGLEFSNDNGAMVVPEELGHAVRVEKVN



Next add in knowns (those with structures!!!)

In [8]:
!cat files/uniprot-dtxr_corr_70_trim.fasta files/dtxr_pdbs.fasta > files/final.fasta

In [11]:
records = list(SeqIO.parse("files/final.fasta", "fasta"))
print("There are %i sequences in your file.\n" % len(records))

print("The first 10 sequence record ids are:\n")
for i in range(10):
    print(records[i].id)

    
print("\nThe record: %s has a sequence of: %s\n" % (records[0].id, records[0].seq))

There are 6516 sequences in your file.

The first 10 sequence record ids are:

Q4JV96
A0A6B8VWX5
A0A497M387
F8DXY3
C4LJ14
A0A076NPQ6
I7L8E4
C0XSL1
A0A524MTU1
A0A550H4H4

The record: Q4JV96 has a sequence of: MRDLVDTTEMYLRTIYELEEEGIPPLRARIAERLDQSGPTVSQTVARMERDELLTVEKDRSLKLSAQGRALATAVMRKHRLAERLLTDVIGLPWEKVHDEACRWEHVMGDEVEVQLVKVLSEYATSPFGNPIPGLDELMEGIPDSERAELQQKIDNLQVVTSQRASDIEPPEPIQVKILSINEIIQVEHKLMAKFHALGMRPGSVVDLVATEDGLEFSNDNGAMVVPEELGHAVRVEKVN



In [13]:
!makeblastdb -in files/final.fasta -dbtype prot -out files/finalpro



Building a new DB, current time: 06/04/2021 14:57:31
New DB name:   /home/jovyan/files/finalpro
New DB title:  files/final.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /home/jovyan/files/finalpro
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 6516 sequences in 0.1796 seconds.




In [None]:
!blastp -db files/finalpro -query files/final.fasta -outfmt 6 -out files/BLASTe22_out -num_threads 4 -evalue 10e-22