# Importing some sequence data!

In this exercise we will retrieve and upload a set of diphthteria toxin repressor (DTXR)-like protein sequences from Uniprot. 

Click this link to head to https://uniprot.org

***
At the uniprot site...

Type dtxr in the search bar and hit return.

Check out the results and consider these questions:

    1. How many results did you get from this search?
    2. How many results are reviewed?
    3. How many sequences are unreviewed?

Hopefully you found something like 31,263 total sequences. I would certainly call this <b>big data</b> - way too many sets to look at manually!

Near the top of the sequences you will see the download button. Download all the sequences in FASTA (canonical) format and ensure <b>uncompressed</b> is checked. I named the file uniprot-dtxr.fasta. That is easiest, but you could name it something else.<br> <img src="images/download.png" width=200>

Next, we will upload this file to Binder. In the files panel at the left, <b>double click the files folder to open it</b>. You should see two existing files in there: dtxr_pdbs.fasta and dtxr.tfa.

Now you may either drag the uniprot-dtxr.fasta file into that folder or click the Upload files button that looks like this:<img src="images/upload.png" width=50>

***

# Working with Sequence files using Biopython

The code box below is the most complicated we have seen. Comments in the code begin with #s. Read these if you want help understanding the code.

The first line "turns on" Biopython, a set of tools built for biology and biochemistry!

You can learn more at https://biopython.org/

The second line uses SeqIO (think: sequence input and output) to read a fasta file and stores the information as a list of records.

It outputs the number of sequences, the first ten records in the file, and finally a sequence.

The command below won't run correctly (you can try it!) unless you enter in the file to read. <b> Between the quotes, where it says <<\<your file here>>>, change it to : files/uniprot-dtxr.fasta.</b>
    
Then shift+enter to run the code.


In [None]:
from Bio import SeqIO # imports the SeqIO function from Biopython

records = list(SeqIO.parse("<<<your file here>>>", "fasta"))     # reads the fasta file into a list of records 
print("There are %i sequences in your file.\n" % len(records))   # prints the number of sequences, that is, the length of the list, named records

print("The first 10 sequence record ids are:\n")
for i in range(10):                                              # this creates a variable i and counts to 10
    print(records[i].id)                                         # prints the id for record i
    
print("\nThe record: %s has a sequence of: %s\n" % (records[0].id, records[0].seq))  # prints the record id and its sequence!


Great! The code below finds the first record (in Python we start counting at zero), so records[0].id gets the identification of the first record. Edit the code to give the 100th record (Hint: remember to subtract 1).

We can also look at the last record id. You could put in 31262, but -1 is easier! The -1 starts from the opposite end and you don't need to know how many records you have.

In [None]:
print(records[0].id)

***
Before you move on, check your knowledge by answering these questions:

    1. What is the sequence id for the first record in the file?
    2. What is the sequence id for the 100th record in the file?
    3. What is the sequence id for the last record in the file?

***

As you have seen above, the ids are a little long and redundant. The code below simplifies the record and writes a new, corrected file.

In [None]:
original_file = "files/uniprot-dtxr.fasta"
corrected_file = "files/uniprot-dtxr_corr.fasta"

with open(original_file) as original, open(corrected_file, 'w') as corrected:
    records = SeqIO.parse(original_file, 'fasta')

    for record in records:
        name = str.split(record.id, "|")[1]
        record.id = name
        #print record.id             # prints 'bar' as expected
        SeqIO.write(record, corrected, 'fasta')

In [None]:
for record in SeqIO.parse("files/uniprot-dtxr_corr.fasta", "fasta"):
    #print(record.id)
    continue

In [None]:
!cd-hit -i files/uniprot-dtxr_corr.fasta -o files/uniprot-dtxr_corr_40.fasta -c 0.4 -n 2

In [None]:
records = list(SeqIO.parse("files/uniprot-dtxr_corr_40.fasta", "fasta"))
print("There are %i sequences in your file.\n" % len(records))

print("The first 10 sequence record ids are:\n")
for i in range(10):
    print(records[i].id)

    
print("\nThe record: %s has a sequence of: %s\n" % (records[0].id, records[0].seq))

In [None]:
from Bio import SeqIO

original_file = "files/uniprot-dtxr_corr_40.fasta"
corrected_file = "files/uniprot-dtxr_corr_40_trim.fasta"

with open(original_file) as original, open(corrected_file, 'w') as corrected:
    records = SeqIO.parse(original_file, 'fasta')

    for record in records:
        if len(record.seq) > 120 and len(record.seq) < 280: 
            #print(len(record.seq))
            SeqIO.write(record, corrected, 'fasta')

Next add in knowns (those with structures!!!)

In [None]:
!cat files/uniprot-dtxr_corr_40_trim.fasta files/dtxr_pdbs.fasta > files/final_40.fasta

In [None]:
records = list(SeqIO.parse("files/final.fasta", "fasta"))
print("There are %i sequences in your file.\n" % len(records))

print("The first 10 sequence record ids are:\n")
for i in range(10):
    print(records[i].id)

    
print("\nThe record: %s has a sequence of: %s\n" % (records[0].id, records[0].seq))

In [None]:
!makeblastdb -in files/final_40.fasta -dbtype prot -out files/finalpro_40

In [None]:
!blastp -db files/finalpro_40 -query files/final_40.fasta -outfmt "6 qseqid sseqid evalue" -out files/BLASTe40_out -num_threads 4 -evalue 10e-40