---
# Lecture 07 : COMMUNICATING WITH THE OUTSIDE (Reading and Writing to Files)
---

### Reading    and    Writing    Files    
To    read    or    write    files    use    the    built-­‐in    function   ` open(filename, mode)`
- **Reading from file**:
> `f = open(filename,'r')`

'r is    the    default    value    for    the    mode    parameter,    so    we    can    just    omit    it:   

> `f = open(filename)`

- **Writing to file**:
> `f = open(filename,'w')`

If    the    file  file already    exists    using    mode    'w' truncates    its    content    first.    To    append    to    the    end    of    the    file,    if    it    exists,    use    mode    'a':  

> `f = open(filename,'w')`

### Errors    When    Opening    a    File :
If    you    attempt    to    open    a    file    that    does    not    exist,    Python    will    produce    an    error    message:    

In [1]:
f = open('fasta.txt', 'r')

NameError: name 'fasta' is not defined

A    common    way    to    let    your    program    handle    this    type    of    error    properly    is    to    specify    what    to    do    in    case    of    errors:   

In [2]:
try:
    f = open("fasta.txt")
except IOError:
    print("The file does not exist")

The file does not exist


## Reading    From    a    File  
An    efficient    and    fast    way    to    read    the    content    of    a    file    is    by    looping    over    the    file    object:   


In [15]:
file_path = 'data/fasta_small.txt'
f = open(file_path,'r')

for line in f:
    print(line.strip())



>gi|142022655|gb|EQ086233.1|43 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
TCGGGCGAAGGCGGCAGCAAGTCGTCCACGCGCAGCGCGGCACCGCGGGCCTCTGCCGTGCGCTGCTTGG
CCATGGCCTCCAGCGCACCGATCGGATCAAAGCCGCTGAAGCCTTCGCGCATCAGGCGGCCATAGTTGGC
GCCAGTGACCGTACCAACCGCCTTGATGCGGCGCTCGGTCATCGCTGCATTGATCGAGTAGCCACCGCCG
CCGCAAATGCCCAGCACGCCAATGCGTTCTTCATCCACATAGGGGAGCGTTACGAGGTAGTCGCAGACCA
CGCGGAAATCCTCGACGCGCAGTGTCGGGTCTTCGGTAAAACGTGGTTCGCCGCCGCTGGCACCCTGGAA
GCTGGCGTCGAAGGCGATGACGACGAAACCTTCCTTGGCCAGCGCCTCGCCATACACGTTCCCCGATGTT
TGCTCCTTGCAGCTGCCGATCGGATGCGCGCTGATGATGGCGGGATATTTCTTGCCTTCGTCGAAGTTCG
GCGGGAAGTGGATGTCGGCTGCGATATCCCAATACACATTCTTGATCTTGACGCTTTTCATGACAGCTCC
GTTCAGGGGGAGGGGGTAAGTTCGCCAGGCCGAATCGTTGGTAGCCAAGCGGCAACGACTCGAATATAGA
GAGCCGATTGGAATTCCGTAAGATCGCAATCTGGACTACAGTGGTATCTTCAAATTGACAATGGCACCTA
CATGGATCCCTCACTGCTTCCGTCTCTCGCGTGGTTCGCCCACGTCGCACATCATCGTAGCTTCACGAAA
GCGGCTGCGGAAATGGGCGTTTCTCGAGCAAACCTGTCGCAGAACGTGAAGGCGCTCGAACGCCGGTTGA
ACGTCAAGCTGCTGTATCGAACGACTC

### The    content    of    a    file    can be also read using    the    `read()` method    of    the    file    object:

In [9]:
f.read() # Print Nothing on Screens, Why?

''

### Changing    Positions    Within    a    File    Object   
To    change    the    file    object’s    position,    use    `f.seek(offset, from_what)`.    The    position    is    computed    from    by    adding    offset    to    a    reference    point;    the    reference    point    is    selected    by    the    `from_what` argument,    which    in    text    files    is    only    allowed    to    be    0    signifying    the    beginning    of    the    file:    

In [10]:
f.seek(0) # Go to the first position of the file

0

In [11]:
f.read()

'>gi|142022655|gb|EQ086233.1|43 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence\nTCGGGCGAAGGCGGCAGCAAGTCGTCCACGCGCAGCGCGGCACCGCGGGCCTCTGCCGTGCGCTGCTTGG\nCCATGGCCTCCAGCGCACCGATCGGATCAAAGCCGCTGAAGCCTTCGCGCATCAGGCGGCCATAGTTGGC\nGCCAGTGACCGTACCAACCGCCTTGATGCGGCGCTCGGTCATCGCTGCATTGATCGAGTAGCCACCGCCG\nCCGCAAATGCCCAGCACGCCAATGCGTTCTTCATCCACATAGGGGAGCGTTACGAGGTAGTCGCAGACCA\nCGCGGAAATCCTCGACGCGCAGTGTCGGGTCTTCGGTAAAACGTGGTTCGCCGCCGCTGGCACCCTGGAA\nGCTGGCGTCGAAGGCGATGACGACGAAACCTTCCTTGGCCAGCGCCTCGCCATACACGTTCCCCGATGTT\nTGCTCCTTGCAGCTGCCGATCGGATGCGCGCTGATGATGGCGGGATATTTCTTGCCTTCGTCGAAGTTCG\nGCGGGAAGTGGATGTCGGCTGCGATATCCCAATACACATTCTTGATCTTGACGCTTTTCATGACAGCTCC\nGTTCAGGGGGAGGGGGTAAGTTCGCCAGGCCGAATCGTTGGTAGCCAAGCGGCAACGACTCGAATATAGA\nGAGCCGATTGGAATTCCGTAAGATCGCAATCTGGACTACAGTGGTATCTTCAAATTGACAATGGCACCTA\nCATGGATCCCTCACTGCTTCCGTCTCTCGCGTGGTTCGCCCACGTCGCACATCATCGTAGCTTCACGAAA\nGCGGCTGCGGAAATGGGCGTTTCTCGAGCAAACCTGTCGCAGAACGTGAAGGCGCTCGAACGCCGGTTGA\nACGTCAAGCTGCT

### You    can    also    read    a    single    line    from    the    file:   

In [16]:
f.seek(0)
f.readline()


'>gi|142022655|gb|EQ086233.1|43 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence\n'

In [17]:
f.close() # close the file

## Writing    Into    a    File 
`f.write(string)` writes    the    contents    of    string    to    the    file,    returning    the    number    of    characters    written    in    Python    3.x.  

In [30]:
file_path = 'data/fasta_small.txt'
f=open(file_path,'a')

In [31]:
f.write("\n>gi|142022655|gb|EQ086233.1|160 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence\n")

123

In [32]:
f.write("TCGGGCGAAGGCGGCAGCAAGTCGTCCACGCGCAGCGCGGCACCGCGGGCCTCTGCCGTGCGCTGCTTGG\nCCATGGCCTCCAGCGCACCGATCGGATCAAAGCCGCTGAAGCCTTCGCGCATCAGGCGGCCATAGTTGGC\nGCCAGTGACCGTACCAACCGCCTTGATGCGGCGCTCGGTCATCGCTGCATTGATCGAGTAGCCACCGCCG\nCCGCAAATGCCCAGCACGCCAATGCGTTCTTCATCCACATAGGGGAGCGTTACGAGGTAGTCGCAGACCA\nCGCGGAAATCCTCGACGCGCAGTGTCGGGTCTTCGGTAAAACGTGGTTCGCCGCCGCTGGCACCCTGGAA\nGCTGGCGTCGAAGGCGATGACGACGAAACCTTCCTTGGCCAGCGCCTCGCCATACACGTTCCCCGATGTT\nTGCTCCTTGCAGCTGCCGATCGGATGCGCGCTGATGATGGCGGGATATTTCTTGCCTTCGTCGAAGTTCG\nGATCGGATGCGCGCTGATGATGGCGGGATATTTCTTGCCTTCGTCGAAG")

546

In [33]:
f.close()

In [35]:
# After Modification
file_path = 'data/fasta_small.txt'
f = open(file_path,'r')

for line in f:
    print(line.strip())



>gi|142022655|gb|EQ086233.1|43 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
TCGGGCGAAGGCGGCAGCAAGTCGTCCACGCGCAGCGCGGCACCGCGGGCCTCTGCCGTGCGCTGCTTGG
CCATGGCCTCCAGCGCACCGATCGGATCAAAGCCGCTGAAGCCTTCGCGCATCAGGCGGCCATAGTTGGC
GCCAGTGACCGTACCAACCGCCTTGATGCGGCGCTCGGTCATCGCTGCATTGATCGAGTAGCCACCGCCG
CCGCAAATGCCCAGCACGCCAATGCGTTCTTCATCCACATAGGGGAGCGTTACGAGGTAGTCGCAGACCA
CGCGGAAATCCTCGACGCGCAGTGTCGGGTCTTCGGTAAAACGTGGTTCGCCGCCGCTGGCACCCTGGAA
GCTGGCGTCGAAGGCGATGACGACGAAACCTTCCTTGGCCAGCGCCTCGCCATACACGTTCCCCGATGTT
TGCTCCTTGCAGCTGCCGATCGGATGCGCGCTGATGATGGCGGGATATTTCTTGCCTTCGTCGAAGTTCG
GCGGGAAGTGGATGTCGGCTGCGATATCCCAATACACATTCTTGATCTTGACGCTTTTCATGACAGCTCC
GTTCAGGGGGAGGGGGTAAGTTCGCCAGGCCGAATCGTTGGTAGCCAAGCGGCAACGACTCGAATATAGA
GAGCCGATTGGAATTCCGTAAGATCGCAATCTGGACTACAGTGGTATCTTCAAATTGACAATGGCACCTA
CATGGATCCCTCACTGCTTCCGTCTCTCGCGTGGTTCGCCCACGTCGCACATCATCGTAGCTTCACGAAA
GCGGCTGCGGAAATGGGCGTTTCTCGAGCAAACCTGTCGCAGAACGTGAAGGCGCTCGAACGCCGGTTGA
ACGTCAAGCTGCTGTATCGAACGACTC

## Closing    a    File    Object   
When    you’re    done    with    a    file,    call    `f.close()` to    close    it    and    free    up    any    system    resources    taken    up    by    the    open    file:  

In [None]:
f.close() # close the file

In [36]:
f.read() # Create value Error message

''

## Reading    a    FASTA    File   
## **Exercise:    Build    a    dictionary    containing    all    sequences    from    a    FASTA    file.**   

FASTA    file:
```    
>id1| description of id1|
ATGTGTGTCCGTTGTGTAAAGTGTGTCcccgtgttATggtagatttttga

>id2| description of id2|
ccccagtggggagtagggcAAAcgtatAA
```

In [43]:
try:
    f = open("data\myfile.fasta")
except IOError:
    print("File  myfile.fa does not exist!!")

seqs={}
for line in f:
    # let's discard the newline at the end (if any)
    line=line.rstrip()   
    # distinguish header from sequence
    if line[0]=='>': # or line.startswith('>')
        words=line.split()
        name=words[0][1:]
        seqs[name]= ""
    else : # sequence, not header
        seqs[name] = seqs[name] + line

f.close()

### Retrieving    Data    From    Dictionaries   
We    can    retrieve    the    key    and    corresponding    value    from    our    dictionary    using    the    items()    method:  

In [44]:
for name,seq in seqs.items():
    print(name,seq)

id1| ATGTGTGTCCGTTGTGTAAAGTGTGTCcccgtgttATggtagatttttga
id2| ccccagtggggagtagggcAAAcgtatAA


### Command    Line    Arguments 
Scripts    often    need    to    process    command    line    arguments.    Suppose    a    script    that    parses    a    FASTA    file    is    called    `processfasta.py`,    and    you    want    to    run    it    on    a    file    whose    name    we    give    as    an    argument    in    the    command    line:  
> ` python processfasta.py myfile.fa`

The    arguments    of    the    above    command    are    stored    in    the    sysmodule’s    argv    attribute    as    a    list:  

```
import sys
print(sys.argv)
```

### Parsing    Command    Line    Arguments    With    `getopt`
Python’s    getopt    module    can    help    with    processing    the    arguments    of    sys.argv.Suppose    the   ` processfasta.py`    script    reads    a    FASTA    file    but    only    stores    in    the    dictionary    the    sequences    bigger    than    a    given    length    provided    in    the    command    line:    

> ` python processfasta.py myfile.fa -k 250 myfile.fa` 

### Usage    Definition    For    processfasta.py

In [None]:
# #!/usr/bin/python
def usage():                                                
    print ("""
    
    processfasta.py : reads a FASTA file and  builds a dictionary with all sequences bigger than a given length.

    processfasta.py [-h] [-l <length>] <filename> 

    -h              print this message 

    -l <length>     filter all sequences with a length
                    smaller than <length>         
                    (default <length>=0) 

    <filename>    the file has to be in FASTA format

    """)

    import sys
    import getopt
    o, a = getopt.getopt(sys.argv[1:], 'l:h')                           
    opts = {}
    seqlen = 0

    for k,v in o:
        opts[k] = v
    if '-h' in opts.keys():
        usage()
        sys.exit()
    if len(a) < 1:
        usage()
        sys.exit("Input fasta file is missing!")
    if '-l' in opts.keys():
        if int(opts['l']) < 0:
            print("Length of sequence should be positive!")
            sys.exit()
        seqlen = opts['-l']





### Using    the    System    Environment 

**Reminder:**    When    we    run    a    script/program    in    the    UNIX    environment    there    are    standard    streams    recognized    by    a    computer    program: 
   
- **Standard input**    or    `stdin`    is    stream    data    (often    text)    going    into    a    program.    Unless    redirected,    standard    input    is    expected    from    the    keyboard    which    started    the    program.
-  **Standard output**    or    `stdout `   is    the    stream    where    a    program    writes    its    output    data.    Unless    redirected,    standard    output    is    the    text    terminal    which    initiated    the    program.
-  **Standard error**    or    `stderr`    is    another    output    stream    typically    used    by    programs    to    output    error    messages    or    diagnostics.    It    is    a    stream    independent    of    standard    output    and    can    provide    error    messages    even    when    stdout    has    been    redirected.    stderr    can    also    be    redirected    separately:  

> my_program | my_script.sh 1>program_output.txt 2>error_messages.t

The    sys    module    in    Python    provides    @ile    handles    for    the    standard    input,    output    and    error:    

In [52]:
sys.stdin.read()
"a line \n another line"

'a line \n another line'

In [53]:
sys.stdout.write("Some useful output.\n")

Some useful output.


In [54]:
sys.stderr.write("Warning: input file was not found\n")



Interfacing    With    External    Programs 
- You    can    call/execute    an    external    program    from    within    your    script    
- Helps    you    automate    certain    tasks    that    would    be    difficult    for    you    to    do    within    Python .

Use    the    `call()`    function    in    the    **subprocess    module**    to    run    an    external    program:    


In [58]:
import subprocess
subprocess.call(["dir"],shell=True)

0

In [None]:
subprocess.call(["tophat","genome_mouse_idx","PE_reads_1.fq.gz","PE_reads_2.fq.gz"])

---
# Lecture 08: BIOPYTHON
---

### The    Biopython    Project   

- [http://www.biopython.org](http://www.biopython.org) :    an    online    resource    for    modules,    scripts,    and    web    links    for    developers    of    Python-­‐based    software    for    bioinformatics    use    and    research.


- Biopython    includes    parsers    for    various    bioinformatics    file    formats    (such    as    FASTA,    Genbank),    access    to    online    services    like    NCBI    Entrez    or    Pubmed    databases,    interfaces    to    common    bioinformatics    programs    such    as    BLAST,    Clustalw,    and    others.    2 


In [60]:
# To install Biopython 
# !pip install biopython

Collecting biopython
  Downloading biopython-1.79-cp38-cp38-win_amd64.whl (2.3 MB)
Installing collected packages: biopython
Successfully installed biopython-1.79


In [61]:
import Bio
print(Bio.__version__)

1.79


##   Running    BLAST    over    the    Internet   

In [67]:
from Bio.Blast import NCBIWWW
filename = "data/myseq.fa"
fasta_string = open(filename, "r").read() 
result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)


In [68]:
# To find out more information:
help(NCBIWWW.qblast)

Help on function qblast in module Bio.Blast.NCBIWWW:

qblast(program, database, sequence, url_base='https://blast.ncbi.nlm.nih.gov/Blast.cgi', auto_format=None, composition_based_statistics=None, db_genetic_code=None, endpoints=None, entrez_query='(none)', expect=10.0, filter=None, gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None, layout=None, lcase_mask=None, matrix_name=None, nucl_penalty=None, nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None, query_file=None, query_believe_defline=None, query_from=None, query_to=None, searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, word_size=None, short_query=None, alignments=500, alignment_view=None, descriptions=500, entrez_links_new_window=None, expect_low=None, expect_high=None, format_entrez_query=None, format_object=None, format_type='XML', ncbi_gi=None, results_file=None, show_overview=None, megablast=None, template_type=None, template_length=None)
    BLAST search using NCBI's

In [69]:
## The    BLAST    Record   
from Bio.Blast import NCBIXML
blast_record = NCBIXML.read(result_handle)

In [70]:
# Parsing    BLAST    Output 
len(blast_record.alignments)   

50

In [71]:
E_VALUE_THRESH = 0.01

for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect < E_VALUE_THRESH:
            print('****Alignment****')                     
            print('sequence:', alignment.title)                     
            print('length:', alignment.length)                     
            print('e value:', hsp.expect)                     
            print(hsp.query)                     
            print(hsp.match)                    
            print(hsp.sbjct)



****Alignment****
sequence: gi|1503253460|gb|MK114118.1| Zaire ebolavirus isolate Ebola virus/H.sapiens-tc/COD/1976/Yambuku-Mayinga, partial genome
length: 18936
e value: 1.72617e-28
CATGCTACGGTGCTAAAAGCATTACGCCCTATAGTGATTTTCGAGACATACTGTGTTTTTAAATATAGTATTGCC
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CATGCTACGGTGCTAAAAGCATTACGCCCTATAGTGATTTTCGAGACATACTGTGTTTTTAAATATAGTATTGCC
****Alignment****
sequence: gi|1282621605|gb|MG572235.1| Zaire ebolavirus isolate Ebola virus/H.sapiens-tc/COD/1995/Kikwit-9510621, complete genome
length: 18957
e value: 1.72617e-28
CATGCTACGGTGCTAAAAGCATTACGCCCTATAGTGATTTTCGAGACATACTGTGTTTTTAAATATAGTATTGCC
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CATGCTACGGTGCTAAAAGCATTACGCCCTATAGTGATTTTCGAGACATACTGTGTTTTTAAATATAGTATTGCC
****Alignment****
sequence: gi|1500174811|gb|MK044561.1| Zaire ebolavirus isolate ZEBOV/Human/DRC/2014/BOE_036, partial genome
length: 18898
e value: 1.72617e-28
CATGCTACGGTGCTAA

## Lecture 8 Quiz

### Question 3: Using Biopython find out what species the following unknown DNA sequence comes from:
`TGGGCCTCATATTTATCCTATATACCATGTTCGTATGGTGGCGCGATGTTCTACGTGAATCCACGTTCGAAGGACATCATACCAAAGTCGTACAATTAGGACCTCGATATGGTTTTATTCTGTTTATCGTATCGGAGGTTATGTTCTTTTTTGCTCTTTTTCGGGCTTCTTCTCATTCTTCTTTGGCACCTACGGTAGAG`

In [1]:
from Bio.Blast import NCBIWWW
fasta_string = 'TGGGCCTCATATTTATCCTATATACCATGTTCGTATGGTGGCGCGATGTTCTACGTGAATCCACGTTCGAAGGACATCATACCAAAGTCGTACAATTAGGACCTCGATATGGTTTTATTCTGTTTATCGTATCGGAGGTTATGTTCTTTTTTGCTCTTTTTCGGGCTTCTTCTCATTCTTCTTTGGCACCTACGGTAGAG'
result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)

In [2]:
## The    BLAST    Record   
from Bio.Blast import NCBIXML
blast_record = NCBIXML.read(result_handle)

In [3]:
len(blast_record.alignments)  

50

In [4]:
E_VALUE_THRESH = 0.01

for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect < E_VALUE_THRESH:
            print('****Alignment****')                     
            print('sequence:', alignment.title)                     
            print('length:', alignment.length)                     
            print('e value:', hsp.expect)                     
            print(hsp.query)                     
            print(hsp.match)                    
            print(hsp.sbjct)

****Alignment****
sequence: gi|1783584753|gb|MN651324.1| Nicotiana tabacum strain zhongyan90 cytoplasmic male sterility(CMS) line cultivar MSzhongyan90 mitochondrion, complete genome
length: 530869
e value: 1.02446e-95
TGGGCCTCATATTTATCCTATATACCATGTTCGTATGGTGGCGCGATGTTCTACGTGAATCCACGTTCGAAGGACATCATACCAAAGTCGTACAATTAGGACCTCGATATGGTTTTATTCTGTTTATCGTATCGGAGGTTATGTTCTTTTTTGCTCTTTTTCGGGCTTCTTCTCATTCTTCTTTGGCACCTACGGTAGAG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TGGGCCTCATATTTATCCTATATACCATGTTCGTATGGTGGCGCGATGTTCTACGTGAATCCACGTTCGAAGGACATCATACCAAAGTCGTACAATTAGGACCTCGATATGGTTTTATTCTGTTTATCGTATCGGAGGTTATGTTCTTTTTTGCTCTTTTTCGGGCTTCTTCTCATTCTTCTTTGGCACCTACGGTAGAG
****Alignment****
sequence: gi|1783584659|gb|MN651323.1| Nicotiana tabacum strain zhongyan90 maintainer line cultivar zhongyan90 mitochondrion, complete genome
length: 472218
e v

### Question 5: Create a Biopython Seq object that represents the following sequence:
`TGGGCCTCATATTTATCCTATATACCATGTTCGTATGGTGGCGCGATGTTCTACGTGAATCCACGTTCGAAGGACATCATACCAAAGTCGTACAATTAGGACCTCGATATGGTTTTATTCTGTTTATCGTATCGGAGGTTATGTTCTTTTTTGCTCTTTTTCGGGCTTCTTCTCATTCTTCTTTGGCACCTACGGTAGAG`

In [8]:
from Bio.Seq import Seq
coding_dna = Seq('TGGGCCTCATATTTATCCTATATACCATGTTCGTATGGTGGCGCGATGTTCTACGTGAATCCACGTTCGAAGGACATCATACCAAAGTCGTACAATTAGGACCTCGATATGGTTTTATTCTGTTTATCGTATCGGAGGTTATGTTCTTTTTTGCTCTTTTTCGGGCTTCTTCTCATTCTTCTTTGGCACCTACGGTAGAG')

coding_dna.translate()





Seq('WASYLSYIPCSYGGAMFYVNPRSKDIIPKSYN*DLDMVLFCLSYRRLCSFLLFF...LR*')