# Sequence input/ouput

## Writing and converting sequence files

Conversion of files from FASTA (multi-lines with 50 bases per line) to [Fasta-2line](https://codechalleng.es/bites/298/) using [SeqIO](https://biopython.org/wiki/SeqIO).

In [2]:
import os

from Bio import Entrez, SeqIO
Entrez.email = 'sbwiecko@free.fr'

In [7]:
handle = Entrez.esearch(
    db='nucleotide',
    term='HBB[Gene Name] AND RefSeq[Keyword] AND Homo sapiens[Organism]',
    idtype='acc',
)

record = Entrez.read(handle)
handle.close()

fetch_list = []

for ID in record['IdList']:
    if 'NM_' in ID:
        fetch = Entrez.efetch(
            db='nucleotide',
            id=ID,
            rettype='fasta',
            retmode='text',
        )

        read_fetch = fetch.read() # same as Entrez.read
        fetch_list.append(read_fetch)

print(f"Item count after search = {record['Count']}")
print(f"Item count after filtering = {len(fetch_list)}")

for file in fetch_list:
    with open('HBB-human.fasta', 'a+') as saved_file:
        saved_file.write(file)

Item count after search = 7
Item count after filtering = 1


In [4]:
with open('HBB-human.fasta2lines', 'w+') as converted_file:
    SeqIO.convert(
        'HBB-human.fasta', # original file
        'fasta',           # format of the original file
        converted_file,    # destination file
        'fasta-2line',     # format converted
    )

Conversion from GenBank to fasta-2line

In [5]:
with open('HBB.fasta2lines', 'w+') as converted_file:
    SeqIO.convert(
        './resources/HBB.gb',
        'gb',
        converted_file,
        'fasta-2line',
    )

In [17]:
import pandas as pd
table = pd.read_html("https://biopython.org/wiki/SeqIO")
#print(table[0].set_index('Format name', drop=True).to_markdown())

### File formats

| Format name           |   Read | Write       | Index   | Notes                                                                                                                                                                                                                                                                                                                                                                                       |
|:----------------------|-------:|:------------|:--------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| abi                   |   1.58 | No          | nan     | Reads the ABI "Sanger"capillary sequence traces files, including the PHRED quality scores for the base calls. This allows ABI to FASTQ conversion. Note each ABI file contains one and only one sequence (so there is no point in indexing the file).                                                                                                                                      |
| abi-trim              |   1.71 | No          | nan     | Same as "abi"but with quality trimming with Mott"s algorithm.                                                                                                                                                                                                                                                                                                                              |
| ace                   |   1.47 | No          | 1.52    | Reads the contig sequences from an ACE assembly file. Uses Bio.Sequencing.Ace internally                                                                                                                                                                                                                                                                                                    |
| cif-atom              |   1.73 | No          | No      | Uses Bio.PDB.MMCIFParser to determine the (partial) protein sequence as it appears in the structure based on the atomic coordinates.                                                                                                                                                                                                                                                        |
| cif-seqres            |   1.73 | No          | No      | Reads a macromolecular Crystallographic Information File (mmCIF) file to determine the complete protein sequence as defined by the _pdbx_poly_seq_scheme records.                                                                                                                                                                                                                           |
| clustal               |   1.43 | 1.43        | No      | The alignment format of Clustal X and Clustal W.                                                                                                                                                                                                                                                                                                                                            |
| embl                  |   1.43 | 1.54        | 1.52    | The EMBL flat file format. Uses Bio.GenBank internally.                                                                                                                                                                                                                                                                                                                                     |
| fasta                 |   1.43 | 1.43        | 1.52    | This refers to the input FASTA file format introduced for Bill Pearson"s FASTA tool, where each record starts with a ">"line.                                                                                                                                                                                                                                                              |
| fasta-2line           |   1.71 | 1.71        | No      | FASTA format variant with no line wrapping and exactly two lines per record.                                                                                                                                                                                                                                                                                                                |
| fastq-sanger or fastq |   1.5  | 1.50        | 1.52    | FASTQ files are a bit like FASTA files but also include sequencing qualities. In Biopython, "fastq"(or the alias "fastq-sanger”) refers to Sanger style FASTQ files which encode PHRED qualities using an ASCII offset of 33. See also the incompatible "fastq-solexa"and "fastq-illumina"variants used in early Solexa/Illumina pipelines, Illumina pipeline 1.8 produces Sanger FASTQ. |
| fastq-solexa          |   1.5  | 1.50        | 1.52    | In Biopython, "fastq-solexa"refers to the original Solexa/Illumina style FASTQ files which encode Solexa qualities using an ASCII offset of 64. See also what we call the "fastq-illumina"format.                                                                                                                                                                                         |
| fastq-illumina        |   1.51 | 1.51        | 1.52    | In Biopython, "fastq-illumina"refers to early Solexa/Illumina style FASTQ files (from pipeline version 1.3 to 1.7) which encode PHRED qualities using an ASCII offset of 64. For good quality reads, PHRED and Solexa scores are approximately equal, so the "fastq-solexa"and "fastq-illumina"variants are almost equivalent.                                                           |
| gck                   |   1.75 | No          | No      | The native format used by Gene Construction Kit.                                                                                                                                                                                                                                                                                                                                            |
| genbank or gb         |   1.43 | 1.48 / 1.51 | 1.52    | The GenBank or GenPept flat file format. Uses Bio.GenBank internally for parsing. Biopython 1.48 to 1.50 wrote basic GenBank files with only minimal annotation, while 1.51 onwards will also write the features table.                                                                                                                                                                     |
| ig                    |   1.47 | No          | 1.52    | This refers to the IntelliGenetics file format, apparently the same as the MASE alignment format.                                                                                                                                                                                                                                                                                           |
| imgt                  |   1.56 | 1.56        | 1.56    | This refers to the IMGT variant of the EMBL plain text file format.                                                                                                                                                                                                                                                                                                                         |
| nexus                 |   1.43 | 1.48        | No      | The NEXUS multiple alignment format, also known as PAUP format. Uses Bio.Nexus internally.                                                                                                                                                                                                                                                                                                  |
| pdb-seqres            |   1.61 | No          | No      | Reads a Protein Data Bank (PDB) file to determine the complete protein sequence as it appears in the header (no dependency on Bio.PDB and NumPy).                                                                                                                                                                                                                                           |
| pdb-atom              |   1.61 | No          | No      | Uses Bio.PDB to determine the (partial) protein sequence as it appears in the structure based on the atom coordinate section of the file (requires NumPy).                                                                                                                                                                                                                                  |
| phd                   |   1.46 | 1.52        | 1.52    | PHD files are output from PHRED, used by PHRAP and CONSED for input. Uses Bio.Sequencing.Phd internally.                                                                                                                                                                                                                                                                                    |
| phylip                |   1.43 | 1.43        | No      | PHYLIP files. Truncates names at 10 characters.                                                                                                                                                                                                                                                                                                                                             |
| pir                   |   1.48 | 1.71        | 1.52    | A "FASTA like"format introduced by the National Biomedical Research Foundation (NBRF) for the Protein Information Resource (PIR) database, now part of UniProt.                                                                                                                                                                                                                            |
| seqxml                |   1.58 | 1.58        | No      | Simple sequence XML file format.                                                                                                                                                                                                                                                                                                                                                            |
| sff                   |   1.54 | 1.54        | 1.54    | Standard Flowgram Format (SFF) binary files produced by Roche 454 and IonTorrent/IonProton sequencing machines.                                                                                                                                                                                                                                                                             |
| sff-trim              |   1.54 | No          | 1.54    | Standard Flowgram Format applying the trimming listed in the file.                                                                                                                                                                                                                                                                                                                          |
| snapgene              |   1.75 | No          | No      | The native format used by SnapGene.                                                                                                                                                                                                                                                                                                                                                         |
| stockholm             |   1.43 | 1.43        | No      | The Stockholm alignment format is also known as PFAM format.                                                                                                                                                                                                                                                                                                                                |
| swiss                 |   1.43 | No          | 1.52    | Swiss-Prot aka UniProt format. Uses Bio.SwissProt internally. See also the UniProt XML format.                                                                                                                                                                                                                                                                                              |
| tab                   |   1.48 | 1.48        | 1.52    | Simple two column tab separated sequence files, where each line holds a record"s identifier and sequence. For example, this is used by Aligent"s eArray software when saving microarray probes in a minimal tab delimited text file.                                                                                                                                                        |
| qual                  |   1.5  | 1.50        | 1.52    | Qual files are a bit like FASTA files but instead of the sequence, record space separated integer sequencing values as PHRED quality scores. A matched pair of FASTA and QUAL files are often used as an alternative to a single FASTQ file.                                                                                                                                                |
| uniprot-xml           |   1.56 | No          | 1.56    | UniProt XML format, successor to the plain text Swiss-Prot format.                                                                                                                                                                                                                                                                                                                          |
| xdna                  |   1.75 | 1.75        | No      | The native format used by Christian Marck"s DNA Strider and Serial Cloner.                                                                                                                                                                                                                                                                                                                  |
| nan                   | nan    | nan         | nan     | nan       

### Quiz

Convert the attached sample.ab1 file (see the trace using [Chromas](http://technelysium.com.au/wp/chromas/) to fastq file and then answer the following:

1. What is the name of the header?
2. Does the third line contain any letters or signs?
3. Does quality vary between nucleotides?

In [8]:
with open('sample.fastq', 'w+') as converted_file:
    SeqIO.convert(
        './resources/sample.abi',
        'abi',
        converted_file,
        'fastq',
    )

f = open('./resources/sample.fastq', 'r')
print(f.read())
f.close()

@M2927_AU-NFP-NFP5
CTCCGGCATTAAACAANCAAGATTNTTGTNCTGACGTCTCCCCTTTATATNTATCTCTTCTTTGGACATCAAATATCTTTTGGTTTGATATTAAAAATATATGCTGATGTCTCTGTCATTAGTTCTACTATTTTTCTTTTTCAAATCTTAGACACTGTTGGCAAATCCCAAACAAGTATGAAATCATGCGCAGAATATAATAAAGATGGTAAGAATAATATGGTATTATAACCCAATTTGGTGTCCATACACGATGTCAACTAGCTGGCTGGTACCTAAAGACCATCTAACTTGATGCCTTTACCAAATTTTTAAGGGGCAAACTAGCAAAAGATATGGTTTAACGTTTAACAAAGGCTAATGCCATTATATATATGCAATTATAATATAATACAGGCAATGAACAGCTTTCAATGTNGACACTTCCAGTTACATTGGTTTCAATATATTAAAACACCTTGACAAACTCNCCATTTNTATTATTTGTNCCCTATTCCTAGTNCATAGCTAATTCGGATAGAAACNGTGTTNACATCGNCANGTAAGCNTGACNTCTAATATATTNAAANANTANGATGCCCTTCTNACCNGCTCGCCTNGGCTTGNGCGTAGCTTTNGGTNCNNCATANNCANCTANCGCGNNGCNNGNCCNGNNCGATTGANTGAGNATGGCNGGGGGCNCTGCNCNCCGCTNGCNNGACCCGCNGNNGCCGAGGCCGACCNNCGGCGANACGANNNTGACAGNCGGTGGCGCTCCGCAGGNNNCGNNNGNGGNCGCTNCNGGGCNCGGGNGNNCCNGNNCCNGCGNNCNNACCCCNCAGGGCGCGGGGGAGGCCGGNTGGCCCCNGGNNNNCNNCGNNNNNCAGGNNGNCGNGGCNGGNNGGNGCCCCCCGGCCNCNNTGGGTNGGCNANGNGCNNNTTGGGCCCNNGNGCNTCGNTGCGNGCNGCNCTGCAGGGTGGNCNNTNGNGNGGGNGCGNGGN

## Parsing FASTA files

In [9]:
record = SeqIO.parse( # parse all the sequences even if only 1 sequence
    'HBB-human.fasta',
    'fasta',
)

for element in record:
    print(element)
    print('-'*15)
    print(element.id) # parse has created an object with multiple attributes
    print(element.name)
    print(element.description)
    print(element.features)
    print(element.seq)

ID: NM_000518.5
Name: NM_000518.5
Description: NM_000518.5 Homo sapiens hemoglobin subunit beta (HBB), mRNA
Number of features: 0
Seq('ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGG...CAA')
---------------
NM_000518.5
NM_000518.5
NM_000518.5 Homo sapiens hemoglobin subunit beta (HBB), mRNA
[]
ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGCAA


In [10]:
# same for FASTA files that contain multiple sequences
record = SeqIO.parse(
    'OCA2.fasta',
    'fasta',
)

for element in record:
    print('-'*15)
    print(element.id)
    print(len(element.seq))

---------------
NM_001300984.2
3071
---------------
NM_000275.3
3143
---------------
NM_001182895.3
594
---------------
NM_214094.2
3267
---------------
NM_021879.3
3133
---------------
NM_001271493.1
2752
---------------
NM_001131493.2
3181
---------------
NM_001104792.1
3424
---------------
NM_001320209.1
2673
---------------
NM_001022940.2
2486


### Quiz

What is the length of the sequence in the converted file (fastq) in the third quiz?

In [11]:
record = SeqIO.parse(
    handle='sample.fastq',
    format='fastq',

)

for element in record:
    print(f"Length of the sequence = {len(element.seq)}")

Length of the sequence = 1451


## Parsing GenBank files

More information are contained in a GenBank file, therefore the elements in the record will have more attributes.

In [12]:
record = SeqIO.parse(
    handle = './resources/HBB-human.gb',
    format='gb',
)

for element in record:
    print(element)

ID: NM_000518.5
Name: NM_000518
Description: Homo sapiens hemoglobin subunit beta (HBB), mRNA
Number of features: 48
/molecule_type=mRNA
/topology=linear
/data_file_division=PRI
/date=20-APR-2020
/accessions=['NM_000518']
/sequence_version=5
/keywords=['RefSeq', 'MANE Select']
/source=Homo sapiens (human)
/organism=Homo sapiens
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/references=[Reference(title='[Analysis of beta-globin gene variants in Liuzhou area of Guangxi]', ...), Reference(title='Genetic polymorphisms of HbE/beta thalassemia related to clinical presentation: implications for clinical diversity', ...), Reference(title='Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino po

In [31]:
record = SeqIO.parse(
    handle = './resources/HBB-human.gb',
    format='gb',
)

for element in record:
    for key in element.annotations.keys():
        print(key, ':', element.annotations[key])

molecule_type : mRNA
topology : linear
data_file_division : PRI
date : 20-APR-2020
accessions : ['NM_000518']
sequence_version : 5
keywords : ['RefSeq', 'MANE Select']
source : Homo sapiens (human)
organism : Homo sapiens
taxonomy : ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
references : [Reference(title='[Analysis of beta-globin gene variants in Liuzhou area of Guangxi]', ...), Reference(title='Genetic polymorphisms of HbE/beta thalassemia related to clinical presentation: implications for clinical diversity', ...), Reference(title='Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations', ...), Reference(title='Effect of N(Epsilon)-(carboxymethyl)lysine on Laboratory Parameters and

In [38]:
record = SeqIO.parse(
    handle = './resources/HBB-human.gb',
    format='gb',
)

for element in record:
    for feature in element.features:
        print(feature.type, feature.location.start, feature.location.end)

source 0 628
gene 0 628
exon 0 142
misc_feature 29 32
CDS 50 494
misc_feature 53 56
misc_feature 53 56
misc_feature 53 56
misc_feature 71 77
misc_feature 74 77
misc_feature 77 80
misc_feature 86 89
misc_feature 101 104
misc_feature 125 131
misc_feature 137 143
mat_peptide 146 176
mat_peptide 146 167
misc_feature 155 161
misc_feature 161 167
misc_feature 182 185
misc_feature 185 191
misc_feature 200 203
misc_feature 206 212
misc_feature 218 224
misc_feature 227 230
misc_feature 227 230
misc_feature 248 251
misc_feature 263 269
misc_feature 272 278
misc_feature 296 299
misc_feature 296 299
misc_feature 302 308
misc_feature 311 314
misc_feature 326 332
misc_feature 329 332
misc_feature 335 338
misc_feature 362 368
misc_feature 380 386
misc_feature 407 413
misc_feature 410 413
misc_feature 416 422
misc_feature 434 440
misc_feature 470 476
misc_feature 482 488
misc_feature 482 485
misc_feature 482 485
exon 142 365
exon 365 628


In [41]:
# filtering a specific feature type
record = SeqIO.parse(
    handle = './resources/HBB-human.gb',
    format='gb',
)

for element in record:
    for feature in element.features:
        if feature.type == 'CDS':
            print(feature.type, feature.location)
            # extract the sequence of the feature knowing the location coordinates:
            print(element.seq[feature.location.start:feature.location.end])

CDS [50:494](+)
ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAA


In [45]:
# same for GenBank file containing many sequences
record = SeqIO.parse(
    handle = './resources/HBB.gb',
    format='gb',
)

for elements in record:
    print('\n'.join(elements.annotations.keys()))

molecule_type
topology
data_file_division
date
accessions
sequence_version
keywords
source
organism
taxonomy
references
comment
structured_comment
molecule_type
topology
data_file_division
date
accessions
sequence_version
keywords
source
organism
taxonomy
references
comment
structured_comment
molecule_type
topology
data_file_division
date
accessions
sequence_version
keywords
source
organism
taxonomy
references
comment
structured_comment
molecule_type
topology
data_file_division
date
accessions
sequence_version
keywords
source
organism
taxonomy
references
comment
structured_comment
molecule_type
topology
data_file_division
date
accessions
sequence_version
keywords
source
organism
taxonomy
references
comment
structured_comment
molecule_type
topology
data_file_division
date
accessions
sequence_version
keywords
source
organism
taxonomy
references
comment
structured_comment
molecule_type
topology
data_file_division
date
accessions
sequence_version
keywords
source
organism
taxonomy
reference

In [49]:
record = SeqIO.parse(
    handle = './resources/HBB.gb',
    format='gb',
)

for element in record:
    print(element.annotations['organism'])

Papio anubis
Rattus norvegicus
Homo sapiens
Esox lucius
Esox lucius
Esox lucius
Esox lucius
Cricetulus griseus
Xenopus laevis
Sus scrofa
Ovis aries
Bos taurus
Danio rerio
Oryctolagus cuniculus
Oryctolagus cuniculus
Macaca fascicularis
Equus caballus
Salmo salar
Chlorocebus sabaeus
Ailuropoda melanoleuca
Condylura cristata
Ictalurus punctatus
Macaca mulatta
Ailuropoda melanoleuca


### Exerice on parsing GenBank file

You are going to use a file called HBB.gb and this file you will find in the udemybiopython.zip file found in the tenth lecture resources.

```python
r = SeqIO.parse("HBB.gb", 'gb')
for i in r:
    if "references" in i.annotations.keys():
        for j in i.annotations["references"]:
            print(j.pubmed_id)
```

Explain the previous code and what the results appear in front of you.
Then write a code to extract all the titles of the research in the file. The code will be close to it

In [69]:
# look at ALL the references from ALL the sequences in the GenBank file
# and print only the PubMed ID
r = SeqIO.parse("./resources/HBB.gb", 'gb')
for i in r:
    if "references" in i.annotations.keys():
        for j in i.annotations["references"]:
            print(j.pubmed_id)


10723742
28066926
24930900
24333691
23952145
23382103
1512262
1520632
2239966
1272328
242324
32219817
32078010
31869403
31636731
19372376
20301599
20301551
429365
68958
67897
20433749
20433749
20433749
20433749
1562610
12454917
2999708
6093050
6688076
23487454
17145712
15679890
14681463
10713517
9250869
7990139
7764326
2988373
565742
16962804
2494347
6161931
23991043
21269357
19393038
17718518
11369847
10449916
8411160
6322113
4561255
6048711
27189481
24611545
24123299
25603810
23594743
16709914
16379021
15169611
10207156
9002973
29100196
9692979
3365379
7350525
7357610
264241
63556
61580
4530669
5789874
29100196
9692979
3365379
7350525
7357610
264241
63556
61580
4530669
5789874
22002653
17194215
10723742
3110424
5499429
27377386
26107351
20383026
7409745
561852
5529282
4876811
5659637
5659617
20433749
8924215
8151712
22726727
22246260
20634964
19959874


In [13]:
# print the titles instead of the pubmed_id
r = SeqIO.parse("./resources/HBB.gb", 'gb')
for i in r:
    if "references" in i.annotations.keys():
        for j in i.annotations["references"]:
            print(j.title)

Strand symmetry around the beta-globin origin of replication in primates
Low affinity hemoglobinopathy (Hb Vigo) due to a new mutation of beta globin gene (c200 A>T; Lys>Ile). A cause of rare anemia misdiagnosis
Changes in hematological parameters in alpha-thalassemia individuals co-inherited with erythroid Kruppel-like factor mutations
A mitochondrial location for haemoglobins--dynamic distribution in ageing and Parkinson's disease
Acute chest syndrome is associated with single nucleotide polymorphism-defined beta globin cluster haplotype in children with sickle cell anaemia
Platelet proteome analysis reveals integrin-dependent aggregation defects in patients with myelodysplastic syndromes
A third quaternary structure of human hemoglobin A at 1.7-A resolution
Beta 141 Leu is not deleted in the unstable haemoglobin Atlanta-Coventry but is replaced by a novel amino acid of mass 129 daltons
Rapid detection of the hemoglobin C mutation by allele-specific polymerase chain reaction
Hemoglob

## Creating FASTA files

In [14]:
# from GenBank to FASTA
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

####
# no more alphabet in Biopython!
# see https://biopython.org/wiki/Alphabet
#from Bio.Alphabet import DNAAlphabet

In [15]:
record = SeqIO.parse(
    handle = './resources/HBB.gb',
    format='gb',
)

for elements in record:
    new_record = SeqRecord(
        seq=elements.seq, # or manually Seq(str(any_sequence)))
        id=elements.id, # or type in manually, e.g. "NM_001168847"
        name=elements.name, # or manually, e.g. "HBB"
        description=elements.description, # or manually, e.g. "Papio anubis hemoglobin, beta (HBB), mRNA."
    )

    print(new_record.format('fasta'))

>NM_001168847.1 Papio anubis hemoglobin, beta (HBB), mRNA
ATGGTGCATCTGACTCCTGAGGAGAAGAATGCCGTTACCGCCCTGTGGGGCAAAGTGAAC
GTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGCTGGTGGTCTACCCTTGGACCCAG
AGGTTCTTTGATTCCTTTGGGGATCTGTCCTCTCCTGCTGCTGTTATGGGCAACCCTAAG
GTGAAGGCTCATGGCAAGAAAGTGCTTGGTGCCTTTAGTGATGGCCTGAATCACCTGGAC
AACCTCAAGGGCACCTTTGCCCAGCTCAGTGAGCTGCACTGTGACAAGCTGCATGTGGAT
CCTGAGAACTTCAAGCTCCTGGGCAACGTGCTGGTGTGTGTGCTGGCCCATCACTTTGGC
AAAGAATTCACCCCGCAAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAAT
GCCCTGGCCCACAAGTACCACTAA

>NM_033234.1 Rattus norvegicus hemoglobin subunit beta (Hbb), mRNA
TGCTTCTGACATAGTTGTGTTGACTCACAAACTCAGAAACAGACACCATGGTGCACCTGA
CTGATGCTGAGAAGGCTGCTGTTAATGGCCTGTGGGGAAAGGTGAACCCTGATGATGTTG
GTGGCGAGGCCCTGGGCAGGCTGCTGGTTGTCTACCCTTGGACCCAGAGGTACTTTGATA
GCTTTGGGGACCTGTCCTCTGCCTCTGCTATCATGGGTAACCCTAAGGTGAAGGCCCATG
GCAAGAAGGTGATAAACGCCTTCAATGATGGCCTGAAACACTTGGACAACCTCAAGGGCA
CCTTTGCTCATCTGAGTGAACTCCACTGTGACAAGCTGCATGTGGATCCTGAGAACTTCA
GGCTCCTGGGCAATATGATTGTGATTGTGTTGGGCCACCACCTGGGCAAGGAATTC

## Creating GenBank Files

Same as for FASTA files except for the format, and more information can be provided. When it's not possible, we need to build the features.

Some sequence file formats require the molecule type when writing a file, which previously was recorded with a Bio.Alphabet object as the .alphabet attribute of the Seq object. This is now recorded as a molecule type string in the SeqRecord object annotation dictionary instead. See more details about the values for `mol_type` [here](https://www.insdc.org/files/feature_table.html).

In [16]:
record = SeqIO.parse(
    handle = 'HBB-human.fasta',
    format='fasta',
)

for elements in record:
    new_record = SeqRecord(
        seq=elements.seq, # already of Seq type
        id=elements.id,
        name=elements.name,
        description=elements.description,
        annotations={"molecule_type": "mRNA"},
    )

    print(new_record.format('genbank'))

LOCUS       NM_000518.5              628 bp    mRNA             UNK 01-JAN-1980
DEFINITION  NM_000518.5 Homo sapiens hemoglobin subunit beta (HBB), mRNA.
ACCESSION   NM_000518
VERSION     NM_000518.5
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
ORIGIN
        1 acatttgctt ctgacacaac tgtgttcact agcaacctca aacagacacc atggtgcatc
       61 tgactcctga ggagaagtct gccgttactg ccctgtgggg caaggtgaac gtggatgaag
      121 ttggtggtga ggccctgggc aggctgctgg tggtctaccc ttggacccag aggttctttg
      181 agtcctttgg ggatctgtcc actcctgatg ctgttatggg caaccctaag gtgaaggctc
      241 atggcaagaa agtgctcggt gcctttagtg atggcctggc tcacctggac aacctcaagg
      301 gcacctttgc cacactgagt gagctgcact gtgacaagct gcacgtggat cctgagaact
      361 tcaggctcct gggcaacgtg ctggtctgtg tgctggccca tcactttggc aaagaattca
      421 ccccaccagt gcaggctgcc tatcagaaag tggtggctgg tgtggctaat gccctggccc
      481 acaagtatca ctaagctcgc tttcttgctg tccaatttct attaaaggtt cctttgttcc
      541 ct

In [17]:
from Bio.SeqFeature import SeqFeature, FeatureLocation, Reference

record = SeqIO.parse(
    handle = 'HBB-human.fasta',
    format='fasta',
)

for elements in record:
    # we first build each individual feature
    f1 = SeqFeature(
        location=FeatureLocation(0, 628, strand=1), # positive strand
        type='gene',
    )

    f2 = SeqFeature(
        location=FeatureLocation(50, 494, strand=1),
        type='CDS',
        qualifiers={
            'product': "hemoglobin subunit beta",
            'protein_id': "NP_000509.1",
            'translation': elements.seq[50:494].translate(),
        }
    )

    # reference is another class of SeqFeature
    r1=Reference()
    r1.authors = "Sebastien WIECKOWSKI"
    r1.journal = 'Nature'

    # we then add the features into the SeqRecord
    new_record = SeqRecord(
            seq=elements.seq, # already of Seq type
            id=elements.id,
            name=elements.name,
            description=elements.description,
            # also adding more annotations e.g. organism, source, etc.
            annotations={
                "molecule_type": "mRNA",
                'organism': 'Homo sapiens',
                'source': 'Homo sapiens (human)',
                'references': [r1], # list of references
                },
            features=[f1, f2], # list of Features
        )

    print(new_record.format('genbank'))

LOCUS       NM_000518.5              628 bp    mRNA             UNK 01-JAN-1980
DEFINITION  NM_000518.5 Homo sapiens hemoglobin subunit beta (HBB), mRNA.
ACCESSION   NM_000518
VERSION     NM_000518.5
KEYWORDS    .
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            .
REFERENCE   1
  AUTHORS   Sebastien WIECKOWSKI
  JOURNAL   Nature
FEATURES             Location/Qualifiers
     gene            1..628
     CDS             51..494
                     /product="hemoglobin subunit beta"
                     /protein_id="NP_000509.1"
                     /translation="MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFES
                     FGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENF
                     RLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH*"
ORIGIN
        1 acatttgctt ctgacacaac tgtgttcact agcaacctca aacagacacc atggtgcatc
       61 tgactcctga ggagaagtct gccgttactg ccctgtgggg caaggtgaac gtggatgaag
      121 ttggtggtga ggccctgggc aggctgctgg tggtctaccc ttgga

In [17]:
print(new_record)

ID: NM_000518.5
Name: NM_000518.5
Description: NM_000518.5 Homo sapiens hemoglobin subunit beta (HBB), mRNA
Number of features: 2
/molecule_type=mRNA
/organism=Homo sapiens
/source=Homo sapiens (human)
/references=[Reference(title='', ...)]
Seq('ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGG...CAA')


#### Saving into a file

In [18]:
file = 'new_HBB-human.gb'
with open(file, 'w') as output_handle:
    SeqIO.write(
        sequences=new_record,
        handle=output_handle,
        format='genbank',
    )


In [19]:
file = 'new_HBB-human.fa'
with open(file, 'w') as output_handle:
    SeqIO.write(
        sequences=new_record,
        handle=output_handle,
        format='fasta',
    )