# Bioinformatics Vignette: Using For Loops and Dictionaries to Compare Nucleotide Composition in Pandemic and Non-Pandemic Causing Influenza Strains

## *Vignette Author*: Nia Prabhu



<img src="./nia_prabhu_vignette_picture.png" width="200" align="left" style="margin: 0px 10px 0px 0px;" description="A professional picture of Nia Prabhu, the author of this vignette"> 

## About Me 
I am passionate about medicine and research! Bioinformatics intrigues me because it blends together the diverse fields of computer science, biology, mathematics, and physics. The beauty of bioinformatics is that it uses computational methods to take large and complex data sets like the human genome and output very specific biological data which we can then use to test a hypothesis. Something that I would like to further study in this field is the analysis of data from genome sequencing and gene variants in a patient’s response to a particular drug. 









## Background

Influenza A pandemics occur when the human population is exposed to a new Influenza A virus strain that it has little or no pre-existing immunity to. Antigenic shifts in Influenza A viruses result in a new HA or HA/NA protein subtype, which then gains the ability to infect humans. This novel subtype can cause an influenza pandemic. 

These shifts transpire when an influenza virus from an animal (most commonly from a swine or avian origin) infects humans with the novel subtype, and due to the lack of immunity to this, the disease rapidly transmits among the population. Given the detrimental nature of influenza viruses and the several adverse impacts they’ve brought, it’s important to study these viruses and their mutations in order to forecast for future outbreaks of disease. 

My research focused on two specific glycoproteins on the surface of Influenza A viruses - HA and NA- in both pandemic and non-pandemic strains. I analyzed nucleotide composition to study the similarities and patterns between the nucleotides of these two proteins in pandemic or non-pandemic strains. This analysis was possible because other researchers had already used genetic characterization to study many influenza viruses and share their sequences in online databases. 

Genetic characterization is the process of comparing genetic sequences, and genomic sequencing is a type of genetic characterization in which we can determine the nucleotides present in each gene of the virus’s genome. Unlike our double-stranded DNA genome, Influenza A's genome is an RNA genome (A,U,C, and G) that is single-stranded and broken up into pieces called segments. However, this segmented RNA genome is usually converted into DNA using a reverse transcriptase protein before being studied with DNA sequencing, so our data was still presented as DNA (A,T,C, and G). 

Something I was specifically interested in was studying the nucleotide compositions and comparing the pandemic H1N1 strains to non-pandemic H1N1 strains to determine patterns among Influenza A viruses. 

## Applying the Technique

### Data Collection
The database that we used to collect our genomic sequences was the [Influenza Research Database]( https://www.fludb.org/brc/influenza_sequence_search_segment_display.spg?method=ShowCleanSearch&decorator=influenza). For data collection, I used a dataset of HA and NA protein segments from pandemic H1N1 reference strains and non-pandemic H1N1 reference strains. The dataset consisted of a total of 40 sequences- 10 HA segments from pandemic strains, 10 NA segments from pandemic strains, 10 HA segments from non-pandemic strains, and 10 NA segments from non-pandemic strains. 

**Table 1. HA and NA nucleotide sequences from pandemic and non-pandemic causing strains.**
<table>
    <thead>
        <tr>
            <th>Link to Database                                                                                       </th>
            <th>  Protein Type </th>
            <th> Pandemic causing? </th>
            <th> Strain Name                            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=GQ280797&amp;decorator=influenza&amp;context=1628317532186 </td>
            <td> HA           </td>
            <td> yes               </td>
            <td> A/California/04/2009                   </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=FJ969517&amp;decorator=influenza&amp;context=1628317555652 </td>
            <td> NA           </td>
            <td> yes               </td>
            <td> A/California/04/2009                   </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=HM138501&amp;decorator=influenza&amp;context=1628317639251 </td>
            <td> HA           </td>
            <td> yes               </td>
            <td> A/Germany-BY/74/2009 A/AA/Marton/1943  </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=HM138500&amp;decorator=influenza&amp;context=1628317679677 </td>
            <td> NA           </td>
            <td> yes               </td>
            <td> A/Germany-BY/74/2009 A/AA/Marton/1943  </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY040888&amp;decorator=influenza&amp;context=1628317752983 </td>
            <td> HA           </td>
            <td> yes               </td>
            <td> A/Mexico/47N/2009(H1N1)                </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY040890&amp;decorator=influenza&amp;context=1628317817620 </td>
            <td> NA           </td>
            <td> yes               </td>
            <td> A/Mexico/47N/2009(H1N1)                </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=GQ402188&amp;decorator=influenza&amp;context=1628317850766 </td>
            <td> HA           </td>
            <td> yes               </td>
            <td> A/Mexico/InDRE4115/2009                </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=GQ402228&amp;decorator=influenza&amp;context=1628317893386 </td>
            <td> NA           </td>
            <td> yes               </td>
            <td> A/Mexico/InDRE4115/2009                </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=GQ894830&amp;decorator=influenza&amp;context=1628317943477 </td>
            <td> HA           </td>
            <td> yes               </td>
            <td> A/Rhode Island/08/2009                 </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=GQ894831&amp;decorator=influenza&amp;context=1628317971411 </td>
            <td> NA           </td>
            <td> yes               </td>
            <td> A/Rhode Island/08/2009                 </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY053744&amp;decorator=influenza&amp;context=1628318065357 </td>
            <td> HA           </td>
            <td> yes               </td>
            <td> A/Russia/190/2009                      </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY053746&amp;decorator=influenza&amp;context=1628318084753 </td>
            <td> NA           </td>
            <td> yes               </td>
            <td> A/Russia/190/2009                      </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY071063&amp;decorator=influenza&amp;context=1628318131697 </td>
            <td> HA           </td>
            <td> yes               </td>
            <td> A/Tallinn/INS372/2009                  </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY071065&amp;decorator=influenza&amp;context=1628318173047 </td>
            <td> NA           </td>
            <td> yes               </td>
            <td> A/Tallinn/INS372/2009                  </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY074982&amp;decorator=influenza&amp;context=1628318303335 </td>
            <td> HA           </td>
            <td> yes               </td>
            <td> A/Thailand/CU-H1222/2010               </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY074984&amp;decorator=influenza&amp;context=1628318830367 </td>
            <td> NA           </td>
            <td> yes               </td>
            <td> A/Thailand/CU-H1222/2010               </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY080420&amp;decorator=influenza&amp;context=1628319077054 </td>
            <td> HA           </td>
            <td> yes               </td>
            <td> A/Ulaanbaatar/190/2011                 </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY080574&amp;decorator=influenza&amp;context=1628319197987 </td>
            <td> NA           </td>
            <td> yes               </td>
            <td> A/Ulaanbaatar/190/2011                 </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=GQ397278&amp;decorator=influenza&amp;context=1628319283045 </td>
            <td> HA           </td>
            <td> yes               </td>
            <td> A/Yamaguchi/22/2009                    </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=GQ397279&amp;decorator=influenza&amp;context=1628319315358 </td>
            <td> NA           </td>
            <td> yes               </td>
            <td> A/Yamaguchi/22/2009                    </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY021709&amp;decorator=influenza&amp;context=1628320312832 </td>
            <td> HA           </td>
            <td> no                </td>
            <td> A/AA/Huston/1945                       </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY021711&amp;decorator=influenza&amp;context=1628320337871 </td>
            <td> NA           </td>
            <td> no                </td>
            <td> A/AA/Huston/1945                       </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY020285&amp;decorator=influenza&amp;context=1628320421148 </td>
            <td> HA           </td>
            <td> no                </td>
            <td> A/AA/Marton/1943                       </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY020285&amp;decorator=influenza&amp;context=1628320461651 </td>
            <td> NA           </td>
            <td> no                </td>
            <td> A/AA/Marton/1943                       </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY021901&amp;decorator=influenza&amp;context=1628320507754 </td>
            <td> HA           </td>
            <td> no                </td>
            <td> A/Albany/1618/1951                     </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY021903&amp;decorator=influenza&amp;context=1628320551208 </td>
            <td> NA           </td>
            <td> no                </td>
            <td> A/Albany/1618/1951                     </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY026411&amp;decorator=influenza&amp;context=1628320634213 </td>
            <td> HA           </td>
            <td> no                </td>
            <td> A/Albany/8/1979                        </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY026413&amp;decorator=influenza&amp;context=1628320694306 </td>
            <td> NA           </td>
            <td> no                </td>
            <td> A/Albany/8/1979                        </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY074243&amp;decorator=influenza&amp;context=1628320744651 </td>
            <td> HA           </td>
            <td> no                </td>
            <td> A/California/VRDL225/2009              </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY074245&amp;decorator=influenza&amp;context=1628320777620 </td>
            <td> NA           </td>
            <td> no                </td>
            <td> A/California/VRDL225/2009              </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY010260&amp;decorator=influenza&amp;context=1628320807968 </td>
            <td> HA           </td>
            <td> no                </td>
            <td> A/Canterbury/48/2001                   </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY010262&amp;decorator=influenza&amp;context=1628320840500 </td>
            <td> NA           </td>
            <td> no                </td>
            <td> A/Canterbury/48/2001                   </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=AJ289702&amp;decorator=influenza&amp;context=1628320921483 </td>
            <td> HA           </td>
            <td> no                </td>
            <td> A/Fiji/15899/83                        </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=AJ006954&amp;decorator=influenza&amp;context=1628320944073 </td>
            <td> NA           </td>
            <td> no                </td>
            <td> A/Fiji/15899/83                        </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=U02085&amp;decorator=influenza&amp;context=1628321015721   </td>
            <td> HA           </td>
            <td> no                </td>
            <td> A/Fort Monmouth/1/1947                 </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY009614&amp;decorator=influenza&amp;context=1628321060539 </td>
            <td> NA           </td>
            <td> no                </td>
            <td> A/Fort Monmouth/1/1947                 </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY020461&amp;decorator=influenza&amp;context=1628321098549 </td>
            <td> HA           </td>
            <td> no                </td>
            <td> A/Iowa/1943                            </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY020463&amp;decorator=influenza&amp;context=1628321120028 </td>
            <td> NA           </td>
            <td> no                </td>
            <td> A/Iowa/1943                            </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY019971&amp;decorator=influenza&amp;context=1628321172939 </td>
            <td> HA           </td>
            <td> no                </td>
            <td> A/Roma/1949                            </td>
        </tr>
        <tr>
            <td> https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=CY019973&amp;decorator=influenza&amp;context=1628321192613 </td>
            <td> NA           </td>
            <td> no                </td>
            <td> A/Roma/1949</td>
        </tr>
    </tbody>
</table>

### Sequence Data Analysis

I compared the sequences in two ways: by comparing the nucleotide composition of each sequence, and by comparing the sequence similarity of each pair of sequences. This vignette includes code for calculating nucleotide composition of each sequence.

This was done by parsing FASTA files, using for loops and if statements to find identifier lines and actual sequences. Nucleotide composition of each sequence was calculated using for loops and a dictionary to count the number of each nucleotide present in the sequence. Finally I calculated the nucleotide composition percentage.

Separately, additional code was used to test how similar each pair of sequences was with other sequences in the analysis (not included in this vignette).

The code used to calculate nucleotide compositions is added below. 

## Code 

I wrote python code to perform this analysis with my project partner Shubham Kamboj. For the nucleotide composition analysis, I defined two functions. The first function is called `writeFileToDict`. It takes the location of a FASTA file, along with a dictionary. It parses the FASTA file and adds the DNA sequences in it into the `dict` called `myDict`. 

In [None]:
 def writeFileToDict(myFile, myDict):
    """Function to write the file contents to a dictionary
    """
    seqID = None
    sequence = []
    infile = open(myFile)
    for line in infile: 
        line = line.strip() #strip the line of any unnecessary content 
        if line.startswith(">"): #if it is an identifier line 
            if seqID is not None:  #if exist, put it in dictionary 
                myDict[seqID] = ''.join(sequence) 
                
            #Set each ID to the filename plus the FASTA label line    
            seqID = str(myFile) + "_" + line[1:]
            
            sequence = []
        
            continue 
                
        sequence.append(line) #append sequence 
            
    myDict[seqID] = ''.join(sequence)
    

The second function, `findPercentNucleotides` calculates the percentage of nucleotides in the sequence.

In [5]:
def findPercentNucleotides(sequence):
    """Finds the percentage of nucleotides in a sequence
    """ 
    nucleotides = ["A", "G", "T", "C"] #nucleotides 
    nucleotidePercent = {} #corresponding dictionary 
    length = len(sequence)
    for nuc in nucleotides:
        count = sequence.count(nuc)
        nucPercent = (count/length)*100
        nucleotidePercent[nuc] = nucPercent
    return nucleotidePercent
    

Now we can use these functions to calculate our results. For this vignette we'll run the code on a test example where the non-Pandemic sequences are all 'A' or 'T' nucleotides, and the Pandemic sequences are all 'G' or 'C' nucleotides.

In [8]:
#load external tools we'll need to find and access files from our directory 
from os import listdir
from os.path import join

nonPandemicDict = {}
PandemicDict = {}

nonPandDir = "NonPandemicStrains"

#accesses the nonpandemic strains and places them in our python nonPandemicDict dictionary 
for file in listdir(nonPandDir):
    writeFileToDict("./" + nonPandDir + "/" + file, nonPandemicDict)
    
PandDir = "PandemicStrains"

#accesses the pandemic strains and places them in our python PandemicDict dictionary 
for file in listdir(PandDir):
    writeFileToDict("./" + PandDir + "/" + file, PandemicDict)

#finds the percent nucleotides in each sequence with their corresponding ID 
for seqID,seq in PandemicList.items(): 
    print(seqID, str(findPercentNucleotides(seq)))
    
for seqID, seq in nonPandemicList.items(): 
    print(seqID, str(findPercentNucleotides(seq)))

./PandemicStrains/test_pandemic_strain1.txt_gene1_all_A {'A': 100.0, 'G': 0.0, 'T': 0.0, 'C': 0.0}
./PandemicStrains/test_pandemic_strain1.txt_gene2_all_T {'A': 0.0, 'G': 0.0, 'T': 100.0, 'C': 0.0}
./PandemicStrains/test_pandemic_strain2.txt_gene1_all_A_strain2 {'A': 100.0, 'G': 0.0, 'T': 0.0, 'C': 0.0}
./PandemicStrains/test_pandemic_strain2.txt_gene2_all_T_strain2 {'A': 0.0, 'G': 0.0, 'T': 100.0, 'C': 0.0}
./NonPandemicStrains/test_nonpandemic_strain1.txt_gene1_all_A {'A': 0.0, 'G': 100.0, 'T': 0.0, 'C': 0.0}
./NonPandemicStrains/test_nonpandemic_strain1.txt_gene2_all_T {'A': 0.0, 'G': 0.0, 'T': 0.0, 'C': 100.0}
./NonPandemicStrains/test_nonpandemic_strain2.txt_gene1_all_A_strain2 {'A': 0.0, 'G': 100.0, 'T': 0.0, 'C': 0.0}
./NonPandemicStrains/test_nonpandemic_strain2.txt_gene2_all_T_strain2 {'A': 0.0, 'G': 0.0, 'T': 0.0, 'C': 100.0}


The output of the script lists each gene in each strain, followed by the percentage of each nucleotide in that gene.

## Exercises

1. The `writeFileToDict` function doesn't return a value like a traditional python function. Yet in this example, we're still getting data out of the file. Carefully read the code for this function, and see if you can figure out how data is getting *out*. *Hint*: mutable objects such as lists and dicts that are passed to a function using its arguments can be changed inside functions and those changes will be preserved even after the function ends (see https://www.dataquest.io/blog/tutorial-functions-modify-lists-dictionaries-python/ for examples, and https://realpython.com/python-pass-by-reference/ for a detailed discussion contrasting python with C#)
2. Try modifying the `writeFileToDict` function to pass data out in a dictionary using the `return` keyword.
3. Imagine that you liked this approach, and wanted to do something similar. However, you noticed that your DNA sequences were all represented in lowercase, whereas the sequences that Nia was studying were all uppercase. Do you think the results of running this code would be the same if the DNA sequence files you used had sequences in lowercase (e.g. like 'atgtacgatcgtagc')? Why or why not? *Hint*: you can try out example sequences that you come up with by passing them to the the findPercentNucleotides function (e.g. `findPercentNucleotides('ATAA')`).

## [Reading Responses & Feedback](https://docs.google.com/forms/d/e/1FAIpQLSeUQPI_JbyKcX1juAFLt5z1CLzC2vTqaCYySUAYCNElNwZqqQ/viewform?usp=pp_url&entry.2118603224=Bioinformatics+Vignette+-+Nia+Prabhu+-+Using+For+Loops+and+Dictionaries+to+Compare+Nucleotide+Composition+in+Pandemic+and+Non-Pandemic+Causing+Influenza+Strains)


## References

1.	Al-Muharrmi, Z. Understanding the Influenza A H1N1 2009 Pandemic. *Sultan Qaboos Univ. Med. J.* 10, 187–195 (2010).

2.	Jilani, T. N., Jamil, R. T. & Siddiqui, A. H.. H1N1 Influenza. in *StatPearls* (StatPearls Publishing, 2021).