# Introduction to bioinformatics

In [2]:
from IPython import display

### Course objectives (Period: 30th Sep - 11th Oct)
- General understanding of:
  - bioinformatics concepts and approaches
  - the usefulness of biological sequences and structures are used in bioinformatics
  - commonly used bioinformatics tools, web servers and databases
  - sequence alignment approaches
- Formative assessment
  - MCQs (RUConnected)
  - Test (hand-written)
  - Exam (1 essay type question, 15 marks)

### Schedule:
|Day|Date|Time|Topic|Venue|
|--|--|--|--|--|
|Monday|30th Sep|11:25 - 12:30|An introduction to bioinformatics / NCBI| Eden Grove Computer Lab |
|Tuesday|1st Oct|07:45 - 08:30|Digital representation and storage of biomolecules / NCBI| Eden Grove Computer Lab |
|Wednesday|2nd Oct|08:40-09:25|Sequence alignment approaches|Union Computer Lab|
|Thursday|3rd Oct|09:30 - 10:20|Sequence alignment approaches| Eden Grove Computer Lab |
|Friday|4th Oct|10:30 - 11:15|Practical session| Jacaranda Computer Lab|
|Monday|7th Oct|11:25 - 13:00|Structural bioinformatics tools and databases| Eden Grove Computer Lab |
|Tuesday|8th Oct|07:45 - 08:30|Structural bioinformatics tools and databases| Eden Grove Computer Lab |
|Wednesday|9th Oct|08:40-09:25|Structural bioinformatics tools and databases| Union Computer Lab |
|Thursday|10th Oct|09:30 - 10:20|Practical session | Eden Grove Computer Lab |
|Friday|11th Oct|10:30 - 11:15|	Written test |Jacaranda Computer Lab|



---

## What is bioinformatics?
- Various definitions
  - IT applied to the management and analysis of biological data (Attwood & Parry-Smith, 1999)
  - The collection, archiving organization and interpretation of biological data (Thornton, 2003)
  - Multidisciplinary research area interfacing computer science and biological science (Xiong, 2006)
  - Science of using computer tools to analyze biological data in order to formulate hypotheses about life processes (Tramontano, 2007)
- It is a relatively young field
  - Term first appeared in the mid 1980s, even though computers were used to analyze biological data in 1960s
- Bioinformatics is rapidly advancing field, and is accelerated by
  - leaps in technology, improved algorithms
  - growth in experimental data
  - artificial intelligence

---

## Simply put, bioinformatics today
- The world has generated vast amounts of biological data of various kind across the world
  - DNA, RNA, protein, small molecules (natural & synthetic)
  - Organisms constantly evolve
  - Added complexities such as interactions, time/context variance and location
    - These molecules interact in a dynamic system
    - Rule book (DNA) and its products (protein, RNA), and the environment
      - Data recorded from various experimental conditions
- The main goal is to improve our understanding of biology
  - Life is extremely complex, but science tries to improve our understanding of it
    - Scientific tools are used to test hypotheses using various measurements (data)
  - But how do we navigate or make sense of this huge amount of complex information?
    - This is what bioinformatics is about

---

## Bioinformatics can/has to be approached from diverse skill sets 
- Because of the complexicty, bioinformatics integrates various sciences (and is thus vast):
    - Biology, mathematics, statistics, physics, computer science, engineering, chemistry, etc
- It can be approached and practiced from various scientific disciplines, e.g:
  - Biologist + computational skills
  - Biochemist + computational skills
  - Chemist + biology knowledge + computational skills
  - Computer scientist + biology knowledge
  - Physics + biology + computational skills
  - etc
- You do not need to be an expert in all these fields! No-one is.
  - You specialize in a certain aspect
  - I am a structural bioinformaticist

---

## Convergence towards bioinformatics, and its expansion
  - **Patterns** in living things
    - Macromolecules such as DNA sequences, protein structures, etc
    - **Biological data** has been gathered and **annotated** over decades
  - Various technologies have generated **huge amounts of biological information** over the years
    - We have a sea of data.
    - How do we make sense of all this - aka find various needles in various hay stacks?
  - Greater access to **high performance hardware** and **software** 
    - Decreasing cost of computation
    - Cluster computing access, GPU availability (e.g. [Google Colab](https://colab.google))
    - Improved algorithms
    - Open source software (e.g. [Linux OS](https://www.youtube.com/watch?v=q5yM4ZYwB_s&t=46s), [FOSS](https://en.wikipedia.org/wiki/Free_and_open-source_software#:~:text=Free%20and%20open%2Dsource%20software%20(FOSS)%20is%20software%20that,necessary%20but%20not%20sufficient%20condition.))
  - Increased data enable **data-driven insights**, improve the understanding of biology
      - Available data **empowers the discovery new therapeutics**
      - in silico calculations using pre-existing data can **reduce the number of samples** to be evaluated (**lowers cost**)

---

## Examples of tasks performed in bioinformatics

### A. Gene, or amplicon sequencing
- Let's say you did a [PCR](https://cdn.gentaur.co.uk/wp-content/uploads/2021/10/MB_pcr_cycle_pcr_protocol.png) [assay](https://www.sigmaaldrich.com/deepweb/assets/sigmaaldrich/marketing/global/videos/components-to-successfull-pcr-ms/components-to-successfull-pcr-ms.mp4) on an unknown sample, and you see a unique band of some molecular weight on your [electrophoresis gel](https://images.app.goo.gl/t6x9YEG7jj6jjckf9)
  - You could stop there, or you could ask more questions about the fragment
    - what is its composition?
    - which organism it is from?
    - where is it found in the organism?
    - what is its function in the organism?
    - are there other organisms that have the same/similar sequence
      - are they related - e.g. different virus strains?
    - does it code for a protein?
  - All these questions are can be answered by bioinformatics techniques
    - Various wet lab and _in silico_ data stored in **databases** + various **tools** for analysis
  - Now, how would one determine the sequence of a single short sequence (~800bp) - most likely the **Sanger sequencing** method
    - The workflow is typically likewise:
      - Template PCR amplification & purification
      - PCR amplification (cycle sequencing) with fluorescent labels and normal dNTPs and special nucleotides (ddNTPs) that interrupt extension
      - Capillary electrophoresis + fluorescence detection
    - Well established, and is most accurate sequence method
    - Not cost effective if sequencing multiple sequences in parallel

### B. Genome sequencing
- Let's say there's a new pandemic
  - you're tasked to identify what it is.
  - you're also asked how similar it is to known pathogens
- These could qualify for a genome sequencing project
  - What is genome sequencing?
    - We wish to detemine the entire set of nucleotide sequences that compose the organism's genetic material
    - Technical limitations mean that sequences are determined in fragments
      - these are assembled into **contigs** (i.e. contiguous sequences)
  - But then, what do we do with a sequence of four alphabets?
    - Is that it? ---ATCGATCGACTAGCTAGCTAGCTAGCTAGC---
    - Do we try and find words, sentences and punctuations from it?
    - What's this sequence data mean?
  - If you're determining an entire genome, most likely you'd be using [**Next Generation Sequencing (NGS)**](https://www.illumina.com/science/technology/next-generation-sequencing/beginners.html)
    - Mainly based on the following workflow:
      - fragmentation -> sequencing short reads -> contig assembly -> scaffold

In [26]:
display.Image(url="https://knowgenetics.org/wp-content/uploads/2012/12/kevin2.png", height=600)

For more info, see [here](https://knowgenetics.org/whole-genome-sequencing)

---

### C. Genome annotation
- Genomic features can be 
  - It is the next logical step following a genome assembly
  - We determine features/landmarks from the genomic sequence
    - Chromosomes, genes, introns/exons, forward/reverse strands, control elements, etc
  - But now, what can we do with an annotated genome?
    - How does this information help us?
      - Knowing genomic locations and compositions of genetic elements, we can better understand what the organism does
      - We can study pathways in disease, improve 
      - Essentially, it is map meant to decipher the genomic content and its meaning

#### A very quick detour to the NCBI - viewing genomic data
- National Center for Biotechnology Information (NCBI)
  - Over 3 million visitors daily to its website,
  - ~27 TB of data downloaded/day


**Quick demonstration:** 

Accessing the genome annotation data through NCBI
- Go to [https://www.ncbi.nlm.nih.gov](https://www.ncbi.nlm.nih.gov)
- Look for "Data & Software"
- Click on "Genome Data Viewer (GDV)"
- Look for "Glutathione S transferase" in the human genome

___

### D. Molecular detection, and evolutionary (phylogenetic) studies
  - With well-characterised regions of several genomes, we can examine their similarities and differences
    - Did the SARS-CoV-2 strain that you contracted similar to the one originally from Wuhan?
  - Sequences can be used for guide molecular diagnostics
    - Is pathogen X present in this patient?
      - e.g. genetic markers, PCR primers/multiplexing, etc
  - To track the provenance and evolution of disease
    - e.g. The similarities/differences in SARS-CoV-2 since COVID-19
  - To develop nucleic acid-based vaccines using their known protein-coding sequences
    - Quite new (COVID-19 mRNA vaccines, Pfizer-BioNTech and Moderna)
  - To discover and characterise unculturable microorganisms
    - Difficult to grow in standard growth media
  - To predict the effectiveness of therapy (HAART)
    - Which HIV subtype(s) is a patient infected with?
    - Is this ARV still effective for the patient (e.g. [Stanford HIVDB](https://hivdb.stanford.edu/hivdb/by-patterns/))?
  - To eludidate the genetic cause of certain disorders
    - sickle-cell anemia (typically single DNA mutation)
    - certain types of diabetes 

---

**Evolutionary data from the [nextstrain](https://nextstrain.org/) resource**

<strong>Example 1: SARS-CoV-2 evolutionary analysis</strong>

In [7]:
display.IFrame(src="https://nextstrain.org/ncov/gisaid/africa/1m?dmin=2023-01-01", width="100%", height="600px")

<strong>Example 2: MPOX evolutionary analysis</strong>

In [8]:
display.IFrame(src="https://nextstrain.org/mpox/all-clades", width="100%", height="600px")

---

### E. Gene expression analysis
- All cells contain the same genome, yet they specialise into different tissues and organs?
  - The retina of the eye is composed of the same DNA as your skin cells
- Different portions of the genome are switched on and off at different times, and in different locations
  - e.g. as a preparation for sleep, melatonin production goes up while your insulin levels go down
- Cellular environment also changes in different states (healthy vs diseased state)
  - A viral infection will trigger an immune response upon recognition of an epitope
- Measuring these changes can highlight coexpression/differential expression of genes
  - Similarities in disease states
  - Development of novel drug targeting approaches
- The NCBI GEO Profile can be used to investigate gene expression profiles
  - [CYP1A1 expression levels with and without arsenate](https://www.ncbi.nlm.nih.gov/geoprofiles?term=CYP1A1[Gene+Symbol])

---

### F. Genes are not the only way to analyse biology, but their products add another level of complexity
- Protein are translated and regulated themselves
- Their structures can be predicted through NMR and Xray crystallography
  - Lengthy and costly
  - Recent developments in AI are are closing the gap on the number of unsolved proteins
    - Alphafold, RosettaFold, CollabFold 

---

### G. Biomolecular docking
- Proteins 3D structures are often solved in isolation or with a limited number of possible compounds
- We may want to determing if the protein interacts with another protein/compound 
- These interactions can be predicted in silico
  - Protein-protein (HADDOCK, ClusPro, ZDOCK, etc)
  - Protein-ligand (AutoDock, Vina, etc)


---

### H. Bioinformatics software development
- You've found a new way of thinking about some biological data and you discovered something, and is generalisable?
  - Great, you can convert it into a tool!
    - E.g. analysis, databases, web resources
  - You can then decide how to make it accessible
    - open/closed source
  - Interface
    - Web application, Graphical User Interface, Command-Line Application
- Programming is a very good skill to have in any bioinformaticist's toolkit
  - Beware: Garbage In Garbage Out

---

## Biological information has to be digitized to be used computationally
- What information should be stored?
   - Biomolecules (data) & their accompanying information (**metadata**)
     - "metadata" is data that describes data
   - Sequence data
     - Genes, genomes, protein sequences, sequencing reads, etc
     - Common file formats (FASTA, Genbank, FASTQ, ...)
   - **Aligned** sequence data
     - Sequences stacked over each other
     - Common file formats (FASTA, CLUSTAL, ...)
   - Protein structural data (**Static**)
     - X-Ray, NMR, Cryo-EM crystal structures
     - Common file formats (PDB, MMCIF, ...)
   - Protein structural data (**Dynamic**)
     - 3D snapshots of proteins observed over time
     - Common file formats (PDB, XTC, TRR, ...)
   - Small molecules
     - Common metabolites, and any other small compound
     - Common file formats (PDB, SMILES, MOL2, ...)

---

## Biological information also has to be stored
- Why store data?
    - Allow data archival and re-use
    - Facilitate discovery of previously unknown relationships
- Databases
    - Tables can be simple flat text files
    - More complex collection of tables
- Various data formats
    - Agreed upon conventions for representing data
    - Formats may have different advantages and disadvantages
- These can be stored on-site or remotely
    - Popular public web resources for retrieving bioinformatics data
        
        |Web resources|URLs|
        |--|--|
        | **NCBI**: | [https://www.ncbi.nlm.nih.gov](https://www.ncbi.nlm.nih.gov)|
        | **UniProt**:| [http://uniprot.org/](http://uniprot.org/)|
        | **RCSB PDB**:| [https://www.rcsb.org](https://www.rcsb.org)|
        | **ENSEMBL**:|[http://ensembl.org/](http://ensembl.org/)|
        | **ChEMBL**:|[https://www.ebi.ac.uk/chembl/](https://www.ebi.ac.uk/chembl/)|
        | **AlphaFold Protein Structure Database**|[https://alphafold.ebi.ac.uk](https://alphafold.ebi.ac.uk)|
        | **EMBL-EBI**|[https://www.ebi.ac.uk](https://www.ebi.ac.uk)|
        |There are many more|

    - You can also make your own database(s), for privacy
        - Can be as simple as a CSV file, an excel sheet, or
        - Your own database for [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi) 

---