## A survey of iodine transporters, for identification of candidates for recombination into a microbe.

Procedure:
- scrape human, mouse [Na-I Sympoters](https://en.wikipedia.org/wiki/Sodium/iodide_cotransporter)
- verify that they have conserved sequences
- scrape for other human, mouse [solute carrier membrane transport proteins](https://en.wikipedia.org/wiki/Solute_carrier_family). Do it to get a sense of what the active domains might be that make the Na-I symporter specific for iodine.
- loop:
  - identify peptide sequences of interest.
  - search databases for other matching domains.
  - if results are satisfactory and the crucial domains are clear, exit loop.
  - else, expand the search to other species up the phylogenetic tree (moving in the direction of microbes and kelp) and look at other proteins across life that are specific to iodine, by doing literature research.
  - identify candidate species proteins, and find sequencing data.
- end with a protein of interest that is a likely candidate for being a pump that's viable for 

General links:
- [Capnocytophaga canimorsus](https://en.wikipedia.org/wiki/Capnocytophaga_canimorsus)
- [Na-I Symporter](https://en.wikipedia.org/wiki/Sodium/iodide_cotransporter)
- [Solute Carrier Family](https://en.wikipedia.org/wiki/Solute_carrier_family)
- [Iodine in biology](https://en.wikipedia.org/wiki/Iodine_in_biology)
- [Iodine](https://en.wikipedia.org/wiki/Iodine#Biological_role)

In [40]:
# Run this code inside a Jupyter cell. If all three lines point to your .venv folder in WSL, your setup is perfect:
# In this project (iodine transporter survey), the venv should be found in ../comparative-sequence-analysis, 
# should we need to get it manually.

import sys, os
print(f"1. Executable: {sys.executable}") 
print(f"2. Version: {sys.version}")
print(f"3. Env Path: {os.getenv('VIRTUAL_ENV')}")

1. Executable: /home/morgan/projects/bioinfo-projects/comparative-sequence-analysis/.venv/bin/python
2. Version: 3.11.2 (main, Apr 28 2025, 14:11:48) [GCC 12.2.0]
3. Env Path: /home/morgan/projects/bioinfo-projects/comparative-sequence-analysis/.venv


### SLC5A5 / NIS / TDH1 - solute carrier family 5 member 5 - gene id 6528

https://www.ncbi.nlm.nih.gov/datasets/gene/6528/
- RefSeq summary:     This gene encodes a member of the sodium glucose cotransporter family. The encoded protein is responsible for the uptake of iodine in tissues such as the thyroid and lactating breast tissue. The iodine taken up by the thyroid is incorporated into the metabolic regulators triiodothyronine (T3) and tetraiodothyronine (T4). Mutations in this gene are associated with thyroid dyshormonogenesis 1.[provided by RefSeq, Sep 2009]
- 8 proteins, 8 transcipts, downloaded as ncbi_dataset(1).zip https://www.ncbi.nlm.nih.gov/datasets/gene/6528/#transcripts-and-proteins

https://www.ncbi.nlm.nih.gov/gene?cmd=retrieve&dopt=default&rn=1&list_uids=6528

https://www.ncbi.nlm.nih.gov/protein/KAI4041341.1 (partial) sequence.gp
https://www.ncbi.nlm.nih.gov/protein/KAI2589675.1 (partial?) sequence(1).gp

https://www.uniprot.org/uniprot/Q92911

In [None]:
from Bio import SeqIO

sequences =[]
for record in SeqIO.parse('data/SLC5A5-h-sapiens/ncbi_dataset/data/protein.faa', "fasta"):
    sequences += [record]
for record in SeqIO.parse('data/SLC5A5-h-sapiens/sequence.fasta', 'fasta'):
    sequences += [record]
for record in SeqIO.parse('data/SLC5A5-h-sapiens/sequence(1).fasta', 'fasta'):
    sequences += [record]


[SeqRecord(seq=Seq('MEAVETGERPTFGAWDYGVFALMLLVSTGIGLWVGLARGGQRSAEDFFTGGRRL...TNL'), id='NP_000444.1', name='NP_000444.1', description='NP_000444.1 SLC5A5 [organism=Homo sapiens] [GeneID=6528] [isoform=1]', dbxrefs=[]),
 SeqRecord(seq=Seq('MCLGQLLNSVLTALLFMPVFYRLGLTSTYEYLEMRFSRAVRLCGTLQYIVATML...TNL'), id='NP_001427636.1', name='NP_001427636.1', description='NP_001427636.1 SLC5A5 [organism=Homo sapiens] [GeneID=6528] [isoform=2]', dbxrefs=[]),
 SeqRecord(seq=Seq('MEAVETGERPTFGAWDYGVFALMLLVSTGIGLWVGLARGGQRSAEDFFTGGRRL...TNL'), id='XP_011526494.1', name='XP_011526494.1', description='XP_011526494.1 SLC5A5 [organism=Homo sapiens] [GeneID=6528] [isoform=X1]', dbxrefs=[]),
 SeqRecord(seq=Seq('MCLGQLLNSVLTALLFMPVFYRLGLTSTYEYLEMRFSRAVRLCGTLQYIVATML...TNL'), id='XP_011526495.1', name='XP_011526495.1', description='XP_011526495.1 SLC5A5 [organism=Homo sapiens] [GeneID=6528] [isoform=X2]', dbxrefs=[]),
 SeqRecord(seq=Seq('MRFSRAVRLCGTLQYIVATMLYTGIVIYAPALILNQVTGLDIWASLLSTGIICT...TNL'), id='XP_0115

In [111]:

from Bio import Align
import numpy as np
arr = np.ones([len(sequences),len(sequences)])*np.inf

aligner = Align.PairwiseAligner(mode='local') # local is better than global or fogsaa in this task of long sequences.
all_aligns = {}
for si1 in range(len(sequences)-1):
  for si2 in range(si1, len(sequences)):
    alignments = aligner.align(sequences[si1], sequences[si2])
    # print('='*80)
    # print(si1,si2)
    # print(alignments)
    # print(alignments[0])
    all_aligns[(si1,si2)] = alignments[0]
    arr[si1,si2] = alignments[0].score
arr

array([[643., 554., 632., 543., 510., 632., 543., 510., 643., 643.],
       [ inf, 554., 543., 543., 510., 543., 543., 510., 554., 554.],
       [ inf,  inf, 654., 565., 532., 654., 565., 532., 632., 632.],
       [ inf,  inf,  inf, 565., 532., 565., 565., 532., 543., 543.],
       [ inf,  inf,  inf,  inf, 532., 532., 532., 532., 510., 510.],
       [ inf,  inf,  inf,  inf,  inf, 654., 565., 532., 632., 632.],
       [ inf,  inf,  inf,  inf,  inf,  inf, 565., 532., 543., 543.],
       [ inf,  inf,  inf,  inf,  inf,  inf,  inf, 532., 510., 510.],
       [ inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf, 643., 643.],
       [ inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf]])

In [112]:
print(arr-np.min(arr))

[[133.  44. 122.  33.   0. 122.  33.   0. 133. 133.]
 [ inf  44.  33.  33.   0.  33.  33.   0.  44.  44.]
 [ inf  inf 144.  55.  22. 144.  55.  22. 122. 122.]
 [ inf  inf  inf  55.  22.  55.  55.  22.  33.  33.]
 [ inf  inf  inf  inf  22.  22.  22.  22.   0.   0.]
 [ inf  inf  inf  inf  inf 144.  55.  22. 122. 122.]
 [ inf  inf  inf  inf  inf  inf  55.  22.  33.  33.]
 [ inf  inf  inf  inf  inf  inf  inf  22.   0.   0.]
 [ inf  inf  inf  inf  inf  inf  inf  inf 133. 133.]
 [ inf  inf  inf  inf  inf  inf  inf  inf  inf  inf]]


By this analysis, it seems small chunks are excised, which might even be a facet of sequencing. There aren't small mutations that can show differences in function -- which makes sense among individuals in the same species. Todo, look at mouse and other species data.