# Example Using emergenet.domseq
- Computing current season dominant strain and predicting future dominant strain using Emergenet
- Analyzes H1N1 from 2021-2022, northern hemisphere

In [27]:
%%capture
!pip install emergenet --upgrade

In [3]:
import pandas as pd
from emergenet.domseq import DomSeq, save_model, load_model

DATA_DIR = 'example_data/domseq/'

In [4]:
# initialize the DomSeq
domseq = DomSeq(seq_trunc_length=565, random_state=42)

## Dominant Strains 2021-2022 Season
The DomSeq uses MeanShift to cluster the strain population using the precomputed Levenshtein distance matrix. Then it finds the Levenshtein centroid of each cluster.

Levenshtein Centroid: $$\widehat{x}_i^{\text{dom}} = argmin_{x\in P_i^t} \sum_{y \in P_i^t} \theta(x,y)$$
- Where $P_i^t$ is a cluster of the strain population at time $t$
- $\bigcup P_i^t = P^t$, the total population at time $t$
- $\theta(x,y)$ is the edit distance (Levenshtein) between $x$ and $y$

In [17]:
# load data from current time period
df = pd.read_csv(DATA_DIR+'north_h1n1_21_22.csv')

print('Number of sequences:', len(df))
df.head()

Number of sequences: 735


Unnamed: 0,acc,name,date,sequence,acc_na,sequence_na
0,EPI1877640,A/Togo/0146/2021,2021-02-15,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...,EPI1877636,MNPNQKIITIGSICMTIGMANLILQIGNIISIWVSHSIQIGNQSQI...
1,EPI1877399,A/Togo/0093/2021,2021-02-15,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...,EPI1877398,MNPNQKIITIGSICMTIGMANLILQIGNIISIWVSHSIQIGNQSQI...
2,EPI1877567,A/Togo/0094/2021,2021-02-15,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...,EPI1877566,MNPNQKIITIGSICMTIGMANLILQIGNIISIWVSHSIQIGNQSQI...
3,EPI1877448,A/Togo/0079/2021,2021-02-15,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...,EPI1877447,MNPNQKIITIGSICMTIGMANLILQIGNIISIWVSHSIQIGNQSQI...
4,EPI1877427,A/Togo/0169/2021,2021-02-16,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...,EPI1877426,MNPNQKIITIGSICMAIGTANLILQIGNIISIWVSHSIQIGNQSQI...


In [18]:
%%time
# compute dominant sequences
dom_seqs = domseq.compute_domseq(seq_df=df)
dom_seqs.to_csv(DATA_DIR + 'dom_seqs_21_22.csv', index=False)
dom_seqs

CPU times: user 1min 15s, sys: 3min 14s, total: 4min 30s
Wall time: 14.7 s


Unnamed: 0,acc,name,date,sequence,acc_na,sequence_na,cluster_size
15,EPI1877316,A/Togo/0172/2021,2021-02-18,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...,EPI1877315,MNPNQKIITIGSICMAIGTANLILQIGNIISIWVSHSIQIGNQSQI...,531
197,EPI1925191,A/Bangladesh/9004/2021,2021-09-08,MKAILVVMLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...,EPI1925190,MNPNQKIITIGSICMTIGTANLILQIGNIISIWVSHSIQIGNQSQI...,193
165,EPI1927635,A/Wisconsin/04/2021,2021-07-29,MKAVLVVLLYTVTNANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...,EPI1927637,MNTNQRIITIGTVCLIVGIISLLLQIGNIVSLWVSHSIQTKWENHT...,2
89,EPI1868537,A/Mecklenburg-Vorpommern/1/2021,2021-04-19,MEAKLFVLFCVFTALKADTICVGYHANNSTDTVDTIMEKNVTVTHS...,EPI1868536,MNPNQKIIIISSICMTNGIASLILQIGNIISIWISHSIQIGNQNQT...,1
60,EPI1921238,A/Gansu-Xifeng/1194/2021,2021-03-10,MKARLFILFCAFTALKADTICVGYHANNSTDTVDTILEKNVTVTHS...,EPI1921237,MNPNQKIITISSICMTIGIASLILQIGNIISIWISHSIQTKKQNQS...,1
82,EPI1868840,A/Wisconsin/03/2021,2021-04-01,MKAVLVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...,EPI1868839,MNTNQRIITIGTVCLIVGIISLLLQIGNIVSLWVSHSIQTRWENHT...,1
516,EPI1958412,A/Denmark/36/2021,2021-11-24,MKAILVALLYTFATANADTLCIGYHANNSTDTVDTILEKNVTVTHS...,EPI1958414,MNPNQKIITIGLICMTSGIASLILQIGNMILMWTSHSIQNENQTQS...,1
372,EPI1946030,A/SouthAfrica/PET20744/2021,2021-10-14,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...,EPI1946032,MNPNQKIITIGSICMAIGTANLILQIGNIISIWVSHSIQIGNQSQI...,1
24,EPI1921246,A/Gansu-Xifeng/1143/2021,2021-02-22,MKARLFILFCAFTALKADTICVGYHANNSTDTVDTILEKNVTVTHS...,EPI1921245,MNPNQKIITISSICMTIGIASLILQIGNIISIWISHSIQTKKQNQS...,1
9,EPI1853135,A/Parana/10835/2021,2021-02-16,MKAILVVLLYTFATTNADTLCIGYHANNSTDTVDTVLEKNVTVTHS...,EPI1853137,MNPNQKIITIGSICMTIGMVNLTLQIGNIISIWISHSIQLRNQSQI...,1


## Predictions for Dominant Strain 2022-2023 Season
E-Centroid: $$x_{*}^{t+\delta} = argmin_{y\in \bigcup_{r\leq t}H^{\tau}} \sum_{x \in {H^t}} \theta(x,y) - |H^t|A \text{ln}\omega_y$$
- $x_{*}^{t+\delta}$ is the dominant strain in the upcoming flu season at time $t+\omega$
- $H^t$ is the sequence population at time $t$
- $\theta(x,y)$ is the e-distance between $x$ and $y$ in their respective Enets
- $A = \frac{1-\alpha}{\sqrt{8}N^2}$, where $\alpha$ is a fixed significance level and $N$ is the sequence length considered
- $\text{ln}\omega_y$ is the membership degree of sequence $y$
- We do this computation for each cluster (clusters defined by e-distance matrix)

In [20]:
%%time
# train enet
enet = domseq.train(seq_df=df, sample_size=3000, n_jobs=1)
# save enet
save_model(enet=enet, outfile=DATA_DIR+'enet.joblib')

CPU times: user 4min 39s, sys: 1.89 s, total: 4min 41s
Wall time: 4min 50s


In [23]:
# load enet
enet_model = load_model(filepath=DATA_DIR+'enet.joblib')

In [6]:
# load candidate sequences for recommendation
candidate_df = pd.read_csv(DATA_DIR+'north_h1n1_21_22_pred.csv')

print('Number of sequences:', len(candidate_df))
candidate_df.head()

Number of sequences: 18057


Unnamed: 0,acc,name,date,sequence,acc_na,sequence_na
0,AJK00834.1,A/Memphis/1/2001,2001-09-25,MKAKLLVLLCTFTATYADTICIGYHANNSTDTVDTVLEKNVTVTHS...,ABO38024.1,MNPNQKIITIGSISIAIGIISLMLQIGNIISIWASHSIQTGSQNHT...
1,AJK02489.1,A/Memphis/7/2001,2001-09-25,MKAKLLVLLCTFTATYADTICIGYHANNSTDTVDTVLEKNVTVTHS...,ABO32962.1,MNPNQKIITIGSISIAIGIISLMLQIGNIISIWASHSIQTGSKNHT...
2,AJK02965.1,A/Memphis/8/2001,2001-09-25,MKAKLLVLLCTFTATYADTICIGYHANNSTDTVDTVLEKNVTVTHS...,ABN51080.1,MNPNQKIITIGSISIAIGIISLMLQIGNIISIWASHSIQTGSQNHT...
3,AJK03129.1,A/Memphis/6/2001,2001-09-25,MRAKLLVLLCTFTATYADTICIGYHANNSTDTVDTVLEKNVTVTHS...,ABO32951.1,MNPNQKIITIGSISIAIGIISLMLQIGNIISIWASHSIQTGSQNHT...
4,AFQ90527.1,A/Chile/8885/2001,2001-09-25,MKAKLLVLLCTFTATYADTICIGYHANNSTDTVDTVLEKNVTVTHS...,AFO66161.1,MNPNQKIITIGSISIAIGIISLMLQIGNIISIWASHSIQTGSQNHT...


In [25]:
%%time
# compute prediction sequences (return predictions from top 3 largest clusters)
pred_df = domseq.predict_domseq(seq_df=df, pred_seq_df=candidate_df, enet=enet_model, n_clusters=3, sample_size=3000)
pred_df = pred_df.sort_values(by=['cluster_size'], ascending=False)
pred_df.to_csv(DATA_DIR+'predictions_for_22_23.csv', index=False)
pred_df

symmetric case
CPU times: user 3h 10min 3s, sys: 2min 28s, total: 3h 12min 32s
Wall time: 1h 3min 28s


Unnamed: 0,acc,name,date,sequence,acc_na,sequence_na,first_term,second_term,sum,cluster_size
17528,EPI1716610,A/Netherlands/00475/2020,2020-03-07,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...,EPI1716609,MNPNQKIITIGSICMTIGMANLILQIGNIISIWVSHSIQTGNQSQI...,0.196503,-0.001843,0.198347,527
17834,EPI1925191,A/Bangladesh/9004/2021,2021-09-08,MKAILVVMLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...,EPI1925190,MNPNQKIITIGSICMTIGTANLILQIGNIISIWVSHSIQIGNQSQI...,0.074637,-0.004569,0.079207,193
17824,EPI1918841,A/North_Dakota/12226/2021,2021-09-01,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...,EPI1918840,MNPNQKIITIGSICMTIGMANLILQIGNIISIWVSHSIQIGNQNQI...,0.011633,-0.000199,0.011833,3
