# Example Using emergenet.domseq
- Computing current season dominant strain and predicting future dominant strain using Emergenet
- Analyzes H1N1 HA segment from 2021-2022, northern hemisphere
- These predictions reproduce results from the paper, see `emergenet/qnet_predictions/dominant_sequences_2021_2022.ipynb` and `emergenet/qnet_predictions/influenza_qnet_predictions_2022_2023.ipynb`
- Data sources:
    - GISAID: https://platform.epicov.org/epi3/cfrontend#586f5f

In [27]:
%%capture
!pip install emergenet --upgrade

In [1]:
from emergenet.domseq import DomSeq, save_model, load_model

DATA_DIR = 'example_data/domseq/'

In [2]:
# initialize the DomSeq
domseq = DomSeq(seq_trunc_length=566, random_state=42)

## Dominant Strain 2021-2022 Season
Levenshtein Centroid: $$\widehat{x}^{dom} = argmin_{x\in P^t} \sum_{y \in P^t} \theta(x,y)$$
- Where $P^t$ is the sequence population at time $t$.
- $\theta(x,y)$ is the edit distance between x and y

In [13]:
# load fasta data
df = domseq.load_data(filepath=DATA_DIR+'north_h1n1_ha_21.fasta')

print('Number of sequences:', len(df))
df.head()

Number of sequences: 976


Unnamed: 0,id,sequence
0,A/Ile_de_France/52803/2021|A_/_H1N1|$SEGMENT_N...,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...
1,A/Ile_de_France/50420/2021|A_/_H1N1|$SEGMENT_N...,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...
2,A/DIJON/48658/2021|A_/_H1N1|$SEGMENT_NAME|2021...,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...
3,A/SAINT-DENIS/48408/2021|A_/_H1N1|$SEGMENT_NAM...,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...
4,A/TOURS/37554/2021|A_/_H1N1|$SEGMENT_NAME|2021...,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...


In [20]:
%%time
# compute dominant sequence
dom_id, dom_seq = domseq.compute_domseq(seq_df=df, sample_size=1000)

print('Name:', dom_id)
print()
print('Sequence:', dom_seq)

Name: A/Ireland/20935/2022|A_/_H1N1|$SEGMENT_NAME|2022-04-10|EPI2069085|

Sequence: MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDKHNGKLCKLRGVAPLHLGKCNIAGWILGNPECESLSTARSWSYIVETSNSDNGTCYPGDFINYEELREQLSSVSSFERFEIFPKTSSWPNHDSDKGVTAACPHAGAKSFYKNLIWLVKKGNSYPKLNQTYINDKGKEVLVLWGIHHPPTIAAQESLYQNADAYVFVGTSRYSKKFKPEIATRPKVRDQEGRMNYYWTLVEPGDKITFEATGNLVVPRYAFTMERDAGSGIIISDTPVHDCNTTCQTPEGAINTSLPFQNVHPITIGKCPKYVKSTKLRLATGLRNVPSIQSRGLFGAIAGFIEGGWTGMVDGWYGYHHQNEQGSGYAADLKSTQNAIDKITNKVNSVIEKMNTQFTAVGKEFNHLEKRIENLNKKVDDGFLDIWTYNAELLVLLENERTLDYHDSNVKNLYEKVRNQLKNNAKEIGNGCFEFYHKCDNTCMESVKNGTYDYPKYSEEAKLNREKIDGVKLESTRIYQILAIYSTVASSLVLVVSLGAISFWMCSNGSLQCRICI
CPU times: user 5min 29s, sys: 181 ms, total: 5min 29s
Wall time: 5min 37s


## Prediction for Dominant Strain 2022-2023 Season
E-Centroid: $$x_{*}^{t+\delta} = argmin_{y\in \bigcup_{r\leq t}H^{\tau}} \sum_{x \in {H^t}} \theta(x,y) - |H^t|A \text{ln}\omega_y$$
- $x_{*}^{t+\delta}$ is the dominant strain in the upcoming flu season at time $t+\omega$
- $H^t$ is the sequence population at time $t$
- $\theta(x,y)$ is the qdistance between $x$ and $y$ in their respective Qnets
- $A = \frac{1-\alpha}{\sqrt{8}N^2}$, where $\alpha$ is a fixed significance level and $N$ is the sequence length considered
- $\text{ln}\omega_y$ is the membership degree of sequence $y$

In [22]:
# load fasta data
import pandas as pd
df_north = domseq.load_data(filepath=DATA_DIR+'north_h1n1_ha_21.fasta')
df_south = domseq.load_data(filepath=DATA_DIR+'south_h1n1_ha_21.fasta')
df1 = pd.concat([df_north, df_south])

print('Number of sequences:', len(df1))
df1.head()

Number of sequences: 1257


Unnamed: 0,id,sequence
0,A/Ile_de_France/52803/2021|A_/_H1N1|$SEGMENT_N...,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...
1,A/Ile_de_France/50420/2021|A_/_H1N1|$SEGMENT_N...,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...
2,A/DIJON/48658/2021|A_/_H1N1|$SEGMENT_NAME|2021...,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...
3,A/SAINT-DENIS/48408/2021|A_/_H1N1|$SEGMENT_NAM...,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...
4,A/TOURS/37554/2021|A_/_H1N1|$SEGMENT_NAME|2021...,MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHS...


In [24]:
%%time
# train enet
enet = domseq.train(seq_df=df_north, sample_size=1000, n_jobs=1)
# save qnet
save_model(enet=enet, outfile=DATA_DIR+'enet.joblib')

CPU times: user 3min 52s, sys: 2.53 s, total: 3min 55s
Wall time: 3min 58s


In [25]:
# load enet
enet = load_model(filepath=DATA_DIR+'enet.joblib')

In [26]:
%%time
# compute prediction sequence
pred_id, pred_seq = domseq.predict_domseq(seq_df=df1, enet=enet, sample_size=1000)

print('Name:', pred_id)
print()
print('Sequence:', pred_seq)

Name: A/Netherlands/00068/2022|A_/_H1N1|$SEGMENT_NAME|2022-02-11|EPI1988870|

Sequence: MKAILVVLLYTFTTANADTLCIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDKHNGKLCKLRGVAPLHLGKCNIAGWILGNPECESLSTARSWSYIVETSNSDNGTCYPGDFINYEELREQLSSVSSFERFEIFPKTSSWPNHDSDKGVTAACPHAGAKSFYKNLIWLVKKGNSYPKLNQTYINDKGKEVLVLWGIHHPPTIAAQESLYQNADAYVFVGTSRYSKKFKPEIATRPKVRDQEGRMNYYWTLVEPGDKITFEATGNLVVPRYAFTMERDAGSGIIISDTPVHDCNTTCQTPEGAINTSLPFQNVHPITIGKCPKYVKSTKLRLATGLRNVPSIQSRGLFGAIAGFIEGGWTGMVDGWYGYHHQNEQGSGYAADLKSTQNAIDKITNKVNSVIEKMNTQFTAVGKEFNHLEKRIENLNKKVDDGFLDIWTYNAELLVLLENERTLDYHDSNVKNLYEKVRNQLKNNAKEIGNGCFEFYHKCDNTCMESVKNGTYDYPKYSEEAKLNREKIDGVKLESTRTYQILAIYSTVASSLVLVVSLGAISFWMCSNGSLQCRICI
CPU times: user 26min 7s, sys: 3.25 s, total: 26min 10s
Wall time: 26min 40s
