# Notebook for demonstrating evidence matching between assayed fusions and categorical fusions

In [1]:
import warnings
from os import environ

warnings.filterwarnings("ignore")

# These are the configurations for the UTA and SeqRepo databases. These should
# be adjusted by the user based on the locations where these databases exist.
environ["UTA_DB_URL"] = "postgresql://anonymous@localhost:5432/uta/uta_20241220"
environ["SEQREPO_ROOT_DIR"] = "/usr/local/share/seqrepo/2024-12-20"

### Load FUSOR and Translator modules
Run the cell below to load the FUSOR and Translator modules

In [2]:
from civicpy import civic

from fusor.fusor import FUSOR

fusor = FUSOR()

***Using Gene Database Endpoint: http://localhost:8000***


### Generate list of AssayedFusion objects from STAR-Fusion file
Run the cell below to generate a list of AssayedFusion objects from a file of STAR-Fusion output

In [3]:
# Generate AssayedFusion list from STAR-Fusion file
from pathlib import Path

from cool_seq_tool.schemas import Assembly

from fusor.harvester import StarFusionHarvester

path = Path("../../tests/fixtures/star-fusion.fusion_predictions.abridged.tsv")
harvester = StarFusionHarvester(fusor=fusor, assembly=Assembly.GRCH38.value)
fusions_list = await harvester.load_records(path)

assayed_fusion_star_fusion = [fusions_list[1]] # Use EML4::ALK fusion as testing example

Unable to get MANE Transcript data for gene: RN7SKP80
Could not find a transcript for RN7SKP80 on NC_000022.11
Unable to get MANE Transcript data for gene: RN7SKP118
Could not find a transcript for RN7SKP118 on NC_000016.10
Gene does not exist in UTA: AC021660.2
Unable to get MANE Transcript data for gene: EEF1A1P13
Could not find a transcript for EEF1A1P13 on NC_000005.10
Gene does not exist in UTA: AC098590.1
Gene does not exist in UTA: AC099789.1
Unable to get MANE Transcript data for gene: USP27X-DT
38584945 on NC_000021.9 occurs more than 150 bp outside the exon boundaries of the NM_182918.4 transcript, indicating this may not be a chimeric transcript junction and is unlikely to represent a contiguous coding sequence. Confirm that the genomic position 38584945 is being used to represent transcript junction and not DNA breakpoint.
Unable to get MANE Transcript data for gene: LINC00158
Gene does not exist in UTA: AP001341.1
Gene does not exist in UTA: AC021660.2
Gene does not exist 

### Load CIViC fusion variants
Run the cell below to load accepted fusion variants from the CIViC knowledgebase

In [4]:
# Load in accepted fusion variants
variants = civic.get_all_fusion_variants(include_status="accepted")

In [5]:
partners = ("EML4", "ALK")
for fusion in variants:
    if any(partner in fusion.vicc_compliant_name for partner in partners):
        print(fusion.vicc_compliant_name)

EML4(entrez:27436)::ALK(entrez:238)
v::ALK(entrez:238)
NPM1(entrez:4869)::ALK(entrez:238)
RANBP2(entrez:5903)::ALK(entrez:238)
CLTC(entrez:1213)::ALK(entrez:238)
STRN(entrez:6801)::ALK(entrez:238)
CAD(entrez:790)::ALK(entrez:238)
KANK4(entrez:163782)::ALK(entrez:238)
EML4(entrez:27436)::NTRK3(entrez:4916)
HIP1(entrez:3092)::ALK(entrez:238)
ENST00000318522.5(EML4):e.20::ENST00000389048.3(ALK):e.20
ENST00000318522.5(EML4):e.2::ENST00000389048.3(ALK):e.20
ENST00000318522.5(EML4):e.6::ENST00000389048.3(ALK):e.20


The output above lists all possible categorical fusions with EML4 and ALK 
as a partner. For the EML4::ALK fusion, we would expect a match for the 
EML4(entrez:27436)::ALK(entrez:238) fusion, as this fusion describes the joining 
of exon 13 of EML4 with exon 20 of ALK, which also describes the assayed fusion. 
Note that the other EML4::ALK categorical fusions indicate the joining of exons 
that do not match the queried assayed fusion. v::ALK(entrez:238) would also be a 
match as this fusion describes the joining of exon 20 for the ALK transcript which
matches the assayed fusion.

### Run FusionMatcher to gather objects containing standardized fusion knowledge
Run the cell below to use FusionMatcher to extract standardized knowledge for the fusion extracted from the STAR-Fusion file (EML4::ALK). The score for each matching CategoricalFusion is printed at the bottom of the cell.

In [6]:
# Generate list of matches, report match score
from fusor.fusion_matching import FusionMatcher
from fusor.harvester import CIVICHarvester
from fusor.config import config
from fusor.models import save_fusions_cache

# Generate categorical fusions list
harvester = CIVICHarvester(fusor=fusor, local_cache_path="civic_cache.pkl")
harvester.fusions_list = variants
civic_fusions = await harvester.load_records()

# Save cache for later
save_fusions_cache(civic_fusions, cache_dir=config.data_root, cache_name="civic_translated_fusions.pkl")

# Initialize FusionMatcher and define sources to match against
fm = FusionMatcher(assayed_fusions=assayed_fusion_star_fusion,
                   categorical_fusions=civic_fusions)

# Generate list of matching fusions
matches = await fm.match_fusion()
for matching_output in matches:
    for match in matching_output:
        print(f"Match Score: {match[1]}")



Match Score: 10
Match Score: 5


### View matching categorical fusions
Run the cells below to view the matching CategoricalFusion objects for the queried AssayedFusion object.

#### EML4::ALK

In [7]:
# Print highest quality match for EML4::ALK
matches[0][0][0].model_dump(exclude_none=True)

{'type': <FUSORTypes.CATEGORICAL_FUSION: 'CategoricalFusion'>,
 'structure': [{'type': <FUSORTypes.TRANSCRIPT_SEGMENT_ELEMENT: 'TranscriptSegmentElement'>,
   'transcript': 'refseq:NM_019063.5',
   'strand': <Strand.POSITIVE: 1>,
   'exonEnd': 13,
   'exonEndOffset': 0,
   'gene': {'conceptType': 'Gene',
    'name': 'EML4',
    'primaryCoding': {'id': 'hgnc:1316',
     'system': 'https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/',
     'code': 'HGNC:1316'}},
   'elementGenomicEnd': {'id': 'ga4gh:SL.PQzV-kfeCQ4MBmxD5mSHqZmId3I_f-Ib',
    'type': 'SequenceLocation',
    'digest': 'PQzV-kfeCQ4MBmxD5mSHqZmId3I_f-Ib',
    'sequenceReference': {'id': 'refseq:NC_000002.12',
     'type': 'SequenceReference',
     'refgetAccession': 'SQ.pnAqCRBrTsUoBghSD1yp_jXWSmlbdh4g'},
    'end': 42295516}},
  {'type': <FUSORTypes.TRANSCRIPT_SEGMENT_ELEMENT: 'TranscriptSegmentElement'>,
   'transcript': 'refseq:NM_004304.5',
   'strand': <Strand.NEGATIVE: -1>,
   'exonStart': 20,
   'exonStartOff

In [8]:
# Print second match for EML4::ALK
matches[0][1][0].model_dump(exclude_none=True)

{'type': <FUSORTypes.CATEGORICAL_FUSION: 'CategoricalFusion'>,
 'structure': [{'type': <FUSORTypes.MULTIPLE_POSSIBLE_GENES_ELEMENT: 'MultiplePossibleGenesElement'>},
  {'type': <FUSORTypes.TRANSCRIPT_SEGMENT_ELEMENT: 'TranscriptSegmentElement'>,
   'transcript': 'refseq:NM_004304.5',
   'strand': <Strand.NEGATIVE: -1>,
   'exonStart': 20,
   'exonStartOffset': 0,
   'gene': {'conceptType': 'Gene',
    'name': 'ALK',
    'primaryCoding': {'id': 'hgnc:427',
     'system': 'https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/',
     'code': 'HGNC:427'}},
   'elementGenomicStart': {'id': 'ga4gh:SL.Eu_igVd9zOahn3tFN-pyxtphUmrSlRAh',
    'type': 'SequenceLocation',
    'digest': 'Eu_igVd9zOahn3tFN-pyxtphUmrSlRAh',
    'sequenceReference': {'id': 'refseq:NC_000002.12',
     'type': 'SequenceReference',
     'refgetAccession': 'SQ.pnAqCRBrTsUoBghSD1yp_jXWSmlbdh4g'},
    'end': 29223528}}],
 'viccNomenclature': 'v::NM_004304.5(ALK):e.20',
 'civicMolecularProfiles': [<CIViC molecular_pr

### View Standardized Evidence for each matching CategoricalFusion object
Run the cells below to view an associated evidence item for a matching CategoricalFusion object

### EML4::ALK

In [9]:
# View evidence item linked to matched EML4::ALK categorical fusion
matches[0][0][0].civicMolecularProfiles[0].evidence_items[0].__dict__

{'_assertions': [<CIViC assertion 3>],
 '_therapies': [<CIViC therapy 12>],
 '_phenotypes': [],
 '_incomplete': {'phenotypes', 'therapies'},
 '_partial': False,
 'type': 'evidence',
 'id': 262,
 'variant_origin': 'SOMATIC',
 'therapy_interaction_type': None,
 'therapy_ids': [12],
 'status': 'accepted',
 'source_id': 166,
 'significance': 'SENSITIVITYRESPONSE',
 'rating': 4,
 'phenotype_ids': [],
 'name': 'EID262',
 'molecular_profile_id': 5,
 'evidence_type': 'PREDICTIVE',
 'evidence_level': 'C',
 'evidence_direction': 'SUPPORTS',
 'disease_id': 30,
 'description': 'A 28 year-old patient with non-small cell lung cancer that failed conventional therapy was found to harbor the EML4-ALK (E13;A20) fusion using reverse transcription PCR. Treatment with 250mg crizotinib twice daily resulted in rapid improvement of symptoms and disease control for 5 months.',
 'assertion_ids': [3],
 '_include_status': ['accepted', 'submitted', 'rejected']}

In [10]:
# View evidence item linked to matched EML4::ALK categorical fusion
matches[0][1][0].civicMolecularProfiles[0].evidence_items[0].__dict__

{'_assertions': [<CIViC assertion 3>],
 '_therapies': [<CIViC therapy 12>],
 '_phenotypes': [],
 '_incomplete': {'phenotypes', 'therapies'},
 '_partial': False,
 'type': 'evidence',
 'id': 1187,
 'variant_origin': 'SOMATIC',
 'therapy_interaction_type': None,
 'therapy_ids': [12],
 'status': 'accepted',
 'source_id': 819,
 'significance': 'SENSITIVITYRESPONSE',
 'rating': 5,
 'phenotype_ids': [],
 'name': 'EID1187',
 'molecular_profile_id': 495,
 'evidence_type': 'PREDICTIVE',
 'evidence_level': 'A',
 'evidence_direction': 'SUPPORTS',
 'disease_id': 8,
 'description': 'In the Phase I study PROFILE 1001 (NCT00585195), a recommended crizotinib dose of 250 mg twice daily for 28 day cycles was established. Among 1,500 advanced NSCLC patients who were screened for ALK-rearrangement using a break-apart FISH assay, 82 patients were eligible for crizotinib treatment. Overall response rate was 57%, with 46 partial responses and one complete response. Since crizotinib inhibits MET, 33 patients w

### Evidence Matching against CIVIC and Molecular Oncology Almanac (MOA)
The example below displays how a detected BCR::ABL1 fusion can be matched against the CIVIC and MOA knowledgebases

#### Load and standardize patient fusion

In [11]:
from fusor.harvester import ArribaHarvester

path = Path("../../tests/fixtures/fusions_arriba_test.tsv")
harvester = ArribaHarvester(fusor=fusor, assembly=Assembly.GRCH37.value)
fusions_list = await harvester.load_records(path)

#### Load and standardize data from Molecular Oncology Almanac (MOA)

In [12]:
from fusor.harvester import MOAHarvester
harvester = MOAHarvester(fusor=fusor, cache_dir=Path.cwd())
moa_fusions = harvester.load_records()

# Save cache for later
save_fusions_cache(moa_fusions, cache_dir=config.data_root, cache_name="moa_translated_fusions.pkl")



### Run FusionMatcher to gather objects containing standardized fusion knowledge from CIVIC and MOA.
Run the cell below to use FusionMatcher to extract standardized knowledge for the fusion extracted from the Arriba file (BCR::ABL1). The score for each matching CategoricalFusion is printed at the bottom of the cell. This query matches against both the CIVIC and MOA knowledgebases.

In [13]:
# Generate list of matches, report match score
from fusor.fusion_matching import FusionMatcher

# Initialize FusionMatcher and define sources to match against. This time, we will use
# the cache_files field, using the cached pickle files that have been saved
fm = FusionMatcher(assayed_fusions=fusions_list,
                   cache_dir=config.data_root,
                   cache_files=["civic_translated_fusions.pkl", "moa_translated_fusions.pkl"])

# Generate list of matching fusions
matches = await fm.match_fusion()
for matching_output in matches:
    for match in matching_output:
        print(f"Match Score: {match[1]}")

Match Score: 10
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2
Match Score: 2


The cell below can be uncommented to view the corresponding matches. Assertion and evidence information from CIVIC can be accessed through the civicMolecularProfiles attribute while assertion information from MOA can be accessed through the moaAssertion field.

In [14]:
#matches