# Topic Matching Tutorial

This notebook demonstrates how to use the topic matching functionality in atac_mapper to map query cells to reference topics.

In [16]:
import pandas as pd
from scipy.io import mmread
from atac_mapper.topic_matching import TopicMatch

## Load Data

We need two main pieces of data:
1. Region-topic matrix from reference dataset (from cistopic analysis)
2. Query fragment matrix (regions x cells)

In [11]:
# Load reference topic distributions
region_topic_df = pd.read_csv(
    "/Users/nazbukina/code/test_data_atac_mapper/cistopic_loading_mannens.tsv", sep="\t", index_col=0
)

# Load query fragment matrix (regions x cells)
query_matrix = mmread("/Users/nazbukina/code/test_data_atac_mapper/FM_atlas_test_20_subset.mtx").tocsr()

print(f"Reference shape (regions x topics): {region_topic_df.shape}")
print(f"Query shape (regions x cells): {query_matrix.shape}")

Reference shape (regions x topics): (410863, 175)
Query shape (regions x cells): (410863, 20)


## Initialize TopicMatch and Run Inference

Now we can use the TopicMatch class to infer topic distributions for our query cells.

In [None]:
# Initialize topic matcher
topic_matcher = TopicMatch(region_topic_df)

# Run inference
topic_distributions = topic_matcher.infer_topics(
    query=query_matrix,
    njobs=-1,  # Use all available cores
    n_iterations=100,
    tol=1e-4,
)

# Convert to DataFrame for easier inspection
results_df = pd.DataFrame(
    topic_distributions,
    index=[f"cell_{i}" for i in range(topic_distributions.shape[0])],
    columns=region_topic_df.columns,
)

print("\nFirst few cells and their topic distributions:")
print(results_df.head())

Input shapes - tf_vector: (410863,), topic_word: (175, 410863)
Input shapes - tf_vector: (410863,), topic_word: (175, 410863)
Input shapes - tf_vector: (410863,), topic_word: (175, 410863)
Input shapes - tf_vector: (410863,), topic_word: (175, 410863)
Input shapes - tf_vector: (410863,), topic_word: (175, 410863)
Input shapes - tf_vector: (410863,), topic_word: (175, 410863)
Input shapes - tf_vector: (410863,), topic_word: (175, 410863)
Input shapes - tf_vector: (410863,), topic_word: (175, 410863)
Converged after 83 iterations (delta=9.88e-05)
Input shapes - tf_vector: (410863,), topic_word: (175, 410863)
Input shapes - tf_vector: (410863,), topic_word: (175, 410863)
Converged after 99 iterations (delta=9.95e-05)
Input shapes - tf_vector: (410863,), topic_word: (175, 410863)
Input shapes - tf_vector: (410863,), topic_word: (175, 410863)
Input shapes - tf_vector: (410863,), topic_word: (175, 410863)
Input shapes - tf_vector: (410863,), topic_word: (175, 410863)
Input shapes - tf_vector

## Analyze Results

Let's look at some basic statistics of the inferred topic distributions.

In [13]:
# Basic statistics
print("Average topic probability per topic:")
print(results_df.mean())

print("\nMost common dominant topic for cells:")
dominant_topics = results_df.idxmax(axis=1).value_counts()
print(dominant_topics)

Average topic probability per topic:
topic_0      0.005722
topic_1      0.001145
topic_2      0.004587
topic_3      0.000797
topic_4      0.004191
               ...   
topic_170    0.003773
topic_171    0.002717
topic_172    0.008476
topic_173    0.003091
topic_174    0.005285
Length: 175, dtype: float64

Most common dominant topic for cells:
topic_141    2
topic_164    2
topic_83     2
topic_34     2
topic_132    2
topic_46     1
topic_125    1
topic_111    1
topic_37     1
topic_144    1
topic_9      1
topic_7      1
topic_107    1
topic_63     1
topic_159    1
Name: count, dtype: int64


In [15]:
region_topic_df

Unnamed: 0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,Topic10,...,Topic166,Topic167,Topic168,Topic169,Topic170,Topic171,Topic172,Topic173,Topic174,Topic175
chr10:100006329-100006730,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.00000,0.000000,0.000000,0.000000e+00,0.0,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000003,0.000000,0.000000e+00
chr10:100009751-100010152,0.000033,0.000049,0.000026,0.000008,2.845689e-05,0.00005,0.000021,0.000006,4.173635e-05,0.0,...,0.000034,1.648887e-05,0.000007,0.000032,0.000021,0.000039,0.000044,0.000012,0.000032,2.577879e-05
chr10:100016741-100017142,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.00000,0.000000,0.000000,0.000000e+00,0.0,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
chr10:100019766-100020167,0.000000,0.000000,0.000000,0.000000,2.294911e-07,0.00000,0.000000,0.000000,0.000000e+00,0.0,...,0.000000,0.000000e+00,0.000000,0.000000,0.000030,0.000000,0.000000,0.000000,0.000000,0.000000e+00
chr10:100020276-100020677,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.00000,0.000000,0.000000,3.185981e-07,0.0,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000022,0.000000e+00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
chrY:7703790-7704191,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.00000,0.000000,0.000000,0.000000e+00,0.0,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
chrY:7714477-7714878,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.00000,0.000000,0.000000,0.000000e+00,0.0,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
chrY:7724233-7724634,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.00000,0.000000,0.000000,0.000000e+00,0.0,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
chrY:7725786-7726187,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.00000,0.000000,0.000000,0.000000e+00,0.0,...,0.000000,6.011566e-07,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,4.296465e-07
