# 1. Topic matching tutorial

This notebook demonstrates how to use the topic matching functionality in atac_mapper to infer topic distribution for query data using topic loadings from reference data. Topic modelling of scATAC-seq data could be performed using [cisTopics](https://github.com/aertslab/cisTopic).


In [1]:
import pandas as pd
from scipy.io import mmread
from atac_mapper.topic_matching import TopicMatch

## Load Data

We need two main pieces of data:
1. **Region-topic matrix** from reference dataset (from [cisTopics](https://github.com/aertslab/cisTopic) analysis). Normally you can find it in `cistopic_obj.selected_model.region_topic`.
2. **Query fragment matrix** (regions x cells)

The region-topic matrix should contain the topic loadings for each genomic region, typically saved as a TSV file from cisTopics. The query fragment matrix contains accessibility counts for each region in your query cells.

💡 **Tip**: Make sure the regions (rows) in both matrices correspond to the same genomic coordinates. You can use tools like [pyranges](https://github.com/biocore-ntnu/pyranges) or [Feature Matrix](https://stuartlab.org/signac/reference/featurematrix) to match region coordinates of query and reference data if needed.

In [5]:
# Load reference topic distributions
region_topic_df = pd.read_csv("../../../../test_data_atac_mapper/cistopic_loading_mannens.tsv", sep="\t", index_col=0)

# Load query fragment matrix (regions x cells)
query_matrix = mmread("../../../../test_data_atac_mapper/FM_atlas_test_20_subset.mtx").tocsr()

print(f"Reference shape (regions x topics): {region_topic_df.shape}")
print(f"Query shape (regions x cells): {query_matrix.shape}")

Reference shape (regions x topics): (410863, 175)
Query shape (regions x cells): (410863, 20)


## Initialize TopicMatch and Run Inference

Now we can use the TopicMatch class to infer topic distributions for our query cells. Topic inference implementation was inspired by [lda package](https://lda.readthedocs.io/en/latest/).
We highly recommend to allocate as many cores as possible to ensure parallelization, especially for bigger datasets. 

In [6]:
# Initialize topic matcher
topic_matcher = TopicMatch(region_topic_df)

# Run inference
topic_distributions = topic_matcher.infer_topics(
    query=query_matrix,
    njobs=-1,  # Use all available cores
    n_iterations=100,
    tol=1e-4,
)

# Convert to DataFrame for easier inspection
results_df = pd.DataFrame(
    topic_distributions,
    columns=region_topic_df.columns,
)

print("\nFirst few cells and their topic distributions:")
print(results_df.head())

Converged after 83 iterations (delta=9.88e-05)
Converged after 99 iterations (delta=9.95e-05)
Converged after 94 iterations (delta=9.80e-05)
Converged after 92 iterations (delta=9.89e-05)
Converged after 100 iterations (delta=9.99e-05)

First few cells and their topic distributions:
         Topic1        Topic2        Topic3        Topic4    Topic5  \
0  2.176386e-08  1.664294e-11  1.746968e-05  1.384006e-13  0.010057   
1  7.951094e-03  2.679191e-03  1.049767e-02  5.712014e-03  0.001593   
2  2.590506e-12  2.809658e-07  1.422678e-03  1.744847e-10  0.010116   
3  3.265713e-02  4.445763e-04  1.816033e-02  6.666232e-05  0.003741   
4  2.483622e-04  6.483961e-10  1.280752e-07  6.682240e-03  0.013579   

         Topic6        Topic7        Topic8    Topic9       Topic10  ...  \
0  4.184981e-03  7.211357e-07  6.976538e-04  0.021367  1.185928e-18  ...   
1  4.078344e-04  6.189188e-03  3.997446e-06  0.012007  1.544286e-04  ...   
2  2.267622e-05  1.827246e-10  8.665125e-09  0.004133  6.7260

## Analyze Results

Let's look at some basic statistics of the inferred topic distributions.

In [13]:
# Basic statistics
print("Average topic probability per topic:")
print(results_df.mean())

print("\nMost common dominant topic for cells:")
dominant_topics = results_df.idxmax(axis=1).value_counts()
print(dominant_topics)

Average topic probability per topic:
topic_0      0.005722
topic_1      0.001145
topic_2      0.004587
topic_3      0.000797
topic_4      0.004191
               ...   
topic_170    0.003773
topic_171    0.002717
topic_172    0.008476
topic_173    0.003091
topic_174    0.005285
Length: 175, dtype: float64

Most common dominant topic for cells:
topic_141    2
topic_164    2
topic_83     2
topic_34     2
topic_132    2
topic_46     1
topic_125    1
topic_111    1
topic_37     1
topic_144    1
topic_9      1
topic_7      1
topic_107    1
topic_63     1
topic_159    1
Name: count, dtype: int64


In [7]:
# Save results to a file
results_df.to_csv("topic_inference_results.tsv", index=False, sep="\t")

This matrix one should save as a layer in query adata to continue with reference matching.