Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I am confused, and need your help #162

Closed
saramoein372 opened this issue Jul 11, 2022 · 27 comments
Closed

I am confused, and need your help #162

saramoein372 opened this issue Jul 11, 2022 · 27 comments
Labels
question Further information is requested

Comments

@saramoein372
Copy link

Description of the question

Hi Kelvin,

I have some basic questions about how dandelion is working and trying to find the biological meaning of each step in dandelion. To do this I am asking my questions to complete the puzzle.

Would you please help me to answer these questions:

1- is each node in the dandelion network a clone?

2- how the clone network is generated? I already read all the tutorials and papers. But I think there are some inconsistencies in the paper and tutorial. It would be great to briefly provide me the steps. I am very confused.

3- why after generating the .tsv file, some of the cells have different cluster_id?

4- We expected to see the same germline in all the cells in the network. But the germlines of cells in the network are different. Why?

Thank you,
Sara

Minimal example

NA

Any error message produced by the code above

NA

OS information

NA

Version information

NA

Additional context

No response

@saramoein372 saramoein372 added the question Further information is requested label Jul 11, 2022
@zktuong
Copy link
Owner

zktuong commented Jul 11, 2022

Hi Sara,

1- is each node in the dandelion network a clone?

each node is a single cell:
image
and each connected component (network) would most often be 1 clone. there are situations where a network can be comprised of multiple clones, because some cells have multiple BCRs/TCRs and dandelion merges them into a single network just for the visualisation.

2- how the clone network is generated?

in a simple example, for all cells that were assigned a clone id of 1_1_1_1, including cells that have clone ids of 1_2_3_4|1_1_1_1 (exampled of a single cell expressing two pairs of BCRs) will be selected and pairwise levenshtein distances will be calculated for every pair of cells within this subset. The calculation is performed on each IGH/IGK/IGL layer separately. The layers are then just summed (simple matrix addition), forming a distance matrix like this:
image

I've coloured the upper triange grey because it's just going to mirror the lower triangle.

a minimum spanning tree is then calculated, which will form something like this:
image
I've coloured the edge weights (levenshtein distance) blue

In the constructed minimum spanning tree, a special circumstance here is that Cell 1, being connected to Cell 4, is totally random - Cell 3 and Cell 2 have equal chances of being selected for Cell 1's position because they are the same distance apart. So, i added a step to 'rescue' those connections/edges, making it look like:
image

I've coloured the rescued edges as orange.

That's it.

3- why after generating the .tsv file, some of the cells have different cluster_id?

i'm not sure what you mean by this. Unless you are asking why the numbers change each time you run it - it's got to do with a random argsort whenever lists of V/J and lengths are sorted. The numbers don't have any particularly meaning other than to say whether or not two different clones share a similar criteria, so i've never enforced for the numbers to stay identical all the time.

4- We expected to see the same germline in all the cells in the network. But the germlines of cells in the network are different. Why?

I'm unsure how this can happen, other than the possiblity as i described above where a cell can have multiple BCRs, and also when cells have multiple light chains. Are you sure that the different germlines you are seeing is not because it's just IGH/IGK/IGL? Otherwise, I'll need an example where you've observed this.

@saramoein372
Copy link
Author

saramoein372 commented Jul 12, 2022 via email

@zktuong
Copy link
Owner

zktuong commented Jul 12, 2022

Hi Sara,

1-My confusion is how biologically we can justify the intra-connections of
clones? Do you have any comments about the justifications of
intra-clsterers edges from a biological perspective?

The network structure should look like this:
image

Just a side note: in the latest update (v0.2.4), .edges have been removed because its behaviour was a bit random in which nodes were selected for source/target and this can lead to edge table being unstable and it's used at the intermediate step. In the latest version, the networkx graph holds the final edge list which should hopefully be more consistent.

2- Most of the clones in my dandelion file have unassigned clone-id. Why
can this happen?

can you try and update your dandelion version and see if this persist?

@saramoein372
Copy link
Author

saramoein372 commented Jul 14, 2022 via email

@saramoein372
Copy link
Author

saramoein372 commented Jul 14, 2022 via email

@saramoein372
Copy link
Author

saramoein372 commented Jul 15, 2022 via email

@zktuong
Copy link
Owner

zktuong commented Jul 18, 2022

1- From the dandelion network, how can I extract the single cell ID's in
the biggest clone?

The biggest clone should have a clone_id_by_size of 1. So you can just use the cell ids from the metadata that matches that.

2- How can get the size of clones?

run ddl.tl.clone_size

2- Are you saying there is no biological justification to use the "edges"
that was in the previous version?

that's correct. no justification

3- why in the same clone, I see different VDJs?

I'm not sure how this can happen. can you show me an example?

@saramoein372
Copy link
Author

saramoein372 commented Jul 18, 2022 via email

@zktuong
Copy link
Owner

zktuong commented Jul 18, 2022

i see.

for that you need to extract from the graph itself:
vdj.graph[0] or vdj.graph[1] - either will work.

you would want to follow the instructions here:
https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.components.connected_components.html

which basically should look like:

G = vdj.graph[1]
# find the largest connected network
largest_cc = max(nx.connected_components(G), key=len)
# subset to largest_cc
S = [G.subgraph(c).copy() for c in nx.connected_components(G)]

# this should give you the list of nodes that are this network
S.nodes

Then you should be able to just match it them from the metadata?

newvdj = vdj[vdj.metadata_names.isin(list(S.nodes))].copy()
newvdj.metadata

@saramoein372
Copy link
Author

saramoein372 commented Jul 18, 2022 via email

@saramoein372
Copy link
Author

saramoein372 commented Jul 18, 2022 via email

@zktuong
Copy link
Owner

zktuong commented Jul 18, 2022 via email

@saramoein372
Copy link
Author

saramoein372 commented Jul 19, 2022 via email

@zktuong
Copy link
Owner

zktuong commented Jul 19, 2022

Yes, because the whole point is to remove all ambiguous BCR chains.

You can also use scirpy's method to define clones and see if that makes a difference

@saramoein372
Copy link
Author

saramoein372 commented Jul 19, 2022 via email

@saramoein372
Copy link
Author

saramoein372 commented Jul 19, 2022 via email

@zktuong
Copy link
Owner

zktuong commented Jul 19, 2022

One more question: how this can happen that my cell ranger results has the
v-call and j-call information for each cell. But dandelion has put empty
for v and j genotypes columns, and also empty column for clone-id? Then I
have unassigned clone and my bcr network is showing all these cells in a
big clone.

the pre-processing will reannotate the V and J calls, using igblastn and blastn. Where it was deemed that the call was too low confidence, dandelion will remove the V/J call annotation, but would largely be consistent with how igblastn is performed (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3692102/).

during post-processing i.e. filter_contigs or check_contigs, a contig level QC assessment is performed where i ask whether the assignments make sense:

e.g. IGHV must pair with a IGHJ in the same contig - if it's missing either, then it's not a good productive contig.
there's several other logical checks like that along the way, to ensure that what we end up with are good sets of contigs.

Where filter_contigs and check_contigs differ, is that filter_contigs is stricter, and also performs a hard cell level QC where it checks if a cell has 1 or many sets of heavy+light chains. If many, filter_contigs will remove. For check_contigs, the cell level QC is a soft check, and just populates in the .metadata's chain_status column - to indicate if particular cells display ambiguous contigs.

clone_id thus relies on all these checks to succeed.

  1. It MUST have a V gene, a J gene, CDR3 sequence
  2. It MUST have at least 1 heavy chain
    If a cell only has light chains, then clone id will not be defined. The rationale is that biologically, IGH rearrangement occurs prior to IGK/IGL rearrangement i.e. you must have a productive heavy chain before light chain will be rearranged.

So unless you are still using an older version of dandelion i'm not sure if it's possible for form a network of unassigned clones - regardless, this is still a bug and should be removed/ignore. I'll need a more concrete example to able to diagnose this bug.

And one more question is: can I ask the correct singularity command in
preprocessing step, that has all the necessary parameters for correct
filtering, including contig filtering and everything?

The current singularity script just do the pre-processing. All the filtering steps are considered post-processing and you'll have to follow the tutorial.

@saramoein372
Copy link
Author

saramoein372 commented Jul 19, 2022 via email

@saramoein372
Copy link
Author

saramoein372 commented Jul 19, 2022 via email

@saramoein372
Copy link
Author

saramoein372 commented Jul 19, 2022 via email

@zktuong
Copy link
Owner

zktuong commented Jul 19, 2022

During filter_contig "vdj, adata2 = ddl.pp.filter_contigs(new_vdj, adata, library_type ='tr-ab', filter_rna = True)" I get this error: TypeError: update_metadata() got an unexpected keyword argument 'library_type'

you are not using the correct version of dandelion. please uninstall and reinstall again. dandelion.__version__ has to be 0.2.4

I am going to generate a network of all the edges from nx package (like the
graph that you sent me a few days ago) and you mentioned that the 'edges'
from the dandelion package is not reliable. I need a way that gives me
edges.
But it is not clear for me how to do that.

I would suggest for you have to learn how to use the networkx package because this isn't the place to ask questions related to it.
https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.to_pandas_edgelist.html

Would you please provide a short explanation about graph[0] and graph[1]?
It looks after plotting all clones are connected together. I am confused
about how they are connected?

graph[0] contains all nodes (includes singleton) and graph[1] contains only connected nodes.

Sorry the code i used above is wrong. should be:

S = G.subgraph(largest_cc)

@saramoein372
Copy link
Author

saramoein372 commented Jul 20, 2022 via email

@saramoein372
Copy link
Author

saramoein372 commented Jul 20, 2022 via email

@saramoein372
Copy link
Author

saramoein372 commented Jul 20, 2022 via email

@zktuong
Copy link
Owner

zktuong commented Jul 20, 2022

How we can say to dandelion to consider both heavy and LIGHT chains?
because currently, it is only generating clone_id based on heavy chain. But
we need to look at both chanis.

This assertion is not true. Dandelion will consider both heavy and light chains if they are there.
Thus, your description is only possible if your light chain rows are not there (because they were filtered away because of quality issues), or are not formed properly (and thus filtered away because of quality issues).

One question I have: how filtering_contigs function is working?
Does dandelion remove the light chain?

It does not remove normally.

Do you see a lot of situatuons where a single cell barcode have more than two contigs assigned to one barcode?

If so, then your original data needs to be assessed if it's correct and of high quality.

We want to see which criterias filter_contigs is looking at to filter
contigs.

This is in the documentation. Please read it.

How can I have the original version of dandelion?

You can pip install an earlier version as they are all on pypi

However, earlier versions should not change this behavior of missing clone_ids as i highly suspect that your issue is with your data, rather than the tool itself.

Please provide a screenshot of your data/error, or send the data to my email so i can diagnose if it's a genuine problem. I wouldn't need the full data - just a couple of your rows which you are experiencing issues will suffice. If that is not possible, then i will suggest that you start from the original cellranger outputs and just read in with ddl.read_10x_vdj or ddl.read_10x_airr.

@saramoein372
Copy link
Author

saramoein372 commented Jul 21, 2022 via email

@zktuong
Copy link
Owner

zktuong commented Jul 21, 2022

1- what is the "criteria of connecting" of one cluster to another cluster?
Do clusters connect each other from cells that "have one base nucleotide"
difference?

As i've explained above - this is determined if the clone_id entry is found to be shared by the different cells/clusters.

2- I was running one data with dandelion 1.12 and the output vdj had around
6000 rows (in vdj.metadata). But with dandelion 2.4, running on the same
data generates the vdj.metadata with around 300 rows. How these two
dandelions different?

you can see the various code changes here: https://github.com/zktuong/dandelion/releases

The largest difference between v0.1.12 and 0.2.x is the preprocessing step has a 'strict' mode by default, which could be why your dataset now is reduced. The rest of the changes are to do with speed upgrades.

So, instead of using filter_contigs, can you use check_contigs and report back if you still only see ~300 rows?

Repository owner locked and limited conversation to collaborators Jul 21, 2022
@zktuong zktuong converted this issue into discussion #164 Jul 21, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants