Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When inputing parts of genes, gene SGR00005418.1 could not find anchors in another species #566

Open
YANGMEI0127 opened this issue May 10, 2023 · 2 comments

Comments

@YANGMEI0127
Copy link

Hi @tanghaibao,

Thanks for this great tool.

We encountered a problem while analyzing data using jcvi. Below are the steps and issues we analyzed.

Firstly, we prepared bed and cds files for 2 species. The format of the files for 2 species is the same. One species was named LHG and another species named Species2.

The format of bed is shown below (LHG_scaffold_3):

Hic_scaffold_3 31125864 31140016 SGR00007415.1 0 +
Hic_scaffold_3 31141175 31144628 SGR00007416.1 0 -
Hic_scaffold_3 31149146 31150574 SGR00007417.1 0 +
Hic_scaffold_3 31153232 31155626 SGR00007418.1 0 -

The format of cds is shown below (LHG_scaffold_3):

>SGR00007415.1
ATGGATTCAGTGGAGGCTATAGAGGAGTTGGCTCAGCTGTCGGACTCGATGCGGCAAGCTGCGGCTTTGCTTGCCGATGAGGATGTCGACGAAACTACCACCTCCGGTACTTCCTCTCGTAGGCTTTCCACTTTCCTCAACGTCGTGACACTGGGGAATGTCGGAGCTGGTAAATCTGCAGTGTTAAATAGCTTG……
>SGR00007416.1
ATGCAGCGGTCACGTAGAGCTCTTCTTCAGAGAAGAGCTTTGGAGAAAACTATCAGTGGGAGGAATCGTTTGTATATGTTTTCTCTGACCTTAGTTTTTGTTTTGTGGGGGCTTTTCTTCCTTTTTAGCTTATGGATCAGACAGGGCAACGGTCGTAGAGATGGATGTACTGCACTACCAGATAGCATATCAACT……

Secondly, we performed ortholog analysis for LHG with all other species using code like python -m jcvi.compara.catalog ortholog {main_species} {other_species} --no_strip_names to extract collinear information for all genes in LHG_scaffold_3 with all other species.

We have obtained the collinearity information file of LHG_scaffold_3 compared to species. We display the content of each file (displaying gene SGR00005418.1 information as it exists in each file) such as:

  • LHG_scaffold_3.Species2_v2_5.last
SGR00005418.1   Cla97C02G033340.2       80.71   1405    262     2       11      1406    14      1418    0       1.34e+03
SGR00005418.1   Cla97C02G033330.1       67.11   1420    440     3       7       1402    13      1429    4.1e-212        737
SGR00005418.1   Cla97C02G033320.1       65.94   1415    455     3       11      1401    17      1428    2.2e-195        682
……
  • LHG_scaffold_3.Species2_v2_5.last.filtered
SGR00005418.1   Cla97C02G033340.2   80.71   1405    262 2   11  1406    14  1418    0   1.34e+03
……

*LHG_scaffold_3.Species2_v2_5.anchors

SGR00005418.1   Cla97C02G033340.2       1340
……

*LHG_scaffold_3.Species2_v2_5.lifted.anchors

SGR00005418.1   Cla97C02G033340.2   1340
……

Thirdly, we extracted collinear information for several genes (include SGR00005418.1) in LHG_scaffold_3 instead of all genes in LHG_scaffold_3 and rerun the analysis.

  • LHG_UGTscaffold_3.Species2_v2_5.last:
SGR00005418.1   Cla97C02G033340.2       80.71   1405    262     2       11      1406    14      1418    0       1.34e+03
  • LHG_UGTscaffold_3.Species2_v2_5.last.filtered:
SGR00005418.1   Cla97C02G033340.2       80.71   1405    262     2       11      1406    14      1418    0       1.34e+03
  • LHG_UGTscaffold_3.Species2_v2_5.lifted.anchors: were not formed.
  • LHG_UGTscaffold_3.Species2_v2_5.anchors: The file have no content, and gene SGR00005418.1 A is not in the file.

The question is: in the second step, when inputting all genes, gene SGR00005418.1 could find anchors in another species. But in the third step when inputing parts of genes, gene SGR00005418.1 could not find anchors in another species. Is there any fltering steps for generating the anchor file from the last.filtered file. (By the way, I also tried to change --min_size to 1 or 0, still no luck.)

Would you please give me some suggestion? When we focus on specific gene sets, should we feed only these gene sets or whole scaffold to the jcvi.compara.catalog step?

Thank you very much.

@tanghaibao
Copy link
Owner

@YANGMEI0127

In your third step, did you include the neighbors that formed the synteny block close to SGR00005418.1?

Filtering the input files to only subsets of the genes is not recommended, since jcvi relies on the LAST hits from other genes (even those that aren't forming the blocks) as part of the filtering decision. So subsetting may lead to unexpected results?

Would you please explain a bit why you want to run on subsets, rather than extracting what you need from the full run?

@YANGMEI0127
Copy link
Author

YANGMEI0127 commented May 11, 2023

Hi @tanghaibao,

Thank you very much for your reply.

In my third step, I include the neighbors that formed the synteny block close to SGR00005418.1 , however, only two genes are close to the SGR00005418.1. Such as:

Hic_scaffold_3  5284539 5286279 SGR00005419.1   0   +
Hic_scaffold_3  5274281 5275706 SGR00005418.1   0   +
Hic_scaffold_3  5253515 5254979 SGR00005417.1   0   -

The reason I want to run on subsets is that the full data is abundant, I want to simplify the data, and analyze the data for the genes I am interested in. This could be more convenient.
To my surprise, the data processing results were different.

Thank you again for your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants