Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots and lots of duplication #267

Closed
cbizon opened this issue Sep 2, 2021 · 4 comments
Closed

Lots and lots of duplication #267

cbizon opened this issue Sep 2, 2021 · 4 comments
Assignees

Comments

@cbizon
Copy link
Contributor

cbizon commented Sep 2, 2021

Standup query:

query = {
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": [
                        "NCBIGene:6656"
                    ],
                    "categories": [
                        "biolink:Gene"
                    ]
                },
                "n1": {
		             "categories":[
                    
                         
                         "biolink:BiologicalProcessOrActivity"
                        ]
                },
                "n2": {
                    "ids":["NCBIGene:6657"],
                    "categories":[
                        "biolink:Gene"
                       ]
               }

            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates":["biolink:related_to"]

                },
                "e1": {
                    "subject": "n2",
                    "object": "n1",
                    "predicates":["biolink:related_to"]

                }
            }
        }
    }
}

2 genes, what processes do they have in common. We return 1174 results, but only 62 are unique.

Oddly, each result happens either 1, 5, 10, 25, 50, 75, 100, or 150 times.

The most common result (150 copies) is this:

{
 "edge_bindings": {
  "e0": [
   {
    "id": "NCBIGene:6656-biolink:participates_in-GO:0006355"
   }
  ],
  "e1": [
   {
    "id": "NCBIGene:6657-biolink:participates_in-GO:0006355"
   }
  ]
 },
 "node_bindings": {
  "n0": [
   {
    "id": "NCBIGene:6656"
   }
  ],
  "n1": [
   {
    "id": "GO:0006355"
   }
  ],
  "n2": [
   {
    "id": "NCBIGene:6657"
   }
  ]
 },
 "score": null
}

I suspect KP funkiness but even so, I think we should unique these. It's possible that there is uniquifiying that could occur in the process, allowing some speedup as well.

@cbizon cbizon added Priority: Medium standup Issue related to a Translator standup Type: Enhancement labels Sep 2, 2021
@cbizon
Copy link
Contributor Author

cbizon commented Sep 2, 2021

The other thing is that this causes trouble for AC, which is trying to count things....

@cbizon
Copy link
Contributor Author

cbizon commented Sep 8, 2021

There are also repeated values in D.1.

@uhbrar
Copy link
Collaborator

uhbrar commented Oct 5, 2021

This may be fixed via the smarter message merging Alon has been working on implementing. Once that's in, I'll check back to see whether or not that addresses the problem.

@richakanwar13
Copy link

Can be closed when #276 is closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants