-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solidago for ICML24 #1879
Solidago for ICML24 #1879
Conversation
…L24 submission. Additionally, I would like to include a synthetic data generation process, and evaluation of the pipeline on synthetic data. I also consider generating attack data, to test the resilience of the pipeline.
Les pretrusts et les vouches ne semblent pas être exportés dans les données publiques. J'aimerais bien avoir les vouches en csv avec les columns "voucher_id", "vouchee_id" et "vouch". Les pretrusts pourraient être juste une liste de user_id. |
That sounds reasonable to add a "vouchers.csv" file, as we already describe vouching as public information on the website. I am not so sure about "pretrusts". We could add a column "is_pretrusted" in the file "users.csv". But is the fact that a user registered with a trusted address supposed to be public? (altough that can be guessed easily from the current "trust_score" values). |
I also included the public export to /solidago/tests/data to easily test implementations. I re-implemented lipschitrust, which is now linear in the number of vouches, as opposed to quadratic with the number of users. However I do not get the same trust scores (I assumed that if a user has a trust score >= 0.8, then they were pretrusted).
WIP Constructing a generative model pipeline
Also cleaned lipshitrust code
…due to always having a<b
There is still more room for acceleration for Mehestan
… active user comparisons.
backend/ml/mehestan/parameters.py
Outdated
from solidago.pipeline import PipelineParameters | ||
|
||
from tournesol.utils.constants import COMPARISON_MAX, MEHESTAN_MAX_SCALED_SCORE | ||
|
||
|
||
class MehestanParameters(PipelineParameters): | ||
r_max = COMPARISON_MAX | ||
max_squashed_score = MEHESTAN_MAX_SCALED_SCORE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) As this is extending the dataclass PipelineParameters, I think it should include the type annotation.
r_max: float = COMPARISON_MAX
max_squashed_score: float = MEHESTAN_MAX_SCALED_SCORE
(2) Shouldn't we define this type in Solidago and only use it in Tournesol backend
|
||
import pandas as pd | ||
|
||
class PipelineOutput(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PipelineOutput interface, only makes sense if this object is passed to the pipeline and gets called along the way. I see that this is how it is used inside Tournesol which does not yet use the pipeline as defined by Solidago. Do I understand this correctly?
If yes, do we want to adapt the solidago's Pipeline to use solidago's PipelineOutput?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipeline.__call__()
now accepts an optional output
to customize the output and write scores during the pipeline execution:
tournesol/solidago/src/solidago/pipeline/pipeline.py
Lines 122 to 131 in 631257d
def __call__( | |
self, | |
users: pd.DataFrame, | |
vouches: pd.DataFrame, | |
entities: pd.DataFrame, | |
privacy: PrivacySettings, | |
judgments: Judgments, | |
init_user_models : Optional[dict[int, ScoringModel]] = None, | |
output: Optional[PipelineOutput] = None, | |
) -> tuple[pd.DataFrame, VotingRights, dict[int, ScoringModel], ScoringModel]: |
solidago/experiments/engagement.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lenhoanglnh On this file I created a flame graph to investigate how we can improve the compute time of the pipeline. May be it is not the right place?
Install flameprof
$ source venv/bin/activate
$ pip install flameprof
Create a profiling stats file then use flameprof to make it into a nice svg for visualization
$ python -m cProfile -o stats.prof experiments/engagement.py
$ python -m flameprof stats.prof > output.svg
(open the svg in a new tab, gives some hoverlay that helps readability)
Analyzing the graph:
- Nothing takes >80% of the time, so we don't have an obvious optimization issue (although this could change if we run the profiling with a larger dataset to better show cubic or quadratic complexity in some parts of the pipeline)
- The multiple steps of scaling is taking about 50%. I don't really understand what are all the substeps, but once again, there is not one substep that takes the majority of the time. After confirming what the profile looks like on bigger dataset, that's where we should put time to investigate performance improvements.
- The second most time consuming part is the computation of individual scores which spends 90% of the time running brentq. So we could look closely howbrentq is implemented and if it can be sped up.
Fixed doc typos in pipeline.
…on as in the ICML paper
…mentation to legacy2023 (#1915)
I merged #1915 into this branch, to include separate more clearly the previous implementation from the new pipeline modules, deduplicate algorithms when that was possible, and apply some various optimizations. Solidago version is now v0.1.0:
Anything else to check before merging? @lenhoanglnh A discussion is still open about how to apply the post_process function in "scoring_model". |
I think it is ready to merge, right? |
related issues #1781 (also contributes to solidago)
Description
The goal of the PR is to make solidago run the full pipeline, for ICML24 submission.
Additionally, I would like to include a synthetic data generation process, and evaluation of the pipeline on synthetic data.
I also consider generating attack data, to test the resilience of the pipeline.
Checklist