Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scaling... again #116

Open
lorenzozanisi opened this issue Jul 1, 2022 · 6 comments
Open

scaling... again #116

lorenzozanisi opened this issue Jul 1, 2022 · 6 comments

Comments

@lorenzozanisi
Copy link
Collaborator

lorenzozanisi commented Jul 1, 2022

Hi @zcapjdb @thandi1908

I was trying to reproduce your results because some CL results seemed odd, but I'm getting a strange thing with the scaling (again!).

Looking at main (which I assume is where the paper results are coming from), it seems that the line where the regressor's scaler is updated is commented.

Un-commenting that line has quite drastic results on the results of the paper,
image
The left plot is similar to what is in the paper, but it's obtained with the scaler bit commented, which I thought we had settled to keep uncommented (this is how I have been doing the CL all this time) - but maybe I don't recall correctly? Could you please confirm this when you can?

@zcapjdb
Copy link
Owner

zcapjdb commented Jul 3, 2022

Hi Lorenzo,

The final paper results should have been produced on main I believe so that probably means that yes that line was left commented (I need to check my local version tomorrow to confirm if that it is also commented on my local version)

The with scaling versions seem to have fully converged and seem like they require more iterations. I would be interested to see what happens with more but it is still not very clear to me why rescaling has worse performance

@lorenzozanisi
Copy link
Collaborator Author

I'm still not sure what's going on. The right one looks like it will converge at some point, but it's odd that it's not as fast as the other one. It is possible that AL is prone to picking points that are at the outskirts of the output manifold and so the variance of the scaler becomes very big and it's not offset enough by the training. It would be helpful to visualise that (i.e. distribution of flux of acquired points).

The original scaler instead contains info about points that are less biased, and so it gives better generalisation MSE over a random test set.
The problem is that the scaler ought to be updated during the CL iterations because we know that the data distribution changes. So if the output scaling is so sensitive to the AL acquisition function, then we have a problem.

At the same time I remember that rescaling everything with the new scaler after each acquisition gave much better unscaled MSEs... Just dumping these points here for the moment as I'm sure I'll have forgotten in a week when I'm back.

It's also possible that the NN starts from a point that would struggle to generalise when seeing more data. One quick experiment could be to retrain the regressor NN from scratch after every acquisition. In this case using the same seed would be crucial (as we want the same net as before, but trained with extra data).

I'm more keen on the first idea though.

@lorenzozanisi
Copy link
Collaborator Author

lorenzozanisi commented Jul 11, 2022

So, I confirm that using the scaling results in identical performance of the scaled loss (left panel) but completely different behaviour for the unscaled loss (right). Interestingly, both orange lines have the same identical trend, while the blue don't.

image

So there must be something going on in the scaler itself that is just not right. Therefore I looked at the mean and variance of the scaler for the ion heat flux at each iteration:
image
Note that they both increase at least for the first iterations. That's telling us that the network is looking to label points that are at the extremes of the distribution (similar trends are there for the input variables. But there's more to it.

The unscaled MSE is sensitive to the scale of the scaler. The scaled variables are
image

Each individual contribution to the average unscale MSE is
image

If the scale increases, then the lower convergence rate of the unscaled loss compared to the scaled loss is explained for the blue lines above (i.e. when the regressor's scaler is updated) . By never changing the regressor's scaler (orange line), we are always using a scale~22 which makes things look better than they are. Thins kinda begs the question of how good of a metric the MSE is for this problem, and I'm not sure I know the answer - thoughts?

@lorenzozanisi
Copy link
Collaborator Author

lorenzozanisi commented Jul 11, 2022

@zcapjdb @thandi1908 Please have a look at my comment before tomorrow if you can, so we can discuss how to proceed for the poster for the AL part.

If what I say above makes sense, I think that the AL experiments should be run again with the correct scaler.

Hopefully we'll be able to show that we can still be competitive with the full dataset training scenario, with a small amount of labels (now it's 0.4%, but I reckon we'll get to at least ~5% which is not bad, but also not great). We will need more iterations to be competitive with the full dataset training scenario, which will imply that the candidates not selected must all be added back to the unlabelled pool (I don't think it's the case now?) that otherwise gets depleted of potentially interesting points.

Judging by my CL figures where I ran AL (with scaler) and random sampling, I think the former should still do better than the latter, so at least we should have that.

lorenzozanisi pushed a commit that referenced this issue Jul 15, 2022
@lorenzozanisi
Copy link
Collaborator Author

lorenzozanisi commented Jul 15, 2022

Hey @zcapjdb @thandi1908 , just a quick update to move forward.

This has nearly driven me insane, but the y vs yhat plots seem to reveal that the case of no scaling of the regressor lead to predicted negative fluxes:

No scaling regressor:
image

With scaling regressor:
image

You can reproduce this on branch checkscaler. The config I used is in src/pipeline/pipeline_config_debug.yaml. There's a keyword "scaler": use "noscaler" and "withscaler" (what they mean should be obvious).

The MSE depends only on sigma, as I showed in my previous post, which might make it seem that having a fixed scaler is a good thing. This is not true.

This is because the ys depend on mu as well. When you compute y = sigma*z+mu with the unchanged regressor scaler you'll get lower (and potentially negative) ys, because you z was scaled during training with much larger parameters (because mu increases, see my previous post). Note that this happens for both y and yhat - so mu cancels out in the MSE, which therefore looks nicer. So, effectively, we are not looking at the fluxes we are actually trying to fit if the regressor scaler is not updated. The better MSEs are a complete artifact of the way the scaling is currently done, and should not be used in the paper.

Also, I have tried again to test this discussion: I confirm that the performance is the same in terms of unscaled loss for the two cases: 1. the case where we do the scaling and scale the regressor as well (teal line), 2. the case where the part where we unscale and rescale everything is commented out and the scaler is kept fixed (magenta line),
image.

In summary:

  1. The way the MSE is calculated without updating the regressor scaler is wrong.
  2. All the experiments in the paper must be run again making sure that the regressor scaler is updated, as I had indicated before the original submission.
  3. The scaled loss is better for the teal line in the last plot. I don't know why. Perhaps, rescaling the inputs only during the AL loop might still help with the performance. For future reference, one could exclude the scaling of the output variable during the AL loop, and not scale the regressor either.

I have been busy figuring this out in the last couple of days so haven't done any work on the CL. I'll try to plough through it today and early next week (my parents are visiting this weekend). In particular, I'll use the "scaling regressor" (i.e. uncomment thisline).

If you can please update all the plots, tables and results pertaining to AL in the paper it would really be great. You can use the "scaling regressor" option and check whether my suggestion in point 3. above works, if you've got the time.

I appreciate this is quite stressful and annoying, but unfortunately it's necessary. Let me know if you need any help! Also could you please acknowledge you've seen this, otherwise I won't know whether you're on it or not.

@zcapjdb
Copy link
Owner

zcapjdb commented Jul 15, 2022

Hi Lorenzo, I am still mid transit (ended up missing my connecting flight from NYC yesterday). I get into Baltimore in the afternoon, so it’ll probably be tomorrow when I am able to try running your new changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants