Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to derive the architecture? Running the train_search on cifar10, but the architecture is different #32

Closed
Ian09 opened this issue Aug 14, 2018 · 12 comments

Comments

@Ian09
Copy link

Ian09 commented Aug 14, 2018

Hi
I have 2 questions about how to derive the final architecture:

  1. I use the default search script on cifar10 python train_search.py --unrolled --seed 0 to search the architecture. I found the architecture is different with the one provided by the paper (also in the genotypes.py of the repo). If I change the seed, the architecture will also change. In the paper, the authors mentioned the results are obtained by 4 runs. So my questions: Do the 4 runs use the same architecture? Or use 4 different architectures?, and How the illustrated architecture in the paper is selected?

  2. On my own run to search the architecture, I found the probability of the zero op is the highest. However, in the paper, the authors mentioned that zero op is not used in the final architecture (Sec.2.4). This is also confirmed in the code. My question is If zero op is not used, why we add a zeros in the searching space? It is really weird since if we do not excluded the zero op, all the ops will be zero ;-(. Does the author have the same problems? For example, the alphas for normal cell is

[[0.1838, 0.0982, 0.081 , 0.1736, 0.1812, 0.0846, 0.091 , 0.1066],
[0.4717, 0.0458, 0.0496, 0.0945, 0.1113, 0.0556, 0.0953, 0.0762],
[0.2946, 0.1425, 0.0855, 0.1768, 0.0837, 0.0735, 0.0731, 0.0704],
[0.3991, 0.0631, 0.0581, 0.1053, 0.1307, 0.0577, 0.1043, 0.0817],
[0.6298, 0.0382, 0.035 , 0.0658, 0.0435, 0.0551, 0.0605, 0.0721],
[0.3526, 0.0974, 0.0693, 0.1346, 0.1245, 0.0697, 0.091 , 0.061 ],
[0.4829, 0.06 , 0.0612, 0.115 , 0.0969, 0.065 , 0.0624, 0.0565],
[0.6591, 0.0303, 0.0282, 0.0558, 0.0578, 0.054 , 0.0581, 0.0568],
[0.7612, 0.0199, 0.0207, 0.0294, 0.0343, 0.0442, 0.0431, 0.0472],
[0.3519, 0.1231, 0.0692, 0.1381, 0.0925, 0.076 , 0.0748, 0.0744],
[0.4767, 0.0781, 0.0679, 0.1216, 0.0679, 0.0701, 0.0548, 0.0629],
[0.6769, 0.032 , 0.0292, 0.0547, 0.0533, 0.0427, 0.0614, 0.0498],
[0.7918, 0.0191, 0.0199, 0.0279, 0.0423, 0.0223, 0.0392, 0.0375],
[0.8325, 0.0153, 0.0158, 0.0199, 0.0284, 0.0255, 0.0313, 0.0313]]

Each row is the probability of ['none', 'max_pool_3x3','avg_pool_3x3','skip_connect','sep_conv_3x3','sep_conv_5x5','dil_conv_3x3','dil_conv_5x5'] for each edge.

@quark0
Copy link
Owner

quark0 commented Aug 14, 2018

  1. Very likely you won't get the same cell each time due to cuDNN and GPU parallelism even with a fixed seed. We repeated the search for 4 times to find the best resulting cell based on the validation curves in Figure 3, and then evaluated that single best cell also for 4 times to obtain it's mean and average.

  2. Zeros are auxiliary ops that allow different edges to have different total strengths. Their absolute strengths do not matter, because multiplying all edges in the cell by the same factor does not really affect it's topology. Note that in this work we are not trying to learn the sparsity of the architecture at all, as we always assume each node to have 2 predecessors (as in NASNets and AmoebaNets etc.).

@Ian09
Copy link
Author

Ian09 commented Aug 15, 2018

Hi

Thanks very much for the reply!

About the answer 1, may I ask how do you split the validation set? In your initial code, it seems that you use the "test" data as the "validation" data when searching. In the commit of a00cad0, you fix this problem. So do you just split the real trainset half-and-half as "train" and "valid" set, and use the performance on the "valid" set (half of the real training set) to select the best architecture?

@quark0
Copy link
Owner

quark0 commented Aug 15, 2018

The test set was never used even before that commit (valid_queue was never actually used for arch search in the old version). The variable names were indeed a bit confusing and that's why I fixed it.

Yes, your understanding about architecture selection is correct.

@quark0 quark0 closed this as completed Aug 15, 2018
@Ian09
Copy link
Author

Ian09 commented Aug 15, 2018

Hi

And about answer 2, could you please why you want to "allow different edges to have different total strengths"? When you generate the final architecture, you use softmax over the alphas, thus the total strengths of these edges are not used. Is there any reason for that? Thanks!

@quark0
Copy link
Owner

quark0 commented Aug 15, 2018

To derive the final architecture, we retain the top-2 strongest predecessors for each intermediate node (sect 2.4). The edge strengths are needed here to rank the candidate predecessors. While argmax tells us which op to put on the edge, it does not tell whether that particular edge should be retained.

@Ian09
Copy link
Author

Ian09 commented Aug 15, 2018

Hi
One things I am not quite understand:
What is the definition of "edge strengths"? Is it the sum of all alpha for that edge? If so, even without zero op, the "edge strengths" will also be different. So I don't quite understand why " Zeros are auxiliary ops that allow different edges to have different total strengths."

DARTS is really a nice work! And I just would like to understand the model better without any offense.
I am appreciated your help!

@quark0
Copy link
Owner

quark0 commented Aug 15, 2018

Sure, I hope the following helps (strength I'm referring to here is slightly different from that in the paper):

strength of op_i
= softmax_{i of all ops}(\alpha_i)
= softmax_{i of non-zero ops}(\alpha_i) * (1-probability_of_the_zero_op)
= softmax_{i of non-zero ops}(\alpha_i) * strength_of_the_edge

In other words, strength means the sum of the mixing probabilities of all non-zero ops on a given edge.

Also, assuming you have only a single non-zero op (e.g. conv). Without introducing the zero op, all edges are equally important hence it's impossible to determine the two predecessors for each node.

@Ian09
Copy link
Author

Ian09 commented Aug 15, 2018

Got it. Thanks!

@Ian09
Copy link
Author

Ian09 commented Aug 16, 2018

Hi
One more question: If you split the trainset half-and-half, one half is used to train parameters w, and the other half is used to train the architecture alpha. Then, you also use the second half as the validation set to select the best architecture. Maybe there will be some problems since you use the second half both to train alpha and validate the performance. Is it right?

@quark0
Copy link
Owner

quark0 commented Aug 16, 2018

"Overfitting" \alpha to the validation set is precisely what we want. While there could be better selection strategies, exploring them is beyond the scope of this paper.

@PipiZong
Copy link

PipiZong commented May 9, 2020

  1. Very likely you won't get the same cell each time due to cuDNN and GPU parallelism even with a fixed seed. We repeated the search for 4 times to find the best resulting cell based on the validation curves in Figure 3, and then evaluated that single best cell also for 4 times to obtain it's mean and average.
  2. Zeros are auxiliary ops that allow different edges to have different total strengths. Their absolute strengths do not matter, because multiplying all edges in the cell by the same factor does not really affect it's topology. Note that in this work we are not trying to learn the sparsity of the architecture at all, as we always assume each node to have 2 predecessors (as in NASNets and AmoebaNets etc.).

Hi, @quark0 ,

About question 1, I found if we did not enable cuDNN and set num_workers to 0 in the dataloader, we still cannot produce the same searched result with a fixed seed when running for multiple times. what else factors do you think that affect the reproduciblity?

@d12306
Copy link

d12306 commented Oct 6, 2020

@PipiZong , have you been able to achieve the same structure as the paper, I found I am not able to reproduce either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants