How to derive the architecture? Running the train_search on cifar10, but the architecture is different #32

Ian09 · 2018-08-14T07:29:40Z

Hi
I have 2 questions about how to derive the final architecture:

I use the default search script on cifar10 python train_search.py --unrolled --seed 0 to search the architecture. I found the architecture is different with the one provided by the paper (also in the genotypes.py of the repo). If I change the seed, the architecture will also change. In the paper, the authors mentioned the results are obtained by 4 runs. So my questions: Do the 4 runs use the same architecture? Or use 4 different architectures?, and How the illustrated architecture in the paper is selected?
On my own run to search the architecture, I found the probability of the zero op is the highest. However, in the paper, the authors mentioned that zero op is not used in the final architecture (Sec.2.4). This is also confirmed in the code. My question is If zero op is not used, why we add a zeros in the searching space? It is really weird since if we do not excluded the zero op, all the ops will be zero ;-(. Does the author have the same problems? For example, the alphas for normal cell is

[[0.1838, 0.0982, 0.081 , 0.1736, 0.1812, 0.0846, 0.091 , 0.1066],
[0.4717, 0.0458, 0.0496, 0.0945, 0.1113, 0.0556, 0.0953, 0.0762],
[0.2946, 0.1425, 0.0855, 0.1768, 0.0837, 0.0735, 0.0731, 0.0704],
[0.3991, 0.0631, 0.0581, 0.1053, 0.1307, 0.0577, 0.1043, 0.0817],
[0.6298, 0.0382, 0.035 , 0.0658, 0.0435, 0.0551, 0.0605, 0.0721],
[0.3526, 0.0974, 0.0693, 0.1346, 0.1245, 0.0697, 0.091 , 0.061 ],
[0.4829, 0.06 , 0.0612, 0.115 , 0.0969, 0.065 , 0.0624, 0.0565],
[0.6591, 0.0303, 0.0282, 0.0558, 0.0578, 0.054 , 0.0581, 0.0568],
[0.7612, 0.0199, 0.0207, 0.0294, 0.0343, 0.0442, 0.0431, 0.0472],
[0.3519, 0.1231, 0.0692, 0.1381, 0.0925, 0.076 , 0.0748, 0.0744],
[0.4767, 0.0781, 0.0679, 0.1216, 0.0679, 0.0701, 0.0548, 0.0629],
[0.6769, 0.032 , 0.0292, 0.0547, 0.0533, 0.0427, 0.0614, 0.0498],
[0.7918, 0.0191, 0.0199, 0.0279, 0.0423, 0.0223, 0.0392, 0.0375],
[0.8325, 0.0153, 0.0158, 0.0199, 0.0284, 0.0255, 0.0313, 0.0313]]

Each row is the probability of ['none', 'max_pool_3x3','avg_pool_3x3','skip_connect','sep_conv_3x3','sep_conv_5x5','dil_conv_3x3','dil_conv_5x5'] for each edge.

The text was updated successfully, but these errors were encountered:

quark0 · 2018-08-14T15:44:18Z

Very likely you won't get the same cell each time due to cuDNN and GPU parallelism even with a fixed seed. We repeated the search for 4 times to find the best resulting cell based on the validation curves in Figure 3, and then evaluated that single best cell also for 4 times to obtain it's mean and average.
Zeros are auxiliary ops that allow different edges to have different total strengths. Their absolute strengths do not matter, because multiplying all edges in the cell by the same factor does not really affect it's topology. Note that in this work we are not trying to learn the sparsity of the architecture at all, as we always assume each node to have 2 predecessors (as in NASNets and AmoebaNets etc.).

Ian09 · 2018-08-15T02:47:04Z

Hi

Thanks very much for the reply!

About the answer 1, may I ask how do you split the validation set? In your initial code, it seems that you use the "test" data as the "validation" data when searching. In the commit of a00cad0, you fix this problem. So do you just split the real trainset half-and-half as "train" and "valid" set, and use the performance on the "valid" set (half of the real training set) to select the best architecture?

quark0 · 2018-08-15T02:57:35Z

The test set was never used even before that commit (valid_queue was never actually used for arch search in the old version). The variable names were indeed a bit confusing and that's why I fixed it.

Yes, your understanding about architecture selection is correct.

Ian09 · 2018-08-15T03:01:46Z

Hi

And about answer 2, could you please why you want to "allow different edges to have different total strengths"? When you generate the final architecture, you use softmax over the alphas, thus the total strengths of these edges are not used. Is there any reason for that? Thanks!

quark0 · 2018-08-15T03:09:38Z

To derive the final architecture, we retain the top-2 strongest predecessors for each intermediate node (sect 2.4). The edge strengths are needed here to rank the candidate predecessors. While argmax tells us which op to put on the edge, it does not tell whether that particular edge should be retained.

Ian09 · 2018-08-15T03:37:08Z

Hi
One things I am not quite understand:
What is the definition of "edge strengths"? Is it the sum of all alpha for that edge? If so, even without zero op, the "edge strengths" will also be different. So I don't quite understand why " Zeros are auxiliary ops that allow different edges to have different total strengths."

DARTS is really a nice work! And I just would like to understand the model better without any offense.
I am appreciated your help!

quark0 · 2018-08-15T04:00:07Z

Sure, I hope the following helps (strength I'm referring to here is slightly different from that in the paper):

strength of op_i
= softmax_{i of all ops}(\alpha_i)
= softmax_{i of non-zero ops}(\alpha_i) * (1-probability_of_the_zero_op)
= softmax_{i of non-zero ops}(\alpha_i) * strength_of_the_edge

In other words, strength means the sum of the mixing probabilities of all non-zero ops on a given edge.

Also, assuming you have only a single non-zero op (e.g. conv). Without introducing the zero op, all edges are equally important hence it's impossible to determine the two predecessors for each node.

Ian09 · 2018-08-15T06:51:37Z

Got it. Thanks!

Ian09 · 2018-08-16T02:01:07Z

Hi
One more question: If you split the trainset half-and-half, one half is used to train parameters w, and the other half is used to train the architecture alpha. Then, you also use the second half as the validation set to select the best architecture. Maybe there will be some problems since you use the second half both to train alpha and validate the performance. Is it right?

quark0 · 2018-08-16T02:25:48Z

"Overfitting" \alpha to the validation set is precisely what we want. While there could be better selection strategies, exploring them is beyond the scope of this paper.

PipiZong · 2020-05-09T22:42:38Z

Very likely you won't get the same cell each time due to cuDNN and GPU parallelism even with a fixed seed. We repeated the search for 4 times to find the best resulting cell based on the validation curves in Figure 3, and then evaluated that single best cell also for 4 times to obtain it's mean and average.

Zeros are auxiliary ops that allow different edges to have different total strengths. Their absolute strengths do not matter, because multiplying all edges in the cell by the same factor does not really affect it's topology. Note that in this work we are not trying to learn the sparsity of the architecture at all, as we always assume each node to have 2 predecessors (as in NASNets and AmoebaNets etc.).

Hi, @quark0 ,

About question 1, I found if we did not enable cuDNN and set num_workers to 0 in the dataloader, we still cannot produce the same searched result with a fixed seed when running for multiple times. what else factors do you think that affect the reproduciblity?

d12306 · 2020-10-06T03:02:11Z

@PipiZong , have you been able to achieve the same structure as the paper, I found I am not able to reproduce either.

quark0 closed this as completed Aug 15, 2018

cys4 mentioned this issue Sep 18, 2018

Question about architecture derivation #45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to derive the architecture? Running the train_search on cifar10, but the architecture is different #32

How to derive the architecture? Running the train_search on cifar10, but the architecture is different #32

Ian09 commented Aug 14, 2018

quark0 commented Aug 14, 2018 •

edited

Loading

Ian09 commented Aug 15, 2018

quark0 commented Aug 15, 2018 •

edited

Loading

Ian09 commented Aug 15, 2018

quark0 commented Aug 15, 2018 •

edited

Loading

Ian09 commented Aug 15, 2018

quark0 commented Aug 15, 2018 •

edited

Loading

Ian09 commented Aug 15, 2018

Ian09 commented Aug 16, 2018

quark0 commented Aug 16, 2018

PipiZong commented May 9, 2020

d12306 commented Oct 6, 2020

How to derive the architecture? Running the train_search on cifar10, but the architecture is different #32

How to derive the architecture? Running the train_search on cifar10, but the architecture is different #32

Comments

Ian09 commented Aug 14, 2018

quark0 commented Aug 14, 2018 • edited Loading

Ian09 commented Aug 15, 2018

quark0 commented Aug 15, 2018 • edited Loading

Ian09 commented Aug 15, 2018

quark0 commented Aug 15, 2018 • edited Loading

Ian09 commented Aug 15, 2018

quark0 commented Aug 15, 2018 • edited Loading

Ian09 commented Aug 15, 2018

Ian09 commented Aug 16, 2018

quark0 commented Aug 16, 2018

PipiZong commented May 9, 2020

d12306 commented Oct 6, 2020

quark0 commented Aug 14, 2018 •

edited

Loading

quark0 commented Aug 15, 2018 •

edited

Loading

quark0 commented Aug 15, 2018 •

edited

Loading

quark0 commented Aug 15, 2018 •

edited

Loading