Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper: Case study: Real-world machine learning application for hardware failure detection #468

Open
wants to merge 20 commits into
base: 2019
from

Conversation

Projects
None yet
3 participants
@hongsups
Copy link

commented May 22, 2019

If you are creating this PR in order to submit a draft of your paper,
see http://procbuild.scipy.org/ for logs generated by the build
process.

See the project readme
for more information.

@hongsups hongsups changed the title Hongsup shin Paper: Case study: Real-world machine learning application for hardware failure detection May 22, 2019

@deniederhut

This comment has been minimized.

Copy link
Member

commented Jun 13, 2019

Hi @hongsups ! It looks like none of our reviewers were able to complete their reviews for your paper by the deadline, so I'll be stepping in to complete the review myself. Please expect comments from me some time in the next handful of days 🙂.


.. class:: abstract

When designing microprocessors, engineers must verify whether the proposed design, defined in hardware description language, do what is intended. Duering this verification process, engineers run simulation tests and can fix bugs if tests have failed. Due to the complexity of design, the baseline approach is to provide random stimuli to verify random parts of the design. However, this method is time-consuming and redundant especially when the design becomes mature and thus failure rate is low. To increase efficiency and detect failures faster, we can build machine learning models by using previously run tests, and address the likelihood of failure of new tests. This way, instead of running random tests agnostically, engineers use the model prediction on a new set of tests and run a subset of tests (i.e., "filtering" the tests) that are more likely to fail. Due to the severe imbalance, I trained an ensemble of supervised (classification) and unsupervised models and used the union of the prediction from both models to catch more failures. The tool has been deployed as a complementary workflow early this year, which does not interfere the existing workflow. After the deployment, I found that the the "filtering" approach has limitations due to the randomness in test generation. In addition to introducing the relatively new data-driven approach in hardware design verification, this study also discusses the details of post-deployment evaluation such as retraining, and working around real-world constrations, which are not often discussed in machine learning and data science research.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 14, 2019

Member

I think you mean to say:

"does what is intended"
"During this verification"
"complexity of the design"
"real-world constraints"

What do you mean by "severe imbalanced"? Imbalance in the labels?

which are not often discussed in machine learning and data science research

😅 I don't know that I would say "often", but yes it is nice to see what can be done without infinite time and infinite money. Maybe we could rephrase this to "which are sometimes not discussed"?

This comment has been minimized.

Copy link
@mepa

mepa Jun 14, 2019

@deniederhut, @hongsups, sorry for the duplicate comments from me. I had reviewed earlier this week but neglected to hit the "Finish review" button. Just did so now :)

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 17, 2019

Author

@deniederhut I added specific information of the class imbalance (as @mepa also suggested). Also, rephrased the "often" according to your suggestion :)

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 17, 2019

Author

@deniederhut @mepa Btw, I am making adjustments to the manuscript while replying to your comments. I will make commits in batch so there can be delay between my answer here and actual change in manuscript.

Simulation-based hardware verification
######################################

Hardware verification is the process of checking that a given design correctly implements the specification, which are the technical descriptions of the computer's components and capabilities. It is recoginsed as the largest task in silicon development and as such has the biggest impact on the key business drivers of quality, schedule and cost. In the computer hardware design cycle, microprocessor manufactoring companies often spend 60-70% of the cycle dedicated to the verification procedure. Traditionally, two techniques have been used: formal and simulation-based (random-constraint) methods [Ioa12]_. The former adopts mathematical approach such as theorem proving and requirement checks [Wil05]_, which provides exhaustiveness but doesn't scale well with design complexity. For instance, if a design has :math:`N` flip-flops (two states for each), the total number states is :math:`2^N`. Thus, more widely used approach is the simulation-based testing, which simulates the design (i.e., each line in hardware description language) by providing stimuli to tests. During the simulation-based testing, engineers provide a set of constraints to the stimuli so that they can direct the tests to a certain direction. However, it is never possible to target certain design part deterministically and engineers often depend on previous knowledge or intuition.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 14, 2019

Member

I think you might mean to say:

"implements a specification"
"which is the technical description"
"adopts a mathematical approach"
"the more widely used approach"

the total number states is :math:2^N

another way to say this is that the complexity grows exponentially

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 17, 2019

Author

@deniederhut I corrected the errors as you suggested. About the last one, the new sentence is: "Due to the exponentially-growing design complexity, the more widely used approach is the simulation-based testing, ..."

Previous machine-learning based approach
########################################

The ultimate goal of hardware verification is to have a (close-to) bug-free design. From the simulation-based testing perspective, this is an exploration problem where machine learning can be useful. For instance, reinforcement learning algorithm can be used to explore complex parameter space by learning the reward function [Ioa12]_. However, the main roadblock of the approach is that the simulation-based testing is not deterministic at all and it is extremely challenging to estimate the level of stochasiticity. Most of the previous research on hardware verification with the simulation-based testing approach has focused on supervised learning [Mam16]_, [Bar08]_, [Wag07]_ and evolutionary algorithms [Ber13]_, [Cru13]_. However, some approaches are based on probability-based methods, which are difficult to be scalable and most of the studies lack the mention of how their machine learning approach is actually implemented and used.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 14, 2019

Member

Did you mean to write "these approaches are based on probability-based methods"? Or maybe "these approaches tend to be based on..."?

I might replace "difficult to be scalable" with "difficult to scale". I might also replace "lack the mention" with "fail to mention".

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 17, 2019

Author

@deniederhut I rephrased the part about probabilistic methods and provided more details because I thought that part was somewhat vague (especially why it's challenging). Changed the rest according to the suggestion.

:scale: 40%
:align: center

Proportional relationship between the number of failures (x axis) and the number of unique fail signatures (UFS) on the y axis (mean and standard error from 100 random draws; among 250k simlated tests, I drew :math:`N` failed tests and counted the number of UFS, and repeated the process 100 times). The more failures occur, the more UFS are found. :label:`Fig.2`

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 14, 2019

Member

"simulated"

This comment has been minimized.

Copy link
@hongsups
Simulation-based testing in practice
####################################

In practice, a testbench is built to make the random simulation of tests efficient and be automated to a degree such as running a batch of tests regularly to explore random parts of the design. Once engineers run tests with certain input constraints, *settings*, and the simulation is finished, the results are obtained. These show which test has passed and failed (binary label) and specific signature of failure *unique failure signature (UFS)*, created by a hash function which converts a failure log into a hexadecimal code. The way engineers control the input settings vary widely. In an extreme case, they only control the seed number of a pseudo-random number generator for the entire test. Normally engineers have input settings of a test that either controls specific binary options (i.e. turning a specific condition on and off) or controls the range of stimuli values. For example, at a given input setting *A*, which decides the range of stimulu values, if *A = '1-5'*, this means, the final stimulus value is *a number between integers 1 and 5*. Unfortunately, there is no way to identify the final value. Even if we figure this out, it is impossible to guide the testbench to specifically generate a desired value because the testbench is designed to only behave stochastically. This means, even if engineers tweak certain settings (instead of just using the seed) to guide the tests, the randomness still remains because the remaining downstream process is randomized, which makes directing tests to certain design parts is challenging. This is also difficult for building a machine learning model because two tests with the exact same input settings can result in completely different outcome. Finally, the design changes almost everyday by engineers, which can potentially create rapid data drift, which may require model to be retrained frequently.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 14, 2019

Member

Did you mean to write:
"and includes the specific signature of the failure"
"which may require the model to be retrained"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 18, 2019

Author

@deniederhut Fixed and changed the wording.

Deployment
##########

Other engineers and I wrote a Python script with in my group, which is a command-line tool that engineers can run without changing their main *random* flow. The script takes test candidates as input and by using the pre-trained models, make a binary prediction on whether a test candidate will fail or not. Note that whenever new test candidates are provided, we run a separate script that preprocesses the new data to be ready to be consumed by the pre-trained models. The test candidates are randomly generated by using the testbench and normally we generated about 1k test candidates so that at the end about 400 tests are filtered, which is the upper limit of the number of additional tests we can run. We decided to adjust the number of tests as we have better assessment of the model performance after the deployment. Finally, the script returns the unique identifier of the test candidates that are flagged as failure by the models. Then the script invokes a testbench simulation where it runs the filtered tests. After the deployment, we found that model performance had high variability. Figure :ref:`Fig.4` shows the model performance of the first 17 days (no retraining). The efficiency values were often larger than 1 but sometimes they changed dramatically. In the following sessions, I will address how I attempted to resolve this issue and found caveats of the "filtering" approach.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 14, 2019

Member

"script with in my group" -> "script within my group"
"In the following sessions" -> "in the following sections"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 18, 2019

Author

Fixed.

:scale: 45%
:align: center

The effect of the number of tests that are provided to the models and the performance variability. Each vertical line represnets a single simulated run. Since we use the models to filter out the test candidates, the fewer tests we provide to the models, more likely that performance depends on how good the initial test candidates are. The more tests we provide, the performance becomes less variable. :label:`Fig.6`

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 14, 2019

Member

"the performance becomes less variable" -> "the less variable the performance becomes"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 18, 2019

Author

Fixed.

Data for retraining
###################

During the initial deployment stage, we retrained the models manually whenever we made major changes in tool for instance how we preprocess data or whenever the production engineers announced that there was a major change in the testbench or the design. In order to decide how much training data we would use to optimize the performance, we conducted an experiment by varying the size and the weight of the training data. Theoretically, it's possible to use the entire suite of tests that were every run. However, this requires long training time and it's possible that very old test data would be useless if the design has changed a lot since then. Hence, in the experiment, we implemented a varying size of rolling window and weight decay. The rolling window size decides the number of :math:`N` consecutive days to look back to build a training dataset. For instance, if :math:`N=7`, we use the past 7 days worth of simulated tests as our training data. The weight decay takes into account the recency effect of changes in the testbench; the data that was generated more recently has higher significance in training. We used 5 different windows (:math:`N = 3, 5, 7, 10, 14`) and multiplicative power decay with various power parameters to compute the weight :math:`w`, (:math:`w(t) = x^t` where :math:`x` is the power parameter (0.3, 0.6, 0.9, 1 (=no decay)) and :math:`t` is the number of days counting from today). For instance, if :math:`x=0.9`, tests that were run 2 days before today are 10% less important than yesterday's tests. These weights are applied to objective function during training by using ``sample_weight`` parameter in scikit-learn models’ ``fit()`` function, which allows users to assign weights during model fitting for every single data point. Since every day multiple tests are generated, same weights are assigned to data points if they were generated on a same day. Note that this weight adjustment was added on top of the class weight adjustment (``class_weight='balanced'``).

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 14, 2019

Member

"whenever we made major changes in tool for instance" -> "whenever we made major changes in the tool, for instance"
"These weights are applied to objective function" -> "These weights are applied to the objective function"
"Since every day multiple tests are generated, same weights are assigned to data points if they were generated on a same day" -> "Since multiple tests are generated for each day, they each get the same weights."

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 18, 2019

Author

Fixed.


During the initial deployment stage, we retrained the models manually whenever we made major changes in tool for instance how we preprocess data or whenever the production engineers announced that there was a major change in the testbench or the design. In order to decide how much training data we would use to optimize the performance, we conducted an experiment by varying the size and the weight of the training data. Theoretically, it's possible to use the entire suite of tests that were every run. However, this requires long training time and it's possible that very old test data would be useless if the design has changed a lot since then. Hence, in the experiment, we implemented a varying size of rolling window and weight decay. The rolling window size decides the number of :math:`N` consecutive days to look back to build a training dataset. For instance, if :math:`N=7`, we use the past 7 days worth of simulated tests as our training data. The weight decay takes into account the recency effect of changes in the testbench; the data that was generated more recently has higher significance in training. We used 5 different windows (:math:`N = 3, 5, 7, 10, 14`) and multiplicative power decay with various power parameters to compute the weight :math:`w`, (:math:`w(t) = x^t` where :math:`x` is the power parameter (0.3, 0.6, 0.9, 1 (=no decay)) and :math:`t` is the number of days counting from today). For instance, if :math:`x=0.9`, tests that were run 2 days before today are 10% less important than yesterday's tests. These weights are applied to objective function during training by using ``sample_weight`` parameter in scikit-learn models’ ``fit()`` function, which allows users to assign weights during model fitting for every single data point. Since every day multiple tests are generated, same weights are assigned to data points if they were generated on a same day. Note that this weight adjustment was added on top of the class weight adjustment (``class_weight='balanced'``).

All combinatorial scenarios were tested via simulation across multiple datasets (Fig. :ref:`Fig.5`). When the rolling window is too small (e.g., :math:`N=3`), performance was low in both UFS recovery and efficiency metrics, which suggests 3-day dataset might not be enough for training. Having more dramatic decay tends to mimic the effect of having a smaller rolling window and generally degraded performance. In terms of performance stability over time, naturally, having a longer rolling window seemed better. As showed in Fig. :ref:`Fig.5` as orange box, we decided to use 14-day window without any decay even though the efficiency value was slightly higher in 7-day without any decay. This was to consider the fact that we might have to run a smaller number of tests in the future and thus 7-day window might not provide enough tests for training.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 14, 2019

Member

"When the rolling window is too small" -> "When the rolling window was too small"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 18, 2019

Author

Fixed.

-----------
In real-world scenarios, it is often the case where one just does not have the complete freedom of algorithms or inifite amount of training resource. In hardware verification, the fact that tests are generated randomly challenge building machine learnig models because we can neither guide test generation nor measure stochasiticity easily. In addition, machine-learning approach is only useful when the design is mature and the majority of the tests that are run are pass but engineers are looking for failures, meaning the severe class imbalance of the training data. Finally, we cannot rely on single metric because our complementary flow competes against the existing workflow.

To address these issues, I have built a prototype that provide test candidates and filters out failure-prone tests instead of trying to guide the testbench itself, used both supervised and unsupervised models to address the problem as classification and outlier detection at the same time, customized the process of how to select the best model by looking at multiple metrics, and explore the idea of using continuous predictions instead of the binary to filter fewer but better candidates. I have also conducted experimetns to address the details of retraining and identifying the cause of performance instabilty, which are often overlooked but crucial in post-depoyment process. In summary, this work provides practical information when building a machine learning engineering product for hardware verification, where machine learning approaches are still relatively new.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 14, 2019

Member

"I have also conducted experimetns" -> "I have also conducted experiments"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 18, 2019

Author

Fixed.

@deniederhut

This comment has been minimized.

Copy link
Member

commented Jun 14, 2019

This is a great paper! I really enjoyed reading about your approach. I've left a couple of inline comments about spelling and grammar -- I think I caught most of the mistakes, but you may want to run this through a spellchecker just to be sure.

Show resolved Hide resolved papers/Hongsup_Shin/Hongsup_Shin.rst Outdated
Simulation-based hardware verification
######################################

Hardware verification is the process of checking that a given design correctly implements the specification, which are the technical descriptions of the computer's components and capabilities. It is recoginsed as the largest task in silicon development and as such has the biggest impact on the key business drivers of quality, schedule and cost. In the computer hardware design cycle, microprocessor manufactoring companies often spend 60-70% of the cycle dedicated to the verification procedure. Traditionally, two techniques have been used: formal and simulation-based (random-constraint) methods [Ioa12]_. The former adopts mathematical approach such as theorem proving and requirement checks [Wil05]_, which provides exhaustiveness but doesn't scale well with design complexity. For instance, if a design has :math:`N` flip-flops (two states for each), the total number states is :math:`2^N`. Thus, more widely used approach is the simulation-based testing, which simulates the design (i.e., each line in hardware description language) by providing stimuli to tests. During the simulation-based testing, engineers provide a set of constraints to the stimuli so that they can direct the tests to a certain direction. However, it is never possible to target certain design part deterministically and engineers often depend on previous knowledge or intuition.

This comment has been minimized.

Copy link
@mepa

mepa Jun 14, 2019

  • "specification" -> specifications
  • "recoginsed" -> recognized
  • "approach" -> approaches
  • "Thus, the more widely..."
  • "During the simulation-based testing" -> Please drop "the" in this case and all cases below (unless referring to a particular "testing" in which case it could be clarified which testing is in mind)
  • "part" -> parts

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 17, 2019

Author

Thanks @mepa, I corrected typos and grammatical errors.

Previous machine-learning based approach
########################################

The ultimate goal of hardware verification is to have a (close-to) bug-free design. From the simulation-based testing perspective, this is an exploration problem where machine learning can be useful. For instance, reinforcement learning algorithm can be used to explore complex parameter space by learning the reward function [Ioa12]_. However, the main roadblock of the approach is that the simulation-based testing is not deterministic at all and it is extremely challenging to estimate the level of stochasiticity. Most of the previous research on hardware verification with the simulation-based testing approach has focused on supervised learning [Mam16]_, [Bar08]_, [Wag07]_ and evolutionary algorithms [Ber13]_, [Cru13]_. However, some approaches are based on probability-based methods, which are difficult to be scalable and most of the studies lack the mention of how their machine learning approach is actually implemented and used.

This comment has been minimized.

Copy link
@mepa

mepa Jun 14, 2019

  • "bug-free design" -> It may be interesting to a non-specialized audience to read specific examples of bugs met during hardware verification.
  • "For instance, reinforcement learning algorithm..." -> Correct singular/plural agreement in this sentence.
  • "of the approach" -> of this approach

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 17, 2019

Author

@mepa I wrote a short paragraph that describes what hardware bug means and also address a couple of general examples. The grammatical errors are fixed.

Simulation-based testing in practice
####################################

In practice, a testbench is built to make the random simulation of tests efficient and be automated to a degree such as running a batch of tests regularly to explore random parts of the design. Once engineers run tests with certain input constraints, *settings*, and the simulation is finished, the results are obtained. These show which test has passed and failed (binary label) and specific signature of failure *unique failure signature (UFS)*, created by a hash function which converts a failure log into a hexadecimal code. The way engineers control the input settings vary widely. In an extreme case, they only control the seed number of a pseudo-random number generator for the entire test. Normally engineers have input settings of a test that either controls specific binary options (i.e. turning a specific condition on and off) or controls the range of stimuli values. For example, at a given input setting *A*, which decides the range of stimulu values, if *A = '1-5'*, this means, the final stimulus value is *a number between integers 1 and 5*. Unfortunately, there is no way to identify the final value. Even if we figure this out, it is impossible to guide the testbench to specifically generate a desired value because the testbench is designed to only behave stochastically. This means, even if engineers tweak certain settings (instead of just using the seed) to guide the tests, the randomness still remains because the remaining downstream process is randomized, which makes directing tests to certain design parts is challenging. This is also difficult for building a machine learning model because two tests with the exact same input settings can result in completely different outcome. Finally, the design changes almost everyday by engineers, which can potentially create rapid data drift, which may require model to be retrained frequently.

This comment has been minimized.

Copy link
@mepa

mepa Jun 14, 2019

  • "stimulu" -> stimulus
  • "parts is challenging" -> parts challenging
  • "everyday" -> every day
  • "require model" -> require the model

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 18, 2019

Author

@mepa Fixed the errors. Also added more detailed information about how a testbench works so that the readers have a better idea of how a test is generated.

Post-deployment analysis
########################

I used both supervised and unsupervised models to address the severe class imbalance problem and used the union of the prediction from both models. With this approach, for a set of independent testing datasets, it was possible to find 80% of unique failure signatures (Fig. :ref:`Fig.3`) by running only 40% of tests on average, compared to running tests based on the original simulation-based method. The tool has been deployed and been used since early this year. It is not common in both machine learning and hardware verification literature to find how suggested models perform in real-world setting; often machine learning papers show performance based on a single limited dataset or use commonly used benchmark datasets. In this paper, I address this and attempt to provide practical insights to post-deployment process such as decisions regarding the automation of model retraining and addressing randomness in post-deployment datasets.

This comment has been minimized.

Copy link
@mepa

mepa Jun 14, 2019

  • I suggest clarifying "the severe class imbalance problem" on first time meeting the phrase, though I see that it's spelled out under the Models paragraph.
  • "has been deployed" -> to where? used by whom and how often? or, if it will be described further in a later section it may be nice to state that here
  • "insights to post-deployment" -> insights to the post-deployment

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 18, 2019

Author

@mepa

  • I added the details of the class imbalance (class ratio) to the very beginning when I first mention this.
  • Also added more details about the deployment ("The tool has been deployed in our internal cluster as a part of daily verification workflow, which is used by verification engineers in the production team.")
  • Fixed the error.
Data
####

Simulation-based testing is run almost every day via testbench. Every simulated test and the result, such as whether the test has passed or failed (and its UFS), is stored in a database. ngineers push various commits to the testbench almost daily, which suggests changes in data generation process on a daily basis. This may include new implementation or modification in the design or even bug fixes. Depending on the amount of changes, the data drift might be significant. To address this issue, we collected two datasets. The first dataset ("snapshot") was generated from a same version of testbench (115k tests). For the second set, we collected a month worth of data (ca. 6k tests per day). The second dataset ("1-month") is collected specifically to simulate retraining scenarios and to challenge our model for every-day changes in the testbench (150k). Both datasets are from a specific unit of a microprocessor with a specific test scenario. The input data have individual tests as rows and test settings (stimuli) as columns. The total number of columns are in the range of several hundreds. The data were preprocessed based on the consultation with domain experts and stakeholders. The output data have tests as rows and two columns, one for pass/fail binary label and the other for UFS for the failed tests.

This comment has been minimized.

Copy link
@mepa

mepa Jun 14, 2019

  • "ngineers" -> engineers

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 18, 2019

Author

Fixed.

Models
######

I used an ensemble of a supervised and an unsupervised learning model. Due to the severe class imbalance between passes and failures (near 99% pass and 1% failure ratio) in the training data, we can train a supervised model with adjusted class weight or either build an unsupervised model to detects outliers (i.e. failures). In a preliminary analysis, I found that supervised and the unsupervised models provided predictions that are qualitatively different. The fail signatures (UFS), which describe the reason of failures, from the supervised model’s prediction and the unsupervised one’s were not identical although there were some overlaps. Thus, when we computed the union of both predictions, we did see a small increase of fail signature recovery across many testing datasets. Due to the frequent changes in data generation process (near-daily change in the testbench), I decided to use algorithms robust to frequent retraining and tuning. We used a group of non-neural-net scikit-learn (v0.20.2) classifiers as supervised and isolation forest as unsupervised learning algorithms. For both cases, I conducted randomized search to tune the hyperparameters and select the best model.

This comment has been minimized.

Copy link
@mepa

mepa Jun 14, 2019

  • "we can train a supervised model with adjusted class weight or either" -> we can either train a supervised model with adjusted class weight or
  • "I found that supervised and the unsupervised" -> I found that the supervised and unsupervised

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 18, 2019

Author

Fixed.

Metrics
#######

For both supervised and unsupervised models, I used recall and precision as basic metric but also used more practical metrics. For the unsupervised, I treated the outliers and failures and computed the metrics.

This comment has been minimized.

Copy link
@mepa

mepa Jun 14, 2019

  • "as basic metric" -> as basic metrics
  • "I treated the outliers and failures" -> what does "treated" mean here?

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 18, 2019

Author
  • Fixed it to "metrics"
  • I added description in the first paragraph under "Models" section: "For the unsupervised, because the majority of our training data is passed tests, it is possible to consider the failures as outliers or abnormalities. " and removed the sentence, "treated the outliers..."
@hongsups

This comment has been minimized.

Copy link
Author

commented Jun 14, 2019

Hi @hongsups ! It looks like none of our reviewers were able to complete their reviews for your paper by the deadline, so I'll be stepping in to complete the review myself. Please expect comments from me some time in the next handful of days 🙂.

Thanks, Dillon. Would you want me to wait until you complete the review, to make changes in the manuscript or is it okay if I start working on it as soon as possible? Honestly the draft was submitted with rush, so I think I can fix some problems by myself.

@deniederhut

This comment has been minimized.

Copy link
Member

commented Jun 14, 2019

My review is already submitted - you should be able to see the inline comments here on this page.

@mepa

This comment has been minimized.

Copy link

commented Jun 15, 2019

Hi @hongsups ! It looks like none of our reviewers were able to complete their reviews for your paper by the deadline, so I'll be stepping in to complete the review myself. Please expect comments from me some time in the next handful of days 🙂.

Thanks, Dillon. Would you want me to wait until you complete the review, to make changes in the manuscript or is it okay if I start working on it as soon as possible? Honestly the draft was submitted with rush, so I think I can fix some problems by myself.

Hi @hongsups, @deniederhut - apologies for not getting my initial review in on time. I think go ahead and make changes in the manuscript, then we can continue adding more content-based review comments before the final deadline.

@hongsups

This comment has been minimized.

Copy link
Author

commented Jun 17, 2019

Hi @hongsups ! It looks like none of our reviewers were able to complete their reviews for your paper by the deadline, so I'll be stepping in to complete the review myself. Please expect comments from me some time in the next handful of days 🙂.

Thanks, Dillon. Would you want me to wait until you complete the review, to make changes in the manuscript or is it okay if I start working on it as soon as possible? Honestly the draft was submitted with rush, so I think I can fix some problems by myself.

Hi @hongsups, @deniederhut - apologies for not getting my initial review in on time. I think go ahead and make changes in the manuscript, then we can continue adding more content-based review comments before the final deadline.

@mepa Will do. I will upload the updated manuscript in 2-3 days this week!

@deniederhut

This comment has been minimized.

Copy link
Member

commented Jun 18, 2019

Awesome! Please @ us when you are ready for us to have a second look 😄

@hongsups

This comment has been minimized.

Copy link
Author

commented Jun 18, 2019

Awesome! Please @ us when you are ready for us to have a second look 😄

Will do! I was using @ for every comment to reply mainly because I wasn't sure whether I should refer or not, but I will just use it once the manuscript is ready for the second look. I will also try to commit in batches. Thanks for the feedback :)

@hongsups

This comment has been minimized.

Copy link
Author

commented Jun 18, 2019

This is a great paper! I really enjoyed reading about your approach. I've left a couple of inline comments about spelling and grammar -- I think I caught most of the mistakes, but you may want to run this through a spellchecker just to be sure.

Thank you so much :) I submitted with rush, so there were a lot of typos and errors. I will make sure I fix as many as possible. I am also running spellcheckers!

Hongsup Shin added some commits Jun 18, 2019

Hongsup Shin
updated the manuscript by addressing all the comments made by both re…
…viewers. Currently working on editing the second half of the manuscript. Will notify the update once it's done
@hongsups

This comment has been minimized.

Copy link
Author

commented Jun 19, 2019

@mepa @deniederhut Hi! I just uploaded the updated manuscript. It's ready for the second review. Btw, is the deadline June 25? Anyway, I will be frequently checking this page, so please feel free to leave any comments and questions. I will do my best to answer!

@deniederhut
Copy link
Member

left a comment

This is looking really good! I've left some more phrasing suggestions. Also, it looks like your citations are missing DOIs. Could I ask you to add those in?

Simulation-based hardware verification
######################################

Hardware verification is the process of checking that a given design correctly implements the specifications, which is the technical description of the computer's components and capabilities. It is recognized as the largest task in silicon development and as such has the biggest impact on the key business drivers of quality, schedule and cost. In the computer hardware design cycle, microprocessor manufacturing companies often spend 60-70% of the cycle dedicated to the verification procedure. Traditionally, two techniques have been used: formal and simulation-based (random-constraint) methods [Ioa12]_. The former adopts a mathematical approach such as theorem proving and requirement checks [Wil05]_, which provides exhaustiveness but doesn't scale well with design complexity. Due to the exponentially-growing design complexity, the more widely used approach is the simulation-based testing, which simulates a design by providing stimuli to tests. During simulation-based testing, verification engineers provide a set of constraints to stimuli so that they can direct tests to a certain direction. However, it is not easy to target certain design parts deterministically and engineers often depend on previous knowledge or intuition.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 20, 2019

Member

"specifications, which are"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 20, 2019

Author

Fixed.

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

  • what do "stimuli" mean in this context?
  • "direct tests toward a certain direction", or "in"

Overall this paragraph is clear, informative, and motivates the approach taken in the paper. Really nice intro!

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 23, 2019

Author
  • I added a sentence to provide the details on "stimuli". "These stimuli can be considered as arbitrary values that control certain functionalities of the design that were expressed in hardware description language such as whether to turn on or off a specific setting."
  • Fixed. Used "toward"
  • Thank you! It took me a while to understand the background since I am not a hardware engineer, so your feedback means a lot!
Previous machine-learning based approach
########################################

The ultimate goal of hardware verification is to have a (close-to) failure-free design. From the simulation-based testing perspective, this is an exploration problem where machine learning can be useful. For instance, reinforcement learning algorithms can be used to explore the complex space of test stimuli by learning a reward function [Ioa12]_. However, this approach is not feasible because the simulation-based testing is non-deterministic and intractable, which makes it difficult to estimate the level of stochasticity. This is mainly because the motivation for the simulation-based approach is randomization, often implemented in multiple steps (i.e., a value in an input setting randomizes a value in the next step, which then randomizes a value of a different setting in the following step, etc.). The testing tools have not been built to track these setting values and the information on probability distributions used in the randomization process have been left out. To address this, a few studies [Bar08]_, [Fin09]_ adopted probabilistic approach but they failed to mention actual implementation in production cycle and scalability issue. The majority of the previous research on hardware verification with the simulation-based testing approach has focused on supervised learning [Mam16]_, [Bar08]_, [Wag07]_ and evolutionary algorithms [Ber13]_, [Cru13]_. [Mam16]_ has shown a study that is the closest to this study in nature but the authors focused on high-level instruction set simulator (ISS), which generates instructions at a higher level (related to hardware performance, a high-level metric) than the design level.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 20, 2019

Member

"a few studies have adopted a probabilistic approach, but they failed"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 20, 2019

Author

Fixed.

Simulation-based testing in practice
####################################

In practice, engineers build a testbench to house all the components that are needed for the verification process: test generator, interface, driver, monitor, model, and scoreboard. To run tests, verification engineers define a set of values as *input settings*, which can be compared to input arguments to a function of a code. These values are passed to the test generator, and under certain constraints, a series of subsequent values that stimulate various parts of the design are *randomly generated*. This information is then passed to the interface through the driver. The interface interacts with a design part (register-transfer level (RTL) design written in HDL) and then the returned output is fed into the monitor. To evaluate the result, the desirable output should be retrieved. This information is stored in the model, which is connected to the driver. A test is identified as failure when the the desirable output from the driver (through the model) and the output from the monitor do not match. In addition to the binary label of pass or failure, testbench also returns a log file of failure, if the test has failed. This log contains detailed information of the failure. Each failure log is encoded as an 8-digit hexadecimal code by a hash function. This code is called *unique failure signature (UFS)*. In general, instead of inspecting every failure log, engineers are more interested in maximizing the number of UFS that are collected after a batch of tests. Collecting a large number of UFS means failures with a great variety have been hunted down.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 20, 2019

Member

"input arguments to a function."

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 20, 2019

Author

Fixed.

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

  • "when the desirable output" --> remove the second "the"
  • "or failure, the testbench ... of the failure"

Why do engineers want to see a large variety of failures? What does absence of variety imply?

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 23, 2019

Author
  • Fixed.
  • Really good question! I added a sentence to elaborate the reason at the end of the paragraph. "Having a larger variety of failures is important because it means the tests have explored various parts of the design and thus, it's likely to discover failures associated with rare edge cases or problems overlooked before."

This comment has been minimized.

Copy link
@mepa

mepa Jun 24, 2019

Thanks, I think I now understand the reason verification engineers want to maximize UFS. Is it because various inputs/stimuli to tests can identify the same bug (UFS) in the HDL code, which assumedly requires just one fix? Hence, we want to find all unique bugs, and it's a waste of resources/time to create tests that identify the same UFS.


In practice, engineers build a testbench to house all the components that are needed for the verification process: test generator, interface, driver, monitor, model, and scoreboard. To run tests, verification engineers define a set of values as *input settings*, which can be compared to input arguments to a function of a code. These values are passed to the test generator, and under certain constraints, a series of subsequent values that stimulate various parts of the design are *randomly generated*. This information is then passed to the interface through the driver. The interface interacts with a design part (register-transfer level (RTL) design written in HDL) and then the returned output is fed into the monitor. To evaluate the result, the desirable output should be retrieved. This information is stored in the model, which is connected to the driver. A test is identified as failure when the the desirable output from the driver (through the model) and the output from the monitor do not match. In addition to the binary label of pass or failure, testbench also returns a log file of failure, if the test has failed. This log contains detailed information of the failure. Each failure log is encoded as an 8-digit hexadecimal code by a hash function. This code is called *unique failure signature (UFS)*. In general, instead of inspecting every failure log, engineers are more interested in maximizing the number of UFS that are collected after a batch of tests. Collecting a large number of UFS means failures with a great variety have been hunted down.

Random generation of the test settings in the test generator is intended to run a batch of tests automatically almost daily to explore random parts of the design with efficiency. In practice, engineers run tests with certain input settings and collect the results after the tests are simulated. The way that engineers control the input settings vary widely. In an extreme case, they only control the seed number of a pseudo-random number generator in the test generator for the entire set of the input settings. Normally for a test, engineers have a set of input settings, not just the seed, which either turn on and off a setting or controls stochastic behavior a setting by defining what kind of values the setting can take. For instance, if a certain input setting has a string value of *"1-5"*, it indicates that the actual stimulus that goes into the simulation can be *any integer from 1 to 5*. Unfortunately, testbench does not track this information and it is not possible to know which value has ended up chosen eventually. Hence, it is extremely challenging to guide a testbench to generate a specific value of the input settings. This is why building a machine learning model is challenging because two tests with the exact same values of an input setting can result in two different outcomes. Additionally, engineers make changes to the design almost every day, which includes a new implementation or modification in the design, or bug fixes. This affects the test behavior and in turn, data generation process, which implies, the data distribution can potentially change almost daily (i.e., frequent data drift).

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 20, 2019

Member

"in the test generator is used to run"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 20, 2019

Author

Fixed.

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

  • "which either turns a setting on and off"
  • "Unfortunately, the testbench"
  • "which value ended up getting chosen eventually"
  • "which implies that the data"

It might be worth italicizing the sentence "This is why building a machine learning model is challenging..." as this is a key point.

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 23, 2019

Author

Fixed. And italicized the sentence. Good idea!

Working around the stochastic test generation
#############################################

This situation requires a unique approach. It is impossible to eliminate randomness in the test generation step, which makes it difficult to guide testbench to test specific input values or parts of the system. Instead, we leave the inputs to be generated randomly and filter them afterward. By using the labeled data from previous tests (i.e., tests that were already simulated), a machine learning model (classifier) can be trained to predict whether a test will fail or pass with a given set of input settings. Then, it is possible to provide a large set of test *candidates* (a number of tests with random input setting values, i.e., providing the new input values) to the trained model that assesses which subset of the test candidates will fail. This way, it is possible to run the subset of tests only, instead of running the entire test candidates agnostically. This can bring cluster savings and make the verification process more efficient. However, the existing simulation-based testing with random constraints *should remain* because we still have to explore new design parts, which in turn provide new training data for model update. Hence, two parallel pathways can be proposed (Fig. :ref:`Fig.1`); one with the default randomized testing and the other with machine learning models, where an additional set of test candidates are provided and then only the tests flagged by the models are filtered and run. This way, it is possible to continue collecting novel data from the first pathway to explore a new input space while utilizing the data from previous tests via the ML flow.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 20, 2019

Member

"the entire set of test candidates"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 20, 2019

Author

For this one, I modified it to "the entire set of the input settings of test candidates" because I want to specify that the randomization is specifically for the input settings.


I used an ensemble of a supervised and an unsupervised learning model. Due to the severe class imbalance between passes and failures (near 99% pass and 1% failure rate) in the training data, it is possible to either train a supervised model with adjusted class weight or train an unsupervised model that detects outliers (i.e. failures). For the unsupervised, because the majority of the training data is passed tests, it is possible to consider the failures as outliers or abnormalities. In a preliminary analysis, I found that the supervised and the unsupervised models provided predictions that were qualitatively different; the unique failure signatures (UFS) from the supervised model’s and the unsupervised one’s predictions were not identical although there were some overlaps. Thus, when the union of both predictions were computed, there was a small increase of UFS recovery across many testing datasets. Hence, I decided to use both models and take the union of the predictions. This means, when test candidates are passed to the model for prediction, a candidate will be flagged as failure either of the supervised or the unsupervised predicts it as failure.

Due to the frequent changes in data generation process, I decided to use algorithms robust to frequent retraining and tuning (i.e., faster training time). I used a group of non-neural-net scikit-learn (v0.20.2) classifiers as supervised and isolation forest as unsupervised learning algorithms. For both cases, I conducted randomized search to tune the hyperparameters and select the best model. For the supervised, I used algorithms such as logistic regression and tree-based ensemble methods (random forest, gradient boosting, and extra trees). The winning algorithm was the logistic regression with L2 regularization, potentially because the preprocessed input data had high sparsity (more than 50%).

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 20, 2019

Member

The winning algorithm was the logistic regression with L2 regularization

It's interesting how often the simple models perform the best 😄

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

What does a null value mean in these tests? Is it that a particular stimulus is not possible/relevant/used for a particular test, so the field is empty?

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 23, 2019

Author

So often if a setting is null, it means one of the two things:

  • The setting is absent or turned off.
  • The setting is not specified by users.
    In the first case, this happened for binary columns ([0, 1]), so it was pretty straightforward to encode. For the second one, I had to ask the engineers to track down which specific default values their code uses because it is not specified because their tests were just using default values. Even for the first case, the engineers' default value matched to how I handled the data (because the columns have names such as "setting_A_on", etc.).

I added this information in the "data preprocessing" part as such:
"Using a software analogy, this is similar to not having to specify an input argument value in a function, if it already has a default value for that argument. The engineers were able to obtain the default values, which fixed the missing data issue."

If you think I need more clarification, please let me know!

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

I see. So the data had "more than 50%" sparsity, but you were able to fill in missing values based on gaining knowledge of the default values. Were the input data to the model completely dense then, or were there still missing values?

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 24, 2019

Author

Ah, so when I meant 50% sparsity, it meant literally more than 50% were 0s. This was after the imputation. I will add this to the manuscript. These 0s often mean a setting is not present or the value is set to 0.


.. math::
\text{UFS recovery rate} = \frac{\mathrm{card}(S_{\hat{y}=1})}{\mathrm{card}(S_{y=1})},

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 20, 2019

Member

Is there a reason you didn't use Jaccard similarity?

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 20, 2019

Author

@deniederhut Oh! I just realized that what I was computing is essentially the same as Jaccard similarity. The union would be the total set of UFS and the intersection would be the ones that are found by the models, which is a subset of the total set.

I guess when I think of Jaccard similarity as a metric for arbitrary sets A and B, I often automatically assume |A - B| or |B - A| is non-zero, meaning A doesn't belong to B or vice versa. In this case, if we define the UFS recovery rate as Jaccard similarity, the set of UFS found by the models would completely belong to the total set of UFS, so I think that's why I used the current formula. I am also thinking that the current formula might be more straightforward to the audience than Jaccard similarity, so I would like to keep it as it is.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 22, 2019

Member

Ah okay -- I had assumed that the sets would be non-overlapping. Could I ask you to add this as a footnote to the formula?

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 23, 2019

Author

Ah I missed this. Added this as a footnote to the formula.

Results
-------

For the *snapshot* dataset, the testing data (50% holdout data in 10 different sets; each set is generated independently) shows that the union predictions from the trained supervised and unsupervised models achieved the UFS recovery rate of :math:`82 \pm 2` % (mean :math:`\pm` sem) and the efficiency of :math:`1.8 \pm 0.1` (mean :math:`\pm` sem). Similar results were obtained in the *1-month* dataset (Fig. :ref:`Fig.3`). Note that in the figure, the UFS recovery rate increased for the union approach but the efficiency got lower because the union approach naturally required running more tests. Since the precision score was very low (due to the class imbalance), I ran a permutation test as a sanity check (100 runs) and found the model performance was significantly different from the permuted runs (:math:`p=0.010` for the *snapshot* dataset). Overall, in both datasets, on average, the union approach flagged about 40% of the tests and was able to retrieve 80% of the unique failure signatures. This suggests that with the ML flow, it is possible to find 80% of UFS by running only 40% tests, compared to the existing random flow.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 20, 2019

Member

"achieved a UFS recovery rate"
"and an efficiency of"

I ran a permutation test as a sanity check

I'm really glad you included this detail; I think permutation tests are underrated. Phil Stark used to joke about writing a book called "Permutation tests can be used for everything".

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 20, 2019

Author

Fixed.

Haha, thanks. I found permutation tests easy to execute and also provides a good simple sanity check for any metrics.


Several productions engineers and I wrote Python and shell scripts to build a command-line tool that verification engineers can run without changing their main *random* flow. The script takes test candidates as input and make a binary prediction on a test candidate's success (pass or failure) based on the pre-trained models (both the supervised and the unsupervised and then their union). Whenever new test candidates are provided to the tool, the input settings of those are preprocessed so that they are consistent with the training data. The test candidates are randomly generated from the testbench. The tool generates about 1k test candidates and it flags about 400 tests on average. The number of the test candidates generated depends on computational resource available in the internal cluster. In the flagging process, the script returns the unique identifier of the flagged test candidates. Then it invokes a testbench simulation where the filtered tests are run. Our scripts are deployed as a part of the production team's continuous integration.

After the deployment, model performance started showing high variability, sometimes very different from the pre-deployment model performance. Figure :ref:`Fig.4` shows the model performance of the first 17 days of post-deployment period. Note that the models were not retrained during this period. During this period, the efficiency scores were often larger than 1 but they changed dramatically sometimes. In the following sections, I will discuss how to identify the cause of the performance variability in the context of model retraining, and other issues found post-deployment.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 20, 2019

Member

"scores were often larger than one"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 20, 2019

Author

Fixed.

Random-draw effect
##################

It is suspected that the fluctuation in model performance (Fig. :ref:`Fig.4`) might be related to the quality of the test candidates. This is because the test candidates were generated randomly in the testbench independently and we have no control over it. Hence, by chance, it is possible that the test candidates on a certain day might be more challenging to the models (e.g., samples that are closer to the decision boundary), which may result in low performance (i.e., "random-draw" effect). To test this idea, I simulated the effect of the random draw by varying the number of test candidates provided to the models (Fig. :ref:`Fig.6`). I found that when more candidates were provided, model performance was more stable for both UFS recovery rate and efficiency. In the actual deployment, about 1000 test candidates were provided to the tool. As shown in Fig. :ref:`Fig.6`, it is very much possible that with 1000 candidates, the efficiency can be lower than 1 or as high as 4 in certain draws. For the simulation in Fig. :ref:`Fig.6`, I drew tests from a pool of 25k tests but considering that the actual number of possible test candidates that can be ever generated is astronomical, variability in the model performance in reality could be more severe.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 20, 2019

Member

"lower than one or as high as four"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 20, 2019

Author

Fixed.

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

Ok, this paragraph and Fig. 6 more or less answer my earlier question about the relationship between number of test candidates and model performance. Nice results.

@hongsups

This comment has been minimized.

Copy link
Author

commented Jun 20, 2019

This is looking really good! I've left some more phrasing suggestions. Also, it looks like your citations are missing DOIs. Could I ask you to add those in?

Thanks! Yes, I will add those.

@hongsups

This comment has been minimized.

Copy link
Author

commented Jun 20, 2019

@deniederhut Thanks for the quick response, Dillon. I added the doi info and incorporated your feedback. Just updated the manuscript.

@hongsups

This comment has been minimized.

Copy link
Author

commented Jun 21, 2019

@deniederhut @mepa Hey all, I made a few minor changes to polish the manuscript. If you have any questions or comments, please let me know!

@deniederhut
Copy link
Member

left a comment

I think we're almost there! Just a few language suggestions in some of the new sections.


.. class:: abstract

When designing microprocessors, engineers must verify whether the proposed design, defined in hardware description language, does what is intended. During this verification process, engineers run simulation tests and can fix bugs if the tests have failed. Due to the complexity of the design, the baseline approach is to provide random stimuli to verify random parts of the design. However, this method is time-consuming and redundant especially when the design becomes mature and thus failure rate is low. To increase efficiency and detect failures faster, it is possible to train machine learning models by using previously run tests, and assess the likelihood of failure of new test candidates before running them. This way, instead of running random tests agnostically, engineers use the model prediction on a new set of test candidates and run a subset of them (i.e., "filtering" the tests) that are more likely to fail. Due to the severe imbalance (1% failure rate), I trained an ensemble of supervised (classification) and unsupervised models (outlier detection) and used the union of the prediction from both models to catch more failures. The tool has been deployed in the internal cluster early this year, as a complementary workflow which does not interfere the existing workflow. After the deployment, I found performance instability in post-deployment performance and ran various experiments to address the issue such as by identifying the effect of the randomness in the test generation process. In addition to introducing the relatively new data-driven approach in hardware design verification, this study also discusses the details of post-deployment evaluation such as retraining, and working around real-world constraints, which are sometimes not discussed in machine learning and data science research.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 22, 2019

Member

"address the issue, such as identifying"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 22, 2019

Author

Fixed.

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

  • "and unsupervised (outlier detection) models"
  • "does not interfere with the existing"
  • "in the internal cluster" -> in which internal cluster?

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 23, 2019

Author

Hi @mepa thanks for the review! I fixed the suggestions and for the last bullet point, it's the HPC that our production team uses, so I just rephrased it as "in an interval high performance computing (HPC) cluster"

Simulation-based testing in practice
####################################

In practice, engineers build a testbench to house all the components that are needed for the verification process: test generator, interface, driver, monitor, model, and scoreboard. To run tests, verification engineers define a set of values as *input settings*, which can be compared to input arguments to a function. These values are passed to the test generator, and under certain constraints, a series of subsequent values that stimulate various parts of the design are *randomly generated*. This information is then passed to the interface through the driver. The interface interacts with a design part (register-transfer level (RTL) design written in HDL) and then the returned output is fed into the monitor. To evaluate the result, the desirable output should be retrieved. This information is stored in the model, which is connected to the driver. A test is identified as failure when the the desirable output from the driver (through the model) and the output from the monitor do not match. In addition to the binary label of pass or failure, testbench also returns a log file of failure, if the test has failed. This log contains detailed information of the failure. Each failure log is encoded as an 8-digit hexadecimal code by a hash function. This code is called *unique failure signature (UFS)*. In general, instead of inspecting every failure log, engineers are more interested in maximizing the number of UFS that are collected after a batch of tests. Collecting a large number of UFS means failures with a great variety have been hunted down.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 22, 2019

Member

"A test is identified as a failure"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 22, 2019

Author

Fixed.

Post-deployment analysis
########################

I used both supervised and unsupervised models to address the severe class imbalance problem and used the union of the prediction from both models. This means, a test is predicted as failure when at least one of the two models predict the test as failure. With this approach, for a set of independent testing datasets, it was possible to find 80% of unique failure signatures (Fig. :ref:`Fig.3`) by running only 40% of tests on average, compared to running tests based on the original simulation-based method. The tool has been deployed in production since early this year in our internal cluster as a part of daily verification workflow, which is used by verification engineers in the production team. It is not common in both machine learning and hardware verification literature to find how suggested models perform in a real-world setting. Often machine learning studies show performance based on a single limited dataset or commonly used benchmark datasets. In this paper, I address this and attempt to provide practical insights to the post-deployment process such as decisions regarding the automation of model retraining and addressing randomness in the post-deployment period.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 22, 2019

Member

"This means, a test is predicted to fail when at least one of the two models predict it will fail."

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 22, 2019

Author

Fixed.

Data preprocessing
##################

The input data was preprocessed based on the domain knowledge of the verification engineers. In the raw data, roughly 70% of the data was missing. This is because when the value of an input setting in a test is the same as a default value of the setting, which was not specified by the engineers. Using software analogy, this is similar to not having to specify an input argument value in a function, if it already has a default value for that argument. The engineers were able to obtain the default values, which fixed the missing data issue. There were about 20% object (i.e., non-numerical) columns. Some of them were nominal columns (e.g., "name1", "name2") but the majority turned out to be numerical values in quotes (e.g., "5", "100"), quoted ranges (e.g., "1-5", "50-100") or a dictionary with key-value pairs in quotes. For the quoted numerical values, I simply stripped the quotes and converted them to numbers. For the quoted ranges, it was not straightforward because these columns have uncertainty information in them. For instance, "1-5" means any values from 1 to 5 and there was no way to know which value was chosen in the end and also what type of probability distribution was used for the random draw. Although I initially considered treating them as nominal, I decided to take the mean of the minimum and maximum values of a range value: for "1-5", it would be represented as (1+5)/2 = 3. This way, it might be possible to preserve some numerical information about the range in the input data. For the quoted dictionary, I parsed them and expanded to multiple columns so that each key became a column in the input dataset. Finally, I dropped columns that were non-informative (i.e., single unique value) and duplicates. This resulted in about 10% increase of the number of columns, which was still in the range of several hundreds. The output datasets did not require preprocessing.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 22, 2019

Member

"In the raw data, roughly 70% of the data was missing, which corresponds to input settings that were not modified from the defaults."

"Using a software analogy"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 22, 2019

Author

Fixed.


Although the predictions from the supervised and the unsupervised models are binary in the deployed tool, in fact both models (logistic regression and isolation forest) can return a continuous score, which can be used as a measure of likelihood of failure. For the supervised model, this is prediction probability and for the unsupervised, this is anomaly score. In the default setting (as in the deployed tool), the supervised model classifies the candidates with the probability of failure larger than 0.5 as failures, and the unsupervised flags the ones with negative anomaly scores as outliers.

To address the random-draw effect, it might be better to use these likelihood metrics. With these metrics, the test candidates can be ranked and the tool can choose the top candidates, which are more likely to fail (prediction probability for a supervised model) or more abnormal (anomaly score of an unsupervised model) than other candidates. Then it is possible to provide a larger number of test candidates to the models, which can simply choose the top :math:`K` candidates. This allows the models to see more test candidates, which can potentially reduce the random-draw effect. It also works well with the deployed tool because the test candidate generation is very fast and doesn't cost much resource. Assuming that enough test candidates are provided to the models, it is not necessary to set specific cut-offs for the likelihood measures but to pick the top :math:`K` tests where :math:`K` will simply depend on the cluster resource constraints, which is more straightforward.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 22, 2019

Member

"is very fast and doesn't cost much" or "is very fast and doesn't consume many resources"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 22, 2019

Author

Decided to go with the first suggestion! Fixed.

- *Top K without any retraining*: Top :math:`K` candidates flagged as failure by the models. The models saw the input settings of the *tests that were run in the existing random flow* (the same 3k-4k tests from the *Random K*). Using the same tests as in the "Random :math:`K`" is important to make the comparison fair and consistent. In this scenario, both supervised and unsupervised models were never retrained. This is shown as blue dots in Fig. :ref:`Fig.7`.
- *Top K with retraining under "three-strikes" rule*: Same as the previous but both models were retrained whenever model performance was lower than the *Random K*'s, three days in a row. It is shown as orange crosses in Fig. :ref:`Fig.7`.

Since all scenarios have the same :math:`K=400`, it is possible to compare the unique failure signature (UFS) counts (the y axis in Fig. :ref:`Fig.7`) instead of the UFS recovery rate. Although the models did not always perform better than the baseline, when they did (the middle section of the figure), retraining the models based on the "three-strikes rule" did help. This rule kept the models relatively new but also helped keeping good models without retraining too frequently. In the middle section of the figure, it was possible to use the same models without retraining for almost two weeks. Theoretically, it is possible to retrain the models every day. However, model retraining is not free and it still takes up resource in the internal cluster. This means, too frequent retraining can undermine the benefit of using the ML models.

This comment has been minimized.

Copy link
@deniederhut

deniederhut Jun 22, 2019

Member

"it still consumes computational resources"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 22, 2019

Author

Fixed.

@hongsups

This comment has been minimized.

Copy link
Author

commented Jun 22, 2019

@deniederhut Thanks for another review! Just updated the manuscript based on your comments!

@deniederhut

This comment has been minimized.

Copy link
Member

commented Jun 22, 2019

Awesome! And just in case you missed this in the giant thread of comments:

#468 (comment)

@hongsups

This comment has been minimized.

Copy link
Author

commented Jun 23, 2019

Awesome! And just in case you missed this in the giant thread of comments:

#468 (comment)

@deniederhut Thanks for pointing it out! I missed it indeed. Just updated the manuscript with the footnote.


.. class:: abstract

When designing microprocessors, engineers must verify whether the proposed design, defined in hardware description language, does what is intended. During this verification process, engineers run simulation tests and can fix bugs if the tests have failed. Due to the complexity of the design, the baseline approach is to provide random stimuli to verify random parts of the design. However, this method is time-consuming and redundant especially when the design becomes mature and thus failure rate is low. To increase efficiency and detect failures faster, it is possible to train machine learning models by using previously run tests, and assess the likelihood of failure of new test candidates before running them. This way, instead of running random tests agnostically, engineers use the model prediction on a new set of test candidates and run a subset of them (i.e., "filtering" the tests) that are more likely to fail. Due to the severe imbalance (1% failure rate), I trained an ensemble of supervised (classification) and unsupervised models (outlier detection) and used the union of the prediction from both models to catch more failures. The tool has been deployed in the internal cluster early this year, as a complementary workflow which does not interfere the existing workflow. After the deployment, I found performance instability in post-deployment performance and ran various experiments to address the issue such as by identifying the effect of the randomness in the test generation process. In addition to introducing the relatively new data-driven approach in hardware design verification, this study also discusses the details of post-deployment evaluation such as retraining, and working around real-world constraints, which are sometimes not discussed in machine learning and data science research.

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

  • "and unsupervised (outlier detection) models"
  • "does not interfere with the existing"
  • "in the internal cluster" -> in which internal cluster?
Simulation-based hardware verification
######################################

Hardware verification is the process of checking that a given design correctly implements the specifications, which is the technical description of the computer's components and capabilities. It is recognized as the largest task in silicon development and as such has the biggest impact on the key business drivers of quality, schedule and cost. In the computer hardware design cycle, microprocessor manufacturing companies often spend 60-70% of the cycle dedicated to the verification procedure. Traditionally, two techniques have been used: formal and simulation-based (random-constraint) methods [Ioa12]_. The former adopts a mathematical approach such as theorem proving and requirement checks [Wil05]_, which provides exhaustiveness but doesn't scale well with design complexity. Due to the exponentially-growing design complexity, the more widely used approach is the simulation-based testing, which simulates a design by providing stimuli to tests. During simulation-based testing, verification engineers provide a set of constraints to stimuli so that they can direct tests to a certain direction. However, it is not easy to target certain design parts deterministically and engineers often depend on previous knowledge or intuition.

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

  • what do "stimuli" mean in this context?
  • "direct tests toward a certain direction", or "in"

Overall this paragraph is clear, informative, and motivates the approach taken in the paper. Really nice intro!

:scale: 35%
:align: center

Overview of the prototype pipeline. Top: the existing workflow (randomized testing). Bottom: the complementary machine learning (ML) flow. In the final deployed version, approximately 1000 test candidates are provided to the ML flow, which passed about 400 tests. This corresponds to the 10% of the number of the tests in the top flow. The cubes correspond to the pre-trained machine learning models (blue: a supervised model, green: an unsupervised model). :label:`Fig.1`

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

  • "This corresponds to 10% of the number of tests"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 23, 2019

Author

Fixed.

Failures (bugs) in hardware verification
########################################

Hardware verification can be compared to unit testing in software engineering, especially since design functionalities are realized in hardware description language (HDL) like Verilog. Similar to software testing, hardware verification process involves checking whether simulations of the code written in HDL with a set of given input values (i.e., tests with certain inputs), show desirable behavior. If a test returns undesirable output, it is considered as a failure (bug). To fix the failures, engineers modify the HDL source code such as by fixing "assign" statements or by correcting or adding conditions (e.g., "if" statements), and so on [Sud08]_. The HDL-level hardware verification is one of the many steps in hardware testing, which precedes physical design implementation. This low-level verification is a critical step in hardware testing because fixing a bug in a higher level (e.g., in physical design or even in a product) is more costly and challenging because it is hard to identify which previous steps have bugs.

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

I like the analogy :)

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 23, 2019

Author

Thank you! :)

Simulation-based testing in practice
####################################

In practice, engineers build a testbench to house all the components that are needed for the verification process: test generator, interface, driver, monitor, model, and scoreboard. To run tests, verification engineers define a set of values as *input settings*, which can be compared to input arguments to a function of a code. These values are passed to the test generator, and under certain constraints, a series of subsequent values that stimulate various parts of the design are *randomly generated*. This information is then passed to the interface through the driver. The interface interacts with a design part (register-transfer level (RTL) design written in HDL) and then the returned output is fed into the monitor. To evaluate the result, the desirable output should be retrieved. This information is stored in the model, which is connected to the driver. A test is identified as failure when the the desirable output from the driver (through the model) and the output from the monitor do not match. In addition to the binary label of pass or failure, testbench also returns a log file of failure, if the test has failed. This log contains detailed information of the failure. Each failure log is encoded as an 8-digit hexadecimal code by a hash function. This code is called *unique failure signature (UFS)*. In general, instead of inspecting every failure log, engineers are more interested in maximizing the number of UFS that are collected after a batch of tests. Collecting a large number of UFS means failures with a great variety have been hunted down.

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

  • "when the desirable output" --> remove the second "the"
  • "or failure, the testbench ... of the failure"

Why do engineers want to see a large variety of failures? What does absence of variety imply?


During the initial post-deployment period, the models were manually retrained whenever major changes were made either in the tool or in the design. To automate the retraining process, I tested ideas related to the model retraining. First, for any retraining, the size of training data should be determined. Technically, it is possible to use the entire historic data from the very beginning of the testing process but this is not a good idea because the training data will be too big and very old tests would be useless since the design would have changed a lot since then. To determined how much training data is needed, I conducted an experiment by considering these two factors: rolling window size and weight decay. The rolling window size corresponds to :math:`N` consecutive days (:math:`N = 3, 5, 7, 10, 14`) to look back to build a training dataset. For instance, if :math:`N=7`, tests that were run for the past 7 days become the training data.

The weight decay is related to how *fresh* the data is. If tests were generated more recently, they might be more important because the design then is more similar to the today's compared to older tests. The multiplicative power decay is used to compute the weight :math:`w`, (:math:`w(t) = x^t`, where :math:`x` is the power parameter (0.3, 0.6, 0.9, 1 (=no decay)) and :math:`t` is the number of days counting from today). Using the power law, :math:`x=0.9` would mean tests from yesterday are 10% less important than today's. Once the weights computed, then are applied to the objective function during training by using ``sample_weight`` parameter in scikit-learn models’ ``fit()`` module. It allows users to assign weights during model fitting for every data point. Since multiple tests were generated on a day, they each get the same weights and the weights only vary on the day-level. Note that this weight adjustment was added on top of the class weight adjustment (``class_weight='balanced'``).

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

  • "to the current day's"
  • "Once the weights are computed, they are"
  • "multiple tests are generated"
  • "weight adjustment is added on top"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 23, 2019

Author

Fixed.

Random-draw effect
##################

It is suspected that the fluctuation in model performance (Fig. :ref:`Fig.4`) might be related to the quality of the test candidates. This is because the test candidates were generated randomly in the testbench independently and we have no control over it. Hence, by chance, it is possible that the test candidates on a certain day might be more challenging to the models (e.g., samples that are closer to the decision boundary), which may result in low performance (i.e., "random-draw" effect). To test this idea, I simulated the effect of the random draw by varying the number of test candidates provided to the models (Fig. :ref:`Fig.6`). I found that when more candidates were provided, model performance was more stable for both UFS recovery rate and efficiency. In the actual deployment, about 1000 test candidates were provided to the tool. As shown in Fig. :ref:`Fig.6`, it is very much possible that with 1000 candidates, the efficiency can be lower than 1 or as high as 4 in certain draws. For the simulation in Fig. :ref:`Fig.6`, I drew tests from a pool of 25k tests but considering that the actual number of possible test candidates that can be ever generated is astronomical, variability in the model performance in reality could be more severe.

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

Ok, this paragraph and Fig. 6 more or less answer my earlier question about the relationship between number of test candidates and model performance. Nice results.


Although the predictions from the supervised and the unsupervised models are binary in the deployed tool, in fact both models (logistic regression and isolation forest) can return a continuous score, which can be used as a measure of likelihood of failure. For the supervised model, this is prediction probability and for the unsupervised, this is anomaly score. In the default setting (as in the deployed tool), the supervised model classifies the candidates with the probability of failure larger than 0.5 as failure, and the unsupervised flags the ones with negative anomaly scores as outliers.

To address the random-draw effect, it might be better to use these likelihood metrics. With these metrics, the test candidates can be ranked and the tool can choose the top candidates, which are more likely to fail (prediction probability for a supervised model) or more abnormal (anomaly score of an unsupervised model) than other candidates. Then it is possible to provide a larger number of test candidates to the models, which can simply choose the the top :math:`K` candidates. This allows the models to see more test candidates, which can potentially reduce the random-draw effect. It also works well with the deployed tool because the test candidate generation is very fast and doesn't cost much resource. Assuming that enough test candidates are provided to the models, it is not necessary to set specific cut-offs for the likelihood measures but to pick the top :math:`K` tests where :math:`K` will simply depend on the cluster resource constraints, which is more straightforward.

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

  • "can simply choose the top" -> remove the second "the"

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 23, 2019

Author

Fixed.


Since all scenarios have the same :math:`K=400`, it is possible to compare the unique failure signature (UFS) counts (the y axis in Fig. :ref:`Fig.7`) instead of the UFS recovery rate. Although models do not always perform better than the baseline, when it does (the mid section of the figure), retraining the models based on the "three-strikes rule" did help. This rule kept the models relatively new but also helped keeping good models without retraining too frequently. In the mid section of the figure, it was possible to use the same models without retraining for almost two weeks. Theoretically, it is possible to retrain the models every day. However, model retraining is not free and it still takes up resource in the internal cluster. This means, too frequent retraining can undermine the benefit of using the ML models.

This simulation was based on the 3k-4k tests that were run daily. To compare the model performance and the random-testing results, it was important to use the same set of tests for the simulation; the models saw the input settings of the same 3k-4k tests and made predictions, and then the predictions and actual results were compared. Thus, in this case, the models have only seen several thousands of test candidates. However, in reality, if the top :math:`K` approach is adopted, it will be possible to increase the number of candidates provided to the models, which may potentially increase model performance given that the models see a larger number of the candidates. During the mid two-week period in Fig. :ref:`Fig.7`, on average, the "top K with retraining" approach was able to obtain :math:`2.62 \pm 1.21` (mean :math:`\pm` std) more UFS compared to the random flow. According to the verification engineers that I worked with, in the mature-design stage, even a single additional UFS is valuable. Hence, if the top :math:`K` approach is applied with a larger number of test candidates, it will be possible to find even more UFS.

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

What action is taken upon discovering a "new" UFS?

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 23, 2019

Author

Once a new UFS is found, the rest of the process is very manual. Engineers have to check the failure log to locate the bug, find the cause and fix it. I am adding this information in the introduction: "Once a new UFS is found, engineers starts a debugging process to fix the failure." (in the "Simulation-based testing in practice" part)


In real-world scenarios, it is often the case where a machine learning approach faces many practical constraints. Hardware verification turns out to be a good example. Verification tests are randomly generated and the information about the randomization is intractable, which makes it difficult to control test generation or measure the degree of the stochastic behavior. Also, ML models are only useful in the later stage of hardware verification when it is not easy to find failures by running random tests because in the beginning, the random testing can find a number of failures easily. This means, to use machine learning for the failure detection in hardware verification, one will inevitably face severe class imbalance. Modifying the objective function so that it actually considers the metric of interest, unique failure signature, is not easy because simply there are not enough training examples for each signature. On top of this, the design and even the testbench itself change frequently, suggesting that the data generation process goes through frequent changes.

To address these issues, this study shows a prototype that provides test candidates and filters out failure-prone tests instead of trying to guide the testbench itself. To work with the class imbalance issues, I used both supervised and unsupervised models to address the problem as classification and outlier detection at the same time. I chose a customized approach for model selection by evaluating multiple metrics to be more practical and be able to make a compromise between the metrics. Finally, I have conducted experiments in the post-deployment process to address the details of retraining and identifying the cause of performance variability, which are often overlooked but crucial in deployment. In summary, this study proves that machine learning approach can be used for failure detection in hardware verification and provides practical guidance to work under constraints, to identify and investigate performance-related issues for building a machine learning engineering product.

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

Could you please correct grammar and rephrase the last sentence for clarity? For example, I'm not sure what "machine learning engineering product" refers to.

Finally, a question just to ensure my understanding: Does the approach presented effectively down-sample test candidates yielding the same UFS, which are not valuable tests to run because once a particular UFS is known it can be fixed, no matter the number of times it is met?

This comment has been minimized.

Copy link
@mepa

mepa Jun 23, 2019

@hongsups, this is an interesting paper and your investigation is quite thorough. I enjoyed reading about the post-deployment aspects which, I agree, are not as often presented but are crucial in building actual machine learning products.

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 23, 2019

Author
  • Yes, you are correct. I added "This indicates, engineers can run fewer tests to retrieve similar number of unique failure signatures." in the first paragraph of Conclusions.

  • I changed the last part of the Conclusions like this:
    "In summary, this study proves that machine learning approach can be used for failure detection in hardware verification. It also provides an example to work under practical constraints and investigate performance-related issues in building actual machine learning products."

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 23, 2019

Author

@mepa Thanks for many excellent comments, Meghann! It's always such a rare opportunity to get feedback on my work from other data scientist, so I genuinely appreciate it!

This comment has been minimized.

Copy link
@mepa

mepa Jun 24, 2019

  • "that a machine learning approach"

@hongsups Very glad for the opportunity to review your work!

This comment has been minimized.

Copy link
@hongsups

hongsups Jun 24, 2019

Author

Fixed. Thank you!

@hongsups

This comment has been minimized.

Copy link
Author

commented Jun 23, 2019

@mepa @deniederhut Hey y'all, I just updated the manuscript based on the recent review from @mepa. I noticed that my replies to @mepa's comments don't always appear in the main PR page but they do appear when I click the "View changes" button, so please keep this in mind. I think I answered every comment but if there's anything missing, please let me know. Thanks for working on this even during the weekend!

@deniederhut

This comment has been minimized.

Copy link
Member

commented Jun 25, 2019

Thanks for being so diligent about responding to our suggestions!

@mepa do you feel this paper is ready for inclusion in the proceedings?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.