Skip to content

Latest commit

 

History

History
96 lines (61 loc) · 11.7 KB

File metadata and controls

96 lines (61 loc) · 11.7 KB

Bayesian Data Analysis for Software Engineering

Some pointers to help you get started doing Bayesian analysis on your SE data.

ICSE 2021 Technical Briefing

We (Richard Torkar, Carlo Furia and yours truly) held a technical briefing (short tutorial) on Bayesian Data Analysis for SE at ICSE 2021 ("in" Madrid, i.e. online/virtual).

Cordoba 2022 SMILES Summer School Seminar

Robert Feldt held a seminar on "Bayesian Analysis of SE Data" at the International Summer School on Search- and Machine Learning-based Software Engineering, in Cordoba, Spain on June 22nd of 2022.

FAQ

Q1. BDA sounds great, but how do I get started?

We recommend you buy and read/follow the book Statistical Rethinking by Richard McElreath. We strongly recommend you get the 2nd edition since it involves and is based on causal analysis which will be of great importance for science (and SE) longer-term.

Q2. What if I want more SE-specific BDA starting points?

You might find it useful to start with our "trifecta" of papers arguing for and providing SE-specific processes and examples:

  1. Furia, C. A., R. Feldt, and R. Torkar. "Bayesian data analysis in empirical software engineering research." IEEE Transactions on Software Engineering (2019). IEEE link
    • Summarizes some disadvantages of traditional, frequentist statistics
    • Argues that Bayesian statistical analysis should have a more prominent role in SE
    • High-level overview of Bayesian statists
    • Re-analyses two SE datasets
  2. Torkar, R., C. A. Furia, R. Feldt, ... "A Method to Assess and Argue for Practical Significance in Software Engineering." IEEE Transactions on Software Engineering (2020). IEEE link
    • Method for analysing practical significance of SE results based on BDA, with showcase on SE data
  3. Furia, C. A., R. Torkar, and R. Feldt. "Applying Bayesian Analysis Guidelines to Empirical Software Engineering Data: The Case of Programming Languages and Code Quality.", accepted in ACM TOSEM, October 2021.
    • Condensed BDA guidelines for SE, with showcase on one SE data set

If you find them useful and then go on to publish and use BDA in your paper we would appreciate if you cite our work.

Q3. Where can I find examples of BDA on SE data sets?

Our published papers arguing for BDA have analyzed a few different SE-related data sets:

  1. Effectiveness of autogenerated vs manual testing: Section 3.3 of paper 1 analyses data from the paper Ceccato2015, "Do Automatically Generated Test Cases Make Debugging Easier? An Experimental Assessment of Debugging Effectiveness and Efficiency".
  2. Run time performance of programming languages: Section 4.1 of paper 1 analyses data from the paper Nanz2015, "A Comparative Study of Programming Languages in Rosetta Code".
  3. Effectiveness of exploratory vs scripted testing: Section 3 of paper 2 analyses data from the paper Afzal2015, "An experiment on the effectiveness and efficiency of exploratory testing".
  4. Programming language and code quality: Section 3 of paper 3 analyses data from the paper Ray2014, "A large scale study of programming languages and code quality in Github".

Other papers in SE include:

  1. Setting software metrics tresholds: Ernst2018, "Bayesian Hierarchical Modelling for Tailoring Metric Thresholds", MSR 2018.
  2. Affective states and technical debt: Olsson2020, "Measuring affective states from technical debt: A psychoempirical software engineering experiment", accepted for publication in EMSE journal.
  3. Fault localization algorithm: Scholz2020, "An empirical study of Linespots: A novel past-fault algorithm", in submission.
  4. Requirements prioritization criteria: BerntssonSvensson2021, "Not all requirements prioritization criteria are equal at all times: A quantitative analysis", in submission.

One early re-analysis was in Carlo's arXiv report "Bayesian Statistics in Software Engineering: Practical Guide and Case Studies" but its analysis (on performance of programs implemented in different languages) was superseded by the one in paper 1.

Most likely there are also other and earlier examples and we would like to collect them. If you know of papers that do Bayesian analysis of SE-related data please contact us or make a pull request to this page. Thanks!

Q4. Which tools do you recommend?

Find an up-to-date library that is easy for you to work with in a language and with tools you already know. The main workhorse of modern BDA is the Stan tool but it is easier to use it from libraries in your language of choice:

  1. brms for R
  2. Stan.jl for Julia
  3. PyStan for Python
  4. There are also Stan interfaces for Matlab, Stata, Mathematica etc, see Stan interfaces

For the future we are optimistic about the Turing.jl library for Julia since it has the potential to be even more flexible, powerful, and scalable/fast than solutions based on Stan. However, it is not yet as mature as libraries and tools based on Stan.

Q5. Are there guidelines and recommended workflows for doing BDA in SE?

Workflows and guidelines are currently under development in statistics and there is not yet a clear consensus. For a detailed and up-to-date guide see the current version of the "Bayesian Workflow" book being written by Gelman et al. We have tried to present a shorter, condensed workflow for BDA in SE in Section 2 of paper 3.

Q6. Is it true that statisticians discourages the use of p-values and "statistically significant"?

The ASA (American Statistical Association) in their 2016 "ASA Statement on P-Values" discouraged the use of declarations of "statistical significance". In a 2019 editorial of the ASA journal AST the Editors of the special issue on "Moving to a world beyond p < 0.05" then said:

"We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term 'statistically significant' entirely. Nor should variants such as 'statistically different', 'p < 0.05' and 'nonsignificant' survive, whether expressed in words, by asterisk in a table, or in some other way".

See slides 20 and 21 in Robert Feldt's ESEM 2019 keynote for links and the actual quote. For what we should do instead see Robert's thoughts on Slide 23 but you should also check the editorial and some papers of the above-mentioned special issue to get the views of the statisticians.

Q7. How can BDA help analyze practical significance of SE results?

The output of a Bayesian analysis is a posterior distribution with a large number of samples of all the parameters of your model. It summarizes the knowledge you have gained about the model, i.e. "What are plausible parameter values of my model given the empirical data I collected?". Importantly, it brings with it the uncertainty of that knowledge. We can perform computations on this posterior to calculate quantities of interest for practitioners and then answer concrete questions they have. And the computations we do will carry the uncertainty of the posterior with them.

This will be important for practitioners since it will give them a more nuanced view of what the research result actually would mean in their context. Rather than a typically dichotomous or at least coarse-grained answer from a traditional research study using frequentist statistics, e.g. "Exploratory testing should be preferred over scripted test cases", we can provide answers to practitioners like "If we hire testers with low experience and let them use exploratory testing we can expect them to find 6-8 faults per hour of testing" (an actual example from our practical significance for SE paper, see table 3). We use the posterior to simulate specific scenarios and calculate what our model and the knowledge the posterior captures implies. We can also answer "What if?" questions. In this example we could for example answer the practical question "What if we instead hired highly experienced testers (and still let them use exploratory testing)?" with "We expect them to find 8-12 faults per hour of testing". The practitioner could use this knowledge to consider the trade-off between the extra cost of hiring more experienced staff compared to the efficiency of their testing.

In concrete terms, the posterior is simply a large table with many samples for all of the parameters of your model. So if each column corresponds to one of the parameters of the model, each row corresponds to one sample of what the values of these parameters could have been (at the same time, i.e. their joint values). To simulate we just come up with a way to calculate a quantity of practical interest (QPI) from the model with parameter values selected from one row of the table. We then loop over all the rows of the table and get a distribution of the values of the QPI. By asking practitioners to estimate costs and efforts we can then often calculate also concrete utilities such as ROI (Return-On-Investment) and their distributions. For examples of this see paper 2.