## Defect Prediction Models
Defect prediction models have been proposed to predict the most risky areas of source code that are likely to have post-release defects~\cite{menzies2007data,nagappan2010change,d2010extensive,tantithamthavorn2018optimization,tantithamthavorn2016automated,wang2016automatically,wang2018deep}.
A defect prediction model is a classification model that estimates the likelihood that a file will have post-release defects.
One of the main purposes is to help practitioners effectively spend their limited SQA resources on the most risky areas of code in a cost-effective manner.

## The modelling pipeline of defect prediction models

The predictive accuracy of the defect prediction model heavily relies on the modelling pipelines of defect prediction models~\cite{tantithamthavorn2015icse,tantithamthavorn2016comments,tantithamthavorn2018pitfalls,menzies2019, ghotra2015revisiting,agrawal2018better,tantithamthavorn2016icseds}.
To accurately predicting defective areas of code, prior studies conducted a comprehensive evaluation to identify the best technique of the modelling pipelines for defect models.
For example, feature selection techniques~\cite{ghotra2017large,jiarpakdee2018autospearman,jiarpakdee2020featureselection},
collinearity analysis~\cite{jiarpakdee2018autospearman,jiarpakdee2016study,jiarpakdee2018impact},
class rebalancing techniques \cite{tantithamthavorn2018impact},
classification techniques~\cite{ghotra2015revisiting},
parameter optimization~\cite{tantithamthavorn2016automated,fu2016tuning,tantithamthavorn2018optimization,agrawal2018better},
model validation~\cite{tantithamthavorn2017empirical}, and model interpretation~\cite{jiarpakdee2018impact,jiarpakdee2020empirical}.
Despite the recent advances in the modelling pipelines for defect prediction models, the cost-effectiveness of the SQA resource prioritization still relies on the granularity of the predictions.

## The granularity levels of defect predictions models

The cost-effectiveness of the SQA resource prioritization heavily relies on the granularity levels of defect prediction.
Prior studies argued that prioritizing software modules at the finer granularity is more cost-effective~\cite{pascarella2019fine,kamei2010revisiting,hata2012bug}.
For example, Kamei~\ea~\cite{kamei2010revisiting} found that the file-level defect prediction is more effective than the package-level defect prediction.
Hata~\ea~\cite{hata2012bug} found that the method-level defect prediction is more effective than file-level defect prediction.
Defect models at various granularity levels have been proposed, e.g., packages~\cite{kamei2010revisiting}, components~\cite{thongtanunam2016revisiting}, modules~\cite{kamei2007effects}, files~\cite{kamei2010revisiting,mende2010effort}, methods~\cite{hata2012bug}.
However, developers could still waste an SQA effort on manually identifying the most risky lines, since the current prediction granularity is still perceived as coarse-grained~\cite{wan2018perceptions}. 
Hence, the line-level defect prediction should be beneficial to SQA teams to spend optimal effort on identifying and analyzing defects.






## The Risks of Unsound Software Analytics

However, such predictions and explanations may be invalid if practitioners do not consider the risks of building unsound software analytics, leading to outdated analytical models, invalid predictions and explanations, erroneous and costly business decisions. 
Thus, if care is not taken when analyzing and modeling software data, the predictions and insights that are derived from analytical models may be inaccurate and unreliable. 

Below, we provide a hands-on tutorial to guide practitioners on how to (1) analyze software data using statistical techniques like correlation analysis, hypothesis testing, effect size analysis, and multiple comparisons, (2) develop accurate, reliable, and reproducible analytical models, (3) interpret the models to uncover relationships and insights, and (4) discuss pitfalls associated with analytical techniques including hands-on examples with real software data.