# (Step 4) Model Evaluation
<!-- # - file: defect-prediction/model-validation.ipynb
# - file: defect-prediction/evaluation-measures.ipynb -->


\subsection{Evaluation Measures}\label{sec:eval_measure}
To evaluate the approaches, we use five performance measures preferred by practitioners~\cite{wan2018perceptions}, i.e., recall, false alarm rate, a combination of recall and false alarm rate, initial false alarm, and Top k\%LOC Recall.\footnote{Note that we have confirmed with one of the authors of the survey study~\cite{wan2018perceptions} that top k\%LOC Recall is one of the top-5 measures, not top k\%LOC Precision as reported in the paper.}
In addition, we use Matthews Correlation Coefficients (MCC) to evaluate the overall predictive accuracy which is suitable for the unbalanced data like our line-level defect datasets~\cite{shepperd2014researcher,bowes2016mutation}.
\revisedTextOnly{Below, we} describe each of our performance measures.

\smallsection{Recall} Recall measures the proportion between the number of lines that are correctly identified as defective and the number of actual defective lines.
\revised{R2.4}{More specifically, we compute recall using a calculation of $\frac{TP}{(TP+FN)}$, where $TP$ is the number of actual defective lines that are predicted as defective and $FN$ is the number of actual defective lines that are predicted as clean.}
A high recall value indicates that the approach can identify more defective lines.


\smallsection{False alarm rate (FAR)} FAR measures a proportion between the number of clean lines that are identified as defective and the number of actual clean lines.
\revised{R2.4}{More specifically, we measure FAR using a calculation of $\frac{FP}{(FP+TN)}$, where $FP$ is the number of actual clean lines that are predicted as defective and $TN$ is the number of actual clean lines that are predicted as clean.}
The lower the FAR value is, the fewer the clean lines that are identified as defective.
In other words, a low FAR value indicates that developers spend less effort when inspecting defect-prone lines identified by the an approach.

\smallsection{A combination of recall and FAR} In this work, we use \emph{Distance-to-heaven (d2h)} of Agrawal and Menzies~\cite{agrawal2018better} to combine the recall and FAR values.
D2h is the root mean square of the recall and false alarm values (i.e., $\sqrt{\frac{(1-Recall)^2 + (0-FAR)^2}{2}}$)~\cite{agrawal2018better,agrawal2019dodge}.
A d2h value of 0 indicates that an approach achieves a perfect identification, i.e., an approach can identify all defective lines (Recall $= 1$) without any false positives (FAR $= 0$).
A high d2h value indicates that the performance of an approach is far from perfect, e.g., achieving a high recall value but also have high a FAR value and vice versa.

%\smallsection{Top k\% LOC Precision} Top k\% LOC precision measures how many defective lines found when inspecting the top k\% of lines ranked by the defect-proneness estimated by the approach\cite{huang2017supervised}.
%A high value of top k\% LOC precision indicates that the approach can rank many defective lines at the top and many defective lines can be found given the fixed amount of effort (i.e., k\% of LOC).
%On the other hand, the low value of top k\% LOC precision indicates many clean lines are in the top k\% LOC and developers need to inspect more lines to identify defects.
%Similar to prior work~\cite{mende2010effort,kamei2010revisiting,rahman2014comparing,ray2016naturalness}, we use 20\% of LOC as a fixed cutoff for an effort.

\smallsection{Top k\%LOC Recall} Top k\%LOC recall measures how many actual defective lines found given a fixed amount of effort, i.e., the top k\% of lines ranked by their defect-proneness~\cite{huang2017supervised}.
A high value of top k\%LOC recall indicates that an approach can rank many actual defective lines at the top and many actual defective lines can be found given a fixed amount of effort.
On the other hand, a low value of top k\% LOC recall indicates many clean lines are in the top k\% LOC and developers need to spend more effort to identify defective lines.
Similar to prior work~\cite{mende2010effort,kamei2010revisiting,rahman2014comparing,ray2016naturalness}, we use 20\% of LOC as a fixed cutoff for an effort.

\smallsection{Initial False Alarm (IFA)} IFA measures the number of clean lines on which developers spend SQA effort until the first defective line is found when lines are ranked by their defect-proneness~\cite{huang2017supervised}.
A low IFA value indicates that few clean lines are ranked at the top, while a high IFA value indicates that developers will spend unnecessary effort on clean lines.
The intuition behinds this measure is that developers may stop inspecting if they could not get promising results (i.e., find defective lines) within the first few inspected lines~\cite{parnin2011automated}.

\smallsection{Matthews Correlation Coefficients (MCC)} MCC measures a correlation coefficients between actual and predicted outcomes using the following calculation:
\begin{equation}
\frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}
\end{equation}
An MCC value ranges from -1 to +1, where an MCC value of 1 indicates a perfect prediction, and -1 indicates total disagreement between the prediction

\subsection{Validation Settings}
In this paper, we perform both within-release and cross-release validation settings.
Below, we describe each of our validation settings.