Skip to content

Latest commit

 

History

History
119 lines (85 loc) · 6.51 KB

File metadata and controls

119 lines (85 loc) · 6.51 KB

Model evaluation and evalution metrics

During the model building process, understanding the model performance and setting an appropriate single-value evaluation metrics are critical as they can guide how to improve the model performance.

When debugging the learning algorithm, a few options to consider include:

  • Obtain more training data (increase m)
  • Add more features (increase n) through feature extraction
  • Remove some features (decrease n) through feature selection
  • Adjust hyperparameters (i.e. regularization parameters, type of kernels)

In general: Overfitting (high variance) vs. Underfitting (high bias)

Rule of thumb:

  1. Model is to be fitted on the training set
  2. Best model is to be selected on the cross-validation set
  3. Model performance is to be evaluated on the test set

Tool #1: Validation curve (for hyperparameter tuning)

For a particular hyperparameter, a validation curve can help if the model is overfitting or underfitting as its value varies.

  • If an error is high for both the training and the CV sets, the model is underfitting - High bias
  • If an error is high for the CV set, but low for the training set, the model is overfitting - High variance

Tool #2: Learning curve

A learning curve can help if the model is overfitting or underfitting as the number of training sample varies.

  • If errors of the training and the CV sets converge to a high value, the model is underfitting - High bias (Fig. a)
  • If an error of the CV set is way higher than that of the training set, the model is overfitting - High variance (Fig. b)

Some options to consider for improvement

Consider
Overfitting
  • Increasing m
  • Decreasing n by removing some less important features
  • More regularization
Underfitting
  • Note: increasing m does not help improving the model performance
  • Increasing n by adding more features
  • Less regularization
  1. If a model is overfitting (high variance), consider:
    • Increasing m
    • Decreasing n by removing some less important features
    • More regularization
  2. If a model is underfitting (high bias), consider:
    • Note: increasing m does not help to improve the model performance
    • Increasing n by adding more features
    • Less regularization

For a classification model (binary)

The model performance is usually evaluated by various evaluation methods (i.e. accuracy, user satisfaction, survival rate of the patient, etc.). Accuracy is the most popular default choice, but it may give a partial picture of performance only in many scenarios.

  • Typical example: imbalanced or skewed class (only a small subset of data explains positive class)
    • Possible scenarios: occurrence of fire in a city, occurrence of malignant tumors, credit fraud detection, etc.
    • Problem: Even an untrained model (dummy classifier) may give higher accuracy by predicting all as negative
  • Solution: consider other evaluation metrics such as precision and recall (confusion matrix)
    • **P**recision = % True among **P**redicted as True (important when to avoid False **P**ositive)
    • **R**ecall = % True among **R**eal True (important when to avoid False Negative)

  • Precision and Recall trade-off

    When? Action Precision Recall
    To avoid False Positive Raise decision threshold (> 0.5) Increase Decrease
    To avoid False Negative Lower decision threshold (< 0.5) Decrease Increase
  • Two metrics can be difficult to compare when choosing the better model, consider:

    • F1 = 2(Precision*Recall) / (Precision + Recall)
    • Fβ = (1+β2)(Precision*Recall) / (β2Precision + Recall)
      • β is set high when recall is important (i.e. ~2)
      • β is set low when precision is important (i.e. ~0.5)
    • Receiver operating characteristic (ROC) curve and its area underneath the curve (AUC)

For a classification model (multi-clasee)

As an extension of a simple binary case, overall evaluation metrics are averaged across classes. They are different ways of averaging them and, depending on the distribution of classes, they may lead to different results.

  • Type of averaging
    • Macro-average: each class has equal weight. Compute metrics within each class, then average them.
    • Micro-average: each instance has equal weight. Aggregate the outcome first, then compute metrics. In this case, large classes have more influence.
When? Averaging to do
To weight more toward the small classes Macro
To weight more toward the large classes Micro
  • Compare two averaging results:
    • if micro- << macro-, large classes have poor metrics
    • if micro >> macro-, small classes have poor metrics

For a regression model

Typically, r2 score (total variance explained by a model/total variance) is satisfactory. If not, consider to minimize:

  • Mean Absolute Error (MAE)
    • average magnitude of the differences (errors) between the predicted and true target values
    • MAE holds the same unit as the target variable and easy to interpret
    • do not distinguish between over- and under-estimations
  • Mean Squared Error (MSE)
    • similar to MAE, but apply more penalty to high-errored values (desirable when large errors should be avoided)
    • MSE holds a different unit from the target variable
    • do not distinguish between over- and under-estimations
  • Root Mean Squared Error (RMSE)
    • similar to MSE, more penalty to high-errored values (desirable when large errors should be avoided)
    • similar to MAE, RMSE holds the same unit as the target variable, but more challenging to interpret the result
    • do not distinguish between over- and under-estimations
  • Median Absolute Error
    • robust to outlier
  • etc.