Issues Without PR in
Where a dataset is split up and not all evaluated at once, some classes may be missing from evaluation. Metrics implementations get around problems relating to classes appearing not in both the y_true and y_pred by considering the union of their labels. However, this is insufficient if a label that existed in the training set for a fold is absent from both the predicted and true test targets.
This is at least a problem for the P/R/F family of metrics with average='macro' and labels unspecified, and it should be documented (though a user shouldn't be using 'macro' if there are infrequent labels). I haven't thought yet about whether it is an issue elsewhere, or whether it can be reasonably tested.
Where P/R/F specially handles the binary case this is also a problem for other values of average. By this I mean that if one or more missing classes reduces the problem from multiclass to binary classes, the expected result is completely different.