Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Many redundant prediction probabilities for test instances with sparse SVM #3266

Closed
cbmeyer opened this Issue · 45 comments

7 participants

@cbmeyer

I’m having an issue using the prediction probabilities for sparse SVM, where many of the predictions come out the same for my test instances. These probabilities are produced during cross validation, and when I plot an ROC curve for the folds, the results look very strange, as there are a handful of clustered points on the graph. Here is my cross validation code, I based it off of the samples on the scikit website:

skf = StratifiedKFold(y, n_folds=numfolds)

for train_index, test_index in skf:
         #split the training and testing sets
         X_train, X_test = X_scaled[train_index], X_scaled[test_index]
         y_train, y_test = y[train_index], y[test_index]

         #train on the subset for this fold
         print 'Training on fold ' + str(fold)
         classifier = svm.SVC(C=C_val, kernel='rbf', gamma=gamma_val, probability=True)
         probas_ = classifier.fit(X_train, y_train).predict_proba(X_test)

         #Compute ROC curve and area the curve
         fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
         mean_tpr += interp(mean_fpr, fpr, tpr)
         mean_tpr[0] = 0.0
         roc_auc = auc(fpr, tpr)

I’m just trying to figure out if there’s something I’m obviously missing here, since I used this same training set and SVM parameters with libsvm and got much better results. When I used libsvm and printed out the distances from the hyperplane for the CV test instances and then plotted the ROC, it came out much more like I expected, and a much better AUC. Since the decision_function() method is not supported for sparse matrices, I cannot recreate this functionality in scikit, and therefore have to rely on the prediction probabilities.

There are 20k instances total, 10k positive and 10k negative, and I'm using 5-fold cross-validation. In the cross-validation results, there are several prediction values for which there are 1k-2k samples that all have the same prediction value, and there are only 3600 distinct prediction values over all of the folds for cross-validation. The resulting ROC looks like five big stair steps, with some little bits of fuzziness around the inner corners.

I have many sparse features, so I'm hashing those into index ranges for different types of feature subsets, so one feature subset will be in the index range 1 million to 2 million, the next will be in the range 2 million to 3 million, etc.

@agramfort
Owner

I am guessing you're hitting the issue that scipy sparse matrices indices are in32 so have an overflow...

I am not sure of the status of scipy.sparse with int64 indices.

I would use dim reduction with univariate feature selection to avoid it.

@larsmans
Owner

scipy.sparse supports 64-bit indices now, but we haven't updated our code to deal with that yet. That means you're limited to 2bln features, 2bln samples and 2bln non-zeros per matrix.

@cbmeyer

I don't think that's the issue, as I have 20k training instances, and almost all of them have < 5k features that are populated, with the majority having only hundreds populated. There are 4 million total features, so I could understand if the vectors weren't so sparse, but since a majority of that 4 million is zeroes, I don't see that approaching those limits.

@ogrisel
Owner

@cbmeyer can you confirm that #3268 is a fix for your issue?

@cbmeyer

I don't believe that #3268 is a fix for the particular issue I'm seeing, as the issue lies not in the roc_curve function, but rather in the predict_proba function for sparse SVM. The ROC calculation is correct, it is the predictions that seem to have an issue.

@amueller
Owner

The predict_proba uses Platt scaling for recallibration. Could you try to do probablility=False and use just the decision function? Could you also please add images of the roc curves?

@cbmeyer

It seems like the Platt scaling is the likely culprit here, but I can't use the decision function, as it's not been implemented for sparse matrices, though I would much prefer to just use that instead. Here's a sample ROC curve showing the issue:
all_folds_sampled2_round2 txt-100p-roc

@cbmeyer

For comparison, when I used LIBSVM instead, and used the distance from the hyperplane values from the CV results, I got a much more reasonable ROC curve. This ROC curve is from a LIBSVM classifier trained using the same data, same C and gamma values, and the same 5-fold CV scheme.
sampled_round2_va-100p-roc

@ogrisel
Owner

and the same 5-fold CV scheme

To clarify this point: do you average the ROC curves of the 5CV or do you concatenate the test predictions of the 5 folds before computing the roc curve?

@ogrisel
Owner

Since the decision_function() method is not supported for sparse matrices,

Sounds like a mission for @hamsal. @arjoly has this particular already been considered to be part of his GSoC?

@cbmeyer

That plot was produced by concatenating the probability predictions for all instances, as I was first alerted to the problem after averaging the ROC curves for each fold.

@ogrisel
Owner

@cbmeyer can you isolate the problem by computing the probabilities with the -b flag of libsvm and check that they match those of SVC.predict_proba?

@ogrisel
Owner

I am not sure whether we can fix the randomness of the internal CV used by libsvm to perform the Platt scaling though.

@arjoly
Owner

Since the decision_function() method is not supported for sparse matrices,

Sounds like a mission for @hamsal. @arjoly has this particular already been considered to be part of his GSoC?

No, we have mostly discussed about sparse input support in ensemble method and bring sparse output support.

@ogrisel, @larsmans, @amueller What is the state of #1586?

@arjoly
Owner

What is your feeling about this @hamsal?

@cbmeyer

Ran the same data through libsvm with the -b flag, and the probability prediction values definitely don't match. In the output of SVC.predict_proba there are many redundant prediction values, only 3585 distinct values for 20,000 samples. In libsvm with -b, there are 17960 distinct prediction values for the same 20,000 samples. Also, just comparing the roc curves is very telling, here is the roc curve from the results of the -b flag in libsvm:
sampled_round2_proba-100p-roc
Compared to the curve I posted above, they are extremely different.

@ogrisel
Owner

There is probably a bug in predict_proba then. I don't have the time to investigate further right now though. If you find the cause please feel free to issue a pull request.

@ogrisel ogrisel added the Bug label
@ogrisel
Owner

@ogrisel, @larsmans, @amueller What is the state of #1586?

No idea. It would be great to finish it but based on the last comments it does not look easy.

@amueller amueller added this to the 0.15.1 milestone
@amueller
Owner

Hum, that is not good :-/ I'll try to have a look later today.

@amueller amueller modified the milestone: 0.15.1
@amueller
Owner

I can not reproduce the step function behavior. Can you get that behavior with a dataset you can share? Or can you share your input matrices?

@ogrisel
Owner

I cannot reproduce either. Here is the script I used:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.metrics import roc_curve
from sklearn.cross_validation import StratifiedKFold
from sklearn.preprocessing import scale
from sklearn import svm

digits = load_digits()
X_scaled = scale(digits.data)
y = digits.target == 0
y[np.random.rand(y.shape[0]) > 0.9] = 1
C_val = 100
gamma_val = 1e-6

skf = StratifiedKFold(y, n_folds=5)

predictions = "predict_proba"

for fold, (train_index, test_index) in enumerate(skf):
     # split the training and testing sets
     X_train, X_test = X_scaled[train_index], X_scaled[test_index]
     y_train, y_test = y[train_index], y[test_index]

     # train on the subset for this fold
     print('Training on fold ' + str(fold))
     classifier = svm.SVC(C=C_val, kernel='rbf', gamma=gamma_val, probability=True)
     if predictions == 'predict_proba':
         y_pred = classifier.fit(X_train, y_train).predict_proba(X_test)[:, 1]
     elif predictions == 'decision_function':
         y_pred = classifier.fit(X_train, y_train).decision_function(X_test)
     else:
         raise ValueError(predictions)

     # Compute ROC curve and area the curve
     fpr, tpr, thresholds = roc_curve(y_test, y_pred)
     plt.plot(fpr, tpr, label='ROC curve of fold {0}'.format(fold))

plt.title("ROC curves computed with " + predictions)
plt.legend(loc='best')
plt.show()

and the outcome with predict proba:

roc_predict_proba

with decision function:

roc_decision_function

@ogrisel
Owner

I retried with sparse input as follows:

import numpy as np
from scipy.sparse import csr_matrix
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.metrics import roc_curve
from sklearn.cross_validation import StratifiedKFold
#from sklearn.preprocessing import scale
from sklearn import svm

digits = load_digits()
X_scaled = csr_matrix(digits.data)
y = digits.target == 0
y[np.random.rand(y.shape[0]) > 0.9] = 1
C_val = 100
gamma_val = 1e-6

skf = StratifiedKFold(y, n_folds=5)

predictions = "predict_proba"

for fold, (train_index, test_index) in enumerate(skf):
     # split the training and testing sets
     X_train, X_test = X_scaled[train_index], X_scaled[test_index]
     y_train, y_test = y[train_index], y[test_index]

     # train on the subset for this fold
     print('Training on fold ' + str(fold))
     classifier = svm.SVC(C=C_val, kernel='rbf', gamma=gamma_val, probability=True)
     if predictions == 'predict_proba':
         y_pred = classifier.fit(X_train, y_train).predict_proba(X_test)[:, 1]
     elif predictions == 'decision_function':
         y_pred = classifier.fit(X_train, y_train).decision_function(X_test)
     else:
         raise ValueError(predictions)

     # Compute ROC curve and area the curve
     fpr, tpr, thresholds = roc_curve(y_test, y_pred)
     plt.plot(fpr, tpr, label='ROC curve of fold {0}'.format(fold))

plt.title("ROC curves computed with " + predictions)
plt.legend(loc='best')
plt.show()

the output looks very similar.

@ogrisel ogrisel closed this
@ogrisel ogrisel reopened this
@ogrisel
Owner

Sorry I did not mean to click on the close button...

@cbmeyer

Here is a link to the feature vectors, in LIBSVM format:

https://www.dropbox.com/s/ovqg5t3v0o0mnn8/sampled_round2.data

I should also note that I'm scaling the data like so:

#load SVM format file
X, y = load_svmlight_file(inputfile)

#scale training data
std_scaler = preprocessing.StandardScaler(with_mean=False)
X_scaled = std_scaler.fit_transform(X)
@amueller
Owner

Thanks :)

@amueller
Owner

@cbmeyer what where your gamma and C? (sorry for the very slow reply)

@amueller
Owner

Also reproducing with a smaller dataset would be helpful, this one is huge (which is why I didn't do a grid-search ;)

@cbmeyer

The C value is 94.1605007514 and gamma value is 0.00456579534043. I'll see if I can put together a smaller dataset that will reproduce the issue.

@amueller
Owner

I used these parameters and could reproduce a lot of duplicate probability estimates.
Then I subsampled by a factor of 100 and also observed duplicate values. Then I ran libsvm and it gave me similarly many duplicate values.
So I conclude it is the platt scaling, and there is no difference between libsvm and sklearn.
Can you provide a single train/test set split that gives very different results on libsvm and sklearn, with the same scaling and parameters?

@cbmeyer

Interesting, the plot I posted above on June 25 uses the -b flag in libsvm, and shows drastically different results. I'm not sure how you got the same result output, but mine is definitely much different (see above post). Just to reiterate from the first post, the reason I discovered this behavior was that the decision_function() method is not supported for sparse vectors, so I was forced to used this route to produce results. If the I could use distance from the boundary instead, it would be a far better solution than incurring the overhead of the Platt scaling.

@ogrisel
Owner

I discovered this behavior was that the decision_function() method is not supported for sparse vectors.

Indeed. If you feel motivated a PR to add support for this would be very well received I think (with tests and all).

Coming back to the original issue, let us know if you can find a tuple of (dataset, C, gamma) that yields very distinct behaviors between sklearn and libsvm. Otherwise I think we should close this issue.

@cbmeyer

The dataset, C, and gamma I've provided produced drastically different results for me between sklearn and libsvm, as outlined in my earlier posts.

@ogrisel
Owner

@amueller @cbmeyer which version of scikit-learn the libsvm commandline tool are you both using?

@cbmeyer can you please give the exact command line for libsvm and python script you used for sklearn?

@amueller
Owner

@cbmeyer I am working on implementing the decision_function, but I can't promise anything.
Your sklearn script only uses one file, which is then split. How did you split the data for libsvm? Also, your sklearn script uses scaling. How did you scale the data for libsvm? Your sklearn code uses scaling.

Without knowing how you split or scaled the data I can not reproduce your results.

@amueller
Owner

I used libsvm-3.12-1 as bundled with Ubuntu (I think, I won't get to the laptop for a week now).

@cbmeyer

I'm using libsvm 3.18. For doing the scaling and cross fold splits, I do something like this in a bash script:

# scale the data
libsvm-3.18/svm-scale -l 0 -s sampled_round2.scalefactors sampled_round2.data > sampled_round2.scaled

# split out positive and negative instances for sampling
cat sampled_round2.scaled | awk '{if ($1==-1) print}' > negatives
cat sampled_round2.scaled | awk '{if ($1==1) print}' > positives

# sample for each fold
for fold in {1..$num_folds}
do
    cat negatives | awk '{if (NR%'${num_folds}'=='${fold}') print}' > sampled_round2_test_${fold}.data
    cat negatives | awk '{if (NR%'${num_folds}'!='${fold}') print}' > sampled_round2_train_${fold}.data

    cat positives | awk '{if (NR%'${num_folds}'=='${fold}') print}' >> sampled_round2_test_${fold}.data
    cat positives | awk '{if (NR%'${num_folds}'!='${fold}') print}' >> sampled_round2_train_${fold}.data
done

# train on each fold and predict
for fold in {1..$num_folds}
do 
   libsvm-3.18/svm-train -t 2 -g 0.00456579534043 -c 94.1605007514 -w1 1 -w-1 1 -b 1 sampled_round2_train_${fold}.data model_round2_${fold}
   libsvm-3.18/svm-predict -b 1 sampled_round2_test_${fold}.data model_round2_${fold} res_${fold}
   cat res_${fold} >> res_all_folds
done
@cbmeyer

Here's a more complete version of my sklearn Python code:

C_val = 94.1605007514
gamma_val = 0.00456579534043

#load SVM format file
X, y = load_svmlight_file(inputfile)

#scale training data
std_scaler = preprocessing.StandardScaler(with_mean=False)
X_scaled = std_scaler.fit_transform(X)

skf = StratifiedKFold(y, n_folds=numfolds)

fold = 1

for train_index, test_index in skf:
   #split the training and testing sets
   X_train, X_test = X_scaled[train_index], X_scaled[test_index]
   y_train, y_test = y[train_index], y[test_index]

   #train on the subset for this fold
   print 'Training on fold ' + str(fold)
   classifier = svm.SVC(C=C_val, kernel='rbf', gamma=gamma_val, probability=True)
   probas_ = classifier.fit(X_train, y_train).predict_proba(X_test)

   #Compute ROC curve and area of the curve
   fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
   mean_tpr += interp(mean_fpr, fpr, tpr)
   mean_tpr[0] = 0.0
   roc_auc = auc(fpr, tpr)

   #predict on the testing subset for this fold
   res = probas_[:, 1]

   #write out the prediction probabilities for each fold
   out = open('cv_fold_' + str(fold) + '_prob', 'w')
   for i in range(0, len(y_test)):
      out.write(str(y_test[i]) + '\t' + str(res[i]) + '\n')

   fold += 1
@amueller
Owner

Thanks.
The scaling you use is different. You use min/max scaling for libsvm and mean/std scaling for scikit-learn. Can you try using the MinMaxScaler instead of the StandardScaler? That should give you identical results.

@amueller
Owner

hm not sure MinMaxScaler currently supports sparse data :-/ you can either do it "by hand" or load the data you scaled with svm-scale. But either way, I'm petty sure that is the source of the discrepancy.

@cbmeyer

Thanks for the input, I'll try loading in the libsvm scaled data directly and see how that fares.

@cbmeyer

After feeding the libsvm scaled data in directly, looks like I get the prediction values I expect, so the StandardScaler looks like the culprit. Thanks for the input, this was very helpful!

@cbmeyer cbmeyer closed this
@Karthick333031

Hi,
I am totally new to these topics. Slowly trying to learn and pickup scikit. I encountered the same issue and I am not able to understand what causes the same probabilities. Please find the details below.

I am using 0.15.2 version.

import sklearn
sklearn.version
'0.15.2'

I took the data from the following Kaggle challenge: http://www.kaggle.com/c/datasciencebowl

Per my reading online from scikit documentation and references, I used the below code to extract features, train/test and predict probabilities for the test data set provided.

#Coding Starts
#! /usr/bin/python

#Variables
train_data_path='/home/ubuntu/kaggle/nationaldatasciencebowl-feb2015/train/'
test_data_path='/home/ubuntu/kaggle/nationaldatasciencebowl-feb2015/test/'

total_train_list=[]
target_list=[]

#Imports
import os, Image, numpy as np
import pandas
from scipy.ndimage import convolve
from sklearn import metrics
from sklearn.svm import SVC
from sklearn import preprocessing
from sklearn.neural_network import BernoulliRBM
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib
import time

#Nudge data by convolving
#Took this fn from scikit example
def nudge_dataset(X, Y):
"""
This produces a dataset 5 times bigger than the original one,
by moving the 8x8 images in X around by 1px to left, right, down, up
"""
direction_vectors = [
[[0, 1, 0],
[0, 0, 0],
[0, 0, 0]],

    [[0, 0, 0],
     [1, 0, 0],
     [0, 0, 0]],

    [[0, 0, 0],
     [0, 0, 1],
     [0, 0, 0]],

    [[0, 0, 0],
     [0, 0, 0],
     [0, 1, 0]]]

shift = lambda x, w: convolve(x.reshape((64, 64)), mode='constant',
                              weights=w).ravel()
X = np.concatenate([X] +
                   [np.apply_along_axis(shift, 1, X, vector)
                    for vector in direction_vectors])
Y = np.concatenate([Y for _ in range(5)], axis=0)
return X, Y

#Extract Train Data
print "Feature extraction on Train data Started"
print time.time()
for plankton in os.listdir(train_data_path):
for imgs in os.listdir(os.path.join(train_data_path,plankton)):
img = Image.open(os.path.join(train_data_path, plankton, imgs))
img = img.resize((64,64))
img_arr = np.asarray(img)
img_arr_1d = np.zeros((1,4096))
img_arr_1d = img_arr.flatten()
total_train_list.append(img_arr_1d)
target_list.append(plankton)

total_train_array = np.asarray(total_train_list, 'float32')
train, target = total_train_array, target_list
print len(train), len(target)
print "Feature extraction on Train data Completed"
print time.time()

#Scale the data
nudged_train, nudged_target = nudge_dataset(train, target)
min_max_scaler = preprocessing.MinMaxScaler()
nudged_scaled_train = min_max_scaler.fit_transform(nudged_train)

#Instantiate model pipeline
svc = SVC(kernel='rbf', probability=True, gamma=0.1)
rbm = BernoulliRBM(random_state=0, verbose=True)

classifier = Pipeline(steps=[('rbm', rbm), ('svm', svc)])
###############################################################################

Training

rbm.learning_rate = 0.06
rbm.n_iter = 20
rbm.n_components = 200

print "Model Building Started"
print time.time()

Training RBM-SVM Pipeline

classifier.fit(nudged_scaled_train, nudged_target)

print "Model Building Completed"
print time.time()

joblib.dump(classifier, '/home/ubuntu/kaggle/nationaldatasciencebowl-feb2015/model/kaggle-plankton-cnn-svm-1.pkl')
print "Model saved"
print time.time()
print "***************************************************************************"

#Extract Test Data
def extract_test_data_from_images(test_data_path):
total_test_list=[]
test_input_list=[]
for imgs in os.listdir(test_data_path):
img = Image.open(os.path.join(test_data_path, imgs))
img = img.resize((64,64))
img_arr = np.asarray(img)
img_arr_1d = np.zeros((1,4096))
img_arr_1d=img_arr.flatten()
total_test_list.append(img_arr_1d)
test_input_list.append(imgs)
total_test_array = np.asarray(total_test_list, 'float32')
return total_test_array, test_input_list

print "Feature extraction on Test data Started"
print time.time()
test_data, input_list = extract_test_data_from_images(test_data_path)
print "Feature extraction on Test data Completed"
print time.time()

#Scale Test Data
scaled_test = min_max_scaler.transform(test_data)

#Load the model
rbmsvm_clf = joblib.load('/home/ubuntu/kaggle/nationaldatasciencebowl-feb2015/model/kaggle-plankton-cnn-svm-1.pkl')

#Predict the probabilities and class
print "Predict on Test data Started"
print time.time()
pred_class = rbmsvm_clf.predict(scaled_test)
pred_class_prob = rbmsvm_clf.predict_proba(scaled_test)
print "Predict on Test data Completed"
print time.time()

dist_target = set(target)
print len(scaled_test), len(pred_class), len(pred_class_prob), len(dist_target)
#Write the output
planktons_output=[]
for ind in range(0, len(input_list)):
row=[]
row.append(input_list[ind])
row.extend(list(pred_class_prob[ind]))
planktons_output.append(row)

planktons_output_class=[]
for ind in range(0, len(input_list)):
row=[]
row.append(input_list[ind])
row.append(pred_class[ind])
planktons_output_class.append(row)

df1 = pandas.DataFrame(planktons_output)
df2 = pandas.DataFrame(planktons_output_class)

df1.to_csv("planktons_output.csv")
df2.to_csv("planktons_class_out.csv")`


The output probabilities predicted by predict_proba is marginally different but they are broadly the same for the first 6 - 7 decimal values.

3980.jpg,0.029336030196937507,0.0004290310973272736,.......(other probs for other classes)
64551.jpg,0.029336029371341908,0.00042903114570667965,......(other probs for other classes)

It would also be helpful to know the class name for each of the probability predicted. Is it possible to tag the class name associated with each probability predicted above?

Please help me understand. Any help would be much appreciated.

@amueller
Owner

Can you be more specific about what your problem is?
It can totally happen that points have similiar / the same probability estimates.
You can find which class each column corresponds to in the svm.classes_ attribute.

@Karthick333031

Thanks for the note.
I get the same probabilities predicted. As shared in the example, I see the same probabilities predicted for all the input data.
While it is certainly valid, I know for a fact that it can't be correct. I wanted to know if I am missing any thing in the logic.

@amueller
Owner

I'm not sure what you mean by correct. Under the true distribution?

Well, as mentioned in many places, the predict_proba method uses platt scaling, which is not great.
If you are not using sparse data, you can use the decision_function and if you want calibration you can use the new calibration module:
http://scikit-learn.org/dev/modules/calibration.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.