Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Fast Random Kernelized Features

Support Vector Machine Classification for High-Dimensional IDC Dataset


Random feature maps provide low-dimensional kernel approximations, thereby accelerating the training of support vector machines for large-scale datasets. Through a process of dimensionality reduction via k-means clustering, lifting via a random feature map, and subsequent linear SVM classification in this feature space, we outperform standard Gaussian kernel SVM on the high-dimensional invasive ductal carcinoma dataset (7500 dimensions) in both accuracy and speed. We explore applying two random maps (random Fourier features and random binning features) and experiment with different pre-processing methods such as k-means clustering, HSV transform, and histogram of gradients.


Tianshu Huang Nimit Kalra
Department of Electrical and Computer Engineering Department of Computer Science
University of Texas at Austin University of Texas at Austin


Run the module with python arg1=value arg2=value ..., where parameters are passed as keyword arguments joined by '=' and separated by spaces.

Parameter Default Description
ptrain 0.01 Percent of training images to use
ptest 0.1 Percent of test images to use
fdim 5000 Feature space dimensionality
knn False Use Color k-means?
k 5 Number of means to use
kernel G Kernel type ("G", "L", or "C")
ntrain -25 Number of patients used for training
ntest 25 Number of patients used for testing
cores 8 Number of cores (processes) to use
ftype F Type of feature to use ("F" or "B")

The terminal output is mirrored to the file results_<command line args>.txt.

Example Session

Output file results_ptrain=0.1_ptest=1_fdim=2500.txt of python ptrain=0.1 ptest=1 fdim=2500:


                                 Random Feature Support Vector Classifier


    |Parameter|Value|Description                                                 |
    |ptrain   |0.1  |Percent of training images to use (default=0.01)            |
    |ptest    |1.0  |Percent of test images to use (default=0.1)                 |
    |fdim     |2500 |Feature space dimensionality (default=5000)                 |
    |knn      |False|Use Color k-means? (default=False)                          |
    |k        |5    |Number of means to use (default=5)                          |
    |kernel   |G    |Kernel type ("G", "L", or "C") (default=G)                  |
    |ntrain   |-25  |Number of patients used for training (default=-25)          |
    |ntest    |25   |Number of patients used for testing (default=25)            |
    |cores    |8    |Number of cores (processes) to use (8 available) (default=8)|
    |ftype    |F    |Type of feature to use ("F" or "B") (default=F)             |

-- Main Program ------------------------------------------------------------------------------------------
<RF> Random Feature Support Vector Classifier
[0.94s | 75.02MB] <Random Fourier Feature> 7500->2500 Random Fourier Feature created
  | <test data> Loading Images...
  |   | <IDC Dataset> started accounting
  |   |   | <Loader> Loading patient 10253...
  |   | [0.27s | 1.10MB] <Loader> loaded patient 10253
  |   |   | <Loader> Loading patient 10254...
  |   |   | <Loader> Loading patient 10261...

  			[Details omitted]

  |   | [6.37s | 4.66MB] <Loader> loaded patient 13693
  |   |   | <Loader> Loading patient 13694...
  |   | <IDC Dataset> stopped accounting
  | [113.93s | 502.89MB] <IDC Dataset> 25143 images (10.0%) sampled from 254 patients
  | <training data> Loading Images...
  |   | <IDC Dataset> started accounting
  |   |   | <Loader> Loading patient 9257...
  |   |   | <Loader> Loading patient 9258...

  			[Details omitted]

  |   | <IDC Dataset> stopped accounting
  | [84.02s | 516.57MB] <IDC Dataset> 25827 images (100.0%) sampled from 25 patients
  |   | <RF SVC> Computing RF SVC Classifier
  |   | [8.40s] <RF SVC> RFF SVC Computed
<Task> Classification experiment on new patients
  | <Tester> True Negative: 18126
  | <Tester> True Positive: 4116
  | <Tester> False Negative: 1588
  | <Tester> False Positive: 1997
  | <Tester> Total Tests: 25827
  | <Tester> Incorrect Tests: 3585.0
  | <Tester> Correct Tests: 22242.0
  | <Tester> Percent Error: 13.880822395167847
  | <Tester> Constant Classifier Baseline: 22.08541448871336
  | <Tester> Relative Improvement over Constant: 0.3714936886395512
[0.36s] <Tester> Done running tests.
  | <Tester> Time per test: 1.378298848930882e-05s
<Task> Classification verification on training data
  | <Tester> True Negative: 16185
  | <Tester> True Positive: 4704
  | <Tester> False Negative: 2715
  | <Tester> False Positive: 1539
  | <Tester> Total Tests: 25143
  | <Tester> Incorrect Tests: 4254.0
  | <Tester> Correct Tests: 20889.0
  | <Tester> Percent Error: 16.919222049874715
  | <Tester> Constant Classifier Baseline: 29.507218708984606
  | <Tester> Relative Improvement over Constant: 0.42660735948241
[0.35s] <Tester> Done running tests.
  | <Tester> Time per test: 1.3938018597950777e-05s
[213.03s] <RF> Program finished.



[1] Ali Rahimi and Benjamin Recht. "Random Features for Large-Scale Kernel Machines." Advances in Neural Information Processing Systems. 2008.

[2] Wu, Lingfei, et al. "Revisiting Random Binning Features: Fast Convergence and Strong Parallelizability." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016.

[3] Janowczyk, Andrew. "Invasive Ductal Carcinoma Identification Dataset."

[4] Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research. JMLR, 2011.

[5] van der Walt, Stefan, Schonberger, Johannes L., Nunez-Iglesias, Juan, Boulogne, Franc¸ois, Warner, Joshua D., Yager, Neil, Gouillart, Emmanuelle, Yu, Tony, and the scikit-image contributors. scikit-image: image processing in Python. PeerJ, 2:e453, 6 2014. ISSN 2167-8359. doi: 10.7717/peerj.453. URL


Fast Random Kernelized Features: Support Vector Machine Classification for High-Dimensional IDC Dataset







No releases published


No packages published