GSoC 2016 Proposal: Adding fused types to Cython files

Proposal Title

scikit-learn: Adding fused types to Cython files

Sub-organization

scikit-learn

Abstract

The current implementation of many algorithms, such as Stochastic Gradient Descent, Coordinate Descent, etc. only allow input with float64 and int64 dtypes due to the adoption of Cython fused types may result in explosion of the generated C code. However, since scikit-learn has removed Cython files from the repo and re-generate them from every build, it provides a good chance to refactor some of the ".pyx" files by introducing Cython fused types. This will allow those algorithms to support float32 and int32 dtypes data, which is currently casted into float64 and int64 respectively, and therefore reduce the waste of memory space.

Cython Fused Types Introduction

Cython fused types allow programmer to have one type definition that can refer to multiple types. This allows programmer to write a single static-typed cython algorithm that can operate on values of multiple types. Thus fused types allow generic programming and are akin to templates in C++ or generics in languages like Java / C#.

Plan

Functions in scikit-learn that I plan to add fused types are listed below:

1. Sparse Function Utils

Motivation:

Dealing with sparse data is fairly common when we are anlyzing large datasets. However, sparse function utils in scikit-learn don't support float32 now and will convert input data into float64 internally, which means it is not visible to user. If we introduce Cython fused types here, we may solve a potential memory waste issue.

Goal:

Make sparse function utils support Cython fused types.

Implementation Plan:

According to #6430, sparsefuncs_fast.pyx need to be modified first so that other classes can handle input sparse data in the float32 dtype as float32 internally. Also, I will add test for other classes that used functions defined in sparsefuncs_fast.pyx to make sure those classes still work well. I've already started to work on this issue with PR #6539.

Main Related Files:

sklearn/
  utils/
	sparsefuncs_fast.pyx

2. Cluster

Motivation:

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). This common technique is used in many fields, including image analysis and unsupervised document classification. However, lots of clustering algorithms in scikit-learn need to convert float32 input data into float64 first before fitting step, which may cause memory issue.

Goal:

Make clustering algorithms support Cython fused types.

Implementation Plan:

After enhancing sparse functions utils in scikit-learn, I will refactor KMeans/MiniBatchKMeans and modify _hierarchical.pyx to make these commonly used cluster algorithms also support Cython fused types. Also, I'll modify test_k_means.py and test_hierarchical.py to check it actually works

Main Related Files:

sklearn/
  cluster/
	k_means_.py
  	_k_means.pyx
  	hierarchical.py
  	_hierarchical.pyx

3. SGD

Motivation:

Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. In scikit-learn, however, the current implementation of SGDClassifier and SGDRegressor doesn't allow the user to specify the dtype they want to use and may cause a MemoryError when the user wants to train the model since it will try to copy input data into np.float64. (See #5776)

Implementation Plan:

I'll start by modifying base classes in sgd_fast.pyx including LossFunction, Classification, and Regression, then I'll make several loss functions such as Hinge and Log also support Cython fused types. Finally, I'll refactor stochastic_gradient.py make it accept both float32 and float64 input.

Goal:

Make SGD algorithms and SGD related learning methods support Cython fused types.

Main Related Files:

sklearn/
  linear_model/
	stochastic_gradient.py
	sgd_fast.pyx

4. SAG

Motivation:

Stochastic Average Gradient (SAG) is another variant of SGD with an averaging strategy, and it is useful for classification with a logistic loss. Like SGD, SAG in scikit-learn will try to copy input data into np.float64 internally.

Implementation Plan:

I'll first modify classes and functions defined in sag_fast.pyx such as MultinomialLogLoss, and _multinomial_grad_loss_all_samples, then I'll refactor sag.py to make learning methods defined in it such as to support Cython fused types and therefore be able to handle both float32 and float64 input.

Goal:

Make SAG and SAG related learning methods support Cython fused types.

Main Related Files:

sklearn/
  linear_model/
	sag.py
	sag_fast.pyx

5. CD

Motivation:

Coordinate Descent (CD) is a non-derivative optimization algorithm that does line search along one coordinate direction at the current point in each iteration in order to find a local minimum of a function. Nonetheless, the current implementation of ElasticNet and Lassoin scikit-learn constrains the input to be float64, which is a waste of space. (See #5464)

Implementation Plan:

I'll first modify multiple functions related to enet_coordinate_descent in cd_fast.pyx, then I'll refactor coordinate_descent.py to make several learning methods defined in it such as ElasticNet and Lasso also support Cython fused types and therefore accept both float32 and float64 input.

Goal:

Make CD and CD related learning methods support Cython fused types.

Main Related Files:

sklearn/
  linear_model/
	coordinate_descent.py
	cd_fast.pyx

6. Neighbor Trees

Motivation:

The neighbors module in scikit-learn provides functionality for neighbors-based learning methods. The principle behind these methods is to find a predefined number of training samples closest in distance to the new test data point, and predict the label from them. Neighbors-based methods simply “remember” all of its training data and may leverage a fast indexing structure such as a Ball Tree or KD Tree to accelerate the predicting procedure. Ball Tree and KD Tree currently work well, but they only support np.float64; this is a waste of memory as same quality results should be obtainable with float32.

Goal:

Make neighbors trees and related neighbors-based learning methods support Cython fused types.

Implementation Plan:

I'll introduce Cython fused types into functions defined in ball_tree.pyx and kd_tree.pyx to make them support both float64 and float32. Then I'll modify dist_metrics.pyx to make it also adopt Cython fused types.

Main Related Files:

sklearn/
  neighbors/
	ball_tree.pyx
	kd_tree.pyx
	dist_metrics.pyx
	typedefs.pyx

7. LDA

Motivation:

Latent Dirichlet allocation (LDA) is a generative statistical model which represents documents as mixtures of topics that spit out words with certain probabilities. It is also an example of a topic model and is very popular for topic discovery. In practice, document and text datasets tends to be really large. Therefore, it would be really useful if we LDA in scikit-learn can handle both float32 and float64 input.

Goal:

Make LDA in scikit-learn support Cython fused types.

Implementation Plan:

I'll replace original float64_t arguments defined in _online_lda.pyx with Cython fused types in order to make it support float32 and prevent potential memory issue. Also, I'll add tests to guarantee this class works for both float32 and float64.

Main Related Files:

sklearn/
  decomposition/
	_online_lda.pyx

**Notes

svm module is not included in this plan since it is highly relevant with the source cpp code. However, I'll make it adopt Cython fused types after GSoC program if scikit-learn comminity think it is important.
tree module is not included in this plan since it always convert training input samples to float32. See comments here

Time Line

Warm up (March 26, 2016 - April 22, 2016)
- Complete #6351 Add TimeSeriesCV and HomogeneousTimeSeriesCV
- Complete #6540 Add ensemble selection algorithm
Community Bonding Period (April 22, 2016 - May 22, 2016)
- Modify sparsefuncs_fast.pyx to make it handle sparse float32
- Add test for classes use functions in sparsefuncs_fast.pyx
Week 1-3 (May 23, 2016 - June 12, 2016)
- Refactor KMeans/MiniBatchKMeans to handle sparse float32
- Make other cluster algorithms support fused type
- Add test for cluster
Week 4-5 (June 13, 2016 - June 26, 2016)
- Make SGD related learning methods support fused type
- Add test for SGD related learning methods
Week 6-7 (July 27, 2016 - July 10, 2016)
- Make SAG related learning methods support fused type
- Add test for SAG related learning methods
Week 8-9 (July 11, 2016 - July 24, 2016)
- Make CD related learning methods support fused type
- Add test for CD related learning methods
Week 10-12 (July 25, 2016 - August 14, 2016)
- Make neighbors trees support fused type
- Add test for neighbors trees
Week 13 (August 15, 2016 - August 23, 2016)
- Make LDA support fused type
- Add test
- Wrap up

Other Commitments

I'll graduate in June and will work as a research assistant from September, therefore I can work full time for Google Summer of Code program with scikit-learn.

References

Student Information:

###1. Personal Information:

Name: Yen-Chen Lin

Alternate Names: Yen

Email: yenchenlin1994@gmail.com

Telephone: +886 905136913

Time Zone: GMT/UTC+8

Blog: http://www.yclin.me/

2. University Information

University: National Tsing Hua University

Major: Computer Science

Current Year: 4th year

Expected Graduation date: 2016

Degree: Bachelor

3. About Me

I am Yen-Chen Lin, and I'm a final year undergraduate at the National Tsing Hua University, Taiwan, studying Computer Science. I started doing Machine Learning / Computer Vision research one year ago and have published a workshop paper in CVPR.

In addition to my research experience, I also build interesting side projects such as DeepLearningFlappyBird which has been featured on Hacker News, and work as an contract iOS developer for a startup called Forest.

I started to use scikit-learn around September of 2014 and have started to contribute to scikit-learn since January 2016 by improving documentation and fixing minor bugs etc.

4. Past Works

[MRG+1] Use fused type in inplace normalize (Open)
[WIP] Add ensemble selection algorithm (Open)
[MRG+1] Make dump_svmlight_file support sparse y (fixes #6301) (Merged)
[MRG] Add homogeneous time series cross validation (Open)
[MRG] Make GradientBoostingClassifier error message more informative (fixes 6433) (Open)
[MRG] DOC Add LabelKFold in _BaseKFold's docstring (Open)
Example document_clustering.py has broken (Open)
[MRG] DOC Add doc of attributes for VotingClassifier (Merged)
[MRG] DOC Fix misuse of "it's" (Merged)
[MRG] DOC Remove redundant words in sklearn (Merged)
Add test for local linear embedding (Merged)
[MRG] Remove redundant assignment of variable in test_locally_linear.py (Merged)
[MRG] Fix broken link of Wainwright's paper (fixes #6383) (Merged)
[MRG+1] Add test for voting classifier‘s init errors (Merged)
[MRG+1] Remove redundant assignment of parameter in _split.py (Merged)
[MRG] DOC Remove redundant that in test_ranking.py (Merged)
BUG: Normalize LDA transform's return value (fixes #6320) (Merged)
[MRG] DOC Remove redundant "and" in binary_tree.pxi (Merged)
[MRG] Fix check_cv error message (Merged)
[MRG+1] Doc Add doc in DictVectorizer when categorical features are numeric values (fixes #4413) (Merged)
[MRG] DOC: Fix typo in LDA parameter's doc (Merged)
[MRG+1]Fix axis label in the plot of multioutput regression tree example (Merged)
[MRG+1] Make doc for parameter in hierarchical.py consistent (Merged)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC 2016 Proposal: Adding fused types to Cython files

Proposal Title

Sub-organization

Abstract

Cython Fused Types Introduction

Plan

1. Sparse Function Utils

2. Cluster

3. SGD

4. SAG

5. CD

6. Neighbor Trees

7. LDA

**Notes

Time Line

Other Commitments

References

Student Information:

2. University Information

3. About Me

4. Past Works

Clone this wiki locally