@@ -12,12 +12,362 @@ features in a format supported by machine learning algorithms from datasets
consisting of formats such as text and image.


Loading features from dicts
===========================

The class :class:`DictVectorizer` can be used to convert feature arrays
represented as lists of standard Python ``dict`` objects to the NumPy/SciPy
representation used by scikit-learn estimators.

While not particularly fast to process, Python's ``dict`` has the advantages
of being convenient to use, being sparse (absent features need not be
stored) and storing feature names in addition to values.

``DictVectorizer`` implements what is called one-of-K or "one-hot" coding for
categorical (aka nominal, discrete) features. For a dictionary such as::

{"word-2": "guitar",
"pos-2": "NN",
"word-1": "and",
"pos-1": "CC",
"word+1": "player",
"pos+1": "NN",
"word+2": "stand",
"pos+2": "VB"}

it will construct new, binary features ``"word-2=guitar"``, ``"pos-2=NN"``, etc.


.. _text_feature_extraction:

Text feature extraction
=======================

.. currentmodule:: sklearn.feature_extraction.text

XXX: a lot to do here

The Bag of Words representation
-------------------------------

Text Analysis is a major application field for machine learning
algorithms. However the raw data, a sequence of symbols cannot be fed
directly to the algorithms themselves as most of them expect numerical
feature vectors with a fixed size rather than the raw text documents
with variable length.

In order to address this, scikit-learn provides utilities for the most
common ways to extract numerical features from text content, namely:

- **tokenizing** strings and giving an integer id for each possible token,
for instance by using whitespaces and punctuation as token separators.

- **counting** the occurrences of tokens in each document.

- **normalizing** and weighting with diminishing importance tokens that
occur in the majority of samples / documents.

In this scheme, features and samples are defined as follows:

- each **individual token occurrence frequency** (normalized or not)
is treated as a **feature**.

- the vector of all the token frequencies for a given **document** is
considered a multivariate **sample**.

A corpus of documents can thus be represented by a matrix with one row
per document and one column per token (e.g. word) occurring in the corpus.

We call **vectorization** the general process of turning a collection
of text documents into numerical feature vectors. This specific stragegy
(tokenization, counting and normalization) is called the **Bag of Words**
or "Bag of n-grams" representation. Documents are described by word
occurrences while completely ignoring the relative position information
of the words in the document.

When combined with :ref:`tfidf`, the bag of words encoding is also known
as the `Vector Space Model
<https://en.wikipedia.org/wiki/Vector_space_model>`_.


Sparsity
--------

As most documents will typically use a very subset of a the words used in
the corpus, the resulting matrix will have many feature values that are
zeros (typically more than 99% of them).

For instance a collection of 10,000 short text documents (such as emails)
will use a vocabulary with a size in the order of 100,000 unique words in
total while each document will use 100 to 1000 unique words individually.

In order to be able to store such a matrix in memory but also to speed
up algebraic operations matrix / vector, implementations will typically
use a sparse representation such as the implementations available in the
``scipy.sparse`` package.


Common Vectorizer usage
-----------------------

:class:`CountVectorizer` implements both tokenization and occurrence
counting in a single class::

>>> from sklearn.feature_extraction.text import CountVectorizer

This model has many parameters, however the default values are quite
reasonable (please see the :ref:`reference documentation
<text_feature_extraction_ref>` for the details)::

>>> vectorizer = CountVectorizer()
>>> vectorizer
CountVectorizer(analyzer='word', binary=False, charset='utf-8',
charset_error='strict', dtype=<type 'long'>, input='content',
lowercase=True, max_df=1.0, max_features=None, max_n=1, min_n=1,
preprocessor=None, stop_words=None, strip_accents=None,
token_pattern=u'\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
Let's use it to tokenize and count the word occurrences of a minimalistic
corpus of text documents::

>>> corpus = [
... 'This is the first document.',
... 'This is the second second document.',
... 'And the third one.',
... 'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X # doctest: +NORMALIZE_WHITESPACE
<4x9 sparse matrix of type '<type 'numpy.int64'>'
with 19 stored elements in COOrdinate format>

The default configuration tokenizes the string by extracting words of
at least 2 letters. The specific function that does this step can be
requested explicitly::

>>> analyze = vectorizer.build_analyzer()
>>> analyze("This is a text document to analyze.")
[u'this', u'is', u'text', u'document', u'to', u'analyze']

Each term found by the analyzer during the fit is assigned a unique
integer index corresponding to a column in the resulting matrix. This
interpretation of the columns can be retrieved as follows::

>>> vectorizer.get_feature_names()
[u'and', u'document', u'first', u'is', u'one', u'second', u'the', u'third', u'this']

>>> X.toarray() # doctest: +ELLIPSIS
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]]...)

The converse mapping from feature name to column index is stored in the
``vocabulary_`` attribute of the vectorizer::

>>> vectorizer.vocabulary_.get('document')
1

Hence words that were not seen in the training corpus will be completely
ignored in future calls to the transform method::

>>> vectorizer.transform(['Something completely new.']).toarray()
... # doctest: +ELLIPSIS
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...)

Note that in the previous corpus, the first and the last documents have
exactly the same words hence are encoded in equal vectors. In particular
we lose the information that the last document is an interogative form. To
preserve some of the local ordering information we can extract 2-grams
of words in addition to the 1-grams (the word themselvs)::

>>> bigram_vectorizer = CountVectorizer(min_n=1, max_n=2,
... token_pattern=ur'\b\w+\b')
>>> analyze = bigram_vectorizer.build_analyzer()
>>> analyze('Bi-grams are cool!')
[u'bi', u'grams', u'are', u'cool', u'bi grams', u'grams are', u'are cool']

The vocabulary extracted by this vectorizer is hence much bigger and
can now resolve ambiguities encoded in local positioning patterns::

>>> X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
>>> X_2
... # doctest: +ELLIPSIS
array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]]...)


In particular the interogative form "Is this" is only present in the
last document::

>>> feature_index = bigram_vectorizer.vocabulary_.get(u'is this')
>>> X_2[:, feature_index] # doctest: +ELLIPSIS
array([0, 0, 0, 1]...)


.. _tfidf:

TF-IDF normalization
--------------------

In a large text corpus, some words will be very present (e.g. "the", "a",
"is" in English) hence carrying very little meaningul information about
the actual contents of the document. If we were to feed the direct count
data directly to a classifier those very frequent terms would shadow
the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values
suitable for usage by a classifier it is very common to use the tf–idf
transform.

Tf means **term-frequency** while tf–idf means term-frequency times
**inverse document-frequency**. This is a orginally a term weighting
scheme developed for information retrieval (as a ranking function
for search engines results), that has also found good use in document
classification and clustering.

This normalization is implemented by the :class:`TfidfTransformer` class::

>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer()
>>> transformer
TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

Again please see the :ref:`reference documentation
<text_feature_extraction_ref>` for the details on all the parameters.

Let's take an example with the following counts. The first term is present
100% of the time hence not very interesting. The two other features only
in less than 50% of the time hence probably more representative of the
content of the documents::

>>> counts = [[3, 0, 1],
... [2, 0, 0],
... [3, 0, 0],
... [4, 0, 0],
... [3, 2, 0],
... [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf # doctest: +NORMALIZE_WHITESPACE
<6x3 sparse matrix of type '<type 'numpy.float64'>'
with 9 stored elements in Compressed Sparse Row format>

>>> tfidf.toarray() # doctest: +ELLIPSIS
array([[ 0.85..., 0. ..., 0.52...],
[ 1. ..., 0. ..., 0. ...],
[ 1. ..., 0. ..., 0. ...],
[ 1. ..., 0. ..., 0. ...],
[ 0.55..., 0.83..., 0. ...],
[ 0.63..., 0. ..., 0.77...]])

Each row is normalized to have unit euclidean norm. The weights of each
feature computed by the ``fit`` method call are stored in a model
attribute::

>>> transformer.idf_ # doctest: +ELLIPSIS
array([ 1. ..., 2.25..., 1.84...])


As tf–idf is a very often used for text features, there is also another
class called :class:`TfidfVectorizer` that combines all the option of
:class:`CountVectorizer` and :class:`TfidfTransformer` in a single model::

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer = TfidfVectorizer()
>>> vectorizer.fit_transform(corpus)
... # doctest: +NORMALIZE_WHITESPACE
<4x9 sparse matrix of type '<type 'numpy.float64'>'
with 19 stored elements in Compressed Sparse Row format>

While the tf–idf normalization is often very useful, there might
be cases where the binary occurrence markers might offer better
features. This can be achieved by using the ``binary`` parameter
of :class:`CountVectorizer`. In particular, some estimators such as
:ref:`bernoulli_naive_bayes` explicitly model discrete boolean random
variables. Also very short text are likely to have noisy tf–idf values
while the binary occurrence info is more stable.

As usual the only way how to best adjust the feature extraction parameters
is to use a cross-validated grid search, for instance by pipelining the
feature extractor with a classifier:

* :ref:`example_grid_search_text_feature_extraction.py`


Applications and examples
-------------------------

The bag of words representation is quite simplistic but surprisingly
useful in practice.

In particular in a **supervised setting** it can be successfully combined
with fast and scalable linear models to train **document classificers**,
for instance:

* :ref:`example_document_classification_20newsgroups.py`

In an **unsupervised setting** it can be used to group similar documents
together by applying clustering algorithms such as :ref:`k_means`:

* :ref:`example_document_clustering.py`

Finally it is possible to discover the main topics of a corpus by
relaxing the hard assignement constraint of clustering, for instance by
using :ref:`NMF`:

* :ref:`example_applications_topics_extraction_with_nmf.py`


Limitations of the Bag of Words representation
----------------------------------------------

While some local positioning information can be preserved by extracting
n-grams instead of individual words, Bag of Words and Bag of n-grams
destroy most of the inner structure of the document and hence most of
the meaning carried by that internal structure.

In order to address the wider task of Natural Language Understanding,
the local structure of sentences and paragraphs should thus be taken
into account. Many such models will thus be casted as "Structured output"
problems which are currently outside of the scope of scikit-learn.


Customizing the vectorizer classes
-----------------------------------

It is possible to customize the behavior by passing some callable as
parameters of the vectorizer::

>>> def my_tokenizer(s):
... return s.split()
...
>>> vectorizer = CountVectorizer(tokenizer=my_tokenizer)
>>> vectorizer.build_analyzer()(u"Some... punctuation!")
[u'some...', u'punctuation!']

In particular we name:

* ``preprocessor`` a callable that takes a string as input and return
another string (removing HTML tags or converting to lower case for
instance)

* ``tokenizer`` a callable that takes a string as input and output a
sequence of feature occurrences (a.k.a. the tokens).

* ``analyzer`` a callable that wraps calls to the preprocessor and
tokenizer and further perform some filtering or n-grams extractions
on the tokens.

To make the preprocessor, tokenizer and analyzers aware of the model
parameters it is possible to derive from the class and override the
``build_preprocessor``, ``build_tokenizer``` and ``build_analyzer``
factory method instead.

Customizing the vectorizer can be very useful to handle Asian languages
that do not use an explicit word separator such as the whitespace for
instance.


Image feature extraction
@@ -73,7 +73,7 @@ parameters or alternatively it uses the given parameters.
>>> y = f(X).ravel()
>>> x = np.atleast_2d(np.linspace(0, 10, 1000)).T
>>> gp = gaussian_process.GaussianProcess(theta0=1e-2, thetaL=1e-4, thetaU=1e-1)
>>> gp.fit(X, y) # doctest: +ELLIPSIS
>>> gp.fit(X, y) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
GaussianProcess(beta0=None, corr=<function squared_exponential at 0x...>,
normalize=True, nugget=array(2.22...-15),
optimizer='fmin_cobyla', random_start=1,
@@ -4,18 +4,121 @@
Hidden Markov Models
====================

.. warning::
.. currentmodule:: sklearn.hmm

This module is not actively maintained and might be removed in a future
release. If you are interested in working on this module, please contact
the mailing list.
`sklearn.hmm` implements the algorithms of Hidden Markov Model (HMM).
HMM is a generative probabilistic model, in which a sequence of observable
:math:`\mathbf{X}` variable is generated by a sequence of internal hidden
state :math:`\mathbf{Z}`. The hidden states can not be observed directly.
The transition of hidden states is aussumed to be
the first order Markov Chain. It can be specified by the start probability
vector :math:`\boldsymbol{\Pi}` and the transition probability matrix
:math:`\mathbf{A}`.
The emission probability of observable can be any distribution with the
parameters :math:`\boldsymbol{{\Theta}_i}` conditioned on the current hidden
state index. (e.g. Multinomial, Gaussian).
Thus the HMM can be completely determined by
:math:`\boldsymbol{\Pi, \mathbf{A}}` and :math:`\boldsymbol{{\Theta}_i}`.


.. currentmodule:: sklearn.hmm
There are three fundamental problems of HMM:

* Given the model parameters and observed data, estimate the optimal
sequence of hidden states.

* Given the model parameters and observed data, calculate the likelihood
of the data.

* Given just the observed data, estimate the model parameters.


The first and the second problem can be solved by the dynamic programing
algorithms known as
the Viterbi algorithm and the Forward-Backward algorithm respectively.
The last one can be solved by an Expectation-Maximization (EM) iterative
algorithm, known as Baum-Welch algorithm.

See the ref listed below for further detailed information.

.. topic:: References:

[Rabiner89] `A tutorial on hidden Markov models and selected applications in speech recognition <http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf>`_
Lawrence, R. Rabiner, 1989


Using HMM
=========

Classes in this module include :class:`MultinomalHMM` :class:`GaussianHMM`,
and :class:`GMMHMM`. They implement HMM with emission probability of
Multimomial distribution, Gaussian distribution and the mixture of
Gaussian distributions.


Building HMM and generating samples
------------------------------------

You can build an HMM instance by passing the parameters described above to the
constructor. Then, you can generate samples from the HMM by calling `sample`.::

>>> import numpy as np
>>> from sklearn import hmm
>>> startprob = np.array([0.6, 0.3, 0.1])
>>> transmat = np.array([[0.7, 0.2, 0.1], [0.3, 0.5, 0.2], [0.3, 0.3, 0.4]])
>>> means = np.array([[0.0, 0.0], [3.0, -3.0], [5.0, 10.0]])
>>> covars = np.tile(np.identity(2), (3, 1, 1))
>>> model = hmm.GaussianHMM(3, "full", startprob, transmat)
>>> model.means_ = means
>>> model.covars_ = covars
>>> X, Z = model.sample(100)


.. figure:: ../auto_examples/images/plot_hmm_sampling_1.png
:target: ../auto_examples/plot_hmm_sampling.html
:align: center
:scale: 75%

.. topic:: Examples:

* :ref:`example_plot_hmm_sampling.py`

Training HMM parameters and infering the hidden states
------------------------------------------------------

You can train the HMM by calling `fit` method. The input is "the list" of
the sequence of observed value. Note, since EM-algorithm is a gradient based
optimization method, it will generally be stuck at local optimal. You should try
to run `fit` with various initialization and select the highest scored model.
The score of the model can be calculated by the `score` method.
The infered optimal hidden states can be obtained by calling `predict` method.
The `predict` method can be specified with decoder algorithm.
Currently Viterbi algorithm `viterbi`, and maximum a posteriori
estimation `map` is supported.
This time, the input is a single sequence of observed values.::

>>> model2 = hmm.GaussianHMM(3, "full")
>>> model2.fit([X])
GaussianHMM(algorithm='viterbi', covariance_type='full', covars_prior=0.01,
covars_weight=1, means_prior=None, means_weight=0, n_components=3,
random_state=None, startprob=None, startprob_prior=1.0,
transmat=None, transmat_prior=1.0)
>>> Z2 = model.predict(X)


.. topic:: Examples:

* :ref:`example_plot_hmm_stock_analysis.py`


Implementing HMMs with other emission probabilities
---------------------------------------------------

Classes in this module include :class:`GaussianHMM`, :class:`MultinomalHMM`
and :class:`GMMHMM`. There's currently no narrative documentation for this
module, you can help by expanding it, see section
:ref:`contribute_documentation` .
If you want to implement other emission probability (e.g. Poisson), you have to
make you own HMM class by inheriting the :class:`_BaseHMM` and override
necessary methods. They should be `__init__`, `_compute_log_likelihood`,
`_set` and `_get` for addiitional parameters,
`_initialize_sufficient_statistics`, `_accumulate_sufficient_statistics` and
`_do_mstep`.


@@ -592,7 +592,7 @@ zero) model.

.. topic:: Examples:

* :ref:`example_linear_model_logistic_l1_l2_sparsity.py`
* :ref:`example_linear_model_plot_logistic_l1_l2_sparsity.py`

* :ref:`example_linear_model_plot_logistic_path.py`

@@ -85,7 +85,7 @@ training samples::
>>> clf.fit(X, Y) # doctest: +NORMALIZE_WHITESPACE
SVC(C=None, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.5, kernel='rbf', probability=False, scale_C=True, shrinking=True,
tol=0.001)
tol=0.001, verbose=False)

After being fitted, the model can then be used to predict new values::

@@ -124,7 +124,7 @@ classifiers are constructed and each one trains data from two classes::
>>> clf.fit(X, Y) # doctest: +NORMALIZE_WHITESPACE
SVC(C=None, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=1.0, kernel='rbf', probability=False, scale_C=True, shrinking=True,
tol=0.001)
tol=0.001, verbose=False)
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes: 4*3/2 = 6
6
@@ -271,7 +271,7 @@ floating point values instead of integer values::
>>> clf.fit(X, y) # doctest: +NORMALIZE_WHITESPACE
SVR(C=None, cache_size=200, coef0=0.0, degree=3,
epsilon=0.1, gamma=0.5, kernel='rbf', probability=False, scale_C=True,
shrinking=True, tol=0.001)
shrinking=True, tol=0.001, verbose=False)
>>> clf.predict([[1, 1]])
array([ 1.5])

@@ -469,7 +469,7 @@ vectors and the test vectors must be provided.
>>> clf.fit(gram, y) # doctest: +NORMALIZE_WHITESPACE
SVC(C=None, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0, kernel='precomputed', probability=False, scale_C=True,
shrinking=True, tol=0.001)
shrinking=True, tol=0.001, verbose=False)
>>> # predict on training examples
>>> clf.predict(gram)
array([ 0., 1.])
@@ -114,12 +114,20 @@ def generate_example_rst(app):
os.makedirs(root_dir)

# we create an index.rst with all examples
fhindex = file(os.path.join(root_dir, 'index.rst'), 'w')
fhindex = file(os.path.join(root_dir, 'index.rst'), 'w')
#Note: The sidebar button has been removed from the examples page for now
# due to how it messes up the layout. Will be fixed at a later point
fhindex.write("""\
.. raw:: html
<style type="text/css">
div#sidebarbutton {
display: none;
}
.figure {
float: left;
margin: 10px;
@@ -5,11 +5,18 @@
Layout for scikit-learn, after a design made by Angel Soler
(http://webylimonada.org)

Update: Collapsable sidebar added - 13/03/2012 - Jaques Grobler


:copyright: Fabian Pedregosa
:license: BSD
#}
{% extends "basic/layout.html" %}

{% if theme_collapsiblesidebar|tobool %}
{% set script_files = script_files + ['_static/sidebar.js'] %}
{% endif %}

{% block extrahead %}
<script type="text/javascript">

@@ -85,12 +92,12 @@
{% block content %}
<div class="content-wrapper">

<div class="sphinxsidebar">

{%- if pagename != 'index' %}
{%- if parents %}
<div class="rel">
{% else %}
<div class="sphinxsidebar">
<div class="sphinxsidebarwrapper">
{%- if pagename != 'index' %}
{%- if parents %}
<div class="rel">
{% else %}
<div class="rel rellarge">
{% endif %}
<!-- rellinks[1:] is an ugly hack to avoid link to module
@@ -188,6 +195,8 @@ <h3>{{ _('This page') }}</h3>
{{ toc }}
{% endif %}
</div>
</div>


<div class="content">
{%- block document %}
@@ -196,6 +205,7 @@ <h3>{{ _('This page') }}</h3>
<div class="clearer"></div>
</div>
</div>

{% endblock %}


@@ -226,3 +236,5 @@ <h3>{{ _('This page') }}</h3>
{%- endif %}
</div>
{%- endblock %}


Large diffs are not rendered by default.

@@ -225,13 +225,14 @@ div.sphinxsidebar {
width: 200px;
float: left;
margin-left: 0;
margin-right: 0;
background-color: inherit;
border-top-left-radius: 15px;
-moz-border-radius:15px;
border-top-width: 0;
border-left-width: 0;
border-bottom-width: 0;
margin-top: 5px;
margin-top: 0;
}

div.sphinxsidebar h3 {
@@ -248,6 +249,17 @@ div.sphinxsidebar {
line-height: 1.5em;
}

div.sphinxsidebarwrapper {
padding: 0 0 0 0;
}

{% if theme_collapsiblesidebar|tobool %}
/* for collapsible sidebar */
div#sidebarbutton {
background-color: #F0F0F0;
}
{% endif %}


input {
border: 1px solid #ccc;
@@ -299,6 +311,12 @@ div.body h3 {
text-align: left;
}

div.bodywrapper {
margin: 0 0 0 0;

}


div.bodywrapper h1 {
margin: 0 -10px 0 -10px;
text-align: center;
@@ -0,0 +1,151 @@
/*
* sidebar.js
* ~~~~~~~~~~
*
* This script makes the Sphinx sidebar collapsible.
*
* .sphinxsidebar contains .sphinxsidebarwrapper. This script adds
* in .sphixsidebar, after .sphinxsidebarwrapper, the #sidebarbutton
* used to collapse and expand the sidebar.
*
* When the sidebar is collapsed the .sphinxsidebarwrapper is hidden
* and the width of the sidebar and the margin-left of the document
* are decreased. When the sidebar is expanded the opposite happens.
* This script saves a per-browser/per-session cookie used to
* remember the position of the sidebar among the pages.
* Once the browser is closed the cookie is deleted and the position
* reset to the default (expanded).
*
* :copyright: Copyright 2007-2011 by the Sphinx team, see AUTHORS.
* :license: BSD, see LICENSE for details.
*
*/

$(function() {
// global elements used by the functions.
// the 'sidebarbutton' element is defined as global after its
// creation, in the add_sidebar_button function
var bodywrapper = $('.bodywrapper');
var sidebar = $('.sphinxsidebar');
var sidebarwrapper = $('.sphinxsidebarwrapper');

// for some reason, the document has no sidebar; do not run into errors
if (!sidebar.length) return;

// original margin-left of the bodywrapper and width of the sidebar
// with the sidebar expanded
var bw_margin_expanded = bodywrapper.css('margin-left');
var ssb_width_expanded = sidebar.width();

// margin-left of the bodywrapper and width of the sidebar
// with the sidebar collapsed
var bw_margin_collapsed = '-190px';
var ssb_width_collapsed = '.8em';

// colors used by the current theme
var dark_color = $('.related').css('background-color');
var light_color = $('.footer').css('color');

function sidebar_is_collapsed() {
return sidebarwrapper.is(':not(:visible)');
}

function toggle_sidebar() {
if (sidebar_is_collapsed())
expand_sidebar();
else
collapse_sidebar();
}

function collapse_sidebar() {
sidebarwrapper.hide();
sidebar.css('width', ssb_width_collapsed);
bodywrapper.css('margin-left', bw_margin_collapsed);
sidebarbutton.css({
'margin-left': '0',
'height': bodywrapper.height()
});
sidebarbutton.find('span').text('»');
sidebarbutton.attr('title', _('Expand sidebar'));
document.cookie = 'sidebar=collapsed';
}

function expand_sidebar() {
bodywrapper.css('margin-left', bw_margin_expanded);
sidebar.css('width', ssb_width_expanded);
sidebarwrapper.show();
sidebarbutton.css({
'margin-left': ssb_width_expanded-12,
'height': bodywrapper.height()
});
sidebarbutton.find('span').text('«');
sidebarbutton.attr('title', _('Collapse sidebar'));
document.cookie = 'sidebar=expanded';
}

function add_sidebar_button() {
sidebarwrapper.css({
'float': 'left' ,
'margin-right': '0',
'width': ssb_width_expanded - 10
});
// create the button
sidebar.append(
'<div id="sidebarbutton"><span>&laquo;</span></div>'
);
var sidebarbutton = $('#sidebarbutton');
light_color = sidebarbutton.css('background-color');
// find the height of the viewport to center the '<<' in the page
var viewport_height;
if (window.innerHeight)
viewport_height = window.innerHeight;
else
viewport_height = $(window).height();
sidebarbutton.find('span').css({
'display': 'block',
'margin-top': (viewport_height - sidebar.position().top + 60) / 2
});

sidebarbutton.click(toggle_sidebar);
sidebarbutton.attr('title', _('Collapse sidebar'));
sidebarbutton.css({
'border-left': '1px solid ' + dark_color,
'border-top-left-radius' : '.8em',
'font-size': '1.2em',
'cursor': 'pointer',
'height': bodywrapper.height(),
'padding-top': '1px',
'margin-left': ssb_width_expanded - 12
});

sidebarbutton.hover(
function () {
$(this).css('background-color', dark_color);
},
function () {
$(this).css('background-color', light_color);
}
);
}

function set_position_from_cookie() {
if (!document.cookie)
return;
var items = document.cookie.split(';');
for(var k=0; k<items.length; k++) {
var key_val = items[k].split('=');
var key = key_val[0];
if (key == 'sidebar') {
var value = key_val[1];
if ((value == 'collapsed') && (!sidebar_is_collapsed()))
collapse_sidebar();
else if ((value == 'expanded') && (sidebar_is_collapsed()))
expand_sidebar();
}
}
}

add_sidebar_button();
var sidebarbutton = $('#sidebarbutton');
set_position_from_cookie();
});
@@ -5,3 +5,4 @@ pygments_style = tango

[options]
oldversion = False
collapsiblesidebar = True
@@ -154,7 +154,7 @@ one::
>>> clf.fit(digits.data[:-1], digits.target[:-1])
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.001, kernel='rbf', probability=False, scale_C=True,
shrinking=True, tol=0.001)
shrinking=True, tol=0.001, verbose=False)

Now you can predict new values, in particular, we can ask to the
classifier what is the digit of our last image in the `digits` dataset,
@@ -191,7 +191,7 @@ persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_
>>> clf.fit(X, y)
SVC(C=None, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.25, kernel='rbf', probability=False, scale_C=True,
shrinking=True, tol=0.001)
shrinking=True, tol=0.001, verbose=False)

>>> import pickle
>>> s = pickle.dumps(clf)
@@ -18,6 +18,9 @@ Changelog
- Regressors can now be used as base estimator in the :ref:`multiclass`
module by `Mathieu Blondel`_.

- Simple dict-based feature loader with support for categorical variables
(:class:`feature_extraction.DictVectorizer`) by `Lars Buitinck`_.

- Added Matthews correlation coefficient (:func:`metrics.matthews_corrcoef`)
and added macro and micro average options to
:func:`metrics.precision_score`, :func:`metrics.recall_score` and
@@ -55,6 +58,17 @@ Changelog
Ridge regression, esp. for the ``n_samples > n_features`` case, in
:class:`linear_model.RidgeCV`, by Reuben Fletcher-Costin.

- Refactoring and simplication of the :ref:`text_feature_extraction`
API and fixed a bug that caused possible negative IDF,
by `Olivier Grisel`_.

- Beam pruning option in :class:`_BaseHMM` module has been removed since it
is difficult to cythonize. If you are interested in contributing a cython
version, you can use the python version in the git history as a reference.

- Added :class:`sklearn.cross_validation.StratifiedShuffleSplit`, which is
a :class:`sklearn.cross_validation.ShuffleSplit` with balanced splits,
by `Yannick Schwartz`_.


API changes summary
@@ -91,6 +105,43 @@ API changes summary
Options now are 'ovr' and 'crammer_singer', with 'ovr' being the default.
This does not change the default behavior but hopefully is less confusing.

- Classs :class:`feature_selection.text.Vectorizer` is deprecated and
replaced by :class:`feature_selection.text.TfidfVectorizer`.

- The preprocessor / analyzer nested structure for text feature
extraction has been removed. All those features are
now directly passed as flat constructor arguments
to :class:`feature_selection.text.TfidfVectorizer` and
:class:`feature_selection.text.CountVectorizer`, in particular the
following parameters are now used:

- ``analyzer`` can be `'word'` or `'char'` to switch the default
analysis scheme, or use a specific python callable (as previously).

- ``tokenizer`` and ``preprocessor`` have been introduced to make it
still possible to customize those steps with the new API.

- ``input`` explicitly control how to interpret the sequence passed to
``fit`` and ``predict``: filenames, file objects or direct (byte or
unicode) strings.

- charset decoding is explicit and strict by default.

- the ``vocabulary``, fitted or not is now stored in the
``vocabulary_`` attribute to be consistent with the project
conventions.

- Class :class:`feature_selection.text.TfidfVectorizer` now derives directly
from :class:`feature_selection.text.CountVectorizer` to make grid
search trivial.

- methods `rvs` in :class:`_BaseHMM` module are now deprecated.
`sample` should be used instead.

- Beam pruning option in :class:`_BaseHMM` module is removed since it is
difficult to be Cythonized. If you are interested, you can look in the
history codes by git.

.. _changes_0_10:

0.10
@@ -67,10 +67,10 @@
print "done in %0.3fs." % (time() - t0)

# Inverse the vectorizer vocabulary to be able
inverse_vocabulary = dict((v, k) for k, v in vectorizer.vocabulary.iteritems())
feature_names = vectorizer.get_feature_names()

for topic_idx, topic in enumerate(nmf.components_):
print "Topic #%d:" % topic_idx
print " ".join([inverse_vocabulary[i]
print " ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]])
print
@@ -80,8 +80,7 @@ def uniform_labelings_scores(score_func, n_samples, n_clusters_range,
scores = uniform_labelings_scores(score_func, n_samples, n_clusters_range)
print "done in %0.3fs" % (time() - t0)
plots.append(pl.errorbar(
# n_clusters_range, scores.mean(axis=1), scores.std(axis=1)))
n_clusters_range, np.median(scores, axis=1), scores.std(axis=1)))
n_clusters_range, np.median(scores, axis=1), scores.std(axis=1))[0])
names.append(score_func.__name__)

pl.title("Clustering measures for 2 random uniform labelings\n"
@@ -112,7 +111,7 @@ def uniform_labelings_scores(score_func, n_samples, n_clusters_range,
fixed_n_classes=n_classes)
print "done in %0.3fs" % (time() - t0)
plots.append(pl.errorbar(
n_clusters_range, scores.mean(axis=1), scores.std(axis=1)))
n_clusters_range, scores.mean(axis=1), scores.std(axis=1))[0])
names.append(score_func.__name__)

pl.title("Clustering measures for random uniform labeling\n"
@@ -18,6 +18,9 @@

print __doc__

import shutil
import tempfile

import numpy as np
import pylab as pl
from scipy import linalg, ndimage
@@ -59,7 +62,8 @@
# Compute the coefs of a Bayesian Ridge with GridSearch
cv = KFold(len(y), 2) # cross-validation generator for model selection
ridge = BayesianRidge()
mem = Memory(cachedir='.', verbose=1)
cachedir = tempfile.mkdtemp()
mem = Memory(cachedir=cachedir, verbose=1)

# Ward agglomeration followed by BayesianRidge
A = grid_to_graph(n_x=size, n_y=size)
@@ -99,3 +103,6 @@
pl.title("Feature Agglomeration")
pl.subplots_adjust(0.04, 0.0, 0.98, 0.94, 0.16, 0.26)
pl.show()

# Attempt to remove the temporary cachedir, but don't worry if it fails
shutil.rmtree(cachedir, ignore_errors=True)
@@ -24,14 +24,13 @@

import logging
import numpy as np
from operator import itemgetter
from optparse import OptionParser
import sys
from time import time
import pylab as pl

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import LinearSVC
@@ -107,7 +106,7 @@

print "Extracting features from the training dataset using a sparse vectorizer"
t0 = time()
vectorizer = Vectorizer(sublinear_tf=True)
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)
print "done in %fs" % (time() - t0)
print "n_samples: %d, n_features: %d" % X_train.shape
@@ -130,15 +129,16 @@
print "done in %fs" % (time() - t0)
print

vocabulary = np.array([t for t, i in sorted(vectorizer.vocabulary.iteritems(),
key=itemgetter(1))])


def trim(s):
"""Trim string to fit on terminal (assuming 80-column display)"""
return s if len(s) <= 80 else s[:77] + "..."


# mapping from integer feature name to original token string
feature_names = vectorizer.get_feature_names()


###############################################################################
# Benchmark classifiers
def benchmark(clf):
@@ -166,7 +166,8 @@ def benchmark(clf):
print "top 10 keywords per class:"
for i, category in enumerate(categories):
top10 = np.argsort(clf.coef_[i])[-10:]
print trim("%s: %s" % (category, " ".join(vocabulary[top10])))
print trim("%s: %s" % (
category, " ".join(feature_names[top10])))
print

if opts.print_report:
@@ -17,7 +17,7 @@
# License: Simplified BSD

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics

from sklearn.cluster import KMeans, MiniBatchKMeans
@@ -75,7 +75,8 @@

print "Extracting features from the training dataset using a sparse vectorizer"
t0 = time()
vectorizer = Vectorizer(max_df=0.95, max_features=10000)
vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000,
stop_words='english')
X = vectorizer.fit_transform(dataset.data)

print "done in %fs" % (time() - t0)
@@ -24,7 +24,7 @@
'clf__n_iter': (10, 50, 80),
'clf__penalty': ('l2', 'elasticnet'),
'tfidf__use_idf': (True, False),
'vect__analyzer__max_n': (1, 2),
'vect__max_n': (1, 2),
'vect__max_df': (0.5, 0.75, 1.0),
'vect__max_features': (None, 5000, 10000, 50000)}
done in 1737.030s
@@ -35,7 +35,7 @@
clf__n_iter: 50
clf__penalty: 'elasticnet'
tfidf__use_idf: True
vect__analyzer__max_n: 2
vect__max_n: 2
vect__max_df: 0.75
vect__max_features: 50000
@@ -94,29 +94,33 @@
# increase processing time in a combinatorial way
'vect__max_df': (0.5, 0.75, 1.0),
# 'vect__max_features': (None, 5000, 10000, 50000),
'vect__analyzer__max_n': (1, 2), # words or bigrams
'vect__max_n': (1, 2), # words or bigrams
# 'tfidf__use_idf': (True, False),
# 'tfidf__norm': ('l1', 'l2'),
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
# 'clf__n_iter': (10, 50, 80),
}

# find the best parameters for both the feature extraction and the
# classifier
grid_search = GridSearchCV(pipeline, parameters, n_jobs=1)

print "Performing grid search..."
print "pipeline:", [name for name, _ in pipeline.steps]
print "parameters:"
pprint(parameters)
t0 = time()
grid_search.fit(data.data, data.target)
print "done in %0.3fs" % (time() - t0)
print

print "Best score: %0.3f" % grid_search.best_score
print "Best parameters set:"
best_parameters = grid_search.best_estimator.get_params()
for param_name in sorted(parameters.keys()):
print "\t%s: %r" % (param_name, best_parameters[param_name])
if __name__ == "__main__":
# multiprocessing requires the fork to happen in a __main__ protected
# block

# find the best parameters for both the feature extraction and the
# classifier
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

print "Performing grid search..."
print "pipeline:", [name for name, _ in pipeline.steps]
print "parameters:"
pprint(parameters)
t0 = time()
grid_search.fit(data.data, data.target)
print "done in %0.3fs" % (time() - t0)
print

print "Best score: %0.3f" % grid_search.best_score
print "Best parameters set:"
best_parameters = grid_search.best_estimator.get_params()
for param_name in sorted(parameters.keys()):
print "\t%s: %r" % (param_name, best_parameters[param_name])
@@ -46,7 +46,7 @@
import pylab as pl

from sklearn.datasets import load_mlcomp
from sklearn.feature_extraction.text import Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
@@ -66,7 +66,7 @@

print "Extracting features from the dataset using a sparse vectorizer"
t0 = time()
vectorizer = Vectorizer()
vectorizer = TfidfVectorizer(charset='latin1')
X_train = vectorizer.fit_transform((open(f).read()
for f in news_train.filenames))
print "done in %fs" % (time() - t0)
@@ -0,0 +1,61 @@
"""
==================================
Demonstration of sampling from HMM
==================================
This script shows how to sample points from a Hiden Markov Model (HMM):
we use a 4-components with specified mean and covariance.
The plot show the sequence of observations generated with the transitions
between them. We can see that, as specified by our transition matrix,
there are no transition between component 1 and 3.
"""

import numpy as np
import matplotlib.pyplot as plt

from sklearn import hmm

##############################################################
# Prepare parameters for a 3-components HMM
# Initial population probability
start_prob = np.array([0.6, 0.3, 0.1, 0.0])
# The transition matrix, note that there are no transitions possible
# between component 1 and 4
trans_mat = np.array([[0.7, 0.2, 0.0, 0.1],
[0.3, 0.5, 0.2, 0.0],
[0.0, 0.3, 0.5, 0.2],
[0.2, 0.0, 0.2, 0.6]])
# The means of each component
means = np.array([[0.0, 0.0],
[0.0, 11.0],
[9.0, 10.0],
[11.0, -1.0],
])
# The covariance of each component
covars = .5*np.tile(np.identity(2), (4, 1, 1))

# Build an HMM instance and set parameters
model = hmm.GaussianHMM(4, "full", start_prob, trans_mat,
random_state=42)

# Instead of fitting it from the data, we directly set the estimated
# parameters, the means and covariance of the components
model.means_ = means
model.covars_ = covars
###############################################################

# Generate samples
X, Z = model.sample(500)

# Plot the sampled data
plt.plot(X[:, 0], X[:, 1], "-o", label="observations", ms=6,
mfc="orange", alpha=0.7)

# Indicate the component numbers
for i, m in enumerate(means):
plt.text(m[0], m[1], 'Component %i' % (i + 1),
size=17, horizontalalignment='center',
bbox=dict(alpha=.7, facecolor='w'))
plt.legend(loc='best')
plt.show()
@@ -0,0 +1,95 @@
"""
==========================
Gaussian HMM of stock data
==========================
This script shows how to use Gaussian HMM.
It uses stock price data, which can be obtained from yahoo finance.
For more information on how to get stock prices with matplotlib, please refer
to date_demo1.py of matplotlib.
"""
print __doc__

import datetime
import numpy as np
import pylab as pl
from matplotlib.finance import quotes_historical_yahoo
from matplotlib.dates import YearLocator, MonthLocator, DateFormatter
from sklearn.hmm import GaussianHMM

###############################################################################
# Downloading the data
date1 = datetime.date(1995, 1, 1) # start date
date2 = datetime.date(2012, 1, 6) # end date
# get quotes from yahoo finance
quotes = quotes_historical_yahoo("INTC", date1, date2)
if len(quotes) == 0:
raise SystemExit

# unpack quotes
dates = np.array([q[0] for q in quotes], dtype=int)
close_v = np.array([q[2] for q in quotes])
volume = np.array([q[2] for q in quotes])[1:]

# take diff of close value
# this makes len(diff) = len(close_t) - 1
# therefore, others quantity also need to be shifted
diff = close_v[1:] - close_v[:-1]
dates = dates[1:]
close_v = close_v[1:]

# pack diff and volume for training
X = np.column_stack([diff, volume])

###############################################################################
# Run Gaussian HMM
print "fitting to HMM and decoding ...",
n_components = 5

# make an HMM instance and execute fit
model = GaussianHMM(n_components, "diag")
model.fit([X], n_iter=1000)

# predict the optimal sequence of internal hidden state
hidden_states = model.predict(X)

print "done\n"

###############################################################################
# print trained parameters and plot
print "Transition matrix"
print model.transmat_
print ""

print "means and vars of each hidden state"
for i in xrange(n_components):
print "%dth hidden state" % i
print "mean = ", model.means_[i]
print "var = ", np.diag(model.covars_[i])
print ""

years = YearLocator() # every year
months = MonthLocator() # every month
yearsFmt = DateFormatter('%Y')
fig = pl.figure()
ax = fig.add_subplot(111)

for i in xrange(n_components):
# use fancy indexing to plot data in each state
idx = (hidden_states == i)
ax.plot_date(dates[idx], close_v[idx], 'o', label="%dth hidden state" % i)
ax.legend()

# format the ticks
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(yearsFmt)
ax.xaxis.set_minor_locator(months)
ax.autoscale_view()

# format the coords message box
ax.fmt_xdata = DateFormatter('%Y-%m-%d')
ax.fmt_ydata = lambda x: '$%1.2f' % x
ax.grid(True)

fig.autofmt_xdate()
pl.show()

Large diffs are not rendered by default.

@@ -0,0 +1,123 @@
import numpy as np
cimport numpy as np
cimport cython

cdef extern from "math.h":
double exp(double)
double log(double)

ctypedef np.float64_t dtype_t


@cython.boundscheck(False)
def _logsum(int N, np.ndarray[dtype_t, ndim=1] X):
cdef int i
cdef double maxv, Xsum
Xsum = 0.0
maxv = X.max()
for i in xrange(N):
Xsum += exp(X[i] - maxv)
return log(Xsum) + maxv


@cython.boundscheck(False)
def _forward(int n_observations, int n_components, \
np.ndarray[dtype_t, ndim=1] log_startprob, \
np.ndarray[dtype_t, ndim=2] log_transmat, \
np.ndarray[dtype_t, ndim=2] framelogprob, \
np.ndarray[dtype_t, ndim=2] fwdlattice):

cdef int t, i, j
cdef double logprob
cdef np.ndarray[dtype_t, ndim = 1] work_buffer
work_buffer = np.zeros(n_components)

for i in xrange(n_components):
fwdlattice[0, i] = log_startprob[i] + framelogprob[0, i]

for t in xrange(1, n_observations):
for j in xrange(n_components):
for i in xrange(n_components):
work_buffer[i] = fwdlattice[t - 1, i] + log_transmat[i, j]
fwdlattice[t, j] = _logsum(n_components, work_buffer) \
+ framelogprob[t, j]


@cython.boundscheck(False)
def _backward(int n_observations, int n_components, \
np.ndarray[dtype_t, ndim=1] log_startprob, \
np.ndarray[dtype_t, ndim=2] log_transmat, \
np.ndarray[dtype_t, ndim=2] framelogprob, \
np.ndarray[dtype_t, ndim=2] bwdlattice):

cdef int t, i, j
cdef double logprob
cdef np.ndarray[dtype_t, ndim = 1] work_buffer
work_buffer = np.zeros(n_components)

for i in xrange(n_components):
bwdlattice[n_observations - 1, i] = 0.0

for t in xrange(n_observations - 2, -1, -1):
for i in xrange(n_components):
for j in xrange(n_components):
work_buffer[j] = log_transmat[i, j] + framelogprob[t + 1, j] \
+ bwdlattice[t + 1, j]
bwdlattice[t, i] = _logsum(n_components, work_buffer)


@cython.boundscheck(False)
def _compute_lneta(int n_observations, int n_components, \
np.ndarray[dtype_t, ndim=2] fwdlattice, \
np.ndarray[dtype_t, ndim=2] log_transmat, \
np.ndarray[dtype_t, ndim=2] bwdlattice, \
np.ndarray[dtype_t, ndim=2] framelogprob, \
double logprob, \
np.ndarray[dtype_t, ndim=3] lneta):

cdef int i, j, t
for t in xrange(n_observations - 1):
for i in xrange(n_components):
for j in xrange(n_components):
lneta[t, i, j] = fwdlattice[t, i] + log_transmat[i, j] \
+ framelogprob[t + 1, j] + bwdlattice[t + 1, j] - logprob


@cython.boundscheck(False)
def _viterbi(int n_observations, int n_components, \
np.ndarray[dtype_t, ndim=1] log_startprob, \
np.ndarray[dtype_t, ndim=2] log_transmat, \
np.ndarray[dtype_t, ndim=2] framelogprob):

cdef int i, j, t, max_pos
cdef np.ndarray[dtype_t, ndim = 2] viterbi_lattice
cdef np.ndarray[np.int_t, ndim = 1] state_sequence
cdef double logprob
cdef np.ndarray[dtype_t, ndim = 1] work_buffer

# Initialize state_sequenceation
state_sequence = np.zeros(n_observations, dtype=np.int)
work_buffer = np.zeros(n_components)
viterbi_lattice = np.zeros((n_observations, n_components))

# viterbi_lattice[0,:] = log_startprob[:] + framelogprob[0,:]
for i in xrange(n_components):
viterbi_lattice[0, i] = log_startprob[i] + framelogprob[0, i]

# Induction
for t in xrange(1, n_observations):
for j in xrange(n_components):
work_buffer[:] = viterbi_lattice[t - 1, :] + log_transmat[:, j]
viterbi_lattice[t, j] = np.max(work_buffer[:]) + framelogprob[t, j]

# observation traceback
max_pos = np.argmax(viterbi_lattice[n_observations - 1, :])
state_sequence[n_observations - 1] = max_pos
logprob = viterbi_lattice[n_observations - 1, max_pos]

for t in xrange(n_observations - 2, -1, -1):
max_pos = np.argmax(viterbi_lattice[t, :] \
+ log_transmat[:, state_sequence[t + 1]])
state_sequence[t] = max_pos

return state_sequence, logprob

Large diffs are not rendered by default.

@@ -24,16 +24,6 @@ cdef extern from "cblas.h":
double ddot "cblas_ddot"(int N, double *X, int incX, double *Y, int incY)


@cython.profile(False)
@cython.wraparound(False)
cdef inline DOUBLE array_ddot(int n,
np.ndarray[DOUBLE, ndim=2] a, int a_idx,
np.ndarray[DOUBLE, ndim=2] b, int b_idx):
"""Fast dot product of rows of 2D arrays with blas"""
return ddot(n, <DOUBLE*>(a.data + a_idx * n * sizeof(DOUBLE)), 1,
<DOUBLE*>(b.data + b_idx * n * sizeof(DOUBLE)), 1)


@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
@@ -63,16 +53,17 @@ cpdef DOUBLE _assign_labels_array(np.ndarray[DOUBLE, ndim=2] X,
store_distances = 1

for center_idx in range(n_clusters):
center_squared_norms[center_idx] = array_ddot(
n_features, centers, center_idx, centers, center_idx)
center_squared_norms[center_idx] = ddot(
n_features, &centers[center_idx, 0], 1, &centers[center_idx, 0], 1)

for sample_idx in range(n_samples):
min_dist = -1
for center_idx in range(n_clusters):
dist = 0.0
# hardcoded: minimize euclidean distance to cluster center:
# ||a - b||^2 = ||a||^2 + ||b||^2 -2 <a, b>
dist += array_ddot(n_features, X, sample_idx, centers, center_idx)
dist += ddot(n_features, &X[sample_idx, 0], 1,
&centers[center_idx, 0], 1)
dist *= -2
dist += center_squared_norms[center_idx]
dist += x_squared_norms[sample_idx]
@@ -118,8 +109,8 @@ cpdef DOUBLE _assign_labels_csr(X, np.ndarray[DOUBLE, ndim=1] x_squared_norms,
store_distances = 1

for center_idx in range(n_clusters):
center_squared_norms[center_idx] = array_ddot(
n_features, centers, center_idx, centers, center_idx)
center_squared_norms[center_idx] = ddot(
n_features, &centers[center_idx, 0], 1, &centers[center_idx, 0], 1)

for sample_idx in range(n_samples):
min_dist = -1
@@ -288,14 +288,15 @@ def fit(self, X):
if isinstance(memory, basestring):
memory = Memory(cachedir=memory)

if not sparse.issparse(self.connectivity):
raise TypeError("`connectivity` should be a sparse matrix, got: %r"
% type(self.connectivity))

if (self.connectivity.shape[0] != X.shape[0] or
self.connectivity.shape[1] != X.shape[0]):
raise ValueError("`connectivity` does not have shape "
"(n_samples, n_samples)")
if not self.connectivity is None:
if not sparse.issparse(self.connectivity):
raise TypeError("`connectivity` should be a sparse matrix or "
"None, got: %r" % type(self.connectivity))

if (self.connectivity.shape[0] != X.shape[0] or
self.connectivity.shape[1] != X.shape[0]):
raise ValueError("`connectivity` does not have shape "
"(n_samples, n_samples)")

# Construct the tree
self.children_, self.n_components, self.n_leaves_ = \
@@ -5,6 +5,7 @@
from numpy.testing import assert_equal
from numpy.testing import assert_array_equal
from numpy.testing import assert_array_almost_equal
from nose import SkipTest
from nose.tools import assert_almost_equal
from nose.tools import assert_raises
from nose.tools import assert_true
@@ -154,7 +155,17 @@ def test_k_means_plus_plus_init():
_check_fitted_model(k_means)


def _get_mac_os_version():
import platform
mac_version, _, _ = platform.mac_ver()
if mac_version:
# turn something like '10.7.3' into '10.7'
return '.'.join(mac_version.split('.')[:2])


def test_k_means_plus_plus_init_2_jobs():
if _get_mac_os_version() == '10.7':
raise SkipTest('Multi-process bug in Mac OS X Lion (see issue #636)')
k_means = KMeans(init="k-means++", k=n_clusters, n_jobs=2,
random_state=42).fit(X)
_check_fitted_model(k_means)
@@ -17,7 +17,7 @@

from .base import is_classifier, clone
from .utils import check_arrays, check_random_state
from .utils.fixes import unique
from .utils.fixes import unique, in1d
from .externals.joblib import Parallel, delayed


@@ -779,6 +779,168 @@ def __len__(self):
return self.n_iterations


def _validate_stratified_shuffle_split(y, test_size, train_size):
y = unique(y, return_inverse=True)[1]
if np.min(np.bincount(y)) < 2:
raise ValueError("The least populated class in y has only 1"
" member, which is too few. The minimum"
" number of labels for any class cannot"
" be less than 2.")

if isinstance(test_size, float) and test_size >= 1.:
raise ValueError(
'test_size=%f should be smaller '
'than 1.0 or be an integer' % test_size)
elif isinstance(test_size, int) and test_size >= y.size:
raise ValueError(
'test_size=%d should be smaller '
'than the number of samples %d' % (test_size, y.size))

if train_size is not None:
if isinstance(train_size, float) and train_size >= 1.:
raise ValueError("train_size=%f should be smaller "
"than 1.0 or be an integer" % train_size)
elif isinstance(train_size, int) and train_size >= y.size:
raise ValueError("train_size=%d should be smaller "
"than the number of samples %d" %
(train_size, y.size))

if isinstance(test_size, float):
n_test = ceil(test_size * y.size)
else:
n_test = float(test_size)

if train_size is None:
if isinstance(test_size, float):
n_train = y.size - n_test
else:
n_train = float(y.size - test_size)
else:
if isinstance(train_size, float):
n_train = floor(train_size * y.size)
else:
n_train = float(train_size)

if n_train + n_test > y.size:
raise ValueError('The sum of n_train and n_test = %d, should '
'be smaller than the number of samples %d. '
'Reduce test_size and/or train_size.' %
(n_train + n_test, y.size))

return n_train, n_test


class StratifiedShuffleSplit(object):
"""Stratified ShuffleSplit cross validation iterator
Provides train/test indices to split data in train test sets.
This cross-validation object is a merge of StratifiedKFold and
ShuffleSplit, which returns stratified randomized folds. The folds
are made by preserving the percentage of samples for each class.
Note: like the ShuffleSplit strategy, stratified random splits
do not guarantee that all folds will be different, although this is
still very likely for sizeable datasets.
Parameters
----------
y: array, [n_samples]
Labels of samples.
n_iterations : int (default 10)
Number of re-shuffling & splitting iterations.
test_size : float (default 0.1) or int
If float, should be between 0.0 and 1.0 and represent the
proportion of the dataset to include in the test split. If
int, represents the absolute number of test samples.
train_size : float, int, or None (default is None)
If float, should be between 0.0 and 1.0 and represent the
proportion of the dataset to include in the train split. If
int, represents the absolute number of train samples. If None,
the value is automatically set to the complement of the test fraction.
indices: boolean, optional (default True)
Return train/test split as arrays of indices, rather than a boolean
mask array. Integer indices are required when dealing with sparse
matrices, since those cannot be indexed by boolean masks.
Examples
--------
>>> from sklearn.cross_validation import StratifiedShuffleSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 1, 1])
>>> sss = StratifiedShuffleSplit(y, 3, test_size=0.5, random_state=0)
>>> len(sss)
3
>>> print sss # doctest: +ELLIPSIS
StratifiedShuffleSplit(labels=[0 0 1 1], n_iterations=3, ...)
>>> for train_index, test_index in sss:
... print "TRAIN:", train_index, "TEST:", test_index
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
TRAIN: [0 3] TEST: [1 2]
TRAIN: [0 2] TEST: [1 3]
TRAIN: [1 2] TEST: [0 3]
"""

def __init__(self, y, n_iterations=10, test_size=0.1,
train_size=None, indices=True, random_state=None):

self.y = np.asarray(y)
self.n = self.y.shape[0]
self.n_iterations = n_iterations
self.test_size = test_size
self.train_size = train_size
self.random_state = random_state
self.indices = indices
self.n_train, self.n_test = \
_validate_stratified_shuffle_split(y, test_size, train_size)

def __iter__(self):
rng = check_random_state(self.random_state)

y = self.y.copy()
n = y.size
k = ceil(n / self.n_test)
l = floor((n - self.n_test) / self.n_train)

for i in xrange(self.n_iterations):
ik = i % k
permutation = rng.permutation(self.n)
idx = np.argsort(y[permutation])
ind_test = permutation[idx[ik::k]]
inv_test = np.setdiff1d(idx, idx[ik::k])
train_idx = idx[np.where(in1d(idx, inv_test))[0]]
ind_train = permutation[train_idx[::l]][:self.n_train]
test_index = ind_test
train_index = ind_train

if not self.indices:
test_index = np.zeros(n, dtype=np.bool)
test_index[ind_test] = True
train_index = np.zeros(n, dtype=np.bool)
train_index[ind_train] = True

yield train_index, test_index

def __repr__(self):
return ('%s(labels=%s, n_iterations=%d, test_size=%s, indices=%s, '
'random_state=%s)' % (
self.__class__.__name__,
self.y,
self.n_iterations,
str(self.test_size),
self.indices,
self.random_state,
))

def __len__(self):
return self.n_iterations


##############################################################################

def _cross_val_score(estimator, X, y, score_func, train, test):
@@ -6,6 +6,7 @@
import os
from os.path import join, exists
import re
import scipy as sp
from scipy import io
from shutil import copyfileobj
import urllib2
@@ -184,6 +185,7 @@ def fetch_mldata(dataname, target_name='label', data_name='data',
if transpose_data:
dataset['data'] = dataset['data'].T
if 'target' in dataset:
dataset['target'] = dataset['target'].squeeze()
if not sp.sparse.issparse(dataset['target']):
dataset['target'] = dataset['target'].squeeze()

return Bunch(**dataset)
@@ -4,5 +4,6 @@
images.
"""

from .dict_vectorizer import DictVectorizer
from .image import img_to_graph, grid_to_graph
from . import text
@@ -0,0 +1,249 @@
# Author: Lars Buitinck <L.J.Buitinck@uva.nl>
# License: BSD-style.

from collections import Mapping, Sequence
from operator import itemgetter

import numpy as np
import scipy.sparse as sp

from ..base import BaseEstimator, TransformerMixin
from ..utils import atleast2d_or_csr


def _tosequence(X):
"""Turn X into a sequence or ndarray, avoiding a copy if possible."""
if isinstance(X, Mapping):
return [X]
elif isinstance(X, (Sequence, np.ndarray)):
return X
else:
return list(X)


class DictVectorizer(BaseEstimator, TransformerMixin):
"""Transforms lists of feature-value mappings to vectors.
This transformer turns lists of mappings (dict-like objects) of feature
names to feature values into Numpy arrays or scipy.sparse matrices for use
with scikit-learn estimators.
When feature values are strings, this transformer will do a binary one-hot
(aka one-of-K) coding: one boolean-valued feature is constructed for each
of the possible string values that the feature can take on. For instance,
a feature "f" that can take on the values "ham" and "spam" will become two
features in the output, one signifying "f=ham", the other "f=spam".
Features that do not occur in a sample (mapping) will have a zero value
in the resulting array/matrix.
Parameters
----------
dtype : callable, optional
The type of feature values. Passed to Numpy array/scipy.sparse matrix
constructors as the dtype argument.
separator: string, optional
Separator string used when constructing new features for one-hot
coding.
sparse: boolean, optional
Whether transform should produce scipy.sparse matrices.
Examples
--------
>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer(sparse=False)
>>> D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
>>> X = v.fit_transform(D)
>>> X
array([[ 1., 2., 0.],
[ 3., 0., 1.]])
>>> v.inverse_transform(X) == \
[{'bar': 2.0, 'foo': 1.0}, {'baz': 1.0, 'foo': 3.0}]
True
>>> v.transform({'foo': 4, 'quux': 3})
array([[ 4., 0., 0.]])
"""

def __init__(self, dtype=np.float64, separator="=", sparse=True):
self.dtype = dtype
self.separator = separator
self.sparse = sparse

def fit(self, X, y=None):
"""Learn a list of feature name -> indices mappings.
Parameters
----------
X : Mapping or iterable over Mappings
Dict(s) or Mapping(s) from feature names (arbitrary Python
objects) to feature values (must be convertible to dtype).
y : (ignored)
Returns
-------
self
"""
X = _tosequence(X)
vocab = {}

for x in X:
for f, v in x.iteritems():
if isinstance(v, basestring):
f = "%s%s%s" % (f, self.separator, v)
vocab.setdefault(f, len(vocab))

self.vocabulary_ = vocab

return self

def fit_transform(self, X, y=None):
"""Learn a list of feature name -> indices mappings and transform X.
Like fit(X) followed by transform(X).
Parameters
----------
X : Mapping or iterable over Mappings
Dict(s) or Mapping(s) from feature names (arbitrary Python
objects) to feature values (must be convertible to dtype).
y : (ignored)
Returns
-------
Xa : {array, sparse matrix}
Feature vectors; always 2-d.
"""
X = _tosequence(X)
self.fit(X)
return self.transform(X)

def inverse_transform(self, X, dict_type=dict):
"""Transform array or sparse matrix X back to feature mappings.
X must have been produced by this DictVectorizer's transform or
fit_transform method; it may only have passed through transformers
that preserve the number of features and their order.
In the case of one-hot/one-of-K coding, the constructed feature
names and values are returned rather than the original ones.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Sample matrix.
dict_type : callable, optional
Constructor for feature mappings. Must conform to the
collections.Mapping API.
Returns
-------
D : list of dict_type objects, length = n_samples
Feature mappings for the samples in X.
"""
X = atleast2d_or_csr(X) # COO matrix is not subscriptable

names = self.get_feature_names()
Xd = [dict_type() for _ in xrange(X.shape[0])]

if sp.issparse(X):
for i, j in zip(*X.nonzero()):
Xd[i][names[j]] = X[i, j]
else:
for i in xrange(X.shape[0]):
d = Xd[i]
for j, v in enumerate(X[i, :]):
if v != 0:
d[names[j]] = X[i, j]

return Xd

def transform(self, X, y=None):
"""Transform feature->value dicts to array or sparse matrix.
Named features not encountered during fit or fit_transform will be
silently ignored.
Parameters
----------
X : Mapping or iterable over Mappings, length = n_samples
Dict(s) or Mapping(s) from feature names (arbitrary Python
objects) to feature values (must be convertible to dtype).
y : (ignored)
Returns
-------
Xa : {array, sparse matrix}
Feature vectors; always 2-d.
"""
dtype = self.dtype
vocab = self.vocabulary_

if self.sparse:
i_ind = []
j_ind = []
values = []

for i, x in enumerate(X):
for f, v in x.iteritems():
if isinstance(v, basestring):
f = "%s%s%s" % (f, self.separator, v)
v = 1
try:
j = vocab[f]
i_ind.append(i)
j_ind.append(j)
values.append(dtype(v))
except KeyError:
pass

shape = (i + 1, len(vocab))
return sp.coo_matrix((values, (i_ind, j_ind)),
shape=shape, dtype=dtype)

else:
X = _tosequence(X)
Xa = np.zeros((len(X), len(vocab)), dtype=dtype)

for i, x in enumerate(X):
for f, v in x.iteritems():
if isinstance(v, basestring):
f = "%s%s%s" % (f, self.separator, v)
v = 1
try:
Xa[i, vocab[f]] = dtype(v)
except KeyError:
pass

return Xa

def get_feature_names(self):
"""Returns a list of feature names, ordered by their indices.
If one-of-K coding is applied to categorical features, this will
include the constructed feature names but not the original ones.
"""
return [f for f, i in sorted(self.vocabulary_.iteritems(),
key=itemgetter(1))]

def restrict(self, support, indices=False):
"""Restrict the features to those in support.
Parameters
----------
support : array-like
Boolean mask or list of indices (as returned by the get_support
member of feature selectors).
indices : boolean, optional
Whether support is a list of indices.
"""
if not indices:
support = np.where(support)[0]

names = self.get_feature_names()
new_vocab = {}
for i in support:
new_vocab[names[i]] = len(new_vocab)

self.vocabulary_ = new_vocab

return self
@@ -0,0 +1,45 @@
# This list of English stop words is taken from the "Glasgow Information
# Retrieval Group". The original list can be found at
# http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words
ENGLISH_STOP_WORDS = frozenset([
"a", "about", "above", "across", "after", "afterwards", "again", "against",
"all", "almost", "alone", "along", "already", "also", "although", "always",
"am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
"any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are",
"around", "as", "at", "back", "be", "became", "because", "become",
"becomes", "becoming", "been", "before", "beforehand", "behind", "being",
"below", "beside", "besides", "between", "beyond", "bill", "both",
"bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con",
"could", "couldnt", "cry", "de", "describe", "detail", "do", "done",
"down", "due", "during", "each", "eg", "eight", "either", "eleven", "else",
"elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
"everything", "everywhere", "except", "few", "fifteen", "fify", "fill",
"find", "fire", "first", "five", "for", "former", "formerly", "forty",
"found", "four", "from", "front", "full", "further", "get", "give", "go",
"had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter",
"hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his",
"how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed",
"interest", "into", "is", "it", "its", "itself", "keep", "last", "latter",
"latterly", "least", "less", "ltd", "made", "many", "may", "me",
"meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly",
"move", "much", "must", "my", "myself", "name", "namely", "neither",
"never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone",
"nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
"once", "one", "only", "onto", "or", "other", "others", "otherwise", "our",
"ours", "ourselves", "out", "over", "own", "part", "per", "perhaps",
"please", "put", "rather", "re", "same", "see", "seem", "seemed",
"seeming", "seems", "serious", "several", "she", "should", "show", "side",
"since", "sincere", "six", "sixty", "so", "some", "somehow", "someone",
"something", "sometime", "sometimes", "somewhere", "still", "such",
"system", "take", "ten", "than", "that", "the", "their", "them",
"themselves", "then", "thence", "there", "thereafter", "thereby",
"therefore", "therein", "thereupon", "these", "they", "thick", "thin",
"third", "this", "those", "though", "three", "through", "throughout",
"thru", "thus", "to", "together", "too", "top", "toward", "towards",
"twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us",
"very", "via", "was", "we", "well", "were", "what", "whatever", "when",
"whence", "whenever", "where", "whereafter", "whereas", "whereby",
"wherein", "whereupon", "wherever", "whether", "which", "while", "whither",
"who", "whoever", "whole", "whom", "whose", "why", "will", "with",
"within", "without", "would", "yet", "you", "your", "yours", "yourself",
"yourselves"])
@@ -0,0 +1,74 @@
# Author: Lars Buitinck <L.J.Buitinck@uva.nl>
# License: BSD-style.

import numpy as np
import scipy.sparse as sp

from nose.tools import assert_equal, assert_true, assert_false
from numpy.testing import assert_array_equal

from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import SelectKBest, chi2


def test_dictvectorizer():
D = [{"foo": 1, "bar": 3},
{"bar": 4, "baz": 2},
{"bar": 1, "quux": 1, "quuux": 2}]

for sparse in (True, False):
for dtype in (int, np.float32, np.int16):
v = DictVectorizer(sparse=sparse, dtype=dtype)
X = v.fit_transform(D)

assert_equal(sp.issparse(X), sparse)
assert_equal(X.shape, (3, 5))
assert_equal(X.sum(), 14)
assert_equal(v.inverse_transform(X), D)

if sparse:
# COO matrices can't be compared for equality
assert_array_equal(X.A, v.transform(D).A)
else:
assert_array_equal(X, v.transform(D))


def test_feature_selection():
# make two feature dicts with two useful features and a bunch of useless
# ones, in terms of chi2
d1 = dict([("useless%d" % i, 10) for i in xrange(20)],
useful1=1, useful2=20)
d2 = dict([("useless%d" % i, 10) for i in xrange(20)],
useful1=20, useful2=1)

for indices in (True, False):
v = DictVectorizer().fit([d1, d2])
X = v.transform([d1, d2])
sel = SelectKBest(chi2, k=2).fit(X, [0, 1])

v.restrict(sel.get_support(indices=indices), indices=indices)
assert_equal(v.get_feature_names(), ["useful1", "useful2"])


def test_one_of_k():
D_in = [{"version": "1", "ham": 2},
{"version": "2", "spam": .3},
{"version=3": True, "spam": -1}]
v = DictVectorizer()
X = v.fit_transform(D_in)
assert_equal(X.shape, (3, 5))

D_out = v.inverse_transform(X)
assert_equal(D_out[0], {"version=1": 1, "ham": 2})

names = v.get_feature_names()
assert_true("version=2" in names)
assert_false("version" in names)


def test_unseen_features():
D = [{"camelot": 0, "spamalot": 1}]
v = DictVectorizer(sparse=False).fit(D)
X = v.transform({"push the pram a lot": 2})

assert_array_equal(X, np.zeros((1, 2)))

Large diffs are not rendered by default.

Large diffs are not rendered by default.

@@ -62,8 +62,7 @@ def l1_cross_distances(X):


class GaussianProcess(BaseEstimator, RegressorMixin):
"""
The Gaussian Process model class.
"""The Gaussian Process model class.
Parameters
----------
@@ -171,9 +170,8 @@ class GaussianProcess(BaseEstimator, RegressorMixin):
>>> X = np.array([[1., 3., 5., 6., 7., 8.]]).T
>>> y = (X * np.sin(X)).ravel()
>>> gp = GaussianProcess(theta0=0.1, thetaL=.001, thetaU=1.)
>>> gp.fit(X, y) # doctest: +ELLIPSIS
GaussianProcess(beta0=None, corr=...,
normalize=..., nugget=...,
>>> gp.fit(X, y) # doctest: +ELLIPSIS
GaussianProcess(beta0=None...
...
Notes

Large diffs are not rendered by default.

@@ -200,10 +200,8 @@ def auc(x, y):
raise ValueError('At least 2 points are needed to compute'
' area under curve, but x.shape = %s' % x.shape)

# reorder the data points according to the x axis
order = np.argsort(x)
x = x[order]
y = y[order]
# reorder the data points according to the x axis and using y to break ties
x, y = np.array(sorted(points for points in zip(x, y))).T

h = np.diff(x)
area = np.sum(h * (y[1:] + y[:-1])) / 2.0
@@ -134,6 +134,20 @@ def test_auc():
assert_array_almost_equal(auc(x, y), 0.5)


def test_auc_duplicate_values():
"""Test Area Under Curve (AUC) computation with duplicate values
auc() was previously sorting the x and y arrays according to the indices
from numpy.argsort(x), which was reordering the tied 0's in this example
and resulting in an incorrect area computation. This test detects the
error.
"""
x = [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.5, 1.]
y = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,
1., 1., 1., 1., 1., 1., 1., 1.]
assert_array_almost_equal(auc(x, y), 1.)


def test_precision_recall_f1_score_binary():
"""Test Precision Recall and F1 Score for binary classification task"""
y_true, y_pred, _ = make_prediction(binary=True)
@@ -124,8 +124,8 @@ class GMM(BaseEstimator):
use. Must be one of 'spherical', 'tied', 'diag', 'full'.
Defaults to 'diag'.
rng : numpy.random object, optional
Must support the full numpy random number generator API.
random_state: RandomState or an int seed (0 by default)
A random number generator instance
min_covar : float, optional
Floor on the diagonal of the covariance matrix to prevent
@@ -30,6 +30,7 @@
Example
-------
>>> from sklearn import datasets
>>> from sklearn.semi_supervised import LabelPropagation
>>> label_prop_model = LabelPropagation()
>>> iris = datasets.load_iris()
>>> random_unlabeled_points = np.where(np.random.random_integers(0, 1,
@@ -276,6 +277,7 @@ class LabelPropagation(BaseLabelPropagation):
Examples
--------
>>> from sklearn import datasets
>>> from sklearn.semi_supervised import LabelPropagation
>>> label_prop_model = LabelPropagation()
>>> iris = datasets.load_iris()
>>> random_unlabeled_points = np.where(np.random.random_integers(0, 1,
@@ -340,6 +342,7 @@ class LabelSpreading(BaseLabelPropagation):
Examples
--------
>>> from sklearn import datasets
>>> from sklearn.semi_supervised import LabelSpreading
>>> label_prop_model = LabelSpreading()
>>> iris = datasets.load_iris()
>>> random_unlabeled_points = np.where(np.random.random_integers(0, 1,
@@ -5,6 +5,7 @@
def configuration(parent_package='', top_path=None):
from numpy.distutils.misc_util import Configuration
from numpy.distutils.system_info import get_info, BlasNotFoundError
import numpy

config = Configuration('sklearn', parent_package, top_path)

@@ -42,6 +43,13 @@ def configuration(parent_package='', top_path=None):
config.add_subpackage('metrics/cluster')
config.add_subpackage('metrics/cluster/tests')

# add cython extension module for hmm
config.add_extension(
'_hmmc',
sources=['_hmmc.c'],
include_dirs=[numpy.get_include()],
)

# some libs needs cblas, fortran-compiled BLAS will not be sufficient
blas_info = get_info('blas_opt', 0)
if (not blas_info) or (
@@ -75,7 +75,7 @@ class BaseLibSVM(BaseEstimator):

def __init__(self, impl, kernel, degree, gamma, coef0,
tol, C, nu, epsilon, shrinking, probability, cache_size,
scale_C, sparse, class_weight):
scale_C, sparse, class_weight, verbose):

if not impl in LIBSVM_IMPL:
raise ValueError("impl should be one of %s, %s was given" % (
@@ -104,6 +104,7 @@ def __init__(self, impl, kernel, degree, gamma, coef0,
self.scale_C = scale_C
self.sparse = sparse
self.class_weight = class_weight
self.verbose = verbose

def fit(self, X, y, class_weight=None, sample_weight=None):
"""Fit the SVM model according to the given training data.
@@ -187,6 +188,8 @@ def _dense_fit(self, X, y, sample_weight=None):
if epsilon is None:
epsilon = 0.1

libsvm.set_verbosity_wrap(self.verbose)

# we don't pass **self.get_params() to allow subclasses to
# add other parameters to __init__
self.support_, self.support_vectors_, self.n_support_, \
@@ -279,6 +282,8 @@ def _sparse_fit(self, X, y, sample_weight=None):

self.scaled_C_ = C

libsvm_sparse.set_verbosity_wrap(self.verbose)

self.support_vectors_, dual_coef_data, self.intercept_, self.label_, \
self.n_support_, self.probA_, self.probB_ = \
libsvm_sparse.libsvm_sparse_train(
@@ -41,7 +41,7 @@ class LinearSVC(BaseLibLinear, ClassifierMixin, SelectorMixin):
two classes.
`ovr` trains n_classes one-vs-rest classifiers, while `crammer_singer`
optimizes a joint objective over all classes.
While `crammer_singer` is interesting from an theoretical perspective
While `crammer_singer` is interesting from an theoretical perspective
as it is consistent it is seldom used in practice and rarely leads to
better accuracy and is more expensive to compute.
If `crammer_singer` is choosen, the options loss, penalty and dual will
@@ -192,6 +192,11 @@ class frequencies.
of the number of samples. To match libsvm commandline one should use
scale_C=False. WARNING: scale_C will disappear in version 0.12.
verbose : bool, default: False
Enable verbose output. Note that this setting takes advantage of a
per-process runtime setting in libsvm that, if enabled, may not work
properly in a multithreaded context.
Attributes
----------
`support_` : array-like, shape = [n_SV]
@@ -233,7 +238,7 @@ class frequencies.
>>> clf.fit(X, y) #doctest: +NORMALIZE_WHITESPACE
SVC(C=None, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.5, kernel='rbf', probability=False, scale_C=True,
shrinking=True, tol=0.001)
shrinking=True, tol=0.001, verbose=False)
>>> print clf.predict([[-0.8, -1]])
[ 1.]
@@ -251,11 +256,12 @@ class frequencies.

def __init__(self, C=None, kernel='rbf', degree=3, gamma=0.0,
coef0=0.0, shrinking=True, probability=False,
tol=1e-3, cache_size=200, scale_C=True, class_weight=None):
tol=1e-3, cache_size=200, scale_C=True, class_weight=None,
verbose=False):

super(SVC, self).__init__('c_svc', kernel, degree, gamma, coef0, tol,
C, 0., 0., shrinking, probability, cache_size, scale_C,
sparse="auto", class_weight=class_weight)
"auto", class_weight, verbose)


class NuSVC(BaseLibSVM, ClassifierMixin):
@@ -310,6 +316,11 @@ class NuSVC(BaseLibSVM, ClassifierMixin):
automatically adjust weights inversely proportional to
class frequencies.
verbose : bool, default: False
Enable verbose output. Note that this setting takes advantage of a
per-process runtime setting in libsvm that, if enabled, may not work
properly in a multithreaded context.
Attributes
----------
@@ -351,7 +362,7 @@ class frequencies.
>>> clf = NuSVC()
>>> clf.fit(X, y)
NuSVC(cache_size=200, coef0=0.0, degree=3, gamma=0.5, kernel='rbf', nu=0.5,
probability=False, shrinking=True, tol=0.001)
probability=False, shrinking=True, tol=0.001, verbose=False)
>>> print clf.predict([[-0.8, -1]])
[ 1.]
@@ -367,11 +378,11 @@ class frequencies.

def __init__(self, nu=0.5, kernel='rbf', degree=3, gamma=0.0,
coef0=0.0, shrinking=True, probability=False,
tol=1e-3, cache_size=200):
tol=1e-3, cache_size=200, verbose=False):

super(NuSVC, self).__init__('nu_svc', kernel, degree, gamma, coef0,
tol, 0., nu, 0., shrinking, probability, cache_size,
scale_C=True, sparse="auto", class_weight=None)
True, "auto", None, verbose)


class SVR(BaseLibSVM, RegressorMixin):
@@ -428,6 +439,11 @@ class SVR(BaseLibSVM, RegressorMixin):
of the number of samples. To match libsvm commandline one should use
scale_C=False. WARNING: scale_C will disappear in version 0.12.
verbose : bool, default: False
Enable verbose output. Note that this setting takes advantage of a
per-process runtime setting in libsvm that, if enabled, may not work
properly in a multithreaded context.
Attributes
----------
`support_` : array-like, shape = [n_SV]
@@ -463,7 +479,8 @@ class SVR(BaseLibSVM, RegressorMixin):
>>> clf = SVR(C=1.0, epsilon=0.2)
>>> clf.fit(X, y)
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.2, gamma=0.2,
kernel='rbf', probability=False, scale_C=True, shrinking=True, tol=0.001)
kernel='rbf', probability=False, scale_C=True, shrinking=True, tol=0.001,
verbose=False)
See also
--------
@@ -474,11 +491,12 @@ class SVR(BaseLibSVM, RegressorMixin):
"""
def __init__(self, kernel='rbf', degree=3, gamma=0.0, coef0=0.0,
tol=1e-3, C=None, epsilon=0.1, shrinking=True,
probability=False, cache_size=200, scale_C=True):
probability=False, cache_size=200, scale_C=True,
verbose=False):

super(SVR, self).__init__('epsilon_svr', kernel, degree, gamma, coef0,
tol, C, 0., epsilon, shrinking, probability, cache_size,
scale_C, sparse="auto", class_weight=None)
scale_C, "auto", None, verbose)


class NuSVR(BaseLibSVM, RegressorMixin):
@@ -536,6 +554,11 @@ class NuSVR(BaseLibSVM, RegressorMixin):
of the number of samples. To match libsvm commandline one should use
scale_C=False. WARNING: scale_C will disappear in version 0.12.
verbose : bool, default: False
Enable verbose output. Note that this setting takes advantage of a
per-process runtime setting in libsvm that, if enabled, may not work
properly in a multithreaded context.
Attributes
----------
`support_` : array-like, shape = [n_SV]
@@ -571,7 +594,8 @@ class NuSVR(BaseLibSVM, RegressorMixin):
>>> clf = NuSVR(C=1.0, nu=0.1)
>>> clf.fit(X, y)
NuSVR(C=1.0, cache_size=200, coef0=0.0, degree=3, gamma=0.2, kernel='rbf',
nu=0.1, probability=False, scale_C=True, shrinking=True, tol=0.001)
nu=0.1, probability=False, scale_C=True, shrinking=True, tol=0.001,
verbose=False)
See also
--------
@@ -586,11 +610,11 @@ class NuSVR(BaseLibSVM, RegressorMixin):
def __init__(self, nu=0.5, C=None, kernel='rbf', degree=3,
gamma=0.0, coef0=0.0, shrinking=True,
probability=False, tol=1e-3, cache_size=200,
scale_C=True):
scale_C=True, verbose=False):

super(NuSVR, self).__init__('nu_svr', kernel, degree, gamma, coef0,
tol, C, nu, 0., shrinking, probability, cache_size, scale_C,
sparse="auto", class_weight=None)
"auto", None, verbose)


class OneClassSVM(BaseLibSVM):
@@ -638,6 +662,11 @@ class OneClassSVM(BaseLibSVM):
of the number of samples. To match libsvm commandline one should use
scale_C=False. WARNING: scale_C will disappear in version 0.12.
verbose : bool, default: False
Enable verbose output. Note that this setting takes advantage of a
per-process runtime setting in libsvm that, if enabled, may not work
properly in a multithreaded context.
Attributes
----------
`support_` : array-like, shape = [n_SV]
@@ -664,11 +693,11 @@ class OneClassSVM(BaseLibSVM):
"""
def __init__(self, kernel='rbf', degree=3, gamma=0.0, coef0=0.0, tol=1e-3,
nu=0.5, shrinking=True, cache_size=200):
nu=0.5, shrinking=True, cache_size=200, verbose=False):

super(OneClassSVM, self).__init__('one_class', kernel, degree, gamma,
coef0, tol, 0., nu, 0., shrinking, False, cache_size,
scale_C=True, sparse="auto", class_weight=None)
True, "auto", None, verbose)

def fit(self, X, sample_weight=None, **params):
"""
@@ -7,15 +7,15 @@
class SparseBaseLibSVM(BaseLibSVM):
def __init__(self, impl, kernel, degree, gamma, coef0,
tol, C, nu, epsilon, shrinking, probability, cache_size,
scale_C, class_weight):
scale_C, class_weight, verbose):

assert kernel in self._sparse_kernels, \
"kernel should be one of %s, "\
"%s was given." % (self._kernel_types, kernel)

super(SparseBaseLibSVM, self).__init__(impl, kernel, degree, gamma,
coef0, tol, C, nu, epsilon, shrinking, probability, cache_size,
scale_C, sparse=True, class_weight=class_weight)
scale_C, True, class_weight, verbose)

def fit(self, X, y, sample_weight=None):
X = scipy.sparse.csr_matrix(X, dtype=np.float64)
@@ -25,18 +25,19 @@ class SVC(SparseBaseLibSVM, ClassifierMixin):
>>> clf.fit(X, y) #doctest: +NORMALIZE_WHITESPACE
SVC(C=None, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.5, kernel='rbf', probability=False, scale_C=True,
shrinking=True, tol=0.001)
shrinking=True, tol=0.001, verbose=False)
>>> print clf.predict([[-0.8, -1]])
[ 1.]
"""

def __init__(self, C=None, kernel='rbf', degree=3, gamma=0.0,
coef0=0.0, shrinking=True, probability=False,
tol=1e-3, cache_size=200, scale_C=True, class_weight=None):
tol=1e-3, cache_size=200, scale_C=True, class_weight=None,
verbose=False):

super(SVC, self).__init__('c_svc', kernel, degree, gamma, coef0, tol,
C, 0., 0., shrinking, probability,
cache_size, scale_C, class_weight)
cache_size, scale_C, class_weight, verbose)


class NuSVC(SparseBaseLibSVM, ClassifierMixin):
@@ -60,18 +61,19 @@ class NuSVC(SparseBaseLibSVM, ClassifierMixin):
>>> clf.fit(X, y) #doctest: +NORMALIZE_WHITESPACE
NuSVC(cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.5,
kernel='rbf', nu=0.5, probability=False, scale_C=True,
shrinking=True, tol=0.001)
shrinking=True, tol=0.001, verbose=False)
>>> print clf.predict([[-0.8, -1]])
[ 1.]
"""

def __init__(self, nu=0.5, kernel='rbf', degree=3, gamma=0.0,
coef0=0.0, shrinking=True, probability=False,
tol=1e-3, cache_size=200, scale_C=True, class_weight=None):
tol=1e-3, cache_size=200, scale_C=True, class_weight=None,
verbose=False):

super(NuSVC, self).__init__('nu_svc', kernel, degree, gamma, coef0,
tol, 0., nu, 0., shrinking, probability,
cache_size, scale_C, class_weight)
cache_size, scale_C, class_weight, verbose)


class SVR(SparseBaseLibSVM, RegressorMixin):
@@ -96,16 +98,18 @@ class SVR(SparseBaseLibSVM, RegressorMixin):
>>> clf = SVR(C=1.0, epsilon=0.2)
>>> clf.fit(X, y)
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.2, gamma=0.2,
kernel='rbf', probability=False, scale_C=True, shrinking=True, tol=0.001)
kernel='rbf', probability=False, scale_C=True, shrinking=True, tol=0.001,
verbose=False)
"""

def __init__(self, kernel='rbf', degree=3, gamma=0.0, coef0=0.0,
tol=1e-3, C=None, epsilon=0.1, shrinking=True,
probability=False, cache_size=200, scale_C=True):
probability=False, cache_size=200, scale_C=True,
verbose=False):

super(SVR, self).__init__('epsilon_svr', kernel, degree, gamma, coef0,
tol, C, 0., epsilon, shrinking, probability,
cache_size, scale_C, class_weight=None)
cache_size, scale_C, None, verbose)


class NuSVR(SparseBaseLibSVM, RegressorMixin):
@@ -131,16 +135,17 @@ class NuSVR(SparseBaseLibSVM, RegressorMixin):
>>> clf.fit(X, y)
NuSVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.2,
kernel='rbf', nu=0.1, probability=False, scale_C=True, shrinking=True,
tol=0.001)
tol=0.001, verbose=False)
"""

def __init__(self, nu=0.5, C=None, kernel='rbf', degree=3,
gamma=0.0, coef0=0.0, shrinking=True, epsilon=0.1,
probability=False, tol=1e-3, cache_size=200, scale_C=True):
probability=False, tol=1e-3, cache_size=200, scale_C=True,
verbose=False):

super(NuSVR, self).__init__('nu_svr', kernel, degree, gamma, coef0,
tol, C, nu, epsilon, shrinking, probability, cache_size,
scale_C, class_weight=None)
scale_C, None, verbose)


class OneClassSVM(SparseBaseLibSVM):
@@ -157,11 +162,13 @@ class OneClassSVM(SparseBaseLibSVM):

def __init__(self, kernel='rbf', degree=3, gamma=0.0, coef0=0.0,
tol=1e-3, nu=0.5, shrinking=True,
probability=False, cache_size=200, scale_C=True):
probability=False, cache_size=200, scale_C=True,
verbose=False):

super(OneClassSVM, self).__init__('one_class', kernel, degree, gamma,
coef0, tol, 0.0, nu, 0.0, shrinking,
probability, cache_size, scale_C)
probability, cache_size, scale_C,
verbose)

def fit(self, X, sample_weight=None):
super(OneClassSVM, self).fit(
@@ -2122,6 +2122,7 @@ static const char *solver_type_table[]=
"L1R_L2LOSS_SVC", "L1R_LR", "L2R_LR_DUAL", NULL
};

#if 0
int save_model(const char *model_file_name, const struct model *model_)
{
int i;
@@ -2264,6 +2265,7 @@ struct model *load_model(const char *model_file_name)

return model_;
}
#endif

int get_nr_feature(const model *model_)
{