Permalink
Browse files

documentation for the LFW dataset loaders

  • Loading branch information...
1 parent 4be3a60 commit bef60efabc73fd26235f6b733a1fa4085707479e @ogrisel ogrisel committed Feb 25, 2011
Showing with 179 additions and 11 deletions.
  1. +2 −1 Makefile
  2. +1 −0 doc/contents.rst
  3. +30 −10 doc/modules/classes.rst
  4. +134 −0 doc/modules/datasets.rst
  5. +12 −0 doc/modules/datasets_fixture.py
View
@@ -30,7 +30,8 @@ inplace:
test: in
$(NOSETESTS) scikits/learn
test-doc:
- $(NOSETESTS) --with-doctest --doctest-tests --doctest-extension=rst doc/ doc/modules/
+ $(NOSETESTS) --with-doctest --doctest-tests --doctest-extension=rst \
+ --doctest-fixtures=_fixture doc/modules/
test-coverage:
$(NOSETESTS) --with-coverage
View
@@ -11,4 +11,5 @@
supervised_learning.rst
unsupervised_learning.rst
model_selection.rst
+ Dataset loading utilities <modules/datasets.rst>
Class Reference <modules/classes.rst>
View
@@ -23,7 +23,7 @@ Support Vector Machines
svm.OneClassSVM
For sparse data
------------------
+---------------
.. autosummary::
:toctree: generated/
@@ -76,7 +76,7 @@ For sparse data
linear_model.sparse.ElasticNet
linear_model.sparse.SGDClassifier
linear_model.sparse.SGDRegressor
-
+
Bayesian Regression
===================
@@ -86,7 +86,8 @@ Bayesian Regression
:template: class.rst
linear_model.BayesianRidge
- linear_model.ARDRegression
+ linear_model.ARDRegression
+
Naive Bayes
===========
@@ -115,6 +116,7 @@ Nearest Neighbors
ball_tree.knn_brute
+
Gaussian Mixture Models
=======================
@@ -153,7 +155,6 @@ Clustering
Covariance Estimators
=====================
-
.. autosummary::
:toctree: generated/
:template: class.rst
@@ -169,9 +170,8 @@ Covariance Estimators
covariance.ledoit_wolf
-
Signal Decomposition
-=======================
+====================
.. autosummary::
:toctree: generated/
@@ -189,12 +189,12 @@ Signal Decomposition
fastica.fastica
Cross Validation
-===================
+================
.. autosummary::
:toctree: generated/
:template: class.rst
-
+
cross_val.LeaveOneOut
cross_val.LeavePOut
cross_val.KFold
@@ -256,8 +256,8 @@ For sparse data
:template: class.rst
feature_extraction.text.sparse.TfidfTransformer
- feature_extraction.text.sparse.CountVectorizer
- feature_extraction.text.sparse.Vectorizer
+ feature_extraction.text.sparse.CountVectorizer
+ feature_extraction.text.sparse.Vectorizer
Pipeline
@@ -268,3 +268,23 @@ Pipeline
:template: class.rst
pipeline.Pipeline
+
+
+Dataset generation and loading
+==============================
+
+.. currentmodule:: scikits.learn.datasets
+
+.. autosummary::
+ :toctree: generated/
+ :template: class.rst
+
+ get_data_home
+ load_diabetes
+ load_digits
+ load_files
+ load_iris
+ load_lfw_pairs
+ load_lfw_people
+ load_mlcomp
+
View
@@ -0,0 +1,134 @@
+=========================
+Dataset loading utilities
+=========================
+
+.. currentmodule:: scikits.learn.datasets
+
+The ``scikits.learn.datasets`` package embeds some small toy datasets
+as introduced in the "Getting Started" section.
+
+To evaluate the impact of the scale of the dataset (``n_features`` and
+``n_samples``) while controlling the statistical properties of the data
+(typically the correlation and informativeness of the features), it is
+also possible to generate synthetic data data
+
+This package also features helpers to fetch larger datasets commonly
+used by the machine learning community to benchmark algorithm on data
+that comes from the 'real world'.
+
+
+Dataset generators
+==================
+
+TODO
+
+
+
+The Labeled Faces in the Wild face recognition dataset
+======================================================
+
+This dataset is a collection of JPEG pictures of famous people collected
+over the internet, all details are available on the official website:
+
+ http://vis-www.cs.umass.edu/lfw/
+
+Each picture is centered on a single face. The typical task is called
+Face Verification: given a pair of two pictures, a binary classifier
+must predict whether the two images are from the same person.
+
+An alternative task, Face Recognition or Face Identification is:
+given the picture of the face of an unknown person, identify the name
+of the person by refering to a gallery of previously seen pictures of
+identified persons.
+
+Both Face Verification and Face Recognition are tasks that are typically
+performed on the output of a model trained to perform Face Detection. The
+most popular model for Face Detection is called Viola-Johns and is
+implemented in the OpenCV library. The LFW faces were extracted by this
+face detector from various online websites.
+
+``scikit-learn`` provides two loaders that will automatically download,
+cache, parse the metadata files, decode the jpeg and convert the
+interesting slices into memmaped numpy arrays. This dataset size if more
+than 200 MB. The first load typically takes more than a couple of minutes
+to fully decode the relevant part of the JPEG files into numpy arrays. If
+the dataset has been loaded once, the following times the loading times
+less than 200ms by using a memmaped version memoized on the disk in the
+``~/scikit_learn_data/lfw_home/`` folder using ``joblib``.
+
+The first loader is used for the Face Identification task: a multi-class
+classification task (hence supervised learning)::
+
+ >>> from scikits.learn.datasets import load_lfw_people
+ >>> lfw_people = load_lfw_people(min_faces_per_person=100)
@GaelVaroquaux

GaelVaroquaux Mar 6, 2011

Owner

Running doctests on this module forces downloading the data.

Are we really sure that we want to do that? (maybe, maybe not, I don't know).

@ogrisel

ogrisel Mar 6, 2011

Owner

No it should not be the case if you don't have a ~/scikit_learn_data folder thanks to the fixture (see the setup_module function and the Makefile)

@GaelVaroquaux

GaelVaroquaux Mar 6, 2011

Owner

OK, so maybe we need a finer fixture (I did see the fixture, and wondered what it was for, if it was not for preventing the download), as I was still getting lengthy downloads.

@ogrisel

ogrisel Mar 6, 2011

Owner

Ok the fixture should be updated to test explicitly for ~/scikit_learn_data/lfw_home and ~/scikit_learn_data/20news_home instead. That should triggering the downloads from the doctests.

+
+ >>> for name in lfw_people.class_names:
+ ... print name
+ ...
+ Colin Powell
+ Donald Rumsfeld
+ George W Bush
+ Gerhard Schroeder
+ Tony Blair
+
+The default slice is a rectangular shape around the face, removing
+most of the background::
+
+ >>> lfw_people.data.dtype
+ dtype('float32')
+
+ >>> lfw_people.data.shape
+ (1140, 62, 47)
+
+Each of the ``1140`` faces is assigned to a single person id in the ``target``
+array::
+
+ >>> lfw_people.target.shape
+ (1140,)
+
+ >>> lfw_people.target[:10]
+ memmap([2, 3, 1, 4, 1, 0, 2, 0, 2, 1])
+
+The second loader is typically used for the face verification task: each sample
+is a pair of two picture belonging or not to the same person::
+
+ >>> from scikits.learn.datasets import load_lfw_pairs
+ >>> lfw_pairs_train = load_lfw_pairs(subset='train')
+
+ >>> list(lfw_pairs_train.class_names)
+ ['Different persons', 'Same person']
+
+ >>> lfw_pairs_train.data.shape
+ (2200, 2, 62, 47)
+
+ >>> lfw_pairs_train.target.shape
+ (2200,)
+
+Both for the ``load_lfw_people`` and ``load_lfw_pairs`` function it is
+possible to get an additional dimension with the RGB color channels by
+passing ``color=True``::
+
+ >>> lfw_pairs_train = load_lfw_pairs(subset='train', color=True)
+ >>> lfw_pairs_train.data.shape
+ (2200, 2, 62, 47, 3)
+
+The ``load_lfw_pairs`` datasets is subdived in 3 subsets: the development
+``train`` set, the development ``test`` set and an evaluation ``10_folds``
+set meant to be compute performance metrics using a 10-folds cross
+validation scheme.
+
+
+Examples
+--------
+
+:ref:`example_applications_plot_face_recognition.py`
+
+
+The 20 newsgroups text dataset
+==============================
+
+TODO
+
+
+
+
@@ -0,0 +1,12 @@
+"""Fixture module to skip the datasets loading when offline
+
+Doctests are skipped if the datasets have not already been dowloaded
+and cached in the past.
+"""
+from os.path import exists
+from nose import SkipTest
+from scikits.learn.datasets import get_data_home
+
+def setup_module(module):
+ if not exists(get_data_home()):
+ raise SkipTest("Skipping dataset loading doctests")

0 comments on commit bef60ef

Please sign in to comment.