# Basic Workflow

This files illustrated how to use the sklearn to import data.

---

## Import Data

This is an example to import the standard data (iris and digits) of sklearn. (In the actual condtion, you may use pandas to import data.)

In [2]:
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()

A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the .data member, which is a n_samples, n_features array. The data stored in the data member is actually 1797 figures with size 8 $\times$ 8. Therefore, the data member has 1797 rows and 64 features. You can see it in the result of `print(digits.data.shape)`. In the case of supervised problems, one or more response variables are stored in the .target member. (It is easy to know that the target member has 1797 elements).

For instance, in the case of the digits dataset, digits. data gives access to the features that can be used to classify the digits samples:

In [14]:
print(digits.data)
print(digits.data.shape)

[[ 0.  0.  5. ...  0.  0.  0.]
 [ 0.  0.  0. ... 10.  0.  0.]
 [ 0.  0.  0. ... 16.  9.  0.]
 ...
 [ 0.  0.  1. ...  6.  0.  0.]
 [ 0.  0.  2. ... 12.  0.  0.]
 [ 0.  0. 10. ... 12.  1.  0.]]
(1797, 64)


and digits.target gives the ground truth for the digit dataset, that is the number corresponding to each digit image that we are trying to learn:

In [15]:
digits.target

array([0, 1, 2, ..., 8, 9, 8])

## Learning and predicting

The target of machine learning is to fit an estimator to be able to predict the classes to which unseen samples belong based on the current data set.

In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and predict(T).

An example of an estimator is the class sklearn.svm.SVC, which implements support vector classification. The estimator’s constructor takes as arguments the model’s parameters.

Here is a demo for fitting an estimator:

In [17]:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)

The gamma and C is the hyperparameters of the svm estimator. To optimize the model, usually we need to adjust the parameters after validation.

Then we need to pass the data set to the estimator to fit the model. In this example, we use all the images from our dataset, except for the last image, which we’ll reserve for our predicting. We select the training set with the [:-1] Python syntax, which produces a new array that contains all but the last item from digits.data:

In [19]:
clf.fit(digits.data[:-1], digits.target[:-1])

Now you can predict new values. In this case, you’ll predict using the last image from digits.data. By predicting, you’ll determine the image from the training set that best matches the last image.

In [27]:
clf.predict(digits.data[-1:])

8


## Conventions

### Type Casting

Scikit-learn estimators follow certain rules to make their behavior more predictive. E.g. type-casting. Where possible, input of type float32 will maintain its data type. Otherwise input will be cast to float64:

In [28]:
import numpy as np
from sklearn import kernel_approximation

rng = np.random.RandomState(0)
X = rng.rand(10, 2000)
X = np.array(X, dtype='float32')
X.dtype

transformer = kernel_approximation.RBFSampler()
X_new = transformer.fit_transform(X)
X_new.dtype

dtype('float32')

In this example, X is float32, and is unchanged by fit_transform(X).

Using float32-typed training (or testing) data is often more efficient than using the usual float64 dtype: it allows to reduce the memory usage and sometimes also reduces processing time by leveraging the vector instructions of the CPU. However it can sometimes lead to numerical stability problems causing the algorithm to be more sensitive to the scale of the values and require adequate preprocessing.

Some transformers will always cast their input to float64 and return float64 transformed values as a result.

Regression targets are cast to float64 and classification targets are maintained:

In [30]:
from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
clf = SVC()

In [31]:
clf.fit(iris.data, iris.target)
list(clf.predict(iris.data[:3]))

[0, 0, 0]

In [32]:
clf.fit(iris.data, iris.target_names[iris.target])
list(clf.predict(iris.data[:3]))

['setosa', 'setosa', 'setosa']

Here, the first predict() returns an integer array, since iris.target (an integer array) was used in fit. The second predict() returns a string array, since iris.target_names was for fitting.

### Refitting and Updating Parameters

We can use set_params to modify the parameter based on the previous model. This method provides an easy way to modify the hyperparameter.

In [33]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import SVC
X, y = load_iris(return_X_y=True)

clf = SVC()
clf.set_params(kernel='linear').fit(X, y)
clf.predict(X[:5])

clf.set_params(kernel='rbf').fit(X, y)
clf.predict(X[:5])

array([0, 0, 0, 0, 0])

## Multiclass vs. Multilabel Fitting


### Difference between Multiclass and Multilabel fitting

In multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related (so there is a benefit in tackling them together rather than separately). For example, in the famous leptograspus crabs dataset there are examples of males and females of two colour forms of crab. You could approach this as a multi-class problem with four classes (male-blue, female-blue, male-orange, female-orange) or as a multi-label problem, where one label would be male/female and the other blue/orange. Essentially in multi-label problems a pattern can belong to more than one class.

### Multiclass classification

When using multiclass classifiers, the learning and prediction task that is performed is dependent on the format of the target data fit upon:

In [1]:
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer

X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
y = [0, 0, 1, 1, 2]

classif = OneVsRestClassifier(estimator=SVC(random_state=0))
classif.fit(X, y).predict(X)

array([0, 0, 1, 1, 2])

In the above case, the classifier is fit on a 1d array of multiclass labels and the predict() method therefore provides corresponding multiclass predictions.

One-vs-all strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes. In addition to its computational efficiency (only n_classes classifiers are needed), one advantage of this approach is its interpretability. Since each class is represented by one and one classifier only, it is possible to gain knowledge about the class by inspecting its corresponding classifier. This is the most commonly used strategy for multiclass classification and is a fair default choice.

### Multilabel Problem

OVR can also be used in multilable problem.

In [2]:
y = LabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)

array([[1, 0, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 0, 0]])

Here, the classifier is fit() on a 2d binary label representation of y, using the LabelBinarizer. In this case predict() returns a 2d array representing the corresponding multilabel predictions.

Note that the fourth and fifth instances returned all zeroes, indicating that they matched none of the three labels fit upon. With multilabel outputs, it is similarly possible for an instance to be assigned multiple labels:

In [3]:
from sklearn.preprocessing import MultiLabelBinarizer
y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
y = MultiLabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)

array([[1, 1, 0, 0, 0],
       [1, 0, 1, 0, 0],
       [0, 1, 0, 1, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0]])

In this case, the classifier is fit upon instances each assigned multiple labels. The MultiLabelBinarizer is used to binarize the 2d array of multilabels to fit upon. As a result, predict() returns a 2d array with multiple predicted labels for each instance.