Goal:

Become familia with machine learning vocabulary

Learning problem:
1. A set of n samples of data
2. Predict properties of unknown data

Categories of learning problems:
1. Supervised: data comes with additional attributes that we want to predict (Classification, regression)
2. Unsupervised: Training data consists of a set of input vectors x without any corresponding target values. The goal is to discover groups of similar examples, determine the distribution of data within input space(density estimation)
3. Project the data from a high dimensional space down to 2 or 3 dimensions for the purpose of visualization 

# Loading the dataset

In [1]:
from sklearn import datasets
iris=datasets.load_iris()
digits=datasets.load_digits()

In [2]:
digits.data

array([[  0.,   0.,   5., ...,   0.,   0.,   0.],
       [  0.,   0.,   0., ...,  10.,   0.,   0.],
       [  0.,   0.,   0., ...,  16.,   9.,   0.],
       ..., 
       [  0.,   0.,   1., ...,   6.,   0.,   0.],
       [  0.,   0.,   2., ...,  12.,   0.,   0.],
       [  0.,   0.,  10., ...,  12.,   1.,   0.]])

In [3]:
digits.target

array([0, 1, 2, ..., 8, 9, 8])

In [4]:
# Shape of the data arrays 
# Data is always a 2D array. Original data may have different shape. 
digits.images[0]

array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
       [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
       [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
       [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
       [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
       [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
       [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
       [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])

# Learning and predicting 

In the digits dataset, the task is to predict. We can fit an estimator to be able to predict the classifiers . 

One example of the estimator is the class sklearn.svm.SVC. This estimator implements support vector classification. Constructor of an estimator takes as arguments the parameters of the model. 

In [7]:
from sklearn import svm
clf=svm.SVC(gamma=0.001,C=100.)

# Gamma is an example of a parameter. It is possible to automatically find good values
# for the parameters by using tools such as grid search and cross validation


CLF is the classifier and it must be fitted to the model. It must learn from the model by passing training set to the fit method. 

In [8]:
clf.fit(digits.data[:-1],digits.target[:-1])


SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [12]:
# Now to predict
clf.predict(digits.data[-1:]) 
# why using -1?

array([8])

In [15]:
clf.predict(digits.data[:-1])

array([0, 1, 2, ..., 0, 8, 9])

# Model persistence

It is possible to save a model in the scitkit by using built in persistence model

In [16]:
from sklearn import svm
from sklearn import datasets
clf=svm.SVC()
iris=datasets.load_iris()
X,y=iris.data,iris.target
clf.fit(X,y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [17]:
import pickle
# saves the previous trained
s=pickle.dumps(clf)
clf2=pickle.loads(s)
clf2.predict(X[0:1])

array([0])

In [18]:
y[0]

0

We can also use joblib's replacement of pickle, which is more efficient on big data, but can only pickle to the disk and not to a string:

In [19]:
from sklearn.externals import joblib
joblib.dump(clf,"filename.pkl")

['filename.pkl']

In [20]:
# later the model can be loaded back with the following
clf=joblib.load("filename.pkl")

joblib.dump and joblib.load functions also accept file-like object instead of filenames. More information on data persistence with Joblib is available

# Conventions

Scikit learn estimators follow rules to make their behaviors more predictive

### Type casting

In [22]:
# Unless other wise specified, input will be cast to float64:
import numpy as np
from sklearn import random_projection
rng=np.random.RandomState(0)
X=rng.rand(10,2000)
X=np.array(X,dtype="float32")
X.dtype

dtype('float32')

In [24]:
transformer=random_projection.GaussianRandomProjection()
X_new=transformer.fit_transform(X)
X_new.dtype

dtype('float64')

In [25]:
# X is float32 which is cast to float64
# Regression targets are cast to float64, classification targets are maintained:
from sklearn import datasets
from sklearn.svm import SVC
irist=datasets.load_iris()
clf=SVC() # Calling the support vector machine method
clf.fit(iris.data,iris.target)


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [26]:
list(clf.predict(iris.data[:3]))

[0, 0, 0]

In [27]:
clf.fit(iris.data,iris.target_names[iris.target]) # training using target names

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [28]:
list(clf.predict(iris.data[:3]))

['setosa', 'setosa', 'setosa']

## Refitting and updating parameters 

Hyper parameters of an estimator can be updated after it has been constructed via the sklear.pipeline.Pipeline.set_params method. Calling fit() more than once will overwrite what awas learned by any previous fit():

In [29]:
import numpy as np
from sklearn.svm import SVC
rng=np.random.RandomState(0)
X=rng.rand(100,10)
y=rng.binomial(1,0.5,100)
X_test=rng.rand(5,10)

In [30]:
clf=SVC()
clf.set_params(kernel="linear").fit(X,y)
# using set_param to set the pipeline

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [31]:
clf.predict(X_test)


array([1, 0, 1, 1, 0])

In [32]:
clf.set_params(kernel="rbf").fit(X,y) # what is the kernel parameter

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [33]:
clf.predict(X_test)

array([0, 0, 0, 1, 0])

Here, the default kernel rbf is first changed to linear after the estimator has been constructed via SVC(), and changed back to rbf to refit the estimator and to make a second prediction.

## Multiclass vs. multilabel fitting
The learning and prediction task that is performed is dependent on the format of the target data fit upon:

In [34]:
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer

X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
y=[0,0,1,1,2]

In [36]:
classif=OneVsRestClassifier(estimator=SVC(random_state=0))
classif.fit(X,y).predict(X)

array([0, 0, 1, 1, 2])

Here, the classifier is fit() on a 2d binary label representation of y, using the LabelBinarizer. In this case predict() returns a 2d array representing the corresponding multilabel predictions.

Note that the fourth and fifth instances returned all zeroes, indicating that they matched none of the three labels fit upon. With multilabel outputs, it is similarly possible for an instance to be assigned multiple labels:

In [37]:
from sklearn.preprocessing import MultiLabelBinarizer
y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
y = MultiLabelBinarizer().fit_transform(y)
classif.fit(X,y).predict(X)

array([[1, 1, 0, 0, 0],
       [1, 0, 1, 0, 0],
       [0, 1, 0, 1, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0]])