# Classify
Supervised learning example: linear classification with [scikit-learn] and [pandas].

[scikit-learn]: https://scikit-learn.org/stable/supervised_learning.html
[pandas]: https://pandas.pydata.org/

In [1]:
from classify import Classifier
from tools import datasplit, irisdata, zscores

## get example data

In [2]:
# load Fisher's iris dataset
data = irisdata()

# normalize numerical columns
data = zscores(data)

# partition into training and testing data
trainrows, testrows = datasplit(data, 100)

In [3]:
trainrows.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
141,1.276066,0.097889,0.760211,1.443994,virginica
142,-0.052331,-0.819823,0.760211,0.919223,virginica
143,1.155302,0.327318,1.213393,1.443994,virginica
144,1.034539,0.556746,1.100097,1.706379,virginica
149,0.068433,-0.131539,0.760211,0.788031,virginica


In [4]:
testrows.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
133,0.551486,-0.590395,0.760211,0.394453,virginica
145,1.034539,-0.131539,0.816859,1.443994,virginica
146,0.551486,-1.27868,0.703564,0.919223,virginica
147,0.793012,-0.131539,0.816859,1.050416,virginica
148,0.430722,0.786174,0.930154,1.443994,virginica


## train a Classifier object

In [5]:
# input training data and name of column to predict
classy = Classifier(trainrows, 'species')
classy

Classifier(LogisticRegression)

In [6]:
# model coefficients
classy.coefs

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
setosa,-0.946284,0.984589,-1.679294,-1.596498
versicolor,0.321063,-0.350691,-0.261631,-0.514258
virginica,0.625221,-0.633897,1.940925,2.110756


In [7]:
# all possible classes
classy.classes

['setosa', 'versicolor', 'virginica']

In [8]:
# columns used to train the model
classy.features

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

In [9]:
# access the scikit-learn model directly
classy.model

In [10]:
# parameters used to train the model
classy.params

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [11]:
# column the model will attempt to predict
classy.target

'species'

## predict classes

In [12]:
# call with testing data
cats = classy(testrows)
cats.tail()

133    versicolor
145     virginica
146     virginica
147     virginica
148     virginica
Name: predicted, dtype: category
Categories (3, object): ['setosa', 'versicolor', 'virginica']

## predict class probabilities
*Caution:* Not all models can do this.

In [13]:
probs = classy.probs(testrows)
probs.round(2).tail()

Unnamed: 0,setosa,versicolor,virginica
133,0.0,0.51,0.48
145,0.0,0.06,0.94
146,0.0,0.2,0.8
147,0.0,0.15,0.85
148,0.0,0.07,0.93


## test with different models and parameters
Show a [confusion matrix] to compare test outputs versus reality.

[confusion matrix]: https://en.wikipedia.org/wiki/Confusion_matrix

In [14]:
Classifier(trainrows, 'species').confusion(testrows)

predicted,setosa,versicolor,virginica
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,13,0,0
versicolor,0,22,1
virginica,0,2,12


In [15]:
Classifier(trainrows, 'species', solver='liblinear').confusion(testrows)

predicted,setosa,versicolor,virginica
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,13,0,0
versicolor,0,14,9
virginica,0,1,13


In [16]:
Classifier(trainrows, 'species', model='RidgeClassifier').confusion(testrows)

predicted,setosa,versicolor,virginica
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,13,0,0
versicolor,0,10,13
virginica,0,1,13


In [17]:
params = {
    'model': 'SGDClassifier',
    'loss': 'log_loss',
    'penalty': 'elasticnet',
    'l1_ratio': 0.5,
}
Classifier(trainrows, 'species', **params).confusion(testrows)

predicted,setosa,versicolor,virginica
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,13,0,0
versicolor,0,23,0
virginica,0,2,12


## help

In [18]:
help(Classifier)

Help on class Classifier in module classify:

class Classifier(builtins.object)
 |  Classifier(data, target, model='LogisticRegression', **kwargs)
 |  
 |  Scikit-learn classifier with pandas inputs and outputs.
 |  Input training data to create and train a model.
 |  Call with new feature data to predict classes.
 |  Output is a Series with datatype 'category'.
 |  
 |  Constructor inputs:
 |      data    DataFrame: observations to use for training
 |      target  string: name of column to predict
 |      model   optional str: name of an sklearn.linear_model
 |      kwargs  are passed to the selected sklearn.linear_model
 |  
 |  Call inputs:
 |      data    DataFrame: features to use for prediction
 |  
 |  Methods defined here:
 |  
 |  __call__(self, data)
 |      Call self as a function.
 |  
 |  __init__(self, data, target, model='LogisticRegression', **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __repr__(self)
 |      Return repr(self)