OneR is a simple algorithm that simply predicts the class of a sample by finding
the most frequent class for the feature values. OneR is a shorthand for One Rule,
indicating we only use a single rule for this classification by choosing the feature
with the best performance. While some of the later algorithms are significantly more
complex, this simple algorithm has been shown to have good performance in a
number of real-world datasets.
The algorithm starts by iterating over every value of every feature. For that value,
count the number of samples from each class that have that feature value. Record
the most frequent class for the feature value, and the error of that prediction.

The OneR algorithm is quite simple but can be quite effective, showing the power of using even basic statistics in many applications. The algorithm is:



For each variable

    For each value of the variable

        The prediction based on this variable goes the most frequent class

        Compute the error of this prediction

    Sum the prediction errors for all values of the variable

Use the variable with the lowest error

In [3]:
import numpy as np
from sklearn.datasets import load_iris
dataset = load_iris()
X = dataset.data
y = dataset.target
print(dataset.DESCR)
n_samples, n_features = X.shape

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

While the features in this dataset are continuous, the algorithm we will use in this
example requires categorical features. Turning a continuous feature into a categorical
feature is a process called discretization.

A simple discretization algorithm is to choose some threshold and any values below
this threshold are given a value 0. Meanwhile any above this are given the value 1.
For our threshold, we will compute the mean (average) value for that feature. To
start with, we compute the mean for each feature:

In [6]:
attribute_means = X.mean(axis = 0)
#https://stackoverflow.com/questions/5142418/what-is-the-use-of-assert-in-python
assert attribute_means.shape == (n_features,)
X_d = np.array(X >= attribute_means,dtype='int')

In [8]:
from sklearn.model_selection import train_test_split
random_state = 14
X_train,X_test,y_train,y_test = train_test_split(X_d,y,random_state = random_state)
print("There are {} training samples".format(y_train.shape))
print("There are {} test samples".format(y_test.shape))

There are (112,) training samples
There are (38,) test samples


In [28]:
#https://stackoverflow.com/questions/5900578/how-does-collections-defaultdict-work
from collections import defaultdict
#https://docs.python.org/3/howto/sorting.html
from operator import itemgetter

def train(X,y_true,feature):
    n_samples, n_features = X.shape
    assert 0 <= feature < n_features
    values = set(X[:,feature])
    predictors = dict()
    errors = []
    for current_value in values:
        most_frequent_class, error = train_feature_value(X,y_true,feature,current_value)
        predictors[current_value] = most_frequent_class
        errors.append(error)
    total_error = sum(errors)
    return predictors, total_error
    

In [29]:
a = [1,2,3,4,5]
b = [2,2,9,0,9]

In [30]:
#merge two and pick the largest
def pick_the_largest(a,b):
    result = []
    list_length = len(a)
    for i in range(list_length):
        result.append(max(a[i],b[i]))
    return result

In [31]:
pick_the_largest(a,b)

[2, 2, 9, 4, 9]

In [32]:
#zip : This function takes two equal-length collections, and merges them together in pairs. 
list(zip(a,b))

[(1, 2), (2, 2), (3, 9), (4, 0), (5, 9)]

In [33]:
# lambda is just a shorthand to create an anonymous function.
# lambda <input> : expression

lambda pair: max(pair)

<function __main__.<lambda>>

In [34]:
#map takes a function, and applies it to each item in an iterable (such as a list). 

#map(some_function, some_iterable)

list(map(lambda pair : max(pair), zip(a,b)))


[2, 2, 9, 4, 9]

In [35]:
list(map(max,a,b))

[2, 2, 9, 4, 9]

In [36]:
#http://www.python-course.eu/lambda.php

In [37]:
def train_feature_value(X,y_true,feature,value):
    class_counts = defaultdict(int)
    #https://bradmontgomery.net/blog/pythons-zip-map-and-lambda/
    for sample, y in zip(X,y_true):
        if sample[feature] == value:
            class_counts[y] += 1
    sorted_class_counts = sorted(class_counts.items(), key = itemgetter(1), reverse = True)
    most_frequent_class = sorted_class_counts[0][0]
    n_samples = X.shape[1]
    error = sum([class_count for class_value,class_count in class_counts.items()
                if class_value != most_frequent_class])
    
    return most_frequent_class, error

In [41]:
all_predictors = {variable: train(X_train,y_train,variable) for variable in range(X_train.shape[1])}
print(all_predictors)
errors = {variable: error for variable, (mapping, error) in all_predictors.items() }
print(errors)
best_variable, best_error = sorted(errors.items(), key = itemgetter(1))[0]

print("The best model is based on variable {0} and has error {1: .2f}".format(best_variable,best_error))
model = {'variable' : best_variable,
        'predictor': all_predictors[best_variable][0]}
print(model)

{0: ({0: 0, 1: 2}, 41), 1: ({0: 1, 1: 0}, 58), 2: ({0: 0, 1: 2}, 37), 3: ({0: 0, 1: 2}, 37)}
{0: 41, 1: 58, 2: 37, 3: 37}
The best model is based on variable 2 and has error  37.00
{'variable': 2, 'predictor': {0: 0, 1: 2}}


In [46]:
def predict(X_test,model):
    variable = model['variable']
    predictor = model['predictor']
    y_predicted = np.array([predictor[int(sample[variable])] for sample in X_test])
    return y_predicted

In [47]:
y_predicted = predict(X_test, model)
print(y_predicted)

[0 0 0 2 2 2 0 2 0 2 2 0 2 2 0 2 0 2 2 2 0 0 0 2 0 2 0 2 2 0 0 0 2 0 2 0 2
 2]


In [48]:
accuracy = np.mean(y_predicted == y_test) * 100
print("The test accuracy is {: .1f}%".format(accuracy))

The test accuracy is  65.8%


In [49]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_predicted))

             precision    recall  f1-score   support

          0       0.94      1.00      0.97        17
          1       0.00      0.00      0.00        13
          2       0.40      1.00      0.57         8

avg / total       0.51      0.66      0.55        38



  'precision', 'predicted', average, warn_for)
