In [1]:
import numpy as np
from sklearn.datasets import load_iris

dataset = load_iris()
X = dataset.data
y = dataset.target

print(dataset.DESCR)
n_samples, n_features = X.shape

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

### 离散型的分类特征

离散型的数据可以使用一个阀值来进行划分，低于这个阀值的就标识为 `0`，高于这个阀值的就标识为 `1`。这里对每个特征计算均值来得到这个阀值

In [2]:
attribute_means = X.mean(axis=0)

# 以上计算得到 4 个特征的均值。
# 接下来利用这个来把连续型的特征转换为离散特征
X_d = np.array(X >= attribute_means, dtype='int')

print(X_d.shape)

(150, 4)


## OneR 算法

OneR(One Rule) 算法是一个简单的算法，它通过寻找包含一个特征值的最多的类别来进行分类。
OneR is a simple algorithm that simply predicts the class of a sample by finding the most frequent class for the feature values.

意思是仅仅使用一个规则去选择一个特征来进行分类，以此来获取最佳的性能。

算法会历遍每个特征值，统计在每个分类中某个特征值的出现次数，记录下频次最高的分类的特征值，同时统计出错的频次（除最频繁出现的特征值外，其它的都算是错误），最终选择出错频次最低的一个

In [3]:
from collections import defaultdict
from operator import itemgetter

In [4]:
def train_feature_value(X, y_true, feature_index, value):
    class_counts = defaultdict(int)
    for sample, y in zip(X, y_true):
        if sample[feature_index] == value:
            class_counts[y] += 1
    # 通过对 class_counts 进行排序即可获知包含特征最多的类别是哪个
    sorted_class_counts = sorted(class_counts.items(),
                                 key=itemgetter(1), reverse=True)
    most_frequent_class = sorted_class_coutns[0][0]
    
    # 最后要统计错误的归类有多少。没有出现在最高频次类里面的
    incorrect_predictions = [class_count for class_value, class_count
                             in class_counts.items() 
                             if class_value != most_frequent_class]
    error = sum(incorrect_predictions)
    
    return most_frequent_class, error

In [None]:
values = set(X[:, fea])