# 分类正确度的问题
对于极度偏斜(Skewed Data)的数据, 不能只使用分类准确度

## 混淆矩阵
> Confusion Matrix 对于二分类问题

### 精准率
预测为一(真正关注的对象)且预测对的概率

预测有100个有癌症, 而这100个中有癌症的概率

$precision=\frac{TP}{TP+FP}$

- 股票预测更加注重精准率

### 召回率
我们关注的事情中(1)真实发生了, 其中预测到了多少

召回率为80%: 每一百个癌症患者中, 系统能成功预测其中的80个

- 诊断病人时更注重召回率, 可以错判, 不能漏过

$recall=\frac{TP}{TP+FN}$

In [1]:
import numpy as np
from sklearn import datasets
digits = datasets.load_digits()
X = digits.data
y = digits.target.copy() # 不加copy的话, y只是digits.target的一个别名, 修改y也会修改digits

y[digits.target==9] = 1
y[digits.target!=9] = 0

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)

In [2]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
log_reg.score(X_test, y_test)



0.9755555555555555

In [3]:
y_log_predict = log_reg.predict(X_test)

In [4]:
def TN(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    
    return np.sum((y_true==0) & (y_predict==0)) # 按位与, 两个都为true才为true
    # return np.sum(y_predict[y_true==0] == 0)

def FP(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    
    return np.sum((y_true==0) & (y_predict==1))

def FN(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    
    return np.sum((y_true==1) & (y_predict==0))

def TP(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    
    return np.sum((y_true==1) & (y_predict==1))

print(TN(y_test, y_log_predict))
print(FP(y_test, y_log_predict))
print(FN(y_test, y_log_predict))
print(TP(y_test, y_log_predict))

403
2
9
36


In [5]:
def confusion_matrix(y_true, y_predict):
    return np.array([
        [TN(y_true, y_predict), FP(y_true, y_predict)],
        [FN(y_true, y_predict), TP(y_true, y_predict)]
    ])

confusion_matrix(y_test, y_log_predict)

array([[403,   2],
       [  9,  36]])

In [6]:
def precision_score(y_true, y_predict):
    tp = TP(y_true, y_predict)
    fp = FP(y_true, y_predict)
    try:
        return tp / (tp + fp)
    except: # 分母为0
        return 0.0
    
precision_score(y_test, y_log_predict)

0.9473684210526315

In [7]:
def recall_score(y_true, y_predict):
    tp = TP(y_true, y_predict)
    fn = FN(y_true, y_predict)
    try:
        return tp / (tp + fn)
    except: # 分母为0
        return 0.0
    
recall_score(y_test, y_log_predict)

0.8

## sklearn

In [8]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

# F1 Score
兼顾精准率, 召回率, 两者的调和平均值

$\frac{1}{F1} = \frac{1}{2}(\frac{1}{precision} + \frac{1}{recall})$

$F1 = \frac{2\cdot precision\cdot recall}{precision + recall}$

取值范围[0,1]

一个低, 则整体低, 两个都高, 整体才会高

算数平均只要有一个高, 整体就能高

In [9]:
import numpy as np

In [10]:
def f1_score(precision, recall):
    try:
        return 2*precision*recall / (precision + recall)
    except:
        return 0.0

In [11]:
precision = 0.5
recall = 0.5
f1_score(precision, recall)

0.5

In [14]:
precision = 0.1
recall = 0.9
f1_score(precision, recall)

0.18000000000000002

In [16]:
precision = 0
recall = 1
f1_score(precision, recall)

0.0

In [18]:
confusion_matrix(y_test, y_log_predict)

array([[403,   2],
       [  9,  36]], dtype=int64)

In [20]:
from sklearn.metrics import f1_score

f1_score(y_test, y_log_predict) # 因为数据有偏
# 这个能更好地衡量数据的准确性

0.8674698795180723