## 精准率和召回率的平衡
在这里决策边界是以0为分界点，如果把0改成一个自定义的threshold，threshold的改变会平移决策边界，从而影响精准率和召回率的结果。
https://windmising.gitbook.io/liu-yu-bo-play-with-machine-learning/10-1/10-5

![image.png](attachment:image.png)

In [1]:
import numpy as np
from sklearn import datasets

In [2]:
digits = datasets.load_digits()
X = digits.data
y = digits.target.copy() # 为了避免下面的代码出现冲突, copy一份

# 人为改变,使得数据变成二分类的问题
# 倾斜比较大的数据
y[digits.target == 9] = 1
y[digits.target != 9] = 0

In [3]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 666)

In [6]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train,y_train)
y_predict = log_reg.predict(X_test)



### 计算 正常情况下, f1_score的值

In [7]:
from sklearn.metrics import f1_score
f1_score(y_test,y_predict)

0.8674698795180723

In [9]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_predict)

array([[403,   2],
       [  9,  36]])

In [10]:
from sklearn.metrics import precision_score

precision_score(y_test, y_predict) # 精准率

0.9473684210526315

In [11]:
from sklearn.metrics import recall_score

recall_score(y_test, y_predict) # 召回率

0.8

### 查看 逻辑回归模型的 decision_function (预测样本的置信度分数)

In [12]:
log_reg.decision_function(X_test)[:10] # 查看前10个数据

array([-22.05700185, -33.02943631, -16.21335414, -80.37912074,
       -48.25121102, -24.54004847, -44.39161228, -25.0429358 ,
        -0.97827574, -19.71740779])

predict函数,对比预测结果, decision_function函数返回的值,小于0的, 表示 0, 否则表示1

In [13]:
log_reg.predict(X_test)[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

### 利用 decision_function (预测样本的置信度分数) 和 决策边界阈值 threshold

In [14]:
decisio_scores = log_reg.decision_function(X_test)

In [15]:
np.min(decisio_scores) # 最小值

-85.68612416749173

In [17]:
np.max(decisio_scores) #最大值

19.889606885682948

### 移动 决策边界阈值, 再计算精准率和召回率
- 精准率和召回率之间的平衡

1 ,threshold= 5 , 不能超出最小值与最大值的范围
- 分界点往右移，则精准率提升，召回率降低。

In [18]:
y_predict_2 = np.array(decisio_scores >= 5,dtype='int')

In [19]:
confusion_matrix(y_test,y_predict_2)

array([[404,   1],
       [ 21,  24]])

In [20]:
precision_score(y_test,y_predict_2)

0.96

In [25]:
recall_score(y_test,y_predict_2)

0.5333333333333333

2, threshold=-5 , 不能超出最小值与最大值的范围
- 分界点往左移，则精准率下降，召回率提升

In [21]:
y_predict_3 = np.array(decisio_scores>=-5,dtype='int')

In [22]:
confusion_matrix(y_test,y_predict_3)

array([[390,  15],
       [  5,  40]])

In [23]:
precision_score(y_test,y_predict_3)

0.7272727272727273

In [24]:
recall_score(y_test,y_predict_3)

0.8888888888888888