<a href="https://colab.research.google.com/github/wannasmile/colab_code_note/blob/main/MLPU002.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

PU Learning比较有代表性的一篇，当样本稀疏时，PU Learning提供了另一种求解思路。


首先PU Learning针对的数据分布是Positive data和Unlabeled data，和通常正反例不同，而是正例和未打标数据，这样的数据集在实际生产和生活中经常存在。
比如，如果我们认为一个患者确诊某一疾病是正例，那会有大量的人群是Unlabeled状态，但Unlabeled不代表没有患病更不能看作负例；
还有推荐系统，通常认为用户点击是正例，未点击是负例，这建立在认为曝光等于真实触达的先验假设之上，但可能用户的视线压根就没有看到该位置，根本就没有触达，这时便属于Unlabeled状态；在游戏中也是这样，如果视野内某个位置出现了金币我们可以认为该坐标是正例，那其他未探索的地图也是Unlabeled状态。


对于这样的场景，我们希望模型能够通过在正例和未标记样例上学得正例在整体样本空间下的真实分布情况，比如通过确诊病例和没来看病的人群数据，我们可以预估该疾病到底在整体人群中的发病率是多少；通过用户点击数据和下发曝光数据判断真实CTR值；通过已经获得的金币和未探索的地图信息，知道整张地图的金币数分布等。


对于以上问题，可以概括为：在正例-未打标样本的非标准数据集（标准数据集指正例-负例形式）上训练一个与标准数据集下相近的分类器，从而缓解样本分布上的偏差。这一学习过程我们希望和在正负例上直接训练的效果接近，在实际应用中，通过分类器提供的可以在未标注数据上预测其为正例的概率，产生实际价值。



当您只有几个正样本时，如何对未标记的数据进行分类


假设您有一个支付交易数据集。一些交易被标记为欺诈交易，其余交易被标记为真实交易，您需要设计一个模型来区分欺诈交易和真实交易。假设您有足够的数据和良好的特征，这似乎是一个简单的分类任务。但是，假设只有 15% 的数据被标记，并且标记的样本仅属于一个类别，于是您的训练集包含 15% 标记为真实的样本，而其余样本则未标记，可能是真实的，也可能是欺诈的。你会如何对它们进行分类？需求的这种变化是否只是将这项任务变成了无监督学习问题？嗯，不一定。


这个问题通常被称为 PU（正且未标记）分类问题，首先应该与两个相似且常见的“标记问题”区分开来，这两个问题使许多分类任务变得复杂。

第一种也是最常见的标签问题是小训练集的问题。当尽管您拥有相当数量的数据，但实际上只有一小部分被标记时，就会出现这种情况。这个问题有很多种，具体的训练方法也有很多。

另一个常见的标记问题（通常与 PU 问题混为一谈）涉及这样的情况：我们的训练数据集已完全标记，但它仅包含一个类。例如，假设我们拥有的只是一个非欺诈交易的数据集，并且我们需要使用该数据集来训练一个模型来区分（类似的）非欺诈交易和欺诈交易。这也是一个常见问题，通常被视为无监督异常值检测问题，尽管在 ML 领域也有很多广泛使用的工具是专门为处理这些场景而设计的（OneClassSVM 可能是最著名的）。

相反，PU 分类问题是涉及训练集的情况，其中只有部分数据被标记为正，而其余数据未标记，并且可以是正的或负的。例如，假设你的雇主是一家银行，可以为你提供大量交易数据，但只能确认其中一部分是100%真实的。我将在这里使用的示例涉及与欺诈钞票相关的类似场景。它包含 1200 张纸币的数据集，其中大部分没有标签，只有部分被确认为真品。尽管 PU 问题也很常见，但与前面提到的两个分类问题相比，它们的讨论通常要少得多，并且广泛可用的实践示例或库也很少。

Learning classifiers from only Positive and unlabeled data（2008）

本质上声称，给定一个包含正数据和未标记数据的数据集，某个样本为正样本的概率 [P(y=1|x)] 等于该样本被标记的概率 [P(s=1|x)] 除以我们的数据集中标记正样本的概率 [P(s=1|y=1)]。



In [1]:
!wget https://raw.githubusercontent.com/wannasmile/colab_code_note/main/data_banknote_authentication.txt

--2024-03-24 16:55:24--  https://raw.githubusercontent.com/wannasmile/colab_code_note/main/data_banknote_authentication.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 46400 (45K) [text/plain]
Saving to: ‘data_banknote_authentication.txt.2’


2024-03-24 16:55:24 (21.6 MB/s) - ‘data_banknote_authentication.txt.2’ saved [46400/46400]



In [2]:
import pandas as pd
import numpy as np
data = pd.read_csv('data_banknote_authentication.txt', header=None)

载入数据

数据包含4个特征列和1个标签列

In [3]:
print(data.shape)
data.head(10)

(1372, 5)


Unnamed: 0,0,1,2,3,4
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0
5,4.3684,9.6718,-3.9606,-3.1625,0
6,3.5912,3.0129,0.72888,0.56421,0
7,2.0922,-6.81,8.4636,-0.60216,0
8,3.2032,5.7588,-0.75345,-0.61251,0
9,1.5356,9.1772,-2.2718,-0.73535,0


In [4]:
data.iloc[:, 1].value_counts()

-4.45520    6
-3.26330    5
 0.70980    4
-3.79710    4
-0.02480    4
           ..
 8.81100    1
 6.40230    1
 7.27970    1
 2.10860    1
-0.65804    1
Name: 1, Length: 1256, dtype: int64

正负样本比

In [5]:
data.iloc[:, -1].value_counts()

0    762
1    610
Name: 4, dtype: int64

训练一个基准模型

In [6]:
from sklearn.model_selection import train_test_split

x_data = data.iloc[:,:-1]
y_data = data.iloc[:,-1]

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=7)

In [7]:
import xgboost as xgb

model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, n_estimators=100, n_jobs=1, missing=1,#missing=None,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

model.fit(x_train, y_train)

In [8]:
y_predict = model.predict(x_test)

In [9]:
from sklearn.metrics import recall_score, precision_score, roc_auc_score, accuracy_score, f1_score

def evaluate_results(y_test, y_predict):
    print('Classification results:')
    f1 = f1_score(y_test, y_predict)
    print("f1: %.2f%%" % (f1 * 100.0))
    roc = roc_auc_score(y_test, y_predict)
    print("roc: %.2f%%" % (roc * 100.0))
    rec = recall_score(y_test, y_predict, average='binary')
    print("recall: %.2f%%" % (rec * 100.0))
    prc = precision_score(y_test, y_predict, average='binary')
    print("precision: %.2f%%" % (prc * 100.0))


evaluate_results(y_test, y_predict)

Classification results:
f1: 99.57%
roc: 99.57%
recall: 99.15%
precision: 100.00%


测试 PU Learning 方法

In [10]:
mod_data = data.copy()
#取所有正样本的索引
pos_ind = np.where(mod_data.iloc[:,-1].values == 1)[0]
#随机排序
np.random.shuffle(pos_ind)

#取25%正样本的索引
pos_sample_len = int(np.ceil(0.25 * len(pos_ind)))
print(f'Using {pos_sample_len}/{len(pos_ind)} as positives and unlabeling the rest')
pos_sample = pos_ind[:pos_sample_len]

Using 153/610 as positives and unlabeling the rest


构建目标列 'class_test' ：

1代表正样本

-1代表未标记

In [11]:
#取25%正样本 作为 有标记的正样本
#其他所有样本 作为 未标记的样本

#mod_data['class_test'] = -1
mod_data['class_test'] = 0
mod_data.loc[pos_sample, 'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

target variable:
 0    1219
1     153
Name: class_test, dtype: int64


In [12]:
mod_data.head(10)

Unnamed: 0,0,1,2,3,4,class_test
0,3.6216,8.6661,-2.8073,-0.44699,0,0
1,4.5459,8.1674,-2.4586,-1.4621,0,0
2,3.866,-2.6383,1.9242,0.10645,0,0
3,3.4566,9.5228,-4.0112,-3.5944,0,0
4,0.32924,-4.4552,4.5718,-0.9888,0,0
5,4.3684,9.6718,-3.9606,-3.1625,0,0
6,3.5912,3.0129,0.72888,0.56421,0,0
7,2.0922,-6.81,8.4636,-0.60216,0,0
8,3.2032,5.7588,-0.75345,-0.61251,0,0
9,1.5356,9.1772,-2.2718,-0.73535,0,0


[:-2] is the original class label for positive and negative data

[:-1] is the new class label for positive and unlabeled data

In [13]:
x_data = mod_data.iloc[:,:-2].values # just the X
y_labeled = mod_data.iloc[:,-1].values # new class (just the P & U)
y_positive = mod_data.iloc[:,-2].values # original class (just the P & N)

In [14]:
print(x_data.shape)
print(y_labeled.shape)
print(y_positive.shape)

(1372, 4)
(1372,)
(1372,)


In [15]:
pd.Series(y_positive).value_counts()

0    762
1    610
dtype: int64

In [16]:
pd.Series(y_labeled).value_counts()

0    1219
1     153
dtype: int64

训练集被分为：

* fitting-set（拟合集） 拟合P(s=1|X)

* held-out set（留出集） 估计P(s=1|y=1)


In [17]:
def fit_PU_estimator(X, y, hold_out_ratio, estimator):

    # 所有已经标注的正样本
    # find the indices of the positive & labeled elements
    assert (type(y) == np.ndarray), "Must pass np.ndarray rather than list as y"
    positives = np.where(y == 1.)[0]
    print("已经标注的正样本数量：", len(positives))
    # hold_out_size = the number of positives & labeled samples
    # that we will use later to estimate P(s=1|y=1)
    hold_out_size = int(np.ceil(len(positives) * hold_out_ratio))
    np.random.shuffle(positives)
    # hold_out = the indices of the positive elements
    # that we will use later to estimate P(s=1|y=1)
    hold_out = positives[:hold_out_size]

    print("放在一边的正样本数量：", len(hold_out))
    # 放在一边的真实正样本
    # the actual positive elements that we will keep aside
    X_hold_out = X[hold_out]


    # remove the held out elements from X and y
    X = np.delete(X, hold_out ,0)
    y = np.delete(y, hold_out)

    # fit the estimator on the unlabeled samples + (part of the) positive and labeled ones
    # in order to estimate P(s=1|X)
    estimator.fit(X, y)

    # use the estimator for prediction of the positive held-out set
    # in order to estimate P(s=1|y=1)
    hold_out_predictions = estimator.predict_proba(X_hold_out)

    # take the probability that it is 1
    hold_out_predictions = hold_out_predictions[:,1]

    # save the mean probability
    c = np.mean(hold_out_predictions)
    return estimator, c

#predict_proba()函数返回的是数组（预测该样本为某个标签的概率值）
def predict_PU_prob(X, estimator, prob_s1y1):
    predicted_s = estimator.predict_proba(X)
    predicted_s = predicted_s[:,1]
    print(predicted_s)
    return predicted_s / prob_s1y1

In [18]:
pu_estimator, probs1y1 = fit_PU_estimator(x_data, y_labeled, 0.3, xgb.XGBClassifier())

已经标注的正样本数量： 153
放在一边的正样本数量： 46


In [19]:
predicted = predict_PU_prob(x_data, pu_estimator, probs1y1)

[6.3478896e-05 8.6529508e-05 2.3621214e-03 ... 5.6797959e-02 1.0673341e-02
 7.0716180e-02]


In [20]:
predicted

array([4.5280310e-04, 6.1722606e-04, 1.6849315e-02, ..., 4.0514714e-01,
       7.6134317e-02, 5.0442761e-01], dtype=float32)

In [21]:
predicted = np.zeros(len(x_data))
learning_iterations = 24
for index in range(learning_iterations):
    pu_estimator, probs1y1 = fit_PU_estimator(x_data, y_labeled, 0.3, xgb.XGBClassifier())
    predicted += predict_PU_prob(x_data, pu_estimator, probs1y1)
    if(index%4 == 0):
        print(f'Learning Iteration::{index}/{learning_iterations} => P(s=1|y=1)={round(probs1y1,2)}')

已经标注的正样本数量： 153
放在一边的正样本数量： 46
[0.00018149 0.00013006 0.00017582 ... 0.09066085 0.06442959 0.01045341]
Learning Iteration::0/24 => P(s=1|y=1)=0.17000000178813934
已经标注的正样本数量： 153
放在一边的正样本数量： 46
[7.16482173e-05 1.17729214e-04 1.09127257e-04 ... 1.01397641e-01
 5.91486134e-02 7.32010901e-02]
已经标注的正样本数量： 153
放在一边的正样本数量： 46
[4.0079434e-05 2.8727738e-05 7.9490995e-04 ... 8.9179613e-02 3.5139001e-03
 4.7335889e-02]
已经标注的正样本数量： 153
放在一边的正样本数量： 46
[1.7231259e-04 1.1733560e-04 3.4272610e-04 ... 1.5227357e-01 9.4989769e-02
 6.6200554e-02]
已经标注的正样本数量： 153
放在一边的正样本数量： 46
[9.4017247e-05 4.2619209e-05 6.5062306e-04 ... 2.4688619e-01 4.8192274e-02
 8.7408513e-02]
Learning Iteration::4/24 => P(s=1|y=1)=0.18000000715255737
已经标注的正样本数量： 153
放在一边的正样本数量： 46
[7.4189418e-05 1.1595671e-04 5.2375969e-04 ... 1.2034549e-01 3.9205749e-02
 5.5454303e-02]
已经标注的正样本数量： 153
放在一边的正样本数量： 46
[9.9506986e-05 1.3272073e-04 2.9706670e-04 ... 3.4322675e-02 5.5902790e-02
 4.3699939e-02]
已经标注的正样本数量： 153
放在一边的正样本数量： 46
[7.4014526

In [22]:
y_predict = [1 if x > 0.5 else 0 for x in (predicted/learning_iterations)]
evaluate_results(y_positive, y_predict)

Classification results:
f1: 59.45%
roc: 71.15%
recall: 42.30%
precision: 100.00%


In [23]:
y_predict = [1 if x > 0.02 else 0 for x in (predicted/learning_iterations)]
evaluate_results(y_positive, y_predict)

Classification results:
f1: 97.40%
roc: 97.74%
recall: 98.36%
precision: 96.46%


对比Bagging方法

In [24]:
mod_data = data.copy()
mod_data['class_test'] = 0
mod_data.loc[pos_sample,'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

target variable:
 0    1219
1     153
Name: class_test, dtype: int64


In [25]:
y_data = mod_data.iloc[:,-1]
df_orig_positive = mod_data.iloc[y_data.values == 1]
df_orig_unlabeled = mod_data.iloc[y_data.values != 1]

In [26]:
x_data_pos = df_orig_positive.iloc[:,:-3].values
x_data_unl = df_orig_unlabeled.iloc[:,:-3].values

In [27]:
len_pos = x_data_pos.shape[0] # size of positive
len_unl = x_data_unl.shape[0] # size of unlabeled
learners_num = 128 # learners

#从未标记样本中取出 与正样本相同数量的
bootstrap_sample_size = len_pos # random bootstrap sample size

In [28]:
#create a label set for each learning cycle
train_labels = np.zeros(shape=(len_pos+bootstrap_sample_size,))
#populate the first part of the set with the positive label
train_labels[:len_pos] = 1.0

#记录一个数据点被预测的次数
#place holder array for the number of times the datapoint is predicted
n_oob = np.zeros(shape=(len_unl,))

#记录一个数据点被预测的结果
#hold the results of the prediction of the data point
f_oob = np.zeros(shape=(len_unl, 2))

In [29]:
import lightgbm as lgb

for i in range(learners_num):
    # Bootstrap resample
    bootstrap_sample = np.random.choice(np.arange(len_unl), replace=True, size=bootstrap_sample_size)
    # Positive set + Bootstrapped unlabeled set
    data_bootstrap = np.concatenate((x_data_pos, x_data_unl[bootstrap_sample, :]), axis=0)

    # Train model
    model = lgb.LGBMClassifier(verbosity=2)
    model.fit(data_bootstrap, train_labels)

    # Index for the out of the bag (oob) samples
    idx_oob = sorted(set(range(len_unl)) - set(np.unique(bootstrap_sample)))

    # Transductive learning of oob samples
    f_oob[idx_oob] += model.predict_proba(x_data_unl[idx_oob])
    n_oob[idx_oob] += 1

    if(i%10 == 0): print(f'learner {i}/{learners_num} completed')

predicted = f_oob[:, 1]/n_oob

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
[LightGBM] [Debug] Trained a tree with leaves = 15 and depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 16 and depth = 10
[LightGBM] [Info] Number of positive: 153, number of negative: 153
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000
[LightGBM] [Debug] init for col-wise cost 0.000004 seconds, init for row-wise cost 0.000712 seconds
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000730 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 307
[LightGBM] [Info] Number of data points in the train set: 306, number of used features: 3
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Debug] Trained a tree with leaves = 10 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 11 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 11 and depth = 6
[LightGBM] [Debug] Train

In [30]:
df_orig_predicted = df_orig_unlabeled.copy()
df_orig_predicted['pred'] = [1 if x > 0.5 else 0 for x in predicted]
df_orig_positive.loc[:,'pred'] = 1
df_outcome = pd.concat([df_orig_positive,df_orig_predicted])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_orig_positive.loc[:,'pred'] = 1


In [31]:
df_outcome

Unnamed: 0,0,1,2,3,4,class_test,pred
762,-1.39710,3.31910,-1.3927,-1.99480,1,1,1
765,-3.84830,-12.80470,15.6824,-1.28100,1,1,1
767,-2.28040,-0.30626,1.3347,1.37630,1,1,1
768,-1.75820,2.73970,-2.5323,-2.23400,1,1,1
769,-0.89409,3.19910,-1.8219,-2.94520,1,1,1
...,...,...,...,...,...,...,...
1367,0.40614,1.34920,-1.4501,-0.55949,1,0,0
1368,-1.38870,-4.87730,6.4774,0.34179,1,0,1
1369,-3.75030,-13.45860,17.5932,-2.77710,1,0,1
1370,-3.56370,-8.38270,12.3930,-1.28230,1,0,1


In [32]:
evaluate_results(df_orig_predicted.iloc[:,-3].values, df_orig_predicted.iloc[:,-1].values)

Classification results:
f1: 95.67%
roc: 95.89%
recall: 91.90%
precision: 99.76%
