# 🤦🏻 用法

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

X, Y = fetch_california_housing(return_X_y=True, as_frame=True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [2]:
from ngboost import NGBRegressor

ngb = NGBRegressor().fit(X_train, Y_train)
Y_preds = ngb.predict(X_test)
Y_dists = ngb.pred_dist(X_test)

[iter 0] loss=1.5670 val_loss=0.0000 scale=1.0000 norm=1.1092
[iter 100] loss=1.1400 val_loss=0.0000 scale=2.0000 norm=1.5566
[iter 200] loss=0.9289 val_loss=0.0000 scale=1.0000 norm=0.7054
[iter 300] loss=0.7743 val_loss=0.0000 scale=1.0000 norm=0.6778
[iter 400] loss=0.7006 val_loss=0.0000 scale=1.0000 norm=0.6768


In [3]:
from sklearn.metrics import mean_squared_error

# 计算mse
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)

# 计算负似然对数
test_NLL = -Y_dists.logpdf(Y_test).mean()
print('Test NLL', test_NLL)

Test MSE 0.31496311191332554
Test NLL 0.7043259666977987


In [4]:
# 获取一组点的估计分布参数
Y_dists[0:5].params

{'loc': array([1.42271054, 2.86045558, 2.14404235, 2.06860307, 2.14030303]),
 'scale': array([0.31714014, 0.51342792, 0.63051949, 0.55790625, 0.37466855])}

NGBoost可以与多种分布一起使用。包括回归分布和分类分布

回归分布：通过`NGBRegressor()`构造函数传递`Dist`参数，默认是`Normal`

| 分布          | 参数           | 评分指标               | 参考                                                         |
| ------------- | -------------- | ---------------------- | ------------------------------------------------------------ |
| `Normal`      | `loc`，`scale` | `LogScore`，`CRPScore` | [`scipy.stats.norm`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) |
| `LogNormal`   | `s`，`scale`   | `LogScore`，`CRPScore` | [`scipy.stats.lognorm`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.lognorm.html) |
| `Exponential` | `scale`        | `LogScore`，`CRPScore` | [`scipy.stats.expon`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.expon.html) |

In [5]:
from ngboost.distns import Exponential,Normal

X_reg_train, X_reg_test, Y_reg_train, Y_reg_test = train_test_split(X, Y, test_size=0.2)

ngb_norm = NGBRegressor(Dist=Normal, verbose=False).fit(X_reg_train, Y_reg_train)
ngb_exp = NGBRegressor(Dist=Exponential, verbose=False).fit(X_reg_train, Y_reg_train)

两种预测方法：
1. `predict()`，返回点预测
2. `pred_dist()`，返回分布对象

In [6]:
ngb_norm.predict(X_reg_test)[0:5]
ngb_exp.predict(X_reg_test)[0:5]
ngb_exp.pred_dist(X_reg_test)[0:5].params

{'scale': array([4.10204883, 3.67173243, 1.88355812, 3.62738977, 1.79807525])}

NGBoost使用删失似然作为评分规则来处理右删失数据，使得NGBoost能够估计生存时间分布。

目前(2024.08)已经实现`LogNormal`、`Exponential`的右删失版本

> 右删失数据：知道某个时间发生在时间点后，但是不知道确切的时间。如：一项研究跟踪患者5年，一些患者在跟踪结束时依然活着，那么这些患者的生存时间是右删失的

In [7]:
import numpy as np
from ngboost import NGBSurvival
from ngboost.distns import LogNormal

X_surv_train, X_surv_test, Y_surv_train, Y_surv_test = train_test_split(X, Y, test_size=0.2)

# 引入管理审查模拟生存数据
T_surv_train = np.minimum(Y_surv_train, 3) # 事件或审查的时间
E_surv_train = Y_surv_train > 3 # 如果是事件时间则为1，如果是审查时间则为0

ngb = NGBSurvival(Dist=LogNormal).fit(X_surv_train, T_surv_train, E_surv_train)

[iter 0] loss=0.9572 val_loss=0.0000 scale=4.0000 norm=2.4033
[iter 100] loss=0.3118 val_loss=0.0000 scale=2.0000 norm=0.5934
[iter 200] loss=0.0546 val_loss=0.0000 scale=4.0000 norm=1.0242
[iter 300] loss=-0.1323 val_loss=0.0000 scale=2.0000 norm=0.4097
[iter 400] loss=-0.2071 val_loss=0.0000 scale=1.0000 norm=0.1794


In [8]:
ngb.predict(X_surv_test)

array([2.34018201, 3.05263929, 3.14951764, ..., 2.29411069, 2.97176873,
       2.85810047])

### 2. 分类分布

分类分布：通过`NGBClassifier()`构造函数，并通过`Dist`参数传入分类分布，默认是`Bernoulli`，相当于`k_categorical(2)`

| 分布               | 参数                  | 评分指标   | 参考                                                         |
| ------------------ | --------------------- | ---------- | ------------------------------------------------------------ |
| `k_categorical(K)` | `p0`，`p1`...`p{K-1}` | `LogScore` | [Categorical distribution on Wikipedia](https://en.wikipedia.org/wiki/Categorical_distribution) |
| `Bernoulli`        | `p`                   | `LogScore` | [Bernoulli distribution on Wikipedia](https://en.wikipedia.org/wiki/Bernoulli_distribution) |

`NGBClassifier`对象有三种预测方法：

* `predict()`：返回最可能的类
* `predict_proba()`：返回类概率
* `pred_dist()`：返回分布对象

In [9]:
from ngboost import NGBClassifier
from ngboost.distns import k_categorical, Bernoulli
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True,as_frame=True)
y[0:15] = 2 # 手动将2分类问题修改成3分类问题对应下面的`Dist=k_categorical(3)`
X_cls_train, X_cls_test, Y_cls_train, Y_cls_test  = train_test_split(X, y, test_size=0.2)

ngb_cat = NGBClassifier(Dist=k_categorical(3), verbose=False)
_ = ngb_cat.fit(X_cls_train, Y_cls_train) # Y should have only 3 values: {0,1,2}

In [10]:
ngb_cat.predict(X_cls_test)[0:5]#%%
ngb_cat.predict_proba(X_cls_test)[0:5]
ngb_cat.pred_dist(X_cls_test)[0:5].params

{'p0': array([0.0045647 , 0.99269303, 0.00541314, 0.00409154, 0.09781032]),
 'p1': array([0.99517851, 0.00686771, 0.99302614, 0.99553533, 0.90158408]),
 'p2': array([0.00025679, 0.00043926, 0.00156072, 0.00037313, 0.00060561])}

## 评价分数

`NGBoost`支持`LogScore`(对数分数，也称负对数似然)和`CRPScore`，由构造函数中的`Score`参数指定

In [11]:
from ngboost.scores import LogScore, CRPScore

NGBRegressor(Dist=Exponential, Score=CRPScore, verbose=False).fit(X_reg_train, Y_reg_train)
NGBClassifier(Dist=k_categorical(3), Score=LogScore, verbose=False).fit(X_cls_train, Y_cls_train)

## 基学习器

`NGBoost`可以与任何`sklearn`回归器使用作为基础学习器，由构造函数中`Base`参数指定。默认是深度为3的回归树

In [12]:
from sklearn.tree import DecisionTreeRegressor

learner = DecisionTreeRegressor(criterion='friedman_mse', max_depth=5)
NGBSurvival(Dist=Exponential,Score=CRPScore,Base=learner,verbose=True).fit(X_surv_train,T_surv_train,E_surv_train)

[iter 0] loss=0.5616 val_loss=0.0000 scale=2.0000 norm=0.9975
[iter 100] loss=0.4483 val_loss=0.0000 scale=2.0000 norm=0.4613
[iter 200] loss=0.4358 val_loss=0.0000 scale=2.0000 norm=0.3779
[iter 300] loss=0.4329 val_loss=0.0000 scale=1.0000 norm=0.1789
[iter 400] loss=0.4308 val_loss=0.0000 scale=1.0000 norm=0.1714


## 其他参数

构造函数中配置：

`learning_rate`：学习率

`n_estimators`：估计器数量

`minibatch_frac`：小批量分数

`col_sample`：列子采样

`fit`函数中配置：

`sample_weight`：样本权重

In [13]:
ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
                   minibatch_frac=0.5, col_sample=0.5)
ngb.fit(X_reg_train, Y_reg_train)

[iter 0] loss=1.5589 val_loss=0.0000 scale=1.0000 norm=1.1022


In [14]:
ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
                   minibatch_frac=0.5, col_sample=0.5)
weights = np.random.random(Y_reg_train.shape)
ngb.fit(X_reg_train, Y_reg_train, sample_weight=weights)

[iter 0] loss=1.5539 val_loss=0.0000 scale=1.0000 norm=1.0987
