# 三. 对数几率回归（logit regression）
考虑一个二分类任务，其生产标记$y\in \{0,1\}$，而线性回归模型产生的预测值$z=\mathbf{\omega^Tx+b}$是实数，于是需将$z$转换为0/1值。直观地，可以考虑"单位阶跃函数"
$$
\begin{equation}
y=\begin{cases}
0,z<0;\\
0.5,z=0;\\
1,z>0.
\end{cases}
\end{equation}
$$
即若预测值$z$大于0则判为正例。显然，单位阶越函数是不连续函数，因此退而使用有更好性质的对数几率函数（logistic function）:
$$
y=\frac{1}{1+e^{-z}}.
$$
显然对数几率函数可以将z值转换为一个接近0或1的值，且在$z=0$附近变化很陡。

将$z=\mathbf{\omega^Tx+b}$代入对数几率函数，可得
$$
y=\frac{1}{1+e^{-\mathbf{\omega^Tx+b}}}.
$$
进而转换为
$$
\mathrm{ln}\frac{y}{1-y}=\mathbf{\omega^Tx+b}.
$$
若将$y$视为样本$\mathbf{x}$作为正例的可能性，则$1-y$是其反例可能性，两者比值为
$$
\frac{y}{1-y}
$$
称为几率（odd），反映了x作为正例的相对可能性。对几率取自然对数则可得对数几率（log odds, 也称为logit）
$$
\mathrm{ln}\frac{y}{1-y}
$$
通过“极大似然法”来估计$\omega$和$b$，给定数据集$\{(x_i,y_i)\}^m_{i=1}$，最大化对数似然率
$$
\text{max  } \mathbf{l(w,b)}=\sum_{i=1}^m \mathrm{ln}p(y_i|\mathbf{x_i;w,b})
$$
即令每个样本属于其真实标记的概率越大越好。上式又等价于最小化负对数似然率
$$
(\omega, b)^* = \text{argmin  } \mathbf{l(w,b)}=\sum_{i=1}^m\left(-y_i(\omega^Tx_i+b)+\mathbf{ln}(1+e^{\omega^Tx_i+b})\right)
$$

对应的矩阵形式为
$$
\mathbf{-(X^T\omega+b)^TY + \left(\ln(1+e^{X^T\omega+b})\right)^T1}
$$

In [14]:
%matplotlib inline
from IPython import display
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from torch.utils.data import TensorDataset, DataLoader
import torch
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

- 垃圾邮件数据集

In [5]:
df = pd.read_csv('../dataset/smsspamcollection/SMSSpamCollection', delimiter='\t', header=None, names=['category', 'message'])

In [6]:
df.head()

Unnamed: 0,category,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
print('垃圾邮件数量: %d ' % np.sum(df.category == 'spam'))
print('正常邮件数量: %d ' % np.sum(df.category == 'ham'))

垃圾邮件数量: 747 
正常邮件数量: 4825 


## 1. 使用scikit-learn获取tfidf分数

In [8]:
X = df.message.values
y = df.category.values

X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [9]:
vectorizer = TfidfVectorizer()  # tfidf值
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
# vectorizer.vocabulary_  # 词典 
# vectorizer.get_feature_names()  # 特征

In [10]:
classifier = LogisticRegression(solver='lbfgs')  # logit分类器
classifier.fit(X_train, y_train)

LogisticRegression()

In [11]:
predictions = classifier.predict(X_test)
for i, prediction in enumerate(predictions[:10]):
    print('Predicted: %s, True: %s, Message: %s' % (prediction, y_test[i], X_test_raw[i]))

Predicted: ham, True: ham, Message: Alright, we're all set here, text the man
Predicted: ham, True: ham, Message: Yup. Thk of u oso boring wat.
Predicted: ham, True: ham, Message: This weekend is fine (an excuse not to do too much decorating)
Predicted: ham, True: ham, Message: If you hear a loud scream in about &lt;#&gt; minutes its cause my Gyno will be shoving things up me that don't belong :/
Predicted: ham, True: ham, Message: * Am on a train back from northampton so i'm afraid not!
Predicted: ham, True: ham, Message: I got it before the new year cos yetunde said she wanted to surprise you with it but when i didnt see money i returned it mid january before the  &lt;#&gt; day return period ended.
Predicted: ham, True: ham, Message: Hi babe its me thanks for coming even though it didnt go that well!i just wanted my bed! Hope to see you soon love and kisses xxx
Predicted: ham, True: ham, Message: I am in a marriage function
Predicted: ham, True: ham, Message: 1Apple/Day=No Doctor. 1T

In [12]:
np.sum(classifier.predict(X_train) != y_train)  # 训练集预测出错数量

111

In [13]:
np.sum(classifier.predict(X_test) != y_test)  # 测试集预测出错数量

22

In [15]:
def f(z):
    if z < 0:
        return 0
    elif z == 0:
        return 0.5
    else:
        return 1

def g(z):
    return 1/(1+np.exp(-z))

## 2. 基于`torch`实现Logit回归

函数的向量化

In [16]:
v_f = np.vectorize(f)
v_g = np.vectorize(g)

In [17]:
z = np.linspace(-5, 5, num=200)
y1 = v_f(z)
y2 = v_g(z)

In [None]:
display.set_matplotlib_formats('svg')
fig = plt.figure(figsize=(8, 4))
ax = fig.add_subplot(1, 1, 1)
ax.plot(z, y1, 'r-', label='Heaviside')
ax.plot(z, y2, 'g-', label='logit')
ax.scatter([0], [0.5], s=50, alpha=0.5)
ax.set_xlim([-5, 5])
ax.set_xlabel("z")
ax.set_ylabel("y")
ax.legend()

In [39]:
# sigmod函数
def logit(w, x, b):
    return torch.sigmoid(x@w + b)  # 1 / (1 + torch.exp(-(x@w + b)))

# 负对数似然率函数
def neglikelihood(x, y, w, b):
    z = x@w + b
    llike = -y.reshape(1, -1)@z + torch.ones_like(z).reshape(1, -1)@torch.log(1 + torch.exp(z))
    return llike

In [128]:
def precision(w, b, feature, true_label):
    z = logit(w, feature, b)
    y = (z >= 0.5).float()
    return torch.sum(y == true_label).numpy() / len(y)

In [81]:
true_w = torch.FloatTensor([1, -2]).reshape(-1, 1)
true_b = torch.FloatTensor([1])
x = torch.randn(size=(1000, 2)).float()

# 生成数据集
z = logit(true_w, x, true_b)  # 为正例的概率
y = z >= 0.5  # 生成True或False
y = y.float()  # 注意要转换为浮点数，否则后面迭代时报错

In [136]:
x_train = x[:int(len(x)*0.8)]
x_test = x[int(len(x)*0.8):]

y_train = y[:int(len(y)*0.8)]
y_test = y[int(len(y)*0.8):]

# 设置参数初始值
w = torch.rand(size=true_w.shape)
b = torch.FloatTensor([0.0])
w.requires_grad_(True)
b.requires_grad_(True)

lr = 0.05
num_epochs = 10
batch_size = 10  # 构建10个批次的训练集
dataset = TensorDataset(x_train, y_train)
data_iter = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=True)

for epoch in range(num_epochs):
    for t_x, t_y in data_iter:
        l = neglikelihood(t_x, t_y, w, b)        
        l.backward()  # 计算损失函数在 [w,b] 上的梯度
        w.data.sub_(lr*w.grad/batch_size)
        w.grad.data.zero_()
        b.data.sub_(lr*b.grad/batch_size)
        b.grad.data.zero_()
        
    with torch.no_grad():  # 不计算梯度，加速损失函数的运算
        train_l = neglikelihood(x, y, w, b)  # 最近一次的负对数似然率
        est_w = [u[0] for u in w.detach().numpy()]  # detach得到一个有着和原tensor相同数据的tensor
        est_b = [v for v in b.detach().numpy()]
        train_accu_ratio = precision(w, b, x_train, y_train)
        test_accu_ratio = precision(w, b, x_test, y_test)
        
        print(f'epoch {epoch + 1}, neglikelihood: {train_l.numpy()[0][0]:.4f}')
        print(f'    train accuracy: {train_accu_ratio}, test accuracy: {test_accu_ratio}')
#         print(f'    w0: {est_w[0]:.4f}, w1: {est_w[1]:.4f},  b: {est_b[0]:.4f}')

epoch 1, neglikelihood: 413.5895
    train accuracy: 0.9575, test accuracy: 0.965
epoch 2, neglikelihood: 306.1644
    train accuracy: 0.99375, test accuracy: 0.995
epoch 3, neglikelihood: 257.0077
    train accuracy: 1.0, test accuracy: 1.0
epoch 4, neglikelihood: 227.7614
    train accuracy: 1.0, test accuracy: 1.0
epoch 5, neglikelihood: 207.6567
    train accuracy: 1.0, test accuracy: 1.0
epoch 6, neglikelihood: 192.8063
    train accuracy: 1.0, test accuracy: 1.0
epoch 7, neglikelihood: 181.2397
    train accuracy: 1.0, test accuracy: 1.0
epoch 8, neglikelihood: 171.8677
    train accuracy: 1.0, test accuracy: 1.0
epoch 9, neglikelihood: 164.0813
    train accuracy: 1.0, test accuracy: 1.0
epoch 10, neglikelihood: 157.4612
    train accuracy: 1.0, test accuracy: 1.0


### 基于`LogisticRegression`验证

In [91]:
classifier = LogisticRegression(solver='lbfgs')
classifier.fit(x.data.numpy(), y.data.squeeze().numpy())

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [92]:
p_y = classifier.predict(x)

In [114]:
np.sum(p_y == y.data.squeeze().numpy())

1000