/
regularized logistic regression gradient with weighted samples.tex
72 lines (39 loc) · 1.79 KB
/
regularized logistic regression gradient with weighted samples.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
\documentclass[11pt]{article}
\usepackage[margin=0.75in]{geometry}
\usepackage{amsmath}
\usepackage{amsfonts}
\title{Regularized logistic regression gradient with weighted samples}
\begin{document}
\maketitle
Scikit-learn's batch logistic regression (which uses LIBLINEAR) doesn't support weighted
samples. The SGD classifier does support weighted samples, but it can be tricky to tune.
For my application, solving the optimization problem in batch with L-BFGS worked best.
\section*{No regularization, no weights}
In other words, all samples have weight $1.0$. $M$ training examples and $N$ features,
one of which is a dummy feature for the intercept. \\
$\theta \in \mathbb{R}^N$ - coefficients
$y \in \{0, 1\}^M$ - response variable ($M x 1$ vector)
$X \in \{0, 1\}^{MxN}$ - $M x N$ design matrix
$f(x) = \frac{1}{1+e^{-x}}$ - logistic function
$l(\theta)$ - loss function
$l(\theta) = - e^T \left(y \odot \log(p) + (1-y) \odot \log(1-p)\right) $
where $p = f(X^T \theta)$
aka average log-loss
$$r = f(X^T \theta) - y$$
$$\nabla l = X^T r $$
\section*{Standard regularization}
No regularization on the intercept. $M$ training examples and $N$ features. \\
$\theta \in \mathbb{R}^N$ - coefficients
$\theta_0 \in \mathbb{R}$ - intercept
$y \in \{0, 1\}^M$ - response variable ($M x 1$ vector)
$w \in (0, \infty)^N $ - per example weight ($N x 1$ vector)
$X \in \{0, 1\}^{MxN}$ - $M x N$ design matrix
$f(x) = \frac{1}{1+e^{-x}}$ - logistic function
$l(\theta)$ - loss function
$l(\theta) = -w^T \left(y \odot \log(p) + (1-y) \odot \log(1-p)\right) / \sum w $
where $p = f(X^T \theta + \theta_0)$
aka average of the usual log-loss, weighted by $w$
$$r = f(X^T \theta + \theta_0) - y$$
$$\nabla_0 l = \frac{r^T w}{\sum w} $$
$$\nabla l = \frac{X^T (w \odot r) + \lambda \theta}{ \sum w} $$
\end{document}