update

tingleshao · Apr 3, 2013 · 50217da · 50217da
1 parent 7a542fe
commit 50217da
Show file tree

Hide file tree

Showing 3 changed files with 37 additions and 14 deletions.
diff --git a/text.aux b/text.aux
@@ -9,15 +9,15 @@
 \@writefile{toc}{\contentsline {subsection}{\numberline {2.2}PAC Model}{4}}
 \@writefile{toc}{\contentsline {subsection}{\numberline {2.3}Examples}{5}}
 \@writefile{toc}{\contentsline {subsection}{\numberline {2.4}Rademacher Complexity}{8}}
-\@writefile{toc}{\contentsline {subsection}{\numberline {2.5}Growth Function}{9}}
+\@writefile{toc}{\contentsline {subsection}{\numberline {2.5}Growth Function}{10}}
 \@writefile{toc}{\contentsline {subsection}{\numberline {2.6}VC-dimension}{10}}
 \@writefile{toc}{\contentsline {section}{\numberline {3}Support Vector Machines}{11}}
 \@writefile{toc}{\contentsline {subsection}{\numberline {3.1}Formulation}{11}}
-\@writefile{toc}{\contentsline {subsection}{\numberline {3.2}Leave-one-out analysis}{12}}
-\@writefile{toc}{\contentsline {subsection}{\numberline {3.3}Soft Margin SVM Formulation}{13}}
-\@writefile{toc}{\contentsline {subsection}{\numberline {3.4}Margin Theory}{13}}
-\@writefile{toc}{\contentsline {subsection}{\numberline {3.5}Generalization Bound}{13}}
-\@writefile{toc}{\contentsline {section}{\numberline {4}Distance Weighted Discrimination}{14}}
-\@writefile{toc}{\contentsline {subsection}{\numberline {4.1}Formulation}{14}}
-\@writefile{toc}{\contentsline {subsection}{\numberline {4.2}Weighted DWD}{15}}
-\@writefile{toc}{\contentsline {section}{\numberline {5}References}{16}}
+\@writefile{toc}{\contentsline {subsection}{\numberline {3.2}Leave-one-out analysis}{13}}
+\@writefile{toc}{\contentsline {subsection}{\numberline {3.3}Soft Margin SVM Formulation}{14}}
+\@writefile{toc}{\contentsline {subsection}{\numberline {3.4}Margin Theory}{14}}
+\@writefile{toc}{\contentsline {subsection}{\numberline {3.5}Generalization Bound}{14}}
+\@writefile{toc}{\contentsline {section}{\numberline {4}Distance Weighted Discrimination}{15}}
+\@writefile{toc}{\contentsline {subsection}{\numberline {4.1}Formulation}{15}}
+\@writefile{toc}{\contentsline {subsection}{\numberline {4.2}Weighted DWD}{16}}
+\@writefile{toc}{\contentsline {section}{\numberline {5}References}{17}}
diff --git a/text.tex b/text.tex
@@ -69,8 +69,8 @@ \subsection{PAC Model}
 (It can be shown that this inequality can be used to bound generalization error)
 \subsection{Examples}
 1. Finite $H$, consistent case \\[0.2cm]
-Finite $H$ means that the number of possible returned hypothesis $h$ is finite. Consistent means the returned hypothesis makes no error on the training data sample $S$.\\[0.2cm]
-[2] Let $H$ be a finite set of functions mapping from $\mathcal{X}$ to $\mathcal{Y}$. Let $\mathcal{A}$ be an algorithm that for any target concept $c \in H$ and \emph{i.i.d.} sample $S$ returns a consistent hypothesis $h_{S}$: $\widehat{R}(h_S) = 0$. Then, for any $\epsilon, \delta > 0$, the inequality $\text{Pr}_{S \sim D^m}[R(h_S) \leq \epsilon] \geq 1-\delta$ holds if
+Finite $H$ means that the number of possible returned hypothesis $h$ is finite. Consistent means the returned hypothesis makes no error on the training data sample $S$ [2].\\[0.2cm]
+Let $H$ be a finite set of functions mapping from $\mathcal{X}$ to $\mathcal{Y}$. Let $\mathcal{A}$ be an algorithm that for any target concept $c \in H$ and \emph{i.i.d.} sample $S$ returns a consistent hypothesis $h_{S}$: $\widehat{R}(h_S) = 0$. Then, for any $\epsilon, \delta > 0$, the inequality $\text{Pr}_{S \sim D^m}[R(h_S) \leq \epsilon] \geq 1-\delta$ holds if
 \[m \geq \frac{1}{\epsilon}(log|H|+log\frac{1}{\delta})\] \\[0.2cm]
 Then we can get a inequality for $R(h)$
 \[R(h_S) \leq \frac{1}{m}(log|H| + log\frac{1}{\delta})\]
@@ -90,8 +90,12 @@ \subsection{Examples}
 \[m\epsilon\ \geq log|H| + log\frac{1}{\delta}\]
 \[m \geq \frac{1}{\epsilon} ( log|H| + log\frac{1}{\delta})\]
 \[\text{Q.E.D.}\]\\[0.2cm]
-2. finite H, inconsistent case (later) \\[0.2cm]
-can get a inequality for $R(h)$ \\[0.2cm]
+This just shows that a concept is learnable, and it is related to the size of hypothesis set, the degree of confidence the learner requires and the size of training examples. For this particular example, it shows that every consistent learning algorithm with finite size hypothesis set is PAC-learnable.\\[0.2cm]
+2. Finite H, inconsistent case\\[0.2cm]
+If it is in the case that there is no way to get a hypothesis with no error on the training data, it is called inconsistent case. For this situation, the inequality should contain the term $\widehat{R}(h)$. The inequality is called \textbf{generalization bound}.\\[0.2cm]
+Let H be a finite hypothesis set, Then for any $\delta > 0$, with probability at least $1-\delta$. the following inequality holds:
+\[\forall h \in H, R(h) \leq \widehat{R}(h) + \sqrt{\frac{log|H|+log\frac{2}{\delta}}{2m}}\]
+
 \textbf{Link to the idea of Occam's razor principle} (later)\\[0.2cm]
 The basic idea of introducing Occam's razor principle is that, for two learning algorithms that performs equally well in terms of empirical error on a given data, the one with less complexity is preferred. For example, the one which generates a linear separating hyperplane is preferred in stead of the one which yields an nonlinear separating boundary. This can be shown in the next section, by examing the inequality (generalization bound) using Rademacher complexity, for a learning algorithm with function family with smaller Rademacher complexity, the bound is tighter.
 \subsection{Rademacher Complexity}

diff --git a/text.tex.bak b/text.tex.bak
@@ -74,8 +74,27 @@ Finite $H$ means that the number of possible returned hypothesis $h$ is finite.
 \[m \geq \frac{1}{\epsilon}(log|H|+log\frac{1}{\delta})\] \\[0.2cm]
 Then we can get a inequality for $R(h)$
 \[R(h_S) \leq \frac{1}{m}(log|H| + log\frac{1}{\delta})\]
+\textbf{Proof:}\\[0.2cm]
+We think that having a fixed $\epsilon$, the probability of having a consistent hypothesis $h_S$ that has error larger than $\epsilon$:
+\[\text{Pr}[\exists h \in H: \widehat{R}(h) > 0  \wedge R(h) > \epsilon]\]
+\[= \text{Pr}[(h_1 \in H, \widehat{R}(h_1)=0 \wedge R(h_1) \geq \epsilon) \vee (h_2 \in H, \widehat{R}(h_2) = 0 \wedge R(h_2) > \epsilon) \vee \dots]\]
+\[\leq \sum_{h\in H}\text{Pr}[\widehat{R}(h) = 0 \wedge R(h)> \epsilon]\]
+\[\leq \sum_{h\in H}\text{Pr}[\widehat{R}(h) = 0 | R(h)> \epsilon]\]
+for any $h$ with $R(h) > \epsilon$, the probability having $\widehat{R}(h) = 0$ is bounded by:
+\[\text{Pr}[\widehat{R}(h)=0| R(h)>\epsilon] \leq (1-\epsilon)^m\] 
+Apply this inequality to the previous inequality and consider the finite hypothesis family $H$:
+\[\text{Pr}[\exists h \in H: \widehat{R}(h) = 0 \wedge R(h) > \epsilon] \leq |H|(1-\epsilon)^m \leq |H|e^{-m\epsilon}\]
+\[\text{Pr}[R(h_S) \leq \epsilon] \geq 1 - |H|e^{-m\epsilon} \geq 1 - \delta  \]
+\[\delta \geq |H|e^{-m\epsilon}\]
+\[1 \geq \frac{|H|}{\delta}e^{-m\epsilon}\]
+\[m\epsilon\ \geq log|H| + log\frac{1}{\delta}\]
+\[m \geq \frac{1}{\epsilon} ( log|H| + log\frac{1}{\delta})\]
+\[\text{Q.E.D.}\]\\[0.2cm]
+This just shows that a concept is learnable, and it is related to the size of hypothesis set, the degree of confidence the learner requires and the size of training examples. For this particular example, it shows that every consistent learning algorithm with finite size hypothesis set is PAC-learnable.
 2. finite H, inconsistent case (later) \\[0.2cm]
 can get a inequality for $R(h)$ \\[0.2cm]
+3. \textbf{Generalization bound}
+
 \textbf{Link to the idea of Occam's razor principle} (later)\\[0.2cm]
 The basic idea of introducing Occam's razor principle is that, for two learning algorithms that performs equally well in terms of empirical error on a given data, the one with less complexity is preferred. For example, the one which generates a linear separating hyperplane is preferred in stead of the one which yields an nonlinear separating boundary. This can be shown in the next section, by examing the inequality (generalization bound) using Rademacher complexity, for a learning algorithm with function family with smaller Rademacher complexity, the bound is tighter.
 \subsection{Rademacher Complexity}
@@ -115,7 +134,7 @@ For a linearly separable data set, the goal is to find a hyperplane that success
 \[\underset{\mathbf{w}, b}{\mathrm{min}} \frac{1}{2}\|\mathbf{w}\|^2\]
 \[\text{subject to: } y_i(\mathbf{w\cdot x_i}+b) \geq 1, \forall i \in [1,m]\]
 Note that in the problem we can scale $\mathbf{w}$ and $b$ to make $\underset{(x,y)\in S}{\mathrm{min}}|\mathbf{w\cdot x}+b| =1 $.
-The objective function is quadratic and the constraint is affine. This problem belongs to quadratic programming (QP). \\[0.2cm]
+The objective function is quadratic and the constraint is affine. This problem belongs to quadratic programming (QP). It can be solved efficiently using many available optimization software. \\[0.2cm]
 \textbf{Lagrangian and KKT condition and dual form: (later)} \\[0.2cm]
 For this problem, we can write the Lagrangian:
 \[\mathcal{L}(\mathbf{w},b,\mathbf{\alpha})=\frac{1}{2}\|\mathbf{w}\|^2-\sum_{i=1}^{m}\alpha_i [y_i(\mathbf{w\cdot x_i}+b)-1]\]