第一次提交，写完了感知机

soulmachine · May 15, 2013 · eb96d6f · eb96d6f
1 parent 0d3c3a2
commit eb96d6f
Show file tree

Hide file tree

Showing 26 changed files with 8,140 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -1,4 +1,22 @@
-machine-learning-cheat-sheet
+Machine learning cheat sheet
 ============================
 
-classical equations and diagrams of machine learning
+This cheat sheet contains many classical equations and diagrams on machine learning, which will help you quickly recall knowledge and ideas on machine learning.
+
+The cheat sheet will also appeal to someone who is preparing for a job interview related to machine learning. 
+
+##LaTeX template
+This open-source book adopts the [Springer latex templte](http://www.springer.com/authors/book+authors?SGWID=0-154102-12-970131-0).
+
+##How to compile on Windows
+1. Install [Tex Live 2012](http://www.tug.org/texlive/), then add its `bin` path for example `D:\texlive\2012\bin\win32` to he PATH environment variable.
+2. Install [TeXstudio](http://texstudio.sourceforge.net/).
+3. Configure TeXstudio.  
+    Run TeXstudio, click `Options-->Configure Texstudio-->Commands`, set `XeLaTex` to `xelatex -synctex=1 -interaction=nonstopmode %.tex`.
+
+    Click `Options-->Configure Texstudio-->Build`,   
+    set `Build & View` to `Compile & View`,  
+    set `Default Compiler` to `XeLaTex`,  
+    set `PDF Viewer` to `Internal PDF Viewer(windowed)`, so that when previewing it will pop up a standalone window, which will be convenient.
+4. Compile. Use Open `main.tex` with TeXstudio，click the green arrow on the menu bar, then it will start to compile.  
+    In the messages window below we can see the compilation command that TeXstudio is using is `xelatex -synctex=1 -interaction=nonstopmode "ACM-cheat-sheet".tex`
diff --git a/acknow.tex b/acknow.tex
@@ -0,0 +1,11 @@
+%%%%%%%%%%%%%%%%%%%%%%acknow.tex%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+% sample acknowledgement chapter
+%
+% Use this file as a template for your own input.
+%
+%%%%%%%%%%%%%%%%%%%%%%%% Springer %%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\extrachap{Acknowledgements}
+
+Use the template \emph{acknow.tex} together with the Springer document class SVMono (monograph-type books) or SVMult (edited books) if you prefer to set your acknowledgement section as a separate chapter instead of including it as last part of your preface.
+
diff --git a/acronym.tex b/acronym.tex
@@ -0,0 +1,18 @@
+%%%%%%%%%%%%%%%%%%%%%%acronym.tex%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+% sample list of acronyms
+%
+% Use this file as a template for your own input.
+%
+%%%%%%%%%%%%%%%%%%%%%%%% Springer %%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\extrachap{Acronyms}
+
+Use the template \emph{acronym.tex} together with the Springer document class SVMono (monograph-type books) or SVMult (edited books) to style your list(s) of abbreviations or symbols in the Springer layout.
+
+Lists of abbreviations\index{acronyms, list of}, symbols\index{symbols, list of} and the like are easily formatted with the help of the Springer-enhanced \verb|description| environment.
+
+\begin{description}[CABR]
+\item[ABC]{Spelled-out abbreviation and definition}
+\item[BABI]{Spelled-out abbreviation and definition}
+\item[CABR]{Spelled-out abbreviation and definition}
+\end{description}
diff --git a/appendix.tex b/appendix.tex
@@ -0,0 +1,79 @@
+%%%%%%%%%%%%%%%%%%%%% appendix.tex %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%
+% sample appendix
+%
+% Use this file as a template for your own input.
+%
+%%%%%%%%%%%%%%%%%%%%%%%% Springer-Verlag %%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\chapter{Chapter Heading}
+\label{introA} % Always give a unique label
+% use \chaptermark{}
+% to alter or adjust the chapter heading in the running head
+
+Use the template \emph{appendix.tex} together with the Springer document class SVMono (monograph-type books) or SVMult (edited books) to style appendix of your book in the Springer layout.
+
+
+\section{Section Heading}
+\label{sec:A1}
+% Always give a unique label
+% and use \ref{<label>} for cross-references
+% and \cite{<label>} for bibliographic references
+% use \sectionmark{}
+% to alter or adjust the section heading in the running head
+Instead of simply listing headings of different levels we recommend to let every heading be followed by at least a short passage of text. Further on please use the \LaTeX\ automatism for all your cross-references and citations.
+
+
+\subsection{Subsection Heading}
+\label{sec:A2}
+Instead of simply listing headings of different levels we recommend to let every heading be followed by at least a short passage of text. Further on please use the \LaTeX\ automatism for all your cross-references and citations as has already been described in Sect.~\ref{sec:A1}.
+
+For multiline equations we recommend to use the \verb|eqnarray| environment.
+\begin{eqnarray}
+\vec{a}\times\vec{b}=\vec{c} \nonumber\\
+\vec{a}\times\vec{b}=\vec{c}
+\label{eq:A01}
+\end{eqnarray}
+
+\subsubsection{Subsubsection Heading}
+Instead of simply listing headings of different levels we recommend to let every heading be followed by at least a short passage of text. Further on please use the \LaTeX\ automatism for all your cross-references and citations as has already been described in Sect.~\ref{sec:A2}.
+
+Please note that the first line of text that follows a heading is not indented, whereas the first lines of all subsequent paragraphs are.
+
+% For figures use
+%
+\begin{figure}[t]
+\sidecaption[t]
+% Use the relevant command for your figure-insertion program
+% to insert the figure file.
+% For example, with the graphicx style use
+\includegraphics[scale=.65]{figure}
+%
+% If no graphics program available, insert a blank space i.e. use
+%\picplace{5cm}{2cm} % Give the correct figure height and width in cm
+%
+\caption{Please write your figure caption here}
+\label{fig:A1}       % Give a unique label
+\end{figure}
+
+% For tables use
+%
+\begin{table}
+\caption{Please write your table caption here}
+\label{tab:A1}       % Give a unique label
+%
+% Follow this input for your own table layout
+%
+\begin{tabular}{p{2cm}p{2.4cm}p{2cm}p{4.9cm}}
+\hline\noalign{\smallskip}
+Classes & Subclass & Length & Action Mechanism  \\
+\noalign{\smallskip}\hline\noalign{\smallskip}
+Translation & mRNA$^a$  & 22 (19--25) & Translation repression, mRNA cleavage\\
+Translation & mRNA cleavage & 21 & mRNA cleavage\\
+Translation & mRNA  & 21--22 & mRNA cleavage\\
+Translation & mRNA  & 24--26 & Histone and DNA Modification\\
+\noalign{\smallskip}\hline\noalign{\smallskip}
+\end{tabular}
+$^a$ Table foot note (with superscript)
+\end{table}
+%
diff --git a/cblist.tex b/cblist.tex
@@ -0,0 +1,16 @@
+%%%%%%%%%%%%%%%%%%%%clist.tex %%%%%%%%%%%%%%%%%%%%%%%%
+%                                                    
+% sample list of contributors and their addresses    
+%                                                    
+% Use this file as a template for your own input.    
+%                                                    
+%%%%%%%%%%%%%%%%%%%%%%%% Springer %%%%%%%%%%%%%%%%%%%%
+\contributors
+
+\begin{thecontriblist}
+Firstname Surname
+\at ABC Institute, 123 Prime Street, Daisy Town, NA 01234, USA, \email{smith@smith.edu}
+\and
+Firstname Surname
+\at XYZ Institute, Technical University, Albert-Schweitzer-Str. 34, 1000 Berlin, Germany, \email{meier@tu.edu}
+\end{thecontriblist}
diff --git a/chapterIntroduction.tex b/chapterIntroduction.tex
@@ -0,0 +1,127 @@
+\chapter{Introduction}
+
+\section{Types of machine learning}
+\begin{equation}\nonumber
+Machine\, learning\begin{cases}
+Supervised\, learning \begin{cases} Classfication\, \\ Regression \end{cases}\\
+Unsupervised\, learning \begin{cases} Discovering\, clusters\, \\ Discovering\, latent\, factors\, \\ Discovering\, graph\, structure\, \\ Matrix\, completion \end{cases}\\
+\end{cases}
+\end{equation}
+
+\section{Three elements of a machine learning method}
+
+\textbf{method = model + strategy + algorithm}
+
+\subsection{Model}
+In supervised learning, a model is a decision function or conditional probability distribution  to be learned. The model's hypothesis space contains all possible decition fuctions $f(x)$ or conditional probability distributions $P(y|\vec{x})$.
+
+\subsection{Strategy}
+Given a model's hypothesis space, we need a strategy to select which hypothesis is optimal.
+
+\subsubsection{Loss function and risk function}
+
+\begin{definition}
+In order to measure how well a function fits the training data, a \textbf{loss function} $L:Y \times Y \rightarrow R \geq 0$ is defined. For training example $(x_i,y_i)$, the loss of predicting the value $\widehat{y}$ is $L(y_i,\widehat{y})$.
+\end{definition}
+
+The following is some common loss functions:
+\begin{enumerate}
+\item 0-1 loss function $L(Y,f(X))=I(Y,f(X))=\begin{cases} 1, & Y=f(X) \\ 0, & Y \neq f(X) \end{cases}$
+\item Quadratic loss function $L(Y,f(X))=\left(Y-f(X)\right)^2$
+\item Absolute loss function $L(Y,f(X))=\abs{Y-f(X)}$
+\item Logarithmic loss function $L(Y,P(Y|X))=-\log{P(Y|X)}$
+\end{enumerate}
+
+\begin{definition}
+The risk of function $f$ is defined as the expected loss of $f$:
+\begin{equation}
+R_{exp}(f)=E_p\left[L\left(Y,f(X)\right)\right]=\int _{X \times Y} L\left(y,f(x)\right)P(x,y)dxdy
+\end{equation}
+which is also called expected loss or \textbf{risk function}.
+\end{definition}
+
+\begin{definition}
+The risk function $R_{exp}(f)$ can be estimated from the training data as
+\begin{equation}
+R_{emp}(f)=\dfrac{1}{N}\sum\limits_{i=1}^{N} L\left(y_i,f(x_i)\right)
+\end{equation}
+which is also called empirical loss or \textbf{empirical risk}.
+\end{definition}
+
+You can define your own loss function, but if you're a novice, you're probably better off using one from the literature. There are conditions that loss functions should meet\footnote{\url{http://t.cn/zTrDxLO}}:
+\begin{enumerate}
+\item They should approximate the actual loss you're trying to minimize. As was said in the other answer, the standard loss functions for classification is zero-one-loss (misclassification rate) and the ones used for training classifiers are approximations of that loss.
+\item The loss function should work with your intended optimization algorithm. That's why zero-one-loss is not used directly: it doesn't work with gradient-based optimization methods since it doesn't have a well-defined gradient (or even a subgradient, like the hinge loss for SVMs has).
+
+The main algorithm that optimizes the zero-one-loss directly is the old perceptron algorithm(chapter \S \ref{chap:Perceptron}).
+\end{enumerate}
+
+\subsubsection{ERM and SRM}
+\begin{definition}
+ERM(Empirical risk minimization)
+\begin{equation}
+\min\limits _{f \in \mathcal{F}} R_{emp}(f)=\min\limits _{f \in \mathcal{F}} \dfrac{1}{N}\sum\limits_{i=1}^{N} L\left(y_i,f(x_i)\right)
+\end{equation}
+\end{definition}
+
+\begin{definition}
+Structural risk
+\begin{equation}
+R_{smp}(f)=\dfrac{1}{N}\sum\limits_{i=1}^{N} L\left(y_i,f(x_i)\right) +\lambda J(f)
+\end{equation}
+\end{definition}
+
+\begin{definition}
+SRM(Structural risk minimization)
+\begin{equation}
+\min\limits _{f \in \mathcal{F}} R_{srm}(f)=\min\limits _{f \in \mathcal{F}} \dfrac{1}{N}\sum\limits_{i=1}^{N} L\left(y_i,f(x_i)\right) +\lambda J(f)
+\end{equation}
+\end{definition}
+
+\subsection{Algorithm}
+Namely training algorithm(or learning algorithm), which is used to compute the optimal result according to the strategy. It's a procedural concept.
+
+\section{Cross validation}
+\begin{definition}
+\textbf{Cross validation}, sometimes called \emph{rotation estimation}, is a \emph{model validation} technique for assessing how the results of a statistical analysis will generalize to an independent data set\footnote{\url{http://en.wikipedia.org/wiki/Cross-validation_(statistics)}}.
+\end{definition}
+
+Common types of cross-validation:
+\begin{enumerate}
+\item K-fold cross-validation. In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data.
+\item 2-fold cross-validation. Also, called simple cross-validation or holdout method. This is the simplest variation of k-fold cross-validation, k=2.
+\item Leave-one-out cross-validation(\emph{LOOCV}). k=M, the number of original samples.
+\end{enumerate}
+
+\section{Linear Regression}
+Given 
+\begin{equation}
+\begin{array}{lcl}
+\mathcal{D}=\left\{(\vec{x}_i,y_i) | i=1:M\right\} \\
+\mathcal{H}=\left\{f(\vec{x_i})=\vec{w}^T\vec{x_i}+b | i=1:M\right\}\\
+L(\vec{w},b)=\sum\limits_{i=1}^{M} \left(y_i-f(\vec{x}_i)-b\right)^2\\
+\end{array}
+\end{equation}
+
+Let $\widehat{\vec{w}}=\left(\vec{w}^T,b\right)^T$,  and
+\begin{equation}
+\widehat{\vec{X}}=\left(\begin{array}{lcr}
+\widehat{\vec{x}}_1^T\\
+\widehat{\vec{x}}_2^T\\
+\vdots \\
+\widehat{\vec{x}}_M^T\\
+\end{array}
+\right), where\; \widehat{\vec{x}}_i=\left(\vec{x}_i^T,1\right)^T
+\end{equation}
+
+We can get
+\begin{equation}
+\begin{array}{lcr}
+L(\widehat{\vec{w}})=\left(\vec{y}-\widehat{\vec{X}}\widehat{\vec{w}}\right)^T\left(\vec{y}-\widehat{\vec{X}}\widehat{\vec{w}}\right)\\
+\dfrac{\partial L}{\partial{\widehat{\vec{w}}}}=-2\widehat{\vec{X}}^T\vec{y}+2\widehat{\vec{X}}^T\widehat{\vec{X}}\widehat{\vec{w}}=0\\
+\widehat{\vec{X}}^T\vec{y}=\widehat{\vec{X}}^T\widehat{\vec{X}}\widehat{\vec{w}}\\
+\widehat{\vec{w}}=\left(\widehat{\vec{X}}^T\widehat{\vec{X}}\right)^{-1}\widehat{\vec{X}}^T\vec{y}
+\end{array}
+\end{equation}
+
+If $\widehat{\vec{X}}^T\widehat{\vec{X}}$ is singular, the pseudo-inverse can be used, or else the technique of ridge regression described below can be applied.
diff --git a/chapterPerceptron.tex b/chapterPerceptron.tex
@@ -0,0 +1,98 @@
+\chapter{Perceptron}
+\label{chap:Perceptron}
+
+\section{Model}
+\begin{equation}
+\mathcal{H}:f(\vec{x})=\text{sign}(\vec{w}\vec{x}+b)
+\end{equation}
+where $\text{sign}(x)=\begin{cases}+1, & x \geq 0\\-1, & x<0\\\end{cases}$, see Fig. ~\ref{fig:perceptron}\footnote{\url{https://en.wikipedia.org/wiki/Perceptron}}.
+\begin{figure}[hbtp]
+\centering
+    \includegraphics[scale=.50]{figures/perceptron.png}
+\caption{Perceptron}
+\label{fig:perceptron} 
+\end{figure}
+
+Perceptron is a binary linear classifier, which is a discriminant model.
+
+\section{Strategy}
+\begin{eqnarray}
+L(\vec{w},b)&=&-y_i(\vec{wx}_i+b)\\
+R_{emp}(f)&=&-\sum\limits_i y_i(\vec{wx}_i+b)\\
+\end{eqnarray}
+
+\section{Learning algorithm}
+\subsection{Primal form}
+Stochastic gradient descent, the pseudo code is as follows:
+\begin{algorithm}[htbp]
+    %\SetAlgoLined
+    \SetAlgoNoLine
+
+    $\vec{w} \leftarrow 0;\; b \leftarrow 0;\; k \leftarrow 0$\;
+    \While{no mistakes made within the for loop}{
+        \For{$i\leftarrow 1$ \KwTo $N$}{
+			\If{$y_i(\vec{w}^T\vec{x}_i+b) \leq 0$}{
+				$\vec{w} \leftarrow \vec{w}+\eta y_i \vec{x}_i$\;
+				$b \leftarrow b+\eta y_i$\;
+				$k \leftarrow k+1$\;
+			}
+		}
+    }
+\caption{Perceptron learning algorithm, primal form}
+\end{algorithm}
+
+\subsection{Convergency}
+\begin{theorem}
+(\textbf{Novikoff}) If traning data set $\mathcal{D}$ is linearly separable, then
+\begin{enumerate}
+\item There exists a hyperplane denoted as $\widehat{\vec{w}}_{opt} \cdot \vec{x}+b_{opt}=0$ which can correctly seperate all samples, and $\exists\gamma>0,\forall i, y_i(\vec{w}_{opt} \cdot \vec{x}_i+b_{opt}) \geq \gamma$
+\item $k \leq \left(\dfrac{R}{\gamma}\right)^2$, where $R=\max\limits_{1 \leq i \leq N} \abs{\abs{\widehat{\vec{x}}_i}}$
+\end{enumerate}
+\end{theorem}
+
+\begin{proof}
+(1) let $\gamma=\min\limits_{i} y_i(\vec{w}_{opt} \cdot \vec{x}_i+b_{opt})$, then we get $y_i(\vec{w}_{opt} \cdot \vec{x}_i+b_{opt}) \geq \gamma$.
+
+(2) The algorithm start from $\widehat{\vec{x}_0}=0$, if a instance is misclassified, then update the weight. Let $\widehat{\vec{w}_{k-1}}$ denotes the extended weight before the k-th misclassified instance, then we can get
+\begin{eqnarray}
+y_i(\widehat{\vec{w}}_{k-1} \cdot \widehat{\vec{x}_i})&=&y_i(\vec{w}_{k-1} \cdot \vec{x}_i+b_{k-1}) \leq 0\\
+\widehat{\vec{w}}_k&=&\widehat{\vec{w}}_{k-1}+\eta y_i \widehat{\vec{x}_i}
+\end{eqnarray}
+
+We could infer the following two equations, the proof procedure are omitted.
+\begin{enumerate}
+\item $\widehat{\vec{w}}_k \cdot \widehat{\vec{w}}_{opt} \geq k\eta\gamma$
+\item $\abs{\abs{\widehat{\vec{w}}_k}}^2 \leq k\eta^2R^2$
+\end{enumerate}
+
+From above two equations we get
+\begin{eqnarray}
+\nonumber k\eta\gamma & \leq & \widehat{\vec{w}}_k \cdot \widehat{\vec{w}}_{opt} \leq \abs{\abs{\widehat{\vec{w}}_k}}\abs{\abs{\widehat{\vec{w}}_{opt}}} \leq \sqrt k \eta R \\
+\nonumber k^2\gamma^2 & \leq & kR^2 \\
+\nonumber \text{i.e. } k & \leq & \left(\dfrac{R}{\gamma}\right)^2
+\end{eqnarray}
+\end{proof}
+
+\subsection{Dual form}
+\begin{eqnarray}
+\vec{w}&=&\sum\limits_{i=1}^{N} \alpha_iy_i\vec{x}_i \\
+b&=&\sum\limits_{i=1}^{N} \alpha_iy_i \\
+f(\vec{x})&=&\text{sign}\left(\sum\limits_{j=1}^{N} \alpha_jy_j\vec{x}_j \cdot \vec{x}+b\right)
+\end{eqnarray}
+
+\begin{algorithm}[htbp]
+    %\SetAlgoLined
+    \SetAlgoNoLine
+
+    $\vec{\alpha} \leftarrow 0;\; b \leftarrow 0;\; k \leftarrow 0$\;
+    \While{no mistakes made within the for loop}{
+        \For{$i\leftarrow 1$ \KwTo $N$}{
+			\If{$y_i\left(\sum\limits_{j=1}^{N} \alpha_jy_j\vec{x}_j \cdot \vec{x}_i+b\right) \leq 0$}{
+				$\vec{\alpha} \leftarrow \vec{\alpha}+\eta$\;
+				$b \leftarrow b+\eta y_i$\;
+				$k \leftarrow k+1$\;
+			}
+		}
+    }
+\caption{Perceptron learning algorithm, dual form}
+\end{algorithm}