Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Comparing changes

Choose two branches to see what's changed or to start a new pull request. If you need to, you can also compare across forks.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also compare across forks.
...
  • 13 commits
  • 14 files changed
  • 0 commit comments
  • 3 contributors
Commits on Jan 06, 2012
Liang Zhang Adding wrapper and quick start tutorial(unfinished) c3c1d89
Commits on Jan 07, 2012
Liang Zhang Fix some small bugs in BST.R 595598e
Liang Zhang fix bug in BST.R 61829ec
Bee-Chung Chen Fix issue #2 and #6 77a115f
Bee-Chung Chen Merge branch 'master' of
https://beechung@github.com/beechung/Latent-Factor-Models.git

Conflicts:
	doc/tutorial.pdf
a1a1d16
Bee-Chung Chen Remove tutorial.pdf from git to prevent conflicts a19f259
Commits on Jan 09, 2012
Liang Zhang add more tutorial text 1064778
Commits on Jan 10, 2012
Bee-Chung Chen revise the tutorial & add predict.bst 8bebe78
beechung revise the tutorial 4fbc640
beechung revise error messages b83b4ec
Commits on Jan 11, 2012
Bee-Chung Chen update tutorial.pdf 1bcbdf7
Bee-Chung Chen Add copy right notice 53f7415
beechung Merge pull request #3 from beechung/master
revise the tutorial and add fit.bst wrapper
7321055
View
4 README
@@ -3,8 +3,8 @@
Research Code for Fitting Latent Factor Models
########################################################
-Authors: Bee-Chung Chen and Deepak Agarwal
- Yahoo! Research
+Authors: Bee-Chung Chen, Deepak Agarwal and Liang Zhang
+ Yahoo! Labs
I. Introduction
View
1  doc/.gitignore
@@ -4,3 +4,4 @@
/tutorial.ps
/tutorial.bbl
/tutorial.blg
+/tutorial.pdf
View
210 doc/quick-start.tex
@@ -1,8 +1,208 @@
-\subsection{Quick Start}
+\subsection{Model Fitting}
\label{sec:bst-quick-start}
-In this section, we describe how to fit latent factor models using this package without much need for familiarity of R.
-\begin{center}
-(To be completed)
-\end{center}
+In this section, we describe how to fit the BST model to the toy dataset using this package without deep understanding of the fitting procedure. Before you run the sample code, please make sure you are in the top-level directory (i.e. by using Linux command {\tt ls}, you should be able to see files {\tt LICENSE} and {\tt README}).
+
+\subsubsection{Step 1: Read Data}
+\label{sec:read-data}
+
+We fist read training and test observation tables (named as {\tt obs.train} and {\tt obs.test} in the following R script), their corresponding observation feature tables (named as {\tt x\_obs.train} and {\tt x\_obs.test}), the source feature table ({\tt x\_src}), the destination feature table ({\tt x\_dst}) and the edge context feature table ({\tt x\_ctx}) from the corresponding files. Note that if you replace these tables with your data, you must not change the column names. If you remove some optional columns, you must make sure that you remove the corresponding column names correctly. Assuming we use the dense format of the feature files, a sample R script is in the following.
+{\small\begin{verbatim}
+input.dir = "test-data/multicontext_model/simulated-mtx-uvw-10K"
+obs.train = read.table(paste(input.dir,"/obs-train.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(obs.train) = c("src_id", "dst_id", "src_context",
+ "dst_context", "ctx_id", "y");
+x_obs.train = read.table(paste(input.dir,"/dense-feature-obs-train.txt",
+ sep=""), sep="\t", header=FALSE, as.is=TRUE);
+obs.test = read.table(paste(input.dir,"/obs-test.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(obs.test) = c("src_id", "dst_id", "src_context",
+ "dst_context", "ctx_id", "y");
+x_obs.test = read.table(paste(input.dir,"/dense-feature-obs-test.txt",
+ sep=""), sep="\t", header=FALSE, as.is=TRUE);
+x_src = read.table(paste(input.dir,"/dense-feature-user.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(x_src)[1] = "src_id";
+x_dst = read.table(paste(input.dir,"/dense-feature-item.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(x_dst)[1] = "dst_id";
+x_ctx = read.table(paste(input.dir,"/dense-feature-ctxt.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(x_ctx)[1] = "ctx_id";
+\end{verbatim}}
+
+\subsubsection{Step 2: Fit Model(s)}
+We start fitting the model by loading the function {\tt fit.bst} in {\tt src/R/BST.R}.
+{\small\begin{verbatim}
+> source("src/R/BST.R");
+\end{verbatim}}
+\noindent Then, we can fit a simple latent factor model without any feature using the following command.
+{\small\begin{verbatim}
+> ans = fit.bst(obs.train=obs.train, obs.test=obs.test,
+ out.dir = "/tmp/bst/quick-start", model.name="uvw3",
+ nFactors=3, nIter=10);
+\end{verbatim}}
+\noindent Or, we can fit a model using all the features.
+{\small\begin{verbatim}
+> ans = fit.bst(obs.train=obs.train, obs.test=obs.test,
+ x_obs.train=x_obs.train, x_obs.test=x_obs.test,
+ x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
+ out.dir = "/tmp/bst/quick-start",
+ model.name="uvw3-F", nFactors=3, nIter=10);
+\end{verbatim}}
+In the above examples, we basically put all the loaded data as input to the fitting function, specify the output directory prefix as {\tt /tmp/bst/quick-start}, and fit a model (with name {\tt uvw3} or {\tt uvw3-F}). Note that the model name can be arbitrary, and the final output directory for model {\tt uvw3} is in {\tt /tmp/bst/quick-start\_uvw3}. This model has 3 factors per node (i.e., $\bm{u}_i$, $\bm{v}_j$ and $\bm{w}_k$ are 3-dimensional vectors) and is fitted using 10 EM iterations.
+If you do not have test data, you can simply omit input parameters {\tt obs.test} and {\tt x\_obs.test} when calling {\tt fit.bst}.
+More options and control parameters will be introduced in Section~\ref{sec:fit.bst}.
+
+\subsubsection{Step 3: Check the Output}
+\label{sec:model-output}
+
+The two main output files in an output directory are {\tt summary} and {\tt model.last}.
+
+\parahead{Summary File}
+It records a number of statistics for each EM iteration. To read a summary file, use the following R command.
+{\small\begin{verbatim}
+> read.table("/tmp/bst/quick-start_uvw3-F/summary", header=TRUE);
+\end{verbatim}}
+\noindent Explanation of the columns is in the following:
+\begin{itemize}
+\item {\tt Iter} specifies the iteration number.
+\item {\tt nSteps} records the number of Gibbs samples drawn in the E-step of that iteration.
+\item {\tt CDlogL}, {\tt TestLoss}, {\tt LossInTrain} and {\tt TestRMSE} record the complete data log likelihood, loss on the test data, loss on the training data and RMSE (root mean squared error) on the test data for the model at the end of that iteration. For the Gaussian response model, the loss is defined as RMSE. For the logistic response model, the loss is defined as negative average log likelihood per observation.
+\item {\tt TimeEStep}, {\tt TimeMStep} and {\tt TimeTest} record the numbers of seconds used to compute the E-step, M-step and predictions on test data in that iteration.
+\end{itemize}
+
+\parahead{Sanity Check}
+\begin{itemize}
+\item Check {\tt CDlogL} to see whether it increases sharply during the first few iterations and then oscillates at the end.
+\item Check {\tt TestLoss} to see whether it converges. If not, more EM iterations are needed.
+\item Check {\tt TestLoss} and {\tt LossInTrain} to see whether the model overfits the data; i.e., TestLoss goes up, while LossInTrain goes down. If so, try to simplify the model by reducing the number of factors and parameters.
+\end{itemize}
+You can monitor the summary file when the code is running. When you see {\tt TestLoss} converges, kill the running process.
+
+\parahead{Model Files}
+The fitted models are saved in {\tt model.last} and {\tt model.minTestLoss}, which are R data binary files. To load the models, run the following command.
+{\small\begin{verbatim}
+> load("/tmp/bst/quick-start_uvw3-F/model.last");
+> str(factor);
+> str(param);
+> str(data.train);
+\end{verbatim}}
+\noindent After we load the model, the fitted prior parameters are in object {\tt param} and the fitted latent factors are in object {\tt factor}. Also, the object {\tt data.train} contains the ID mappings (see Appendix~\ref{sec:index-data} for details) that are needed when we need to apply this model to a new test dataset. Notice that {\tt data.train} does not contain actual data, but just meta information. You do not need to understand these objects to use this model to predict the response of new test data.
+
+Object {\tt factor} is a list of factors. For example, $\alpha_{ip}$ = {\tt factor\$alpha[i,p]}, the {\tt src\_id} of source node index $i$ is {\tt data.train\$IDs\$SrcIDs[i]}, and the {\tt src\_context} of source context index $p$ is {\tt data.train\$IDs\$SrcContexts[p]}. As another example, $\bm{w}_k$ =
+{\tt factor\$w[k,]} and the {\tt ctx\_id} of edge context index $k$ is {\tt data.train\$IDs\$CtxIDs[k]}.
+
+Object {\tt param} is a list of prior parameters. For example, $\sigma^2_{\alpha,p}$ = {\tt param\$var\_alpha[p]}. The format of the regression function parameters ($b$, $\bm{g}$, $\bm{d}$, $\bm{h}$, $G$, $D$ and $H$) depends on the regression model. See the following file for details.
+{\small\begin{verbatim}
+ src/R/model/Notation-multicontext.txt
+\end{verbatim}}
+
+\subsubsection{Step 4: Make Predictions}
+
+Once Step 2 finishes, we have the predicted values of the response variable $y$ for the test data, since we have the test data as input to the fitting function. Check file {\tt prediction} inside the output directory (In our example, {\tt /tmp/bst/quick-start\_uvw3-F/prediction} is the file name). The file has two columns:
+\begin{enumerate}
+\item {\tt y}: The ground-truth response $y$
+\item {\tt pred\_y}: The predicted response $y$
+\end{enumerate}
+Please note that the predicted values of $y$ for model {\tt uvw3-F} can also be found at {\tt ans\$pred.y[["uvw3-F"]]}.
+If you did not specify {\tt obs.test} and {\tt x.obs.test} when calling function {\tt fit.bst}, then there would be no prediction file.
+
+To make predictions for new test data, first read the new data (similar to Step 1) and then call {\tt predict.bst}.
+{\small\begin{verbatim}
+pred = predict.bst(
+ model.file="/tmp/bst/quick-start_uvw3-F/model.last",
+ obs.test=obs.test, x_obs.test=x_obs.test,
+ x_src=x_src, x_dst=x_dst, x_ctx=x_ctx);
+\end{verbatim}}
+\noindent Note that {\tt obs.test} is the test observation tables, and {\tt x\_obs.test}, {\tt x\_src}, {\tt x\_dst} and {\tt x\_ctx} are the feature tables. You need to make sure that the test data uses the same set of features as those used in training (i.e., the column names of a feature table in the training data must be the same as those in the test data). This prediction function does not perform sanity checks for feature consistency. Some strange errors may occur if the training and test features are inconsistent. It is also important to note that, in the current implementation, {\tt x\_src}, {\tt x\_dst} and {\tt x\_ctx} must also include all of the source nodes, destination nodes and edge contexts that appear in the {\bf training data}.
+
+\subsubsection{Details of {\tt fit.bst}}
+\label{sec:fit.bst}
+
+\parahead{Fit Multiple Models in One Call}
+You are able to fit multiple models by calling {\tt fit.bst} once. The following is an example.
+{\small\begin{verbatim}
+> ans = fit.bst(obs.train=obs.train, obs.test=obs.test,
+ x_obs.train=x_obs.train, x_obs.test=x_obs.test,
+ x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
+ out.dir = "/tmp/bst/quick-start",
+ model.name=c("uvw1", "uvw2"), nFactors=c(1,2), nIter=10);
+\end{verbatim}}
+\noindent Here, we fit two models: {\tt uvw1} and {\tt uvw2} by setting the {\tt model.name} and {\tt nFactors} as vectors of length=2; in this example, model {\tt uvw1} uses 1 factor, and model {\tt uvw2} uses 2 factors. They are both fitted using 10 EM iterations. Unfortunately, for fair comparison between sibling models we do not allow {\tt nIter} to be different among different models. The model files, summary files and prediction files for the two models are in {\tt /tmp/bst/quick-start\_uvw1} and {\tt /tmp/bst/quick-start\_uvw2}.
+
+\parahead{Basic parameters}
+The basic input parameters of function {\tt fit.bst} are documented in the following:
+\begin{itemize}
+\item {\tt code.dir} is the top-level directory of the location where this package was installed. If you are already in this directory, the default which is the empty string can be used.
+\item {\tt obs.train}, {\tt obs.test}, {\tt x\_obs.train}, {\tt x\_obs.test}, {\tt x\_src}, {\tt x\_dst}, {\tt x\_ctx} are the training and data. Please check Section \ref{sec:data} for details. Note that only {\tt obs.train} is required to run this code; everything else is optional depending on the problem that you have. Note that if {\tt obs.test}, {\tt x\_src}, {\tt x\_dst} and {\tt x\_ctx} are specified, then {\tt x\_src}, {\tt x\_dst} and {\tt x\_ctx} must also contain the features of the source nodes, destination nodes and edge contexts that appear only in the test data.
+\item {\tt out.dir} is the output directory prefix. The final output directory is {\tt out.dir\_model.name}.
+\item {\tt model.name} is a vector of the names of the models to be fitted. It can be an arbitrary string or a vector of strings. Default is {\tt "model"}.
+\item {\tt nFactors} specifies the number of factors per node (i.e., the number of dimensions of vector $\bm{v}_j$; note that $\bm{u}_i$ and $\bm{w}_k$ have the same number of dimensions). It can be either a scalar or a vector of numbers with length equal to the number of models.
+\item {\tt nIter} specifies the number of EM iterations. All the models are fitted using the same number of iterations.
+\item {\tt nSamplesPerIter} specifies the number of Gibbs samples drawn in each E-step of an single EM iteration. It can be either a scalar which means every EM iteration uses the same {\tt nSamplesPerIter}, or it can be a vector with length equal to {\tt nIter} specifying the Gibbs samples for each EM iteration (in this case, each iteration has its own value of {\tt nSamplesPerIter}). Note that all models use the same {\tt nSamplesPerIter}.
+\item {\tt is.logistic} specifies whether we want to use logistic link function for our models on binary response data. Default is FALSE. It can be either a boolean value that is shared by all models, or a vector of boolean values with length equal to the number of models.
+\item {\tt src.dst.same} specifies whether you want the model to have a single factor vector per node (ignoring the difference between source nodes and destination nodes). For example, if source nodes represent users and destination nodes represents items, {\tt src.dst.same} should be set to {\tt FALSE} because it does not make sense to use a single factor vector for both the $i$th user and the $i$th item. However, if both source and destination nodes represent users (e.g., users rate other users) and ${\tt src\_id} = A$ refers to the same user $A$ as ${\tt dst\_id} = A$, then {\tt src.dst.same} can be set to {\tt TRUE}. In this case, the following model will be fitted.
+\begin{equation}
+y_{ijkpq} \sim \bm{x}'_{ijk} \bm{b} + \alpha_{ip} + \beta_{jq} + \gamma_{k} + \left<\bm{v}_i, \bm{v}_j, \bm{w}_k\right>, \label{eq:vvw-model}
+\end{equation}
+Comparing the above model to the original model specified in Equation~\ref{eq:uvw-model}, note the difference between $\left<\bm{v}_i, \bm{v}_j, \bm{w}_k\right>$ and
+$\left<\bm{u}_i, \bm{v}_j, \bm{w}_k\right>$. Of course, in the case where both source nodes and destination nodes represent users, you can still set {\tt src.dst.same=FALSE} to fit the original model. The default of {\tt src.dst.same} is {\tt FALSE}.
+\item {\tt control} has a list of more advanced parameters that will be introduced later.
+\end{itemize}
+
+\parahead{Advanced parameters} {\tt control=fit.bst.control(...)} contains the following advanced parameters:
+\begin{itemize}
+\item {\tt rm.self.link}: Whether to remove self-edges. If {\tt src.dst.same=TRUE}, you can choose to remove observations with ${\tt src\_id} = {\tt dst\_id}$ by setting {\tt rm.self.link=FALSE}. Otherwise, {\tt rm.self.link} should be set to {\tt FALSE}. The default of {\tt rm.self.link} is {\tt FALSE}.
+\item {\tt add.intercept}: Whether you want to add an intercept to each feature matrix. If {\tt add.intercept=TRUE}, a column of all 1s will be added to every feature matrix. The default of {\tt add.intercept} is {\tt TRUE}.
+\item {\tt has.gamma} specifies whether to include $\gamma_k$ in the model specified in Equation~\ref{eq:uvw-model} or not. If {\tt has.gamma=FALSE}, $\gamma_k$ will be disabled or removed from the model. By default, {\tt has.gamma} is set as {\tt FALSE} unless the training response data {\tt obs.train} does not have any source or destination context, but has edge context.
+\item {\tt reg.algo} and {\tt reg.control} specify how the regression priors will to be fitted. If they are set to {\tt NULL} (default), R's basic linear regression function {\tt lm} will be used to fit the prior regression coefficients $\bm{g}, \bm{d}, \bm{h}, G, D$ and $H$. Currently, we only support two other algorithms {\tt "GLMNet"} and {\tt "RandomForest"}. Currently, {\tt reg.algo} can only take one of the following three: NULL, {\tt "GLMNet"} and {\tt "RandomForest"} (both are strings). Notice that if {\tt "RandomForest"} is used, the regression priors become nonlinear; see~\cite{gmf:recsys11} for more information.
+\item {\tt nBurnin} is the number of burn-in samples per E-step. The default is 10\% of {\tt nSamplesPerIter}.
+\item {\tt init.params} is a list of the initial values of all the variance component parameters at the beginning of the first EM iteration. The default value of {\tt init.params} is
+{\small\begin{verbatim}
+init.params = list(var_alpha=1, var_beta=1, var_gamma=1,
+ var_u=1, var_v=1, var_w=1, var_y=NULL,
+ relative.to.var_y=FALSE, var_alpha_global=1,
+ var_beta_global=1)
+\end{verbatim}}
+where {\tt var\_alpha} specifies the initial value of $\sigma^2_\alpha$ and so on. When {\tt var\_y=NULL}, the initial value of $\sigma^2_y$ is set to the sample variance of the response in the training data. {\tt relative.to.var\_y} specifies whether the specification of {\tt var\_alpha} and so on should be relative to {\tt var\_y}. For example, if {\tt relative.to.var\_y=TRUE}, {\tt var\_y=NULL} and {\tt var\_alpha=0.1}, then the initial value of $\sigma^2_\alpha$ will be set to 0.1 times the sample variance of the response.
+\item {\tt random.seed} is the random seed for the model fitting procedure.
+\end{itemize}
+
+\subsubsection{Special Case Models}
+
+\parahead{Original BST Model} The original BST model defined in~\cite{bst:kdd11} can be fitted by setting {\tt src.dst.same=TRUE}, {\tt has.gamma=FALSE}, {\tt rm.self.link=TRUE} and setting all the context columns to be the same in the input data.
+{\small\begin{verbatim}
+obs.train$src_context = obs.train$dst_context = obs.train$ctx_id
+obs.test$src_context = obs.test$dst_context = obs.test$ctx_id
+ans = fit.bst(obs.train=obs.train, obs.test=obs.test,
+ x_obs.train=x_obs.train, x_obs.test=x_obs.test,
+ x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
+ out.dir="/tmp/bst/quick-start", model.name="original-bst",
+ nFactors=3, nIter=10, src.dst.same=TRUE,
+ control=fit.bst.control(has.gamma=FALSE, rm.self.link=TRUE));
+\end{verbatim}}
+\noindent This setting gives the following model:
+$$
+y_{ijk} \sim \bm{x}'_{ijk} \bm{b} + \alpha_{ik} + \beta_{jk} + \left<\bm{v}_i, \bm{v}_j, \bm{w}_k\right>
+$$
+Notice that since all the context columns are the same, there is no need for using a three dimensional context vector $(k,p,q)$; instead, it is sufficient to just use $k$ to index the context in the above equation.
+
+\parahead{RLFM}
+The RLFM model defined in~\cite{rlfm:kdd09} can be fitteing by removing all of the context columns.
+{\small\begin{verbatim}
+obs.train$src_context = obs.train$dst_context = obs.train$ctx_id = NULL;
+obs.test$src_context = obs.test$dst_context = obs.test$ctx_id = NULL;
+ans = fit.bst(obs.train=obs.train, obs.test=obs.test,
+ x_obs.train=x_obs.train, x_obs.test=x_obs.test,
+ x_src=x_src, x_dst=x_dst,
+ out.dir="/tmp/bst/quick-start", model.name="uvw3-F",
+ nFactors=3, nIter=10);
+\end{verbatim}}
+\noindent Notice that ${\tt x\_ctx}$ is also removed. This setting gives the following model:
+$$
+y_{ij} \sim \bm{x}'_{ij} \bm{b} + \alpha_{i} + \beta_{j} + \bm{u}'_i \bm{v}_j
+$$
+Notice that removing the context-related columns in an observation table disables the context-specific factors in the model.
View
BIN  doc/tutorial.pdf
Binary file not shown
View
174 doc/tutorial.tex
@@ -3,14 +3,15 @@
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{bm}
+\usepackage{comment}
\newcommand{\parahead}[1]{\vspace{0.15in}\noindent{\bf #1:}}
\begin{document}
\title{Tutorial on How to Fit Latent Factor Models}
-\author{Bee-Chung Chen}
+\author{Bee-Chung Chen and Liang Zhang}
\maketitle
-This paper describes how you can fit latent factor models (e.g., \cite{rlfm:kdd09,bst:kdd11,gmf:recsys11}) using the open source package developed in Yahoo! Labs.
+This tutorial describes how you can fit latent factor models (e.g., \cite{rlfm:kdd09,bst:kdd11,gmf:recsys11}) using the open source package developed in Yahoo! Labs.
{\small\begin{verbatim}
Stable repository: https://github.com/yahoo/Latent-Factor-Models
@@ -29,16 +30,17 @@ \subsection{Install R}
Alternatively, you can install R using linux's package management software. In this case, please install {\tt r-base}, {\tt r-base-core}, {\tt r-base-dev}, {\tt r-recommended}.
-After installing R, enter R by simply typing {\tt R} and install the following R packages: {\tt Matrix} and {\tt glmnet}. Notice that these two packages are not required if you do not need to handle sparse feature vectors or matrices. To install these R packages, use the following commands in R.
+After installing R, enter R by simply typing {\tt R} and install the following R packages: {\tt Matrix}, {\tt glmnet} and {\tt randomForest}. Note that the R packages {\tt glmnet} and {\tt randomForest} are not required unless you want to use them in the regression priors of the model (the parameter {\tt reg.algo} in {\tt control} of {\tt fit.bst}). To install these R packages, use the following commands in R.
{\small\begin{verbatim}
> install.packages("Matrix");
> install.packages("glmnet");
+> install.packages("randomForest");
\end{verbatim}}
\noindent Make sure that you can run R by simply typing {\tt R}. Otherwise, please use alias to point {\tt R} to your R executable file. This is required for {\tt make} to work properly.
\subsection{Be Familiar with R}
-This tutorial assumes that you are familiar with R, at least comfortable reading R code. If not, please read \\
+This tutorial assumes that you are familiar with R, at least comfortable calling R functions, reading R code. If not, please read \\
{\tt http://cran.r-project.org/doc/manuals/R-intro.pdf}.
\subsection{Compile C/C++ Code}
@@ -47,19 +49,19 @@ \subsection{Compile C/C++ Code}
\section{Bias-Smoothed Tensor Model}
-The bias-smoothed tensor (BST) model~\cite{bst:kdd11} includes the regression-based latent factor model (RLFM)~\cite{rlfm:kdd09} and regular matrix factorization models as special cases. In fact, the BST model presented here is more general than the model presented in~\cite{bst:kdd11}. In the following, I demonstrate how to fit such a model and its special cases. The R code of this section can be found in {\tt src/R/examples/tutorial-BST.R}.
+In this section, we demonstrate how to fit the bias-smoothed tensor (BST) model~\cite{bst:kdd11}, which includes the regression-based latent factor model (RLFM)~\cite{rlfm:kdd09} and regular matrix factorization models as special cases. In fact, the BST model presented here is more general than the model presented in~\cite{bst:kdd11}. It also provides the ability to use non-linear regression priors as described in~\cite{gmf:recsys11}. The R script of this section can be found in {\tt src/R/examples/tutorial-BST.R}.
\subsection{Model}
We first specify the model in its most general form and then describe special cases. Let $y_{ijkpq}$ denote the {\em response} (e.g., rating) that {\em source node} $i$ (e.g., user $i$) gives {\em destination node} $j$ (e.g., item $j$) in {\em context} $(k,p,q)$, where the context is specified by a three dimensional vector:
\begin{itemize}
\item {\em Edge context} $k$ specifies the context when the response occurs on the edge from node $i$ to node $j$; e.g., the rating on the edge from user $i$ to item $j$ was given when $i$ saw $j$ on web page $k$.
-\item {\em Source context} $p$ specifies the context (or mode) of the source node $i$ when this node gives the response; e.g., $p$ represents the category of item $j$, meaning that user $i$ are in different modes when rating items in different categories.
+\item {\em Source context} $p$ specifies the context (or mode) of the source node $i$ when this node gives the response; e.g., $p$ represents the category of item $j$, meaning that user $i$ are in different modes when rating items in different categories. Notice that, in this example, $p$ represents an item category, instead of the user segment that $i$ belongs to; if it was the latter case, the user ID would completely determine the context, thus making this context information unnecessary.
\item {\em Destination context} $q$ specifies the context (or mode) of the destination node $j$ when this node receives the response; e.g., $q$ represents the user segment that user $i$ belongs to, meaning that the response that an item receives depends on the segment that the user belongs to.
\end{itemize}
-Notice that the context $(k,p,q)$ is assumed to be given and each individual response is assumed to occur in a single context. Also note that when modeling a problem, we may not always need all the three components in the three dimensional context vector.
+Notice that the context $(k,p,q)$ is assumed to be given and each individual response is assumed to occur in a single context. Also note that when modeling a problem, we may not always need all the three components in the three dimensional context vector. Some examples will be given later. It is important to note that, in the current implementation, the total number of source contexts and the total number of destination contexts cannot be too large (around 2 $\sim$ 100). However, the total number of edge contexts can be large.
-Because $i$ always denotes a source node (e.g., a user), $j$ always denotes a destination node (e.g., an item) and $k$ always denotes an edge context, we slightly abuse our notation by using $\bm{x}_i$ to denote the feature vector of source node $i$, $\bm{x}_j$ to denote the feature vector of destination node $j$, $\bm{x}_k$ to denote the feature vector of edge context $k$, and $\bm{x}_{ijk}$ to denote the feature vector associated with the occasion when $i$ gives $j$ the response in context $k$.
+Because $i$ always denotes a source node (e.g., a user), $j$ always denotes a destination node (e.g., an item) and $k$ always denotes an edge context, we slightly abuse our notation by using $\bm{x}_i$ to denote the feature vector of source node $i$, $\bm{x}_j$ to denote the feature vector of destination node $j$, $\bm{x}_k$ to denote the feature vector of edge context $k$, and $\bm{x}_{ijk}$ to denote the feature vector associated with the occasion when $i$ gives $j$ the response in context $k$ (e.g., the time of day, day of week of the response). Notice that we do not consider features for source and destination contexts because the number of such contexts are expected to be small; since each such context would have a relatively large number of observations, it usually does not need a feature-based regression prior.
\parahead{Response model}
For numeric response, we use the Gaussian response model; for binary response, we use the logistic response model.
@@ -77,20 +79,21 @@ \subsection{Model}
\parahead{Regression Priors}
The priors of the latent factors are specified in the following:
\begin{align}
-\alpha_{ip} & \sim \mathcal{N}(\bm{g}_{p}^\prime \bm{x}_{i} + q_{p} \alpha_i, ~\sigma_{\alpha,p}^2),
+\alpha_{ip} & \sim \mathcal{N}(\bm{g}_{p}(\bm{x}_{i}) + q_{p} \alpha_i, ~\sigma_{\alpha,p}^2),
~~~~ \alpha_i \sim \mathcal{N}(0, 1) \label{eq:alpha} \\
-\beta_{jq} & \sim \mathcal{N}(\bm{d}_{q}^\prime \bm{x}_{j} + r_{q} \beta_j, ~\sigma_{\beta,q}^2),
+\beta_{jq} & \sim \mathcal{N}(\bm{d}_{q}(\bm{x}_{j}) + r_{q} \beta_j, ~\sigma_{\beta,q}^2),
~~~~ \beta_j \sim \mathcal{N}(0, 1) \label{eq:beta} \\
-\gamma_{k} & \sim \mathcal{N}(\bm{h}' \bm{x}_k, \,\sigma_{\gamma}^2 I), \\
-\bm{u}_{i} & \sim \mathcal{N}(G' \bm{x}_i, \,\sigma_{u}^2 I), ~~~
-\bm{v}_{j} \sim \mathcal{N}(D' \bm{x}_j, \,\sigma_{v}^2 I), ~~~
-\bm{w}_{k} \sim \mathcal{N}(H' \bm{x}_k, \,\sigma_{w}^2 I), \label{eq:uvw}
+\gamma_{k} & \sim \mathcal{N}(\bm{h}(\bm{x}_k), \,\sigma_{\gamma}^2 I), \\
+\bm{u}_{i} & \sim \mathcal{N}(G(\bm{x}_i), \,\sigma_{u}^2 I), ~~~
+\bm{v}_{j} \sim \mathcal{N}(D(\bm{x}_j), \,\sigma_{v}^2 I), ~~~
+\bm{w}_{k} \sim \mathcal{N}(H(\bm{x}_k), \,\sigma_{w}^2 I), \label{eq:uvw}
\end{align}
-where $\bm{g}_p$, $q_p$, $\bm{d}_q$, $r_q$, $G$, $D$ and $H$ are regression coefficient vectors and matrices. These regression coefficients will be learned from data and provide the ability to make predictions for users or items that do not appear in training data. The factors of these new users or items will be predicted based on their features through regression.
+where $q_p$ and $r_q$ are regression coefficients; $\bm{g}_p(\cdot)$, $\bm{d}_q(\cdot)$, $\bm{h}(\cdot)$, $G(\cdot)$, $D(\cdot)$ and $H(\cdot)$ are regression functions that can either be linear regression coefficient vectors/matrices, or non-linear regression functions such as random forests. These regression functions will be learned from data and provide the ability to make predictions for users or items that do not appear in training data. The factors of these new users or items will be predicted based on their features through regression.
-\subsection{Toy Dataset}
+\subsection{Data Format}
+\label{sec:data}
-In the following, we describe a toy dataset. You can put your data in the same format to fit the model to your data. This toy dataset is in the following directory:
+We introduce the input data format through the following toy dataset. You can put your own data in the same format to fit the model to your data. This toy dataset is in the following directory:
\begin{verbatim}
test-data/multicontext_model/simulated-mtx-uvw-10K
\end{verbatim}
@@ -98,7 +101,7 @@ \subsection{Toy Dataset}
\begin{verbatim}
src/unit-test/multicontext_model/create-simulated-data-1.R
\end{verbatim}
-This is a simulated dataset; i.e., the response values $y_{ijkpq}$ are generated according to a ground-truth model. To see the ground-truth, run the following commands in R.
+This is a simulated dataset; i.e., the response values $y_{ijkpq}$ are generated according to a known ground-truth model. To see the ground-truth, run the following commands in R.
{\small
\begin{verbatim}
> load("test-data/multicontext_model/simulated-mtx-uvw-10K/ground-truth.RData");
@@ -107,14 +110,14 @@ \subsection{Toy Dataset}
\end{verbatim}
}
-\parahead{Response Data}
-The response data, also called observation data, is in {\tt obs-train.txt} and {\tt obs-test.txt}. Each file has six columns:
+\parahead{Observation Data}
+The observation data, also called response data, is in {\tt obs-train.txt} and {\tt obs-test.txt}. Each file has six columns:
\begin{enumerate}
\item {\tt src\_id}: Source node ID (e.g., user $i$).
\item {\tt dst\_id}: Destination node ID (e.g., item $j$).
-\item {\tt src\_context}: Source context ID (e.g., source context $p$).
-\item {\tt dst\_context}: Destination context ID (e.g., destination context $q$).
-\item {\tt ctx\_id}: Edge context ID (e.g., edge context $k$).
+\item {\tt src\_context}: Source context ID (e.g., source context $p$). This is an optional column.
+\item {\tt dst\_context}: Destination context ID (e.g., destination context $q$). This is an optional column.
+\item {\tt ctx\_id}: Edge context ID (e.g., edge context $k$). This is an optional column.
\item {\tt y}: Response (e.g., the rating that user $i$ gives item $j$ in context $(k,p,q)$).
\end{enumerate}
Note that all of the above IDs can be numbers or character strings.
@@ -128,16 +131,20 @@ \subsection{Toy Dataset}
"dst_context", "ctx_id", "y");
\end{verbatim}
}
-It is important to note that the {\bf column names} of an observation table have to be exactly {\tt src\_id}, {\tt dst\_id}, {\tt src\_context}, {\tt dst\_context}, {\tt ctx\_id} and {\tt y}. The model fitting code does not recognize other names.
+It is important to note that the {\bf column names} of an observation table have to be exactly {\tt src\_id}, {\tt dst\_id}, {\tt src\_context}, {\tt dst\_context}, {\tt ctx\_id} and {\tt y}. The model fitting code looks for these column names to setup internal data structures (instead of the order of columns; i.e., {\tt src\_id} does not need to be the first column), and it does not recognize other columns names. Also, note that {\tt src\_context}, {\tt dst\_context} and {\tt ctx\_id} are optional columns. When these columns are missing, a reduced model without context-specific factors will be fitted. For example, an observation table with only 3 columns: {\tt src\_id}, {\tt dst\_id}, and {\tt y} will setup the fitting procedure to fit the RLFM model introduced in \cite{rlfm:kdd09}; i.e.,
+$$
+y_{ij} \sim \bm{x}'_{ij} \bm{b} + \alpha_{i} + \beta_{j} + \bm{u}'_i \bm{v}_j,
+$$
+since $k$, $p$ and $q$ are missing.
\parahead{Source, Destination and Context Features}
-The features vectors of source nodes ($\bm{x}_i$), destination nodes ($\bm{x}_j$), edge contexts ($\bm{x}_k$) and training and test observations ($\bm{x}_{ijk}$) are in \\
+The feature vectors of source nodes ($\bm{x}_i$), destination nodes ($\bm{x}_j$) and edge contexts ($\bm{x}_k$) are in \\
\indent{\tt {\it type}-feature-user.txt}, \\
\indent{\tt {\it type}-feature-item.txt}, \\
\indent{\tt {\it type}-feature-ctxt.txt}, \\
-where {\it type} = "dense" for the dense format and {\it type} = "sparse" for the sparse format.
+where {\it type} = {\tt "dense"} for the dense format and {\it type} = {\tt "sparse"} for the sparse format.
-For the dense format, take {\tt dense-feature-user.txt} for example. The first column is {\tt src\_id} (the {\tt src\_id} column in the observation table refers to this column to get the feature vector of the source node for each observation). It is important to note that the {\bf name of the first column} has to be exactly {\tt src\_id}. The rest of the columns specify the feature values and the column names can be arbitrary.
+For the dense format, take {\tt dense-feature-user.txt} for example. The first column is {\tt src\_id} (the {\tt src\_id} column in the observation table refers to this column to get the feature vector of the source node for each observation). It is important to note that, after reading this table into R, the {\bf name of the first column} has to be set to {\tt src\_id} exactly. The rest of the columns specify the feature values and the column names can be arbitrary.
For the sparse format, take {\tt sparse-feature-user.txt} for example. It has three columns:
\begin{enumerate}
@@ -145,7 +152,7 @@ \subsection{Toy Dataset}
\item {\tt index}: Feature index (starting from 1, not 0)
\item {\tt value}: Feature value
\end{enumerate}
-It is important to note that the {\bf column names} have to be exactly {\tt src\_id}, {\tt index} and {\tt value}.
+It is important to note that, after reading this table into R, the {\bf column names} have to be set to {\tt src\_id}, {\tt index} and {\tt value} exactly. The following example shows the correspondence between the sparse and dense formats.
{\small\begin{verbatim}
sparse-feature-user.txt dense-feature-user.txt
SPARSE FORMAT <=> DENSE FORMAT
@@ -158,7 +165,7 @@ \subsection{Toy Dataset}
The features vectors of training and test observations ($\bm{x}_{ijk}$) are in\\
\indent{\tt {\it type}-feature-obs-train.txt}, \\
\indent{\tt {\it type}-feature-obs-test.txt}, \\
-where {\it type} = "dense" for the dense format and {\it type} = "sparse" for the sparse format.
+where {\it type} = {\tt "dense"} for the dense format and {\it type} = {\tt "sparse"} for the sparse format.
For the dense format, take {\tt dense-feature-obs-train.txt} for example. The $n$th line specifies the feature vector of observation on the $n$th line of {\tt obs-train.txt}. Since there is a line-by-line correspondence, there is no need to have an ID column. Each column in this file represents a feature and the column names can be arbitrary.
@@ -168,7 +175,7 @@ \subsection{Toy Dataset}
\item {\tt index}: Feature index (starting from 1, not 0)
\item {\tt value}: Feature value
\end{enumerate}
-It is important to note that the {\bf column names} have to be exactly {\tt src\_id}, {\tt index} and {\tt value}.
+It is important to note that, after reading this table into R, the {\bf column names} have to be set to {\tt src\_id}, {\tt index} and {\tt value} exactly. An example is presented in the following.
{\small\begin{verbatim}
obs_id index value # MEANING
9 1 0.14 # 1st feature of line 9 in obs-train.txt = 0.14
@@ -178,39 +185,25 @@ \subsection{Toy Dataset}
\input{quick-start}
-\subsection{Model Fitting Details}
+\bibliographystyle{abbrv}
+\bibliography{references}
+
+\appendix
+
+\section{BST Model Fitting Details}
\label{sec:fitting}
-See Example 1 in {\tt src/R/examples/tutorial-BST.R} for the R script. For succinctness, we ignore some R commands in the following description.
+In this section, we provide more details of how BST models are fitted. In fact, we will fit BST models without using the wrapper function {\tt fit.bst}; hence, it may give insights on how to use this package in your problem settings. All the R code written in this section can be
+found in Appendix Example 1 in {\tt src/R/examples/tutorial-BST.R}. For succinctness, we ignore some R commands in the following description.
-\parahead{Step 1}
-Read training and test observation tables ({\tt obs.train} and {\tt obs.test}), their corresponding observation feature tables ({\tt x\_obs.train} and {\tt x\_obs.test}), the source feature table ({\tt x\_src}), the destination feature table ({\tt x\_dst}) and the edge context feature table ({\tt x\_ctx}) from the corresponding files. Note that if you replace these tables with your data, you must not change the column names.
-{\small\begin{verbatim}
-input.dir = "test-data/multicontext_model/simulated-mtx-uvw-10K"
-obs.train = read.table(paste(input.dir,"/obs-train.txt",sep=""),
- sep="\t", header=FALSE, as.is=TRUE);
-names(obs.train) = c("src_id", "dst_id", "src_context",
- "dst_context", "ctx_id", "y");
-x_obs.train = read.table(paste(input.dir,"/dense-feature-obs-train.txt",
- sep=""), sep="\t", header=FALSE, as.is=TRUE);
-obs.test = read.table(paste(input.dir,"/obs-test.txt",sep=""),
- sep="\t", header=FALSE, as.is=TRUE);
-names(obs.test) = c("src_id", "dst_id", "src_context",
- "dst_context", "ctx_id", "y");
-x_obs.test = read.table(paste(input.dir,"/dense-feature-obs-test.txt",
- sep=""), sep="\t", header=FALSE, as.is=TRUE);
-x_src = read.table(paste(input.dir,"/dense-feature-user.txt",sep=""),
- sep="\t", header=FALSE, as.is=TRUE);
-names(x_src)[1] = "src_id";
-x_dst = read.table(paste(input.dir,"/dense-feature-item.txt",sep=""),
- sep="\t", header=FALSE, as.is=TRUE);
-names(x_dst)[1] = "dst_id";
-x_ctx = read.table(paste(input.dir,"/dense-feature-ctxt.txt",sep=""),
- sep="\t", header=FALSE, as.is=TRUE);
-names(x_ctx)[1] = "ctx_id";
-\end{verbatim}}
+\subsection{Read Data}
+
+Read all the data sets in the same way as described in Section \ref{sec:read-data}.
-\parahead{Step 2} Index the training and test data. Functions {\tt indexData} and {\tt indexTestData} (defined in {\tt rc/R/model/multicontext\_model\_utils.R}) convert the input data tables into the right data structure. In particular, they replace the original IDs ({\tt src\_id}, {\tt dst\_id}, {\tt src\_context}, {\tt dst\_context} and {\tt ctx\_id}) by consecutive index numbers, and convert feature tables (data frames) into feature matrices.
+\subsection{Index Data}
+\label{sec:index-data}
+
+Index the training and test data. Functions {\tt indexData} and {\tt indexTestData} (defined in {\tt rc/R/model/multicontext\_model\_utils.R}) convert the input data tables into the right data structure. In particular, they replace the original IDs ({\tt src\_id}, {\tt dst\_id}, {\tt src\_context}, {\tt dst\_context} and {\tt ctx\_id}) by consecutive index numbers, and convert feature tables (data frames) into feature matrices.
{\small\begin{verbatim}
data.train = indexData(
obs=obs.train, src.dst.same=FALSE, rm.self.link=FALSE,
@@ -247,7 +240,8 @@ \subsection{Model Fitting Details}
\item {\tt data.train\$feature\$x\_obs[$m$,]} is the observation feature vector of this observation. Similarly, {\tt x\_src[$i$,]}, {\tt x\_dst[$j$,]} and {\tt x\_ctx[$k$,]} are the feature vectors of the source node, destination node and edge context of this observation.
\end{itemize}
-\parahead{Step 3}
+\subsection{Model Setting}
+
Fit the model(s). We first specify the settings of the models to be fitted.
{\small\begin{verbatim}
setting = data.frame(
@@ -286,29 +280,27 @@ \subsection{Model Fitting Details}
obs.test$src_context = obs.test$dst_context = obs.test$ctx_id = NULL;
x_ctx = NULL;
\end{verbatim}}
-This setting gives the following model:
+\noindent This setting gives the following model:
$$
y_{ij} \sim \bm{x}'_{ij} \bm{b} + \alpha_{i} + \beta_{j} + \bm{u}'_i \bm{v}_j
$$
Notice that setting the context-related objects to {\tt NULL} disables the context-specific factors in the model.
\end{itemize}
-\parahead{Step 4}
+\subsection{Modeling Fitting}
+
Run the model fitting procedure.
{\small\begin{verbatim}
out.dir = "/tmp/unit-test/simulated-mtx-uvw-10K";
ans = run.multicontext(
- obs=data.train$obs, # training observation table
- feature=data.train$feature, # training feature matrices
+ data.train=data.train, # training data
+ data.test=data.test, # test data (optional)
setting=setting, # setting specified in Step 3
nSamples=200, # number of Gibbs samples in each E-step
nBurnIn=20, # number of burn-in samples for the Gibbs sampler
nIter=10, # number of EM iterations
- test.obs=data.test$obs, # test observation table (optional)
- test.feature=data.test$feature, # test feature matrices (optional)
reg.algo=NULL, # regression algorithm; see below
reg.control=NULL, # control parameters for the regression algorithm
- IDs=data.test$IDs, # ID mappings (optional)
out.level=1, # see below
out.dir=out.dir, # output directory
out.overwrite=TRUE, # whether to overwrite the output directory
@@ -326,7 +318,7 @@ \subsection{Model Fitting Details}
\begin{itemize}
\item {\tt nSamples}, {\tt nBurnIn} and {\tt nIter} determine how long the procedure will run. In the above example, the procedure runs 10 EM iterations. In each iteration, it draws 220 Gibbs samples, where the first 20 samples are burn-in samples (which are thrown away) and the rest 200 samples are used to compute the Monte Carlo means in the E-step of this iteration. In our experience, 10-20 EM iterations with 100-200 samples per iteration are usually sufficient.
\item {\tt reg.algo} and {\tt reg.control} specify how the regression priors will to be fitted. If they are set to {\tt NULL}, R's basic linear regression function {\tt lm} will be used to fit the prior regression coefficients $\bm{g}, \bm{d}, \bm{h}, G, D$ and $H$. Currently, we only support two other algorithms {\tt GLMNet} and {\tt RandomForest}. Notice that if {\tt RandomForest} is used, the regression priors become nonlinear; see~\cite{gmf:recsys11} for more information.
-\item {\tt out.level} and {\tt out.dir} specify what and where the fitting procedure will output. If {\tt out.level} > 0, each model specified in {\tt setting} (i.e., each row in the {\tt setting} table) will be output to a separate directory. The output directory name of the $m$th model is
+\item {\tt out.level} and {\tt out.dir} specify what and where the fitting procedure will output. If {\tt out.level} $> 0$, each model specified in {\tt setting} (i.e., each row in the {\tt setting} table) will be output to a separate directory. The output directory name of the $m$th model is
{\small\begin{verbatim}
paste(out.dir, "_", setting$name[m], sep="")
\end{verbatim}}
@@ -338,39 +330,6 @@ \subsection{Model Fitting Details}
If {\tt out.level=1}, the fitted models are stored in files {\tt model.last} and {\tt model.minTestLoss} in the output directories, where {\tt model.last} contains the model obtained at the end of the last EM iteration and {\tt model.minTestLoss} contains the model at the end of the EM iteration that gives the minimum loss on the test observation. {\tt model.minTestLoss} exists only when {\tt test.obs} is not {\tt NULL}. If the fitting procedure stops (e.g., the machine reboots) before it finishes all the EM iteration, the latest fitted models will still be saved in these two files. If {\tt out.level=2}, the model at the end of the $m$th EM iteration will be saved in {\tt model.$m$} for each $m$. We describe how to read the output in Section~\ref{sec:model-output}.
\end{itemize}
-\subsection{Output}
-\label{sec:model-output}
-
-The two main output files in an output directory are {\tt summary} and {\tt model.last}.
-
-\parahead{Summary File}
-It records a number of statistics for each EM iteration. To read a summary file, use the following R command.
-{\small\begin{verbatim}
-read.table(paste(out.dir,"_uvw2/summary",sep=""), header=TRUE);
-\end{verbatim}}
-\noindent Explanation of the columns are in the following:
-\begin{itemize}
-\item {\tt Iter} specifies the iteration number.
-\item {\tt nSteps} records the number of Gibbs samples drawn in the E-step of that iteration.
-\item {\tt CDlogL}, {\tt TestLoss}, {\tt LossInTrain} and {\tt TestRMSE} record the complete data log likelihood, loss on the test data, loss on the training data and RMSE (root mean squared error) on the test data for the model at the end of that iteration. For the Gaussian response model, the loss is defined as RMSE. For the logistic response model, the loss is defined as negative average log likelihood per observation.
-\item {\tt TimeEStep}, {\tt TimeMStep} and {\tt TimeTest} record the numbers of seconds used to compute the E-step, M-step and predictions on test data in that iteration.
-\end{itemize}
-
-\parahead{Sanity Check}
-\begin{itemize}
-\item Check {\tt CDlogL} to see whether it increases sharply during the first few iterations and then oscillates at the end.
-\item Check {\tt TestLoss} to see whether it converges. If not, more EM iterations are needed.
-\item Check {\tt TestLoss} and {\tt LossInTrain} to see whether the model overfits the data; i.e., TestLoss goes up, while LossInTrain goes down. If so, try to simplify the model by reducing the number of factors and parameters.
-\end{itemize}
-You can monitor the summary file when the code is running. When you see {\tt TestLoss} converges, kill the running process.
-
-\parahead{Model File}
-The fitted models are saved in {\tt model.last} and {\tt model.minTestLoss}, which are R data binary files. To load the models, run the following command.
-{\small\begin{verbatim}
-load(paste(out.dir,"_uvw2/model.last",sep=""));
-\end{verbatim}}
-\noindent After loading, the fitted prior parameters are in object {\tt param} and the fitted latent factors are in object {\tt factor}. Also, the object {\tt IDs} contains the ID mappings described in Step~2 of Section~\ref{sec:fitting}.
-
\subsection{Prediction}
To make predictions, use the following function.
@@ -380,7 +339,14 @@ \subsection{Prediction}
obs=data.test$obs, feature=data.test$feature, is.logistic=FALSE
);
\end{verbatim}}
-\noindent Now, {\tt pred\$pred.y} contains the predicted response for {\tt data.test\$obs}. Notice that the test data {\tt data.test} was created by call {\tt indexTestData} in Step 2 of Section~\ref{sec:fitting}.
+\noindent Now, {\tt pred\$pred.y} contains the predicted response for {\tt data.test\$obs}. Notice that the test data {\tt data.test} was created by calling {\tt indexTestData} in Step 2 of Section~\ref{sec:fitting}. If you have new test data, you can use the following command to index the new test data.
+{\small\begin{verbatim}
+data.test = indexTestData(
+ data.train=data.train, obs=obs.test,
+ x_obs=x_obs.test, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx
+);
+\end{verbatim}}
+\noindent where {\tt obs.test}, {\tt x\_obs.test}, {\tt x\_src}, {\tt x\_dst} and {\tt x\_ctx} contain new data in the same format as described in Step 2 of Section~\ref{sec:fitting}.
\subsection{Other Examples}
@@ -391,8 +357,4 @@ \subsection{Other Examples}
\item Example 4: In this example, we demonstrate how to fit RLFM models with sparse features and the glmnet algorithm. Note that RLFM models do not fit this toy dataset well.
\end{itemize}
-
-\bibliographystyle{abbrv}
-\bibliography{references}
-
\end{document}
View
229 src/R/BST.R
@@ -0,0 +1,229 @@
+### Copyright (c) 2011, Yahoo! Inc. All rights reserved.
+### Copyrights licensed under the New BSD License. See the accompanying LICENSE file for terms.
+###
+### Author: Liang Zhang
+fit.bst <- function(
+ code.dir = "", # The top-level directory of where code get installed, "" if you are in that directory
+ obs.train, # The training response data
+ obs.test = NULL, # The testing response data
+ x_obs.train = NULL, # The data of training observation features
+ x_obs.test = NULL, # The data of testing observation features
+ x_src = NULL, # The data of context features for source nodes
+ x_dst = NULL, # The data of context features for destination nodes
+ x_ctx = NULL, # The data of context features for edges
+ out.dir = "", # The directory of output files
+ model.name = "model", #The name of the model, can be any string or a vector of strings
+ nFactors, # Number of factors, can be any positive integer or vector of positive intergers with length=length(model.name)
+ nIter = 20, # Number of EM iterations
+ nSamplesPerIter = 200, # Number of Gibbs samples per E step, can be a vector of numbers with length=nIter
+ is.logistic = FALSE, # whether to use the logistic model for binary rating, default Gaussian, can be a vector of booleans with length=length(model.name)
+ src.dst.same = FALSE, # Whether src_id and dst_id are the same
+ control = fit.bst.control(), # A list of control params
+ ...
+) {
+ library(Matrix);
+ if (code.dir!="") code.dir = sprintf("%s/",code.dir);
+ #if (out.dir!="") out.dir = sprintf("%s/",out.dir);
+
+ if (floor(nIter)!=nIter || nIter<=0 || length(nIter)>1) stop("When calling fit.bst, nIter must be a positive integer scalar!");
+ if (floor(nSamplesPerIter)!=nSamplesPerIter || nSamplesPerIter<=0 || length(nSamplesPerIter)>1) stop("When calling fit.bst, nSamplesPerIter must be a positive integer scalar!");
+
+ # Load all the required libraries and source code
+ if (class(try(load.code(code.dir)))=="try-error") stop("When calling fit.bst, code.dir is not specified correctly. code.dir must be the path of the top-level directory of this package (which contains src/ as a sub-directory and file LICENSE).");
+
+ # Make sure all the data have required columns
+ if (is.null(obs.train$src_id) || is.null(obs.train$dst_id) || is.null(obs.train$y)) stop("When calling fit.bst, obs.train must have the following three columns: src_id, dst_id, y");
+ if (!is.null(obs.test)) {
+ if (is.null(obs.test$src_id) || is.null(obs.test$dst_id) || is.null(obs.test$y)) stop("When calling fit.bst, obs.test must have the following three columns: src_id, dst_id, y");
+ }
+
+ if (is.null(x_obs.train) && !is.null(x_obs.test)) stop("When calling fit.bst, x_obs.train does not exist while x_obs.test is used! Either set both to NULL or supply both observation feature tables.");
+ if (is.null(x_obs.test) && !is.null(x_obs.train) && !is.null(obs.test)) stop("When calling fit.bst, x_obs.test does not exist while x_obs.train is used! Either set both to NULL or supply both observation feature tables.");
+ if (!is.null(x_obs.train) && !is.null(x_obs.test)) {
+ if (ncol(x_obs.train)!=ncol(x_obs.test)) stop("When calling fit.bst, x_obs.train and x_obs.test have different numbers of columns! The number of features for training and test cases should be exactly the same!");
+ }
+ # Index data: Put the input data into the right form
+ # Convert IDs into numeric indices and
+ # Convert some data frames into matrices
+ # Index training data
+ if (length(src.dst.same)!=1) stop("When calling fit.bst, src.dst.same must be a scalar boolean!");
+ if (src.dst.same!=0 && src.dst.same!=1) stop("When calling fit.bst, src.dst.name must be boolean!");
+ data.train = indexData(
+ obs=obs.train, src.dst.same=src.dst.same, rm.self.link=control$rm.self.link,
+ x_obs=x_obs.train, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
+ add.intercept=control$add.intercept
+ );
+ # Index test data
+ if (!is.null(obs.test)) {
+ data.test = indexTestData(
+ data.train=data.train, obs=obs.test,
+ x_obs=x_obs.test, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx
+ );
+ } else {
+ data.test = NULL;
+ }
+
+ # Model Settings
+ if (is.null(control$has.gamma)) {
+ control$has.gamma = FALSE;
+ if (is.null(x_src) && is.null(x_dst) && !is.null(x_ctx)) control$has.gamma = TRUE;
+ }
+ if (length(nFactors)==1) nFactors = rep(nFactors, length(model.name));
+ if (length(control$has.gamma)==1) control$has.gamma = rep(control$has.gamma,length(model.name));
+ if (length(is.logistic)==1) is.logistic = rep(is.logistic,length(model.name));
+ if (length(model.name)!=length(nFactors)) stop("When calling fit.bst, model.name and nFactors must have the same length.");
+ if (length(model.name)!=length(control$has.gamma)) stop("When calling fit.bst, model.name and control$has.gamma must have the same length.");
+ if (length(model.name)!=length(is.logistic)) stop("When calling fit.bst, model.name and is.logistic must have the same length");
+ for (i in 1:length(is.logistic))
+ {
+ if (is.logistic[i]!=0 && is.logistic[i]!=1) stop("When calling fit.bst, is.logistic must be boolean!");
+ if (is.logistic[i] && length(which(obs.train$y!=0 & obs.train$y!=1))>0) stop("When calling fit.bst, logistic link function should not be used for non-binary training data! Please set is.logistic=FALSE");
+ if (is.logistic[i] && length(which(obs.test$y!=0 & obs.test$y!=1))>0) stop("When calling fit.bst, logistic link function should not be used for non-binary test data! Please set is.logistic=FALSE");
+ }
+
+ setting = data.frame(
+ name = model.name,
+ nFactors = nFactors, # number of interaction factors
+ has.u = rep(!src.dst.same,length(model.name)), # whether to use u_i' v_j or v_i' v_j
+ has.gamma = control$has.gamma,
+ nLocalFactors = rep(0,length(model.name)), # just set to 0
+ is.logistic = is.logistic # whether to use the logistic model for binary rating
+ );
+ if (is.null(control$nBurnin)) nBurnin = floor(nSamplesPerIter*0.1) else nBurnin=control$nBurnin;
+
+ if (!is.null(control$reg.algo)) {
+ if (control$reg.algo=="GLMNet") {
+ source(sprintf("%ssrc/R/model/GLMNet.R",code.dir));
+ reg.algo = GLMNet;
+ }
+ if (control$reg.algo=="RandomForest") {
+ source(sprintf("%ssrc/R/model/RandomForest.R",code.dir));
+ reg.algo = RandomForest;
+ }
+ } else {
+ reg.algo = NULL;
+ }
+ init.params = control$init.params;
+ ans = run.multicontext(
+ data.train=data.train, data.test=data.test,
+ setting=setting, # Model setting
+ nSamples=nSamplesPerIter, # Number of samples drawn in each E-step: could be a vector of size nIter.
+ nBurnIn=nBurnin, # Number of burn-in draws before take samples for the E-step: could be a vector of size nIter.
+ nIter=nIter, # Number of EM iterations
+ approx.interaction=TRUE, # In prediction, predict E[uv] as E[u]E[v].
+ reg.algo=reg.algo, # The regression algorithm to be used in the M-step (NULL => linear regression)
+ reg.control=control$reg.control, # The control paramter for reg.algo
+ # initialization parameters
+ var_alpha=init.params$var_alpha, var_beta=init.params$var_beta, var_gamma=init.params$var_gamma,
+ var_v=init.params$var_v, var_u=init.params$var_u, var_w=init.params$var_w, var_y=init.params$var_y,
+ relative.to.var_y=init.params$relative.to.var_y, var_alpha_global=init.params$var_alpha_global, var_beta_global=init.params$var_beta_global,
+ # others
+ IDs=data.test$IDs,
+ out.level=1, # out.level=1: Save the factor & parameter values to out.dir/model.last and out.dir/model.minTestLoss
+ out.dir=out.dir, # out.level=2: Save the factor & parameter values of each iteration i to out.dir/model.i
+ out.overwrite=TRUE, # whether to overwrite the output directory if it exists
+ debug=0, # Set to 0 to disable internal sanity checking; Set to 100 for most detailed sanity checking
+ verbose=1, # Set to 0 to disable console output; Set to 100 to print everything to the console
+ verbose.M=2,
+ rm.factors.without.obs.in.loglik=TRUE,
+ ridge.lambda=1, # Add diag(lambda) to X'X in linear regression
+ zero.mean=rep(0,0), # zero.mean["alpha"] = TRUE -> g = 0, etc.
+ fix.var=NULL, # fix.var[["u"]] = n -> var_u = n (NULL -> default, list() -> fix no var)
+ max.nObs.for.b=NULL,# maximum number of observations to be used to fit b
+ rnd.seed.init=control$random.seed, rnd.seed.fit=control$random.seed+1
+ );
+ # Do prediction
+ if (!is.null(data.test)) {
+ pred.y = list();
+ for (i in 1:length(model.name)) {
+ load(sprintf("%s_%s/model.last",out.dir,model.name[i]));
+ pred.model = predict.multicontext(
+ model=list(factor=factor, param=param),
+ obs=data.test$obs, feature=data.test$feature, is.logistic=is.logistic[i]
+ );
+ d = data.frame(y=obs.test$y,pred_y=pred.model$pred.y);
+ write.table(d,sprintf("%s_%s/prediction",out.dir,model.name[i]),row.names=F,col.names=T,quote=F,sep="\t");
+ pred.y = c(pred.y, list(pred.model$pred.y));
+ }
+ names(pred.y) = model.name;
+ ans$pred.y = pred.y;
+ }
+ ans
+}
+
+load.code <- function(code.dir)
+{
+ dyn.load(sprintf("%slib/c_funcs.so",code.dir));
+ source(sprintf("%ssrc/R/c_funcs.R",code.dir));
+ source(sprintf("%ssrc/R/util.R",code.dir));
+ source(sprintf("%ssrc/R/model/util.R",code.dir));
+ source(sprintf("%ssrc/R/model/multicontext_model_utils.R",code.dir));
+ source(sprintf("%ssrc/R/model/multicontext_model_MStep.R",code.dir));
+ source(sprintf("%ssrc/R/model/multicontext_model_EM.R",code.dir));
+}
+
+fit.bst.control <- function (
+ rm.self.link = FALSE, # Allow Self link?
+ add.intercept = TRUE, # Whether to add intercept to each feature matrix
+ has.gamma = NULL, # Whether to include context main effect into the model, can be a vector, but length must be equal to the number of model names
+ reg.algo=NULL, # The regression algorithm to be used in the M-step (NULL => linear regression), "GLMNet", or "RandomForest"
+ reg.control=NULL, # The control paramter for reg.algo
+ nBurnin = NULL, # Default is 10% of the nSamplesPerIter
+ init.params = list(var_alpha=1, var_beta=1, var_gamma=1,
+ var_u=1, var_v=1, var_w=1, var_y=NULL,
+ relative.to.var_y=FALSE, var_alpha_global=1, var_beta_global=1), # Initial params for all variance components
+ random.seed = 0, # The random seed
+ ...
+) {
+ if (length(rm.self.link)!=1) stop("When calling fit.bst, rm.self.link in control=fit.bst.control(..., rm.self.link, ...) must be a scalar boolean!");
+ if (length(add.intercept)!=1) stop("When calling fit.bst, add.intercept in control=fit.bst.control(..., add.intercept, ...) must be a scalar boolean!");
+ if (rm.self.link!=0 && rm.self.link!=1) stop("When calling fit.bst, rm.self.link in control=fit.bst.control(..., rm.self.link, ...) must be boolean!");
+ if (add.intercept!=0 && add.intercept!=1) stop("When calling fit.bst, add.intercept in control=fit.bst.control(..., add.intercept, ...) must be boolean!");
+ if (!is.null(has.gamma)) {
+ for (i in 1:length(has.gamma))
+ {
+ if (has.gamma[i]!=0 && has.gamma[i]!=1) stop("When calling fit.bst, has.gamma in control=fit.bst.control(..., has.gamma, ...) must be boolean!");
+ }
+ }
+ if (!is.null(reg.algo)) {
+ if (reg.algo!="GLMNet" && reg.algo!="RandomForest") stop("When calling fit.bst, reg.algo in control=fit.bst.control(..., reg.algo, ...) must be NULL, 'GLMNet', or 'RandomForest'. Make sure they are strings if not NULL.");
+ }
+ if (!is.null(nBurnin)) {
+ if (nBurnin<0 || floor(nBurnin)!=nBurnin || length(nBurnin)>1) stop("When calling fit.bst, nBurnin in control=fit.bst.control(..., nBurnin, ...) must be a positive integer");
+ }
+ list(rm.self.link=rm.self.link,add.intercept=add.intercept, has.gamma=has.gamma, reg.algo=reg.algo, reg.control=reg.control, nBurnin=nBurnin, init.params=init.params, random.seed=random.seed)
+}
+
+predict.bst <- function(
+ model.file,
+ obs.test, # The testing response data
+ x_obs.test = NULL, # The data of testing observation features
+ x_src = NULL, # The data of context features for source nodes
+ x_dst = NULL, # The data of context features for destination nodes
+ x_ctx = NULL, # The data of context features for edges
+ code.dir = "" # The top-level directory of where code get installed, "" if you are in that directory
+){
+ library(Matrix);
+ if (code.dir!="") code.dir = sprintf("%s/",code.dir);
+ # Load all the required libraries and source code
+ if (class(try(load.code(code.dir)))=="try-error") stop("When calling predict.bst, code.dir is not specified correctly. code.dir must be the path of the top-level directory of this package (which contains src/ as a sub-directory and file LICENSE).");
+
+ if(!file.exists(model.file)) stop("When calling predict.bst, the file specified by model.file='",model.file,"' does not exist. Please specify an existing model file.");
+ if("factor" %in% ls()) rm(factor);
+ if("param" %in% ls()) rm(param);
+ if("data.train" %in% ls()) rm(data.train);
+ load(model.file);
+ if(!all(c("factor", "param", "data.train") %in% ls())) stop("When calling predict.bst, some problem with model.file='",model.file,"': The file is not a model file or is corrupted.");
+ data.test = indexTestData(
+ data.train=data.train, obs=obs.test,
+ x_obs=x_obs.test, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx
+ );
+
+ pred = predict.multicontext(
+ model=list(factor=factor, param=param),
+ obs=data.test$obs, feature=data.test$feature, is.logistic=param$is.logistic
+ );
+
+ return(pred);
+}
+
View
278 src/R/examples/tutorial-BST.R
@@ -1,6 +1,93 @@
+### Copyright (c) 2011, Yahoo! Inc. All rights reserved.
+### Copyrights licensed under the New BSD License. See the accompanying LICENSE file for terms.
+###
+### Author: Bee-Chung Chen and Liang Zhang
###
-### Example 1: Fit the BST model with dense features
+### Quick Start
+###
+
+# (1) Read input data
+input.dir = "test-data/multicontext_model/simulated-mtx-uvw-10K"
+# (1.1) Training observations and observation features
+obs.train = read.table(paste(input.dir,"/obs-train.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(obs.train) = c("src_id", "dst_id", "src_context",
+ "dst_context", "ctx_id", "y");
+x_obs.train = read.table(paste(input.dir,"/dense-feature-obs-train.txt",
+ sep=""), sep="\t", header=FALSE, as.is=TRUE);
+# (1.2) Test observations and observation features
+obs.test = read.table(paste(input.dir,"/obs-test.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(obs.test) = c("src_id", "dst_id", "src_context",
+ "dst_context", "ctx_id", "y");
+x_obs.test = read.table(paste(input.dir,"/dense-feature-obs-test.txt",
+ sep=""), sep="\t", header=FALSE, as.is=TRUE);
+# (1.3) User/item/context features
+x_src = read.table(paste(input.dir,"/dense-feature-user.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(x_src)[1] = "src_id";
+x_dst = read.table(paste(input.dir,"/dense-feature-item.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(x_dst)[1] = "dst_id";
+x_ctx = read.table(paste(input.dir,"/dense-feature-ctxt.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(x_ctx)[1] = "ctx_id";
+
+# (2) Fit Models
+source("src/R/BST.R");
+# (2.1) Fit a model without features
+ans = fit.bst(obs.train=obs.train, obs.test=obs.test,
+ out.dir="/tmp/bst/quick-start", model.name="uvw3",
+ nFactors=3, nIter=10);
+# (2.2) Fit a model with features
+ans = fit.bst(obs.train=obs.train, obs.test=obs.test, x_obs.train=x_obs.train,
+ x_obs.test=x_obs.test, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
+ out.dir="/tmp/bst/quick-start",
+ model.name="uvw3-F", nFactors=3, nIter=10);
+
+# (3) Check the Output
+# (3.1) Check the summary of EM iterations
+read.table("/tmp/bst/quick-start_uvw3-F/summary", header=TRUE);
+# (3.2) Check the fitted model
+load("/tmp/bst/quick-start_uvw3-F/model.last");
+str(param);
+str(factor);
+str(data.train);
+
+# (4) Make Predictions
+pred = predict.bst(
+ model.file="/tmp/bst/quick-start_uvw3-F/model.last",
+ obs.test=obs.test, x_obs.test=x_obs.test,
+ x_src=x_src, x_dst=x_dst, x_ctx=x_ctx);
+
+# Fit Multiple Models in One Call
+ans = fit.bst(obs.train=obs.train, obs.test=obs.test, x_obs.train=x_obs.train,
+ x_obs.test=x_obs.test, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
+ out.dir = "/tmp/bst/quick-start",
+ model.name=c("uvw1", "uvw2"), nFactors=c(1,2), nIter=10);
+
+# Fit the Original BST Model
+obs.train$src_context = obs.train$dst_context = obs.train$ctx_id
+obs.test$src_context = obs.test$dst_context = obs.test$ctx_id
+ans = fit.bst(obs.train=obs.train, obs.test=obs.test,
+ x_obs.train=x_obs.train, x_obs.test=x_obs.test,
+ x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
+ out.dir="/tmp/bst/quick-start", model.name="original-bst",
+ nFactors=3, nIter=10, src.dst.same=TRUE,
+ control=fit.bst.control(has.gamma=FALSE, rm.self.link=TRUE));
+
+# Fit RLFM
+obs.train$src_context = obs.train$dst_context = obs.train$ctx_id = NULL;
+obs.test$src_context = obs.test$dst_context = obs.test$ctx_id = NULL;
+ans = fit.bst(obs.train=obs.train, obs.test=obs.test,
+ x_obs.train=x_obs.train, x_obs.test=x_obs.test,
+ x_src=x_src, x_dst=x_dst,
+ out.dir="/tmp/bst/quick-start", model.name="uvw3-F",
+ nFactors=3, nIter=10);
+
+###
+### Appendix Example 1: Fit the BST model with dense features
###
library(Matrix);
dyn.load("lib/c_funcs.so");
@@ -77,14 +164,12 @@ source("src/R/model/multicontext_model_EM.R");
set.seed(2);
out.dir = "/tmp/unit-test/simulated-mtx-uvw-10K";
ans = run.multicontext(
- obs=data.train$obs, # Observation table
- feature=data.train$feature, # Features
- setting=setting, # Model setting
- nSamples=200, # Number of samples drawn in each E-step: could be a vector of size nIter.
- nBurnIn=20, # Number of burn-in draws before take samples for the E-step: could be a vector of size nIter.
- nIter=10, # Number of EM iterations
- test.obs=data.test$obs, # Test data: Observations for testing (optional)
- test.feature=data.test$feature, # Features for testing (optional)
+ data.train=data.train, # training data
+ data.test=data.test, # test data (optional)
+ setting=setting, # Model setting
+ nSamples=200, # Number of samples drawn in each E-step: could be a vector of size nIter.
+ nBurnIn=20, # Number of burn-in draws before take samples for the E-step: could be a vector of size nIter.
+ nIter=10, # Number of EM iterations
approx.interaction=TRUE, # predict E[uv] as E[u]E[v].
reg.algo=NULL, # The regression algorithm to be used in the M-step (NULL => linear regression)
reg.control=NULL, # The control paramter for reg.algo
@@ -93,7 +178,6 @@ ans = run.multicontext(
var_v=1, var_u=1, var_w=1, var_y=NULL,
relative.to.var_y=FALSE, var_alpha_global=1, var_beta_global=1,
# others
- IDs=data.test$IDs,
out.level=1, # out.level=1: Save the factor & parameter values to out.dir/model.last and out.dir/model.minTestLoss
out.dir=out.dir, # out.level=2: Save the factor & parameter values of each iteration i to out.dir/model.i
out.overwrite=TRUE, # whether to overwrite the output directory if it exists
@@ -123,8 +207,9 @@ str(pred);
###
-### Example 2: Fit the BST model with sparse features
-### glmnet is used to fit prior regression parameters
+### Appendix Example 2:
+### Fit the BST model with sparse features
+### glmnet is used to fit prior regression parameters
###
library(Matrix);
dyn.load("lib/c_funcs.so");
@@ -204,14 +289,12 @@ source("src/R/model/GLMNet.R");
set.seed(2);
out.dir = "/tmp/tutorial-BST/example-2";
ans = run.multicontext(
- obs=data.train$obs, # Observation table
- feature=data.train$feature, # Features
+ data.train=data.train, # training data
+ data.test=data.test, # test data (optional)
setting=setting, # Model setting
nSamples=200, # Number of samples drawn in each E-step: could be a vector of size nIter.
nBurnIn=20, # Number of burn-in draws before take samples for the E-step: could be a vector of size nIter.
nIter=10, # Number of EM iterations
- test.obs=data.test$obs, # Test data: Observations for testing (optional)
- test.feature=data.test$feature, # Features for testing (optional)
approx.interaction=TRUE, # predict E[uv] as E[u]E[v].
reg.algo=GLMNet, # The regression algorithm to be used in the M-step (NULL => linear regression)
# initialization parameters
@@ -219,7 +302,6 @@ ans = run.multicontext(
var_v=1, var_u=1, var_w=1, var_y=NULL,
relative.to.var_y=FALSE, var_alpha_global=1, var_beta_global=1,
# others
- IDs=data.test$IDs,
out.level=1, # out.level=1: Save the factor & parameter values to out.dir/model.last and out.dir/model.minTestLoss
out.dir=out.dir, # out.level=2: Save the factor & parameter values of each iteration i to out.dir/model.i
out.overwrite=TRUE, # whether to overwrite the output directory if it exists
@@ -232,7 +314,7 @@ ans = run.multicontext(
###
-### Example 3: Add more EM iterations to an already fitted model
+### Appendix Example 3: Add more EM iterations to an already fitted model
###
### Example scenario: After running Example 2 with 10 EM iterations,
### you feel that the model has not yet converged and want to add
@@ -312,15 +394,12 @@ source("src/R/model/GLMNet.R");
out.dir = "/tmp/tutorial-BST/example-3_uvw2";
set.seed(2);
ans = fit.multicontext(
- obs=data.train$obs, # Observation table
- feature=data.train$feature, # Features
+ data.train=data.train, # training data
+ data.test=data.test, # test data (optional)
init.model=model, # Initial model = list(factor, param)
nSamples=200, # Number of samples drawn in each E-step: could be a vector of size nIter.
nBurnIn=20, # Number of burn-in draws before take samples for the E-step: could be a vector of size nIter.
nIter=5, # Number of EM iterations
- test.obs=data.test$obs, # Test data: Observations for testing
- test.feature=data.test$feature, # Features for testing
- IDs=data.test$IDs,
is.logistic=FALSE,
out.level=1, # out.level=1: Save the factor & parameter values to out.dir/model.last and out.dir/model.minTestLoss
out.dir=out.dir, # out.level=2: Save the factor & parameter values of each iteration i to out.dir/model.i
@@ -340,8 +419,9 @@ str(param, max.level=2);
str(factor);
###
-### Example 4: Fit the RLFM model with sparse features
-### glmnet is used to fit prior regression parameters
+### Appendix Example 4:
+### Fit the RLFM model with sparse features
+### glmnet is used to fit prior regression parameters
###
library(Matrix);
dyn.load("lib/c_funcs.so");
@@ -422,14 +502,12 @@ source("src/R/model/GLMNet.R");
set.seed(2);
out.dir = "/tmp/tutorial-BST/example-4";
ans = run.multicontext(
- obs=data.train$obs, # Observation table
- feature=data.train$feature, # Features
+ data.train=data.train, # training data
+ data.test=data.test, # test data (optional)
setting=setting, # Model setting
nSamples=200, # Number of samples drawn in each E-step: could be a vector of size nIter.
nBurnIn=20, # Number of burn-in draws before take samples for the E-step: could be a vector of size nIter.
nIter=10, # Number of EM iterations
- test.obs=data.test$obs, # Test data: Observations for testing (optional)
- test.feature=data.test$feature, # Features for testing (optional)
approx.interaction=TRUE, # predict E[uv] as E[u]E[v].
reg.algo=GLMNet, # The regression algorithm to be used in the M-step (NULL => linear regression)
# initialization parameters
@@ -437,7 +515,6 @@ ans = run.multicontext(
var_v=1, var_u=1, var_w=1, var_y=NULL,
relative.to.var_y=FALSE, var_alpha_global=1, var_beta_global=1,
# others
- IDs=data.test$IDs,
out.level=1, # out.level=1: Save the factor & parameter values to out.dir/model.last and out.dir/model.minTestLoss
out.dir=out.dir, # out.level=2: Save the factor & parameter values of each iteration i to out.dir/model.i
out.overwrite=TRUE, # whether to overwrite the output directory if it exists
@@ -447,3 +524,144 @@ ans = run.multicontext(
ridge.lambda=1, # Add diag(lambda) to X'X in linear regression
rnd.seed.init=0, rnd.seed.fit=1
);
+
+
+###
+### Appendix Example 5:
+### Fit the BST model with dense features
+### Do not give the test data to the fitting procedure,
+### and then later load the test data, index it (in the correct way)
+### and predict the response in the test data.
+###
+library(Matrix);
+dyn.load("lib/c_funcs.so");
+source("src/R/c_funcs.R");
+source("src/R/util.R");
+source("src/R/model/util.R");
+source("src/R/model/multicontext_model_utils.R");
+set.seed(0);
+
+# (1) Read only the training data (NOT the test data)
+input.dir = "test-data/multicontext_model/simulated-mtx-uvw-10K"
+# (1.1) Training observations and observation features
+obs.train = read.table(paste(input.dir,"/obs-train.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(obs.train) = c("src_id", "dst_id", "src_context",
+ "dst_context", "ctx_id", "y");
+x_obs.train = read.table(paste(input.dir,"/dense-feature-obs-train.txt",
+ sep=""), sep="\t", header=FALSE, as.is=TRUE);
+# (1.2) User/item/context features
+x_src = read.table(paste(input.dir,"/dense-feature-user.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(x_src)[1] = "src_id";
+x_dst = read.table(paste(input.dir,"/dense-feature-item.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(x_dst)[1] = "dst_id";
+x_ctx = read.table(paste(input.dir,"/dense-feature-ctxt.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(x_ctx)[1] = "ctx_id";
+
+# (2) Index the training data: Put the input data into the right form
+# Convert IDs into numeric indices and
+# Convert some data frames into matrices
+data.train = indexData(
+ obs=obs.train, src.dst.same=FALSE, rm.self.link=FALSE,
+ x_obs=x_obs.train, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
+ add.intercept=TRUE
+);
+
+# (3) Setup the model(s) to be fitted
+setting = data.frame(
+ name = c("uvw1", "uvw2"),
+ nFactors = c( 1, 2), # number of interaction factors
+ has.u = c( TRUE, TRUE), # whether to use u_i' v_j or v_i' v_j
+ has.gamma = c( FALSE, FALSE), # whether to include gamma_k in the model
+ nLocalFactors = c( 0, 0), # just set to 0
+ is.logistic = c( FALSE, FALSE) # whether to use the logistic response model
+);
+
+# (4) Run the fitting code without supplying the test data
+# See src/R/model/multicontext_model_EM.R: run.multicontext() for details
+dyn.load("lib/c_funcs.so");
+source("src/R/c_funcs.R");
+source("src/R/util.R");
+source("src/R/model/util.R");
+source("src/R/model/multicontext_model_genData.R");
+source("src/R/model/multicontext_model_utils.R");
+source("src/R/model/multicontext_model_MStep.R");
+source("src/R/model/multicontext_model_EM.R");
+set.seed(2);
+out.dir = "/tmp/tutorial-BST/example-5";
+ans = run.multicontext(
+ data.train=data.train, # training data
+ setting=setting, # Model setting
+ nSamples=200, # Number of samples drawn in each E-step: could be a vector of size nIter.
+ nBurnIn=20, # Number of burn-in draws before take samples for the E-step: could be a vector of size nIter.
+ nIter=10, # Number of EM iterations
+ approx.interaction=TRUE, # predict E[uv] as E[u]E[v].
+ reg.algo=NULL, # The regression algorithm to be used in the M-step (NULL => linear regression)
+ reg.control=NULL, # The control paramter for reg.algo
+ # initialization parameters
+ var_alpha=1, var_beta=1, var_gamma=1,
+ var_v=1, var_u=1, var_w=1, var_y=NULL,
+ relative.to.var_y=FALSE, var_alpha_global=1, var_beta_global=1,
+ # others
+ out.level=1, # out.level=1: Save the factor & parameter values to out.dir/model.last and out.dir/model.minTestLoss
+ out.dir=out.dir, # out.level=2: Save the factor & parameter values of each iteration i to out.dir/model.i
+ out.overwrite=TRUE, # whether to overwrite the output directory if it exists
+ debug=0, # Set to 0 to disable internal sanity checking; Set to 100 for most detailed sanity checking
+ verbose=1, # Set to 0 to disable console output; Set to 100 to print everything to the console
+ verbose.M=2,
+ ridge.lambda=1, # Add diag(lambda) to X'X in linear regression
+ rnd.seed.init=0, rnd.seed.fit=1
+);
+
+# Quit from R and enter R again.
+
+out.dir = "/tmp/tutorial-BST/example-5";
+# Check the output
+read.table(paste(out.dir,"_uvw2/summary",sep=""), header=TRUE, sep="\t", as.is=TRUE);
+
+# Load the model
+load(paste(out.dir,"_uvw2/model.last",sep=""));
+# It loads param, factor, data.train
+str(param);
+str(factor);
+str(data.train); # this does not include the actual data!!
+
+# Read the test data
+input.dir = "test-data/multicontext_model/simulated-mtx-uvw-10K"
+obs.test = read.table(paste(input.dir,"/obs-test.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(obs.test) = c("src_id", "dst_id", "src_context",
+ "dst_context", "ctx_id", "y");
+x_obs.test = read.table(paste(input.dir,"/dense-feature-obs-test.txt",
+ sep=""), sep="\t", header=FALSE, as.is=TRUE);
+x_src = read.table(paste(input.dir,"/dense-feature-user.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(x_src)[1] = "src_id";
+x_dst = read.table(paste(input.dir,"/dense-feature-item.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(x_dst)[1] = "dst_id";
+x_ctx = read.table(paste(input.dir,"/dense-feature-ctxt.txt",sep=""),
+ sep="\t", header=FALSE, as.is=TRUE);
+names(x_ctx)[1] = "ctx_id";
+
+# Index the test data
+dyn.load("lib/c_funcs.so");
+source("src/R/c_funcs.R");
+source("src/R/util.R");
+source("src/R/model/util.R");
+source("src/R/model/multicontext_model_utils.R");
+data.test = indexTestData(
+ data.train=data.train, obs=obs.test,
+ x_obs=x_obs.test, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx
+);
+
+# Make prediction
+pred = predict.multicontext(
+ model=list(factor=factor, param=param),
+ obs=data.test$obs, feature=data.test$feature, is.logistic=FALSE
+);
+# Now, pred$pred.y contains the predicted rating for data.test$obs
+str(pred);
View
97 src/R/model/multicontext_model_EM.R
@@ -42,6 +42,12 @@
### You can also set feature$x_obs[] = 0 to obtain a zero-mean model.
### (2) To DISABLE fitting a FACTOR, set param$var_FACTOR == NULL. Both sampling for the
### FACTOR and regression for the FACTOR will be disabled.
+### (3) Object format for data.train and data.test:
+### data.train = list(
+### obs = data.frame(src.id, dst.id, src.context, dst.context, edge.context, y),
+### feature = list(x_src, x_dst, x_ctx, x_obs)
+### )
+### data.test has the same format.
###
### IMPORTANT NOTE:
### (1) If we predict using E[u_i]E[v_j], we should also use E[u_i]E[v_j] in fitting (instead of E[u_i v_j])
@@ -52,15 +58,12 @@
### if is.null(test.obs), then this output will NOT be available.
###
fit.multicontext <- function(
- obs, # Observation table
- feature, # Features
init.model, # Initial model = list(factor, param)
nSamples, # Number of samples drawn in each E-step: could be a vector of size nIter.
nBurnIn, # Number of burn-in draws before take samples for the E-step: could be a vector of size nIter.
nIter=NULL, # Number of EM iterations
- test.obs=NULL, # Test data: Observations for testing
- test.feature=NULL, # Features for testing
- IDs=NULL,
+ data.train=NULL, # Training data
+ data.test=NULL, # Test data (optional)
is.logistic=FALSE,
out.level=0, # out.level=1: Save the factor & parameter values to out.dir/model.last and out.dir/model.minTestLoss
out.dir=NULL, # out.level=2: Save the factor & parameter values of each iteration i to out.dir/model.i
@@ -75,15 +78,47 @@ fit.multicontext <- function(
rm.factors.without.obs.in.loglik=TRUE,
ridge.lambda=c(b=1, g0=1, d0=1, h0=1, G=1, D=1, H=1), # (or a scalar) Add diag(lambda) to X'X in linear regression
approx.interaction=TRUE, # predict E[uv] as E[u]E[v].
- zero.mean=rep(0,0), # zero.mean["alpha"] = TRUE -> g = 0, etc.
- fix.var=NULL, # fix.var[["u"]] = n -> var_u = n (NULL -> default, list() -> fix no var)
- max.nObs.for.b=NULL # maximum number of observations to be used to fit b
+ zero.mean=rep(0,0), # zero.mean["alpha"] = TRUE -> g = 0, etc.
+ fix.var=NULL, # fix.var[["u"]] = n -> var_u = n (NULL -> default, list() -> fix no var)
+ max.nObs.for.b=NULL, # maximum number of observations to be used to fit b
+ # The following five are for backward competibility when data.train=NULL and/or data.test=NULL
+ IDs=NULL,
+ obs=NULL, # Training data: Observation table
+ feature=NULL, # Features
+ test.obs=NULL, # Test data: Observations for testing
+ test.feature=NULL # Features for testing
){
factor = init.model$factor;
param = init.model$param;
param$approx.interaction = approx.interaction;
if(approx.interaction) test.obs.for.Estep = NULL
else test.obs.for.Estep = test.obs;
+ param$is.logistic = is.logistic;
+
+ # setup obs, feature, test.obs, test.feature
+ if(!is.null(data.train)){
+ if(!all(c("obs", "feature") %in% names(data.train))) stop("Please check input parameter 'data.train' when calling function fit.multicontext or run.multicontext: data.train$obs and data.train$feature cannot be NULL");
+ if(!is.null(obs)) stop("When calling function fit.multicontext or run.multicontext, if you already specified 'data.train', then you should set 'obs=NULL'");
+ if(!is.null(feature)) stop("When calling function fit.multicontext or run.multicontext, if you already specified 'data.train', then you should set 'feature=NULL'");
+ obs=data.train$obs;
+ feature=data.train$feature;
+ data.train$obs = NULL;
+ data.train$feature = NULL;
+ }else{
+ if(is.null(obs) || is.null(feature)) stop("Please specify input parameter 'data.train' when calling function fit.multicontext or run.multicontext");
+ }
+ if(!is.null(data.test)){
+ if(!all(c("obs", "feature") %in% names(data.test))) stop("Please check input parameter 'data.test' when calling function fit.multicontext or run.multicontext: data.test$obs and data.test$feature cannot be NULL");
+ if(!is.null(test.obs)) stop("When calling function fit.multicontext or run.multicontext, if you already specified 'data.test', then you should set 'test.obs=NULL'");
+ if(!is.null(test.feature)) stop("When calling function fit.multicontext or run.multicontext, if you already specified 'data.test', then you should set 'test.feature=NULL'");
+ test.obs=data.test$obs;
+ test.feature=data.test$feature;
+ if(is.null(IDs)) IDs = data.test$IDs;
+ }else{
+ if(( is.null(test.obs) && !is.null(test.feature)) ||
+ (!is.null(test.obs) && is.null(test.feature)))
+ stop("If you want to supply test data to the fitting code, please specify input parameter 'data.test' when calling function fit.multicontext or run.multicontext");
+ }
# Sanity check
if(out.level > 0 && is.null(out.dir)) stop("Please specify input parameter 'out.dir' when calling function fit.multicontext or run.multicontext with out.level > 0");
@@ -292,10 +327,10 @@ fit.multicontext <- function(
time.used.3 = proc.time() - b.time.test;
if(verbose >= 1){
- cat(" training loss: ", attr(loglik$CD,"loss"), " (",time.used.TestLoss[3]," sec)\n",sep="");
+ cat(" training loss: ", attr(loglik$CD,"loss"), "\n",sep="");
if(!is.null(test.obs))
cat(" test loss: ", prediction$test.loss, " (",time.used.TestLoss[3]," sec)\n",sep="");
- }
+ }
###
### Update the model.minTestLoss model if the TestLoss decreases
@@ -312,7 +347,8 @@ fit.multicontext <- function(
out.dir=out.dir, factor=factor, param=param, IDs=IDs,
prediction=prediction, loglik=loglik$CD,
minTestLoss=minTestLoss, nSamples=nSamples, iter=iter, out.level=out.level, out.overwrite=out.overwrite,
- TimeEStep=time.used.1[3], TimeMStep=time.used.2[3], TimeTest=time.used.3[3], verbose=verbose, name="model"
+ TimeEStep=time.used.1[3], TimeMStep=time.used.2[3], TimeTest=time.used.3[3], verbose=verbose, name="model",
+ data.train=data.train
);
}
@@ -338,15 +374,19 @@ fit.multicontext <- function(
# setting = data.frame(name, nFactors, has.u, has.gamma, nLocalFactors, is.logistic)
# T/F T/F 0 or n T/F
# Output dir is out.dir_name
+# data.train = list(
+# obs = data.frame(src.id, dst.id, src.context, dst.context, edge.context, y),
+# feature = list(x_src, x_dst, x_ctx, x_obs)
+# )
+# data.test has the same format
+#
run.multicontext <- function(
- obs, # Observation table
- feature, # Features
+ data.train=NULL, # Training data
setting, # Model setting
nSamples, # Number of samples drawn in each E-step: could be a vector of size nIter.
nBurnIn, # Number of burn-in draws before take samples for the E-step: could be a vector of size nIter.
nIter=NULL, # Number of EM iterations
- test.obs=NULL, # Test data: Observations for testing
- test.feature=NULL, # Features for testing
+ data.test=NULL, # Test data (optional)
return.models=FALSE,
approx.interaction=TRUE, # predict E[uv] as E[u]E[v].
reg.algo=NULL, # The regression algorithm to be used in the M-step (NULL => linear regression)
@@ -356,7 +396,6 @@ run.multicontext <- function(
var_v=1, var_u=1, var_w=1, var_y=NULL,
relative.to.var_y=FALSE, var_alpha_global=1, var_beta_global=1,
# others
- IDs=NULL,
out.level=0, # out.level=1: Save the factor & parameter values to out.dir/model.last and out.dir/model.minTestLoss
out.dir=NULL, # out.level=2: Save the factor & parameter values of each iteration i to out.dir/model.i
out.overwrite=FALSE,
@@ -372,7 +411,13 @@ run.multicontext <- function(
zero.mean=rep(0,0), # zero.mean["alpha"] = TRUE -> g = 0, etc.
fix.var=NULL, # fix.var[["u"]] = n -> var_u = n (NULL -> default, list() -> fix no var)
max.nObs.for.b=NULL,# maximum number of observations to be used to fit b
- rnd.seed.init=NULL, rnd.seed.fit=NULL
+ rnd.seed.init=NULL, rnd.seed.fit=NULL,
+ # The following five are for backward competibility when data.train=NULL and/or data.test=NULL
+ IDs=NULL,
+ obs=NULL, # Training data: Observation table
+ feature=NULL, # Features
+ test.obs=NULL, # Test data: Observations for testing
+ test.feature=NULL # Features for testing
){
if(length(unique(setting$name)) != nrow(setting)) stop("Please check input parameter 'setting' when calling function run.multicontext: setting$name must be a column of unique identifiers");
if(!out.overwrite){
@@ -408,7 +453,7 @@ run.multicontext <- function(
if(!is.null(rnd.seed.init)) set.seed(rnd.seed.init);
begin.time = proc.time();
init = init.simple.random(
- obs=obs, feature=feature,
+ data.train=data.train, obs=obs, feature=feature,
nFactors=setting[k,"nFactors"], has.u=setting[k,"has.u"], has.gamma=setting[k,"has.gamma"],
nLocalFactors=setting[k,"nLocalFactors"], is.logistic=setting[k,"is.logistic"],
var_alpha=var_alpha, var_beta=var_beta, var_gamma=var_gamma,
@@ -420,6 +465,7 @@ run.multicontext <- function(
if(!is.null(rnd.seed.fit)) set.seed(rnd.seed.fit);
ans = fit.multicontext(
+ data.train=data.train, data.test=data.test,
obs=obs, feature=feature, init.model=init, nSamples=nSamples, nBurnIn=nBurnIn, nIter=nIter,
test.obs=test.obs, test.feature=test.feature,
IDs=IDs, is.logistic=setting[k,"is.logistic"],
@@ -633,12 +679,25 @@ MCEM_EStep.multicontext.C <- function(
### g0, d0, h0, G, D, H are all 0
###
init.simple.random <- function(
- obs, feature, nFactors, has.u, has.gamma,
+ data.train=NULL, obs=NULL, feature=NULL, nFactors, has.u, has.gamma,
nLocalFactors=NULL, is.logistic=FALSE,
var_alpha=1, var_beta=1, var_gamma=1,
var_v=1, var_u=1, var_w=1, var_y=1,
relative.to.var_y=FALSE, var_alpha_global=1, var_beta_global=1
){
+ # setup obs, feature, test.obs, test.feature
+ if(!is.null(data.train)){
+ if(!all(c("obs", "feature") %in% names(data.train))) stop("Please check input parameter 'data.train' when calling function fit.multicontext or run.multicontext: data.train$obs and data.train$feature cannot be NULL");
+ if(!is.null(obs)) stop("When calling function fit.multicontext or run.multicontext, if you already specified 'data.train', then you should set 'obs=NULL'");
+ if(!is.null(feature)) stop("When calling function fit.multicontext or run.multicontext, if you already specified 'data.train', then you should set 'feature=NULL'");
+ obs=data.train$obs;
+ feature=data.train$feature;
+ data.train$obs = NULL;
+ data.train$feature = NULL;
+ }else{
+ if(is.null(obs) || is.null(feature)) stop("Please specify input parameter 'data.train' when calling function fit.multicontext or run.multicontext");
+ }
+
nObs = nrow(obs); nSrcNodes = nrow(feature$x_src); nDstNodes = nrow(feature$x_dst);
nSrcFeatures = ncol(feature$x_src);
nDstFeatures = ncol(feature$x_dst);
View
2  src/R/model/multicontext_model_utils.R
@@ -281,7 +281,7 @@ syncheck.multicontext.spec <- function(
feature.name.allowed = c(),
param.name.required = c("b", "g0", "d0", "var_y", "var_alpha", "var_beta"),
param.name.optional = c("h0", "G", "D", "H", "q", "r", "var_alpha_global","var_beta_global", "var_gamma", "var_u", "var_v", "var_w"),
- param.name.allowed = c("nLocalFactors", "approx.interaction", "xi", "reg.algo", "reg.control")
+ param.name.allowed = c("nLocalFactors", "approx.interaction", "xi", "reg.algo", "reg.control", "is.logistic")
){
factor.name.all = c(factor.name.required, factor.name.optional, factor.name.allowed);
obs.name.all = c(obs.name.required, obs.name.optional, obs.name.allowed);
View
7 src/R/model/util.R
@@ -601,7 +601,7 @@ output.to.dir <- function(
out.dir, factor, param, IDs, prediction, loglik,
minTestLoss, nSamples, iter, out.level, out.overwrite,
TimeEStep, TimeMStep, TimeTest, verbose,
- other=NULL, name="est"
+ other=NULL, name="est", data.train=NULL
){
if(!is.null(prediction) && is.null(prediction$test.loss)) stop("prediction$test.loss does not exist");
if(!is.null(prediction) && is.null(prediction$rmse)) stop("prediction$rmse does not exist");
@@ -621,6 +621,7 @@ output.to.dir <- function(
}
thisTestLoss = if(is.null(prediction)) -1 else prediction$test.loss;
+ TestRMSE = if(is.null(prediction)) -1 else prediction$rmse;
if(iter == 0){
if(out.level >= 2) save(file=paste(out.dir,"/",name,".0",sep=""), list=c("factor", "param"));
@@ -631,7 +632,7 @@ output.to.dir <- function(
if(file.exists(file.prev)) file.remove(file.prev);
file.rename(file, file.prev);
}
- save(file=file, list=c("factor", "param", "prediction", "IDs", "other"));
+ save(file=file, list=c("factor", "param", "prediction", "IDs", "other", "data.train"));
if(!is.null(prediction)){
if(thisTestLoss == minTestLoss) file.copy(file, paste(out.dir,"/",name,".minTestLoss",sep=""), overwrite=TRUE);
}
@@ -641,7 +642,7 @@ output.to.dir <- function(
nSamples = if(iter > 0) nSamples[iter] else 0;
}
- summary = data.frame(Method="MCEM", Iter=iter, nSteps=nSamples, CDlogL=loglik, TestLoss=thisTestLoss, LossInTrain=attr(loglik,"loss"), TestRMSE=prediction$rmse, TimeEStep=TimeEStep, TimeMStep=TimeMStep, TimeTest=TimeTest);
+ summary = data.frame(Method="MCEM", Iter=iter, nSteps=nSamples, CDlogL=loglik, TestLoss=thisTestLoss, LossInTrain=attr(loglik,"loss"), TestRMSE=TestRMSE, TimeEStep=TimeEStep, TimeMStep=TimeMStep, TimeTest=TimeTest);
file = paste(out.dir,"/summary",sep="");
if(file.exists(file)) write.table(summary, file=file, append=TRUE, quote=FALSE, sep="\t", row.names=FALSE, col.names=FALSE)
else write.table(summary, file=file, append=FALSE, quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE);
View
4 src/unit-test/multicontext_model/create-simulated-data-1.R
@@ -1,3 +1,7 @@
+### Copyright (c) 2011, Yahoo! Inc. All rights reserved.
+### Copyrights licensed under the New BSD License. See the accompanying LICENSE file for terms.
+###
+### Author: Bee-Chung Chen
###
### Create a small test data for the multi-context model
View
60 src/unit-test/multicontext_model/regression-test-0.R
@@ -0,0 +1,60 @@
+### Copyright (c) 2011, Yahoo! Inc. All rights reserved.
+### Copyrights licensed under the New BSD License. See the accompanying LICENSE file for terms.
+###
+### Author: Liang Zhang
+
+source("src/R/BST.R");
+
+# (1) Read input data
+input.dir = "test-data/multicontext_model/simulated-mtx-uvw-10K"
+# (1.1) Training observations and observation features
+obs.train = read.table(paste(input.dir,"/obs-train.txt",sep=""), sep="\t", header=FALSE, as.is=TRUE);
+names(obs.train) = c("src_id", "dst_id", "src_context", "dst_context", "ctx_id", "y");
+x_obs.train = read.table(paste(input.dir,"/dense-feature-obs-train.txt",sep=""), sep="\t", header=FALSE, as.is=TRUE);
+# (1.2) Test observations and observation features
+obs.test = read.table(paste(input.dir,"/obs-test.txt",sep=""), sep="\t", header=FALSE, as.is=TRUE);
+names(obs.test) = c("src_id", "dst_id", "src_context", "dst_context", "ctx_id", "y");
+x_obs.test = read.table(paste(input.dir,"/dense-feature-obs-test.txt",sep=""), sep="\t", header=FALSE, as.is=TRUE);
+# (1.3) User/item/context features
+x_src = read.table(paste(input.dir,"/dense-feature-user.txt",sep=""), sep="\t", header=FALSE, as.is=TRUE);
+names(x_src)[1] = "src_id";
+x_dst = read.table(paste(input.dir,"/dense-feature-item.txt",sep=""), sep="\t", header=FALSE, as.is=TRUE);
+names(x_dst)[1] = "dst_id";
+x_ctx = read.table(paste(input.dir,"/dense-feature-ctxt.txt",sep=""), sep="\t", header=FALSE, as.is=TRUE);
+names(x_ctx)[1] = "ctx_id";
+
+# (2) Call BST
+ans = fit.bst(obs.train=obs.train, obs.test=obs.test, out.dir = "/tmp/unit-test/simulated-mtx-uvw-10K", model.name=c("uvw1", "uvw2"), nFactors=c(1,2), nIter=10);
+#ans = fit.bst(obs.train=obs.train, obs.test=obs.test, x_obs.train=x_obs.train, x_obs.test=x_obs.test, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
+# out.dir = "/tmp/unit-test/simulated-mtx-uvw-10K", model.name=c("uvw1", "uvw2"), nFactors=c(1,2), nIter=10);
+
+
+# (3) Compare to the reference run
+warnings()
+out.dir = "/tmp/unit-test/simulated-mtx-uvw-10K"
+setting = data.frame(
+ name = c("uvw1", "uvw2"),
+ nFactors = c( 1, 2), # number of interaction factors
+ has.u = c( TRUE, TRUE), # whether to use u_i' v_j or v_i' v_j
+ has.gamma = c( FALSE, FALSE), # just set to F
+ nLocalFactors = c( 0, 0), # just set to 0
+ is.logistic = c( FALSE, FALSE) # whether to use the logistic model for binary rating
+);
+ok = TRUE;
+for(i in 1:nrow(setting)){
+ name = setting[i,"name"];
+ smry.1 = read.table(paste(input.dir,"/mtx-",name,".summary.txt",sep=""), header=TRUE, as.is=TRUE);
+ smry.2 = read.table(paste(out.dir,"_",name,"/summary",sep=""), header=TRUE, as.is=TRUE);
+ for(metric in c("TestLoss", "LossInTrain")){
+ diff = abs(smry.1[,metric] - smry.2[,metric]);
+ if(any(diff > 1e-10)){
+ ok=FALSE; cat("Observe Difference in ",metric,"\n",sep="");
+ print(data.frame(smry.1["Iter"], smry.1[metric], smry.2[metric], Diff=diff, "."=c("", "***")[(diff>0)+1]));
+ }
+ }
+}
+if(ok){
+ cat("\nNo Problem Found!!\n\n",sep="")
+}else{
+ cat("\nSome Problems Found!!\n\n",sep="");
+}
View
5 src/unit-test/multicontext_model/regression-test-1.R
@@ -1,3 +1,8 @@
+### Copyright (c) 2011, Yahoo! Inc. All rights reserved.
+### Copyrights licensed under the New BSD License. See the accompanying LICENSE file for terms.
+###
+### Author: Bee-Chung Chen
+
###
### Test on simulated data to make sure no bug is introduced when rewriting some part of the code
###
View
4 test-data/multicontext_model/simulated-mtx-uvw-10K/README
@@ -7,8 +7,8 @@ FILE: obs-{train,test}.txt
Columns:
1. src_id: User ID
2. dst_id: Item ID
- 3. src_context: The mode of the user when rating the item
- 4. dst_context: The mode of the item when rated by the user
+ 3. src_context: The context of the user when rating the item
+ 4. dst_context: The context of the item when rated by the user
5. ctx_id: The context when the user gives the rating to the item
6. y: The rating value

No commit comments for this range

Something went wrong with that request. Please try again.