Probabilistic Semantic Analysis (baseline version)
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
config-aux
docs
AUTHORS
COPYING
ChangeLog
INSTALL
Makefile.am
Makefile.in
NEWS
README
VERSION
aclocal.m4
config.h.in
configure
configure.ac
debug.c
debug.h
em-estep.c
em-estep.h
em-mstep.c
em-mstep.h
input.c
input.h
main.c
output.c
output.h
parameters.c
parameters.h
plsa-defn.h
run.c
run.h
wmalloc.c
wmalloc.h

README

Probabilistic latent semantic analysis
(PLSA, baseline version)
======================================


Table of Contents
-----------------
  I.     Introduction
  II.    Compiling
  III.   Running PLSA
  IV.    Data format
  V.     Sample run
  VI.    Other issues
  VII.   About PLSA
  VIII.  Copyright and license



I.  Introduction
----------------

Probabilistic latent semantic analysis is a method for analyzing 
co-occurrence data.  This source code is a baseline version of PLSA which 
makes no use of any parallelization.  This work is described by the poster 
paper:

R. Wan, V. N. Anh, and H. Mamitsuka. Efficient Probabilistic Latent 
Semantic Analysis Through Parallelization. In Proc. 5th Asia Information 
Retrieval Symposium (poster session), volume 5839 of LNCS, pages 432-443, 
October 2009.

In that paper, there are five systems.  This source code generates one of 
them:  Baseline.  However, in the description below, we refer to this 
system as "PLSA".



II.  Compiling
--------------

The PLSA software is written in C and has been compiled using the GNU gcc 
compiler v4.1.2 under Linux.  The system was tested on a 64-bit system, 
but does not make use of any features specific to 64-bit architectures.

The PLSA system uses Autotools in order to create a Makefile which is 
configured to your system's settings.  This is done using the "configure" 
script.  Simply perform the following:

./configure;make



III  Running PLSA
-----------------

Run "plsa" without any arguments to get a list of possible options:

Probabilistic Latent Semantic Analysis (baseline)
=================================================

Usage:  ./plsa [options]

Options:
--base <file>      :  Base filename for output file.
--cooccur <file>   :  Co-occurrence filename.
--clusters <int>   :  Number of clusters.
--seed <int>       :  Random seed.
                   :    (Default:  current time).
--maxiter <int>    :  Maximum iterations.
--text             :  Text mode (I/O is in text, not binary).
--verbose          :  Verbose mode.
--debug            :  Debugging output.
--rounding         :  Round using 100000000 as the multiplication factor.
--nooutput         :  Suppress outputting p(x,y) to file.

PLSA version:  Mar  7 2010 (15:10:57)


From top-to-bottom, the options mean:

--base:      The filename, before the extension, of the output file.  The
               extension is fixed as ".plsa".
--cooccur:   The input co-occurrence file, whose format is described below.
--clusters:  The number of latent states.
--seed:      The random seed to use.  If none is provided, the current
               system time is used.
--maxiter:   The maximum number of iterations of the EM algorithm to
               perform.  One of two stopping criteria.
--text:      Indicate that the input file is in text and not binary;
               useful for debugging.
--verbose:   Verbose output.
--debug:     Debugging output.  Output is generated as each value is read
               from the input file.  (Note that a lot of output will be
               generated.)
--rounding:  Round the output values in p(x,y) using the specified
               rounding factor.  That is, if the factor is "1000", then
               three decimal places are used.  Useful for comparing
               methods due to the problem with floating point arithmetic
               (details below).
--nooutput:  Do not produce the final output file.  Eliminates the
               creation of a fairly large file.

Many of these parameters have no defaults (such as --maxiter and 
--clusters), so they will have to be explicitly given.



IV.   Data format
-----------------

The input file is provided to the system using the --cooccur switch.  It 
can be in one of two formats:  text or binary.  Binary mode is the default 
and suggested for large files to save space; text mode is good for testing.

In either mode, the format of the file is:

[rows][columns][row id+][column id+][w1 cos_count (w21 c21) ... (w2n c2n)]+

which means:

- rows --  The number of rows in the matrix (unsigned int)
- columns --  The number of columns in the matrix (unsigned int)
- row id --  A sequence of row id's.  The number of them should be
    exactly "rows".
- column id --  A sequence of column id's.  The number of them should be
    exactly "columns".
- [w1 cos_count (w21 c21) ... (w2n c2n)] -- A row in the matrix which
    describes only the non-zero positions.
    - w1 --  The ID of the row.
    - cos_count --  The number of non-zero values in the row.
    - w21 -- Column ID of the value.
    - c21 -- The value itself.

All values are unsigned integers.  The program does not make use of either 
"row id" or "column id" (it was included for a future feature of the 
program that has not yet been implemented).  Any value is fine (i.e., 0 or 
sequential integers).

In text format, the values are separated by a single whitespace (usually 
the tab character).  In binary mode,  they are unsigned integers (usually 
4 bytes in size each).

Remember to specify whether it is a text or binary file by using (or not 
using) the --text switch.  The program does NOT check for binary or text 
mode and problems will occur if the mode does not match the file.

Please see the source in input.c for further details on the file format.


V.    Sample run
----------------

Suppose we have a 3 by 4 matrix like this:

0	0	3	4
0	0	0	6
7	2	1	0

then the test file will appear like this:

3	4
0	1	2
0	1	2	3
0	2	2	3	3	4
1	1	3	6
2	3	0	7	1	2	2	1


The first 3 rows are self-explanatory.  The fourth row is interpretted as 
follows.  In row 0, there are 2 non-zero co-occurrences.  They are at 
position 2 (with a value of 3) and position 3 (with a value of 4).  
Therefore, we have:  (0, 2) = 3 and (0, 3) = 4.

We execute plsa as follows:

  ./plsa --cooccur test.cooccur --maxiter 30 --clusters 4 --text --verbose 
--debug --base out

Without --verbose and --debug, there would be no output generated.  Output 
will vary from system to system, but some sample output would look like 
the following:


Settings
--------
==	Base filename:                                  out
==	Co-occurrence filename:                         test.cooccur
==	Probability data type:                          double
==	Clusters:                                       4
==	Random seed:                                    [from time]
==	Exponent difference [utils.h::addLogsFloat]:    23.02585093
==	Termination conditions
==	  Maximum EM iterations:                        30
==	  Percentage difference:                        0.001000
==	Text mode:                                      yes
==	Rounding:                                       no
==	Suppress output to file:                        no


==	Reading from co-occurrence file...
==	Applying seed from time:                        1267944876
==		Read (0, 2) --> 3
==		Read (0, 3) --> 4
==		Read (1, 3) --> 6
==		Read (2, 0) --> 7
==		Read (2, 1) --> 2
==		Read (2, 2) --> 1
==	Maximum number of pairs:                        12
==	Actual number of pairs in data file:            6
==	Percentage of zeroes:                           50.00 % (6)
==	Sum of co-occurrence counts:                    23
==	Begin initialization...
==	Initialization complete...
==	m = 3; n = 4
[---]  Initial = -61.076698
[  1]  -61.076698 --> -52.328544	[8.748153, 14.3232 %]
[  2]  -52.328544 --> -50.542405	[1.786140, 3.4133 %]
[  3]  -50.542405 --> -47.318958	[3.223446, 6.3777 %]
[  4]  -47.318958 --> -42.657292	[4.661667, 9.8516 %]
[  5]  -42.657292 --> -38.997613	[3.659679, 8.5793 %]
[  6]  -38.997613 --> -37.772321	[1.225292, 3.1420 %]
[  7]  -37.772321 --> -37.573313	[0.199008, 0.5269 %]
[  8]  -37.573313 --> -37.535241	[0.038072, 0.1013 %]
[  9]  -37.535241 --> -37.523889	[0.011352, 0.0302 %]
[ 10]  -37.523889 --> -37.519834	[0.004055, 0.0108 %]
[ 11]  -37.519834 --> -37.518251	[0.001583, 0.0042 %]
[ 12]  -37.518251 --> -37.517603	[0.000648, 0.0017 %]
[ 13]  -37.517603 --> -37.517333	[0.000270, 0.0007 %]
==	Non-probabilities:                              0
==	Sum of p(x,y):                                  1.000000
==	Total output files printed                      1
==	Program execution:                              0.000 secs


The resulting matrix of p(x,y) is stored as out.plsa and look like the 
following:

-----
3	4	0	1	2	0	1	2	3	
-20.596084	-21.298047	-2.036939	-1.749157	
-102.984165	-103.935308	-11.724635	-1.343766	
-1.189584	-2.442347	-3.135510	-14.177547
-----

Whose output format is the same as the input format, except that the 
integral co-occurrence counts are replaced with probabilities in log-space 
as floating point values.



VI.  Other issues
-----------------

This section lists some issues relevant to running or extending the 
program, in no particular order.

Applicable to both versions:

1)  All values are stored in log-space to prevent underflow of floating 
point values.

2)  Probabilities are multiplied by adding values in log-space.  To 
prevent underflow, a macro called logSumsInline is used (defined in 
plsa-defn.h).  It is provided a cut-off called LN_LIMIT which stipulates 
what cut-off (as a ln value) to use when adding two numbers.  If the 
smallest number is "too small", then no addition is performed and the 
larger value is taken.  As a result, the order in which numbers are added 
matters and if   MPI is used a small, negligible difference will appear.  
In other words, if

  x = a + b + c + d

then when MPI with two processors is used, then

  x = y + w
  y = a + b
  w = c + d

If c is very small compared to (a + b) but not so when compared to d, then 
in the first case, it might be dropped.  Generally, the percentage 
difference in maximum likelihood will not change enough to matter.

3)  Also related to the previous point, p(x,y) values written to the 
output file may be different if the binary files are compared directly due 
to the number of digits of precision.  If sent to a text file with only 6 
digits of precision, there may not be any problems.  To solve this, the 
--rounding option was added.  It and the cut-off defined in plsa-defn.h 
will need to be tweaked if two runs are to be compared.

4)  Generating the output file could take half of the total execution 
time, depending on the matrix size.  To suppress printing this file, use 
the --nooutput switch.

5)  To get the time required for a single iteration of the loop (as 
reported in the paper cited in Section 1), use the "--maxiter 1" option.

6)  The four random seeds used for the paper cited in Section 1 were:  
"20444 3612 31325 17062".  Of course, this information alone is not enough 
to generate the exact same results since every system's random number 
generator might be slightly different.

7)  The number of latent states must be larger than the number of 
processors under MPI.  If this is not the case, then the number of latent 
states is increased  automatically.


Applicable to this version only:

1)  There is no maximum number of latent states.



VII.   About PLSA
-----------------

This software was implemented by Raymond Wan:
  E-mail:  r.wan@aist.go.jp
  Homepage:  http://www.cbrc.jp/~rwan/

and is available from his homepage.

This software is actively maintained.  If you have any information about 
bugs, suggestions for the documentation or just have some general 
comments, feel free to write to the above address.

Details about the method is provided in the poster paper given in Section 
1 of this README.



VIII.  Copyright and license
----------------------------

Probablistic latent semantic analysis (PLSA, baseline version)
Copyright (C) 2009-2010  by Raymond Wan (r.wan@aist.go.jp)

This software is distributed under the terms of the GNU General Public 
License
(GPL, version 3 or later) -- see the file COPYING for details.