Skip to content
/ SNVDM Public

Source code for the Social Network Viewpoint Discovery Model and baselines

License

Notifications You must be signed in to change notification settings

tthonet/SNVDM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SNVDM: the Social Network Viewpoint Discovery Model

Introduction

This repository provides the source code, executables and details to retrieve the data used in the paper Users Are Known by the Company They Keep: Topic Models for Viewpoint Discovery in Social Networks by Thibaut Thonet, Guillaume Cabanac, Mohand Boughanem, and Karen Pinel-Sauvagnat, published at CIKM '17. More details about this work can be found in the original paper, which preprint is available at https://www.irit.fr/publis/IRIS/2017_CIKM_TCBPS.pdf.

The source code presented here is the Java implementation of a collapsed Gibbs sampler for the proposed model SNVDM/SNDVM-GPU and baselines TAM, SN-LDA, and VODUM -- see our paper for full reference to these latter models. The code is based on the JGibbLDA implementation of collapsed Gibbs sampling for LDA (http://jgibblda.sourceforge.net/). This repository also contains details about the Twitter datasets used for the evaluation of models and baselines in our paper. These datasets were introduced in Analyzing Discourse Communities with Distributional Semantic Models by Igor Brigadir, Derek Greene, and Pádraig Cunningham, published at WebSci '15 (https://dl.acm.org/citation.cfm?id=2786470). The original collections are available at http://dx.doi.org/10.6084/m9.figshare.1430449. For any use of these datasets, please cite Brigadir et al's paper. As we performed a noise filtering step on the datasets to eliminate irrelevant tweets, we provide in this repository the IDs for the tweets and their retweets that were actually used in the evaluation -- tweet content is not directly provided due to Twitter Terms of Use strictly prohibiting redistribution thereof.

Content

In this section, we detail the content of this repository.

  • The directory bin contains the runnable jar files snvdm-gpu.jar, tam.jar, sn-lda.jar, and vodum.jar that can be executed to perform collapsed Gibbs sampling on models SNVDM/SNVDM-GPU, TAM, SN-LDA, and VODUM, respectively. The next sections detail how to use these jar files.
  • The directory data contains 2 sub-directories, indyref (for the dataset on the 2014 Scottish Independence Referendum) and midterms (for the dataset on the 2014 US Midterm Elections), with each 3 files: tweets.txt, retweets.txt, and user_labels.txt. The first two files give the IDs of the tweets and their retweets, respectively. Note that some of these tweets may not be available anymore (and thus may not be retrieved) if their respective authors deleted them. The third file contains the mapping of users' Twitter ID to their groundtruth viewpoint label: yes/no for indyref and dem/rep (i.e., Democrat and Republican) for midterms. This list of users is the same as that provided by Bridagir et al. Note that the total number of users does not exactly match the numbers given in our paper (Table 2) because some users were left without any tweets after preprocessing and were thus discarded.
  • The directory lib contains the libraries used in code. It contains the files args4j-2.0.6.jar, commons-io-2.4.jar, and commons-math3-3.5.jar which correspond to the Args4j library (http://args4j.kohsuke.org/), the Apache Commons IO library (http://commons.apache.org/proper/commons-io/), and the Apache Commons Math library (http://commons.apache.org/proper/commons-math/), respectively.
  • The directory src contains the source code for the different models' collapsed Gibbs samplers, compressed in jar files: snvdm-gpu-src.jar, tam-src.jar, sn-lda-src.jar, and vodum-src.jar.
  • The file LICENCE.txt describes the licence of our code.
  • The file README.md is the current file.

How to run the code

This section provides instructions on how to run collapsed Gibbs sampling for the different models, what it is taken as input and what is given as output.

Input file

The input files used to run our collapsed Gibbs sampling programs are the same across the different models -- although some models may not use all available information, e.g., social interactions will not be used by TAM and VODUM. An input file contains on the first line the number of users in the dataset. Then each line contains the data for one user using the following format:

<user_ID>	<user_groundtruth_label>	<word_11> ... <word_1N>;interactedUponBy:<sender_ID_11> ... sender_ID_1M>|<word_21> ... <word_2N>;interactedUponBy:<sender_ID_21> ... sender_ID_2M>|...	recipient_ID_1 ... recipient_ID_P

The first field is the user ID (a string or a number). After a tabulation, the second field corresponds to the viewpoint label for this user -- used only in the evaluation performed at the end of the execution. Then, after another tabulation, follows the documents posted by the user. A document's words are separated by spaces. If a document was interacted upon by another user (i.e. a sender, following the terminology of our paper), the document's last word is followed by a semicolon (";"), any keyword (e.g., interactedUponBy), and a colon ":" after which the list of interacting users' IDs are written, separated by space. Documents posted by the user are separated by pipes ("|"). Finally, after another tabulation, appears the list of IDs (separated by spaces) for users on which the line's user interacted (i.e., the recipients).

Example:

@hillary	democrat	i hate #gop;interactedUponBy:@barack @john|obamacare ftw;interactedUponBy:@barack|i love thai food	@bill @barack

Note that in the case of VODUM the input file also needs to contain the part-of-speech category (0 or 1) for each token. Therefore, in the input file for VODUM (only), each word is followed by its part-of speech and both are separated by a colon (":"), e.g., <word_11>:<pos_category_11>.

SNVDM/SNVDM-GPU

Command line execution

The collapsed Gibbs sampler for SNVDM or SNVDM-GPU is run using the following command:

$ java -jar bin/snvdm-gpu.jar [-beta <double>] [-gamma0 <double>] [-gamma1 <double>] [-mu <double>] [-delta0 <double>] [-delta1 <double>] [-alpha <double>] [-eta <double>] [-lambda <double>] [-tau <int>] [-ntopics <int>] [-nviews <int>] [-niters <int>] [-burnin <int>] [-lag <int>] [-hypsamp] [-nchains <int>] [-savestep <int>] [-topwords <int>] -dir <string> -dfile <string>

To use SNVDM instead of SNVDM-GPU, simply set tau = 0. Alternatively, one could set lambda = 0 but doing so leads to less efficient code so we do not recommend it.

The meaning of each parameter is detailed below:

  • -beta <double>: Value of β, the concentration parameter for the symmetric Dirichlet prior on φ00, φ01, φ10, φ11 (distributions over words). Default value: 0.01.
  • -gamma0 <double> and -gamma1 <double>: Values of γ0 and γ1, the shape parameters for the Beta prior on ψ0 and ψ1 (distributions over routes). Default value: 1.0.
  • -mu <double>: Value of μ, the parameter for the symmetric Dirichlet prior on ξ (distribution over interacting users). Note that the concentration parameter actually used is μ/U. Default value: 1.0.
  • -delta0 <double> and -delta1 <double>: Values of δ0 and δ1, the shape parameters for the Beta prior on σ (distribution over levels). Default value: 1.0.
  • -alpha <double>: Value of α, the parameter for the symmetric Dirichlet prior on θ (distribution over topics). Note that the concentration parameter actually used is α/T. Default value: 1.0.
  • -eta <double>: Value of η, the parameter for the symmetric Dirichlet prior on π (distribution over viewpoints). Note that the concentration parameter actually used is η/V. Default value: 1.0.
  • -lambda <double>: Value (between 0 and 1) of the portion of ball (interaction) to add for related colors (users) in the Generalized Pólya Urn scheme. Default value: 0.5.
  • -tau <int>: Number of acquaintances to consider for each user (among those interacing the most with her) in the Generalized Pólya Urn scheme. Default value: 10.
  • -ntopics <int>: Number of topics (T). Default value: 10.
  • -nviews <int>: Number of viewpoints (V). Default value: 2.
  • -niters <int>: Number of iterations to perform for each chain. Default value: 1000.
  • -burnin <int>: Number of iterations before starting collecting samples. Default value: 500.
  • -lag <int>: Number of iterations between samples to collect. Default value: 50.
  • -hypsamp: Indicates that hyperparameters should be updated (using the auxiliary variable sampling technique) instead of being kept constant. Default value: true.
  • -nchains <int>: Number of chains (independent executions of the program, each with a different random initialization) to perform. Default value: 1.
  • -savestep <int>: Number of iterations (after burnin) between samples to save in the output files. If the savestep is greater than (niters - burnin), then only one sample (the sample for the last iteration) will be saved for each chain. Default value: 500.
  • -topwords <int>: Number of top words (most probable words for each word distribution) to output.
  • -dir <string>: Path to the directory containing the data file, and where the output files will be saved.
  • -dfile <string>: Name of the data file.

Example:

$ java -jar "bin/snvdm-gpu.jar" -beta 0.01 -gamma0 1 -gamma1 1 -mu 1 -delta0 1 -delta1 1 -alpha 1 -eta 1 -lambda 0.5 -tau 10 -ntopics 15 -nviews 2 -niters 1000 -burnin 500 -lag 50 -hypsamp -nchains 1 -savestep 500 -topwords 20 -dir "data/midterms" -dfile "midterms.dat"

Output files

The execution of the collapsed Gibbs sampler for SNVDM and SNVDM-GPU outputs the following files for each savestep (corresponding to a saved model):

  • <model name>.others: This file contains the value of the hyperparameters (beta, gamma0, gamma1, mu, delta0, delta1, alpha, eta), other parameters (lambda, tau, ntopics, nviews). It also specifies the number of users in the collection (nusers), the size of the vocabulary (nwords), as well as the results of the evaluation in terms of various metrics (perplexity, coherence, purity, inversePurity, nmi, bCubedPrecision, bCubedRecall, bCubedF).
  • <model name>.avgphi00, <model name>.avgphi01, <model name>.avgphi10, <model name>.avgphi11: These files contain the distributions over background words φ00, the distributions over viewpoint words φ01, the distributions over topic words φ10, and the distributions over viewpoint-topic words φ11, averaged over the different collected samples.
  • <model name>.avgpsi0, <model name>.avgpsi1: These files contain the general distribution over routes ψ0 and the topic-specific distributions over routes ψ1, averaged over the different collected samples.
  • <model name>.avgxi: This file contains the viewpoint-specific distributions over interacting users ξ, averaged over the different collected samples.
  • <model name>.avgsigma: This file contains the user-specific distributions over levels σ, averaged over the different collected samples.
  • <model name>.avgtheta: This file contains the user-specific distributions over topics θ, averaged over the different collected samples.
  • <model name>.avgpi: This file contains the user-specific distributions over viewpoints π, averaged over the different collected samples.
  • <model name>.words: This file contains the most probable words in the distribution over background words φ00.
  • <model name>.vwords: This file contains the most probable words in the distributions over viewpoint words φ01.
  • <model name>.twords: This file contains the most probable words in the distributions over topic words φ10.
  • <model name>.vtwords: This file contains the most probable words in the distributions over viewpoint-topic words φ11.
  • <model name>.classmap: This file contains the mapping between the groundtruth label strings (from the input file) and their index used in the output files. The first line corresponds to the number of different viewpoint labels in the collection.
  • <model name>.usermap: This file contains the mapping between the user ID strings (from the input file) and their index used in the output files. The first line corresponds to the number of different users in the collection.
  • <model name>.wordmap: This file contains the mapping between the word strings (from the input file) and their index used in the output files. The first line corresponds to the number of different words in the vocabulary.

TAM

Command line execution

The collapsed Gibbs sampler for TAM is run using the following command:

$ java -jar bin/tam.jar [-omega <double>] [-alpha <double>] [-beta <double>] [-delta0 <double>] [-delta1 <double>] [-gamma0 <double>] [-gamma1 <double>] [-naspects <int>] [-ntopics <int>] [-niters <int>] [-burnin <int>] [-lag <int>] [-hypsamp] [-nchains <int>] [-savestep <int>] [-topwords <int>] -dir <string> -dfile <string>

The meaning of each parameter is detailed below:

  • -omega <double>: Value of ω, the concentration parameter for the symmetric Dirichlet prior on φ00, φ01, φ10, φ11 (distributions over words). Default value: 0.01.
  • -alpha <double>: Value of α, the parameter for the symmetric Dirichlet prior on θ (distribution over topics). Note that the concentration parameter actually used is α/T. Default value: 1.0.
  • -beta <double>: Value of β, the parameter for the symmetric Dirichlet prior on π (distribution over aspects). Note that the concentration parameter actually used is β/A. Default value: 1.0.
  • -delta0 <double> and -delta1 <double>: Values of δ0 and δ1, the shape parameters for the Beta prior on σ (distribution over levels). Default value: 1.0.
  • -gamma0 <double> and -gamma1 <double>: Values of γ0 and γ1, the shape parameters for the Beta prior on ψ0 and ψ1 (distributions over routes). Default value: 1.0.
  • -naspects <int>: Number of aspects (A). Default value: 2.
  • -ntopics <int>: Number of topics (T). Default value: 10.
  • -niters <int>: Number of iterations to perform for each chain. Default value: 1000.
  • -burnin <int>: Number of iterations before starting collecting samples. Default value: 500.
  • -lag <int>: Number of iterations between samples to collect. Default value: 50.
  • -hypsamp: Indicates that hyperparameters should be updated (using the auxiliary variable sampling technique) instead of being kept constant. Default value: true.
  • -nchains <int>: Number of chains (independent executions of the program, each with a different random initialization) to perform. Default value: 1.
  • -savestep <int>: Number of iterations (after burnin) between samples to be saved in the output files. If the savestep is greater than (niters - burnin), then only one sample (the sample for the last iteration) will be saved for each chain. Default value: 500.
  • -topwords <int>: Number of top words (most probable words for each word distribution) to output. Default value: 20.
  • -dir <string>: Path to the directory containing the data file, and where the output files will be saved.
  • -dfile <string>: Name of the data file.

Example:

$ java -jar "bin/tam.jar" -omega 0.01 -alpha 1 -beta 1 -delta0 1 -delta1 1 -gamma0 1 -gamma1 1 -naspects 2 -ntopics 15 -niters 1000 -burnin 500 -lag 50 -hypsamp -nchains 1 -savestep 500 -topwords 20 -dir "data/midterms" -dfile "midterms.dat"

Output files

The execution of the collapsed Gibbs sampler for TAM outputs the following files for each savestep (corresponding to a saved model):

  • <model name>.others: This file contains the value of the hyperparameters (omega, alpha, beta, delta0, delta1, gamma0, gamma1), other parameters (ntopics, naspects). It also specifies the number of documents (in our setting this corresponds to the number of users) in the collection (ndocs), the size of the vocabulary (nwords), as well as the results of the evaluation in terms of various metrics (perplexity, coherence, purity, inversePurity, nmi, bCubedPrecision, bCubedRecall, bCubedF).
  • <model name>.avgphi00, <model name>.avgphi01, <model name>.avgphi10, <model name>.avgphi11: These files contain the distributions over background words φ00, the distributions over viewpoint words φ01, the distributions over topic words φ10, and the distributions over viewpoint-topic words φ11, averaged over the different collected samples.
  • <model name>.avgtheta: This file contains the document-specific (here, user-specific) distributions over topics θ, averaged over the different collected samples.
  • <model name>.avgpi: This file contains the document-specific (here, user-specific) distributions over viewpoints π, averaged over the different collected samples.
  • <model name>.avgsigma: This file contains the document-specific (here, user-specific) distributions over levels σ, averaged over the different collected samples.
  • <model name>.avgpsi0, <model name>.avgpsi1: These files contain the general distribution over routes ψ0 and the topic-specific distributions over routes ψ1, averaged over the different collected samples.
  • <model name>.words: This file contains the most probable words in the distribution over background words φ00.
  • <model name>.awords: This file contains the most probable words in the distributions over aspect words φ01.
  • <model name>.twords: This file contains the most probable words in the distributions over topic words φ10.
  • <model name>.atwords: This file contains the most probable words in the distributions over aspect-topic words φ11.
  • <model name>.classmap: This file contains the mapping between the groundtruth label strings (from the input file) and their index used in the output files. The first line corresponds to the number of different viewpoint labels in the collection.
  • <model name>.docmap: This file contains the mapping between the document (here, user) ID strings (from the input file) and their index used in the output files. The first line corresponds to the number of different documents in the collection.
  • <model name>.wordmap: This file contains the mapping between the word strings (from the input file) and their index used in the output files. The first line corresponds to the number of different words in the vocabulary.

SN-LDA

Command line execution

The collapsed Gibbs sampler for SN-LDA is run using the following command:

$ java -jar bin/sn-lda.jar [-alpha <double>] [-beta <double>] [-delta <double>] [-gamma <double>] [-ntopics <int>] [-ncomms <int>] [-niters <int>] [-burnin <int>] [-lag <int>] [-hypsamp] [-nchains <int>] [-savestep <int>] [-topwords <int>] -dir <string> -dfile <string>

The meaning of each parameter is detailed below:

  • -alpha <double>: Value of α, the parameter for the symmetric Dirichlet prior on θ (distributions over topics). Note that the concentration parameter actually used is α/T. Default value: 1.0.
  • -beta <double>: Value of β, the concentration parameter for the symmetric Dirichlet prior on φ (distributions over words). Default value: 0.01.
  • -delta <double>: Value of δ, the parameter for the symmetric Dirichlet prior on η (distributions over interaction recipients). Default value: 1.0.
  • -gamma <double>: Value of γ, the parameter for the symmetric Dirichlet prior on π (distributions over communities). Default value: 1.0.
  • -ntopics <int>: Number of topics (T). Default value: 10.
  • -ncomms <int>: Number of communities (C). Default value: 2.
  • -niters <int>: Number of iterations to perform for each chain. Default value: 1000.
  • -burnin <int>: Number of iterations before starting collecting samples. Default value: 500.
  • -lag <int>: Number of iterations between samples to collect. Default value: 50.
  • -hypsamp: Indicates that hyperparameters should be updated (using the auxiliary variable sampling technique) instead of being kept constant. Default value: true.
  • -nchains <int>: Number of chains (independent executions of the program, each with a different random initialization) to perform. Default value: 1.
  • -savestep <int>: Number of iterations (after burnin) between samples to save in the output files. If the savestep is greater than (niters - burnin), then only one sample (the sample for the last iteration) will be saved for each chain. Default value: 500.
  • -topwords <int>: Number of top words (most probable words for each word distribution) to output.
  • -dir <string>: Path to the directory containing the data file, and where the output files will be saved.
  • -dfile <string>: Name of the data file.

Example:

$ java -jar "bin/sn-lda.jar" -alpha 1 -beta 0.01 -delta 1 -gamma 1 -ntopics 15 -ncomms 2 -niters 1000 -burnin 500 -lag 50 -hypsamp -nchains 1 -savestep 500 -topwords 20 -dir "data/midterms" -dfile "midterms.dat"

Output files

The execution of the collapsed Gibbs sampler for SN-LDA outputs the following files for each savestep (corresponding to a saved model):

  • <model name>.others: This file contains the value of the hyperparameters (alpha, beta, delta, gamma), other parameters (ntopics, ncomms). It also specifies the number of users in the collection (nusers), the size of the vocabulary (nwords), as well as the results of the evaluation in terms of various metrics (perplexity, coherence, purity, inversePurity, nmi, bCubedPrecision, bCubedRecall, bCubedF).
  • <model name>.avgtheta: This file contains the user-specific distributions over topics θ, averaged over the different collected samples.
  • <model name>.avgphi, <model name>.avgphi01, <model name>.avgphi10, <model name>.avgphi11: These files contain the distributions over background words φ00, the distributions over viewpoint words φ01, the distributions over topic words φ10, and the distributions over viewpoint-topic words φ11, averaged over the different collected samples.
  • <model name>.avgeta: This file contains the community-specific distributions over interaction recipients η, averaged over the different collected samples.
  • <model name>.avgpi: This file contains the user-specific distributions over communities π, averaged over the different collected samples.
  • <model name>.twords: This file contains the most probable words in the distributions over topic words φ.
  • <model name>.classmap: This file contains the mapping between the groundtruth label strings (from the input file) and their index used in the output files. The first line corresponds to the number of different viewpoint labels in the collection.
  • <model name>.usermap: This file contains the mapping between the user ID strings (from the input file) and their index used in the output files. The first line corresponds to the number of different users in the collection.
  • <model name>.wordmap: This file contains the mapping between the word strings (from the input file) and their index used in the output files. The first line corresponds to the number of different words in the vocabulary.

VODUM

Command line execution

The collapsed Gibbs sampler for VODUM is run using the following command:

$ java -jar bin/vodum.jar -est [-beta0 <double>] [-beta1 <double>] [-alpha <double>] [-eta <double>] [-ntopics <int>] [-nviews <int>] [-nchains <int>] [-niters <int>] [-savestep <int>] [-topwords <int>] -dir <string> -dfile <string>

The meaning of each parameter is detailed below:

  • -beta0 <double>: Value of &beta0;, the concentration parameter for the symmetric Dirichlet prior on φ0 (distributions over topical words). Default value: 0.01.
  • -beta1 <double>: Value of &beta1;, the concentration parameter for the symmetric Dirichlet prior on φ1 (distributions over opinion words). Default value: 0.01.
  • -alpha <double>: Value of α, the parameter for the symmetric Dirichlet prior on θ (distribution over topics). Note that the concentration parameter actually used is α/T. Default value: 1.0.
  • -eta <double>: Value of η, the parameter for the symmetric Dirichlet prior on π (distribution over viewpoints). Note that the concentration parameter actually used is η/V. Default value: 1.0.
  • -ntopics <int>: Number of topics (T). Default value: 10.
  • -nviews <int>: Number of viewpoints (V). Default value: 2.
  • -niters <int>: Number of iterations to perform for each chain. Default value: 1000.
  • -burnin <int>: Number of iterations before starting collecting samples. Default value: 500.
  • -lag <int>: Number of iterations between samples to collect. Default value: 50.
  • -hypsamp: Indicates that hyperparameters should be updated (using the auxiliary variable sampling technique) instead of being kept constant. Default value: true.
  • -nchains <int>: Number of chains (independent executions of the program, each with a different random initialization) to perform. Default value: 1.
  • -savestep <int>: Number of iterations (after burnin) between samples to save in the output files. If the savestep is greater than (niters - burnin), then only one sample (the sample for the last iteration) will be saved for each chain. Default value: 500.
  • -topwords <int>: Number of top words (most probable words for each word distribution) to output.
  • -dir <string>: Path to the directory containing the data file, and where the output files will be saved.
  • -dfile <string>: Name of the data file.

Example:

$ java -jar "bin/vodum.jar" -beta0 0.01 -beta1 0.01 -alpha 1 -eta 1 -ntopics 15 -nviews 2 -niters 1000 -burnin 500 -lag 50 -hypsamp -nchains 1 -savestep 500 -topwords 20 -dir "data/midterms" -dfile "midterms.dat"

Output files

The execution of the collapsed Gibbs sampler for SNVDM and SNVDM-GPU outputs the following files for each savestep (corresponding to a saved model):

  • <model name>.others: This file contains the value of the hyperparameters (beta0, beta1, alpha, eta, ntopics, nviews), other parameters (ntopics, nviews). It also specifies the number of users in the collection (nusers), the size of the vocabulary (nwords), as well as the results of the evaluation in terms of various metrics (perplexity, coherence, purity, inversePurity, nmi, bCubedPrecision, bCubedRecall, bCubedF).
  • <model name>.avgphi0, <model name>.avgphi1: These files contain the distributions over topical words φ0, and the distributions over opinion words φ1, averaged over the different collected samples.
  • <model name>.avgtheta: This file contains the user-specific distributions over topics θ, averaged over the different collected samples.
  • <model name>.avgpi: This file contains the user-specific distributions over viewpoints π, averaged over the different collected samples.
  • <model name>.twords: This file contains the most probable words in the distributions over topical words φ0.
  • <model name>.vtwords: This file contains the most probable words in the distributions over opinion words φ1.
  • <model name>.classmap: This file contains the mapping between the groundtruth label strings (from the input file) and their index used in the output files. The first line corresponds to the number of different viewpoint labels in the collection.
  • <model name>.usermap: This file contains the mapping between the user ID strings (from the input file) and their index used in the output files. The first line corresponds to the number of different users in the collection.
  • <model name>.wordmap: This file contains the mapping between the word strings (from the input file) and their index used in the output files. The first line corresponds to the number of different words in the vocabulary.

About

Source code for the Social Network Viewpoint Discovery Model and baselines

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published