This is an open-source python-based framework to predict the future user communities in a text streaming social network (e.g., Twitter) based on the users’ topics of interest. Our proposed framework has already been benchmarked on a Twitter dataset and showed improvements compared to the state of the art in underlying applications such as news recommendation and user prediction.
It is strongly recommended to use Linux OS for installing the packages and executing the framework. To install packages and dependencies, simply use this command in your shell:
pip install -r requirements.txt
This command installs compatible version of the following libraries:
- gensim
- networkx
- scikit-network
- dynamicgem
- tagme
- nltk
- numpy
- pandas
- scikit-learn
- scipy
- sklearn
- requests
- mysql-connector-python
- matplotlib
Our framework has six major layers: Data Access Layer (DAL), Topic Modeling Layer (TML), User Modeling Layer (UML), Graph Embedding Layer (GEL), and Community Prediction Layer (CPL). The application layer, is the last layer to show how our method improves the performance of an application.
│── output
│── src
│ │
│ │── cmn (common functions)
│ │──── Common.py
│ │
│ │── dal (data access layer)
│ │──── DataPreparation.py
│ │──── DataReader.py
│ │
│ │── tml (topic modeling layer)
│ │──── TopicModeling.py
│ │
│ │── uml (user modeling layer)
│ │──── UsersGraph.py
│ │──── UserSimilarities.py
│ │
│ │── gel (graph embedding layer)
│ │──── GraphEmbedding.py
│ │──── GraphReconstruction.py
│ │
│ │── cpl (community prediction layer)
│ │──── GraphClustering.py
│ │
│ │── application
│ │──── NewsTopicExtraction.py
│ │──── NewsRecommendation.py
│ │──── ModelEvaluation.py
│ │── main.py
│ │── params.py
│── requirements.txt
We crawled and stored Twitter posts (tweets) for 2 consecutive months. The data is available as some sql scripts that should be executed. They are accessible through the following links. Please download and execute them into your local database engine. Please be sure that your sql engine is working when you start to run the framework.
This framework contains six different layers. Each layer is affected by multiple parameters.
Some of those parameters are fixed in the code via trial and error. However, major parameters such as number of topics can be adjusted by the user.
They can be modified via 'params.py' file in root folder.
After modifying 'params.py', you can run the framework via 'main.py' with following command:
cd src
python main.py
import random
import numpy as np
random.seed(0)
np.random.seed(0)
RunID = 1
# SQL setting. Should be set for each mysql instance
user = ''
password = ''
host = ''
database = ''
uml = {
'Comment': '', # Any comment to express more information about the configuration.
'RunId': RunID, # A unique number to identify the configuration per run.
'start': '2010-12-17', # First date of system activity
'end': '2010-12-17', # Last day of system activity
'lastRowsNumber': 100000, # Number of picked rows of the dataset for the whole process as a sample.
'num_topics': 25, # Number of topics that should be extracted from our corpus.
'library': 'gensim', # Used library to extract topics from the corpus. Could be 'gensim' or 'mallet'
'mallet_home': '--------------', # mallet_home path
# Following parameters is used to generate corpus from our dataset:
'userModeling': True, # Aggregates all tweets of a user as a document
'timeModeling': True, # Aggregate all tweets of a specific day as a document
'preProcessing': False, # Applying some traditional pre-processing methods on corpus
'TagME': False, # Apply Tagme on the raw dataset. Set it to False if tagme-dataset is used.
'filterExtremes': True, # Filter very common and very rare terms in all documents.
'JO': False, # (JO:=JustOne) If True, just one topic is chosen for each document
'Bin': True, # (Bin:=Binary) If True, all scores above/below a threshold is set to 1/0 for each topic
'Threshold': 0.2, # A threshold for topic scores quantization.
'UserSimilarityThreshold': 0.2 # A threshold for filtering low user similarity scores.
}
evl = {
'RunId': RunID,
'Threshold': 0, # A threshold for filtering low news recommendation scores.
'TopK': 20 # Number of selected top news recommendation candidates.
}
Method | News Recommendation | User Prediction | ||||
---|---|---|---|---|---|---|
mrr | ndcg5 | ndcg10 | Precision | Recall | f1-measure | |
Community Prediction | ||||||
Our approach | 0.255 | 0.108 | 0.105 | 0.012 | 0.035 | 0.015 |
Appel et al. [PKDD' 18] | 0.176 | 0.056 | 0.055 | 0.007 | 0.094 | 0.0105 |
Temporal community detection | ||||||
Hu et al. [SIGMOD’15] | 0.173 | 0.056 | 0.049 | 0.007 | 0.136 | 0.013 |
Fani et al. [CIKM’17] | 0.065 | 0.040 | 0.040 | 0.007 | 0.136 | 0.013 |
Non-temporal link-based community detection | ||||||
Ye et al.[CIKM’18] | 0.139 | 0.056 | 0.055 | 0.008 | 0.208 | 0.014 |
Louvain[JSTAT’08] | 0.108 | 0.048 | 0.055 | 0.004 | 0.129 | 0.007 |
Collaborative filtering | ||||||
rrn[WSDM’17] | 0.173 | 0.073 | 0.08 | 0.004 | 0.740 | 0.008 |
timesvd++ [KDD’08] | 0.141 | 0.058 | 0.064 | 0.003 | 0.657 | 0.005 |
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
©2021. This work is licensed under a CC BY-NC-SA 4.0 license.
Email: ziaeines@uwindsor.ca - soroushziaeinejad@gmail.com
Project link: https://github.com/soroush-ziaeinejad/Community-Prediction
In this work, we use dynamicgem library to temporally embed our user graphs. We would like to thank the authors of this library.