In this project we aim to analyze the influence of a bounty on a question. We will try to predict whether a question will receive a successful answer after setting a bounty for it or not. Additionally, we predict if this bounty question will receive its winning answer within 2.5 days after creating the bounty.
The project consists of four parts:
- Scripts (Python / SQL) to import the Stackoverflow dump into a HANA database. The SO dump, given as XML data, will get evaluated and inserted into the database (DB).
- SQL scripts to calculate additional tables that provide easy access to deeper knowledge, e.g. about tags on a question.
- Code to calculate features and train a topic model (LDA) and a SVM, which will serve as the knowledge base for the web server.
- A Web Server to enter questions and receive a prediction as output.
Content of the repository:
| Path | |
|---|---|
data_cleansing/ |
SQL scripts to clean the SO dump from inconsistencies. Furthermore, create additional tables with condensed information. |
data_converter/ |
Python scripts to read XML data dumps and yield them as batched rows. |
data_crawling/ |
Scala script to request timestamps from the SO API that are not contained in the original data dump. The training process needs the data to be as accurate as possible. |
feature_analysis/ |
R scripts to analyze the data and generate plots. |
hana_nodejs_import/ |
Deprecated HANA nodejs importer. This script got replaced by insert_data. |
insert_data/ |
Python script to bulk write INSERT statements to HANA. |
prediction/ |
Python code to calculate features, train topic models and train a SVM classifier |
web_server/ |
Python web server that uses a trained SVM to do live predictions. |
Pre-requirements that should be installed before running any installer / scripts:
- Java > 1.7
- Python >= 2.7
- easy_install / pip for python
- HANA server
- LAPACK compatible OS
There is an installer script install.sh located in the root of the project. It will install all necessary python packages, sbt for scala and some necessary system libraries. It is written to be executed on SUSE SE 11 but should be easily adoptable to other OS.
System packages:
gcc-fortran, p7zip, lapack
Necessary python packages:
nltk, numpy, scipy, gensim, dateutil, scikit-learn,
flask, requests, ordereddict, flask-cors
The python driver for HANA needs to be manually installed. Follow connect-to-sap-hana-in-python to do so.
There are two configuration files, application.cfg and stackoverflow_data.cfg
| application.cfg | default | |
|---|---|---|
| [DB] | ||
| user | root | User with access to the DB |
| password | Password of the DB user | |
| host | localhost | Host address of the server running the DB |
| port | 1337 | Port of the running DB server |
| typ | mysql | DB type, should be either hanaor mysql |
| database | stackoverflow | Name of the DB to work on |
The stackoverflow_data.cfg should contain one section for each <TABLENAME> containing the attributes described below.
| stackoverflow_data.cfg | example | |
|---|---|---|
[<TABLENAME>] |
||
| input | Votes.xml | Name of the input XML file in the SO dump |
| table | SO_VOTES | Name of the table in the DB |
| columns | Id CreationDate PostId VoteTypeId | Columns to transfer from XML to DB |
| timestamps | CreationDate | Columns that need timestamp reformatting |
There is a script called run.sh which will download the most recent stackoverflow dump from the archive. After that the data will get unzipped and inserted into the database. All scripts will use the output/ directory as a working directory. Please make sure it is writable.
After insertion the runner will execute a crawler to download timestamps that are missing from the dump.
There are multiple scripts in the folder data_cleansing/. Those need to be executed against the SO data that got inserted into the database. They should be executed in order of their prefixing indices. This will create additional tables like SO_TAGS containing all tags that got annotated at least once.
After you ran the data cleansing scripts you should execute train.sh. There are several python scripts that will be called. They are used to calculate various features needed to train the classifiers. All features are collected in the table SO_TRAINING_FEATURES.
At the end the script will train two topic models (one on the whole SO question corpus and another one on a smaller sample only using verb phrases VP). After the LDA's are trained the final training for the two SVMs is started.
The project comes with a web server to enter questions and present the prediction results. The server can be used to show a simple HTML report or serves all its data via a REST interface.
The Web Server relies on Flask Webframework and a number of other Python libraries. Running the install.sh script should install them.
The Web Server relies on four data sources for its calculations:
- The Stackoverflow REST API (10000 requests / day quota)
- A Database with X tables contaning pre-calculated statistics on tag usage.
- Two trained SVM located under
output/svms, one for success prediction and one for time interval prediction. - Two trained LDAs located under
output/ldas, one for a topic model trained on the whole dataset and one without verb phrases.
Make sure the DB is up and running and the application.cfg is set up properly to allow access to the DB. Run the following script from the root directory:
python server.pyThe Web Server supports multiple routes:
HTML Results
GET / index
GET /:quesiton_id prediction result & detailed feature report for a SO question
POST /submitQuestion needs field `question` with a question id
REST API
/api/predicitons/:question_id returns only the predicition results as JSON
/api/features/:question_id returns only the features as JSON
If you like to embed the prediction into your site, we recommend using an iFrame, as it correctly encapsulates all styles, urls and requests. To help you with this we provide a little JS file that dynamically renders the iFrame into a container of your choice. For an example see:
web_server/embed-stackoverflow-example.html
web_server/embed-stackoverflow.js
Use the emebding script like this:
<script type="text/javascript" src="embed-stackoverflow.js"></script>
<script type="text/javascript">
renderIntoElement(selector, url_to_prediction_server);
</script>