Stackoverflow bounty analysis

In this project we aim to analyze the influence of a bounty on a question. We will try to predict whether a question will receive a successful answer after setting a bounty for it or not. Additionally, we predict if this bounty question will receive its winning answer within 2.5 days after creating the bounty.

The project consists of four parts:

Scripts (Python / SQL) to import the Stackoverflow dump into a HANA database. The SO dump, given as XML data, will get evaluated and inserted into the database (DB).
SQL scripts to calculate additional tables that provide easy access to deeper knowledge, e.g. about tags on a question.
Code to calculate features and train a topic model (LDA) and a SVM, which will serve as the knowledge base for the web server.
A Web Server to enter questions and receive a prediction as output.

Content of the repository:

Path
`data_cleansing/`	SQL scripts to clean the SO dump from inconsistencies. Furthermore, create additional tables with condensed information.
`data_converter/`	Python scripts to read XML data dumps and yield them as batched rows.
`data_crawling/`	Scala script to request timestamps from the SO API that are not contained in the original data dump. The training process needs the data to be as accurate as possible.
`feature_analysis/`	R scripts to analyze the data and generate plots.
`hana_nodejs_import/`	Deprecated HANA nodejs importer. This script got replaced by `insert_data`.
`insert_data/`	Python script to bulk write `INSERT` statements to HANA.
`prediction/`	Python code to calculate features, train topic models and train a SVM classifier
`web_server/`	Python web server that uses a trained SVM to do live predictions.

Setup and Prediction Training

0. Requirements

Pre-requirements that should be installed before running any installer / scripts:

Java > 1.7
Python >= 2.7
easy_install / pip for python
HANA server
LAPACK compatible OS

1. Installation

There is an installer script install.sh located in the root of the project. It will install all necessary python packages, sbt for scala and some necessary system libraries. It is written to be executed on SUSE SE 11 but should be easily adoptable to other OS.

System packages:

gcc-fortran, p7zip, lapack

Necessary python packages:

nltk, numpy, scipy, gensim, dateutil, scikit-learn, 
flask, requests, ordereddict, flask-cors

The python driver for HANA needs to be manually installed. Follow connect-to-sap-hana-in-python to do so.

2. Configuration

There are two configuration files, application.cfg and stackoverflow_data.cfg

application.cfg	default
[DB]
user	root	User with access to the DB
password		Password of the DB user
host	localhost	Host address of the server running the DB
port	1337	Port of the running DB server
typ	mysql	DB type, should be either `hana`or `mysql`
database	stackoverflow	Name of the DB to work on

The stackoverflow_data.cfg should contain one section for each <TABLENAME> containing the attributes described below.

stackoverflow_data.cfg	example
[`<TABLENAME>`]
input	Votes.xml	Name of the input XML file in the SO dump
table	SO_VOTES	Name of the table in the DB
columns	Id CreationDate PostId VoteTypeId	Columns to transfer from XML to DB
timestamps	CreationDate	Columns that need timestamp reformatting

3. Getting the data

There is a script called run.sh which will download the most recent stackoverflow dump from the archive. After that the data will get unzipped and inserted into the database. All scripts will use the output/ directory as a working directory. Please make sure it is writable.

After insertion the runner will execute a crawler to download timestamps that are missing from the dump.

4. Cleansing of the data and table generation

There are multiple scripts in the folder data_cleansing/. Those need to be executed against the SO data that got inserted into the database. They should be executed in order of their prefixing indices. This will create additional tables like SO_TAGS containing all tags that got annotated at least once.

5. Feature calculation and model training

After you ran the data cleansing scripts you should execute train.sh. There are several python scripts that will be called. They are used to calculate various features needed to train the classifiers. All features are collected in the table SO_TRAINING_FEATURES.

At the end the script will train two topic models (one on the whole SO question corpus and another one on a smaller sample only using verb phrases VP). After the LDA's are trained the final training for the two SVMs is started.

The Web Server

The project comes with a web server to enter questions and present the prediction results. The server can be used to show a simple HTML report or serves all its data via a REST interface.

0. Requirements

The Web Server relies on Flask Webframework and a number of other Python libraries. Running the install.sh script should install them.

The Web Server relies on four data sources for its calculations:

The Stackoverflow REST API (10000 requests / day quota)
A Database with X tables contaning pre-calculated statistics on tag usage.
Two trained SVM located under output/svms, one for success prediction and one for time interval prediction.
Two trained LDAs located under output/ldas, one for a topic model trained on the whole dataset and one without verb phrases.

1. Launching the server

Make sure the DB is up and running and the application.cfg is set up properly to allow access to the DB. Run the following script from the root directory:

python server.py

2. Routes

The Web Server supports multiple routes:

HTML Results

GET /                 index
GET /:quesiton_id     prediction result & detailed feature report for a SO question
POST /submitQuestion  needs field `question` with a question id

REST API

/api/predicitons/:question_id     returns only the predicition results as JSON
/api/features/:question_id        returns only the features as JSON

3. Embedding into a different page

If you like to embed the prediction into your site, we recommend using an iFrame, as it correctly encapsulates all styles, urls and requests. To help you with this we provide a little JS file that dynamically renders the iFrame into a container of your choice. For an example see:

web_server/embed-stackoverflow-example.html
web_server/embed-stackoverflow.js

Use the emebding script like this:

  <script type="text/javascript" src="embed-stackoverflow.js"></script>
  <script type="text/javascript">
    renderIntoElement(selector, url_to_prediction_server);
  </script>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stackoverflow bounty analysis

Setup and Prediction Training

0. Requirements

1. Installation

2. Configuration

3. Getting the data

4. Cleansing of the data and table generation

5. Feature calculation and model training

The Web Server

0. Requirements

1. Launching the server

2. Routes

3. Embedding into a different page

Data Statistics

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 214 Commits
data_cleansing		data_cleansing
data_converter		data_converter
data_crawling		data_crawling
feature_analysis		feature_analysis
hana_nodejs_import		hana_nodejs_import
insert_data		insert_data
libs		libs
prediction		prediction
presentations		presentations
web_server		web_server
.gitignore		.gitignore
README.md		README.md
application.cfg		application.cfg
install.sh		install.sh
prediction_scores.md		prediction_scores.md
run.sh		run.sh
stackoverflow_data.cfg		stackoverflow_data.cfg
train.sh		train.sh

tmbo/stackoverflow-media-mining

Folders and files

Latest commit

History

Repository files navigation

Stackoverflow bounty analysis

Setup and Prediction Training

0. Requirements

1. Installation

2. Configuration

3. Getting the data

4. Cleansing of the data and table generation

5. Feature calculation and model training

The Web Server

0. Requirements

1. Launching the server

2. Routes

3. Embedding into a different page

Data Statistics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages