FOSS Heartbeat analyses the health of a community of contributors. 💓
Python
Latest commit 2ff24fc Jan 2, 2017 @sarahsharp Reports: Add systemd/systemd.
Signed-off-by: Sarah Sharp <sharp@otter.technology>
Permalink
Failed to load latest commit information.
docs Reports: Add systemd/systemd. Jan 2, 2017
empathy-model Sentiment: add more neutral sentences to the training set. Dec 31, 2016
language Sentiment: add more neutral sentences to the training set. Dec 31, 2016
.gitignore Ignore any intermediary files generated by Stanford CoreNLP. Dec 26, 2016
LICENSE LICENSE: mention plotly (MIT) is used. Nov 4, 2016
README.md README: Add notes about sentiment training and Penn tree vim tips. Dec 21, 2016
conda-installation.md Add conda installation Dec 1, 2016
environment.yml Add conda installation Dec 1, 2016
ghcategorize.py Note place to fix issue #29 Nov 17, 2016
ghreport.py Reports: Add code to generate sentiment graphs (if available). Jan 2, 2017
ghrusthighfive.py rust-highfive: Report statistics in units of days. Nov 18, 2016
ghscraper.py Allow scraper to authenticate with GitHub API via token Nov 5, 2016
ghsentiment.py Sentiment: Add period after json file path. Dec 1, 2016
ghsentimentstats.py Sentiment graphs: Add function to generate html sentences. Jan 2, 2017
ghstats.py Reports: Add code to generate sentiment graphs (if available). Jan 2, 2017
ghwordhypothesis.py Word hypothesis: Whitespace fix. Dec 5, 2016
github-api-limitations-assumptions.txt Initial commit: github data scraper working, documented todos. Oct 18, 2016
requirements.txt Sentiment: Add code to scrub github comments. Nov 27, 2016
word-hypothesis.txt Word hypothesis: Changed results for thank you and praise. Dec 5, 2016

README.md

foss-heartbeat

Open source communities are made of people, who contribute in many different ways. What makes a person more likely to continue participating in a project and move into roles of greater responsibility?

Identifying contributors

foss-hearbeat identifies seven major contribution types:

  • Issue reporter
  • Issue responder
  • Code contributor
  • Documentation contributor
  • Reviewer
  • Maintainer
  • Connector

This project uses contributor participation data (currently from GitHub) to categorize users into these seven roles.

Answering key contribution questions

Performing data analysis on this participation data seeks to answer questions about what factors attract and retain those types of contributors.

While there are many different questions you could ask once you categorize contributors and examine their contributions, the first major goal of this project is to answer the question:

What impact does positive or negative language have on contributor participation?

foss-heartbeat seeks to answer that question by applying sentiment analysis on the comments community members make on others' contributions.

Install

Clone the repository, change to the directory containing the repository.

$ pip install -r requirements.txt

May require sudo.

Install Stanford CoreNLP

Stanford CoreNLP includes natural language processing and a neural network (a type of machine learning tool) that can learn how to recognize sentiment at a sentence level.

These installation directions are slightly expanded upon from the installation directions at: http://stanfordnlp.github.io/CoreNLP/index.html#download

Clone the git repo:

$ git clone git@github.com:stanfordnlp/CoreNLP.git

Copy the default sentiment models into the liblocal directory:

$ cd CoreNLP/liblocal
$ wget http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar
$ wget http://nlp.stanford.edu/software/stanford-english-corenlp-models-current.jar

(Note: the directions on the Stanford CoreNLP for how to set the classpath didn't work for me. Instead, I used the -Djava.ext.dirs=lib:liblocal flag to point java to the sentiment models I placed in CoreNLP/liblocal.)

Usage

Scrape information from GitHub

First, scrape information from GitHub for each repository you're analyzing. Note that this step may require several hours or even a day, due to github API rate limits.

$ python ghscraper.py GITHUB_REPO_NAME GITHUB_OWNER_NAME FILE_WITH_CREDENTIALS

Or if you prefer not to type your password into a file, or have turned on two-factor authentication for your GitHub account, use an access token instead:

$ python ghscraper.py GITHUB_REPO_NAME GITHUB_OWNER_NAME GITHUB_OAUTH_TOKEN

(Make sure to select the following scopes for your token: public_repo)

Categorize

Next, run the script to categorize github interactions into different types of open source contribution types:

$ python ghcategorize.py GITHUB_REPO_NAME GITHUB_OWNER_NAME

Stats

Then generate html reports with statistics (note this imports functions from ghreport.py)

$ python ghstats.py GITHUB_REPO_NAME GITHUB_OWNER_NAME docs/

The html report will be created in docs/GITHUB_OWNER_NAME/GITHUB_REPO_NAME You will need to hand-edit docs/index.html to link to docs/GITHUB_OWNER_NAME/GITHUB_REPO_NAME/foss-heartbeat.html.

(Optional) Train the Stanford CoreNLP sentiment model

Sentiment analysis relies on being trained with a large set of sentences that are relevant to the text you want to study. For example, hand-analyzed sentences from one open source project may be used to train the sentiment model to automatically analyze another project.

The Stanford CoreNLP sentiment models are trained on movie reviews and aren't very good for analyzing sentiment of code reviews. It tends to look at the sentence structure of technical comments and rank it as a negative tone, even if there are no negative words. It's also not trained for curse words or emojis.

The Stanford CoreNLP includes a way to retrain the neural network to recognize sentiment of sentence structures. You have to feed it a training set (their training set is ~8,000 sentences) and a development set that helps you tune parameters of the neural net. Both sets have to be sentences that are manually turned into Penn Tree format.

FOSS Heartbeat's training set can be found in empathy/train.txt and FOSS Heartbeat's development set is found in empathy/dev.txt.

The sentences in the training model are taken from open source projects: lkml, debian-devel mailing list, glibc, angular, .NET, elm, react, fsharp, idris, jquery, vscode, node.js, rails, rust, servo, and bootstrap.

There are around 10-20 simple sentences that I hoped would help train the model. I've also included sentiment for all the curse words found at http://www.noswearing.com/dictionary and all the short-hand codes for emojis at http://www.webpagefx.com/tools/emoji-cheat-sheet/

If you make changes to train.txt and dev.txt, you can retrain the model:

$ cd path/to/CoreNLP
$ java -cp stanford-corenlp.jar -Djava.ext.dirs=lib:liblocal -mx5g \
    edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 \
    -trainPath path/to/foss-heartbeat/empathy-model/training.txt \
    -devPath path/to/foss-heartbeat/empathy-model/dev.txt -train \
    -model path/to/foss-heartbeat/empathy-model/empathy-model.ser.gz

Running the sentiment model in stdin mode

In the CoreNLP directory, you can run a test of the default sentiment model. This parses sentences from stdin after you hit enter, but be aware it returns one line for multiple lines fed into it at once, rather than using the sentence parser like the -file option does.

$ cd path/to/CoreNLP
$ java -cp stanford-corenlp.jar -Djava.ext.dirs=lib:liblocal \
  -mx5g edu.stanford.nlp.sentiment.SentimentPipeline -stdin -output pennTrees

-mx5g specifies that 5GB is the maximum amount of RAM to use.

-output pennTrees specifies that the Stanford CoreNLP output the full sentence sentiment analysis. To get an idea of what the format means, take a look at the live demo Removing that flag will change the output mode to only stating the overall sentence tone (very negative, negative, neutral, positive, very positive).

If you wish to run the sentiment analysis using FOSS Heartbeat's empathy model, you should instead run:

$ cd path/to/CoreNLP
$ java -cp stanford-corenlp.jar -Djava.ext.dirs=lib:liblocal -mx5g \
    edu.stanford.nlp.sentiment.SentimentPipeline -stdin \
    -sentimentModel path/to/foss-heartbeat/empathy-model/empathy-model.ser.gz \
    -output pennTrees

language/substitutions.txt contains list of word sentiment labels that need to be relabeled from the default Stanford CoreNLP Penn Tree output. Stanford CoreNLP default model was trained on movie reviews, so it incorrectly labels words we find in software development conversation. For example, 'Christian' is labeled as positive, since people may leave a review about the positivity of Christian movies; in software development, 'Christian' is most likely someone's name. Since FOSS Heartbeat's model is trained to recognize empathy and praise as positive, and personal attacks as negative, we often have to shift the sentiment of specific words.

You can use substitutions.txt to change word sentiment labels in the sentences from the default sentiment model. It involves stripping the '%' off the vim substitution commands in substitution.txt, using the resulting file as a sed regular expression file, and piping the output from the sentiment model into sed:

$ cd path/to/CoreNLP
$ cat path/to/foss-heartbeat/language/substitutions.txt | \
    sed -e 's/^%//' > /tmp/subs.txt; \
    java -cp stanford-corenlp.jar -Djava.ext.dirs=lib:liblocal -mx5g \
    edu.stanford.nlp.sentiment.SentimentPipeline -stdin -output pennTrees | \
    sed -f /tmp/subs.txt

Once this is done, you can feed interesting examples in and put them in empathy/train.txt or empathy/dev.txt to retrain FOSS Heartbeat's model. You will need to manually propagate up any sentiment changes from the innermost sentence fragments to the root of the sentence. This is something that needs to be done by human eyes, since the sentence tone can change when different sentence fragments are combined.

Scrub comments for sentiment analysis

In order to cut down on the amount of time that the Stanford CoreNLP has to process sentences, we need to drop any inline code that is (most likely) to be ranked as neutral, or may be miscategorized because the model hasn't been trained on that particular language.

We also convert any unicode emojis into their short-hand codes (as described on http://www.webpagefx.com/tools/emoji-cheat-sheet/), which make it easier on humans to read analyzed plain-text sentences.

It also takes time for the Stanford CoreNLP to load the models, so it is faster to write a bunch of text to a file, and use the -file command line option to parse a file, than to re-run the command for each sentence. Thus, there is a FOSS Heartbeat script that generates a scrubbed file of all comments in a repo that you can feed to Stanford CoreNLP.

The output file will have the filenames (preceded by a hashmark), and the contents of the scrubbed comments. Sentences may span multiple lines, and the Stanford CoreNLP will break them up using its sentence parser. It does mean that things like lists or sentence that don't end with punctuation will get joined with the next line.

To generate the scrubbed file, run:

$ python ../src/ghsentiment.py owner/repo/ owner/repo/all-comments.txt --recurse

Run the scrubbed data through the sentiment analysis

To use FOSS Heartbeat's retrained empathy model on the scrubbed comments file, run:

$ cd path/to/CoreNLP
$ java -cp stanford-corenlp.jar -Djava.ext.dirs=lib:liblocal -mx5g \
    edu.stanford.nlp.sentiment.SentimentPipeline -output pennTrees \
    -sentimentModel path/to/foss-heartbeat/empathy-model/empathy-model.ser.gz \
    -file path/to/owner/repo/all-comments.txt > \
    path/to/owner/repo/all-comments.empathy.txt

Modifying the sentiment training data

In order to retrain the sentiment model, you need to add parsed sentences with Penn tree sentiment for each word. You'll need to add about 1 sentence to empathy/dev.txt for every 8 similar sentences you add to empathy/train.txt.

Penn tree sentence format initially looks very strange:

(4 (2 Again) (4 (2 this) (4 (4 (2 is) (4 (4 super) (3 (3 great) (2 work)))) (2 .))))

Each word, and each combined sentence part has an associated sentiment, from 0 to 4. In the empathy model, the following categorizations are used:

  • 4 (Very positive): Thank yous with emphasis (great or great!), or specific praise
  • 3 (Positive): Thanks, praise, encouragement, empathy, helping others, and apologies
  • 2 (Neutral): Any talk about code that includes opinions without expressing gratitude, empathy, cursing, or discriminatory language
  • 1 (Negative): Comments about code or people with mild cursing or abelist language
  • 0 (Very Negative): Comments with strong cursing, sexist, racist, homophobic, transphobic, etc language

It can sometimes be easier to see how sentence sentiment changes as its parsed phrases are combined, by putting it in a tree format:

(4
   (2 Again)
   (4
      (2 this)
      (4
         (4
            (2 is)
            (4
               (4 super)
               (3
                  (3 great)
                  (2 work))))
         (2 .))))

There's a good visualation tool by the Standford CoreNLP developers, but is it not open source and uses the default sentiment model trained on movie reviews.

Vim Tips and Tricks

In order for people to better "see" the sentiment in Penn tree text files, you can use this vim plugin to highlight the sentiment labels. You'll need to modify the plugins/highlight.csv files to have the following lines:

6,black,yellow,black,yellow
7,white,DarkRed,white,firebrick
8,white,DarkGreen,white,DarkGreen
9,white,DarkBlue,white,DarkSlateBlue

When you open a Penn tree file, you can run the following commands to highlight sentiment:

Highlight 7 0
Highlight 6 1
Highlight 9 3
Highlight 8 4

Most open source code talk is neutral, so I don't bother to highlight the 2 sentiment.

Additionally, when comparing the sentiment results from two different models, it's useful to have vimdiff only highlight the individual words (or in our case, the sentiment labels) that have changed, rather than highlighting the whole line. This vim plugin highlights only changed words when vim is in diff mode.