Skip to content

williamsdoug/GitAnalyticsDatasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Git Analytics Sample Datasets

Datasets are in CSV format, gzipped for storage efficiency. Each dataset consists of 3 files,

  • <project>_<importance>_feature_vectors.csv.gz - Numpy array saved using numpy.savetext()
  • <project>_<importance>_labels.csv.gz - Single column binary label for each feature vector. This value is the result of comparing the guilt value with a threshold value, where the threshold value is selected such that the number of labeled commits (value=1) roughly corresponds to the number of actual bugs (subject to minimum importance value)
  • <project>_<importance>_info.csv.gz - Other info each feature vector
  • Git Commit ID (SHA)
  • actual guilt value associated each commit
  • ordering number of each commit. 1 for first commit in repo.

The sample datasets are all derived from OpenStack. <project> values include:

  • cinder
  • glance
  • heat
  • nova
  • swift

The guilt value computed with each commit based based the association of the commit with any subsequent bug fix. Higher importance bugs fix (e.g.: Critical vs High vs Medium vs Low) get higher weight in the guilt calculation. Standard <importance> values include:

  • critical - only includes Criticalbug fixes
  • highplus - includes Critical and High bug fixes
  • mediumplus - includes Critical, High and Medium bug fixes
  • highplus - includes Critical, High, Medium and Low bug fixes

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published