[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/chaoss/wg-gmd/master?filepath=implementations/Code_Changes-Git.ipynb)
# Code_Changes-Git

This is the reference implementation for Code_Changes,
a metric specified by the
[GMD Working Group](https://github.com/chaoss/wg-gmd) of the
[CHAOSS project](https://chaoss.community).
This implementation is specific to Git repositories.

See [README.md](README.md) to find out how to run this notebook (and others in this directory).

The implementation is described in two parts (see below):

* Retrieving data from the data source
* Class for computing Code_Changes

Some more auxiliary information in this notebook:

* Examples of the use of the implementation
* Examples of how to check for specific peculiarities of git commits

## Retrieving data from the data source

Retrieve commits data from the Sortinghat, SirMordred and Perceval git repositories which are located in the folllwing links:
http://github.com/chaoss/grimoirelab-perceval
http://github.com/chaoss/grimoirelab-sortinghat
https://github.com/chaoss/grimoirelab-sirmordred

```
josemasa$ perceval git --json-line http://github.com/chaoss/grimoirelab-perceval > git-commits.json
[2019-04-02 14:41:53,679] - Sir Perceval is on his quest.
[2019-04-02 14:41:53,681] - Fetching commits: 'http://github.com/chaoss/grimoirelab-perceval' git repository from 1970-01-01 00:00:00+00:00 to 2100-01-01 00:00:00+00:00; all branches
[2019-04-02 14:41:55,448] - Fetch process completed: 1354 commits fetched
[2019-04-02 14:41:55,448] - Sir Perceval completed his quest.
 
 
josemasa$ perceval git --json-line http://github.com/chaoss/grimoirelab-sortinghat >> git-commits.json
[2019-04-02 14:43:10,196] - Sir Perceval is on his quest.
[2019-04-02 14:43:11,438] - Fetching commits: 'http://github.com/chaoss/grimoirelab-sortinghat' git repository from 1970-01-01 00:00:00+00:00 to 2100-01-01 00:00:00+00:00; all branches
[2019-04-02 14:43:12,502] - Fetch process completed: 659 commits fetched
[2019-04-02 14:43:12,502] - Sir Perceval completed his quest.


josemasa$ perceval git --json-line "https://github.com/chaoss/grimoirelab-sirmordred" >> git-commits.json
[2019-04-02 14:48:41,198] - Sir Perceval is on his quest.
[2019-04-02 14:48:44,526] - Fetching commits: 'https://github.com/chaoss/grimoirelab-sirmordred' git repository from 1970-01-01 00:00:00+00:00 to 2100-01-01 00:00:00+00:00; all branches
[2019-04-02 14:48:45,579] - Fetch process completed: 909 commits fetched
[2019-04-02 14:48:45,579] - Sir Perceval completed his quest.
```

## Class for computing Code_Changes-Git 
(TAKEN FROM THE EXAMPLE VERSION)

This implementation uses data retrieved as described above.
The implementation is encapsulated in the `Code_Changes` class,
which gets all commits for a set of repositories.

In [50]:
import json
import datetime

import pandas as pd

class Code_Changes:
    """Class for Code_Changes for Git repositories.
    
    Objects are instantiated by specifying a file with the
    commits obtained by Perceval from a set of repositories.
        
    :param path: Path to file with one Perceval JSON document per line
    """

    @staticmethod
    def _summary(repo, cdata):
        """Compute a summary of a commit, suitable as a row in a dataframe"""
        
        summary = {
            'repo': repo,
            'hash': cdata['commit'],
            'author': cdata['Author'],
            'author_date': datetime.datetime.strptime(cdata['AuthorDate'],
                                                      "%a %b %d %H:%M:%S %Y %z"),
            'commit': cdata['Commit'],
            'commit_date': datetime.datetime.strptime(cdata['CommitDate'],
                                                      "%a %b %d %H:%M:%S %Y %z"),
            'files_no': len(cdata['files'])
        }
        actions = 0
        for file in cdata['files']:
            if 'action' in file:
                actions += 1
        summary['files_action'] = actions
        if 'Merge' in cdata:
            summary['merge'] = True
        else:
            summary['merge'] = False
        return summary;
    
    def __init__(self, path):
        """Initilizes self.df, the dataframe with one row per commit.
        """

        self.df = pd.DataFrame(columns=['hash', 'author', 'author_date',
                                        'commit', 'commit_date',
                                        'files_no', 'files_action',
                                        'merge'])
        commits = []
        with open(path) as commits_file:
            for line in commits_file:
                commit = json.loads(line)
                commits.append(self._summary(repo=commit['origin'],
                                             cdata=commit['data']))
        self.df = self.df.append(commits, sort=False)
        self.df['author_date'] = pd.to_datetime(self.df['author_date'], utc=True)
        self.df['commit_date'] = pd.to_datetime(self.df['commit_date'], utc=True)
        
    def total_count(self):
        
        return len(self.df.index)
    
    def count(self, since = None, until = None, empty=True, merge=True, date='author_date'):
        """Count number of commits
        
        :param since: Period start
        :param until: Period end
        :param empty: Include empty commits
        :param merge: Include merge commits
        :param  date: Kind of date ('author_date' or 'commit_date')
        """
        
        df = self.df
        if since:
            df = df[df[date] >= since]
        if until:
            df = df[df[date] < until]
        if not empty:
            df = df[df['files_action'] != 0]
        if not merge:
            df = df[df['merge'] == False]
        return df['hash'].nunique()
    
    def by_month(self):
        
        return self.df['author_date'] \
            .groupby([self.df.author_date.dt.year.rename('year'),
                      self.df.author_date.dt.month.rename('month')]) \
            .agg('count')


Method `count()` implements `Count` aggregation for `Code_Changes`.
It accepts parameters specified for the general metric:
    
* Period of time: `since` and `until`

It accepts parameters specified for the specific case of Git:
    
* Include merge commits: `merge`
* Include empty commits: `empty`
* Kind of date: `date`

## List of repositories based on JSON file

In [52]:
repos = set()
with open ('git-commits.json') as commits_file:
    for line in commits_file:
        commit = json.loads(line)
        repos.add(commit['origin'])
        
print (repos)

{'https://github.com/chaoss/grimoirelab-sirmordred', 'http://github.com/chaoss/grimoirelab-perceval', 'http://github.com/chaoss/grimoirelab-sortinghat'}


## Examples of use of the implementation

In [53]:
changes = Code_Changes('./git-commits.json')
print("Code changes total count:", changes.total_count())
print("Code changes count all period:", changes.count())
print("Code changes count from 2018-01-01 to 2018-07-01:",
      changes.count(since="2018-01-01", until="2018-07-01"))
print("Code changes count from 2018-01-01 to 2018-07-01 (no merge commits):",
      changes.count(since="2018-01-01", until="2018-07-01", merge=False))
print("Code changes count from 2018-01-01 to 2018-07-01 (no empty commits):",
      changes.count(since="2018-01-01", until="2018-07-01", empty=False))

Code changes total count: 2922
Code changes count all period: 2922
Code changes count from 2018-01-01 to 2018-07-01: 601
Code changes count from 2018-01-01 to 2018-07-01 (no merge commits): 477
Code changes count from 2018-01-01 to 2018-07-01 (no empty commits): 478


## Checking data consistency

### Verifying different actions

In [55]:
actions = set()
with open('git-commits.json') as commits_file:
    for line in commits_file:
        commit = json.loads(line)
        #will enter just if any file were modified for the commit
        if (len(commit['data']['files']) > 0):
            if ('action' in commit['data']['files'][0]):
                actions.add(commit['data']['files'][0]['action'])

#printing unique values
print (actions)        

{'M', 'MM', 'A', 'C091', 'D', 'R100', 'R067', 'C062'}


### Verifying that the only category is 'commit'

In [56]:
categories = set()
with open('git-commits.json') as commits_file:
    for line in commits_file:
        commit = json.loads(line)
        categories.add(commit['category'])
        
#printing unique values      
print (categories)

{'commit'}


## Counting and displaying distinct committers to the repositories

In [60]:
distinct_authors = set()
with open('git-commits.json') as commits_file:
    for line in commits_file:
        commit = json.loads(line)
        distinct_authors.add(commit['data']['Author'])
        
print ('NAMES AND EMAILS OF DIFFERENT COMITTERS TO THE ANALYZED REPOS \n \n')

for author in distinct_authors:
    print (author)
    
print ("TOTAL DISTINCT COMMITTERS " + str(len(distinct_authors)))

NAMES AND EMAILS OF DIFFERENT COMITTERS TO THE ANALYZED REPOS 
 

J. Manrique Lopez de la Fuente <jsmanrique@bitergia.com>
Jesus M. Gonzalez-Barahona <jgb@gsyc.es>
Aniruddha Karajgi <akarajgi0@gmail.com>
david <david@starlab.io>
David Moreno <dmorenolumb@gmail.com>
Andre Klapper <a9016009@gmx.de>
dpose <dpose@bitergia.com>
zhquan <zhquan@gmail.com>
Miguel Ángel Fernández <mafesan@bitergia.com>
sumitskj <sumitjangirdss.1@gmail.com>
f2014169 <f2014169@hyderabad.bits-pilani.ac.in>
Andy Grunwald <andygrunwald@gmail.com>
Fil Maj <maj.fil@gmail.com>
Lukasz Gryglicki <lukaszgryglicki@o2.pl>
Daniel Izquierdo Cortazar <dicortazar@gmail.com>
camillem <camillem@users.noreply.github.com>
Harshal Mittal <harshalmittal4@gmail.com>
Maurizio Pillitu <maoo@apache.org>
Alvaro del Castillo <acs@bitergia.com>
Alberto Pérez García-Plaza <alpgarcia@bitergia.com>
quan <zhquan7@gmail.com>
Prabhat <prabhatsharma7298@gmail.com>
Miguel Angel Fernandez <mafesan@bitergia.com>
David Pose Fernández <dpose@bitergia.c