## Microtask-0

> Produce one notebook per data source (git, GitHub/GitLab issues, GitHub pull requests / GitLab merge requests) showing a summary of the contents of that file (number of items in it, and number of different identities in it counting authors/committers for git, submitters for issues and pull/merge requests). This microtask is mandatory, to show that you can retrieve data and produde a notebook showing it. In each notebook, include also the list of repositories retrieved, and the date of retrieval, using data available in the JSON file.

This notebook gives you the analysis of [elasticsearch-py](https://github.com/elastic/elasticsearch-py) project. The source file is present in the `data/` folder of this repository.

Date of Retrieval: 03/03/2019

In [1]:
#!pip install perceval

### Retreiving the data

For retreiving the commit data

```bash
$ perceval git --json-line https://github.com/elastic/elasticsearch-py >> elasticsearch-py.json
[2019-03-04 03:39:27,394] - Sir Perceval is on his quest.
[2019-03-04 03:43:54,748] - Fetching commits: 'https://github.com/elastic/elasticsearch-py' git repository from 1970-01-01 00:00:00+00:00 to 2100-01-01 00:00:00+00:00; all branches
[2019-03-04 03:44:15,614] - Fetch process completed: 1135 commits fetched
[2019-03-04 03:44:15,614] - Sir Perceval completed his quest.
```
For retreiving the issues data

```bash
$ perceval github elastic elasticsearch-py --sleep-for-rate -t xxxx --category issue >> elasticsearch-py.json
[2019-03-04 04:19:03,813] - Sir Perceval is on his quest.
[2019-03-04 04:19:07,895] - Getting info for https://api.github.com/users/nkvoll
...
...
[2019-03-04 04:43:23,674] - Sir Perceval completed his quest.
```
For retreiving the pull_requests data

```bash
$ perceval github elastic elasticsearch-py --sleep-for-rate -t xxxx --category pull_request >> elasticsearch-py.json
[2019-03-04 05:19:03,813] - Sir Perceval is on his quest.
[2019-03-04 05:19:07,895] - Getting info for https://api.github.com/users/nkvoll
...
...
[2019-03-04 05:43:23,674] - Sir Perceval completed his quest.
```

You can run the [sh.sh](https://github.com/vchrombie/chaoss-microtasks/blob/master/sh.sh) file to automate the retrieval of the date. All you need to do is to 
provide the GITHUB_TOKEN.

Also, you can download the data source files from the notebook itself. Just provide your `github_token` and run the below cell.

Since it takes a long time to download a source of a large repository, I have downloaded the source files and I am using it directlyw which will be availabel in `data/` folder.

In [2]:
# please provide your github personal token here
#github_token = ""

#!perceval git --json-line https://github.com/elastic/elasticsearch-py >> ../data/elasticsearch-py.json
#!perceval github elastic elasticsearch-py --sleep-for-rate -t $github_token --category issue >> ../data/elasticsearch-py.json
#!perceval github elastic elasticsearch-py --sleep-for-rate -t $github_token --category pull_request >> ../data/elasticsearch-py.json

### Importing the necessary modules

In [3]:
import json
import dateutil

from datetime import datetime
from collections import defaultdict

In [4]:
with open('../data/elasticsearch-py.json') as file:
    line= json.loads(file.readline())
    
# extract the date of retrieval from the data source
print('TimeStamp:',datetime.utcfromtimestamp(int(line['timestamp'])).strftime('%Y-%m-%d %H:%M:%S'))
print('Project Retrieved:',line['origin'])

# reference: https://stackoverflow.com/questions/3682748/converting-unix-timestamp-string-to-readable-date

TimeStamp: 2019-03-05 20:33:11
Project Retrieved: https://github.com/elastic/elasticsearch-py


### Class Code_Changes

This implementation uses data retrieved as described above.
The implementation is encapsulated in the `Code_Changes` class,
which gets all commits for a set of repositories.

In [5]:
class Code_Changes:
    """
    Class for Code_Changes for Git repositories.
    
    Objects are instantiated by specifying a file with the
    commits obtained by Perceval from a set of repositories.
        
    :param path: Path to file with one Perceval JSON document per line
    """
    
    @staticmethod
    def _summarize_commit(commit):
        """
        Compute a summary of a commit, suitable as a row in a dataframe
        """
        repo = commit['origin']
        cdata = commit['data']
        summary ={
                'repo': repo,
                'hash': cdata['commit'],
                'author': cdata['Author'],
                'author_date': datetime.strptime(cdata['AuthorDate'],"%a %b %d %H:%M:%S %Y %z"),
                'commit': cdata['Commit'],
                'commit_date': datetime.strptime(cdata['CommitDate'],"%a %b %d %H:%M:%S %Y %z"),
                'files_no': len(cdata['files'])
        }
        actions = 0
        for file in cdata['files']:
            if 'action' in file:
                actions += 1
        summary['files_action'] = actions
        summary['merge'] = 'Merge' in cdata
        return summary

    # divides the data source into three categories - commit, issue and pr
    def __init__(self, path):
        """
        Initilizes self.df, the dataframe with one row per commit.
        """
        with open(path) as datafile:
            content =  defaultdict(list)
            commit_count = 0
            issue_count = 0
            pr_count = 0
            for line in datafile:
                line = json.loads(line)
                if line['category']=='commit':
                    commit_count+=1
                    summary = self._summarize_commit(line)
                    for field in summary:
                        content[field].append(summary[field])
                elif line['category']=='issue':
                    issue_count+=1
                elif line['category']=='pull_request':
                    pr_count+=1
        self.content = content
        self.commit_count = commit_count
        self.issue_count = issue_count
        self.pr_count = pr_count
        
    # method to return the total number of commits 
    def total_count(self):
        """
        Total number of commits
        """
        # as we computed the commit count already in __init__ and saved the instance
        # we can directly return it
        return self.commit_count
    
    
    def count(self, since = None, until = None, empty=True, merge=True, date='author_date'):
        """
        Count number of commits
        
        :param since: Period start
        :param until: Period end
        :param empty: Include empty commits
        :param merge: Include merge commits
        :param  date: Kind of date ('author_date' or 'commit_date')
        """
        c = self.content
        count = 0
        unique = set()
        for hash,date, files_action, merge_value in zip(c['hash'],c[date],c['files_action'],c['merge']):
            current_count = 1
            date= date.replace(tzinfo = None)  
            if since and date < dateutil.parser.parse(since):
                current_count = 0
            if until and date > dateutil.parser.parse(until):
                current_count = 0
            if not empty and files_action == 0 :
                current_count = 0
            if not merge and merge_value:
                current_count = 0
            if hash not in unique:
                count+=current_count
                unique.add(hash)
        return count

Method `total_count()` implements `Total Count` aggregation for `Code_Changes`.

Method `count()` implements `Count` aggregation for `Code_Changes`.
It accepts parameters specified for the general metric:
    
* Period of time: `since` and `until`
* Specific case if Git: `merge` and `empty`
* `date`

## Examples of use of the implementation

In [6]:
# giving the data source as an argument, calling the class file 
changes = Code_Changes('../data/elasticsearch-py.json')


print("Code changes total count:", changes.total_count())
print("Code changes count all period:", changes.count())
print("Code changes count from 2017-01-01 to 2019-07-01:",changes.count(since="2017-01-01", until="2019-07-01"))
print("Code changes count from 2017-01-01 to 2019-07-01 (no merge commits):",changes.count(since="2017-01-01", until="2019-07-01", merge=False))
print("Code changes count from 2017-01-01 to 2019-07-01 (no empty commits):",changes.count(since="2017-01-01", until="2019-07-01", empty=False))

Code changes total count: 1135
Code changes count all period: 1135
Code changes count from 2017-01-01 to 2019-07-01: 262
Code changes count from 2017-01-01 to 2019-07-01 (no merge commits): 234
Code changes count from 2017-01-01 to 2019-07-01 (no empty commits): 237


## Examples showing peculiarities of git commits

Let's prepare a dictionary, `commits`, with all commits retrieved,
by reading the `elasticsearch-py.json` file.

In [7]:
commits = changes.content
count = changes.commit_count

### Naive count of commits

Let's compute number of commits the easiest way: just count all commits:

In [8]:
print("Code Commits (naive):", count)

Code Commits (naive): 1135


### Issues and PRs 

The `category` in the json file determines whether it is an issue or pull_request.

In [9]:
print("Summary:\n", "Total Commits", changes.commit_count, "\n", "Total Issues", changes.issue_count, "\n", "Total Pull Requests:", changes.pr_count)

Summary:
 Total Commits 1135 
 Total Issues 906 
 Total Pull Requests: 290


# Total Summary  

In [10]:
print('-----Overall Summary-----'+'\n')

print("Code changes total count:", changes.total_count())
print("Code changes count all period:", changes.count())
print("Code changes count from 2017-01-01 to 2019-07-01:",changes.count(since="2017-01-01", until="2019-07-01"))
print("Code changes count from 2017-01-01 to 2019-07-01 (no merge commits):",changes.count(since="2017-01-01", until="2019-07-01", merge=False))
print("Code changes count from 2017-01-01 to 2019-07-01 (no empty commits):",changes.count(since="2017-01-01", until="2019-07-01", empty=False))
print("Code Commits (naive):", count)
print("Summary:\n", "Total Commits", changes.commit_count, "\n", "Total Issues", changes.issue_count, "\n", "Total Pull Requests:", changes.pr_count)

-----Overall Summary-----

Code changes total count: 1135
Code changes count all period: 1135
Code changes count from 2017-01-01 to 2019-07-01: 262
Code changes count from 2017-01-01 to 2019-07-01 (no merge commits): 234
Code changes count from 2017-01-01 to 2019-07-01 (no empty commits): 237
Code Commits (naive): 1135
Summary:
 Total Commits 1135 
 Total Issues 906 
 Total Pull Requests: 290
