# Microtask 4

## Aim

Produce a listing of repositories, as a table and as CSV file, with the number of commits authored, issues opened, and pull/merge requests opened, during the last three months, ordered by the total number (commits plus issues plus pull requests). Use plain Python3 (eg, no Pandas) for this.

### Retrieving Data from OmegaUp

From the command line run Perceval on the github repositories to analyze, to produce a file with JSON documents for all its issues (the list obtained contains the pull request also), one per line (git-commits.json).


Syntax for using Perceval for Github
`perceval github owner repository [--sleep-for-rate] [-t XXXXX]`


Date of Retrieval: 1st March 2019
##### Example:

`$ perceval github --json-line -category pull_request omegaup omegaup --sleep-for-rate -t a247a6b7d506736da6d653cddc060a96bfbd9cb3 > data_source.json`

`$ perceval github --json-line -category issue omegaup omegaup --sleep-for-rate -t a247a6b7d506736da6d653cddc060a96bfbd9cb3 >> data_source.json`

Gets the information about Pull Request and Issues from Github

###### Current Problem With Perceval    
    The Existing Problem with Perceval is that when running the above command with category issue it also 
    fetches the pull request because the github api returns the pull request under issues. For Ex 
   ##### https://api.github.com/repos/omegaup/omegaup/issues/2378
    
    This is a pull request but in url you can see it is returned as issue
   
    So we have to remove the duplicates (done in __init__() function)
    
    

`$ perceval git --json-line https://github.com/omegaup/omegaup >> data_source.json`

    Gets the information about Commits from Github

----------------------------------------------------------------------------------------
--sleep-for-rate To avoid having perceval exiting when the rate limit is exceeded

-t is token for Github API

In [60]:
import json
import datetime
import dateutil.relativedelta
import re
from copy import deepcopy
from dateutil import parser
import pandas as pd

import warnings ## to ignore warnings that come in importing pandas
warnings.filterwarnings("ignore", message="numpy.dtype size changed")

## Summarize Function

#### @arguments 

<b>line</b>: item to be summarized<br>
<b>type</b>: type of item(commit,issue,pull_request)

summary{
    repo,<br>
    hash(in case of commit) or uuid(in case of PR or Issue),<br>
    author,<br>
    author_date,<br>
    ....<br>
}

In [2]:
def summarize(line,type):
    repo = line['origin']
    cdata = line['data']
    if(type=='commit'):    
        summary = {
                'repo': repo,
                'hash': cdata['commit'],
                'author': cdata['Author'],
                'author_date': datetime.datetime.strptime(cdata['AuthorDate'],
                                                          "%a %b %d %H:%M:%S %Y %z"),
                'commit': cdata['Commit'],
                'created_date': datetime.datetime.strptime(cdata['CommitDate'],
                                                          "%a %b %d %H:%M:%S %Y %z"),
                'files_no': len(cdata['files']),
        }
        actions = 0
        for file in cdata['files']:
            if 'action' in file:
                actions += 1
        summary['files_action'] = actions
        if 'Merge' in cdata:
            summary['merge'] = True
        else:
            summary['merge'] = False
    elif(type=='issue'):
        summary = {
                'repo': repo,
                'uuid': line['uuid'],
                'author': cdata['user']['login'],
                'created_date': datetime.datetime.strptime(cdata['created_at'],
                                            "%Y-%m-%dT%H:%M:%SZ"),
                'closed_date':datetime.datetime.strptime(cdata['closed_at'],
                                            "%Y-%m-%dT%H:%M:%SZ") if cdata['closed_at'] else None, 
                'comments': cdata['comments'],
                'labels': cdata['labels'],
                'url': cdata['html_url'],
                'state':cdata['state']
        }
    elif(type=='pull_request'):
        summary = {
                'repo': repo,
                'uuid': line['uuid'],
                'author': cdata['user']['login'],
                'created_date': datetime.datetime.strptime(cdata['created_at'],"%Y-%m-%dT%H:%M:%SZ"),
                'closed_date': datetime.datetime.strptime(cdata['closed_at'],"%Y-%m-%dT%H:%M:%SZ")
                                            if cdata['closed_at'] else None,
                'merged_date': datetime.datetime.strptime(cdata['merged_at'],"%Y-%m-%dT%H:%M:%SZ")
                                        if cdata['merged_at'] else None,
                'comments': cdata['comments'],
                'commits': cdata['commits'],
                'additions': cdata['additions'],
                'deletions': cdata['deletions'],
                'changed_files':cdata['changed_files'],
                'url': cdata['html_url'],
                'state':cdata['state']
        }
    
    return summary

## Class Code_Changes

Takes path to the JSON file as input parameter

In [24]:
class Code_Changes:
    """"Class for Code_Changes for Git repositories.
    
    Objects are instantiated by specifying a file with the
    commits obtained by Perceval from a set of repositories.
    
    Contains individual list for Issues, Pull Requests and Commits
        
    :param path: Path to file with one Perceval JSON document per line
    """
    
    def __init__(self, path):
        
        self.changes = {'issue':[],'commit':[],'pull_request':[]}
        with open(path) as data_file:
            for data in data_file:
                line = json.loads(data)
                if(line['category'] ==  'commit'):
                    self.changes['commit'].append(summarize(line,'commit'))
                else:
                    if (line['category'] == 'pull_request'):
                        self.changes['pull_request'].append(summarize(line,'pull_request'))
                    elif ('pull_request' not in line['data']) and (line['category'] == 'issue'):
                        self.changes['issue'].append(summarize(line,'issue'))
        self.code_dataframe = {
            'commit':pd.DataFrame.from_dict(self.changes['commit']),
            'pull_request':pd.DataFrame.from_dict(self.changes['pull_request']),
            'issue':pd.DataFrame.from_dict(self.changes['issue'])
        }
    

#### Functions Available
- total_count() : returns the total number of issues till date
- count(): returns number of issues created in Period Of Time
    ###### Parameters
    - Since
    - Until
            

## Example of the implementation

In [106]:
code = Code_Changes('data_source.json')

## Analysing Data for Last Three Months

Last three months (doesn't include current month) [Dec 2018, Jan 2019, Feb 2019]

In [115]:
today_date = datetime.datetime.now().date()
last_third_month = today_date - dateutil.relativedelta.relativedelta(months=3)
last_third_month_first_date = last_third_month.replace(day=1)
current_month_first_date = today_date.replace(day=1)
print("Last third month first Date: ",last_third_month_first_date)
print("Last month first Date: ",current_month_first_date)

Last third month first Date:  2018-12-01
Last month first Date:  2019-03-01


We have to analyse data between the above dates

In [117]:
frame = deepcopy(code.code_dataframe)
since = last_third_month_first_date
until = current_month_first_date

for change_type,frame in frame.items():
    if(change_type=='commit'):
        frame['author_date'] = frame['author_date'].apply(lambda x:x.date())
        frame = frame[(since<=frame['author_date'])]
        frame = frame[(until>=frame['author_date'])]
        
    elif(change_type=='issue'):
        frame['created_date'] = frame['created_date'].apply(lambda x:x.date())
        frame = frame[(since<=frame['created_date']) & (frame['created_date']<=until)]
        frame = frame[frame['state']=='open']
    elif(change_type=='pull_request'):
        frame['created_date'] = frame['created_date'].apply(lambda x:x.date())
        frame = frame[(since<=frame['created_date']) & (frame['created_date']<=until)]
        frame = frame[frame['state']=='open']

## Activity List

Shows the list that contains amount of commits, pull requests and issue per quarter in the given years

In [87]:
for key,value in activity.items():
    print(key)
    print("---------")
    for i in range(len(value)):
        print("Quarter "+str(i+1)+": "+str(value[i]))

NameError: name 'activity' is not defined

## New Authors 

Shows the list that contains number of new contributors that has contributed in form commits, pull requests and issue per quarter in the given years

In [114]:
for key,value in new_contributors.items():
    print(key)
    print("---------")
    for i in range(len(value)):
        print("Quarter "+str(i+1)+": "+str(value[i]))

issue
---------
Quarter 1: 13
Quarter 2: 12
Quarter 3: 6
Quarter 4: 5
Quarter 5: 9
Quarter 6: 2
Quarter 7: 6
Quarter 8: 3
commit
---------
Quarter 1: 10
Quarter 2: 5
Quarter 3: 6
Quarter 4: 4
Quarter 5: 12
Quarter 6: 7
Quarter 7: 1
Quarter 8: 1
pull_request
---------
Quarter 1: 10
Quarter 2: 5
Quarter 3: 7
Quarter 4: 2
Quarter 5: 14
Quarter 6: 6
Quarter 7: 0
Quarter 8: 1


## Writing CSV
fields = ['Quarter','Since','Until','Commits','PRs','Issues','New PR Submitters','New Issue Submitters','New Commiters']

In [144]:
dataframe_csv = pd.DataFrame()
fields = ['Quarter','Since','Until','Commits','PRs','Issues','New PR Submitters','New Issue Submitters','New Commiters']
dataframe_csv['Quarter'] = ["Quarter " + str(x) for x in range(len(quarter_durations))]
dataframe_csv['Since'] = [quarter_durations[x][0].strftime('%d %B %Y') for x in range(len(quarter_durations))]
dataframe_csv['Until'] = [quarter_durations[x][1].strftime('%d %B %Y') for x in range(len(quarter_durations))]
dataframe_csv['Commits'] = activity['commit']
dataframe_csv['PRs'] = activity['pull_request']
dataframe_csv['Issues'] = activity['issue']
dataframe_csv['New PR Submitters'] = new_contributors['pull_request']
dataframe_csv['New Issue Submitters'] = new_contributors['issue']
dataframe_csv['New Commiters'] = new_contributors['commit']

dataframe_csv.to_csv('microtask2-quarterwise-pandas.csv',index=None)

In [145]:
pd.read_csv('microtask2-quarterwise-pandas.csv')

Unnamed: 0,Quarter,Since,Until,Commits,PRs,Issues,New PR Submitters,New Issue Submitters,New Commiters
0,Quarter 0,01 January 2017,31 March 2017,102,73,75,10,13,10
1,Quarter 1,01 April 2017,30 June 2017,93,76,104,5,12,5
2,Quarter 2,01 July 2017,30 September 2017,65,63,63,7,6,6
3,Quarter 3,01 October 2017,31 December 2017,107,98,90,2,5,4
4,Quarter 4,01 January 2018,31 March 2018,130,114,111,14,9,12
5,Quarter 5,01 April 2018,30 June 2018,115,120,128,6,2,7
6,Quarter 6,01 July 2018,30 September 2018,61,51,36,0,6,1
7,Quarter 7,01 October 2018,31 December 2018,58,63,29,1,3,1
