# Microtask 3

## Aim

Produce a notebook with charts showing the distribution of time-to-close for issues already closed, and opened during the last year, for each of the repositories analyzed, and for all of them together. Use Pandas for this, and the Python charting library of your choice (as long as it is a FOSS module).

### Retrieving Data from OmegaUp

From the command line run Perceval on the github repositories to analyze, to produce a file with JSON documents for all its issues (the list obtained contains the pull request also), one per line (git-commits.json).


Syntax for using Perceval for Github
`perceval github owner repository [--sleep-for-rate] [-t XXXXX]`


Date of Retrieval: 1st March 2019
##### Example:

`$ perceval github --json-line -category pull_request omegaup omegaup --sleep-for-rate -t a247a6b7d506736da6d653cddc060a96bfbd9cb3 > data_source.json`

`$ perceval github --json-line -category issue omegaup omegaup --sleep-for-rate -t a247a6b7d506736da6d653cddc060a96bfbd9cb3 >> data_source.json`

Gets the information about Pull Request and Issues from Github

###### Current Problem With Perceval    
    The Existing Problem with Perceval is that when running the above command with category issue it also 
    fetches the pull request because the github api returns the pull request under issues. For Ex 
   ##### https://api.github.com/repos/omegaup/omegaup/issues/2378
    
    This is a pull request but in url you can see it is returned as issue
   
    So we have to remove the duplicates (done in __init__() function)
    
    

`$ perceval git --json-line https://github.com/omegaup/omegaup >> data_source.json`

    Gets the information about Commits from Github

----------------------------------------------------------------------------------------
--sleep-for-rate To avoid having perceval exiting when the rate limit is exceeded

-t is token for Github API

In [3]:
import json
import datetime
from dateutil import parser
import pandas as pd

import warnings ## to ignore warnings that come in importing pandas
warnings.filterwarnings("ignore", message="numpy.dtype size changed")

## Summarize Function

#### @arguments 

<b>line</b>: item to be summarized<br>
<b>type</b>: type of item(commit,issue,pull_request)

summary{
    repo,<br>
    hash(in case of commit) or uuid(in case of PR or Issue),<br>
    author,<br>
    author_date,<br>
    ....<br>
}

In [2]:
def summarize(issue):
    repo = issue['origin']
    cdata = issue['data']
    summary = {
            'repo': repo,
            'uuid': issue['uuid'],
            'author': cdata['user']['login'],
            'created_date': datetime.datetime.strptime(cdata['created_at'],
                                           "%Y-%m-%dT%H:%M:%SZ"),
            'closed_date':datetime.datetime.strptime(cdata['closed_at'],
                                         "%Y-%m-%dT%H:%M:%SZ") if cdata['closed_at'] else None, 
            'comments': cdata['comments'],
            'labels': cdata['labels'],
            'url': cdata['html_url'],
            'state':cdata['state']
    }
    return summary

## Class Code_Changes

Takes path to the JSON file as input parameter

In [7]:
class Code_Changes:
    """"Class for Code_Changes for Git repositories.
    
    Objects are instantiated by specifying a file with the
    commits obtained by Perceval from a set of repositories.
    
    Contains list for Issues.
        
    :param path: Path to file with one Perceval JSON document per line
    """
    
    def __init__(self, path):
        
        self.changes = {'issue':[],'commit':[],'pull_request':[]}
        with open(path) as data_file:
            for data in data_file:
                line = json.loads(data)
                if ('pull_request' not in line['data']) and (line['category'] == 'issue'):
                       self.changes['issue'].append(summarize(line))
        self.code_dataframe = {
            'issue':pd.DataFrame.from_dict(self.changes['issue'])
        }

#### Functions Available
- total_count() : returns the total number of issues till date
- count(): returns number of issues created in Period Of Time
    ###### Parameters
    - Since
    - Until
            

## Example of the implementation

In [84]:
code = Code_Changes('data_source.json')

In [85]:
print("Total Number Of Issues:",len(code.changes['issue']))

Total Number Of Issues: 1484


## First Issue in The Repository

In [86]:
code.changes['issue'].sort(key = lambda x:x['created_date'].replace(tzinfo=None))  ##sorting the issues with date 
first_commit = code.changes['issue'][0]
first_issue_year = first_commit['created_date'].year

print("Year of first issue: ",first_issue_year)

Year of first issue:  2011


## Number of Issues Closed and Opened Last Year

Showing distrubution of issues opened and closed last year

In [75]:
current_year = datetime.datetime.now().year
since = datetime.datetime.now()
frame = code.code_dataframe['issue']
frame = frame[frame['created_date'].apply(lambda x:x.year) < current_year]
frame = frame[frame['created_date'].apply(lambda x:x.year) >= current_year - 1]

## frame contains the issues in the last year

In [93]:
total_issues = frame.shape[0]
closed_issues = frame[frame['state'] == 'closed'].shape[0]
open_issues = total_issues - closed_issues
print("Total Number of Issues in last year: ",total_issues)
print("Total Number of Closed Issues in last year: ",closed_issues)
print("Total Number of Open Issues in last year: ",open_issues)

Total Number of Issues in last year:  306
Total Number of Closed Issues in last year:  201
Total Number of Open Issues in last year:  105


## Number of Pull Request Open and Closed Per Month Last Year

In [138]:
frame['month'] = frame['created_date'].apply(lambda x:x.month)
months = []
for i in range(1,13):
     months.append(datetime.date(2008, i, 1).strftime('%B'))
## generate months for indexing
closed = frame[frame['state'] == 'closed']
closed = closed.groupby(['month']).count()
closed = closed['state'].tolist()

[27, 28, 25, 28, 37, 10, 8, 8, 8, 5, 9, 8]

## Activity QuarterWise

Number of Commits, PRs, Issues QuarterWise

In [95]:
quarter_durations = quarterize(2017,2018) ## get the quarter duration between the years passed as arguments

In [115]:
activity = {'issue':[],'pull_request':[],'commit':[] }
new_contributors = {'issue':[],'pull_request':[],'commit':[] }
existing_authors = {'issue':[],'pull_request':[],'commit':[] }
for quarter in quarter_durations:
    since = quarter[0]
    until = quarter[1]
    for change_type,frame in code.code_dataframe.items():
        frame['created_date'] = frame['created_date'].apply(lambda x:x.replace(tzinfo=None))
        frame = frame[(since<=frame['created_date']) & (frame['created_date']<=until)]
                    ##got the items in the quarter duration 
        new_contributions_quarter = len(set(frame['author'].unique())-set(existing_authors[change_type]))
        new_contributors[change_type].append(new_contributions_quarter)
        activity[change_type].append(frame.shape[0])
        existing_authors[change_type] = existing_authors[change_type] + frame['author'].unique().tolist()   

## Quarter Number and Their Duration

In [116]:
for number,quarter in enumerate(quarter_durations):
    print("Quarter "+str(number + 1) + ": ",quarter[0].strftime('%d %B %Y'),"to",quarter[1].strftime('%d %B %Y'))

Quarter 1:  01 January 2017 to 31 March 2017
Quarter 2:  01 April 2017 to 30 June 2017
Quarter 3:  01 July 2017 to 30 September 2017
Quarter 4:  01 October 2017 to 31 December 2017
Quarter 5:  01 January 2018 to 31 March 2018
Quarter 6:  01 April 2018 to 30 June 2018
Quarter 7:  01 July 2018 to 30 September 2018
Quarter 8:  01 October 2018 to 31 December 2018


## Activity List

Shows the list that contains amount of commits, pull requests and issue per quarter in the given years

In [113]:
for key,value in activity.items():
    print(key)
    print("---------")
    for i in range(len(value)):
        print("Quarter "+str(i+1)+": "+str(value[i]))

issue
---------
Quarter 1: 75
Quarter 2: 104
Quarter 3: 63
Quarter 4: 90
Quarter 5: 111
Quarter 6: 128
Quarter 7: 36
Quarter 8: 29
commit
---------
Quarter 1: 102
Quarter 2: 93
Quarter 3: 65
Quarter 4: 107
Quarter 5: 130
Quarter 6: 115
Quarter 7: 61
Quarter 8: 58
pull_request
---------
Quarter 1: 73
Quarter 2: 76
Quarter 3: 63
Quarter 4: 98
Quarter 5: 114
Quarter 6: 120
Quarter 7: 51
Quarter 8: 63


## New Authors 

Shows the list that contains number of new contributors that has contributed in form commits, pull requests and issue per quarter in the given years

In [114]:
for key,value in new_contributors.items():
    print(key)
    print("---------")
    for i in range(len(value)):
        print("Quarter "+str(i+1)+": "+str(value[i]))

issue
---------
Quarter 1: 13
Quarter 2: 12
Quarter 3: 6
Quarter 4: 5
Quarter 5: 9
Quarter 6: 2
Quarter 7: 6
Quarter 8: 3
commit
---------
Quarter 1: 10
Quarter 2: 5
Quarter 3: 6
Quarter 4: 4
Quarter 5: 12
Quarter 6: 7
Quarter 7: 1
Quarter 8: 1
pull_request
---------
Quarter 1: 10
Quarter 2: 5
Quarter 3: 7
Quarter 4: 2
Quarter 5: 14
Quarter 6: 6
Quarter 7: 0
Quarter 8: 1


## Writing CSV
fields = ['Quarter','Since','Until','Commits','PRs','Issues','New PR Submitters','New Issue Submitters','New Commiters']

In [144]:
dataframe_csv = pd.DataFrame()
fields = ['Quarter','Since','Until','Commits','PRs','Issues','New PR Submitters','New Issue Submitters','New Commiters']
dataframe_csv['Quarter'] = ["Quarter " + str(x) for x in range(len(quarter_durations))]
dataframe_csv['Since'] = [quarter_durations[x][0].strftime('%d %B %Y') for x in range(len(quarter_durations))]
dataframe_csv['Until'] = [quarter_durations[x][1].strftime('%d %B %Y') for x in range(len(quarter_durations))]
dataframe_csv['Commits'] = activity['commit']
dataframe_csv['PRs'] = activity['pull_request']
dataframe_csv['Issues'] = activity['issue']
dataframe_csv['New PR Submitters'] = new_contributors['pull_request']
dataframe_csv['New Issue Submitters'] = new_contributors['issue']
dataframe_csv['New Commiters'] = new_contributors['commit']

dataframe_csv.to_csv('microtask2-quarterwise-pandas.csv',index=None)

In [145]:
pd.read_csv('microtask2-quarterwise-pandas.csv')

Unnamed: 0,Quarter,Since,Until,Commits,PRs,Issues,New PR Submitters,New Issue Submitters,New Commiters
0,Quarter 0,01 January 2017,31 March 2017,102,73,75,10,13,10
1,Quarter 1,01 April 2017,30 June 2017,93,76,104,5,12,5
2,Quarter 2,01 July 2017,30 September 2017,65,63,63,7,6,6
3,Quarter 3,01 October 2017,31 December 2017,107,98,90,2,5,4
4,Quarter 4,01 January 2018,31 March 2018,130,114,111,14,9,12
5,Quarter 5,01 April 2018,30 June 2018,115,120,128,6,2,7
6,Quarter 6,01 July 2018,30 September 2018,61,51,36,0,6,1
7,Quarter 7,01 October 2018,31 December 2018,58,63,29,1,3,1
