# Microtask 1

### Retrieving Data from OmegaUp

From the command line run Perceval on the github repositories to analyze, to produce a file with JSON documents for all its issues (the list obtained contains the pull request also), one per line (git-commits.json).


Syntax for using Perceval for Github
`perceval github owner repository [--sleep-for-rate] [-t XXXXX]`


Date of Retrieval: 1st March 2019
##### Example:

`$ perceval github --json-line -category pull_request omegaup omegaup --sleep-for-rate -t a247a6b7d506736da6d653cddc060a96bfbd9cb3 > data_source.json`

`$ perceval github --json-line -category issue omegaup omegaup --sleep-for-rate -t a247a6b7d506736da6d653cddc060a96bfbd9cb3 >> data_source.json`

Gets the information about Pull Request and Issues from Github

###### Current Problem With Perceval    
    The Existing Problem with Perceval is that when running the above command with category issue it also 
    fetches the pull request because the github api returns the pull request under issues. For Ex 
   ##### https://api.github.com/repos/omegaup/omegaup/issues/2378
    
    This is a pull request but in url you can see it is returned as issue
   
    So we have to remove the duplicates (done in __init__() function)
    
    

`$ perceval git --json-line https://github.com/omegaup/omegaup >> data_source.json`

    Gets the information about Commits from Github

----------------------------------------------------------------------------------------
--sleep-for-rate To avoid having perceval exiting when the rate limit is exceeded

-t is token for Github API

In [7]:
import json
import datetime
import re
from dateutil import parser

## Summarize Function

#### @arguments 

<b>line</b>: item to be summarized<br>
<b>type</b>: type of item(commit,issue,pull_request)

summary{
    repo,<br>
    hash(in case of commit) or uuid(in case of PR or Issue),<br>
    author,<br>
    author_date,<br>
    ....<br>
}

In [18]:
def summarize(line,type):
    repo = line['origin']
    cdata = line['data']
    if(type=='commit'):    
        summary = {
                'repo': repo,
                'hash': cdata['commit'],
                'author': cdata['Author'],
                'author_date': datetime.datetime.strptime(cdata['AuthorDate'],
                                                          "%a %b %d %H:%M:%S %Y %z"),
                'commit': cdata['Commit'],
                'created_date': datetime.datetime.strptime(cdata['CommitDate'],
                                                          "%a %b %d %H:%M:%S %Y %z"),
                'files_no': len(cdata['files']),
        }
        actions = 0
        for file in cdata['files']:
            if 'action' in file:
                actions += 1
        summary['files_action'] = actions
        if 'Merge' in cdata:
            summary['merge'] = True
        else:
            summary['merge'] = False
    elif(type=='issue'):
        summary = {
                'repo': repo,
                'uuid': line['uuid'],
                'author': cdata['user']['login'],
                'created_date': datetime.datetime.strptime(cdata['created_at'],
                                            "%Y-%m-%dT%H:%M:%SZ"),
                'closed_date':datetime.datetime.strptime(cdata['closed_at'],
                                            "%Y-%m-%dT%H:%M:%SZ") if cdata['closed_at'] else None, 
                'comments': cdata['comments'],
                'labels': cdata['labels'],
                'url': cdata['html_url'],
                'state':cdata['state']
        }
    elif(type=='pull_request'):
        summary = {
                'repo': repo,
                'uuid': line['uuid'],
                'author': cdata['user']['login'],
                'created_date': datetime.datetime.strptime(cdata['created_at'],"%Y-%m-%dT%H:%M:%SZ"),
                'closed_date': datetime.datetime.strptime(cdata['closed_at'],"%Y-%m-%dT%H:%M:%SZ")
                                            if cdata['closed_at'] else None,
                'merged_date': datetime.datetime.strptime(cdata['merged_at'],"%Y-%m-%dT%H:%M:%SZ")
                                        if cdata['merged_at'] else None,
                'comments': cdata['comments'],
                'commits': cdata['commits'],
                'additions': cdata['additions'],
                'deletions': cdata['deletions'],
                'changed_files':cdata['changed_files'],
                'url': cdata['html_url'],
                'state':cdata['state']
        }
    
    return summary

## Class Code_Changes

Takes path to the JSON file as input parameter and on initializing a dictionary is created.
Dictionary has three keys each of which maps to a list <br>
    - Commits : Contains list of summary of each commit
    - Issues : Contains list of summary of each issue 
    - Pull Requests  : Contains list of summary of each pull request 

In [19]:
class Code_Changes:
    """"Class for Code_Changes for Git repositories.
    
    Objects are instantiated by specifying a file with the
    commits obtained by Perceval from a set of repositories.
    
    Contains individual list for Issues, Pull Requests and Commits
        
    :param path: Path to file with one Perceval JSON document per line
    """
    
    def __init__(self, path):
        
        self.changes = {'issue':[],'commit':[],'pull_request':[]}
        with open(path) as data_file:
            for data in data_file:
                line = json.loads(data)
                if(line['category'] ==  'commit'):
                    self.changes['commit'].append(summarize(line,'commit'))
                else:
                    if (line['category'] == 'pull_request'):
                        self.changes['pull_request'].append(summarize(line,'pull_request'))
                    elif ('pull_request' not in line['data']) and (line['category'] == 'issue'):
                        self.changes['issue'].append(summarize(line,'issue'))
                            

## Quarter Definition

In [20]:
quarters = {} #define a quarter dictonary with key a quarter#no and value is a list with since and until (contains month and date)
quarters['quarter1'] = ({'month':1,'day':1},{'month':3,'day':31})  #January 01 – March 31
quarters['quarter2'] = ({'month':4,'day':1},{'month':6,'day':30})  #April 01 – June 30
quarters['quarter3'] = ({'month':7,'day':1},{'month':9,'day':30})  #July 01 – September 30
quarters['quarter4'] = ({'month':10,'day':1},{'month':12,'day':31})  #October 01 – December 31
quarters_keys = sorted(quarters.keys(), key=lambda x:x.lower()) ## dictonary is not formed in sorted order 
                                                           ##or in the order in which it has been instantiated

## Quarter Function

### @arguments - since (year)  , until (year)  
Divides the given duration into quarters <br>
Returns a list containing start and end of quarter in form datetime object

In [21]:
def quarterize(since,until):
    quarter_duration = []
    for year in range(since,until+1):
        for key in quarters_keys:
            value = quarters[key]
            to_append = [datetime.datetime(year,value[0]['month'],value[0]['day']),datetime.datetime(year,value[1]['month'],value[1]['day'])]
            quarter_duration.append(to_append)
    return quarter_duration

In [22]:
quarter_durations = quarterize(2018,2019)
for i in quarter_durations:
    print(i)

[datetime.datetime(2018, 1, 1, 0, 0), datetime.datetime(2018, 3, 31, 0, 0)]
[datetime.datetime(2018, 4, 1, 0, 0), datetime.datetime(2018, 6, 30, 0, 0)]
[datetime.datetime(2018, 7, 1, 0, 0), datetime.datetime(2018, 9, 30, 0, 0)]
[datetime.datetime(2018, 10, 1, 0, 0), datetime.datetime(2018, 12, 31, 0, 0)]
[datetime.datetime(2019, 1, 1, 0, 0), datetime.datetime(2019, 3, 31, 0, 0)]
[datetime.datetime(2019, 4, 1, 0, 0), datetime.datetime(2019, 6, 30, 0, 0)]
[datetime.datetime(2019, 7, 1, 0, 0), datetime.datetime(2019, 9, 30, 0, 0)]
[datetime.datetime(2019, 10, 1, 0, 0), datetime.datetime(2019, 12, 31, 0, 0)]


## Example of the implementation

In [23]:
code = Code_Changes('../data_source.json')

In [9]:
print("Total Number Of Commits:",len(code.changes['commit']))
print("Total Number Of Pull Requests:",len(code.changes['pull_request']))
print("Total Number Of Issues:",len(code.changes['issue']))

Total Number Of Commits: 4224
Total Number Of Pull Requests: 923
Total Number Of Issues: 1484


## First Activity in The Repository

Finding first activity in repository.<br>
Finding the earliest created_date for each commit, issue and pull_request.<br>
Then finding the earliest of the three.

In [10]:
code.changes['commit'].sort(key = lambda x:x['created_date'].replace(tzinfo=None))  ##sorting the commits with date 
first_commit = code.changes['commit'][0]
first_commit_year = first_commit['created_date'].year

code.changes['pull_request'].sort(key = lambda x:x['created_date'].replace(tzinfo=None))  ##sorting the pull requests with date 
first_commit = code.changes['commit'][0]
first_pull_request_year = first_commit['created_date'].year

code.changes['issue'].sort(key = lambda x:x['created_date'].replace(tzinfo=None))  ##sorting the issues with date 
first_commit = code.changes['commit'][0]
first_issue_year = first_commit['created_date'].year

first_activity_year = min(first_commit_year,first_issue_year,first_pull_request_year)
current_year = 2019
print("Year of first activity: ",first_activity_year)

Year of first activity:  2010


## Activity QuarterWise

Calculating the activity in the repository quarterwise.<br>
Dictionary for Activity that has keys issue, pull_request and commit.<br>
Each of the key is mapped to a list containing a list of which each index denotes the activity in that quarter (indexing starts with 0) <br>
For example `activity['issue'][1]` will denote the issues activities in Quarter 2 <br>
Number of Commits, PRs, Issues QuarterWise

In [11]:
quarter_durations = quarterize(2017,2018) ## get the quarter duration between the years passed as arguments

In [31]:
activity = {'issue':[],'pull_request':[],'commit':[] }
new_contributors = {'issue':[],'pull_request':[],'commit':[] }
existing_authors = {'issue':set(),'pull_request':set(),'commit':set() }
for quarter in quarter_durations:
    since = quarter[0]
    until = quarter[1]
    for change_type,items in code.changes.items():
        activity_per_quarter = 0                    ## resetting activity to 0 after every quarter after every type
        new_author_per_quarter = 0                    ## resetting new_author to 0 after every quarter after every type
        for item in items:
            if since<=item['created_date'].replace(tzinfo=None)<=until:
                activity_per_quarter += 1
                if item['author'] not in existing_authors[change_type]:
                    new_author_per_quarter += 1
                    existing_authors[change_type].add(item['author'])
        activity[change_type].append(activity_per_quarter)
        new_contributors[change_type].append(new_author_per_quarter)

## Quarter Number and Their Duration

Displaying the quarter number and its starting date and ending date

In [51]:
for number,quarter in enumerate(quarter_durations):
    print("Quarter "+str(number + 1) + ": ",quarter[0].strftime('%d %B %Y'),"to",quarter[1].strftime('%d %B %Y'))

Quarter 1:  01 January 2018 to 31 March 2018
Quarter 2:  01 April 2018 to 30 June 2018
Quarter 3:  01 July 2018 to 30 September 2018
Quarter 4:  01 October 2018 to 31 December 2018
Quarter 5:  01 January 2019 to 31 March 2019
Quarter 6:  01 April 2019 to 30 June 2019
Quarter 7:  01 July 2019 to 30 September 2019
Quarter 8:  01 October 2019 to 31 December 2019


## Activity List

Shows the list that contains amount of commits, pull requests and issue per quarter in the given years

In [55]:
first_quarter_year = quarter_durations[0][0].year

In [56]:
for key,value in activity.items():
    print(key)
    print("---------")
    for i in range(len(value)):
        print("Year "+str(first_quarter_year + int(i/4)) +" Quarter "+str(i%4 + 1)+": "+str(value[i]))
    print("\n")

commit
---------
Year 2018 Quarter 1: 130
Year 2018 Quarter 2: 115
Year 2018 Quarter 3: 61
Year 2018 Quarter 4: 58
Year 2019 Quarter 1: 70
Year 2019 Quarter 2: 0
Year 2019 Quarter 3: 0
Year 2019 Quarter 4: 0


issue
---------
Year 2018 Quarter 1: 111
Year 2018 Quarter 2: 128
Year 2018 Quarter 3: 36
Year 2018 Quarter 4: 29
Year 2019 Quarter 1: 45
Year 2019 Quarter 2: 0
Year 2019 Quarter 3: 0
Year 2019 Quarter 4: 0


pull_request
---------
Year 2018 Quarter 1: 114
Year 2018 Quarter 2: 120
Year 2018 Quarter 3: 51
Year 2018 Quarter 4: 63
Year 2019 Quarter 1: 63
Year 2019 Quarter 2: 0
Year 2019 Quarter 3: 0
Year 2019 Quarter 4: 0




## New Authors 

Shows the list that contains number of new contributors that has contributed in form commits, pull requests and issue per quarter in the given years

In [59]:
for key,value in new_contributors.items():
    print(key)
    print("---------")
    for i in range(len(value)):
        print("Year "+str(first_quarter_year + int(i/4)) +" Quarter "+str(i%4 + 1)+": "+str(value[i]))
    print("\n")

commit
---------
Year 2018 Quarter 1: 20
Year 2018 Quarter 2: 10
Year 2018 Quarter 3: 1
Year 2018 Quarter 4: 2
Year 2019 Quarter 1: 3
Year 2019 Quarter 2: 0
Year 2019 Quarter 3: 0
Year 2019 Quarter 4: 0


issue
---------
Year 2018 Quarter 1: 28
Year 2018 Quarter 2: 7
Year 2018 Quarter 3: 7
Year 2018 Quarter 4: 3
Year 2019 Quarter 1: 10
Year 2019 Quarter 2: 0
Year 2019 Quarter 3: 0
Year 2019 Quarter 4: 0


pull_request
---------
Year 2018 Quarter 1: 21
Year 2018 Quarter 2: 9
Year 2018 Quarter 3: 0
Year 2018 Quarter 4: 3
Year 2019 Quarter 1: 3
Year 2019 Quarter 2: 0
Year 2019 Quarter 3: 0
Year 2019 Quarter 4: 0




## Writing CSV
Using csv module to write the activities quarterwise in the CSV file.

fields = ['Quarter','Since','Until','Commits','PRs','Issues','New PR Submitters','New Issue Submitters','New Commiters']

In [52]:
import csv

In [53]:
fields = ['Quarter','Since','Until','Commits','PRs','Issues','New PR Submitters','New Issue Submitters','New Commiters']

with open('microtask1-quarterwise.csv', 'w') as file:
    writer = csv.writer(file,quoting=csv.QUOTE_MINIMAL) 
    writer.writerow(fields)
    for number,quarter in enumerate(quarter_durations):
        writer.writerow(["Quarter "+str(number + 1),quarter[0].strftime('%d %B %Y'),quarter[1].strftime('%d %B %Y'),
                         activity['commit'][number],activity['pull_request'][number],activity['issue'][number],
                         new_contributors['pull_request'][number],new_contributors['issue'][number],new_contributors['commit'][number]])
      

In [54]:
with open('microtask1-quarterwise.csv','r', newline='') as File:  
    reader = csv.reader(File)
    for row in reader:
        print ('%-15s '*(len(row))%tuple(row))


Quarter         Since           Until           Commits         PRs             Issues          New PR Submitters New Issue Submitters New Commiters   
Quarter 1       01 January 2018 31 March 2018   130             114             111             21              28              20              
Quarter 2       01 April 2018   30 June 2018    115             120             128             9               7               10              
Quarter 3       01 July 2018    30 September 2018 61              51              36              0               7               1               
Quarter 4       01 October 2018 31 December 2018 58              63              29              3               3               2               
Quarter 5       01 January 2019 31 March 2019   70              63              45              3               10              3               
Quarter 6       01 April 2019   30 June 2019    0               0               0               0               0       