# Microtask 1

### Retrieving Data from OmegaUp

From the command line run Perceval on the github repositories to analyze, to produce a file with JSON documents for all its issues (the list obtained contains the pull request also), one per line (git-commits.json).


Syntax for using Perceval for Github
`perceval github owner repository [--sleep-for-rate] [-t XXXXX]`


Date of Retrieval: 1st March 2019
##### Example:

`$ perceval github --json-line -category pull_request omegaup omegaup --sleep-for-rate -t a247a6b7d506736da6d653cddc060a96bfbd9cb3 > data_source.json`

`$ perceval github --json-line -category issue omegaup omegaup --sleep-for-rate -t a247a6b7d506736da6d653cddc060a96bfbd9cb3 >> data_source.json`

Gets the information about Pull Request and Issues from Github

###### Current Problem With Perceval    
    The Existing Problem with Perceval is that when running the above command with category issue it also 
    fetches the pull request because the github api returns the pull request under issues. For Ex 
   ##### https://api.github.com/repos/omegaup/omegaup/issues/2378
    
    This is a pull request but in url you can see it is returned as issue
   
    So we have to remove the duplicates (done in __init__() function)
    
    

`$ perceval git --json-line https://github.com/omegaup/omegaup >> data_source.json`

    Gets the information about Commits from Github

----------------------------------------------------------------------------------------
--sleep-for-rate To avoid having perceval exiting when the rate limit is exceeded

-t is token for Github API

In [7]:
import json
import datetime
import re
from dateutil import parser

## Summarize Function

#### @arguments 

<b>line</b>: item to be summarized<br>
<b>type</b>: type of item(commit,issue,pull_request)

summary{
    repo,<br>
    hash(in case of commit) or uuid(in case of PR or Issue),<br>
    author,<br>
    author_date,<br>
    ....<br>
}

In [30]:
def summarize(line,type):
    repo = line['origin']
    cdata = line['data']
    if(type=='commit'):    
        summary = {
                'repo': repo,
                'hash': cdata['commit'],
                'author': cdata['Author'],
                'author_date': datetime.datetime.strptime(cdata['AuthorDate'],
                                                          "%a %b %d %H:%M:%S %Y %z"),
                'commit': cdata['Commit'],
                'created_date': datetime.datetime.strptime(cdata['CommitDate'],
                                                          "%a %b %d %H:%M:%S %Y %z"),
                'files_no': len(cdata['files']),
        }
        actions = 0
        for file in cdata['files']:
            if 'action' in file:
                actions += 1
        summary['files_action'] = actions
        if 'Merge' in cdata:
            summary['merge'] = True
        else:
            summary['merge'] = False
    elif(type=='issue'):
        summary = {
                'repo': repo,
                'uuid': line['uuid'],
                'author': cdata['user']['login'],
                'created_date': datetime.datetime.strptime(cdata['created_at'],
                                            "%Y-%m-%dT%H:%M:%SZ"),
                'closed_date':datetime.datetime.strptime(cdata['closed_at'],
                                            "%Y-%m-%dT%H:%M:%SZ") if cdata['closed_at'] else None, 
                'comments': cdata['comments'],
                'labels': cdata['labels'],
                'url': cdata['html_url'],
                'state':cdata['state']
        }
    elif(type=='pull_request'):
        summary = {
                'repo': repo,
                'uuid': line['uuid'],
                'author': cdata['user']['login'],
                'created_date': datetime.datetime.strptime(cdata['created_at'],"%Y-%m-%dT%H:%M:%SZ"),
                'closed_date': datetime.datetime.strptime(cdata['closed_at'],"%Y-%m-%dT%H:%M:%SZ")
                                            if cdata['closed_at'] else None,
                'merged_date': datetime.datetime.strptime(cdata['merged_at'],"%Y-%m-%dT%H:%M:%SZ")
                                        if cdata['merged_at'] else None,
                'comments': cdata['comments'],
                'commits': cdata['commits'],
                'additions': cdata['additions'],
                'deletions': cdata['deletions'],
                'changed_files':cdata['changed_files'],
                'url': cdata['html_url'],
                'state':cdata['state']
        }
    
    return summary

## Class Code_Changes

Takes path to the JSON file as input parameter

In [31]:
class Code_Changes:
    """"Class for Code_Changes for Git repositories.
    
    Objects are instantiated by specifying a file with the
    commits obtained by Perceval from a set of repositories.
    
    Contains individual list for Issues, Pull Requests and Commits
        
    :param path: Path to file with one Perceval JSON document per line
    """
    
    def __init__(self, path):
        
        self.changes = {'issue':[],'commit':[],'pull_request':[]}
        with open(path) as data_file:
            for data in data_file:
                line = json.loads(data)
                if(line['category'] ==  'commit'):
                    self.changes['commit'].append(summarize(line,'commit'))
                else:
                    if (line['category'] == 'pull_request'):
                        self.changes['pull_request'].append(summarize(line,'pull_request'))
                    elif ('pull_request' not in line['data']) and (line['category'] == 'issue'):
                        self.changes['issue'].append(summarize(line,'issue'))
    

#### Functions Available
- total_count() : returns the total number of issues till date
- count(): returns number of issues created in Period Of Time
    ###### Parameters
    - Since
    - Until
            

## Quarter Definition

In [83]:
quarters = {} #define a quarter dictonary with key a quarter#no and value is a list with since and until (contains month and date)
quarters['quarter1'] = ({'month':1,'day':1},{'month':3,'day':31})  #January 01 – March 31
quarters['quarter2'] = ({'month':4,'day':1},{'month':6,'day':30})  #April 01 – June 30
quarters['quarter3'] = ({'month':7,'day':1},{'month':9,'day':30})  #July 01 – September 30
quarters['quarter4'] = ({'month':10,'day':1},{'month':12,'day':31})  #October 01 – December 31
quarters_keys = sorted(quarters.keys(), key=lambda x:x.lower()) ## dictonary is not formed in sorted order 
                                                           ##or in the order in which it has been instantiated

## Quarter Function

### @arguments - since (year)  , until (year)  
Divides the given duration into quarters

In [92]:
def quarterize(since,until):
    quarter_duration = []
    for year in range(since,until+1):
        for key in quarters_keys:
            value = quarters[key]
            to_append = [datetime.datetime(year,value[0]['month'],value[0]['day']),datetime.datetime(year,value[1]['month'],value[1]['day'])]
#             print(to_append)
            quarter_duration.append(to_append)
    return quarter_duration

In [97]:
quarter_durations = quarterize(2018,2019)

## Example of the implementation

In [106]:
code = Code_Changes('data_source.json')

In [108]:
print("Total Number Of Commits:",len(code.changes['commit']))
print("Total Number Of Pull Requests:",len(code.changes['pull_request']))
print("Total Number Of Issues:",len(code.changes['issue']))

Total Number Of Commits: 4224
Total Number Of Pull Requests: 923
Total Number Of Issues: 1484


## First Activity in The Repository

In [109]:
code.changes['commit'].sort(key = lambda x:x['created_date'].replace(tzinfo=None))  ##sorting the commits with date 
first_commit = code.changes['commit'][0]
first_commit_year = first_commit['created_date'].year

code.changes['pull_request'].sort(key = lambda x:x['created_date'].replace(tzinfo=None))  ##sorting the pull requests with date 
first_commit = code.changes['commit'][0]
first_pull_request_year = first_commit['created_date'].year

code.changes['issue'].sort(key = lambda x:x['created_date'].replace(tzinfo=None))  ##sorting the issues with date 
first_commit = code.changes['commit'][0]
first_issue_year = first_commit['created_date'].year

first_activity_year = min(first_commit_year,first_issue_year,first_pull_request_year)
current_year = 2019
print(first_activity_year)

2010


## Activity QuarterWise

Number of Commits, PRs, Issues QuarterWise

In [121]:
quarter_durations = quarterize(2017,2018) ## get the quarter duration between the years passed as arguments

In [135]:
activity = {'issue':[],'pull_request':[],'commit':[] }
new_contributors = {'issue':[],'pull_request':[],'commit':[] }
existing_authors = {'issue':set(),'pull_request':set(),'commit':set() }
for quarter in quarter_durations:
    since = quarter[0]
    until = quarter[1]
    for change_type,items in code.changes.items():
        activity_per_quarter = 0                    ## resetting activity to 0 after every quarter after every type
        new_author_per_quarter = 0                    ## resetting new_author to 0 after every quarter after every type
        for item in items:
            if since<=item['created_date'].replace(tzinfo=None)<=until:
                activity_per_quarter += 1
                if item['author'] not in existing_authors[change_type]:
                    new_author_per_quarter += 1
                    existing_authors[change_type].add(item['author'])
        activity[change_type].append(activity_per_quarter)
        new_contributors[change_type].append(new_author_per_quarter)

In [136]:
print(activity) ## List showing data for activity from 1st quarter of the since(2017)  till until(2018)
print(new_contributors) ## List showing data for new contributors in commits, PRs, issue 1st quarter of the since(2017)  till until(2018)

No. of Quarters : 3
{'pull_request': [73, 76, 63, 98, 114, 120, 51, 63], 'commit': [102, 93, 65, 107, 130, 115, 61, 58], 'issue': [75, 104, 63, 90, 111, 128, 36, 29]}
{'pull_request': [10, 5, 7, 2, 14, 6, 0, 1], 'commit': [10, 5, 6, 4, 12, 7, 1, 1], 'issue': [13, 12, 6, 5, 9, 2, 6, 3]}
