# Microtask 4

## Aim

Produce a listing of repositories, as a table and as CSV file, with the number of commits authored, issues opened, and pull/merge requests opened, during the last three months, ordered by the total number (commits plus issues plus pull requests). Use plain Python3 (eg, no Pandas) for this.

### Retrieving Data from OmegaUp

From the command line run Perceval on the github repositories to analyze, to produce a file with JSON documents for all its issues (the list obtained contains the pull request also), one per line (git-commits.json).


Syntax for using Perceval for Github
`perceval github owner repository [--sleep-for-rate] [-t XXXXX]`


Date of Retrieval: 1st March 2019
##### Example:

`$ perceval github --json-line -category pull_request omegaup omegaup --sleep-for-rate -t a247a6b7d506736da6d653cddc060a96bfbd9cb3 > data_source.json`

`$ perceval github --json-line -category issue omegaup omegaup --sleep-for-rate -t a247a6b7d506736da6d653cddc060a96bfbd9cb3 >> data_source.json`

Gets the information about Pull Request and Issues from Github

###### Current Problem With Perceval    
    The Existing Problem with Perceval is that when running the above command with category issue it also 
    fetches the pull request because the github api returns the pull request under issues. For Ex 
   ##### https://api.github.com/repos/omegaup/omegaup/issues/2378
    
    This is a pull request but in url you can see it is returned as issue
   
    So we have to remove the duplicates (done in __init__() function)
    
    

`$ perceval git --json-line https://github.com/omegaup/omegaup >> data_source.json`

    Gets the information about Commits from Github

----------------------------------------------------------------------------------------
--sleep-for-rate To avoid having perceval exiting when the rate limit is exceeded

-t is token for Github API

In [146]:
import json
import datetime
import dateutil.relativedelta
import re
from copy import deepcopy
from dateutil import parser
import pandas as pd

import warnings ## to ignore warnings that come in importing pandas
warnings.filterwarnings("ignore", message="numpy.dtype size changed")

## Summarize Function

#### @arguments 

<b>line</b>: item to be summarized<br>
<b>type</b>: type of item(commit,issue,pull_request)

summary{
    repo,<br>
    hash(in case of commit) or uuid(in case of PR or Issue),<br>
    author,<br>
    author_date,<br>
    ....<br>
}

In [147]:
def summarize(line,type):
    repo = line['origin']
    cdata = line['data']
    if(type=='commit'):    
        summary = {
                'repo': repo,
                'hash': cdata['commit'],
                'author': cdata['Author'],
                'author_date': datetime.datetime.strptime(cdata['AuthorDate'],
                                                          "%a %b %d %H:%M:%S %Y %z"),
                'commit': cdata['Commit'],
                'created_date': datetime.datetime.strptime(cdata['CommitDate'],
                                                          "%a %b %d %H:%M:%S %Y %z"),
                'files_no': len(cdata['files']),
        }
        actions = 0
        for file in cdata['files']:
            if 'action' in file:
                actions += 1
        summary['files_action'] = actions
        if 'Merge' in cdata:
            summary['merge'] = True
        else:
            summary['merge'] = False
    elif(type=='issue'):
        summary = {
                'repo': repo,
                'uuid': line['uuid'],
                'author': cdata['user']['login'],
                'created_date': datetime.datetime.strptime(cdata['created_at'],
                                            "%Y-%m-%dT%H:%M:%SZ"),
                'closed_date':datetime.datetime.strptime(cdata['closed_at'],
                                            "%Y-%m-%dT%H:%M:%SZ") if cdata['closed_at'] else None, 
                'comments': cdata['comments'],
                'labels': cdata['labels'],
                'url': cdata['html_url'],
                'state':cdata['state']
        }
    elif(type=='pull_request'):
        summary = {
                'repo': repo,
                'uuid': line['uuid'],
                'author': cdata['user']['login'],
                'created_date': datetime.datetime.strptime(cdata['created_at'],"%Y-%m-%dT%H:%M:%SZ"),
                'closed_date': datetime.datetime.strptime(cdata['closed_at'],"%Y-%m-%dT%H:%M:%SZ")
                                            if cdata['closed_at'] else None,
                'merged_date': datetime.datetime.strptime(cdata['merged_at'],"%Y-%m-%dT%H:%M:%SZ")
                                        if cdata['merged_at'] else None,
                'comments': cdata['comments'],
                'commits': cdata['commits'],
                'additions': cdata['additions'],
                'deletions': cdata['deletions'],
                'changed_files':cdata['changed_files'],
                'url': cdata['html_url'],
                'state':cdata['state']
        }
    
    return summary

## Class Code_Changes

Takes path to the JSON file as input parameter and on initializing a dictionary is created. Dictionary has three keys each of which maps to a list 

- Commits : Contains list of summary of each commit
- Issues : Contains list of summary of each issue 
- Pull Requests  : Contains list of summary of each pull request 

#### self.code_dataframe[item] contains the Pandas Dataframe holding the information about individual items

In [148]:
class Code_Changes:
    """"Class for Code_Changes for Git repositories.
    
    Objects are instantiated by specifying a file with the
    commits obtained by Perceval from a set of repositories.
    
    Contains individual list for Issues, Pull Requests and Commits
        
    :param path: Path to file with one Perceval JSON document per line
    """
    
    def __init__(self, path):
        
        self.changes = {'issue':[],'commit':[],'pull_request':[]}
        with open(path) as data_file:
            for data in data_file:
                line = json.loads(data)
                if(line['category'] ==  'commit'):
                    self.changes['commit'].append(summarize(line,'commit'))
                else:
                    if (line['category'] == 'pull_request'):
                        self.changes['pull_request'].append(summarize(line,'pull_request'))
                    elif ('pull_request' not in line['data']) and (line['category'] == 'issue'):
                        self.changes['issue'].append(summarize(line,'issue'))
        self.code_dataframe = {
            'commit':pd.DataFrame.from_dict(self.changes['commit']),
            'pull_request':pd.DataFrame.from_dict(self.changes['pull_request']),
            'issue':pd.DataFrame.from_dict(self.changes['issue'])
        }
    

## Example of the implementation

In [149]:
code = Code_Changes('../data_source2.json')

## Creating Set Of Repositories

intialize a set, iterate over all items and add repo to the set

In [150]:
repos = set()
for change_type,items in code.changes.items():
    for item in items:
        repos.add(item['repo'])

In [151]:
print("List of Repos :")
print("---------------")
for i in repos:
    print(i)

List of Repos :
---------------
https://github.com/streetmix/streetmix
https://github.com/Submitty/Submitty
https://github.com/omegaup/omegaup
https://github.com/fossasia/susi_server


## Analysing Data for Last Three Months

Last three months (doesn't include current month) [Dec 2018, Jan 2019, Feb 2019]

In [152]:
today_date = datetime.datetime.now().date()
last_third_month = today_date - dateutil.relativedelta.relativedelta(months=3)
last_third_month_first_date = last_third_month.replace(day=1)
current_month_first_date = today_date.replace(day=1)
print("Last third month first Date: ",last_third_month_first_date)
print("Last month first Date: ",current_month_first_date)

Last third month first Date:  2019-01-01
Last month first Date:  2019-04-01


We have to analyse data between the above dates

For each frame select items in the duration<br>
For issue and pull_request check for state i.e. select those that have open state

Analysis is a dataframe with following columns

`Analysis: 
 columns = ['Repo', 'No. of Commits', 'No. of PRs', 'No. of Issues'])`

Initializing data for pandas dataframe<br>
For each repo each item intialized with 0

In [153]:
data = []
for i in repos:
    data.append([0,0,0])


In [154]:
analysis = pd.DataFrame(data, columns = ['No. of Commits', 'No. of PRs', 'No. of Issues'])
analysis.index = repos
# for i in repos:
#     analysis[i] = {'issue':0,'commit':0,'pull_request':0}

In [155]:
analysis 

Unnamed: 0,No. of Commits,No. of PRs,No. of Issues
https://github.com/streetmix/streetmix,0,0,0
https://github.com/Submitty/Submitty,0,0,0
https://github.com/omegaup/omegaup,0,0,0
https://github.com/fossasia/susi_server,0,0,0


In [257]:
frame = deepcopy(code.code_dataframe) ##creating a deepcopy since in python value is assigned by reference
since = last_third_month_first_date
until = current_month_first_date


for change_type,temp_frame in frame.items():
    if(change_type=='commit'):
        temp_frame['author_date'] = temp_frame['author_date'].apply(lambda x:x.date())
        temp_frame = temp_frame[(since<=temp_frame['author_date'])]
        temp_frame = temp_frame[(until>=temp_frame['author_date'])]
        temp_frame = temp_frame.groupby('repo').count()
        temp_frame = pd.DataFrame(temp_frame['author'])         ## since a grouping done any column can be taken, taking author
                                                                ## to keep consistency in all 
        temp_frame.rename(columns={'author':'No. of Commits'},  ## changing column name so that update can be done
                 inplace=True)
        analysis.update(temp_frame)
    elif(change_type=='issue'):
        temp_frame['created_date'] = temp_frame['created_date'].apply(lambda x:x.date())
        temp_frame = temp_frame[(since<=temp_frame['created_date']) & (temp_frame['created_date']<=until)]
        temp_frame = temp_frame[temp_frame['state']=='open']
        temp_frame = temp_frame.groupby('repo').count()
        temp_frame = pd.DataFrame(temp_frame['author'])         ## since a grouping done any column can be taken, taking author
                                                                ## to keep consistency in all 
        temp_frame.rename(columns={'author':'No. of Issues'},  ## changing column name so that update can be done
                 inplace=True)
        analysis.update(temp_frame)
        
    elif(change_type=='pull_request'):
        temp_frame['created_date'] = temp_frame['created_date'].apply(lambda x:x.date())
        temp_frame = temp_frame[(since<=temp_frame['created_date']) & (temp_frame['created_date']<=until)]
        temp_frame = temp_frame[temp_frame['state']=='open']
        temp_frame = temp_frame.groupby('repo').count()
        temp_frame = pd.DataFrame(temp_frame['author'])         ## since a grouping done any column can be taken, taking author
                                                                ## to keep consistency in all 
        temp_frame.rename(columns={'author':'No. of PRs'},  ## changing column name so that update can be done
                 inplace=True)
        analysis.update(temp_frame)
       

In this case using Pandas is little tricky and difficult, but is fast in comparison to list operations

In [None]:
analysis['Sum'] = analysis['No. of Commits'] + analysis['No. of PRs'] + analysis['No. of Issues']
## Summing Issues, PRs and Commits

In [258]:
analysis

Unnamed: 0,No. of Commits,No. of PRs,No. of Issues,Sum
https://github.com/streetmix/streetmix,329.0,4.0,11.0,344.0
https://github.com/Submitty/Submitty,398.0,31.0,63.0,492.0
https://github.com/omegaup/omegaup,143.0,2.0,45.0,190.0
https://github.com/fossasia/susi_server,37.0,0.0,5.0,42.0


#### Sorting the Dataframe by sum 

using sort_value passing the column using which sort is to be done.

In [270]:
analysis = analysis.sort_values(by=['Sum'])
analysis = analysis.astype(int)  ## Converting Float to int
analysis

Unnamed: 0_level_0,No. of Commits,No. of PRs,No. of Issues,Sum
Repo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
https://github.com/fossasia/susi_server,37,0,5,42
https://github.com/omegaup/omegaup,143,2,45,190
https://github.com/streetmix/streetmix,329,4,11,344
https://github.com/Submitty/Submitty,398,31,63,492


## Writing CSV

#### Creating a dataframe_csv than using pandas function to directly convert it into csv file <br>
fields = ['Repo','No. of Commits','No. of PRs','No. of Issues','Sum']

In [271]:
analysis.index.name = "Repo" ## giving column name to the index column

In [272]:
analysis.to_csv('microtask5-analysis.csv')