# Microtask 3 (Implementing CHAOSS metrics with Perceval idea)

## Objective

<p>Produce a notebook with charts showing the distribution of time-to-close for issues already closed, and opened during the last year, for each of the repositories analyzed, and for all of them together. Use Pandas for this, and the Python charting library of your choice (as long as it is a FOSS module).</p>

## Retrieving the data

<p> For this task, information from the following GitHub repos will be analyzed:</p>
<ul>
    <li>Perceval (https://github.com/chaoss/grimoirelab-perceval)</li>
    <li>SortingHat (https://github.com/chaoss/grimoirelab-sortinghat)</li>
    <li>Kibiter (https://github.com/chaoss/grimoirelab-kibiter)</li>
</ul>

### Date of retrieval: April 3rd 2019

<p> The following commands were executed on terminal to write the retrieved data to the issues.json file:
    (XXXX after the -t should be replaced with a valid <a href = "https://help.github.com/en/articles/creating-a-personal-access-token-for-the-command-line">GitHub API Token</a>)
</p>

````
perceval github --json-line --category issue grimoirelab perceval --sleep-for-rate -t XXXX > issues.json 

perceval github --json-line --category issue grimoirelab sortinghat --sleep-for-rate -t XXXX >> issues.json 

perceval github --json-line --category issue grimoirelab kibiter --sleep-for-rate -t XXXX >> issues.json 
````



## Cleaning the data

<p>As <a href = "https://chaoss.github.io/grimoirelab-tutorial/perceval/github.html#retrieving-from-github-with-no-credentials">Perceval documentation</a> indicates, "in GitHub every pull request is an issue, but not every issue is a pull request. Thus, the issues returned may contain pull request information (included in the field pull_request within the issue)."
<p>
<p> So the next step is just selecting those issues with no "pull_request" inside the issue 'data' field </p>

In [16]:
import json
import datetime
from dateutil import parser
import pandas as pd

In [35]:
#creating a list of issues which are not pull requests
clean_issues = []
with open('issues.json') as issues_file:
    for line in issues_file:
        issue = json.loads(line)
        ##if theres no pull_request field, we will add it to our issues list
        if ('pull_request' not in issue['data']):
            clean_issues.append(issue)
            

## Passing the issue's relevant information to a pandas dataframe

<p> The next step is to create a pandas dataframe with every single element in clean_issues as a row of it. In order to achieve this, we are going to create a function called summarizeIssue which will take only relevant features(each of these will be a dataframe's column) for the analysis from each issue </p>

In [13]:
## Function based on the _summary function from the Code_changes class in the microstask 0 example

def summarizeIssue(issue):
    '''
    This is a function for summarizing issue's relevant information 
    
    Parameters:
    issue(dict) : json's file line describing an issue
    
    Returns:
    dict: A non-nested dictionary which can be easily appended as a row of a dataframe
    '''
    cdata = issue['data']
    summary = {
            'repo': issue['origin'],
            'uuid': issue['uuid'],
            'author': cdata['user']['login'],
            'created_date': datetime.datetime.strptime(cdata['created_at'],
                                           "%Y-%m-%dT%H:%M:%SZ"),
            'closed_date':datetime.datetime.strptime(cdata['closed_at'],
                                         "%Y-%m-%dT%H:%M:%SZ") if cdata['closed_at'] else None, 
            'url': cdata['html_url'],
            'state':cdata['state']
    }
    return summary

In [12]:
print (summarizeIssue(clean_issues[0]))

{'repo': 'https://github.com/grimoirelab/perceval', 'uuid': 'c31e77a3e31bb86301ae7b9eb9f2c9a89ac0feb2', 'author': 'jgbarah', 'created_date': datetime.datetime(2016, 1, 24, 23, 35, 59), 'closed_date': datetime.datetime(2016, 1, 25, 13, 13, 32), 'url': 'https://github.com/chaoss/grimoirelab-perceval/issues/8', 'state': 'closed'}


In [28]:
col_names = ['repo', 'uuid', 'author', 'created_date', 'closed_date', 'url','state' ]
issues_df = pd.DataFrame(columns = col_names)

for issue in clean_issues:
    issues_df = issues_df.append(pd.Series(summarizeIssue(issue)), ignore_index = True)

In [40]:
issues_df.groupby('repo').count()

Unnamed: 0_level_0,uuid,author,created_date,closed_date,url,state
repo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
https://github.com/grimoirelab/kibiter,10,10,10,8,10,10
https://github.com/grimoirelab/perceval,184,184,184,144,184,184
https://github.com/grimoirelab/sortinghat,95,95,95,74,95,95
