# New Contributors Pull Requests

This is the reference implementation for [New Contributors Pull Requests](), a metric specified by the [Evolution Working group](https://github.com/chaoss/wg-evolution) of the [CHAOSS project](https://chaoss.community/). This implementation is specific to GitHub repositories.

Have a look at [README.md](../README.md) to find out how to run this notebook (and others in this directory) as well as to get a better understanding of the purpose of the implementations.

The implementation is desvcribed in two parts (see below):

* Class for computing New Contributors Pull Requests
* An explanatory analysis of the class' functionality

Some more auxiliary information in this notebook:

* Examples of the use of the implementation
* Vizualizing the data extracted

As discussed in the [README.md](../README.md) file, the scripts required to analyze the data fetched by Perceval are located in the `code_df` package. Due to python's import system, to import modules from a package which is not in the current directory, we have to either add the package to `PYTHONPATH` or simply append a `../..` to `sys.path`, so that `code_df` can be successfully imported.

In [5]:
from datetime import datetime

import sys

sys.path.append("../..")

from implementations.scripts.pullrequest_github import PullRequestGitHub
from implementations.scripts.utils import read_json_file

In [6]:
class NewContributorsPullRequestsGitHub(PullRequestGitHub):
    """
    Class for New Contributors Pull Requests in GitHub
    """

    def __init__(self, items, date_range=(None, None)):
        """
        Initializes self.items, the list with items (dictionary)
        as elements.

        :param items: A list of dictionaries.
            Each item is a Perceval dictionary, obtained from a JSON
            file or from Perceval directly.

        :param date_range: A tuple which represents the start and end date
            between which new contributors will be considered.
            Either, or both can be None. If, for example, since is None, that
            all unique contributors whose commit lies between the first pull request
            to the pull request which last falls inside the until range would be considered
            unique contributors.
        """
        super().__init__(items, date_range)

        self._filter_items()

        if self.since:
            self.items = [item for item in self.items if item['created_date'] >= self.since]

        if self.until:
            self.items = [item for item in self.items if item['created_date'] <= self.until]

    def _filter_items(self):
        new_contributors = {}

        for item in self.items:
            author = item['author']
            created_date = item['created_date']

            if author not in new_contributors or created_date < new_contributors[author]['created_date']:
                new_contributors[author] = item

        self.items = new_contributors.values()

    def compute(self):
        """
        Count the number of new contributors who has created an pull request
        between the two dates
        of date_range.

        :returns count_of_new_contributors: the number of new contributors who
            created a new pull request between the dates of date_range

            Since the dataframe self.df is modified in __init__ via groupby
            and idmin(), the number of unique entries in the dataframe gives us the number of the new contributors between the given dates.
        """
        return len(self.items)

    def __str__(self):
        return "New Contributors of Pull Requests"

## Performing the Analysis

Using the above class, we can perform several kinds of analysis on the JSON data file, fetched by Perceval.

At its most basic, the `NewContributorsPullRequestsGitHub` class can be used to get the number of contributors over the entire interval for which commits are considered.

The `date_range` parameter talks about the period in which we will look for the new contributors.

### Counting the total number of contributors

We pass the data of the JSON file as a list to NewContributorsPullRequestsGitHub to start off.

In [7]:
items = read_json_file("../pull_requests.json")

Let's use the compute method to count the total number of valid commits made. First, we will do it without passing any `since` and `until` dates. When no values are passed for `date_range`, every contributor to make a commit is a new contributor and hence, we get the total number of unique contributors in the data considered.

In [8]:
new_contributors = NewContributorsPullRequestsGitHub(items)
print("New Contributors, total: {}".format(new_contributors.compute()))

New Contributors, total: 39


### Counting contributors in a specific range
Now, let's give the `date_range` tuple, which has `since` and `until` as its elements some values. Let's pass 2018-01-01 and 2018-07-01 for `date_range`. Thus, we will be looking for new contributors who made their first pull request between 2018-01-01 and 2018-07-01.

In [9]:
date_since = datetime.strptime("2018-01-01", "%Y-%m-%d")
date_until = datetime.strptime("2018-07-01", "%Y-%m-%d")
new_contributors_dated = NewContributorsPullRequestsGitHub(items, (date_since, date_until))
print("New Contributors, between 2018-01-01 and 2018-07-01: ", new_contributors_dated.compute())

New Contributors, between 2018-01-01 and 2018-07-01:  13
