# Mining Software Repositories Assignment 1 (MSR1)

Using your newly minted repository mining skills, along with the `miner-utils` Python library ([link](https://github.com/EPICLab/miner-utils)), please answer the following questions using Python code in the empty code boxes below.

Similar to the tutorials and in-class demonstration, you will want to provide authenticate to the GitHub API in order to have a higher rate-limit. ***NOTE***: Again, please do not include your authentication username or token in your submission.

In [1]:
# install any missing dependencies (only needed if you haven't installed these already during tutorials)
!pip install 'git+https://github.com/EPICLab/miner-utils'
!pip install 'gitpython'
!pip install 'pandas'

ERROR: Invalid requirement: "'git+https://github.com/EPICLab/miner-utils'"
ERROR: Invalid requirement: "'gitpython'"
ERROR: Invalid requirement: "'pandas'"


In [2]:
userName = 'USER'
token = 'TOKEN'

In [3]:
# setup environment (import any needed dependencies)
from minerutils import GitHub
import pandas as pd
from git import Repo
import json
from os import path
from collections import Counter

## Part 1: `discourse/discourse` repo
For this part, we will investigate the [discourse/discourse](https://github.com/discourse/discourse) project.

In [4]:
# create github object
gh = GitHub(userName, token)

In [22]:
# opens the json file if it exists
def open_file(file_name):
    file = path.exists(file_name)
    json_data = None
    if file:
        with open(file_name, "r", encoding="utf8") as f:
           json_data = json.loads(f.read())
        return True, json_data
    else:
        return False, json_data

# create new json file if it doesn't exist
def create_file(data, file_name):   
    # write to json from api call data
    with open(file_name, "w+") as f:
        f.write(json.dumps(data))    
        gh.writeData(file_name, data)

    # read data from pulls.json
    return gh.readData(file_name)    

# intializing 
fnames = ["dis_pulls.json", "dis_commits.json", "dis_file_commits.json"]
dis = "/repos/discourse/discourse"

# handling pull data
p_check, pulls = open_file(fnames[0])
if not p_check:
    # download all the pull requests for discourse/discourse
    pulls = gh.get(dis + "/pulls")    
    # create a pulls.json 
    pulls = create_file(pulls, fnames[0])
print("pull check complete")
    
# handling commit data
c_check, commits = open_file(fnames[1])
if not c_check:
    commits = gh.get(dis + "/commits")
    commits = create_file(commits, fnames[1])
print("commit check complete")

pull check complete
commit check complete


#### Question 1:
What is the total number of unique contributors for this project? (Contributions include commits and pull requests)

In [6]:
# list of usernames
contrib = []

# handling pull data
for p in pulls:
    username = p['user']['login']
    contrib.append(username)
    
# handling commit data
for c in commits: 
    # if the author metadata exists
    if c['author']:
        username = c['author']['login']
        contrib.append(username)    
   
    # if the author metadata doesn't exist
    else:
        username = c['commit']['author']['name']
        contrib.append(username)

# get unique values
data = Counter(contrib)
unique = len(data.most_common())
    
print("Total number of unique contributers:", unique)

Total number of unique contributers: 1251


#### Question 2:
Which user made the most contributions to the project?

In [7]:
# get the top contributer username + count
user = data.most_common(1)[0][0]
max_count = data.most_common(1)[0][1]

ans = "{} made {} contributions, which is the most in this project.".format(user, max_count)
print(ans)

eviltrout made 6556 contributions, which is the most in this project.


#### Question 3:
Which user made the most commits to the [discourse/app/models/badge.rb](https://github.com/discourse/discourse/blob/master/app/models/badge.rb) file? (You can use GitHub to find this information, but we still need to see code for automatically determining the answer)

In [8]:
# handling file commit data
f_check, file_commits = open_file(fnames[2])
if not f_check:
    file_commits = gh.get(dis + "/commits?path=app/models/badge.rb")
    file_commits = create_file(file_commits, fnames[2])
print("file check complete")

# store all the users who commited to badge.rb
badge_users = []

# get the usernames
for fc in file_commits:
    if fc['author']:
        badge_users.append(fc['author']['login'])
    else:
        badge_users.append(fc['commit']['author']['name'])

# get the top contributer username + count
badge = Counter(badge_users)
user = badge.most_common(1)[0][0]
max_count = badge.most_common(1)[0][1]

badge_ans = "{} made {} contributions, which is the most in this project.".format(user, max_count)
print(badge_ans)

file check complete
SamSaffron made 55 contributions, which is the most in this project.


## Part 2: `gaearon` user
For this part, we will investigate the developer [Dan Abramov](https://github.com/gaearon) (prominent developer of [React](https://reactjs.org/), co-author of [Redux](https://redux.js.org/) and [Create React App](https://create-react-app.dev/)).

In [9]:
# get full list of Dan Abramov repositories
repos = gh.get("users/gaearon/repos")

# store Dan's repo names
repo_list = []
for r in repos:
    repo_list.append(r["name"])

#### Question 1:
Which of Dan's projects did he commit to most often in the past three years? (From June 1, 2017 to June 1, 2020)

In [20]:
# store repo name + num of commits
range_repo = {}

# get the list of commits for each repo btwn time range
for r in repo_list:
    author_repo = "repos/gaearon/{}/commits?author=gaearon".format(r)
    clist_repo = gh.get(author_repo, params={"since": "2017-06-01T00:00:00Z", 
                                             "until": "2020-06-01T00:00:00Z"})        
    # if commit exists in that repo, append repo + the number of commits
    if clist_repo:
        range_repo[r] = len(clist_repo)           

# repo with the most commits
max_commits = max(range_repo, key=range_repo.get)
    
# printing results
# print("number of repo", count_repo)
# print(range_repo)
print("max commits in repo:", max_commits)
print("number of max:", range_repo[max_commits])

max commits in repo: overreacted.io
number of max: 250


#### Question 2:
The [React](https://reactjs.org/) project was founded in 2013. When did Dan first make a release in the [facebook/react](https://github.com/facebook/react) project? And what was the version number of that release?

In [31]:
# get a list of all releases in the facebook project
r_check, release_data = open_file("all_release.json")
if not r_check:  
    all_release = gh.get("repos/facebook/react/releases")
    release_data = create_file(all_release, "all_release.json")
print("release check complete")

print(len(release_data))
# print(release_data)

release check complete
94


In [33]:
# dict for storing Dan's releases
dan_releases = {}

# get the version numbers and release dates
for r in release_data:
    if r['author']['login'] == 'gaearon':
        dan_releases[r['tag_name']] = r['published_at']
        
# get Dan's first release version number and date
first_rel = min(dan_releases, key=dan_releases.get)
date = dan_releases[first_rel]

rel_ans = "Dan made his first relase on {}, version number: {}".format(date, first_rel)
print(rel_ans)

Dan made his first relase on 2016-03-29T17:35:57Z, version number: v0.14.8
