# Sample Code for Fetching Git Repo Data Using Command Line
## The code in this notebook and repository is a companion for the tutorial entitled:
## The Challenges and Opportunities Mining GitHub

The notebook is split into sections that show how different data collection tasks can be performed.

Do not forget to look into utils.py and parse_git_stats.py as well.

In [1]:
# fetch the functions we will use from utils.py
from utils import fetch_github_api_data, search_github
import pandas as pd
from pandas.io.json import json_normalize

In [2]:
# let's search for some machine learning repos
search_query = "topic:machine-learning"
resource = "repositories"
res = search_github(resource, search_query)
df = json_normalize(res.get("items"))

Fetching data from https://api.github.com/search/repositories params {'q': 'topic:machine-learning', 'per_page': 100}


In [3]:
# here are a few machine-learning repos
# some of them are software repositories and could be
# examples, tutorials, or even lists
df.full_name

0                               tensorflow/tensorflow
1                                    keras-team/keras
2                           scikit-learn/scikit-learn
3                   aymericdamien/TensorFlow-Examples
4                                          BVLC/caffe
5                             tesseract-ocr/tesseract
6                           ageitgey/face_recognition
7     ZuzooVn/machine-learning-for-software-engineers
8                                     JuliaLang/julia
9                       Avik-Jain/100-Days-Of-ML-Code
10           imhuay/Algorithm_Interview_Notes-Chinese
11               terryum/awesome-deep-learning-papers
12                           GokuMohandas/practicalAI
13                             apache/incubator-mxnet
14                                     Microsoft/CNTK
15                                       dmlc/xgboost
16         donnemartin/data-science-ipython-notebooks
17                    oxford-cs-deepnlp-2017/lectures
18                          

# Working With Repo MetaData or Source Code
Before we can analyze a repository, we have to select the projects we want to work with.

We have searched for repositories that fall under the machine-learning topic.

To see how we could randomly sample these repositories, see the [github_api_examples notebook](./github_api_example.ipynb)

We will perform the analysis on `tensorflow/tensorflow`.

The main steps are:
1. Clone the repository. This requires that we dedicate a directory to store the cloned repositories to.
2. Use python **subprocess** library Run the appropriate command line git commands to display information
3. Extract the information using Python

In [4]:
# To clone a git repository use utils.clone_git_repo
from utils import clone_git_repo
from parse_git_stats import get_git_stats_for_project

In [5]:
# df with single entry for tensorflow/tensorflow repo
repo_data = df[df.full_name == "tensorflow/tensorflow"]

In [6]:
# this is how you iterate rows in a DF
for idx, row in repo_data.iterrows():
    # we need the full name and the clone url
    full_name = row.get("full_name")
    clone_url = row.get("clone_url")
    
    # now we can clone the repository
    # You have to wait until it is cloned
    # if you know the name and url then you can clone with this command only
    clone_git_repo(full_name, clone_url)
    
    # fetch project data
    git_stats_df = get_git_stats_for_project(full_name, start_year=2019, end_year=2020)
    
    # NOTE: problematic diffs are when that there are likely no changes to analyze

Repo exists in tensorflow_tensorflow, not going to clone.
working on period 1, start: 2019-01-01 00:00:00, end: 2019-02-01 00:00:00
problem in diff 96007e5022df35766b014a30d6fe4075cc1662cc, 76b2438b24ea1d541dd93af5c76067d5c901ea2a item: ['']
problem in diff 76b2438b24ea1d541dd93af5c76067d5c901ea2a, 4e06198e15ca222ac2d32a24ed4b7133f720118e item: ['']
problem in diff ef15adf25bd1d73840ab3dab68b5ed13b2bedb96, ce955e38598c47129910f4fd4cbef1402f5a3ddd item: ['']
problem in diff af820d5b364486165d4996a3eebe63f8c0a19a94, 898738bce44c0cfdaa36b1dfc865239eb48fa999 item: ['']
problem in diff 50acc0c4f34b28013fc2b122b68aa135ecb0b197, ac40a68cd5fac2227fb6c4086b2eb01a7dc726c4 item: ['']
working on period 2, start: 2019-02-01 00:00:00, end: 2019-03-01 00:00:00
working on period 3, start: 2019-03-01 00:00:00, end: 2019-04-01 00:00:00
problem in diff 718bdbdb48b7f383e3f6f4003fff298ad187f646, 5b8d7d0fa266361f6d3ab86fce5c9ae3107eab39 item: ['']
no revisions between 2019-04-01 00:00:00 and 2019-05-01 00:0

In [7]:
# The data is quite rich
# please take the time to review it
git_stats_df.T

Unnamed: 0,0,1,2
full_name,tensorflow/tensorflow,tensorflow/tensorflow,tensorflow/tensorflow
date,2019-01-01 00:00:00,2019-02-01 00:00:00,2019-03-01 00:00:00
end_date,2019-02-01 00:00:00,2019-03-01 00:00:00,2019-04-01 00:00:00
period,1,2,3
total_revs,1862,2070,1511
first_rev,31723c74caf0cf3effb46e12de00eb80ae86a899,0b5d09d515ea6f54f949c600a585291b9ff05ea8,36f817a9f3e7d2339cb53b91ddc508b3e25ab761
last_rev,15d49deb3c5d4048718af19a2e0e0279e8f204d5,9cb9320acefb13bf1d7983a2e116c0933d89358c,fdbaab6f506a1829cbadaf79482ffc95a7342b37
new_contribs,43,50,36
new_committers,0,0,0
total_contribs,210,218,189


In [8]:
# store as CSV without the pandas index
git_stats_df.to_csv("gitstats.csv", index=False)