# Summary

We're interested in GitHub project health. I first establish a baseline
using known successful projects and projects which were previously successful
but have now been retired. This notebook is intended to get the basic working
thing together then we can scale up later.

# Objective function:

Lines of code is a good indicator but we don't want to 
just reward increasing project complexity. I want to
equally reward code additions and deletions so we'll
use the [Manhattan Norm](https://en.wikipedia.org/wiki/Taxicab_geometry)
or $l_1$ distance.

$$
||x||_1 := \sum_{i=1}^n |x_i|.
$$

We want to maximize the $l_1$ distance so we can define
the cost function $c(x)$ as

$$
c(x) = -\sum_{i=1}^n |x_i|
$$

which means we want to now minimize $c(x)$.


## Dependencies

In [22]:
import os
from urllib.parse import urljoin

import numpy as np
import pandas as pd
from okra.models import (Meta, Author, Contrib, CommitFile, Info)
from okra.models import DataAccessLayer
from okra.playbooks import local_persistance
from sqlalchemy import func

## Prepare Data

In [2]:
DATA = "/Users/tylerbrown/code/"
repos = [
    "torvalds/linux",
    "docker/docker-ce",
    'apache/attic-lucy',
    'apache/attic-wink',
    'apache/spark',
    'apache/lucene-solr'
]

In [3]:
# Persist repo info in database

for repo_name in repos:
    local_persistance(repo_name, DATA)

Issue with row 0, repo '/Users/tylerbrown/code/torvalds/linux'
Issue with row 0, repo '/Users/tylerbrown/code/docker/docker-ce'
Issue with row 0, repo '/Users/tylerbrown/code/apache/attic-lucy'
Issue with row 0, repo '/Users/tylerbrown/code/apache/attic-wink'
Issue with row 0, repo '/Users/tylerbrown/code/apache/spark'
Issue with row 0, repo '/Users/tylerbrown/code/apache/lucene-solr'


In [4]:
repodbs = {i : i.replace("/", "__REPODB__") + ".db" for i in repos}
repodbs

{'torvalds/linux': 'torvalds__REPODB__linux.db',
 'docker/docker-ce': 'docker__REPODB__docker-ce.db',
 'apache/attic-lucy': 'apache__REPODB__attic-lucy.db',
 'apache/attic-wink': 'apache__REPODB__attic-wink.db',
 'apache/spark': 'apache__REPODB__spark.db',
 'apache/lucene-solr': 'apache__REPODB__lucene-solr.db'}

# Exploratory Analysis: Linux Kernel

Trying to get an idea of which features would be informative
by exploring the Linux kernel. Some initial thoughts about
repo health indicators

1. Number of commits per time period
1. Number of developers per time period


In [5]:
conn_string = "sqlite:///" + urljoin(DATA, repodbs['torvalds/linux'])

In [6]:
dal = DataAccessLayer(conn_string)
dal.connect()
dal.session = dal.Session()

### Compute objective function per month

Let's start by computing our objective function
once per month. 

In [7]:
q1 = dal.session.query(
    Meta.commit_hash, Author.authored, CommitFile.lines_added, CommitFile.lines_deleted
).join(Author).join(CommitFile)

In [8]:
items = []
for item in q1.all():
    r = {
        "commit_hash": item.commit_hash,
        "date_authored": item.authored,
        "lines_added": item.lines_added,
        "lines_deleted": item.lines_deleted
    }
    items.append(r)
objdf = pd.DataFrame(items)
objdf.shape

(1824726, 4)

In [11]:
per = objdf.date_authored.dt.to_period("M")

In [15]:
ok = objdf.groupby(per).sum()

In [36]:
ok['costfunc'] = -np.sqrt(np.square(ok.lines_added.values) + np.square(ok.lines_deleted.values))

In [37]:
ok.head()

Unnamed: 0_level_0,lines_added,lines_deleted,costfunc
date_authored,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1970-01,2,1,-2.236068
2001-09,145,48,-152.738338
2002-04,2272,632,-2358.263768
2003-02,56,55,-78.492038
2004-07,9,0,-9.0


### Check features against cost function

In [40]:
# Number of commits per time period
q2 = dal.session.query(
    Meta.commit_hash, Info.created
).join(Info)

items = []
for item in q2.all():
    r = {
        "commit_hash" : item.commit_hash,
        "date_created" : item.created,
    }
    items.append(r)
comdf = pd.DataFrame(items)

In [42]:
per = comdf.date_created.dt.to_period('M')
comdf.shape

(824976, 2)

In [43]:
comct = comdf.groupby(per).count()

In [44]:
comct.head()

Unnamed: 0_level_0,commit_hash,date_created
date_created,Unnamed: 1_level_1,Unnamed: 2_level_1
1970-01,1,1
2001-09,2,2
2002-04,12,12
2003-02,1,1
2004-07,1,1


In [45]:
# Number of developers per time period

q3 = dal.session.query(
    Meta.commit_hash, Author.name, Author.email, 
    Author.authored
).join(Author)

items = []
for item in q3.all():
    r = {
        "commit_hash": item.commit_hash,
        "author_name": item.name,
        "author_email": item.email,
        "author_date": item.authored,
    }
    items.append(r)
authordf = pd.DataFrame(items)

In [46]:
authordf.shape

(824976, 4)

In [47]:
per = authordf.author_date.dt.to_period('M')

In [52]:
authct = authordf[['author_name','author_email']].groupby(per).count()

In [53]:
authct.head()

Unnamed: 0_level_0,author_name,author_email
author_date,Unnamed: 1_level_1,Unnamed: 2_level_1
1970-01,1,1
2001-09,2,2
2002-04,12,12
2003-02,1,1
2004-07,1,1
