# Summary

We're interested in GitHub project health. I first establish a baseline
using known successful projects and projects which were previously successful
but have now been retired. This notebook is intended to get the basic working
thing together then we can scale up later.

# Objective function:

Lines of code is a good indicator but we don't want to 
just reward increasing project complexity. I want to
equally reward code additions and deletions so we'll
use the [Manhattan Norm](https://en.wikipedia.org/wiki/Taxicab_geometry)
or $l_1$ distance.

$$
||x||_1 := \sum_{i=1}^n |x_i|.
$$

We want to maximize the $l_1$ distance so we can define
the cost function $c(x)$ as

$$
c(x) = -\sum_{i=1}^n |x_i|
$$

which means we want to now minimize $c(x)$.


## Dependencies

In [13]:
import os
from urllib.parse import urljoin

import pandas as pd
from okra.models import DataAccessLayer
from okra.playbooks import local_persistance

## Prepare Data

In [2]:
DATA = "/Users/tylerbrown/code/"
repos = [
    "torvalds/linux",
    "docker/docker-ce",
    'apache/attic-lucy',
    'apache/attic-wink',
    'apache/spark',
    'apache/lucene-solr'
]

In [3]:
# Persist repo info in database

for repo_name in repos:
    local_persistance(repo_name, DATA)

Issue with row 0, repo '/Users/tylerbrown/code/torvalds/linux'
Issue with row 0, repo '/Users/tylerbrown/code/docker/docker-ce'
Issue with row 0, repo '/Users/tylerbrown/code/apache/attic-lucy'
Issue with row 0, repo '/Users/tylerbrown/code/apache/attic-wink'
Issue with row 0, repo '/Users/tylerbrown/code/apache/spark'
Issue with row 0, repo '/Users/tylerbrown/code/apache/lucene-solr'


In [14]:
repodbs = {i : i.replace("/", "__REPODB__") + ".db" for i in repos}
repodbs

{'torvalds/linux': 'torvalds__REPODB__linux.db',
 'docker/docker-ce': 'docker__REPODB__docker-ce.db',
 'apache/attic-lucy': 'apache__REPODB__attic-lucy.db',
 'apache/attic-wink': 'apache__REPODB__attic-wink.db',
 'apache/spark': 'apache__REPODB__spark.db',
 'apache/lucene-solr': 'apache__REPODB__lucene-solr.db'}

# Exploratory Analysis: Linux Kernel

Trying to get an idea of which features would be informative
by exploring the Linux kernel. Some initial thoughts about
repo health indicators

1. Number of commits per time period
1. Number of developers per time period


In [18]:
conn_string = "sqlite:///" + urljoin(DATA, repodbs['torvalds/linux'])

In [20]:
dal = DataAccessLayer(conn_string)
dal.connect()

In [22]:
commits = pd.read_sql_table('commit_file', dal.engine)
info = pd.read_sql_table('info', dal.engine)
contrib = pd.read_sql_table('contrib', dal.engine)
author = pd.read_sql_table('author', dal.engine)
meta = pd.read_sql_table('meta', dal.engine)
inventory = pd.read_sql_table('inventory', dal.engine)

### Compute objective function per month

Let's start by computing our objective function
once per month. 

In [28]:
print("meta: {}, author: {}, commits: {}".format(meta.shape, author.shape, commits.shape))

meta: (824976, 3), author: (824976, 4), commits: (1824726, 5)


In [30]:
meta.dtypes

commit_hash     object
owner_name      object
project_name    object
dtype: object

In [37]:
author.columns

Index(['commit_hash', 'name', 'email', 'authored'], dtype='object')

In [33]:
commits.dtypes

file_id              int64
commit_hash         object
modified_file       object
lines_added          int64
lines_subtracted     int64
dtype: object

In [42]:
per = author.authored.dt.to_period("M")


In [43]:
author.join(commits, on="commit_hash")

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat