# Summary

Need to work through a small example before scaling up. The goal
here is to predict which repos at time $t_{i}$ will be healthy or unhealthy
at time $t_{i+1}$. The algorithm that I implement has to work on Spark. Going
to do spark later once I figure out the game plan.

In [1]:
ls ~/code/*.parquet

/Users/tylerbrown/code/author_2019-04-20_0.parquet
/Users/tylerbrown/code/commit_file_2019-04-20_0.parquet
/Users/tylerbrown/code/commit_file_2019-04-20_1.parquet
/Users/tylerbrown/code/contrib_2019-04-20_0.parquet
/Users/tylerbrown/code/info_2019-04-20_0.parquet
/Users/tylerbrown/code/meta_2019-04-20_0.parquet


## Dependencies

In [2]:
import pandas as pd

## Load Data

Databases have been consolidated and switched to Parquet files with
no more than 2e6 records in each.

In [5]:
commits = pd.read_parquet('/Users/tylerbrown/code/commit_file_2019-04-20_0.parquet', 'pyarrow').append(
    pd.read_parquet('/Users/tylerbrown/code/commit_file_2019-04-20_1.parquet', 'pyarrow')
)
commits.shape

(2327964, 5)

In [8]:
meta = pd.read_parquet('/Users/tylerbrown/code/meta_2019-04-20_0.parquet', 'pyarrow')
meta.shape

(928779, 3)

In [17]:
author = pd.read_parquet('/Users/tylerbrown/code/author_2019-04-20_0.parquet')
author.shape

(928779, 4)

In [9]:
meta.head()

Unnamed: 0,commit_hash,owner_name,project_name
0,8c2ffd9174779014c3fe1f96d9dc3641d9175f00,torvalds,linux
1,17403fa277eda1328a7026dfca7e40249f27dc6b,torvalds,linux
2,231c807a60715312e2a93a001cc9be9b888bc350,torvalds,linux
3,49ef015632ab3fcc19b2cb37b199d6d7ebcfa5f8,torvalds,linux
4,19caf581ba441659f1a71e9a5baed032fdcfceef,torvalds,linux


# Hypotheses

We need to generate some hypotheses related to RepoHealth. We can
then test those hypotheses with some fairly straight forward statistical
modeling. 

$F = ma$, repo health $\Rightarrow a = \frac{F}{m}$ where acceleration $a$ is
my definition for repo health. This is the shark hypotheses, software has to keep
moving or it dies. We can measure the level of acceleration by measuring the amplitude.
We have to normalize acceleration by mass to compare across repos.

$H1_a:$ Amplitude $\uparrow$, repo health (acceleration) $\uparrow$

Since 'repo health' isn't something we can directly measure, we have
to use a proxy measure or instrument variable. I use acceleration as a 
proxy for repo health. Substantively, we shouldn't see any projects that 
we know are 'healthy' violate $H1_a$. 

# Data Preprocessing

Need to change the representation of our data to find linear relationships. Linear
relationships are how we tell someone they should do more or less of something using Data Science. We have to
do a bunch of stuff here like normalize timeframes, normalize acceleration, normalize amplitude. Then we have to redo it on Spark.

In [19]:
# Create working table

df = pd.merge(pd.merge(meta, commits), author)
df.shape

(2327964, 10)

In [24]:
print("Weird things didn't happen: {}".format(df.shape[0] == commits.shape[0]))

Weird things didn't happen: True


In [25]:
df.head()

Unnamed: 0,commit_hash,owner_name,project_name,file_id,modified_file,lines_added,lines_subtracted,name,email,authored
0,8c2ffd9174779014c3fe1f96d9dc3641d9175f00,torvalds,linux,0,Makefile,1,1,Linus Torvalds,torvalds@linux-foundation.org,2019-03-24 14:02:26.000000
1,2a6a8e2d9004b5303fcb494588ba3a3b87a256c3,torvalds,linux,1,drivers/clocksource/clps711x-timer.c,13,32,Alexander Shiyan,shc_work@mail.ru,2018-12-20 14:16:26.000000
2,18915b5873f07e5030e6fb108a050fa7c71c59fb,torvalds,linux,2,fs/ext4/ioctl.c,7,0,Darrick J. Wong,darrick.wong@oracle.com,2019-03-23 12:10:29.000000
3,5e86bdda41534e17621d5a071b294943cae4376e,torvalds,linux,3,fs/ext4/indirect.c,22,25,zhangyi (F),yi.zhang@huawei.com,2019-03-23 11:56:01.000000
4,674a2b27234d1b7afcb0a9162e81b2e53aeef217,torvalds,linux,4,fs/ext4/indirect.c,8,4,zhangyi (F),yi.zhang@huawei.com,2019-03-23 11:43:05.000000


In [30]:
df.authored = pd.to_datetime(df.authored)

In [31]:
per = df.authored.dt.to_period("D")

In [34]:
dfgrp = df[['owner_name', 'project_name', 'lines_added', 'lines_subtracted']].groupby(
    [per, 'owner_name', 'project_name']).sum()

In [37]:
df.shape

(2327964, 10)

In [36]:
dfgrp.shape

(16018, 2)

In [40]:
dfgrp = dfgrp.reset_index()

In [41]:
dfgrp.head()

Unnamed: 0,authored,owner_name,project_name,lines_added,lines_subtracted
0,1970-01-01,torvalds,linux,2,1
1,2001-09-11,apache,lucene-solr,1910,0
2,2001-09-17,torvalds,linux,145,48
3,2001-09-18,apache,lucene-solr,17247,202
4,2001-09-19,apache,lucene-solr,2,4


In [12]:
# Compute acceleration

def acceleration_per_repo():
    pass

In [None]:
# Compute amplitude



In [10]:
# Check regression



In [11]:
# Add any variables which could threaten validity as controls 
# see if the relationship still holds

