# GitHub WebHook Events

This project uses webhook events from GitHub published by **GH Archive**. These events include most activities on public repositories like commits, pull-requests, forks, issues, comments, and more. Interesting questions to consider include:

- Do people use GitHub differently in their personal life than in their jobs? Can you see a difference between weekday use and weekend use? (e.g. in the programming language used, whether users leave commit messages, whether a repository belongs to an organization).
- "Push Events" correspond to users pushing their commits to GitHub. What do users' commits look like? How many commits to people make per push event? What do people write in their commit messages (do text analysis)? What does the usual vs. unusual commits look like?
- Investigate forking repositories:
> A fork is a copy of a repository. Forking a repository allows you to freely experiment with changes without affecting the original project. Most commonly, forks are used to either propose changes to someone else's project or to use someone else's project as a starting point for your own idea.

    - A forked repository is often a sign of a succesful repository, as users fork a repository to make their own customization or development. When do users fork repositories? What are the languages of the repositories being forked?
- Investigate what programming language are used in projects. Are the languages in weekday vs weekend projects different? Are projects associated to an organization written in different languages to those which aren't? (In this dataset, the language is only available for "fork" events. However, you can follow URLs in this dataset to the GitHub API to find the language of any repository.


### Getting the Data

This project is unusual in that the process of getting the data is more involved and is more about the *collection*. As such, if your data collections is more involved and clever, the data analysis portion can be a little lighter.

The primary data is explained on the [GH Archive](https://www.gharchive.org/) homepage. To get data for a single hour, in a terminal, type:
```
mkdir github_data
cd github_data
wget https://data.gharchive.org/2015-01-01-15.json.gz
gunzip *
cd ..
```

The website linked to has commands for pulling more data in one command. One day of data (24 hours) measures about 500MB in size. **You should do analyses on at least a few days of data for this project**. Once unzipped, the json file contains one valid json-object per line; the json is nested many-levels deep. Use [pandas.io.json.json_normalize](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html) to parse the lines on an hour or two of data. There will be approximately 600 columns. 

The following pattern should be useful:

```
filepath = '2015-01-01-15.json'
L = [json.loads(x) for x in open('github_data/%s' % filepath)]
df = pd.io.json.json_normalize(L)
```

Pick 15-25 interesting fields (see below) and write a function that takes in a file (representing a single hour of github activity) and returns a dataframe with only those fields. Then iterate through the files to concatenate all the data into a single dataframe.

A few interesting fields of note are (the 'dot' notation refers to the nested dictionary keys):
* type
    - The type of event (push-event, pull-request, fork, etc).
* actor.login
    - The GitHub user-id of the user
* actor.url
    - The url for more information about the user (e.g. Location, number of followers/folowees, number of repos).
* repo.name
    - The name of the repository.
* repo.url
    - The url for more information about the repo (e.g. programming language, forks, watchers)
* created_at
    - When the event occurred.
* org.login
    - The name of the organization to which the repository belongs.
* payload.commits
    - The commits for the push-event (it is a list that must be separately parsed)
* payload.forkee.*
    - Information about fork events.
* payload.pull_request.*
    - Information about pull requests


In particular, every field that ends in `url` refers to an endpoint of the *GitHub API* that contains more information about that event. You are encouraged to write requests to grab data from those files as well!


### Cleaning and EDA

- Create a single DataFrame with the relevant fields contain the events over the time range chosen.
- Understand the data in ways relevant to your question using univariate and bivariate analysis of the data as well as aggregations.


### Assessment of Missingness


Many columns which have `NaN` values are missing-by-design. For example, the `payload.forkee.*` fields are only present for events of type `fork`. You should assess the missing of columns that are **not** missing-by-design. Either in the GH-Archive data, or in supplementary data pulled from the GitHub API.

### Hypothesis Test / Permutation Test
Find a hypothesis test or permutation test to perform. You can use the questions at the top of the notebook for inspiration.

# Summary of Findings

### Introduction
TODO

### Cleaning and EDA
TODO

### Assessment of Missingness
TODO

### Hypothesis Test
TODO

# Code

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

### Cleaning and EDA

In [None]:
# TODO

### Assessment of Missingness

In [None]:
# TODO

### Hypothesis Test / Permutation Test

In [None]:
# TODO