# Kernel git revision history EDA

So here we are, **12 years** of git kernel history and changed files. This is the first notebook showing the basic properties of the dataset, including time series analysis and commit subject message analysis.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

In [None]:
def clean_ts(df):
    return df[(df['author_timestamp'] > 1104600000) & (df['author_timestamp'] < 1487807212)]
df = clean_ts(pd.read_csv('../input/linux_kernel_git_revlog.csv'))
df['author_dt'] = pd.to_datetime(df['author_timestamp'],unit='s')
df.head()

We have now read the data and removed some outliers that were either too far in the past or future.

For first step, we will have a look at the file additions and deletions over time to get an overview of the activity of the linux kernel.

In [None]:
time_df = df.groupby(['author_timestamp', 'author_dt'])[['n_additions', 'n_deletions']].agg(np.sum).reset_index().sort_values('author_timestamp', ascending=True)
time_df['diff'] = time_df['n_additions'] - time_df['n_deletions']
time_df.head()

In [None]:
t = pd.Series(time_df['diff'].values, index=time_df['author_dt'])
t.plot(title='lines of code added', figsize=(12,8))

Also interesting could be to have a look at the number of commits and files changed over time.

In [None]:
commits_over_time = df.groupby('author_dt')['commit_hash'].nunique().reset_index().sort_values('author_dt', ascending=True)
commits_series = pd.Series(commits_over_time['commit_hash'].values, index=commits_over_time['author_dt'])
commits_series.plot(title='number of commits on original time series', figsize=(12,8))

In [None]:
commits_series.resample('M').mean().plot(title='number of commits on monthly resampled data', figsize=(12,8))

Now lets have a look at the number of files changed per commit over time, is there something interesting to see there?

There is definitely a few spikes of activies where a lot of files have been changed in one commit!

In [None]:
files_changed_per_commit = df.groupby(['author_dt', 'commit_hash'])['filename'].agg('count').reset_index().sort_values('author_dt', ascending=True)
files_changed_per_commit = pd.Series(files_changed_per_commit['filename'].values, index=files_changed_per_commit['author_dt'])
files_changed_per_commit.plot(title='number files changed per commit', figsize=(12,8))

# Changed files and their commit messages

Here we will look at how the number of changed files per commit distributed and if we can learn something from the commit subjects about the changed files.

In [None]:
# trim distribution, there are a few heavy outliers in the data as we saw above 
n_files_changed_per_commit = df.groupby('commit_hash')['filename'].agg('count')
n_files_changed_per_commit = n_files_changed_per_commit[n_files_changed_per_commit < 20]
sns.distplot(n_files_changed_per_commit, kde=False)
plt.title('distribution of number of files changed per commit')
plt.xlabel('number of changed files')

In [None]:
# trim distribution, there are a few heavy outliers in the data as we saw above 
additions_per_commit = df.groupby('commit_hash')['n_additions'].agg(np.sum)
additions_per_commit = additions_per_commit[additions_per_commit < 100]
sns.distplot(additions_per_commit)

Now, let us transform the collection of commit subject messages into a vectorized representation.

HashingVectorizer is doing that by tokenizing each commit message and deciding, by hashing, the integer index of the vector where the count of that token in that commit message will be stored. That, naturally, will produce more and more collisions as the vector dimensionality is decreased.

First, we need to **deduplicate** commit subjects though. Normally we would do this by grouping by the commit hash and selecting any string compatible aggregation like *MAX* or *MIN* of subject to get ahold of the subject of each commit message.
This operation takes very long in pandas so we will assume that there is no subject that is exactly the same with any other subject for a differing commit.

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
# we will consider each unique subject
unique_subjects = np.sort(df['subject'].unique())
print(unique_subjects)
print(unique_subjects.shape)

# now vectorize each subject
hashed_subjects = HashingVectorizer(n_features=1024).fit_transform(unique_subjects)
hashed_subjects

The hashed subjects are now a sparse matrix with 1024 columns, exactly the number of buckets that we allowed in the hashing vectorizer. We have a sparisty of approx **1 - 4399278 / (602739 * 1024) = 99,28%**, so although the matrix is very large is is not occupying a lot of memory for us.

It can be interesting to do inference on the **number lines added** by only looking at the commit message, let's prepare some data.
We do this by grouping the number of lines added in each commit into **five** bins to turn this into a multi-class classification problem.

In [None]:
n_additions_per_subject = df.groupby('subject')['n_additions'].agg(np.sum).reset_index().sort_values('subject')

def bucketize(row):
    if row.n_additions > 80:
        return 'XXL'
    elif row.n_additions <= 80 and row.n_additions > 60:
        return 'XL'
    elif row.n_additions <= 60 and row.n_additions > 40:
        return 'L'
    elif row.n_additions <= 40 and row.n_additions > 20:
        return 'M'
    elif row.n_additions < 20:
        return 'S'

#y = n_additions_per_subject.apply(bucketize, axis=1)

#X = hashed_subjects
#X.shape, y.shape

Now, that we have some data, let us look at the AUC ROCs of a few hashing vectorizer sizes. We will use multinomial logistic regression for this classification task.

# Stay tuned, will continue tomorrow.

# Time zones and their activity share of the linux kernel project

Most active timezones of authors by number of files changed.

In [None]:
files_changed_per_utc_offset = df.groupby('commit_utc_offset_hours')['filename'].agg('count').reset_index().sort_values('filename', ascending=False)
sns.barplot(x = 'commit_utc_offset_hours', y = 'filename', data = files_changed_per_utc_offset)

Which timezones have the most active kernel authors?

In [None]:
n_authors_by_offset = df.groupby('commit_utc_offset_hours')['author_id'].nunique().reset_index().sort_values('author_id', ascending=False)
sns.barplot(x = 'commit_utc_offset_hours', y = 'author_id', data = n_authors_by_offset)

The number of authors is strongly proportional to the number of files changed.

Let's have a look at most common words used in commit subjects now. For that we will split each subject by space (0x20) and do a word count. To not exceed runtime of kernels, also subsample.

In [None]:
from collections import Counter
import operator

n_rows = 1e4
subject_words = []
for row_number, row in df.ix[0:n_rows].iterrows():
    ws = row['subject'].split(" ")
    subject_words = subject_words + [w.lower() for w in ws]

words = []
counts = []
for word, count in sorted(Counter(subject_words).items(), key=operator.itemgetter(1), reverse=True):
    words.append(word)
    counts.append(count)

In [None]:
wcdf = pd.DataFrame({'word': words, 'count': counts})
sns.barplot(y = 'word', x = 'count', data = wcdf[0:20])

It's probably a good idea to remove stop words and redo the word counts, but first let's also havea look at a nice visualization of the words collection.

In [None]:
from wordcloud import WordCloud

wordcloud = WordCloud().generate(" ".join(subject_words))

plt.figure(figsize=(12,8))
plt.imshow(wordcloud)
plt.axis("off")

What would be interesting to know is whether there is a difference in length of commit subjects by UTC offset, let's have a look.

What is deceiving in this plot, is the low confidence in the estimate for UTC offsets around +7, if you go back up, you can see that there are barely any commits from these timezones.

In [None]:
df['subject_char_len'] = df['subject'].str.len()

In [None]:
df.groupby('commit_utc_offset_hours')['subject_char_len'].agg(np.mean).plot()

In [None]:
df['commit_activity'] = df['n_additions'] + df['n_deletions']
cmap = plt.get_cmap('viridis')
sns.heatmap(df[['commit_utc_offset_hours', 'commit_activity', 'subject_char_len']].corr(), cmap=cmap)

In [None]:
#sns.pairplot(df[['commit_utc_offset_hours', 'commit_activity', 'subject_char_len']])

Distribution of length of subject words.

In [None]:
sns.distplot(list(map(lambda w: len(w), words)))

Do authors of the linux kernel use different length of words to describe their commits?

In [None]:
n_rows = 1e4
word_lengths = []
timezones = []

for row_number, row in df.ix[0:n_rows].iterrows():
    ws = row['subject'].split(" ")
    word_lengths = word_lengths + list(map(lambda w: len(w), ws))
    timezones = timezones + [row['commit_utc_offset_hours'] for x in range(len(ws))]

tz_ws = pd.DataFrame({'tz': timezones, 'word_length': word_lengths})
tz_ws.head(5)

In [None]:
tz_ws.groupby('tz')['word_length'].agg(np.mean).plot()

Let's have a look at how many distinct words we have in the whole history of git with and without stop word removal.

In [None]:
len(np.unique(subject_words))

In [None]:
from stop_words import get_stop_words

stop_words = get_stop_words('english')

filtered_subject_words = [w for w in subject_words if w not in stop_words]

len(np.unique(filtered_subject_words))


In [None]:
words = []
counts = []
for word, count in sorted(Counter(subject_words).items(), key=operator.itemgetter(1), reverse=True):
    if word in get_stop_words('english'):
        continue
    words.append(word)
    counts.append(count)

wcdf = pd.DataFrame({'word': words, 'count': counts})
sns.barplot(y = 'word', x = 'count', data = wcdf[0:20])