## Background

The goal of the MLapp project is to provide the following:

1. Illustrate how to build machine learning powered developer tools using the [GitHub Api](https://developer.github.com/v3/) and Flask.  We would like to show data scientists how to build exciting data products using machine learning on the GitHub marketplace, that developers can use.  Specifically, we will build an illustrative data product that will automatically label issues.  

2. Gather feedback and iterate 


The scope of this notebook is to addresses part of goal #1, by illustrating how we can acquire a dataset of GitHub issue labels and train a classifier.  

The top issues on GitHub by count are illustrated in [this spreadsheet](https://docs.google.com/spreadsheets/d/1NPacnVsyZMBneeewvPGhCx512A1RPYf8ktDN_RpKeS4/edit?usp=sharing).  To keep things simple, we will build a model to classify an issue as a `bug`, `feature` or `question`.  We use hueristics to collapse a set of issue labels into these three categories, which can be viewed [in this query](https://console.cloud.google.com/bigquery?sq=123474043329:01abf8866144486f932c756730ddaff1).  

The heueristic for these class labels are contained within the below case statement:

```{sql}
  CASE when labels like '%bug%' and labels not like '%not bug%' then True else False end as Bug_Flag,
  CASE when labels like '%feature%' or labels like '%enhancement%' or labels like '%improvement%' or labels like '%request%' then True else False end as Feature_Flag,
  CASE when labels like '%question%' or labels like '%discussion%' then True else False end as Question_Flag,
```
    the above case statement is located within [this query](https://console.cloud.google.com/bigquery?sq=123474043329:01abf8866144486f932c756730ddaff1)
    

The following alternative projects were tried before this task that we did not pursue further:
 - Transfer learning using the [GitHub Issue Summarizer](https://github.com/hamelsmu/Seq2Seq_Tutorial) to enable the prediction of custom labels on existing repos.  Found that this did not work well as there is a considerable amount of noise with regards to custom labels in repositories and often not enough data to adequately predict this.  
 - Tried to classify more than the above three classes, however the human-labeled issues are very subjective and it is not clear what is a question vs. a bug.  
 - Tried multi-label classification since labels can co-occur.  There is very little overlap between `bug`, `feature` and `question` labels, so we decided to simplify things and make this a multi-class classificaiton problem instead.  


Note: the code in this notebook was executed on a [p3.8xlarge](https://aws.amazon.com/ec2/instance-types/p3/) instance on AWS.

## Outline 

This notebook will follow these steps:

1. Download and partition dataset
2. Pre-process dataset
2. Build model architecture & Train Model
3. Evaluate Model


# Download and Partition Dataset

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd

pd.set_option('max_colwidth', 1000)

In [2]:
df = pd.concat([pd.read_csv(f'https://storage.googleapis.com/codenet/issue_labels/00000000000{i}.csv.gz')
                for i in range(1)])

for i, row in df.iterrows():
    if row['class_int'] == 2:
        df.at[i,'class_int'] = 1

#split data into train/test
traindf, testdf = train_test_split(df, test_size=.15)

traindf.to_pickle('traindf.pkl')
testdf.to_pickle('testdf.pkl')

#print out stats about shape of data
print(f'Train: {traindf.shape[0]:,} rows {traindf.shape[1]:,} columns')
print(f'Test: {testdf.shape[0]:,} rows {testdf.shape[1]:,} columns')

Train: 270,624 rows 10 columns
Test: 47,758 rows 10 columns


### Discussion of the data

In [3]:
# preview data
traindf.head(3)

Unnamed: 0,url,repo,title,body,num_labels,labels,c_bug,c_feature,c_question,class_int
118616,"""https://github.com/HypothesisWorks/hypothesis/issues/164""",HypothesisWorks/hypothesis,better stateful testing,"hypothesis’s stateful testing is extremely powerful and if you’re not using it you should be. but… i’ll be the first to admit it’s a little hard to use. the generic api is fine. it’s quite low level but it’s easy to use. the rule based stuff should be easy to use, but there’s a bit too much boiler plate and bundles of variables are a weird second class citizen. what i’d like to be able to do is make them just behave like strategies, where the strategy’s evaluation is deferred until execution time, so you can use them as if they were any other strategy and everything should just work. i would also like the syntax for using it to be more unified with the normal given syntax. ideally it would be nice if every given invocation had an implicit state machine associated with it so as to unify the two approaches.",1,"[""enhancement""]",False,True,False,1
53683,"""https://github.com/Nick-Lucas/EntryPoint/issues/29""",Nick-Lucas/EntryPoint,commands method.invoke will destroy an exception's stacktrace on re-throw,\r fix may be here here:\r http://stackoverflow.com/questions/57383/in-c-how-can-i-rethrow-innerexception-without-losing-stack-trace,1,"[""bug"", ""bug""]",True,False,False,0
173911,"""https://github.com/spinnaker/spinnaker/issues/2978""",spinnaker/spinnaker,orca/clouddriver triggers multiple createservergroup actions though only one was specified in the pipeline definition,"issue summary:\r \r we’re sometimes seeing pipelines fail after 10 sec due to two clouddriver threads both attempting the exact same atomic k8s operation, one of which fails and blows up the pipeline . \r \r the specific error that's raised is that create replica set <application-vxx> in default for account brisket failed: replicasets.extensions \ <application-vxx>\ already exists . this replica set never exists before the pipeline execution - it is created as a side effect of this pipeline, which promptly fails thereafter citing this error. \r \r on investigating further, we saw in the actual pipeline that kato was attempting to run two createservergroup pipeline stages from the same deploy stage parent, even though the pipeline configuration only specifies one. this feels like a bug in spinnaker 1.7.5.\r \r rerunning the pipeline eventually makes the problem disappear. when successful, our pipelines only have one createservergroup stage. \r \r this has been affecting all ...",5,"[""bug"", ""component/orca"", ""provider/kubernetes-v1"", ""stale"", ""to-be-closed""]",True,False,False,0


Discussion of the data:  

- url:        url where you can find this issue
- repo:       owner/repo name
- title:      title of the issue
- body:       body of the issue, not including comments
- num_labels: number of issue labels
- labels:     an array of labels applied a user manually applied to the issue (represented as a string)
- c_bug:      boolean flag that indicates if the issue label corresponds to a bug
- c_feature:  boolean flag that indicates if the issue label corresponds to a feature
- c_question: boolean flag that indicates if the issue label corresponds to a question
- class_int:  integer between 0 - 2 that corresponds to the class label.  **0=bug, 1=feature, 2=question**

### Summary Statistics

Class frequency **0=bug, 1=feature, 2=question**

In [4]:
print(traindf.groupby('class_int').size())
print(testdf.groupby('class_int').size())

class_int
0    121221
1    149403
dtype: int64
class_int
0    21295
1    26463
dtype: int64


number of unique repos

In [5]:
print(f' Avg # of issues per repo: {len(traindf) / traindf.repo.nunique():.1f}')
print(f" Avg # of issues per org: {len(traindf) / traindf.repo.apply(lambda x: x.split('/')[-1]).nunique():.1f}")

 Avg # of issues per repo: 2.6
 Avg # of issues per org: 2.8


Most popular repos by # of issues:

 - `pcnt` = percent of total issues in the dataset
 - `count` = number of issues in the dataset for that repo

In [6]:
pareto_df = pd.DataFrame({'pcnt': df.groupby('repo').size() / len(df), 'count': df.groupby('repo').size()})
pareto_df.sort_values('pcnt', ascending=False).head(20)

Unnamed: 0_level_0,pcnt,count
repo,Unnamed: 1_level_1,Unnamed: 2_level_1
Microsoft/vscode,0.005145,1638
rancher/rancher,0.002349,748
MicrosoftDocs/azure-docs,0.00206,656
godotengine/godot,0.001894,603
ansible/ansible,0.001866,594
hashicorp/terraform,0.001624,517
kubernetes/kubernetes,0.001504,479
lionheart/openradar-mirror,0.001432,456
dart-lang/sdk,0.001159,369
elastic/kibana,0.001156,368


# Pre-Process Data

To process the raw text data, we will use [ktext](https://github.com/hamelsmu/ktext)

In [7]:
from ktext.preprocess import processor
import dill as dpickle
import numpy as np

Using TensorFlow backend.


Clean, tokenize, and apply padding / truncating such that each document length = 75th percentile for the dataset.
Retain only the top keep_n words in the vocabulary and set the remaining words to 1 which will become common index for rare words.

**Warning:** the below block of code can take a long time to execute.

#### Learn the vocabulary from the training dataset

In [8]:
train_body_raw = traindf.body.tolist()
train_title_raw = traindf.title.tolist()

# Clean, tokenize, and apply padding / truncating such that each document length = 75th percentile for the dataset.
#  also, retain only the top keep_n words in the vocabulary and set the remaining words
#  to 1 which will become common index for rare words 

# process the issue body data
body_pp = processor(heuristic_pct_padding=.75, keep_n=8000)
train_body_vecs = body_pp.fit_transform(train_body_raw)

# process the title data
title_pp = processor(heuristic_pct_padding=.75, keep_n=4500)
train_title_vecs = title_pp.fit_transform(train_title_raw)

 See full histogram by insepecting the `document_length_stats` attribute.
 See full histogram by insepecting the `document_length_stats` attribute.


#### Apply transformations to Test Data

In [9]:

test_body_raw = testdf.body.tolist()
test_title_raw = testdf.title.tolist()

test_body_vecs = body_pp.transform_parallel(test_body_raw)
test_title_vecs = title_pp.transform_parallel(test_title_raw)



#### Extract Labels

Add an additional dimension to the end to facilitate compatibility with Keras.

In [10]:
train_labels = np.expand_dims(traindf.class_int.values, -1)
test_labels = np.expand_dims(testdf.class_int.values, -1)

#### Check shapes

In [11]:
# the number of rows in data for the body, title and labels should be the same for both train and test partitions
assert train_body_vecs.shape[0] == train_title_vecs.shape[0] == train_labels.shape[0]
assert test_body_vecs.shape[0] == test_title_vecs.shape[0] == test_labels.shape[0]

#### Save pre-processors and data to disk

In [12]:
# Save the preprocessor
with open('body_pp.dpkl', 'wb') as f:
    dpickle.dump(body_pp, f)

with open('title_pp.dpkl', 'wb') as f:
    dpickle.dump(title_pp, f)

# Save the processed data
np.save('train_title_vecs.npy', train_title_vecs)
np.save('train_body_vecs.npy', train_body_vecs)
np.save('test_body_vecs.npy', test_body_vecs)
np.save('test_title_vecs.npy', test_title_vecs)
np.save('train_labels.npy', train_labels)
np.save('test_labels.npy', test_labels)

# Build Architecture & Train Model

In [13]:
import tensorflow as tf
from tensorflow.keras.utils import multi_gpu_model
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, GRU, Dense, Embedding, BatchNormalization, Concatenate
from tensorflow.keras.optimizers import Adam
import numpy as np
import dill as dpickle

In [14]:
print(tf.__version__)

1.12.0


Load the data and shape information

In [15]:
with open('title_pp.dpkl', 'rb') as f:
    title_pp = dpickle.load(f)

with open('body_pp.dpkl', 'rb') as f:
    body_pp = dpickle.load(f)
    
#load the training data and labels
train_body_vecs = np.load('train_body_vecs.npy')
train_title_vecs = np.load('train_title_vecs.npy')
train_labels = np.load('train_labels.npy')

#load the test data and labels
test_body_vecs = np.load('test_body_vecs.npy')
test_title_vecs = np.load('test_title_vecs.npy')
test_labels = np.load('test_labels.npy')


issue_body_doc_length = train_body_vecs.shape[1]
issue_title_doc_length = train_title_vecs.shape[1]

body_vocab_size = body_pp.n_tokens + 1
title_vocab_size = title_pp.n_tokens + 1

num_classes = len(set(train_labels[:, 0]))
assert num_classes == 2

#### Build Model Architecture

We did very little hyperparameter tuning.  Keeping model simple as possible.

In [16]:
body_input = Input(shape=(issue_body_doc_length,), name='Body-Input')
title_input = Input(shape=(issue_title_doc_length,), name='Title-Input')

body = Embedding(body_vocab_size, 50, name='Body-Embedding')(body_input)
title = Embedding(title_vocab_size, 50, name='Title-Embedding')(title_input)

body = BatchNormalization()(body)
body = GRU(100, name='Body-Encoder')(body)

title = BatchNormalization()(title)
title = GRU(75, name='Title-Encoder')(title)

x = Concatenate(name='Concat')([body, title])
x = BatchNormalization()(x)
out = Dense(num_classes, activation='softmax')(x)

model = Model([body_input, title_input], out)

model.compile(optimizer=Adam(lr=0.001), 
              loss='sparse_categorical_crossentropy', 
              metrics=['accuracy'])

In [17]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Body-Input (InputLayer)         (None, 105)          0                                            
__________________________________________________________________________________________________
Title-Input (InputLayer)        (None, 10)           0                                            
__________________________________________________________________________________________________
Body-Embedding (Embedding)      (None, 105, 50)      400100      Body-Input[0][0]                 
__________________________________________________________________________________________________
Title-Embedding (Embedding)     (None, 10, 50)       225100      Title-Input[0][0]                
__________________________________________________________________________________________________
batch_norm

#### Train Model

In [18]:
from tensorflow.keras.callbacks import CSVLogger, ModelCheckpoint

script_name_base = 'Issue_Label_v1'
csv_logger = CSVLogger('{:}.log'.format(script_name_base))
model_checkpoint = ModelCheckpoint('{:}_best_model.hdf5'.format(script_name_base),
                                   save_best_only=True)

In [19]:
batch_size = 900
epochs = 4
history = model.fit(x=[train_body_vecs, train_title_vecs], 
                   y=train_labels,
                    batch_size=batch_size,
                    epochs=epochs,
                    validation_data=[(test_body_vecs, test_title_vecs), test_labels], 
                    callbacks=[csv_logger, model_checkpoint])

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 270624 samples, validate on 47758 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


# Evaluate Model

Compute a confusion matrix

In [20]:
from sklearn.metrics import classification_report, confusion_matrix
from tensorflow.keras.models import load_model

best_model = load_model('Issue_Label_v1_best_model.hdf5')

y_pred = np.argmax(best_model.predict(x=[test_body_vecs, test_title_vecs],
                                      batch_size=15000),
                   axis=1)

# get labels
y_test = test_labels[:, 0]

print(confusion_matrix(y_test, y_pred))

[[17593  3702]
 [ 4183 22280]]


# Make Predictions

In [2]:
from tensorflow.keras.models import load_model
import dill as dpickle

In [4]:
#load the best model
best_model = load_model('Issue_Label_v1_best_model.hdf5')

#load the pre-processors
with open('title_pp.dpkl', 'rb') as f:
    title_pp = dpickle.load(f)

with open('body_pp.dpkl', 'rb') as f:
    body_pp = dpickle.load(f)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Using TensorFlow backend.


In [7]:
def makePrediction(title, body):
    titleVec = title_pp.transform([body])
    bodyVec = body_pp.transform([title])
    class_names=['bug', 'feature_request/question']
    probs = best_model.predict(x=[bodyVec, titleVec]).tolist()[0]
    print({k:v for k,v in zip(class_names, probs)})

In [8]:
makePrediction('maybe error', 'I am not sure if it is my fault but it does not work. Am I missing something?')

{'bug': 0.7058060169219971, 'feature_request/question': 0.2941940128803253}


In [9]:
makePrediction('nothing works', 'It does` not work, I get bad errors')

{'bug': 0.9252945780754089, 'feature_request/question': 0.0747053399682045}


In [10]:
makePrediction('Add new button', 'Please add a new button')

{'bug': 0.012789001688361168, 'feature_request/question': 0.9872109889984131}


In [None]:
makePrediction('Help me pleas', 'Please add a new button')