# Text Summarization of Github issues
In this notebook I will write summaries with the help of my Seq2Seq model in Summarizer.py.

The model works impressively well in the end!


In [7]:
import os

import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from collections import Counter
from langdetect import detect

import Summarizer
import summarizer_data_utils
import summarizer_model_utils

In [31]:
print(tf.__version__)

1.8.0


## The data

The dataset set we will use here is the Github-issues dataset from Kaggle. It contains over 5 million issue titles and descriptions from the year 2017. However, we can unfortunately not use all of them, due to limited resources. 
Our aim is, as before, to create a summary from a given input. 

https://www.kaggle.com/davidshinn/github-issues

### Reading and exploring

In [11]:
# load csv file using pandas.
file_path = './github_issues.csv'
data = pd.read_csv(file_path, encoding='utf-8')
data.shape

(5332153, 3)

In [12]:
data.head()

Unnamed: 0,issue_url,issue_title,body
0,"""https://github.com/zhangyuanwei/node-images/i...",can't load the addon. issue to: https://github...,can't load the addon. issue to: https://github...
1,"""https://github.com/Microsoft/pxt/issues/2543""",hcl accessibility a11yblocking a11ymas mas4.2....,user experience: user who depends on screen re...
2,"""https://github.com/MatisiekPL/Czekolada/issue...",issue 1265: issue 1264: issue 1261: issue 1260...,┆attachments: <a href= https:& x2f;& x2f;githu...
3,"""https://github.com/MatisiekPL/Czekolada/issue...",issue 1266: issue 1263: issue 1262: issue 1259...,gitlo = github x trello\n---\nthis board is no...
4,"""https://github.com/MatisiekPL/Czekolada/issue...",issue 1288: issue 1285: issue 1284: issue 1281...,┆attachments: <a href= https:& x2f;& x2f;githu...


In [13]:
# check for missing values.
data.isnull().sum()

issue_url      0
issue_title    0
body           0
dtype: int64

In [14]:
# to make the transition from the amazon review example to this one as comfortable as possbile,
# we rename the columns. 
data.rename(index = str, columns = {'issue_title':'Summary', 'body':'Text'}, inplace = True)

In [17]:
# let's see how long the texts and summaries are. 
len_summaries = [len(summary) for i, summary in enumerate(data.Summary)]
len_texts = [len(text) for text in data.Text]

In [None]:
Counter(len_summaries).most_common(), Counter(len_texts).most_common()

In [20]:
# as I said before we can not use all of the training examples. 
# to make training easier we will only use shorter texts (and summaries) of similar length.
indices = [ind for ind, text in enumerate(data.Text) if 100 < len(text) < 108]
raw_summaries = data.Summary[indices]
raw_texts = data.Text[indices]

In [21]:
len(indices), len(raw_texts), len(raw_summaries)

(78668, 78668, 78668)

In [23]:
# unfortunately the issues are in different languages. 
# to make learning easier to the model we will only use english ones. 
# for that we use langdetect.
# with sufficient resources that might not be a problem, 
# but for our purposes it will be better. 
en_raw_summaries=[]
en_raw_texts=[]

for (s,t) in zip(raw_summaries, raw_texts):
    try:
        lang = detect(t)
        if lang == 'en':
            en_raw_summaries.append(s)
            en_raw_texts.append(t)
    except:
        continue
        

In [25]:
len(en_raw_summaries), len(en_raw_texts)

(71508, 71508)

In [26]:
for t, s in zip(en_raw_texts[:5], en_raw_summaries[:5]):
    print('Text:\n', t,)
    print('Summary:\n', s, '\n\n')

Text:
 the project seems interesting you should add a demo , cuz i almost lost interest when i didnt find one
Summary:
 add a demo 


Text:
 this site should have a logo. probably nothing to special, but something small and neat would be cool.
Summary:
 add a logo 


Text:
 page 8, i would add a ; at the end of the functions to avoid the copy/past error. very cosmetic, i know ;-
Summary:
 adding a ; 


Text:
 description update element bulk docs to reflect all known limitations, expectations, and best practices
Summary:
 box - bulk 


Text:
 1.lock scan button after handling the plist. 2.alternative naming 3.version control for plist 4.ordering
Summary:
 to do list 




### Clean and prepare the data
 




In [28]:
# preprocess the texts and summaries.
# we have the option to keep_most or not. in this case we do not want 'to keep most', i.e. we will only keep
# letters and numbers. 
# (to improve the model, this preprocessing step should be refined)
processed_texts, processed_summaries, words_counted = summarizer_data_utils.preprocess_texts_and_summaries(
    en_raw_texts[:20000],
    en_raw_summaries[:20000],
    keep_most=False)

Processing Time:  7.919019937515259


In [29]:
# take a short look at what a processed text, summary etc look like. 
processed_texts[0], processed_summaries[0], words_counted[:10]

(['the',
  'project',
  'seems',
  'interesting',
  'you',
  'should',
  'add',
  'a',
  'demo',
  'cuz',
  'i',
  'almost',
  'lost',
  'interest',
  'when',
  'i',
  'didnt',
  'find',
  'one'],
 ['add', 'a', 'demo'],
 [('the', 18464),
  ('to', 16602),
  ('a', 8785),
  ('in', 7344),
  ('for', 6732),
  ('is', 6575),
  ('and', 6516),
  ('it', 5234),
  ('of', 5185),
  ('i', 4813)])

### Create lookup dicts

We cannot feed our network actual words, but numbers. So we first have to create our lookup dicts, where each words gets and int value (high or low, depending on its frequency in our corpus). Those help us to later convert the texts into numbers.

We also add special tokens. EndOfSentence and StartOfSentence are crucial for the Seq2Seq model we later use.
Pad token, because all summaries and texts in a batch need to have the same length, pad token helps us do that.

So we need 2 lookup dicts:
 - From word to index 
 - from index to word. 

In [30]:
# create lookup dicts.
# most oft the words only appear only once. 
# min_occureces set to 2 reduces our vocabulary by more than half.
specials = ["<EOS>", "<SOS>","<PAD>","<UNK>"]
word2ind, ind2word,  missing_words = summarizer_data_utils.create_word_inds_dicts(words_counted,
                                                                                  specials = specials,
                                                                                  min_occurences=2)
print(len(word2ind), len(ind2word), len(missing_words))


15159 15159 15434


### Pretrained embeddings

Optionally we can use pretrained word embeddings. Those have proved to increase training speed and accuracy.
Here I used two different options. Either we use glove embeddings or embeddings from tf_hub.
The ones from tf_hub worked better, so we use those. 

In [23]:
# glove_embeddings_path = './glove.6B.300d.txt'
# embedding_matrix_save_path = './embeddings/my_embedding_github.npy'
# emb = summarizer_data_utils.create_and_save_embedding_matrix(word2ind,
#                                                              glove_embeddings_path,
#                                                              embedding_matrix_save_path)

In [24]:
# the embeddings from tf.hub. 
embed = hub.Module("https://tfhub.dev/google/Wiki-words-250/1")
emb = embed([key for key in word2ind.keys()])

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    embedding = sess.run(emb)

In [25]:
embedding.shape

In [26]:
np.save('./embeddings/my_embedding_github.npy', embedding)

### Convert text and summaries
As I said before we cannot feed the words directly to our network, we have to convert them to numbers first of all. This is what we do here. And we also append the SOS and EOS tokens.

In [27]:
# converts words in texts and summaries to indices
converted_texts, unknown_words_in_texts = summarizer_data_utils.convert_to_inds(processed_texts,
                                                                                word2ind,
                                                                                eos = False)

In [28]:
converted_summaries, unknown_words_in_summaries = summarizer_data_utils.convert_to_inds(processed_summaries,
                                                                                        word2ind,
                                                                                        eos = True,
                                                                                        sos = True)

In [29]:
converted_texts[0]

[4,
 84,
 230,
 2241,
 29,
 21,
 18,
 6,
 529,
 13653,
 13,
 2306,
 1828,
 2994,
 22,
 13,
 5531,
 147,
 92]

In [None]:
# seems to have worked well. 
for t, s in zip(converted_texts[:5], converted_summaries[:5]):
    print(summarizer_data_utils.convert_inds_to_text(t, ind2word),
          summarizer_data_utils.convert_inds_to_text(s, ind2word))
    print('\n\n')


## The model

Now we can build and train our model. First we define the hyperparameters we want to use. Then we create our Summarizer and call the function .build_graph(), which as the name suggests, builds the computation graph. 
Then we can train the model using .train()

After training we can try our model using .infer()

### Training
Unfortunately I do not have the resources to find the perfect (or right) hyperparameters, but these do pretty well. 

I trained the model for about 40 epochs. the training loss, as well as the validation loss were both still declining.
I chose to use 90% of the data as trainign set and 10% as validation set. We could have also used sklearn's train_test_split here. 

In [32]:
# model hyperparameters
num_layers_encoder = 2
num_layers_decoder = 2
rnn_size_encoder = 500
rnn_size_decoder = 500

batch_size = 256
epochs = 200
clip = 5
keep_probability = 0.8
learning_rate = 0.0005
max_lr=0.005
learning_rate_decay_steps = 500
learning_rate_decay = 0.90


pretrained_embeddings_path = './embeddings/my_embedding_github.npy'
summary_dir = os.path.join('./tensorboard/github_issues')


use_cyclic_lr = True
inference_targets=True


In [33]:
len(converted_summaries)

20000

In [34]:
round(20000*0.9)

18000

In [None]:
# build graph and train the model 
summarizer_model_utils.reset_graph()
summarizer = Summarizer.Summarizer(word2ind,
                                   ind2word,
                                   save_path='./models/github_issues/my_model',
                                   mode='TRAIN',
                                   num_layers_encoder = num_layers_encoder,
                                   num_layers_decoder = num_layers_decoder,
                                   rnn_size_encoder = rnn_size_encoder,
                                   rnn_size_decoder = rnn_size_decoder,
                                   batch_size = batch_size,
                                   clip = clip,
                                   keep_probability = keep_probability,
                                   learning_rate = learning_rate,
                                   max_lr=max_lr,
                                   learning_rate_decay_steps = learning_rate_decay_steps,
                                   learning_rate_decay = learning_rate_decay,
                                   epochs = epochs,
                                   pretrained_embeddings_path = pretrained_embeddings_path,
                                   use_cyclic_lr = use_cyclic_lr,)
#                                    summary_dir = summary_dir)           

summarizer.build_graph()
summarizer.train(converted_texts[:18000], 
                 converted_summaries[:18000],
                 validation_inputs=converted_texts[18000:],
                 validation_targets=converted_summaries[18000:])

# hidden training output.
# both train and validation loss decrease nicely.

### Inference

Now we can use our trained model to create summaries. 

In [36]:
summarizer_model_utils.reset_graph()
summarizer = Summarizer.Summarizer(word2ind,
                                   ind2word,
                                   './models/github_issues/my_model',
                                   'INFER',
                                   num_layers_encoder = num_layers_encoder,
                                   num_layers_decoder = num_layers_decoder,
                                   batch_size = len(converted_texts[:50]),
                                   clip = clip,
                                   keep_probability = 1.0,
                                   learning_rate = 0.0,
                                   beam_width = 5,
                                   rnn_size_encoder = rnn_size_encoder,
                                   rnn_size_decoder = rnn_size_decoder,
                                   inference_targets = True,
                                   pretrained_embeddings_path = pretrained_embeddings_path)

summarizer.build_graph()
preds = summarizer.infer(converted_texts[:50],
                         restore_path =  './models/github_issues/my_model',
                         targets = converted_summaries[:50])




Loaded pretrained embeddings.
Graph built.
INFO:tensorflow:Restoring parameters from ./models/github_issues/my_model
Done.


In [37]:
# show results. 
summarizer_model_utils.sample_results(preds,
                                      ind2word,
                                      word2ind,
                                      converted_summaries[:50],
                                      converted_texts[:50])




 ----------------------------------------------------------------------------------------------------
Actual Text:
the project seems interesting you should add a demo cuz i almost lost interest when i didnt find one

Actual Summary:
add a demo

Created Summary:
add a demo demo





 ----------------------------------------------------------------------------------------------------
Actual Text:
this site should have a logo probably nothing to special but something small and neat would be cool

Actual Summary:
add a logo

Created Summary:
add a logo logo





 ----------------------------------------------------------------------------------------------------
Actual Text:
page 8 i would add a at the end of the functions to avoid the copy past error very cosmetic i know

Actual Summary:
adding a

Created Summary:
add more more





 ----------------------------------------------------------------------------------------------------
Actual Text:
description update element bulk docs to 

# Conclusion

Generally I am really impressed by how well the model works. 
We only used a limited amount of data, trained it for a limited amount of time and used nearly random hyperparameters and it still delivers good results. 

However, we are clearly overfitting the training data and the model does not perfectly generalize.
Sometimes the summaries the model creates are good, sometimes bad, sometimes they are better than the original ones and sometimes they are just really funny.


Therefore it would be really interesting to scale it up and see how it performs. 

To sum up, I am impressed by seq2seq models, they perform great on many different tasks and I look foward to exploring more possible applications. 
(speech recognition...)