### loosely following this https://www.tensorflow.org/tutorials/text/text_generation and some of this https://www.tensorflow.org/tutorials/text/nmt_with_attention

- for a abstractive summarization, it's a supervised task so we want to have input/output pairs (so we need `<start>`/`<end>` markers and masking/padding) 


- **ideally**, we would also want a **char-level model** because sentences, numbers and capital letters are important too so let's see if we can do that

 - if we do a word-level model we would have to limit the vocabulary at some point
 
 
- also, while the shakespeare language model is just trained on the whole text concatenated, we **need to split the dara into pairs of text/summary** like in the NMT example

In [1]:
import sys
from platform import python_version

sys.executable,python_version()

('/home/felipe/tf2-venv/bin/python3', '3.6.9')

In [114]:
import tensorflow as tf

import numpy as np
import os
import time
import pandas as pd
import re

from tensorflow.keras.preprocessing.text import Tokenizer

In [3]:
# general options
pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows',100)

# for text columns
pd.set_option('display.max_colwidth',1000)
pd.set_option('display.html.use_mathjax',False)

In [4]:
df = pd.read_csv(
    "/home/felipe/tensorflow-sandbox/base-supervised-text-lstm/data/processed/df.csv",
    escapechar="\\")

In [5]:
df.sample(2,random_state=10)

Unnamed: 0,File,Text,Summary,source
1769,070.txt,"Record fails to lift lacklustre meet\n\nYelena Isinbayeva may have produced another world pole vault record, but her achievement could not hide the fact it was not the best meet we have ever seen in Birmingham.\n\nAnd hey, there are not many meets that go by without the Russian breaking a world record.\n\nApparently, Isinbayeva has cleared five metres in training and I would just love her to put us out of our misery and have a go at it rather than extending the indoor record by one centimetre at a time. Athletics to me is all about pushing the barriers and being the best you can, and I would like to see her have a go at 5m in competition. Mind you, every time she breaks the record she gets $30,000 so she can afford to be deliberate about it. World records aside, I thought it was a very encouraging evening's work for Kelly Holmes. She looked good and was very positive. Agnes Samaria, who came second, is in very good shape and is in the world's top three 800m runners this season. Yes...","Yelena Isinbayeva may have produced another world pole vault record, but her achievement could not hide the fact it was not the best meet we have ever seen in Birmingham.From an international perspective, I thought Meseret Defar was disappointing in the 3,000m, but I don't think the pace-making was great.World records aside, I thought it was a very encouraging evening's work for Kelly Holmes.She had a go but just could not hang in there.From a British point of view, Sarah Claxton's victory in the 60m hurdles was the best thing to come out of the meet.She looked good and was very positive.But he has only just come over from the USA, so he may not be that sharp and I still think he is in great shape.Apparently, Isinbayeva has cleared five metres in training and I would just love her to put us out of our misery and have a go at it rather than extending the indoor record by one centimetre at a time.Yes, Samaria let Kelly get away, but there was no coming back over the last 200m as Kell...",sport
90,185.txt,"US bank 'loses' customer details\n\nThe Bank of America has revealed it has lost computer tapes containing account details of more than one million customers who are US federal employees.\n\nSeveral members of the US Senate are among those affected, who could now be vulnerable to identity theft. Senate sources say the missing tapes may have been stolen from a plane by baggage handlers. The bank gave no details of how the records disappeared, but said they had probably not been misused. Customers' accounts were being monitoring and account holders would be notified if any ""unusual activity"" was detected, bank officials said.\n\nBank of America said the tapes went missing in December while being shipped to a back-up data centre. ""We, with federal law authorities, have done a very robust, thorough investigation on this and neither we nor they would make the statement lightly that we believe those tapes to be lost,"" Alexandra Tower, a spokeswoman for the North Carolina-based bank, told...","New York Senator Charles Schumer said he was told by the Senate Rules Committee that the tapes were probably stolen from a commercial plane.Bank of America said the tapes went missing in December while being shipped to a back-up data centre.But although there was no evidence of criminal activity, the bank said, the Secret Service - a federal agency whose brief includes investigations of serious financial crime - is said to be looking into the loss.Customers' accounts were being monitoring and account holders would be notified if any ""unusual activity"" was detected, bank officials said.The Bank of America has revealed it has lost computer tapes containing account details of more than one million customers who are US federal employees.",business


In [68]:
df_preproc = df.copy()

> periods in the Summary text don't have a space after them, let's fix that

In [69]:
def add_space_after_periods(input_str):
    
    # only points followed by a capital letter
    return re.sub(r'(\w)\.([A-Z])',r'\1. \2',input_str)
    
add_space_after_periods("foo.Bar")

'foo. Bar'

In [71]:
def add_start_end_markers(input_str):
    w = input_str.strip()

    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'    
    
    return w

In [72]:
def remove_header_line(input_str):
    
    return re.sub(r'^[^\n]+\n\n','',input_str)

In [73]:
df_preproc['Summary_preproc'] = df_preproc['Summary'].apply(add_space_after_periods)
df_preproc['Summary_preproc'] = df_preproc['Summary_preproc'].apply(add_start_end_markers)

In [74]:
df_preproc["Text_preproc"] = df_preproc["Text"].apply(remove_header_line)
df_preproc["Text_preproc"] = df_preproc["Text_preproc"].apply(add_start_end_markers)

In [76]:
df_preproc.sample(2,random_state=10)[['Text_preproc','Summary_preproc']]

Unnamed: 0,Text_preproc,Summary_preproc
1769,"<start> Yelena Isinbayeva may have produced another world pole vault record, but her achievement could not hide the fact it was not the best meet we have ever seen in Birmingham.\n\nAnd hey, there are not many meets that go by without the Russian breaking a world record.\n\nApparently, Isinbayeva has cleared five metres in training and I would just love her to put us out of our misery and have a go at it rather than extending the indoor record by one centimetre at a time. Athletics to me is all about pushing the barriers and being the best you can, and I would like to see her have a go at 5m in competition. Mind you, every time she breaks the record she gets $30,000 so she can afford to be deliberate about it. World records aside, I thought it was a very encouraging evening's work for Kelly Holmes. She looked good and was very positive. Agnes Samaria, who came second, is in very good shape and is in the world's top three 800m runners this season. Yes, Samaria let Kelly get away, bu...","<start> Yelena Isinbayeva may have produced another world pole vault record, but her achievement could not hide the fact it was not the best meet we have ever seen in Birmingham. From an international perspective, I thought Meseret Defar was disappointing in the 3,000m, but I don't think the pace-making was great. World records aside, I thought it was a very encouraging evening's work for Kelly Holmes. She had a go but just could not hang in there. From a British point of view, Sarah Claxton's victory in the 60m hurdles was the best thing to come out of the meet. She looked good and was very positive. But he has only just come over from the USA, so he may not be that sharp and I still think he is in great shape. Apparently, Isinbayeva has cleared five metres in training and I would just love her to put us out of our misery and have a go at it rather than extending the indoor record by one centimetre at a time. Yes, Samaria let Kelly get away, but there was no coming back over the l..."
90,"<start> The Bank of America has revealed it has lost computer tapes containing account details of more than one million customers who are US federal employees.\n\nSeveral members of the US Senate are among those affected, who could now be vulnerable to identity theft. Senate sources say the missing tapes may have been stolen from a plane by baggage handlers. The bank gave no details of how the records disappeared, but said they had probably not been misused. Customers' accounts were being monitoring and account holders would be notified if any ""unusual activity"" was detected, bank officials said.\n\nBank of America said the tapes went missing in December while being shipped to a back-up data centre. ""We, with federal law authorities, have done a very robust, thorough investigation on this and neither we nor they would make the statement lightly that we believe those tapes to be lost,"" Alexandra Tower, a spokeswoman for the North Carolina-based bank, told Time magazine. But although...","<start> New York Senator Charles Schumer said he was told by the Senate Rules Committee that the tapes were probably stolen from a commercial plane. Bank of America said the tapes went missing in December while being shipped to a back-up data centre. But although there was no evidence of criminal activity, the bank said, the Secret Service - a federal agency whose brief includes investigations of serious financial crime - is said to be looking into the loss. Customers' accounts were being monitoring and account holders would be notified if any ""unusual activity"" was detected, bank officials said. The Bank of America has revealed it has lost computer tapes containing account details of more than one million customers who are US federal employees. <end>"


In [96]:
def create_dataset(input_df, num_examples=None):
    
    pairs = []
    
    for (i,(index, row)) in enumerate(input_df.iterrows()):
        
        if (num_examples is not None) and (i > num_examples+1):
            break
        
        pairs.append( [row["Text_preproc"], row["Summary_preproc"] ])   
              
        
    return zip(*pairs)

In [98]:
texts, summaries = create_dataset(df_preproc)

In [119]:
sample_text = texts[-1]
sample_summary = summaries[-1]

sample_text,sample_summary

('<start> High-speed net connections in the UK are proving more popular than ever.\n\nBT reports that more people signed up for broadband in the last three months than in any other quarter. The 600,000 connections take the total number of people in the UK signing up for broadband from BT to almost 3.3 million. Nationally more than 5 million browse the net via broadband. Britain now has among the highest number of broadband connections throughout the whole of Europe.\n\nAccording to figures gathered by industry watchdog, Ofcom, the growth means that the UK has now surpassed Germany in terms of broadband users per 100 people. The UK total of 5.3 million translates into 7.5 connections per 100 people, compared to 6.7 in Germany and 15.8 in the Netherlands. The numbers of people signing up to broadband include those that get their service direct from BT or via the many companies that re-sell BT lines under their own name. Part of the surge in people signing up was due to BT stretching the 

In [101]:
all_text = " ".join(df_preproc["Text_preproc"].values)

In [102]:
all_text = all_text + " ".join(df_preproc["Summary_preproc"].values)

In [103]:
len(all_text)

7264578

In [104]:
all_text[-100:]

'f getting broadband - beyond 6km. Nationally more than 5 million browse the net via broadband. <end>'

In [25]:
print ('{} ---- characters mapped to int ---- > {}'.format(repr(all_text[:13]), text_as_int[:13]))

'Q&A: Malcolm ' ---- characters mapped to int ---- > [48  7 32 27  1 44 61 72 63 75 72 73  1]


In [115]:
tk = Tokenizer(num_words=None,char_level=True,oov_token='UNK',lower=False,filters=False)

In [116]:
tk.fit_on_texts(all_text)

In [122]:
tk.word_index

{'UNK': 1,
 ' ': 2,
 'e': 3,
 't': 4,
 'a': 5,
 'o': 6,
 'i': 7,
 'n': 8,
 's': 9,
 'r': 10,
 'h': 11,
 'l': 12,
 'd': 13,
 'c': 14,
 'u': 15,
 'm': 16,
 'p': 17,
 'g': 18,
 'f': 19,
 'w': 20,
 'y': 21,
 'b': 22,
 '.': 23,
 'v': 24,
 ',': 25,
 'k': 26,
 '"': 27,
 'T': 28,
 'S': 29,
 'M': 30,
 '0': 31,
 '-': 32,
 "'": 33,
 'B': 34,
 '\n': 35,
 'I': 36,
 'A': 37,
 'C': 38,
 'x': 39,
 '1': 40,
 'P': 41,
 '2': 42,
 'D': 43,
 'H': 44,
 '<': 45,
 '>': 46,
 'L': 47,
 'W': 48,
 'E': 49,
 'F': 50,
 'G': 51,
 'R': 52,
 'U': 53,
 'j': 54,
 'N': 55,
 'O': 56,
 'J': 57,
 '5': 58,
 'K': 59,
 '3': 60,
 '4': 61,
 'q': 62,
 '9': 63,
 'z': 64,
 ')': 65,
 '(': 66,
 'V': 67,
 '6': 68,
 '7': 69,
 '%': 70,
 '8': 71,
 ':': 72,
 '£': 73,
 'Y': 74,
 '$': 75,
 ';': 76,
 '?': 77,
 'Z': 78,
 'Q': 79,
 'X': 80,
 '/': 81,
 '&': 82,
 '!': 83,
 '[': 84,
 ']': 85,
 '#': 86,
 '+': 87,
 '*': 88,
 '`': 89,
 '=': 90,
 '@': 91}

In [137]:
input_tensor = tf.keras.preprocessing.sequence.pad_sequences(
    tk.texts_to_sequences(df_preproc["Text_preproc"]),
    padding='post')
output_tensor = tf.keras.preprocessing.sequence.pad_sequences(
    tk.texts_to_sequences(df_preproc["Summary_preproc"]),
    padding='post')

In [138]:
input_tensor.shape, output_tensor.shape

((2224, 25466), (2224, 12448))

In [145]:
NUM_EXAMPLES = input_tensor.shape[0]
BATCH_SIZE=32
STEPS_PER_EPOCH = NUM_EXAMPLES // BATCH_SIZE
EMBEDDING_DIM = 100
VOCAB_SIZE = len(tk.word_index)+1

In [149]:
dataset = tf.data.Dataset.from_tensor_slices((input_tensor,output_tensor)).shuffle(NUM_EXAMPLES)
dataset = dataset.batch(BATCH_SIZE,drop_remainder=True)

In [150]:
example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape

(TensorShape([32, 25466]), TensorShape([32, 12448]))