## Resources

In Tensorflow: 
  - https://towardsdatascience.com/text-summarization-with-amazon-reviews-41801c2210b
  - https://github.com/Currie32/Text-Summarization-with-Amazon-Reviews/blob/master/summarize_reviews.ipynb

Using Attention: 
  - https://github.com/alesee/abstractive-text-summarization/blob/master/abstractive-text-summ.ipynb
  - https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
  - https://www.kaggle.com/manuhg/summarizing-text-with-amazon-reviews ** See

## Getting Data

In [0]:
# !pip install kaggle

In [0]:
!mkdir .kaggle

In [0]:
!mkdir ~/.kaggle

In [0]:
import json
token = {'username':'aishakhatun','key':'b8b756bc063399448968c46d2b651461'}
with open('/content/.kaggle/kaggle.json', 'w') as file:
    json.dump(token, file)

In [0]:
!cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json

In [0]:
!kaggle config set -n path -v{/content}

- path is now set to: {/content}


In [0]:
!chmod 600 /root/.kaggle/kaggle.json

In [0]:
# !kaggle datasets list

In [0]:
!kaggle datasets download -d snap/amazon-fine-food-reviews -p /content

Downloading amazon-fine-food-reviews.zip to /content
 99% 249M/251M [00:05<00:00, 36.4MB/s]
100% 251M/251M [00:05<00:00, 48.4MB/s]


In [0]:
!unzip \*.zip

Archive:  amazon-fine-food-reviews.zip
  inflating: Reviews.csv             
  inflating: database.sqlite         
  inflating: hashes.txt              


## EDA

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from torchtext import data, vocab
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import spacy
import torchtext
from torchtext.data import Field, BucketIterator, TabularDataset
en = spacy.load('en')
import numpy as np
import pandas as pd
import os
import string
import re
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.autograd import Variable
import torch.utils.data
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences
from collections import Counter
import pickle

Using TensorFlow backend.


In [0]:
df = pd.read_csv('Reviews.csv')[['Summary', 'Text']]
df.head()

Unnamed: 0,Summary,Text
0,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,"""Delight"" says it all",This is a confection that has been around a fe...
3,Cough Medicine,If you are looking for the secret ingredient i...
4,Great taffy,Great taffy at a great price. There was a wid...


In [0]:
df.shape

(568454, 2)

In [0]:
df = df.dropna().reset_index(drop=True)

In [0]:
df.shape

(568427, 2)

## Tokenization

- https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/
- https://spacy.io/usage/linguistic-features#tokenization
- http://anie.me/On-Torchtext/

Text pre=processing in general:
- https://www.kdnuggets.com/2019/04/text-preprocessing-nlp-machine-learning.html
- https://medium.com/@makcedward/nlp-pipeline-word-tokenization-part-1-4b2b547e6a3

In [0]:
### All tokenizations combined
re_punc = re.compile("([\'!\"#$%&()*+,-./:;<=>?@[\]^_`{|}~\\\])") # add spaces around punctuation
## () for capturing group, [] for one of the groups in the braces
re_apos = re.compile(r"n ' t ")    # n't
re_bpos = re.compile(r" ' s ")     # 's
re_mult_space = re.compile(r"  *") # replace multiple spaces with just one
replacements = {'\t':' ','\n':' ','\r':' ','\x0b':' ','\x0c':' '}

def simple_toks(sent):
    sent = "".join([replacements.get(c,c) for c in sent])
    sent = re_punc.sub(r" \1 ", sent) # \1 is the group that we have captured
    sent = re_apos.sub(r" n't ", sent)
    sent = re_bpos.sub(r" 's ", sent)
    sent = re_mult_space.sub(' ', sent)
    ret = ['xbos']
    for w in sent.split():
        if w.isupper():
            ret.append('xwup')
        elif w[0].isupper():
            ret.append('xup')
        ret.append(w.lower())
    return ret

## en.tokenizer(text)
## implement !!! as 3*!

In [0]:
### TESTING
str = 'aysha/is//a***good(person89<3) don\'t enter here       PLEASE!!!'
list(simple_toks(str))

['xbos',
 'aysha',
 '/',
 'is',
 '/',
 '/',
 'a',
 '*',
 '*',
 '*',
 'good',
 '(',
 'person89',
 '<',
 '3',
 ')',
 'do',
 "n't",
 'enter',
 'here',
 'xwup',
 'please',
 '!',
 '!',
 '!']

In [0]:
TEXT = Field(sequential=True, tokenize=simple_toks,lower=True, unk_token='xunk' , pad_token='xunk')

In [0]:
data_fields = [('Summary', TEXT), ('Text', TEXT)]

In [0]:
train, valid = train_test_split(df, test_size=0.2); len(train), len(valid)

(454741, 113686)

In [0]:
train.to_csv("train.csv", index=False)
valid.to_csv("valid.csv", index=False)

Takes some time

In [0]:
train_ds,valid_ds = data.TabularDataset.splits(path='./', train='train.csv', validation='valid.csv', format='csv', fields=data_fields,  skip_header=True)

- **charngram.100d**
- fasttext.en.300d
- fasttext.simple.300d
- glove.42B.300d 
- glove.840B.300d
- glove.twitter.27B.25d
- glove.twitter.27B.50d
- glove.twitter.27B.100d
- glove.twitter.27B.200d
- glove.6B.50d
- *glove.6B.100d*
- glove.6B.200d 
- glove.6B.300d

In [0]:
train_ds[0].Text

In [0]:
TEXT.build_vocab(train_ds, vectors="glove.6B.100d", max_size=60000 , min_freq=5)

In [0]:
len(TEXT.vocab.itos) , len(TEXT.vocab.stoi)

(40860, 40860)

In [0]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [0]:
??BucketIterator

In [0]:
train_iter = BucketIterator(train_ds, batch_size=32, sort_key=lambda x: len(x.Text), shuffle=True, device=device, sort_within_batch=False)

In [0]:
valid_iter = BucketIterator(valid_ds, batch_size=32, sort_key=lambda x: len(x.Text), shuffle=False, device=device, sort_within_batch=False)

To perform both together, use BucketIterator.splits

In [0]:
TEXT.vocab.itos[9]

'xbos'

In [0]:
TEXT.vocab.itos[:50]

['xunk',
 'xup',
 '.',
 'xwup',
 'the',
 ',',
 'i',
 'and',
 'a',
 'xbos',
 'it',
 'to',
 'of',
 '/',
 'is',
 'this',
 '>',
 '<',
 'br',
 '!',
 'for',
 'in',
 'my',
 '-',
 'that',
 'but',
 'you',
 'not',
 'with',
 'have',
 'are',
 'was',
 "'",
 'they',
 "'s",
 "n't",
 'as',
 'on',
 'like',
 '"',
 'so',
 'good',
 'these',
 'great',
 ')',
 'them',
 '(',
 'be',
 'coffee',
 'taste']

In [0]:
lst = next(iter(train_iter)).Text[:,0]

In [0]:
lst

tensor([   9,    1, 3418,  144,   43,    7,   14,   41,   20,   26,    2,    1,
         510,   11,  265,    1,   78,  603,  339,   10,   37,  665,    7,  426,
        1527,    2,    1,   22,  151,   14,    1,  324,    5,   25,   22, 6674,
         514,  167,    4,  272,    1, 1672,    1, 1415,   59,   46,  103,    3,
           6,  690,    1,   78,   72,  216,   11,  128,   50,  648,   44,    2,
           1,  548,   50,   19,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,   

In [0]:
' '.join([TEXT.vocab.itos[i] for i in lst])

'xbos xup orgain tastes great and is good for you . xup glad to see xup amazon finally having it on subscribe and save program . xup my favorite is xup vanilla , but my teenager son loves the new xup cafe xup mocha flavor ( which xwup i hope xup amazon will add to their product list ) . xup fantastic product ! xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xunk xun

In [0]:
# embedding = nn.Embedding(n_embed, embed_dim).from_pretrained(TEXT.vocab.vectors)
#OR
#self.encoder_embedding_layer.weight.data.copy_(self.pre_trained_vector.weight.data)

https://www.kaggle.com/naraque/abstractive-text-summarization -- upto encoder