## Sentence Compression

`Task:`
- Develop a model using RNN, GRU and LSTM to compress the sentences. 
- Analyze the efficiency of these models.
- Compress  the  following  input  sentences (15 questions)  and  manually  compare  the  output  with  the  given summary.  Finetune  the  hyper  parameters  to  generate  the  summary  that  includes  all  the information.  
    - Input: New jobless numbers are a bit of a mixed bag for President Obama and his re- election bid. 
    - Summary: New jobless numbers a mixed bag for Obama

### Import necessary libraries and read the data

In [1]:
import numpy as np
import pandas as pd
import json
import matplotlib.pyplot as plt
import re
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from keras.layers import Input, LSTM, Embedding, Dense
from keras.models import Model

In [63]:
data = []
with open('train.jsonl','r') as file:
    for line in file:
        data.append(json.loads(line))

### Data Cleaning and Preprocessing

In [51]:
data[0]

{'id': 'gigaword-train-0',
 'text': "australia 's current account deficit shrunk by a record #.## billion dollars -lrb- #.## billion us -rrb- in the june quarter due to soaring commodity prices , figures released monday showed .",
 'summary': 'australian current account deficit narrows sharply'}

In [64]:
data = pd.DataFrame(data)

In [65]:
data.head()

Unnamed: 0,id,text,summary
0,gigaword-train-0,australia 's current account deficit shrunk by...,australian current account deficit narrows sha...
1,gigaword-train-1,at least two people were killed in a suspected...,at least two dead in southern philippines blast
2,gigaword-train-2,australian shares closed down #.# percent mond...,australian stocks close down #.# percent
3,gigaword-train-3,south korea 's nuclear envoy kim sook urged no...,envoy urges north korea to restart nuclear dis...
4,gigaword-train-4,south korea on monday announced sweeping tax r...,skorea announces tax cuts to stimulate economy


In [66]:
data.drop(['id'], axis=1, inplace=True)

In [55]:
data.shape

(1000000, 2)

Since the data is too huge it will take a long time to run, so I am going to take only the first 15k rows

In [67]:
data = data[:15000]

In [68]:
data['text'][0]

"australia 's current account deficit shrunk by a record #.## billion dollars -lrb- #.## billion us -rrb- in the june quarter due to soaring commodity prices , figures released monday showed ."

In [69]:
# lowercase
data['text'] = data['text'].str.lower()

In [70]:
# find special characters
special_chars = data['text'].apply(lambda x: re.findall(r'\W', x))
all_special_chars = [char for sublist in special_chars for char in sublist if char != ' ']

In [71]:
all_special_chars = list(set(all_special_chars)) 
print(all_special_chars, len(all_special_chars))

['.', '`', ':', '&', '=', "'", '\\', '/', ';', '+', '\xa0', ',', '#', '-'] 14


In [72]:
# remove special characters
def remove_unwanted_symbols(text):
    for symbol in all_special_chars:
        text = text.replace(symbol, '')
    return text

data['text'] = data['text'].apply(lambda x: remove_unwanted_symbols(x))

In [73]:
data['text'][0]

'australia s current account deficit shrunk by a record  billion dollars lrb  billion us rrb in the june quarter due to soaring commodity prices  figures released monday showed '

In [74]:
data['summary'] = data['summary'].apply(lambda x: 'START_ ' + x + ' _END')

In [75]:
data.head()

Unnamed: 0,text,summary
0,australia s current account deficit shrunk by ...,START_ australian current account deficit narr...
1,at least two people were killed in a suspected...,START_ at least two dead in southern philippin...
2,australian shares closed down percent monday ...,START_ australian stocks close down #.# percen...
3,south korea s nuclear envoy kim sook urged nor...,START_ envoy urges north korea to restart nucl...
4,south korea on monday announced sweeping tax r...,START_ skorea announces tax cuts to stimulate ...
