# Generating News headlines

In this kernel, We will be using the dataset of News headlines fromt (https://www.kaggle.com/sunnysai12345/news-summary) to train a text generation language model which can be used to generate News Headlines

## Week-1

### Data Preprocessing
### 1. Import Libraries

In [1]:
# Basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# keras module for building LSTM 
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Input
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential, Model
import keras.utils as ku 
import keras.preprocessing.text as text
# set seeds for reproducability
from tensorflow import set_random_seed
from numpy.random import seed
set_random_seed(2)
seed(1)

import pandas as pd
import numpy as np
import string, os 


Using TensorFlow backend.


### 2. Load the Dataset
Read file which is in CSV (comma seperated values) format using pandas library

In [2]:
#load dataset 
dataset = pd.read_csv("news_summary.csv", delimiter=',', encoding = "ISO-8859-1")

#data sample
dataset.head()

Unnamed: 0,author,date,headlines,read_more,text,ctext
0,Chhavi Tyagi,"03 Aug 2017,Thursday",Daman & Diu revokes mandatory Rakshabandhan in...,http://www.hindustantimes.com/india-news/raksh...,The Administration of Union Territory Daman an...,The Daman and Diu administration on Wednesday ...
1,Daisy Mowke,"03 Aug 2017,Thursday",Malaika slams user who trolled her for 'divorc...,http://www.hindustantimes.com/bollywood/malaik...,Malaika Arora slammed an Instagram user who tr...,"From her special numbers to TV?appearances, Bo..."
2,Arshiya Chopra,"03 Aug 2017,Thursday",'Virgin' now corrected to 'Unmarried' in IGIMS...,http://www.hindustantimes.com/patna/bihar-igim...,The Indira Gandhi Institute of Medical Science...,The Indira Gandhi Institute of Medical Science...
3,Sumedha Sehra,"03 Aug 2017,Thursday",Aaj aapne pakad liya: LeT man Dujana before be...,http://indiatoday.intoday.in/story/abu-dujana-...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4,Aarushi Maheshwari,"03 Aug 2017,Thursday",Hotel staff to get training to spot signs of s...,http://indiatoday.intoday.in/story/sex-traffic...,Hotels in Maharashtra will train their staff t...,Hotels in Mumbai and other Indian cities are t...


In [3]:
# Get dataset Information
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4514 entries, 0 to 4513
Data columns (total 6 columns):
author       4514 non-null object
date         4514 non-null object
headlines    4514 non-null object
read_more    4514 non-null object
text         4514 non-null object
ctext        4396 non-null object
dtypes: object(6)
memory usage: 211.7+ KB


### 3. Data Cleaning 
First, We will drop all the unnecessary columns and remove all puncuation and convert all headlines in lower case.

In [4]:
# Drop unnecessary cloumns from dataset
dataset = dataset.drop(["author", "date", "read_more", "ctext"], 1)

# After drop colums
dataset.head()

Unnamed: 0,headlines,text
0,Daman & Diu revokes mandatory Rakshabandhan in...,The Administration of Union Territory Daman an...
1,Malaika slams user who trolled her for 'divorc...,Malaika Arora slammed an Instagram user who tr...
2,'Virgin' now corrected to 'Unmarried' in IGIMS...,The Indira Gandhi Institute of Medical Science...
3,Aaj aapne pakad liya: LeT man Dujana before be...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4,Hotel staff to get training to spot signs of s...,Hotels in Maharashtra will train their staff t...


In [5]:
# Data cleaning, Remove puncuation and convert to lower case
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 

dataset['headlines'] = [clean_text(txt) for txt in dataset['headlines']]
dataset['text'] = [clean_text(txt) for txt in dataset['text']]

dataset.head()

Unnamed: 0,headlines,text
0,daman diu revokes mandatory rakshabandhan in ...,the administration of union territory daman an...
1,malaika slams user who trolled her for divorci...,malaika arora slammed an instagram user who tr...
2,virgin now corrected to unmarried in igims form,the indira gandhi institute of medical science...
3,aaj aapne pakad liya let man dujana before bei...,lashkaretaibas kashmir commander abu dujana wh...
4,hotel staff to get training to spot signs of s...,hotels in maharashtra will train their staff t...


In [10]:
dataset = dataset.iloc[:1000, :]
dataset.info()

dataset['text'][0] = dataset['text'][0] + " tt "
dataset['headlines'][0] = dataset['headlines'][0] + " tt ";

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
headlines    1000 non-null object
text         1000 non-null object
dtypes: object(2)
memory usage: 15.7+ KB


## Week-2

### 4. Generating Sentence to Vector Using Tokenizer
The next step is Tokenization. Tokenization is a process of extracting tokens (terms / words) from a corpus. Python’s library Keras has inbuilt model for tokenization which can be used to obtain the tokens and their index in the corpus. After this step, every text document in the dataset is converted into sequence of tokens.

In [14]:
tokenizer = Tokenizer()
def get_sequence_of_tokens(corpus):
    ## tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    return tokenizer, total_words

# maximum length of input sentence
max_Len_Input = max([len(text.text_to_word_sequence(line)) for line in dataset['text']])

#maximum lenth of output sentence
max_Len_out = max([len(text.text_to_word_sequence(line)) for line in dataset['headlines']])

# tokenizer of input text
tokenizer, total_words1 = get_sequence_of_tokens(dataset['text'])

# dicitionary of input words
inp_w2i_dict = tokenizer.word_index
inp_i2w_dic = {}
for word, index in inp_w2i_dict.items():
        inp_i2w_dic[index] = word

# input text vectorization
inp_sequences = np.zeros(shape=(len(dataset['text']), max_Len_Input, total_words1))
for i in range(len(dataset['text'])):
    for j, word in enumerate(text.text_to_word_sequence(dataset['text'][i])):
        inp_sequences[i,j,inp_w2i_dict[word]] = 1

# tokenizer of output text
tokenizer, total_words2 = get_sequence_of_tokens(dataset['headlines'])

# dicitionary of output words 
out_w2i_dict = tokenizer.word_index
out_i2w_dic = {}
for word, index in out_w2i_dict.items():
        out_i2w_dic[index] = word
        
# output sentence vectorization
out_sequences = np.zeros(shape=(len(dataset['headlines']), max_Len_out, total_words2))
target_data = np.zeros(shape=(len(dataset['headlines']), max_Len_out, total_words2))

# output headline vectorization
for i in range(len(dataset['headlines'])):
    for j, word in enumerate(text.text_to_word_sequence(dataset['headlines'][i])):
        out_sequences[i,j,out_w2i_dict[word]] = 1
    if j > 0:
        target_data[i, j-1, out_w2i_dict[word]] = 1;

## Week-3

In [15]:
# Encoder model
encoder_input = Input(shape=(None,total_words1))
encoder_LSTM  = LSTM(256,return_state = True)
encoder_outputs, encoder_h, encoder_c = encoder_LSTM (encoder_input)
encoder_states = [encoder_h, encoder_c]
                      
                      
# Decoder model
decoder_input = Input(shape=(None,total_words2))
decoder_LSTM = LSTM(256,return_sequences=True, return_state = True)
decoder_out, _ , _ = decoder_LSTM(decoder_input, initial_state=encoder_states)
decoder_dense = Dense(total_words2,activation='softmax')
decoder_out = decoder_dense (decoder_out)

In [16]:
# Define the model that will turn
# encoder_input_data & decoder_input_data into decoder_target_data`
model = Model([encoder_input, decoder_input], decoder_out)

In [17]:
# start training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit(x=[inp_sequences,out_sequences], 
          y=target_data,
          batch_size=64,
          epochs=50,
          validation_split=0.2)

Train on 800 samples, validate on 200 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x24d19710588>

In [18]:
# Inference models for testing

# Encoder inference model
encoder_model_inf = Model(encoder_input, encoder_states)

# Decoder inference model
decoder_state_input_h = Input(shape=(256,))
decoder_state_input_c = Input(shape=(256,))
decoder_input_states = [decoder_state_input_h, decoder_state_input_c]

decoder_out, decoder_h, decoder_c = decoder_LSTM(decoder_input, 
                                                 initial_state=decoder_input_states)

decoder_states = [decoder_h , decoder_c]

decoder_out = decoder_dense(decoder_out)

decoder_model_inf = Model(inputs=[decoder_input] + decoder_input_states,
                          outputs=[decoder_out] + decoder_states )

In [19]:
def decode_seq(inp_seq):
    
    # Initial states value is coming from the encoder 
    states_val = encoder_model_inf.predict(inp_seq)
    
    target_seq = np.zeros((1, 1, total_words2))
    target_seq[0, 0, out_w2i_dict['tt']] = 1
    
    translated_sent = ''
    stop_condition = False
    
    while not stop_condition:
        
        decoder_out, decoder_h, decoder_c = decoder_model_inf.predict(x=[target_seq] + states_val)
        
        max_val_index = np.argmax(decoder_out[0,-1,:])
        sampled_fra_char = out_i2w_dic[max_val_index]
        translated_sent += sampled_fra_char
        
        if ( (sampled_fra_char == '\n') or (len(translated_sent) > max_Len_out)) :
            stop_condition = True
        
        target_seq = np.zeros((1, 1, total_words2))
        target_seq[0, 0, max_val_index] = 1
        
        states_val = [decoder_h, decoder_c]
        
    return translated_sent

In [33]:
out = decode_seq(inp_sequences[1:2])

In [32]:
dataset['text'][50]

'a mumbai court has convicted 15 somali pirates to 7 years of imprisonment in a 2011 case the pirates were found guilty of attempt to murder and kidnapping for taking 22 people hostage on board a commercial ship from thailand this is one of the four cases registered against 120 somali pirates for holding 91 people from different countries hostage'