## News category Prediction 

FEATURES:

STORY:  A part of the main content of the article to be published as a piece of news.
SECTION: The genre/category the STORY falls in.

There are four distinct sections where each story may fall in to. The Sections are labelled as follows :

Politics: 0

Technology: 1

Entertainment: 2

Business: 3

In [37]:
import pandas as pd
import numpy as np
import string
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_excel('news_data.xlsx')

In [3]:
df.head()

Unnamed: 0,STORY,SECTION
0,But the most painful was the huge reversal in ...,3
1,How formidable is the opposition alliance amon...,0
2,Most Asian currencies were trading lower today...,3
3,"If you want to answer any question, click on ‘...",1
4,"In global markets, gold prices edged up today ...",3


In [4]:
X = df.iloc[:, 0:1].values
Y = df.iloc[:, 1:2].values

In [5]:
X.shape

(7628, 1)

In [6]:
Y.shape

(7628, 1)

In [7]:
X_train, X_test, Y_train ,Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)

In [13]:
maxlen = 300
classes = 4

## Reading glove vectors from a file

In [14]:
def read_glove(file):
    with open(file, 'r') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
        
        i = 1
        words_to_index = {}
        index_to_words = {}
        for w in sorted(words):
            words_to_index[w] = i
            index_to_words[i] = w
            i = i + 1
    return words_to_index, index_to_words, word_to_vec_map

In [15]:
words_to_index, index_to_words, word_to_vec_map = read_glove("glove.6B.50d.txt")

In [16]:
len(words_to_index)

400000

## Converting sentence into respective indices using glove vectors

In [17]:
def sen_to_index(X, words_to_index, maxlen):
    
    m = X.shape[0]
    X_indices = np.zeros((m, maxlen))
    for i in range(0, m):
        sent = X[i][0].replace('’', '').translate(str.maketrans('', '', string.punctuation)).lower().split()
        j = 0
        for w in sent:
            if w in words_to_index.keys():
                X_indices[i, j] = words_to_index[w]
            else:
                X_indices[i, j] = words_to_index['unk']
            j = j + 1
            if j == maxlen:
                break
    return X_indices   

## Load require libraries

In [18]:
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.initializers import glorot_uniform

Using TensorFlow backend.


In [19]:
def embedding(word_to_vec_map, words_to_index):
    
    vocab = len(words_to_index) + 1
    vec_size = len(word_to_vec_map["news"])
    
    embedding_matrix = np.zeros((vocab, vec_size))
    
    for w, i in words_to_index.items():
        embedding_matrix[i, :] = word_to_vec_map[w]
    
    embedding_layer = Embedding(input_dim = vocab, output_dim = vec_size, trainable = False)
    embedding_layer.build((None,))
    embedding_layer.set_weights([embedding_matrix])
    
    return embedding_layer

## Preparing Model for classifier

In [20]:
def news_classifier(input_shape, word_to_vec_map, words_to_index):
    
    sen_indices = Input(input_shape, dtype = 'int32')
    
    embedding_layer = embedding(word_to_vec_map, words_to_index)
    
    embeddings = embedding_layer(sen_indices)
    
    X = LSTM(128, return_sequences=True)(embeddings)
    X = Dropout(0.5)(X)
    X = LSTM(128, return_sequences=False)(X)
    X = Dropout(0.5)(X)
    X = Dense(4)(X)
    X = Activation("softmax")(X)
    
    model = Model(sen_indices, X)
    
    return model

In [25]:
model = news_classifier((maxlen,), word_to_vec_map, words_to_index)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 300)               0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 300, 50)           20000050  
_________________________________________________________________
lstm_3 (LSTM)                (None, 300, 128)          91648     
_________________________________________________________________
dropout_3 (Dropout)          (None, 300, 128)          0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_4 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 516       
__________

In [26]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [32]:
X_train_indices = sen_to_index(X_train, words_to_index, maxlen)
Y_train_oh = np.eye(classes)[Y_train.reshape(-1)]

In [30]:
model.fit(X_train_indices, Y_train_oh, epochs = 70, batch_size = 64, shuffle=True)

Instructions for updating:
Use tf.cast instead.
Epoch 1/70
Epoch 2/70
Epoch 3/70
Epoch 4/70
Epoch 5/70
Epoch 6/70
Epoch 7/70
Epoch 8/70
Epoch 9/70
Epoch 10/70
Epoch 11/70
Epoch 12/70
Epoch 13/70
Epoch 14/70
Epoch 15/70
Epoch 16/70
Epoch 17/70
Epoch 18/70
Epoch 19/70
Epoch 20/70
Epoch 21/70
Epoch 22/70
Epoch 23/70
Epoch 24/70
Epoch 25/70
Epoch 26/70
Epoch 27/70
Epoch 28/70
Epoch 29/70
Epoch 30/70
Epoch 31/70
Epoch 32/70
Epoch 33/70
Epoch 34/70
Epoch 35/70
Epoch 36/70
Epoch 37/70
Epoch 38/70
Epoch 39/70
Epoch 40/70
Epoch 41/70
Epoch 42/70
Epoch 43/70
Epoch 44/70
Epoch 45/70
Epoch 46/70
Epoch 47/70
Epoch 48/70
Epoch 49/70
Epoch 50/70
Epoch 51/70
Epoch 52/70
Epoch 53/70
Epoch 54/70
Epoch 55/70
Epoch 56/70
Epoch 57/70
Epoch 58/70
Epoch 59/70
Epoch 60/70
Epoch 61/70
Epoch 62/70
Epoch 63/70
Epoch 64/70
Epoch 65/70
Epoch 66/70
Epoch 67/70
Epoch 68/70
Epoch 69/70
Epoch 70/70


<keras.callbacks.History at 0x7f33ae398080>

In [35]:
X_test_indices = sen_to_index(X_test, words_to_index, maxlen)
Y_test_oh = np.eye(classes)[Y_test.reshape(-1)]
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print()
print("Test accuracy = ", acc)
print("Loss = ", loss)


Test accuracy =  0.9469200521902832
Loss =  0.19833273354668762


## Testing with Entertainment news from india tv

In [52]:
ex = np.array(['Vicky Kaushal and Nora Fatehi are all set to spread their charm on The Kapil Sharma Show. The duo will grace the set of the popular show to promote their latest music video Pachtaoge. Nora and Vicky have collaborated for the first time for Arijit Singh song. They had a gala time chatting with Kapil Sharma and his team. Nora also grooved to some of her hit dance numbers.'])
ex_in = sen_to_index(ex, words_to_index, maxlen)
ex_pred = model.predict(ex_in)
ex_pred = np.argmax(ex_pred, axis=1)
print(ex_pred)

[2]


## Save Model into file

In [53]:
model.save('news_classifier.h5')

## Load pre-trianed model

In [54]:
from keras.models import load_model

model = load_model("news_classifier.h5")

In [55]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 300)               0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 300, 50)           20000050  
_________________________________________________________________
lstm_3 (LSTM)                (None, 300, 128)          91648     
_________________________________________________________________
dropout_3 (Dropout)          (None, 300, 128)          0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_4 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 516       
__________