## News category Classification 

FEATURES:

STORY:  A part of the main content of the article to be published as a piece of news.
SECTION: The genre/category the STORY falls in.

There are four distinct sections where each story may fall in to. The Sections are labelled as follows :

Politics: 0

Technology: 1

Entertainment: 2

Business: 3

In [1]:
import pandas as pd
import numpy as np
import string
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_excel('news_data.xlsx')

In [3]:
df.head()

Unnamed: 0,STORY,SECTION
0,But the most painful was the huge reversal in ...,3
1,How formidable is the opposition alliance amon...,0
2,Most Asian currencies were trading lower today...,3
3,"If you want to answer any question, click on ‘...",1
4,"In global markets, gold prices edged up today ...",3


In [4]:
X = df.iloc[:, 0:1].values
Y = df.iloc[:, 1:2].values

In [5]:
X.shape

(7628, 1)

In [6]:
Y.shape

(7628, 1)

In [7]:
X_train, X_test, Y_train ,Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)

In [8]:
maxlen = 300
classes = 4

## Reading glove vectors from a file

Download Pre-trained word vectors form http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip

In [9]:
def read_glove(file):
    with open(file, 'r') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
        
        i = 1
        words_to_index = {}
        index_to_words = {}
        for w in sorted(words):
            words_to_index[w] = i
            index_to_words[i] = w
            i = i + 1
    return words_to_index, index_to_words, word_to_vec_map

In [10]:
words_to_index, index_to_words, word_to_vec_map = read_glove("glove.6B.50d.txt")

In [11]:
len(words_to_index)

400000

## Converting sentence into respective indices using glove vectors

In [12]:
def sen_to_index(X, words_to_index, maxlen):
    
    m = X.shape[0]
    X_indices = np.zeros((m, maxlen))
    for i in range(0, m):
        sent = X[i][0].replace('’', '').translate(str.maketrans('', '', string.punctuation)).lower().split()
        j = 0
        for w in sent:
            if w in words_to_index.keys():
                X_indices[i, j] = words_to_index[w]
            else:
                X_indices[i, j] = words_to_index['unk']
            j = j + 1
            if j == maxlen:
                break
    return X_indices   

## Load require libraries

In [13]:
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.initializers import glorot_uniform

Using TensorFlow backend.


In [14]:
def embedding(word_to_vec_map, words_to_index):
    
    vocab = len(words_to_index) + 1
    vec_size = len(word_to_vec_map["news"])
    
    embedding_matrix = np.zeros((vocab, vec_size))
    
    for w, i in words_to_index.items():
        embedding_matrix[i, :] = word_to_vec_map[w]
    
    embedding_layer = Embedding(input_dim = vocab, output_dim = vec_size, trainable = False)
    embedding_layer.build((None,))
    embedding_layer.set_weights([embedding_matrix])
    
    return embedding_layer

## Preparing Model for classifier

In [15]:
def news_classifier(input_shape, word_to_vec_map, words_to_index):
    
    sen_indices = Input(input_shape, dtype = 'int32')
    
    embedding_layer = embedding(word_to_vec_map, words_to_index)
    
    embeddings = embedding_layer(sen_indices)
    
    X = LSTM(128, return_sequences=True)(embeddings)
    X = Dropout(0.5)(X)
    X = LSTM(128, return_sequences=False)(X)
    X = Dropout(0.5)(X)
    X = Dense(4)(X)
    X = Activation("softmax")(X)
    
    model = Model(sen_indices, X)
    
    return model

In [16]:
model = news_classifier((maxlen,), word_to_vec_map, words_to_index)
model.summary()

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 300)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 300, 50)           20000050  
_________________________________________________________________
lstm_1 (LSTM)                (None, 300, 128)          91648     
_________________________________________________________________
dropout_1 (Dropout)          (None, 300, 128)          0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_2 (Dropout)  

In [17]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [18]:
X_train_indices = sen_to_index(X_train, words_to_index, maxlen)
Y_train_oh = np.eye(classes)[Y_train.reshape(-1)]

In [19]:
model.fit(X_train_indices, Y_train_oh, epochs = 100, batch_size = 64, shuffle=True)

Instructions for updating:
Use tf.cast instead.
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100

Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x7fa98f140be0>

In [20]:
X_test_indices = sen_to_index(X_test, words_to_index, maxlen)
Y_test_oh = np.eye(classes)[Y_test.reshape(-1)]
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print()
print("Test accuracy = ", acc)
print("Loss = ", loss)


Test accuracy =  0.9567496721116462
Loss =  0.18685297507836654


## Testing with Entertainment news from india tv

In [21]:
ex = np.array(['Vicky Kaushal and Nora Fatehi are all set to spread their charm on The Kapil Sharma Show. The duo will grace the set of the popular show to promote their latest music video Pachtaoge. Nora and Vicky have collaborated for the first time for Arijit Singh song. They had a gala time chatting with Kapil Sharma and his team. Nora also grooved to some of her hit dance numbers.'])
ex_in = sen_to_index(ex, words_to_index, maxlen)
ex_pred = model.predict(ex_in)
ex_pred = np.argmax(ex_pred, axis=1)
print(ex_pred)

[2]


## Save Model into file

In [22]:
model.save('news_classifier.h5')

## Load pre-trianed model

In [23]:
from keras.models import load_model

model = load_model("news_classifier.h5")

In [24]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 300)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 300, 50)           20000050  
_________________________________________________________________
lstm_1 (LSTM)                (None, 300, 128)          91648     
_________________________________________________________________
dropout_1 (Dropout)          (None, 300, 128)          0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 516       
__________