# BBC News Category Classification

### Exploring the Data

In [2]:
import pandas as pd
data=pd.read_csv("https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv")
data

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...
...,...,...
2220,business,cars pull down us retail figures us retail sal...
2221,politics,kilroy unveils immigration policy ex-chatshow ...
2222,entertainment,rem announce new glasgow concert us band rem h...
2223,politics,how political squabbles snowball it s become c...


In [3]:
data.category.value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: category, dtype: int64

The data consists of BBC news stories and their classification into categories. There are 2225 observations. There are five categories in the target variable of this dataset. Out of 2225 categories the most are sport (511) and politics (510) and the fewest of the news fall into the category of entertainment (386)

### Preprocessing the Data

There are two steps I need to take while preprocessing this data.

1) The independet variable, the actual news stories, are all cut after the first 100 words and the top 10,000 words are tokenized.

2) The dependent variable needs to be either one hot encoded or be obtained the dummies for. I prefer to use pandas to get dummies rather than using Keras.

In [4]:
text = data['text']

category=data['category']

In [5]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100
max_words = 10000  

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(text)
sequences = tokenizer.texts_to_sequences(text) 

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

text = pad_sequences(sequences, maxlen=maxlen)

category = np.asarray(category)

print('Shape of data tensor:',text.shape)
print('Shape of category tensor:',category.shape)



Using TensorFlow backend.


Found 29726 unique tokens.
Shape of data tensor: (2225, 100)
Shape of category tensor: (2225,)


In [6]:
X=text
X
y=category
y

y=pd.get_dummies(y)

display(y)

Unnamed: 0,business,entertainment,politics,sport,tech
0,0,0,0,0,1
1,1,0,0,0,0
2,0,0,0,1,0
3,0,0,0,1,0
4,0,1,0,0,0
...,...,...,...,...,...
2220,1,0,0,0,0
2221,0,0,1,0,0
2222,0,1,0,0,0
2223,0,0,1,0,0


### Model A: A Model with an embedding layer and two dense layers (with no layers meant for sequential data)

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

print(X_train.shape)
print(y_train.shape)

(1668, 100)
(1668, 5)


In [8]:
from keras.layers import Dense, Embedding
from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()
model.add(Embedding(10000, 8, input_length=maxlen))

model.add(Flatten())
model.add(Dense(5, activation='sigmoid'))
model.add(Dense(5, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=32)

score, acc = model.evaluate(X_test, y_test)
print('Test score:', score)
print('Test accuracy:', acc)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 100, 8)            80000     
_________________________________________________________________
flatten_2 (Flatten)          (None, 800)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 4005      
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 30        
Total params: 84,035
Trainable params: 84,035
Non-trainable params: 0
_________________________________________________________________


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.41835008617791497
Test accuracy: 0.8157988786697388


In the above model, I used embedding and two dense layers to run the model A. The accuracy of the model is 81.58%. 

### Model B: A model using an Embedding layer with Conv1d Layers

In [11]:
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop

max_features = 10000  
max_len = 100 

model = Sequential()
model.add(layers.Embedding(max_features, 128, input_length=max_len))
model.add(layers.Conv1D(32, 7, activation='relu')) 
model.add(layers.MaxPooling1D(5)) #
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(5))

model.summary()

model.compile(optimizer=RMSprop(lr=1e-4),
              loss='binary_crossentropy',
              metrics=['acc'])

history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test)
print('Test score:', score)
print('Test accuracy:', acc)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 100, 128)          1280000   
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 94, 32)            28704     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 18, 32)            0         
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 12, 32)            7200      
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 165       
Total params: 1,316,069
Trainable params: 1,316,069
Non-trainable params: 0
____________________________________________

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 1334 samples, validate on 334 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 1.527776182874957
Test accuracy: 0.8000001311302185


The above model with embedding and two Conv1D layers yeilds 80% accuracy in the model. 

### Model C: A model using an Embedding layer with one LSTM sequential layer

In [24]:
from keras.layers import LSTM, Embedding, Dense
model_c = Sequential()
model_c.add(Embedding(10000, 32))
model_c.add(LSTM(32))
model_c.add(Dense(5, activation='sigmoid'))

model_c.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model_c.fit(X_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model_c.evaluate(X_test, y_test)
print('Test score:', score)
print('Test accuracy:', acc)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 1334 samples, validate on 334 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.2923532754878381
Test accuracy: 0.8987432718276978


The above model with a single LSTM model yeilds 90% accuracy. Which is substantially higher than the previous two models. 

### Model D: A model using an Embedding layer with stacked LSTM sequential layers

In [22]:
model_d = Sequential()
model_d.add(Embedding(10000, 32))
model_d.add(LSTM(32, return_sequences=True))
model_d.add(LSTM(32, return_sequences=True))
model_d.add(LSTM(32))
model_d.add(Dense(5, activation='sigmoid'))

model_d.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model_d.fit(X_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model_d.evaluate(X_test, y_test)
print('Test score:', score)
print('Test accuracy:', acc)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 1334 samples, validate on 334 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.45898516109644616
Test accuracy: 0.8305206298828125


In the above model, I have three LSTM layers, we see that the accuracy of the model (83) decreases considerably from model C.

### Model E: A model using an Embedding layer with bidirectional sequential layers

In [23]:
model_e = Sequential()
model_e.add(layers.Embedding(max_features, 32))
model_e.add(layers.Bidirectional(layers.LSTM(32)))
model_e.add(layers.Dense(5, activation='sigmoid'))

model_e.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model_e.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

score, acc = model_e.evaluate(X_test, y_test)
print('Test score:', score)
print('Test accuracy:', acc)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 1334 samples, validate on 334 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.349355810099487
Test accuracy: 0.8542190194129944


In the above model we have one bidirectional sequential layer and results in the accuracy of 85%

### Optimizing the best model (Model C) with drop outs

In [27]:
model = Sequential()
model.add(Embedding(10000, 32))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2)) 
model.add(Dense(5, activation='sigmoid'))


model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test,
                            batch_size=32)
print('Test score:', score)
print('Test accuracy:', acc)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 1334 samples, validate on 334 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.3390718218561135
Test accuracy: 0.8728905320167542


Retraining model C using a dropout leads to a decreased accuracy of 87%. Below I rerun the model using increased epochs. 

### Optimizing the best model (Model C) with increased epochs

In [28]:
model = Sequential()
model.add(Embedding(10000, 32))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2)) 
model.add(Dense(5, activation='sigmoid'))


model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test)
print('Test score:', score)
print('Test accuracy:', acc)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 1334 samples, validate on 334 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test score: 0.2097818600264657
Test accuracy: 0.9278275966644287


Increasing epochs helped to increase the accuracy to a 90.48% but it is still not equivalent to the 92.78% in the original model.

### Discuss

Model C, where one embedding and one LSTM sequential layer was used, performed the best. This is not what I had expected as I imagined adding more sequential layers would make the model perform better. 

However all the subsequent models; biderectional LSTM, multiple LSTM layers and dropouts made the model perform worse. This might be due to the fact that the dataset is not very large.I would like to try using Glove embeddings or GRU in order to see if a diffferent methodology would provide a better and more accurate model. I have seen that multiple Conv1D layers and multiple LSTM by themselves do not improve the model. Perhaps combining them would improve our predictive power.