## News_Aggregator_Classification

## Objective

To predict the category (business, entertainment, science and technology or health) of a news article given its headline

## Datasets (Source & Acknowledgements)

The columns included in this dataset are: </br>

ID : the numeric ID of the article </br>

TITLE : the headline of the article </br>

URL : the URL of the article </br>

PUBLISHER : the publisher of the article </br>

CATEGORY : the category of the news item; one of: </br>

-- b : business </br>

-- t : science and technology </br>

-- e : entertainment </br>

-- m : health </br>

STORY : alphanumeric ID of the news story that the article discusses </br>

HOSTNAME : hostname where the article was posted </br>

TIMESTAMP : approximate timestamp of the article's publication

1. Publication Dataset: 
Lichman, M. (2013). UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/News+Aggregator. </br> 
Irvine, CA: University of California, School of Information and Computer Science.

2. Kaggle Dataset: 
https://www.kaggle.com/datasets/uciml/news-aggregator-dataset

In [1]:
# Import libraries

import numpy as np
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import LabelEncoder
from keras.utils.np_utils import to_categorical
from transformers import AutoTokenizer
from transformers import TFAutoModelForSequenceClassification
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy

## Read in the dataset

In [2]:
df = pd.read_csv('uci-news-aggregator.csv')
df

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027
...,...,...,...,...,...,...,...,...
422414,422933,Surgeons to remove 4-year-old's rib to rebuild...,http://www.cbs3springfield.com/story/26378648/...,WSHM-TV,m,dpcLMoJD69UYMXMxaoEFnWql9YjQM,www.cbs3springfield.com,1409229190251
422415,422934,Boy to have surgery on esophagus after battery...,http://www.wlwt.com/news/boy-to-have-surgery-o...,WLWT Cincinnati,m,dpcLMoJD69UYMXMxaoEFnWql9YjQM,www.wlwt.com,1409229190508
422416,422935,Child who swallowed battery to have reconstruc...,http://www.newsnet5.com/news/local-news/child-...,NewsNet5.com,m,dpcLMoJD69UYMXMxaoEFnWql9YjQM,www.newsnet5.com,1409229190771
422417,422936,Phoenix boy undergoes surgery to repair throat...,http://www.wfsb.com/story/26368078/phoenix-boy...,WFSB,m,dpcLMoJD69UYMXMxaoEFnWql9YjQM,www.wfsb.com,1409229191071


In [3]:
df = df[['TITLE','CATEGORY']]
df

Unnamed: 0,TITLE,CATEGORY
0,"Fed official says weak data caused by weather,...",b
1,Fed's Charles Plosser sees high bar for change...,b
2,US open: Stocks fall after Fed official hints ...,b
3,"Fed risks falling 'behind the curve', Charles ...",b
4,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,b
...,...,...
422414,Surgeons to remove 4-year-old's rib to rebuild...,m
422415,Boy to have surgery on esophagus after battery...,m
422416,Child who swallowed battery to have reconstruc...,m
422417,Phoenix boy undergoes surgery to repair throat...,m


In [4]:
df['CATEGORY'].value_counts()

e    152469
b    115967
t    108344
m     45639
Name: CATEGORY, dtype: int64

### Sample 5000 rows for each category at random

In [5]:
e = df[df['CATEGORY'] == 'e'].sample(n=5000)
b = df[df['CATEGORY'] == 'b'].sample(n=5000)
t = df[df['CATEGORY'] == 't'].sample(n=5000)
m = df[df['CATEGORY'] == 'm'].sample(n=5000)
df_selected = pd.concat([e,b,t,m], ignore_index=True)
df_selected = df_selected.reindex(np.random.permutation(df_selected.index)).reset_index(drop=True)
df_selected 

Unnamed: 0,TITLE,CATEGORY
0,20 deadliest states for workers,m
1,49ers QB Colin Kaepernick defends reputation a...,b
2,The Voice Final Performances — Duets win the day,e
3,AbbVie Bid for Shire Rejected as Low,b
4,Aereo's Supreme Court loss a big victory for T...,t
...,...,...
19995,IMF says Russia Already in Recession,b
19996,Ikea Will Pay Its Workers a Living Wage,b
19997,Numsa plays hard ball as threat of investment ...,b
19998,Transparent rodents reveal details of inner an...,m


## Data Preparation

In [6]:
def encoded(category):
    """returns the respective encoded category value"""
    if category == "e":
        return 0
    elif category == "t":
        return 1
    elif category == "b":
        return 2
    elif category == "m":
        return 3

df_selected['TARGET'] = df_selected.apply(lambda x: encoded(x['CATEGORY']), axis=1)

df_selected

Unnamed: 0,TITLE,CATEGORY,TARGET
0,20 deadliest states for workers,m,3
1,49ers QB Colin Kaepernick defends reputation a...,b,2
2,The Voice Final Performances — Duets win the day,e,0
3,AbbVie Bid for Shire Rejected as Low,b,2
4,Aereo's Supreme Court loss a big victory for T...,t,1
...,...,...,...
19995,IMF says Russia Already in Recession,b,2
19996,Ikea Will Pay Its Workers a Living Wage,b,2
19997,Numsa plays hard ball as threat of investment ...,b,2
19998,Transparent rodents reveal details of inner an...,m,3


In [10]:
seed = 18
 
X = df_selected['TITLE']
y = df_selected['TARGET']

# splitting data into training, testing, and validation sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.10, random_state=seed)

## Tokenization

In [11]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

In [12]:
print(f"Tokenizer vocab size = {tokenizer.vocab_size}")
print(list(tokenizer.vocab.keys())[6000:6020])

Tokenizer vocab size = 30522
['bees', '[unused91]', 'murmurs', 'intellect', 'interceptions', 'lust', '##can', 'compulsory', 'croix', 'titus', '##uj', 'clashed', 'scroll', 'confronting', 'rodriguez', 'mason', 'bingo', '##خ', 'nazis', 'ventures']


In [13]:
train_texts = X_train.to_list()
train_labels = y_train.tolist()
val_texts = X_val.to_list()
val_labels = y_val.tolist()
test_texts = X_test.to_list()
test_labels = y_test.tolist()

In [14]:
train_encodings = tokenizer(train_texts, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, padding=True, truncation=True)
test_encodings = tokenizer(test_texts, padding=True, truncation=True)

In [15]:
batch_size = 16

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
)).batch(batch_size)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
)).batch(batch_size)

## Fine-tuning the model

In [58]:
model = TFAutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-uncased",num_labels=4)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'activation_13', 'vocab_projector', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_59', 'classifier', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

In [17]:
num_epochs = 1

# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Since our dataset is already batched, we can simply take the len.
num_train_steps = len(train_dataset) * num_epochs

lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
)

In [52]:
opt = Adam(learning_rate=lr_scheduler)

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=opt, loss=loss, metrics=["accuracy"])

model.fit(train_dataset, validation_data=val_dataset, epochs=num_epochs)



<keras.callbacks.History at 0x20edc6663a0>

In [53]:
model.evaluate(test_dataset)



[0.23979614675045013, 0.9172499775886536]

## Testing the model on untrained/unseen headlines

In [54]:
text = "Pop star to start fashion company"
inputs = tokenizer(text, return_tensors="tf")
output = model(inputs)
pred_prob = tf.nn.softmax(output.logits, axis=-1)
print(pred_prob)
pred = np.argmax(pred_prob)
labels = ['entertainment', 'science/tech', 'business', 'health']
print(labels[np.argmax(pred_prob)])

tf.Tensor([[0.96278346 0.01595769 0.01345383 0.00780493]], shape=(1, 4), dtype=float32)
entertainment


In [55]:
text = "Revolutionary methods for discovering new materials"
inputs = tokenizer(text, return_tensors="tf")
output = model(inputs)
pred_prob = tf.nn.softmax(output.logits, axis=-1)
print(pred_prob)
pred = np.argmax(pred_prob)
labels = ['entertainment', 'science/tech', 'business', 'health']
print(labels[np.argmax(pred_prob)])

tf.Tensor([[0.04672822 0.6163599  0.15213916 0.18477274]], shape=(1, 4), dtype=float32)
science/tech


In [56]:
text = "Rebranded bank will target global growth"
inputs = tokenizer(text, return_tensors="tf")
output = model(inputs)
pred_prob = tf.nn.softmax(output.logits, axis=-1)
print(pred_prob)
pred = np.argmax(pred_prob)
labels = ['entertainment', 'science/tech', 'business', 'health']
print(labels[np.argmax(pred_prob)])

tf.Tensor([[5.5752648e-04 6.5813232e-03 9.9111384e-01 1.7472489e-03]], shape=(1, 4), dtype=float32)
business


In [57]:
text = "A new sustainable vaccination against Ebola developed."
inputs = tokenizer(text, return_tensors="tf")
output = model(inputs)
pred_prob = tf.nn.softmax(output.logits, axis=-1)
print(pred_prob)
pred = np.argmax(pred_prob)
labels = ['entertainment', 'science/tech', 'business', 'health']
print(labels[np.argmax(pred_prob)])

tf.Tensor([[4.3970757e-04 7.1494072e-04 1.1743810e-03 9.9767095e-01]], shape=(1, 4), dtype=float32)
health


## End of Notebook