## News_Aggregator_Classification

## Objective

To predict the category (business, entertainment, science and technology or health) of a news article given its headline

## Datasets (Source & Acknowledgements)

The columns included in this dataset are: </br>

ID : the numeric ID of the article </br>

TITLE : the headline of the article </br>

URL : the URL of the article </br>

PUBLISHER : the publisher of the article </br>

CATEGORY : the category of the news item; one of: </br>

-- b : business </br>

-- t : science and technology </br>

-- e : entertainment </br>

-- m : health </br>

STORY : alphanumeric ID of the news story that the article discusses </br>

HOSTNAME : hostname where the article was posted </br>

TIMESTAMP : approximate timestamp of the article's publication

1. Publication Dataset: 
Lichman, M. (2013). UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/News+Aggregator. </br> 
Irvine, CA: University of California, School of Information and Computer Science.

2. Kaggle Dataset: 
https://www.kaggle.com/datasets/uciml/news-aggregator-dataset

In [2]:
# Import libraries

import numpy as np
import tensorflow as tf
import tensorflow.keras as keras
import pandas as pd
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical
from transformers import AutoTokenizer
from transformers import TFAutoModelForSequenceClassification

## Read in the dataset

In [3]:
df = pd.read_csv('./raw_data/uci-news-aggregator.csv')
df

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027
...,...,...,...,...,...,...,...,...
422414,422933,Surgeons to remove 4-year-old's rib to rebuild...,http://www.cbs3springfield.com/story/26378648/...,WSHM-TV,m,dpcLMoJD69UYMXMxaoEFnWql9YjQM,www.cbs3springfield.com,1409229190251
422415,422934,Boy to have surgery on esophagus after battery...,http://www.wlwt.com/news/boy-to-have-surgery-o...,WLWT Cincinnati,m,dpcLMoJD69UYMXMxaoEFnWql9YjQM,www.wlwt.com,1409229190508
422416,422935,Child who swallowed battery to have reconstruc...,http://www.newsnet5.com/news/local-news/child-...,NewsNet5.com,m,dpcLMoJD69UYMXMxaoEFnWql9YjQM,www.newsnet5.com,1409229190771
422417,422936,Phoenix boy undergoes surgery to repair throat...,http://www.wfsb.com/story/26368078/phoenix-boy...,WFSB,m,dpcLMoJD69UYMXMxaoEFnWql9YjQM,www.wfsb.com,1409229191071


In [4]:
df = df[['TITLE','CATEGORY']]
df

Unnamed: 0,TITLE,CATEGORY
0,"Fed official says weak data caused by weather,...",b
1,Fed's Charles Plosser sees high bar for change...,b
2,US open: Stocks fall after Fed official hints ...,b
3,"Fed risks falling 'behind the curve', Charles ...",b
4,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,b
...,...,...
422414,Surgeons to remove 4-year-old's rib to rebuild...,m
422415,Boy to have surgery on esophagus after battery...,m
422416,Child who swallowed battery to have reconstruc...,m
422417,Phoenix boy undergoes surgery to repair throat...,m


In [5]:
df['CATEGORY'].value_counts()

CATEGORY
e    152469
b    115967
t    108344
m     45639
Name: count, dtype: int64

### Sample 5000 rows for each category at random

In [6]:
e = df[df['CATEGORY'] == 'e'].sample(n=5000)
b = df[df['CATEGORY'] == 'b'].sample(n=5000)
t = df[df['CATEGORY'] == 't'].sample(n=5000)
m = df[df['CATEGORY'] == 'm'].sample(n=5000)
df_selected = pd.concat([e,b,t,m], ignore_index=True)
df_selected = df_selected.reindex(np.random.permutation(df_selected.index)).reset_index(drop=True)
df_selected 

Unnamed: 0,TITLE,CATEGORY
0,L'Wren Scott death ruled a suicide,e
1,The Voice: Who Will Take Home The Trophy?,e
2,"Asia stocks lackluster, US economic data awaited",e
3,Nonprofits help the dying make videos,m
4,Michelle Obama: Effort to weaken healthier sch...,m
...,...,...
19995,Mosquito tests positive for West Nile virus in...,m
19996,"Guinea Bans Bat Eating to Curb Ebola Spread, W...",m
19997,Assassin's Creed Unity is just what the franch...,t
19998,Feds release all cows gathered during NV roundup,e


## Data Preparation

In [7]:
def encoded(category):
    """returns the respective encoded category value"""
    if category == "e":
        return 0
    elif category == "t":
        return 1
    elif category == "b":
        return 2
    elif category == "m":
        return 3

df_selected['TARGET'] = df_selected.apply(lambda x: encoded(x['CATEGORY']), axis=1)

df_selected

Unnamed: 0,TITLE,CATEGORY,TARGET
0,L'Wren Scott death ruled a suicide,e,0
1,The Voice: Who Will Take Home The Trophy?,e,0
2,"Asia stocks lackluster, US economic data awaited",e,0
3,Nonprofits help the dying make videos,m,3
4,Michelle Obama: Effort to weaken healthier sch...,m,3
...,...,...,...
19995,Mosquito tests positive for West Nile virus in...,m,3
19996,"Guinea Bans Bat Eating to Curb Ebola Spread, W...",m,3
19997,Assassin's Creed Unity is just what the franch...,t,1
19998,Feds release all cows gathered during NV roundup,e,0


In [8]:
seed = 18
 
X = df_selected['TITLE']
y = df_selected['TARGET']

# splitting data into training, testing, and validation sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.10, random_state=seed)

## Tokenization

In [9]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

In [10]:
print(f"Tokenizer vocab size = {tokenizer.vocab_size}")
print(list(tokenizer.vocab.keys())[6000:6020])

Tokenizer vocab size = 30522
['feeder', '##paper', '##ม', '##ᆨ', 'stil', 'currency', 'roche', 'spies', 'bureaucracy', '[unused55]', '1801', 'grouped', 'eh', 'covert', 'concern', 'cents', 'compute', '##hering', 'faust', 'accurate']


In [11]:
train_texts = X_train.to_list()
train_labels = y_train.tolist()
val_texts = X_val.to_list()
val_labels = y_val.tolist()
test_texts = X_test.to_list()
test_labels = y_test.tolist()

In [12]:
train_encodings = tokenizer(train_texts, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, padding=True, truncation=True)
test_encodings = tokenizer(test_texts, padding=True, truncation=True)

In [13]:
batch_size = 16

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
)).batch(batch_size)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
)).batch(batch_size)

## Fine-tuning the model

In [14]:
model = TFAutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-uncased",num_labels=4)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [15]:
num_epochs = 1

# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Since our dataset is already batched, we can simply take the len.
num_train_steps = len(train_dataset) * num_epochs

lr_scheduler = keras.optimizers.schedules.PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
)

In [None]:
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer="adam", loss=loss, metrics=["accuracy"])
callbacks = [keras.callbacks.LearningRateScheduler(lr_scheduler, verbose=1)]
model.fit(train_dataset, validation_data=val_dataset, epochs=num_epochs, callbacks=callbacks)

In [None]:
model.evaluate(test_dataset)

## Testing the model on untrained/unseen headlines

In [None]:
text = "Pop star to start fashion company"
inputs = tokenizer(text, return_tensors="tf")
output = model(inputs)
pred_prob = tf.nn.softmax(output.logits, axis=-1)
print(pred_prob)
labels = ['entertainment', 'science/tech', 'business', 'health']
print(labels[np.argmax(pred_prob)])

In [None]:
text = "Revolutionary methods for discovering new materials"
inputs = tokenizer(text, return_tensors="tf")
output = model(inputs)
pred_prob = tf.nn.softmax(output.logits, axis=-1)
print(pred_prob)
labels = ['entertainment', 'science/tech', 'business', 'health']
print(labels[np.argmax(pred_prob)])

In [None]:
text = "Rebranded bank will target global growth"
inputs = tokenizer(text, return_tensors="tf")
output = model(inputs)
pred_prob = tf.nn.softmax(output.logits, axis=-1)
print(pred_prob)
labels = ['entertainment', 'science/tech', 'business', 'health']
print(labels[np.argmax(pred_prob)])

In [None]:
text = "A new sustainable vaccination against Ebola developed."
inputs = tokenizer(text, return_tensors="tf")
output = model(inputs)
pred_prob = tf.nn.softmax(output.logits, axis=-1)
print(pred_prob)
labels = ['entertainment', 'science/tech', 'business', 'health']
print(labels[np.argmax(pred_prob)])

## End of Notebook