# Vox Articles Classification
## Goal
Gain experience utilizing machine learning methods on a real-world dataset by utilizing concepts
and algorithms you have learned in class
## Dataset
The dataset consists of 13,930 news articles from Vox ([https://www.vox.com](www.vox.com)). In this dataset your goal
is to classify whether an article is about `politics`. The download link can be found here: [https://uofi.box.com/s/w5hdeyorrvrvht1c9o42whkslv3pi808](https://uofi.box.com/s/w5hdeyorrvrvht1c9o42whkslv3pi808)

In [11]:
from itertools import product
import pickle as pkl
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
import lightgbm as lgb
import tensorflow as tf

In [3]:
def load_data(file_path):
    """Load VOX data from given pickle file."""
    with open(file_path, 'rb') as f:
        x, y, article_ids, article_links = pkl.load(f)
    return x, y

## Baseline
We will use KNN as baseline model to compare with.

In [6]:
data_file = 'vox_data.pkl'
x, y = load_data(data_file)
print('x shape:', x.shape)
print('y shape:', y.shape)

x shape: (13930, 300)
y shape: (13930,)


In [7]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=12, shuffle=True)
print('x_train shape:', x_train.shape)
print('y_train shape:', y_train.shape)
print('x_test shape:', x_test.shape)
print('y_test shape:', y_test.shape)

y_train = y_train.astype('int32')
y_test = y_test.astype('int32')

x_train shape: (12537, 300)
y_train shape: (12537,)
x_test shape: (1393, 300)
y_test shape: (1393,)


In [14]:
clf = KNeighborsClassifier()
param_grid = {'n_neighbors': [3, 5, 7, 9, 11],
              'p': [1, 2]}
grid = GridSearchCV(clf, param_grid, verbose=2, cv=5, n_jobs=-1)
grid.fit(x_train, y_train)

print('Tuned hpyerparameters :(best parameters) ', grid.best_params_)
print('Accuracy:', grid.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  8.0min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed: 10.6min finished


Tuned hpyerparameters :(best parameters)  {'n_neighbors': 11, 'p': 1}
Accuracy: 0.8496452039552411


In [15]:
clf = KNeighborsClassifier(**grid.best_params_)
clf.fit(x_train, y_train)

y_train_pred = clf.predict(x_train)
y_pred = clf.predict(x_test)
print('Training accuracy: {:.4f}'.format(accuracy_score(y_train, y_train_pred)))
print('Test accuracy: {:.4f}'.format(accuracy_score(y_test, y_pred)))

Training accuracy: 0.8760
Test accuracy: 0.8528


## SGDClassifier
Let's use SGDClassifier for this problem.

In [12]:
grid_params = {
    'loss': ['hinge', 'log', 'squared_hinge'],
    'penalty': ['l1', 'l2'],
    'alpha': [0.01, 0.001, 0.0001]
    }


clf = SGDClassifier()
grid = GridSearchCV(clf, grid_params,
                    scoring='accuracy',
                    verbose=2,
                    cv=5,
                    n_jobs=-1)
grid.fit(x_train, y_train)
print('Best score reached: {} with params: {} '.format(grid.best_score_, grid.best_params_))

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  4.4min finished


Best score reached: 0.8516390152230852 with params: {'alpha': 0.0001, 'loss': 'hinge', 'penalty': 'l1'} 


In [13]:
clf = SGDClassifier(**grid.best_params_)
clf.fit(x_train, y_train)
y_train_pred = clf.predict(x_train)
y_pred = clf.predict(x_test)

print('Training accuracy: {:.4f}'.format(accuracy_score(y_train, y_train_pred)))
print('Test accuracy: {:.4f}'.format(accuracy_score(y_test, y_pred)))

Training accuracy: 0.8545
Test accuracy: 0.8586


## Neural Networks
In this section, we will implement a Neural Networks for classification and let's see how good it is.

In [None]:
# Build NN model
hidden_units = [[100, 100], [200, 200], [300, 300]]
learning_rates = [1e-1, 1e-2, 1e-3, 1e-4]
activations = ['relu', 'sigmoid']

best_acc = 0.
best_params = {}
for hs, lr, act in product(hidden_units, learning_rates, activations):
    print('Number of hidden units: {} Learning rate: {} Activation: {}'.format(hs, lr, act))
    model = tf.keras.models.Sequential()
    for h in hs:
        model.add(tf.keras.layers.Dense(h, activation=act))
    model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
    model.compile(optimizer=tf.keras.optimizers.Adam(lr), loss='binary_crossentropy', metrics=['acc'])
    hist = model.fit(x_train, y_train, epochs=10, validation_split=0.1)

    _, test_acc = model.evaluate(x_test, y_test)
    if test_acc > best_acc:
        best_acc = test_acc
        best_params = {
            'hidden_units': hs,
            'learning_rate': lr,
            'activation': act
        }

print('Best accuracy: {:.4f}'.format(best_acc))
print('Best params:', best_params)

Number of hidden units: [100, 100] Learning rate: 0.1 Activation: relu
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Number of hidden units: [100, 100] Learning rate: 0.1 Activation: sigmoid
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Number of hidden units: [100, 100] Learning rate: 0.01 Activation: relu
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Number of hidden units: [100, 100] Learning rate: 0.01 Activation: sigmoid
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Number of hidden units: [100, 100] Learning rate: 0.001 Activation: relu
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Number of hidden units: [100, 100] Learning rate: 0.001 Activation: sigmoid
Epoc

## Question

a) Which three classifiers (two new, one old) did you choose?

Three classifiers are used:
- Support Vectors Machine
- LightGBM
- Neural Networks

b) What software did you use and why did you choose it?

I used 2 very famous Machine Learning and Deep Learning framework that are `sklearn` and `tensorflow`. They are very easy to use.

c) What are the results?

The `KNN` and `SGDClassifier` performance are comparable. The `Neural Networks` performance is a little better.
