## 04 - Neural network on term frequency matrix

This notebook will include various neural networks that are used to
classify fake and reliable news articles. The data is the matrix of
term frequency.

[Learn more about neural networks here](http://news.mit.edu/2017/explained-neural-networks-deep-learning-0414)

While Tensorflow would give a lot more controls over the building blocks,
Keras wrapper is much easier to used for beginner such as myself.
Fortunately, Keras can accept sparse matrix as inputs!

Reminder of labels:
0 = fake, 1 = reliable

In [14]:
import pandas as pd
import numpy as np
from scipy.sparse import load_npz

from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### Load data

The data is already cleaned with no missing values. There are 20000
data points with nearly 200k features. Of 20000, 15600 are fake and 4400
are reliable news.

Each row of X is a piece of articles that are cleaned and tokenized.

In [17]:
# datapath
data_path = 'D:\\PycharmProjects\\springboard\\data\\'

# Load X, y_normal
X = load_npz(f'{data_path}news_tf_sparse.npz')
y_normal = pd.read_csv(f'{data_path}news_labels.csv')['0']

# Check data size
print(f'X shape is {X.shape}')
print(f'y_normal shape is {y_normal.shape}')

# Print labels distribution
print(np.unique(y_normal, return_counts=True))

# Transform y to 2 columns for fitting
y = to_categorical(y_normal)

# Print samples data for easy debugging
print('Sample rows of X')
print(X[0,:])
print('Sample rows of y')
print(y[:10,:])

X shape is (20000, 196679)
y_normal shape is (20000,)
(array([0, 1], dtype=int64), array([15600,  4400], dtype=int64))
Sample rows of X
  (0, 0)	1.0
  (0, 1)	1.0
  (0, 2)	1.0
  (0, 3)	1.0
  (0, 4)	1.0
  (0, 5)	1.0
  (0, 6)	1.0
  (0, 7)	1.0
  (0, 8)	1.0
  (0, 9)	1.0
  (0, 10)	1.0
  (0, 11)	1.0
  (0, 12)	1.0
  (0, 13)	1.0
  (0, 14)	1.0
  (0, 15)	1.0
  (0, 16)	1.0
  (0, 17)	1.0
  (0, 18)	1.0
  (0, 19)	1.0
  (0, 20)	2.0
  (0, 21)	1.0
  (0, 22)	1.0
  (0, 23)	1.0
  (0, 24)	1.0
  :	:
  (0, 73)	1.0
  (0, 74)	1.0
  (0, 75)	1.0
  (0, 76)	1.0
  (0, 77)	1.0
  (0, 78)	2.0
  (0, 79)	1.0
  (0, 80)	1.0
  (0, 81)	1.0
  (0, 82)	1.0
  (0, 83)	1.0
  (0, 84)	2.0
  (0, 85)	1.0
  (0, 86)	1.0
  (0, 87)	1.0
  (0, 88)	1.0
  (0, 89)	1.0
  (0, 90)	1.0
  (0, 91)	1.0
  (0, 92)	1.0
  (0, 93)	1.0
  (0, 94)	1.0
  (0, 95)	1.0
  (0, 96)	1.0
  (0, 97)	1.0
Sample rows of y
[[1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]]


### Splitting train and test set

The usual 70-30 split.

In [18]:
# split data of X and y_normal
X_train, X_test, y_train, y_test = train_test_split(X,y_normal,test_size=0.3,random_state=41)

### Based line - Logistic Regression

A simple logistic regression gives 93.58% accuracy. A random guess of fake
will gives 75% accuracy.

In [20]:
# Fitting logistic regression
clf = LogisticRegression(max_iter=10000)
clf.fit(X_train, y_train)

# Score
print(clf.score(X_test, y_test))

0.9358333333333333


In [21]:
# split data of X and y
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=41)

### Simple Neural network

First neural network will be a simple one layer with 64 nodes. The structure
of the neural network has 196679 of input nodes passing through 64 nodes
of hidden layers with relu activation functions. Finally the output nodes
with sigmoid activation functions to get the probability.

All initializations are from Adam optimizers with cross entropy loss
function.

In [13]:
# configs for neural network
layer_1_nodes = 64
output_nodes = 2
input_dim = X.shape[1]

# Build network
model = Sequential()
model.add(Dense(layer_1_nodes, activation='relu', input_dim=input_dim))
model.add(Dense(output_nodes, activation='sigmoid'))

# Compile model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Fit model
model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x16e9a75ad08>

It takes only a minute to run the above network. The accuracy increases
slightly until the third epoch where it fails to generalized. The above
neural network outperform the logistic regression by about 1%.

### Increase Nodes size

Below the hidden layers are increased to 128 nodes. All other configs are
the same.

In [22]:
# configs for neural network
layer_1_nodes = 128
output_nodes = 2
input_dim = X.shape[1]

# Build network
model = Sequential()
model.add(Dense(layer_1_nodes, activation='relu', input_dim=input_dim))
model.add(Dense(output_nodes, activation='sigmoid'))

# Compile model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Fit model
model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x16e99786c88>

Doubling the node size rarely has any effect of the validation accuracy!
The accuracy also drop off after the second epoch signify overfitting.

### Expanding Layers and Nodes

Next, expanding the network for 3 layers with 128 nodes each.

In [23]:
# configs for neural network
layer_1_nodes = 128
layer_2_nodes = 128
layer_3_nodes = 128
output_nodes = 2
input_dim = X.shape[1]

# Build network
model = Sequential()
model.add(Dense(layer_1_nodes, activation='relu', input_dim=input_dim))
model.add(Dense(layer_2_nodes, activation='relu'))
model.add(Dense(layer_3_nodes, activation='relu'))
model.add(Dense(output_nodes, activation='sigmoid'))

# Compile model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Fit model
model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x16e99071bc8>

Unfortunately, expanding the models results in a big loss of accuracy of nearly
2%. Moreover, this is below even a logistic regression result.

### Drop off counter

