# Proof of concept Neural Network #

## Load tweets and labels ##

Create vocabulary dictionary with unique indexes

Tokenise tweets with unique indexes

*This network only takes one instance of each word per tweet,
for example the tweet "test test test" would be passed to the
network as a single instance of the word "test". The main 
product will consider multiple instances of a word*

The dataset is made up of tweets that link to The Guardain
(Labelled 1) and tweets that link to The Daily Mail (labelled
0).  This is sentiment analysis of sorts, however, instead
of predicting the senitment of the tweets it predicts 
whether the twitter account is sharing left wing or right
wing news.

In [1]:
# Readlines into list of tweets and labels
with open('tweets.txt') as f:
    raw_tweets = f.readlines()

with open('labels.txt') as f:
    raw_labels = f.readlines()

# split list of tweets in 2D set {tweets,words}
tokens = list(map(lambda x:set(x.split(" ")),raw_tweets))

# create vocabulary list: compile tweets into 1D set {words} 
# removing duplicates and nulls('') then convert to a list
vocab = set()
for tweet in tokens:
    for word in tweet:
        if(len(word)>0):
            vocab.add(word)
vocab = list(vocab)

# enumerate list of words to give unique numerical key to each word
word2index = {}
for i,word in enumerate(vocab):
    word2index[word]=i

# Create list of tweets with unique numerical key inplace of words
# pass from list to set and back to list to remove duplicates
input_dataset = list()
for tweet in tokens:
    tweet_indices = list()
    for word in tweet:
        try:
            tweet_indices.append(word2index[word])
        except:
            ""
    input_dataset.append(list(set(tweet_indices)))
    
# Create list of output targets
target_dataset = list()
for label in raw_labels:
    if label == '1\n':
        target_dataset.append(1)
    else:
        target_dataset.append(0)

## Build and Run Network ##

This a simple network with an embedding layer as the input
layer, one hidden layer of 100 nodes and a binary output layer.
Each layer is densely connected however only the indexes of the 
current tweet are considered on each cycle. The sigmoid activation
function is used for both the hidden and output layer. The 
learning rate is set at 0.01.

In [2]:
import time
import sys
import numpy as np
np.random.seed(42)

# function to convert input to value between 0 and 1
def sigmoid(x):
    return 1/(1 + np.exp(-x))

alpha, iterations = (0.01, 5)
hidden_size = 100

# Create matrix with rows equal amount of unique words in all 
# tweets by hidden layer of columns
weights_0_1 = 0.2*np.random.random((len(vocab),hidden_size)) - 0.1
# Create matrix with hidden layer of rows and 1 column
weights_1_2 = 0.2*np.random.random((hidden_size,1)) - 0.1


correct,total = (0,0)
for iter in range(iterations):
    # train on first 80,000 tweets
    for i in range(len(input_dataset)-20000):
        
        # first tweet and outcome set to x and y
        x,y = (input_dataset[i],target_dataset[i])
        # sum matrix columns of rows representing words present in 
        # current tweet then apply sigmoid
        layer_1 = sigmoid(np.sum(weights_0_1[x],axis=0)) 
        # matrix multiplication of hidden layer and second set of 
        # weights then apply sigmoid
        layer_2 = sigmoid(np.dot(layer_1,weights_1_2))

        # calculate delta by taking actual from prediction and 
        # back propigate to calculate layer 1 delta
        layer_2_delta = layer_2 - y
        layer_1_delta = layer_2_delta.dot(weights_1_2.T)

        # subtract delta from linear layer on rows that were 
        # present in current review
        weights_0_1[x] -= layer_1_delta * alpha
        # multiply delta (1) into layer 1 (100) = (100,1) subtract 
        # from 2nd linear layer
        weights_1_2 -= np.outer(layer_1,layer_2_delta) * alpha
        
        # convert layer_2_delta to positive and if its below 0.5 
        # increment correct
        if(np.abs(layer_2_delta) < 0.5):
            correct += 1
            
        total += 1
        
        # print progress every 10 increments
        if(i % 10 == 0):
            progress = str((i/(len(input_dataset)-20010))*100)
            sys.stdout.write('\rIter:'+str(iter+1)+' Progress:'+
                             progress[0:2]+progress[2:5] +
                             '% Training Accuracy:'+ 
                             str(correct/float(total))[2:4]+'.'+
                             str(correct/float(total))[4:6]+ '%')
            time.sleep(0.001)
    print()
    
# repeat for test dataset
correct,total = (0,0)
for i in range(len(input_dataset)-20000,len(input_dataset)):

    x = input_dataset[i]
    y = target_dataset[i]

    layer_1 = sigmoid(np.sum(weights_0_1[x],axis=0))
    layer_2 = sigmoid(np.dot(layer_1,weights_1_2))
    
    if(np.abs(layer_2 - y) < 0.5):
        correct += 1
    total += 1
print("Test Accuracy:" + str(correct/float(total))[2:4]+'.'+
      str(correct/float(total))[4:6] + '%')

Iter:1 Progress:100.0% Training Accuracy:84.06%
Iter:2 Progress:100.0% Training Accuracy:87.49%
Iter:3 Progress:100.0% Training Accuracy:89.25%
Iter:4 Progress:100.0% Training Accuracy:90.42%
Iter:5 Progress:100.0% Training Accuracy:91.30%
Test Accuracy:90.71%


## Making Predictions ##

After five iterations the the training and test accuracy are
both around 91%.  This suggests that the network is generalising
well.  To test if this is the case the network will be passed
a set of ten tweets collected from two weeks after the dataset.
These tweets should have no likeness with regards to subject as
the news cycle should have moved on.

In [3]:
# Readlines into list of tweets and labels
f = open('predict_tweets.txt')
raw_predict_tweets = f.readlines()
f.close()

f = open('predict_labels.txt')
raw_predict_labels = f.readlines()
f.close()

# split list of tweets in 2D set {tweets,words}
predict_tokens = list(map(lambda x:set(x.split(" ")),raw_predict_tweets))

# Create list of tweets with unique numerical key inplace of words
# pass from list to set and back to list to remove duplicates
predict_input_dataset = list()
for tweet in predict_tokens:
    tweet_indices = list()
    for word in tweet:
        try:
            tweet_indices.append(word2index[word])
        except:
            ""
    predict_input_dataset.append(list(set(tweet_indices)))
    
# Create list of output targets
predict_target_dataset = list()
for label in raw_predict_labels:
    if label == '1\n':
        predict_target_dataset.append(1)
    else:
        predict_target_dataset.append(0)
        
# Construct network with trained weights and test on
# prediction dataset
predictions = list()
for i in range(len(predict_input_dataset)):

    x = predict_input_dataset[i]

    layer_1 = sigmoid(np.sum(weights_0_1[x],axis=0))
    layer_2 = sigmoid(np.dot(layer_1,weights_1_2))
    
    predictions.append(np.abs(layer_2 - y))
for i in range(len(predictions)):
    print(str(i+1)+'.\nPrediction: '+str(predictions[i][0])+
          '\nActual: '+str(predict_target_dataset[i])+'\n')

1.
Prediction: 0.9691711710709275
Actual: 1

2.
Prediction: 0.9985246924457338
Actual: 1

3.
Prediction: 0.9984254692500265
Actual: 1

4.
Prediction: 0.013475537512806928
Actual: 0

5.
Prediction: 0.36792776568256974
Actual: 0

6.
Prediction: 0.0011052837807767613
Actual: 0

7.
Prediction: 0.07300761724035176
Actual: 1

8.
Prediction: 0.9999953671271974
Actual: 1

9.
Prediction: 0.0002536731724130849
Actual: 0

10.
Prediction: 0.026135439533821286
Actual: 0



## Conclusion ##

Other than the seventh prediction the network was correct
and other than the fifth there is very little loss. This
suggest that networks can succesfully predict beyond
sentiment.