# Sentiment Analysis in Python - Framing a problem
By Max Taylor

This is an overview of the full process of developing a Neural Network for sentiment analysis. We will address: How to develop and impement a predictive theory, prepare data, build the network, reduce noise and optimise. This will involve how to attack and solve the problem; Which can be applied throughout future networks.

### Contents
- Loading the dataset
- predictive theory
- Theory Validation
- Preparing input and output data
- Building the network
- Identifying Neural Noise
- Reducing Neural Noise
- Analysing inefficiencies in the network
- Optimising inefficiencies in the network
- Further noise reduction

## Loading the dataset

In [2]:
g = open('reviews.txt','r')
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r')
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [3]:
len(reviews)

25000

In [3]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [4]:
labels[0]

'POSITIVE'

## Develop a predictive theory

Look over the data and consider the best way to use the input information to effectively get the output information.

In [5]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

In [6]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


In this case, the data should be split into words as they convey the most meaning. For example, single characters do not contain any form of context and whole sentences are too large and general for the network to compute.

## Theory validation

Having produced a predictive theory it can be helpful to validate the theory. In this case it was decided that words were the most helpful to us, so this is what we will test!

In [7]:
# Import dependencies
from collections import Counter
import numpy as np

In [8]:
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

In [9]:
for i in range(len(reviews)):
    if labels[i] == 'POSITIVE':
        for word in reviews[i].split(' '):
            positive_counts[word] += 1
            total_counts[word] += 1
    else:
        for word in reviews[i].split(' '):
            negative_counts[word] += 1
            total_counts[word] += 1

In [10]:
positive_counts.most_common()[0:50]

[('', 550468),
 ('the', 173324),
 ('.', 159654),
 ('and', 89722),
 ('a', 83688),
 ('of', 76855),
 ('to', 66746),
 ('is', 57245),
 ('in', 50215),
 ('br', 49235),
 ('it', 48025),
 ('i', 40743),
 ('that', 35630),
 ('this', 35080),
 ('s', 33815),
 ('as', 26308),
 ('with', 23247),
 ('for', 22416),
 ('was', 21917),
 ('film', 20937),
 ('but', 20822),
 ('movie', 19074),
 ('his', 17227),
 ('on', 17008),
 ('you', 16681),
 ('he', 16282),
 ('are', 14807),
 ('not', 14272),
 ('t', 13720),
 ('one', 13655),
 ('have', 12587),
 ('be', 12416),
 ('by', 11997),
 ('all', 11942),
 ('who', 11464),
 ('an', 11294),
 ('at', 11234),
 ('from', 10767),
 ('her', 10474),
 ('they', 9895),
 ('has', 9186),
 ('so', 9154),
 ('like', 9038),
 ('about', 8313),
 ('very', 8305),
 ('out', 8134),
 ('there', 8057),
 ('she', 7779),
 ('what', 7737),
 ('or', 7732)]

This has produces a collection of the positive and negarive words and their number of occurances. Scrolling through the data clearly shows that there is a difference between positive and negative word occurances. However straight away we can see the data is cluttered with useless words. 

A good step to resolve this is to find the ratio between the words occurance in the positive set and the negative set.

In [11]:
pos_neg_ratios = Counter()

for term, cnt in list(total_counts.most_common()):
    if cnt > 100:
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term] + 1)
        pos_neg_ratios[term] = pos_neg_ratio
        
for word, ratio in pos_neg_ratios.most_common():
    if ratio > 1:
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log((1 / (ratio + 0.01)))

Loop through each word in the total count where term is the word and cnt is the number of occurances. If the number of occurances is greater than 100 calculate the ratio between the occurances of that word in the positive list and the negative list and add it to the counter.

Then, loop through all the new ratios where word is the word and ratio is the ratio calculated in the last step. If the ratio is greater than 1 it will occur more in the positive set than the negative set. Calculate the natural log to normalise. Otherwise calculate 1 / the natrural log and make it negative. the 0.01 is added so there are no divde by 0 errors.

In [12]:
# Words most frequently seen in the postive reviews
pos_neg_ratios.most_common()[0:50]

[('edie', 4.6913478822291435),
 ('paulie', 4.0775374439057197),
 ('felix', 3.1527360223636558),
 ('polanski', 2.8233610476132043),
 ('matthau', 2.8067217286092401),
 ('victoria', 2.6810215287142909),
 ('mildred', 2.6026896854443837),
 ('gandhi', 2.5389738710582761),
 ('flawless', 2.451005098112319),
 ('superbly', 2.2600254785752498),
 ('perfection', 2.1594842493533721),
 ('astaire', 2.1400661634962708),
 ('captures', 2.0386195471595809),
 ('voight', 2.0301704926730531),
 ('wonderfully', 2.0218960560332353),
 ('powell', 1.9783454248084671),
 ('brosnan', 1.9547990964725592),
 ('lily', 1.9203768470501485),
 ('bakshi', 1.9029851043382795),
 ('lincoln', 1.9014583864844796),
 ('refreshing', 1.8551812956655511),
 ('breathtaking', 1.8481124057791867),
 ('bourne', 1.8478489358790986),
 ('lemmon', 1.8458266904983307),
 ('delightful', 1.8002701588959635),
 ('flynn', 1.7996646487351682),
 ('andrews', 1.7764919970972666),
 ('homer', 1.7692866133759964),
 ('beautifully', 1.7626953362841438),
 ('socc

In [13]:
# Words most frequently seen in the negative reviews
list(reversed(pos_neg_ratios.most_common()))[0:30]

[('boll', -4.0778152602708904),
 ('uwe', -3.9218753018711578),
 ('seagal', -3.3202501058581921),
 ('unwatchable', -3.0269848170580955),
 ('stinker', -2.9876839403711624),
 ('mst', -2.7753833211707968),
 ('incoherent', -2.7641396677532537),
 ('unfunny', -2.5545257844967644),
 ('waste', -2.4907515123361046),
 ('blah', -2.4475792789485005),
 ('horrid', -2.3715779644809971),
 ('pointless', -2.3451073877136341),
 ('atrocious', -2.3187369339642556),
 ('redeeming', -2.2667790015910296),
 ('prom', -2.2601040980178784),
 ('drivel', -2.2476029585766928),
 ('lousy', -2.2118080125207054),
 ('worst', -2.1930856334332267),
 ('laughable', -2.172468615469592),
 ('awful', -2.1385076866397488),
 ('poorly', -2.1326133844207011),
 ('wasting', -2.1178155545614512),
 ('remotely', -2.111046881095167),
 ('existent', -2.0024805005437076),
 ('boredom', -1.9241486572738005),
 ('miserably', -1.9216610938019989),
 ('sucks', -1.9166645809588516),
 ('uninspired', -1.9131499212248517),
 ('lame', -1.9117232884159072),

## Transforming text into numbers

The current information is usefull to those of us who can already read but not very usefull to a neural network!
Before we train the neural network we must convert the data into a format that can be used for in the network!

In [14]:
vocab = set(total_counts.keys())
vocab_size = len(vocab)
print ('The number of different words in the data: ', vocab_size)

The number of different words in the data:  74074


In [15]:
list(vocab)[0:50]

['',
 'spaghettis',
 'ballplayers',
 'progrmmer',
 'rouges',
 'fondas',
 'acmetropolis',
 'pulsates',
 'memorise',
 'tambien',
 'jonatha',
 'piquantly',
 'craptitude',
 'misjudge',
 'classic',
 'felichy',
 'somegoro',
 'expunged',
 'egalitarianism',
 'maruschka',
 'conklin',
 'gunfire',
 'civic',
 'pucky',
 'vaulted',
 'traumatise',
 'blades',
 'voogdt',
 'constipated',
 'emblem',
 'recertified',
 'screw',
 'carnaevon',
 'dramatisation',
 'anddd',
 'dystrophic',
 'mope',
 'tentacle',
 'ridgement',
 'kidman',
 'gainey',
 'fawlty',
 'declared',
 'crybaby',
 'rascism',
 'entomologist',
 'astronishing',
 'racecar',
 'rang',
 'ottoman']

vocab is a list of all the different words found in the dataset

### Creating the input and output data

In [16]:
import numpy as np

layer_0 = np.zeros((1, vocab_size))
layer_0

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.]])

First we create the input layer with a length of the number of different words. We fill it with zeros to save memory which can improve comutiational efficiency.

Next, take each word and give it a unique index:

In [17]:
word2index = {}

for i, word in enumerate(vocab):
    word2index[word] = i

word2index

{'': 0,
 'spaghettis': 1,
 'ballplayers': 2,
 'progrmmer': 3,
 'rouges': 4,
 'fondas': 5,
 'acmetropolis': 6,
 'pulsates': 7,
 'memorise': 8,
 'tambien': 9,
 'jonatha': 10,
 'piquantly': 11,
 'craptitude': 12,
 'misjudge': 13,
 'classic': 14,
 'felichy': 15,
 'somegoro': 16,
 'expunged': 17,
 'egalitarianism': 18,
 'maruschka': 19,
 'conklin': 20,
 'gunfire': 21,
 'civic': 22,
 'pucky': 23,
 'vaulted': 24,
 'traumatise': 25,
 'blades': 26,
 'voogdt': 27,
 'constipated': 28,
 'emblem': 29,
 'recertified': 30,
 'screw': 31,
 'carnaevon': 32,
 'dramatisation': 33,
 'anddd': 34,
 'dystrophic': 35,
 'mope': 36,
 'tentacle': 37,
 'ridgement': 38,
 'kidman': 39,
 'gainey': 40,
 'fawlty': 41,
 'declared': 42,
 'crybaby': 43,
 'rascism': 44,
 'entomologist': 45,
 'astronishing': 46,
 'racecar': 47,
 'rang': 48,
 'ottoman': 49,
 'reaffirming': 50,
 'veer': 51,
 'slevin': 52,
 'crashed': 53,
 'pambies': 54,
 'westernized': 55,
 'eldard': 56,
 'presentable': 57,
 'popcorncoke': 58,
 'fathoms': 59,

Next we need to update the input layer to actually contain the information and not just zeros.

In [18]:
def update_input_layer(review):
    
    global layer_0
    
    # Ensure all of layer 0 is set to 0
    layer_0 *= 0
    
    for word in review.split(' '):
        layer_0[0][word2index[word]] += 1
        
update_input_layer(reviews[0])

This method takes a review and loops through each word. It then adds 1 to the value at the index of the word2index number of the word. This will build a collection of the number of word occurances at the unique index of each word.

In [19]:
layer_0

array([[ 18.,   0.,   0., ...,   0.,   0.,   0.]])

Finally, we need to get the 'POSITIVE' and 'NEGATIVE' labels into a machine readable format. Here we will use a 1 or 0

In [20]:
def get_target_for_label(label):
    if label == 'POSITIVE':
        return 1
    else:
        return 0

In [21]:
labels[0]

'POSITIVE'

In [22]:
get_target_for_label(labels[0])

1

## Building the network

For this I am going to use the NeuralNetwork class from Udacity Project 1 with some adjustments:
- 3 Layer neural network
- no non-linearity in the second layer (No sigmoid between layer 0 and 1)
- use previous functions to create the training data
- create a 'pre_process_data' function to create vocabulary for the training data and generating functions
- modify train to train over the entire corpus

In [23]:
import numpy as np
import time
import sys

class SentimentNetwork():
    def __init__(self, reviews, labels, hidden_nodes = 10, learning_rate = 0.1):
        
        # Seed the random number generator for debugging
        np.random.seed(1)
        
        self.pre_process_data(reviews, labels)
        
        self.init_network(self.review_vocab_size, hidden_nodes, 1, learning_rate)
    
    # Process all the review and label data and form unique dictionarys
    # Give each word a unique index for entry into the network
    def pre_process_data(self, reviews, labels):
        # Creates a dictionary that contains one of each different word
        review_vocab = set()
        for review in reviews:
            for word in review.split(' '):
                review_vocab.add(word)
        
        self.review_vocab = list(review_vocab)
        
        # Creates a dictionary that contains one of each label type [IN our case 'POSITIVE' or 'NEGATIVE']
        label_vocab = set()
        for label in labels:
            label_vocab.add(label)
                
        self.label_vocab = list(label_vocab)
        
        # Store the sizes of each dictionary
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        # Give each word a uniquely identifying index
        self.word2index = {}
        for i, word in enumerate(self.review_vocab):
            self.word2index[word] = i
        
        # Give each label an identifying index [in our case only 0 or 1]
        self.label2index = {}
        for i, label in enumerate(self.label_vocab):
            self.label2index[label] = i
    
    # Initialize all the network parameters
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Set the number of nodes for all three layers
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes
        
        # Initialise weights
        self.weights_0_1 = np.zeros((self.input_nodes, self.hidden_nodes))
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5, (self.hidden_nodes, self.output_nodes))
        
        self.learning_rate = learning_rate
        
        self.layer_0 = np.zeros((1, input_nodes))
    
    # Activation function
    def sigmoid(self, x, deriv=False):
        if deriv:
            return x * (1 - x)
        else:
            return 1 / (1 + np.exp(-x))
    
    # Reset the layer for the next pass and add the new words from the review
    def update_input_layer(self, review):
        # Clear the previous state and set to 0
        self.layer_0 *= 0
        
        for word in review.split(' '):
            if (word in self.word2index.keys()):
                self.layer_0[0][self.word2index[word]] += 1
    
    # Get the neumerical reprisentation of the output
    def get_target_for_label(self, label):
        if label == 'POSITIVE':
            return 1
        else:
            return 0
    
    # Train the network
    def train(self, training_reviews, training_labels):
        
        # Check that the inputs match the outputs
        assert(len(training_reviews) == len(training_labels))
        
        correct_so_far = 0
        
        # Log the start time
        start = time.time()
        
        for i in range(len(training_reviews)):
            
            review = training_reviews[i]
            label = training_labels[i]
            
            ## Forward Pass ##
            
            # Input Layer
            self.update_input_layer(review)
            
            # Hidden Layer
            layer_1 = self.layer_0.dot(self.weights_0_1)
            
            # Output layer
            layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))
            
            ## Backwards Pass ##
            
            # Output Error
            layer_2_error = layer_2 - self.get_target_for_label(label)
            layer_2_delta = layer_2_error * self.sigmoid(layer_2, True)
            
            # Hidden Error
            layer_1_error = layer_2_delta.dot(self.weights_1_2.T)
            layer_1_delta = layer_1_error
            
            self.weights_1_2 -= layer_1.T.dot(layer_2_delta) * self.learning_rate
            self.weights_0_1 -= self.layer_0.T.dot(layer_1_delta) * self.learning_rate
            
            if (np.abs(layer_2_error) < 0.5):
                correct_so_far += 1
            
            reviews_per_second = i / float(time.time() - start)
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
            if(i % 2500 == 0):
                print("")
    
    # Test the network
    def test(self, testing_reviews, testing_labels):
        
        correct = 0
        
        start = time.time()
        
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            
            if pred == testing_labels[i]:
                correct += 1
                
            reviews_per_second = i / float(time.time() - start)
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                            + "% #Correct:" + str(correct) + " #Tested:" + str(i+1) + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    
    # Forward propogate        
    def run(self, review):
        # Input Layer
        self.update_input_layer(review.lower())

        # Hidden layer
        layer_1 = self.layer_0.dot(self.weights_0_1)

        # Output layer
        layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))
        
        if(layer_2[0] > 0.5):
            return "POSITIVE"
        else:
            return "NEGATIVE"

Next we need to create an instance of the network and train it on the data.

In [24]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.1)

In [25]:
# evaluate our model before training (just to show how horrible it is)
mlp.test(reviews[-1000:],labels[-1000:])

Progress:99.9% Speed(reviews/sec):682.6% #Correct:500 #Tested:1000 Testing Accuracy:50.0%

In [26]:
# train the network
mlp.train(reviews[:-1000],labels[:-1000])

Progress:0.0% Speed(reviews/sec):0.0 #Correct:0 #Trained:1 Training Accuracy:0.0%
Progress:10.4% Speed(reviews/sec):147.8 #Correct:1250 #Trained:2501 Training Accuracy:49.9%
Progress:20.8% Speed(reviews/sec):169.2 #Correct:2500 #Trained:5001 Training Accuracy:49.9%
Progress:31.2% Speed(reviews/sec):179.1 #Correct:3750 #Trained:7501 Training Accuracy:49.9%
Progress:41.6% Speed(reviews/sec):182.8 #Correct:5000 #Trained:10001 Training Accuracy:49.9%
Progress:52.0% Speed(reviews/sec):184.3 #Correct:6250 #Trained:12501 Training Accuracy:49.9%
Progress:62.5% Speed(reviews/sec):178.9 #Correct:7500 #Trained:15001 Training Accuracy:49.9%
Progress:72.9% Speed(reviews/sec):175.0 #Correct:8750 #Trained:17501 Training Accuracy:49.9%
Progress:83.3% Speed(reviews/sec):170.0 #Correct:10000 #Trained:20001 Training Accuracy:49.9%
Progress:93.7% Speed(reviews/sec):166.2 #Correct:11250 #Trained:22501 Training Accuracy:49.9%
Progress:99.9% Speed(reviews/sec):164.5 #Correct:11999 #Trained:24000 Training Acc

The result of training was actually worse than predicting the outcome randomly. Adjustments to the learning rate had little affect and therefore there must be something wrong with the data.

## Neural Noise

Neural noise is when the important data is being drowned out by other pieces of unimportant data. 

An analagy would be:
>the neural network is a spade and it helps you dig for gold. However no type of spade is going to help you get more gold if you are digging in the wrong place.

To solve this we need to eliminate the noise in the data. First, lets check the data to see if there is anything that shouldn't be there

In [27]:
# The function defined previously
def update_input_layer(review):
    
    global layer_0
    
    # clear out previous state, reset the layer to be all 0s
    layer_0 *= 0
    for word in review.split(" "):
        layer_0[0][word2index[word]] += 1

update_input_layer(reviews[0])

In [28]:
layer_0

array([[ 18.,   0.,   0., ...,   0.,   0.,   0.]])

In [29]:
review_counter = Counter()

for word in reviews[0].split(' '):
    review_counter[word] += 1
    
review_counter.most_common()[0:50]

[('.', 27),
 ('', 18),
 ('the', 9),
 ('to', 6),
 ('high', 5),
 ('i', 5),
 ('bromwell', 4),
 ('is', 4),
 ('a', 4),
 ('teachers', 4),
 ('that', 4),
 ('of', 4),
 ('it', 2),
 ('at', 2),
 ('as', 2),
 ('school', 2),
 ('my', 2),
 ('in', 2),
 ('me', 2),
 ('students', 2),
 ('their', 2),
 ('student', 2),
 ('cartoon', 1),
 ('comedy', 1),
 ('ran', 1),
 ('same', 1),
 ('time', 1),
 ('some', 1),
 ('other', 1),
 ('programs', 1),
 ('about', 1),
 ('life', 1),
 ('such', 1),
 ('years', 1),
 ('teaching', 1),
 ('profession', 1),
 ('lead', 1),
 ('believe', 1),
 ('s', 1),
 ('satire', 1),
 ('much', 1),
 ('closer', 1),
 ('reality', 1),
 ('than', 1),
 ('scramble', 1),
 ('survive', 1),
 ('financially', 1),
 ('insightful', 1),
 ('who', 1),
 ('can', 1)]

Looking at the data we can see there are 18 instances of an empty space. As this does not convey the sentiment in any way it must be removed.

## Reducing noise in the input data

In this situation we can see that the data is being scewed by the large number of useless values. We can resolve this by changing the update_input_layer method to not incriment the word count but just log it as present!

In [61]:
import numpy as np
import time
import sys

class SentimentNetwork():
    def __init__(self, reviews, labels, hidden_nodes = 10, learning_rate = 0.1):
        
        # Seed the random number generator for debugging
        np.random.seed(1)
        
        self.pre_process_data(reviews, labels)
        
        self.init_network(self.review_vocab_size, hidden_nodes, 1, learning_rate)
    
    # Process all the review and label data and form unique dictionarys
    # Give each word a unique index for entry into the network
    def pre_process_data(self, reviews, labels):
        # Creates a dictionary that contains one of each different word
        review_vocab = set()
        for review in reviews:
            for word in review.split(' '):
                review_vocab.add(word)
        
        self.review_vocab = list(review_vocab)
        
        # Creates a dictionary that contains one of each label type [IN our case 'POSITIVE' or 'NEGATIVE']
        label_vocab = set()
        for label in labels:
            label_vocab.add(label)
                
        self.label_vocab = list(label_vocab)
        
        # Store the sizes of each dictionary
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        # Give each word a uniquely identifying index
        self.word2index = {}
        for i, word in enumerate(self.review_vocab):
            self.word2index[word] = i
        
        # Give each label an identifying index [in our case only 0 or 1]
        self.label2index = {}
        for i, label in enumerate(self.label_vocab):
            self.label2index[label] = i
    
    # Initialize all the network parameters
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Set the number of nodes for all three layers
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes
        
        # Initialise weights
        self.weights_0_1 = np.zeros((self.input_nodes, self.hidden_nodes))
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5, (self.hidden_nodes, self.output_nodes))
        
        self.learning_rate = learning_rate
        
        self.layer_0 = np.zeros((1, input_nodes))
    
    # Activation function
    def sigmoid(self, x, deriv=False):
        if deriv:
            return x * (1 - x)
        else:
            return 1 / (1 + np.exp(-x))
    
    # Reset the layer for the next pass and add the new words from the review
    def update_input_layer(self, review):
        # Clear the previous state and set to 0
        self.layer_0 *= 0
        
        for word in review.split(' '):
            if (word in self.word2index.keys()):
                self.layer_0[0][self.word2index[word]] = 1
    
    # Get the neumerical reprisentation of the output
    def get_target_for_label(self, label):
        if label == 'POSITIVE':
            return 1
        else:
            return 0
    
    # Train the network
    def train(self, training_reviews, training_labels):
        
        # Check that the inputs match the outputs
        assert(len(training_reviews) == len(training_labels))
        
        correct_so_far = 0
        
        # Log the start time
        start = time.time()
        
        for i in range(len(training_reviews)):
            
            review = training_reviews[i]
            label = training_labels[i]
            
            ## Forward Pass ##
            
            # Input Layer
            self.update_input_layer(review)
            
            # Hidden Layer
            layer_1 = self.layer_0.dot(self.weights_0_1)
            
            # Output layer
            layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))
            
            ## Backwards Pass ##
            
            # Output Error
            layer_2_error = layer_2 - self.get_target_for_label(label)
            layer_2_delta = layer_2_error * self.sigmoid(layer_2, True)
            
            # Hidden Error
            layer_1_error = layer_2_delta.dot(self.weights_1_2.T)
            layer_1_delta = layer_1_error
            
            self.weights_1_2 -= layer_1.T.dot(layer_2_delta) * self.learning_rate
            self.weights_0_1 -= self.layer_0.T.dot(layer_1_delta) * self.learning_rate
            
            if (np.abs(layer_2_error) < 0.5):
                correct_so_far += 1
            
            reviews_per_second = i / float(time.time() - start)
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
            if(i % 2500 == 0):
                print("")
    
    # Test the network
    def test(self, testing_reviews, testing_labels):
        
        correct = 0
        
        start = time.time()
        
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            
            if pred == testing_labels[i]:
                correct += 1
                
            reviews_per_second = i / float(time.time() - start)
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                            + "% #Correct:" + str(correct) + " #Tested:" + str(i+1) + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    
    # Forward propogate        
    def run(self, review):
        # Input Layer
        self.update_input_layer(review.lower())

        # Hidden layer
        layer_1 = self.layer_0.dot(self.weights_0_1)

        # Output layer
        layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))
        
        if(layer_2[0] > 0.5):
            return "POSITIVE"
        else:
            return "NEGATIVE"

In [62]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.1)

In [63]:
mlp.train(reviews[:-1000],labels[:-1000])

Progress:0.0% Speed(reviews/sec):0.0 #Correct:0 #Trained:1 Training Accuracy:0.0%
Progress:10.4% Speed(reviews/sec):204.0 #Correct:1812 #Trained:2501 Training Accuracy:72.4%
Progress:20.8% Speed(reviews/sec):204.2 #Correct:3800 #Trained:5001 Training Accuracy:75.9%
Progress:31.2% Speed(reviews/sec):204.4 #Correct:5879 #Trained:7501 Training Accuracy:78.3%
Progress:41.6% Speed(reviews/sec):204.8 #Correct:8015 #Trained:10001 Training Accuracy:80.1%
Progress:52.0% Speed(reviews/sec):205.0 #Correct:10150 #Trained:12501 Training Accuracy:81.1%
Progress:62.5% Speed(reviews/sec):204.7 #Correct:12298 #Trained:15001 Training Accuracy:81.9%
Progress:72.9% Speed(reviews/sec):202.1 #Correct:14425 #Trained:17501 Training Accuracy:82.4%
Progress:83.3% Speed(reviews/sec):201.0 #Correct:16596 #Trained:20001 Training Accuracy:82.9%
Progress:93.7% Speed(reviews/sec):201.2 #Correct:18782 #Trained:22501 Training Accuracy:83.4%
Progress:99.9% Speed(reviews/sec):200.3 #Correct:20105 #Trained:24000 Training 

Instantly we can see that the network is now training successfully; attaining an accuracy upwards of 83.5%!

However the training process is still slow and we need to find remove inefficiencies from the code to make it run faster.

In [33]:
mlp.test(reviews[-1000:],labels[-1000:])

Progress:99.9% Speed(reviews/sec):1000.% #Correct:851 #Tested:1000 Testing Accuracy:85.1%

## Analysing inefficiencies in the network

Now we need to identify things that can slow down the network. Lets start by looking at layer_0

In [34]:
layer_0

array([[ 18.,   0.,   0., ...,   0.,   0.,   0.]])

Notice that layer_0 is the length of the entire dictionary. However most of those inputs will simply be zero. Considering 0 * any number is always 0 it is a wasted calculation. Looking at a smaller example we can see...

In [35]:
layer_0 = np.zeros(10)

layer_0

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [36]:
layer_0[4] = 1
layer_0[9] = 1

layer_0

array([ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.])

In [37]:
weights_0_1 = np.random.randn(10, 5)
layer_0.dot(weights_0_1)

array([-0.10503756,  0.44222989,  0.24392938, -0.55961832,  0.21389503])

On its own this doesnt look like a problem but lets see how we can get the same results with much less computation.

In [38]:
indices = [4,9]
layer_1 = np.zeros(5)
for index in indices:
    layer_1 += (weights_0_1[index])
    
layer_1

array([-0.10503756,  0.44222989,  0.24392938, -0.55961832,  0.21389503])

See how this produces exactly the same results because 0 is a wasted value! It should also be noted that the input can only be a zero or a 1. 1 * x is always equal to x so the first layer computation can be significantly reduced!

## Removing inefficiencies in the network

Now that a serious inefficiency has been found in the network we need to actually adapt the network to handle it.

In [39]:
import time
import sys

# Let's tweak the network from before to model these phenomena
class SentimentNetwork:
    def __init__(self, reviews,labels,hidden_nodes = 10, learning_rate = 0.1):
       
        np.random.seed(1)
    
        self.pre_process_data(reviews)
        
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)
        
        
    def pre_process_data(self,reviews):
        
        review_vocab = set()
        for review in reviews:
            for word in review.split(" "):
                review_vocab.add(word)
        self.review_vocab = list(review_vocab)
        
        label_vocab = set()
        for label in labels:
            label_vocab.add(label)
        
        self.label_vocab = list(label_vocab)
        
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        self.word2index = {}
        for i, word in enumerate(self.review_vocab):
            self.word2index[word] = i
        
        self.label2index = {}
        for i, label in enumerate(self.label_vocab):
            self.label2index[label] = i
         
        
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Set number of nodes in input, hidden and output layers.
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        # Initialize weights
        self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
    
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5, 
                                                (self.hidden_nodes, self.output_nodes))
        
        self.learning_rate = learning_rate
        
        self.layer_0 = np.zeros((1,input_nodes))
        self.layer_1 = np.zeros((1,hidden_nodes))
        
    def sigmoid(self,x):
        return 1 / (1 + np.exp(-x))
    
    
    def sigmoid_output_2_derivative(self,output):
        return output * (1 - output)
    
    def update_input_layer(self,review):

        # clear out previous state, reset the layer to be all 0s
        self.layer_0 *= 0
        for word in review.split(" "):
            self.layer_0[0][self.word2index[word]] = 1

    def get_target_for_label(self,label):
        if(label == 'POSITIVE'):
            return 1
        else:
            return 0
        
    def train(self, training_reviews_raw, training_labels):
        
        training_reviews = list()
        for review in training_reviews_raw:
            indices = set()
            for word in review.split(" "):
                if(word in self.word2index.keys()):
                    indices.add(self.word2index[word])
            training_reviews.append(list(indices))
        
        assert(len(training_reviews) == len(training_labels))
        
        correct_so_far = 0
        
        start = time.time()
        
        for i in range(len(training_reviews)):
            
            review = training_reviews[i]
            label = training_labels[i]
            
            #### Implement the forward pass here ####
            ### Forward pass ###

            # Input Layer

            # Hidden layer
#             layer_1 = self.layer_0.dot(self.weights_0_1)
            self.layer_1 *= 0
            for index in review:
                self.layer_1 += self.weights_0_1[index]
            
            # Output layer
            layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))

            #### Implement the backward pass here ####
            ### Backward pass ###

            # Output error
            layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)

            # Backpropagated error
            layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
            layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error

            # Update the weights
            self.weights_1_2 -= self.layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step
            
            for index in review:
                self.weights_0_1[index] -= layer_1_delta[0] * self.learning_rate # update input-to-hidden weights with gradient descent step

            if(np.abs(layer_2_error) < 0.5):
                correct_so_far += 1
            
            reviews_per_second = i / float(time.time() - start)
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
        
    
    def test(self, testing_reviews, testing_labels):
        
        correct = 0
        
        start = time.time()
        
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1
            
            reviews_per_second = i / float(time.time() - start)
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                            + "% #Correct:" + str(correct) + " #Tested:" + str(i+1) + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    
    def run(self, review):
        
        # Input Layer


        # Hidden layer
        self.layer_1 *= 0
        unique_indices = set()
        for word in review.lower().split(" "):
            if word in self.word2index.keys():
                unique_indices.add(self.word2index[word])
        for index in unique_indices:
            self.layer_1 += self.weights_0_1[index]
        
        # Output layer
        layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))
        
        if(layer_2[0] > 0.5):
            return "POSITIVE"
        else:
            return "NEGATIVE"
        

In [40]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.1)

In [41]:
mlp.train(reviews[:-1000],labels[:-1000])

Progress:99.9% Speed(reviews/sec):1857. #Correct:20108 #Trained:24000 Training Accuracy:83.7%

In [42]:
mlp.test(reviews[-1000:],labels[-1000:])

Progress:99.9% Speed(reviews/sec):1995.% #Correct:857 #Tested:1000 Testing Accuracy:85.7%

## Further Noise Reduction

Lets continue to reduce the amount of noise to make training even more accurate! Small changes can make a huge impact on how fast the network trains.

Here we are going to look at how to seperate out the key words more effectively. 

In [43]:
pos_neg_ratios.most_common()[0:30]

[('edie', 4.6913478822291435),
 ('paulie', 4.0775374439057197),
 ('felix', 3.1527360223636558),
 ('polanski', 2.8233610476132043),
 ('matthau', 2.8067217286092401),
 ('victoria', 2.6810215287142909),
 ('mildred', 2.6026896854443837),
 ('gandhi', 2.5389738710582761),
 ('flawless', 2.451005098112319),
 ('superbly', 2.2600254785752498),
 ('perfection', 2.1594842493533721),
 ('astaire', 2.1400661634962708),
 ('captures', 2.0386195471595809),
 ('voight', 2.0301704926730531),
 ('wonderfully', 2.0218960560332353),
 ('powell', 1.9783454248084671),
 ('brosnan', 1.9547990964725592),
 ('lily', 1.9203768470501485),
 ('bakshi', 1.9029851043382795),
 ('lincoln', 1.9014583864844796),
 ('refreshing', 1.8551812956655511),
 ('breathtaking', 1.8481124057791867),
 ('bourne', 1.8478489358790986),
 ('lemmon', 1.8458266904983307),
 ('delightful', 1.8002701588959635),
 ('flynn', 1.7996646487351682),
 ('andrews', 1.7764919970972666),
 ('homer', 1.7692866133759964),
 ('beautifully', 1.7626953362841438),
 ('socc

The data being passed at the moment is full of words we can use for sentiment analysis. However it also contains many words that are not usefull to us. There are many names that occur commonly that don't really express sentiment. The next step is to remove those useless words.

A good way to do this is to use bokeh; It is a data visualisation library, which will allow us to see what is going on with the data.

In [44]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

In [45]:
hist, edges = np.histogram(list(map(lambda x:x[1], pos_neg_ratios.most_common())), density=True, bins=100, normed=True)

p = figure(tools='pan,wheel_zoom,reset,save', toolbar_location='above', title="Word positive/negative Affinity Distribution")

p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

Here we can see that there are a large number of words lying in the middle of the x axis. This shows that there are a large number of words that have little meaning and less words on the sides with meaning.

This is good because we can add a cut-off point which will discount the useless words.

### Using the data to remove the noise

Now we can use what we have learned to increase the accuracy of the network by introducing a fequency cut-off. This will remove the useless words like '.' and ''.

In [46]:
import time
import sys
import numpy as np

# Let's tweak the network from before to model these phenomena
class SentimentNetwork:
    def __init__(self, reviews,labels,min_count = 10,polarity_cutoff = 0.1,hidden_nodes = 10, learning_rate = 0.1):
       
        np.random.seed(1)
    
        self.pre_process_data(reviews, polarity_cutoff, min_count)
        
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)
        
        
    def pre_process_data(self,reviews, polarity_cutoff,min_count):
        
        positive_counts = Counter()
        negative_counts = Counter()
        total_counts = Counter()

        for i in range(len(reviews)):
            if(labels[i] == 'POSITIVE'):
                for word in reviews[i].split(" "):
                    positive_counts[word] += 1
                    total_counts[word] += 1
            else:
                for word in reviews[i].split(" "):
                    negative_counts[word] += 1
                    total_counts[word] += 1

        pos_neg_ratios = Counter()

        for term,cnt in list(total_counts.most_common()):
            if(cnt >= 50):
                pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
                pos_neg_ratios[term] = pos_neg_ratio

        for word,ratio in pos_neg_ratios.most_common():
            if(ratio > 1):
                pos_neg_ratios[word] = np.log(ratio)
            else:
                pos_neg_ratios[word] = -np.log((1 / (ratio + 0.01)))
        
        review_vocab = set()
        for review in reviews:
            for word in review.split(" "):
                if(total_counts[word] > min_count):
                    if(word in pos_neg_ratios.keys()):
                        if((pos_neg_ratios[word] >= polarity_cutoff) or (pos_neg_ratios[word] <= -polarity_cutoff)):
                            review_vocab.add(word)
                    else:
                        review_vocab.add(word)
        self.review_vocab = list(review_vocab)
        
        label_vocab = set()
        for label in labels:
            label_vocab.add(label)
        
        self.label_vocab = list(label_vocab)
        
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        self.word2index = {}
        for i, word in enumerate(self.review_vocab):
            self.word2index[word] = i
        
        self.label2index = {}
        for i, label in enumerate(self.label_vocab):
            self.label2index[label] = i
         
        
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Set number of nodes in input, hidden and output layers.
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        # Initialize weights
        self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
    
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5, 
                                                (self.hidden_nodes, self.output_nodes))
        
        self.learning_rate = learning_rate
        
        self.layer_0 = np.zeros((1,input_nodes))
        self.layer_1 = np.zeros((1,hidden_nodes))
        
    def sigmoid(self,x):
        return 1 / (1 + np.exp(-x))
    
    
    def sigmoid_output_2_derivative(self,output):
        return output * (1 - output)
    
    def update_input_layer(self,review):

        # clear out previous state, reset the layer to be all 0s
        self.layer_0 *= 0
        for word in review.split(" "):
            self.layer_0[0][self.word2index[word]] = 1

    def get_target_for_label(self,label):
        if(label == 'POSITIVE'):
            return 1
        else:
            return 0
        
    def train(self, training_reviews_raw, training_labels):
        
        training_reviews = list()
        for review in training_reviews_raw:
            indices = set()
            for word in review.split(" "):
                if(word in self.word2index.keys()):
                    indices.add(self.word2index[word])
            training_reviews.append(list(indices))
        
        assert(len(training_reviews) == len(training_labels))
        
        correct_so_far = 0
        
        start = time.time()
        
        for i in range(len(training_reviews)):
            
            review = training_reviews[i]
            label = training_labels[i]
            
            #### Implement the forward pass here ####
            ### Forward pass ###

            # Input Layer

            # Hidden layer
#             layer_1 = self.layer_0.dot(self.weights_0_1)
            self.layer_1 *= 0
            for index in review:
                self.layer_1 += self.weights_0_1[index]
            
            # Output layer
            layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))

            #### Implement the backward pass here ####
            ### Backward pass ###

            # Output error
            layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)

            # Backpropagated error
            layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
            layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error

            # Update the weights
            self.weights_1_2 -= self.layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step
            
            for index in review:
                self.weights_0_1[index] -= layer_1_delta[0] * self.learning_rate # update input-to-hidden weights with gradient descent step

            if(layer_2 >= 0.5 and label == 'POSITIVE'):
                correct_so_far += 1
            if(layer_2 < 0.5 and label == 'NEGATIVE'):
                correct_so_far += 1
            
            reviews_per_second = i / float(time.time() - start)
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
        
    
    def test(self, testing_reviews, testing_labels):
        
        correct = 0
        
        start = time.time()
        
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1
            
            reviews_per_second = i / float(time.time() - start)
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                            + "% #Correct:" + str(correct) + " #Tested:" + str(i+1) + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    
    def run(self, review):
        
        # Input Layer


        # Hidden layer
        self.layer_1 *= 0
        unique_indices = set()
        for word in review.lower().split(" "):
            if word in self.word2index.keys():
                unique_indices.add(self.word2index[word])
        for index in unique_indices:
            self.layer_1 += self.weights_0_1[index]
        
        # Output layer
        layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))
        
        if(layer_2[0] >= 0.5):
            return "POSITIVE"
        else:
            return "NEGATIVE"

min_count_cutoff: In order to be included it must be more frequent than this value
polarity_cutoff: Words must be left or right of the histogram by this much to be included

In [47]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000],min_count=20,polarity_cutoff=0.05,learning_rate=0.01)

In [48]:
mlp.train(reviews[:-1000],labels[:-1000])

Progress:99.9% Speed(reviews/sec):2012. #Correct:20461 #Trained:24000 Training Accuracy:85.2%

In [49]:
mlp.test(reviews[-1000:],labels[-1000:])

Progress:99.9% Speed(reviews/sec):2280.% #Correct:859 #Tested:1000 Testing Accuracy:85.9%

Reducing the useless data has clearly had a positive impact on the overall accuracy of the network during testing. It has also had a small increase in the speed because the dictionary is smaller. 

Increasing the polarity cutoff will speed up the network.

In [50]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000],min_count=20,polarity_cutoff=0.8,learning_rate=0.01)

In [51]:
mlp.train(reviews[:-1000],labels[:-1000])

Progress:99.9% Speed(reviews/sec):6837. #Correct:20552 #Trained:24000 Training Accuracy:85.6%

In [52]:
mlp.test(reviews[-1000:],labels[-1000:])

Progress:0.0% Speed(reviews/sec):0.0% #Correct:0 #Tested:1 Testing Accuracy:0.0%Progress:0.1% Speed(reviews/sec):1942.% #Correct:1 #Tested:2 Testing Accuracy:50.0%Progress:0.2% Speed(reviews/sec):2123.% #Correct:2 #Tested:3 Testing Accuracy:66.6%Progress:0.3% Speed(reviews/sec):2560.% #Correct:3 #Tested:4 Testing Accuracy:75.0%Progress:0.4% Speed(reviews/sec):3084.% #Correct:3 #Tested:5 Testing Accuracy:60.0%Progress:0.5% Speed(reviews/sec):3564.% #Correct:4 #Tested:6 Testing Accuracy:66.6%Progress:0.6% Speed(reviews/sec):3016.% #Correct:5 #Tested:7 Testing Accuracy:71.4%Progress:0.7% Speed(reviews/sec):3083.% #Correct:6 #Tested:8 Testing Accuracy:75.0%Progress:0.8% Speed(reviews/sec):3262.% #Correct:7 #Tested:9 Testing Accuracy:77.7%Progress:0.9% Speed(reviews/sec):2936.% #Correct:8 #Tested:10 Testing Accuracy:80.0%Progress:1.0% Speed(reviews/sec):3034.% #Correct:9 #Tested:11 Testing Accuracy:81.8%Progress:1.1% Speed(reviews/sec):3159.% #Correct:10 #Tested:12 Testing Accur

Here we got an increase in speed from 2236 w/s to 6251 w/s however 3% of the accuracy was traded off as a result. There will almost always be a trade-off between speed and accuracy excluding noise reduction which normally helps both.

## Analysis - Whats going on with the weights?

Here we will visualise what is realling going on under the hood of the network and how the weights work.

In [53]:
mlp_full = SentimentNetwork(reviews[:-1000], labels[:-1000], min_count=0, polarity_cutoff=0, learning_rate=0.01)

In [54]:
mlp_full.train(reviews[:-1000], labels[:-1000])

Progress:99.9% Speed(reviews/sec):1680. #Correct:20335 #Trained:24000 Training Accuracy:84.7%

In [55]:
def get_most_similar_words(focus = "horrible"):
    most_similar = Counter()

    for word in mlp_full.word2index.keys():
        most_similar[word] = np.dot(mlp_full.weights_0_1[mlp_full.word2index[word]],mlp_full.weights_0_1[mlp_full.word2index[focus]])
    
    return most_similar.most_common()

In [56]:
get_most_similar_words('excellent')[0:50]

[('excellent', 0.13672950757352476),
 ('perfect', 0.12548286087225946),
 ('amazing', 0.091827633925999713),
 ('today', 0.090223662694414217),
 ('wonderful', 0.089355976962214589),
 ('fun', 0.087504466674206888),
 ('great', 0.087141758882292031),
 ('best', 0.085810885617880639),
 ('liked', 0.077697629123843426),
 ('definitely', 0.076628781406966023),
 ('brilliant', 0.073423858769279066),
 ('loved', 0.073285428928122162),
 ('favorite', 0.072781136036160765),
 ('superb', 0.071736207178505054),
 ('fantastic', 0.070922191916266183),
 ('job', 0.06916061720763407),
 ('incredible', 0.066424077952614416),
 ('enjoyable', 0.065632560502888793),
 ('rare', 0.064819212662615089),
 ('highly', 0.063889453350970515),
 ('enjoyed', 0.062127546101812946),
 ('wonderfully', 0.062055178604090169),
 ('perfectly', 0.061093208811887387),
 ('fascinating', 0.060663547937493886),
 ('bit', 0.059655427045653048),
 ('gem', 0.059510859296156786),
 ('outstanding', 0.058860808147083027),
 ('beautiful', 0.058613934703162

Because these words are supposed to give a similar output the network sees them as similar and therefore have similar weights. We can visualise these clusters buy plotting them on a graph using T-SNE.

In [57]:
import matplotlib.colors as colors

words_to_visualize = list()
for word, ratio in pos_neg_ratios.most_common(500):
    if(word in mlp_full.word2index.keys()):
        words_to_visualize.append(word)
    
for word, ratio in list(reversed(pos_neg_ratios.most_common()))[0:500]:
    if(word in mlp_full.word2index.keys()):
        words_to_visualize.append(word)

In [58]:
pos = 0
neg = 0

colors_list = list()
vectors_list = list()
for word in words_to_visualize:
    if word in pos_neg_ratios.keys():
        vectors_list.append(mlp_full.weights_0_1[mlp_full.word2index[word]])
        if(pos_neg_ratios[word] > 0):
            pos+=1
            colors_list.append("#00ff00")
        else:
            neg+=1
            colors_list.append("#000000")

In [59]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
words_top_ted_tsne = tsne.fit_transform(vectors_list)

In [60]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="vector T-SNE for most polarized words")

source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],
                                    x2=words_top_ted_tsne[:,1],
                                    names=words_to_visualize))

p.scatter(x="x1", y="x2", size=8, source=source,color=colors_list)

word_labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(word_labels)

show(p)

# green indicates positive words, black indicates negative words

Supplying a user-defined data source AND iterable values to glyph methods is deprecated.

See https://github.com/bokeh/bokeh/issues/2056 for more information.

  warn(message)
Supplying a user-defined data source AND iterable values to glyph methods is deprecated.

See https://github.com/bokeh/bokeh/issues/2056 for more information.

  warn(message)


## Conclusion

After curating the dataset and training we were able to achieve a speed of a couple of hundred words per second with a very low accuracy. However after doing the first instance of noise reduction we were able to get the accuracy upwards of 80%. After optimising inefficiencys the network managed to achieve the same accuracy and a couple of thousand words per second. Finally a second set of noise reduction removed the useless data from the set such as names and punctuation. This acieved an even higher accuracy. After this step the network now has the ability to trade of accuracy for speed by cutting more words out of the vocabulary. This can be helpful for training over a much larger dataset. It also marginally increased speed by removing some of the data.

After visualising the data it can clearly be seen that the network has successfully grouped the input words by sentiment. With a few agnostic variables normally consisting of names that slipped through. This could be imporoved by increading the cutoff.