# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [12]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [13]:
len(reviews)

25000

In [14]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [15]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [16]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


# Project 1: Quick Theory Validation

In [17]:
#Import relevant stuff
from collections import Counter
import numpy as np

In [18]:
#Initialize Counters
positive_counts=Counter()
negative_counts=Counter()
total_counts=Counter()

In [19]:
#For each review, split the review and add the word to the counters
for i in range(len(reviews)):
    if(labels[i]=='POSITIVE'):
        words=reviews[i].split(' ');
        for word in words:
            positive_counts[word] += 1
            total_counts[word] += 1
    else:
        words=reviews[i].split(' ');
        for word in words:
            negative_counts[word] += 1
            total_counts[word] += 1

In [None]:
#See most common words
positive_counts.most_common()

In [42]:
#Initialize counter for positive to negative ratio
pos_neg_ratios = Counter()

# Calculate the ratio
for term,count in total_counts.most_common():
    if(count>250):
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+0.01)
        pos_neg_ratios[term] = pos_neg_ratio
        
#Recalculate ratio as log of ratio
for word, ratio in pos_neg_ratios.most_common():
    if(ratio>1):
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = - np.log(1/(ratio+0.01))

In [41]:
#Look at most common
pos_neg_ratios.most_common()[0:30]

[('superb', 1.7188095864319033),
 ('wonderful', 1.5679980330422756),
 ('fantastic', 1.5116943876671398),
 ('excellent', 1.4673020817037794),
 ('amazing', 1.3957529414465939),
 ('powerful', 1.307437728560479),
 ('favorite', 1.2705554553161489),
 ('perfect', 1.2495194421482667),
 ('brilliant', 1.2324152392966217),
 ('perfectly', 1.2039047794287976),
 ('loved', 1.1592650847544808),
 ('tony', 1.1475750118055885),
 ('highly', 1.1455883702337604),
 ('today', 1.1082418376439529),
 ('unique', 1.0943477107103357),
 ('beauty', 1.0562507391303571),
 ('greatest', 1.0299327631723469),
 ('portrayal', 1.026341373382395),
 ('incredible', 1.0127456344202339),
 ('sweet', 0.99606868642842739),
 ('oscar', 0.9914587953685331),
 ('solid', 0.98257941753634903),
 ('beautiful', 0.97492268057293063),
 ('heart', 0.95506733093911411),
 ('masterpiece', 0.9473228411218837),
 ('season', 0.90270166472248992),
 ('great', 0.88847963729980228),
 ('enjoyed', 0.87339580204542233),
 ('moving', 0.85955601889385091),
 ('memo

In [43]:
#Look at lowest score
list(reversed(pos_neg_ratios.most_common()))[0:30]

[('unfunny', -2.551081323300588),
 ('waste', -2.4901107035690542),
 ('pointless', -2.3432023637708057),
 ('redeeming', -2.2637819801168857),
 ('worst', -2.1927222863938374),
 ('laughable', -2.1701196702562586),
 ('awful', -2.1379259835272206),
 ('poorly', -2.1312068263472965),
 ('sucks', -1.9129360748418687),
 ('lame', -1.9103086224213506),
 ('horrible', -1.8440081266710808),
 ('pathetic', -1.8333726048530989),
 ('wasted', -1.7753922107877484),
 ('crap', -1.7667231669916605),
 ('badly', -1.6958226264381993),
 ('worse', -1.6812588730229823),
 ('terrible', -1.6736094774265753),
 ('mess', -1.63557930602076),
 ('garbage', -1.63187664955076),
 ('stupid', -1.6035794476659104),
 ('dull', -1.5356553207540313),
 ('avoid', -1.5271122025774353),
 ('wooden', -1.519672343607344),
 ('whatsoever', -1.4637561497227671),
 ('ridiculous', -1.4631055014652738),
 ('excuse', -1.4628822116454949),
 ('rubbish', -1.4601030409425524),
 ('boring', -1.4468814838931428),
 ('dumb', -1.38109247913149),
 ('bother', -