# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [2]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [3]:
len(reviews)

25000

In [4]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [5]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [6]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


# Quick test theory

In [55]:
label_count = []
for idx, (label, review) in enumerate(zip(labels, reviews)):
    labels_review_count = [(word.strip(), label) for word in review.replace('.', ' ').split()]
    label_count.extend(labels_review_count)

In [26]:
label_count

[('bromwell', 'POSITIVE'),
 ('high', 'POSITIVE'),
 ('is', 'POSITIVE'),
 ('a', 'POSITIVE'),
 ('cartoon', 'POSITIVE'),
 ('comedy', 'POSITIVE'),
 ('it', 'POSITIVE'),
 ('ran', 'POSITIVE'),
 ('at', 'POSITIVE'),
 ('the', 'POSITIVE'),
 ('same', 'POSITIVE'),
 ('time', 'POSITIVE'),
 ('as', 'POSITIVE'),
 ('some', 'POSITIVE'),
 ('other', 'POSITIVE'),
 ('programs', 'POSITIVE'),
 ('about', 'POSITIVE'),
 ('school', 'POSITIVE'),
 ('life', 'POSITIVE'),
 ('such', 'POSITIVE'),
 ('as', 'POSITIVE'),
 ('teachers', 'POSITIVE'),
 ('my', 'POSITIVE'),
 ('years', 'POSITIVE'),
 ('in', 'POSITIVE'),
 ('the', 'POSITIVE'),
 ('teaching', 'POSITIVE'),
 ('profession', 'POSITIVE'),
 ('lead', 'POSITIVE'),
 ('me', 'POSITIVE'),
 ('to', 'POSITIVE'),
 ('believe', 'POSITIVE'),
 ('that', 'POSITIVE'),
 ('bromwell', 'POSITIVE'),
 ('high', 'POSITIVE'),
 ('s', 'POSITIVE'),
 ('satire', 'POSITIVE'),
 ('is', 'POSITIVE'),
 ('much', 'POSITIVE'),
 ('closer', 'POSITIVE'),
 ('to', 'POSITIVE'),
 ('reality', 'POSITIVE'),
 ('than', 'POSITIVE

In [80]:
total_counter = {'POSITIVE': {}, 'NEGATIVE': {}, 'ALL': {}}

In [81]:
for word1, label1 in label_count:
    total_counter[label1][word1] = total_counter[label1].get(word1, 0) + 1
    total_counter['ALL'][word1] = total_counter['ALL'].get(word1, 0) + 1

In [68]:
words_neg = sorted(total_counter['NEGATIVE'].items(), key=lambda x: x[1], reverse=True)

In [69]:
words_pos = sorted(total_counter['POSITIVE'].items(), key=lambda x: x[1], reverse=True)

In [76]:
words_pos[:10]

[('the', 173324),
 ('and', 89722),
 ('a', 83688),
 ('of', 76855),
 ('to', 66746),
 ('is', 57245),
 ('in', 50215),
 ('br', 49235),
 ('it', 48025),
 ('i', 40743)]

Used method below with help of solutions. The general plan was to do the same, except the logarithmic ratios

In [87]:
pos_neg_ratio = {}
for word2, count2 in total_counter['ALL'].items():
    if count2 > 100:
        pos_neg_ratio[word2] = total_counter['POSITIVE'].get(word2, 0) / float(total_counter['NEGATIVE'].get(word2, 1))

In [91]:
import numpy as np

In [92]:
pos_neg_ratio_log = {}
for word3, ratio3 in pos_neg_ratio.items():
    if ratio3 > 1:
        pos_neg_ratio_log[word3] = np.log(ratio3)
    else:
        pos_neg_ratio_log[word3] = -np.log((1 / (ratio3+0.01)))

In [95]:
# Most positive terms'
sorted(pos_neg_ratio_log.items(), key=lambda x: x[1], reverse=True)

[('paulie', 4.7706846244656651),
 ('edie', 4.6913478822291435),
 ('felix', 3.3758795736778655),
 ('polanski', 3.0056826044071592),
 ('matthau', 2.924504764265623),
 ('victoria', 2.7500144002012421),
 ('mildred', 2.7362210780689065),
 ('gandhi', 2.6567569067146595),
 ('flawless', 2.5563656137701454),
 ('superbly', 2.3470368555648795),
 ('perfection', 2.2284771208403238),
 ('astaire', 2.2141741356499924),
 ('voight', 2.1102132003465894),
 ('captures', 2.0794415416798357),
 ('wonderfully', 2.0485643031153966),
 ('brosnan', 2.0193376176101303),
 ('powell', 2.0175661379617482),
 ('lily', 1.9810014688665833),
 ('bakshi', 1.9636097261547143),
 ('lincoln', 1.9459101490553132),
 ('lemmon', 1.9148195619852821),
 ('breathtaking', 1.8925641683500207),
 ('refreshing', 1.891548939836426),
 ('bourne', 1.8870696490323797),
 ('flynn', 1.8484548129046001),
 ('homer', 1.8382794848629478),
 ('soccer', 1.8268507890393251),
 ('delightful', 1.8262456452992242),
 ('andrews', 1.8230120127321594),
 ('elvira', 1

In [96]:
# Most negative terms'
sorted(pos_neg_ratio_log.items(), key=lambda x: x[1])

[('boll', -4.0749533729074505),
 ('uwe', -3.9169857947702749),
 ('seagal', -3.3155026605572724),
 ('unwatchable', -3.019309004117988),
 ('stinker', -2.9795375876340104),
 ('mst', -2.7704856720430024),
 ('incoherent', -2.7577377143337944),
 ('unfunny', -2.5510464522925456),
 ('waste', -2.4901042280827608),
 ('blah', -2.4427317247372873),
 ('horrid', -2.3632681277314354),
 ('pointless', -2.3431830982648481),
 ('atrocious', -2.3137583935921708),
 ('redeeming', -2.263751651516682),
 ('prom', -2.2515299221937619),
 ('drivel', -2.2396926627631433),
 ('lousy', -2.2072749131897211),
 ('worst', -2.1927186154155134),
 ('laughable', -2.1700959099479666),
 ('awful', -2.1379201056914394),
 ('poorly', -2.1311926067322813),
 ('wasting', -2.111046881095167),
 ('remotely', -2.1056647367789383),
 ('existent', -1.9960263668705585),
 ('boredom', -1.9166645809588516),
 ('miserably', -1.9131499212248517),
 ('sucks', -1.9128983318073884),
 ('lame', -1.9102943211698218),
 ('uninspired', -1.9045549633733994),
