# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [1]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [2]:
len(reviews)

25000

In [3]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [4]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [5]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


In [6]:
from collections import Counter
import numpy as np

In [7]:
positive_counter = Counter()
negative_counter = Counter()
total_counter = Counter()

In [10]:
for i in range(len(reviews)):
    if labels[i] == 'POSITIVE':
        for wrd in reviews[i].split(" "):
            positive_counter[wrd] += 1
            total_counter[wrd] += 1
    else:
        for wrd in reviews[i].split(" "):
            negative_counter[wrd] += 1
            total_counter[wrd] += 1
        

In [11]:
positive_counter.most_common()


[('', 550468),
 ('the', 173324),
 ('.', 159654),
 ('and', 89722),
 ('a', 83688),
 ('of', 76855),
 ('to', 66746),
 ('is', 57245),
 ('in', 50215),
 ('br', 49235),
 ('it', 48025),
 ('i', 40743),
 ('that', 35630),
 ('this', 35080),
 ('s', 33815),
 ('as', 26308),
 ('with', 23247),
 ('for', 22416),
 ('was', 21917),
 ('film', 20937),
 ('but', 20822),
 ('movie', 19074),
 ('his', 17227),
 ('on', 17008),
 ('you', 16681),
 ('he', 16282),
 ('are', 14807),
 ('not', 14272),
 ('t', 13720),
 ('one', 13655),
 ('have', 12587),
 ('be', 12416),
 ('by', 11997),
 ('all', 11942),
 ('who', 11464),
 ('an', 11294),
 ('at', 11234),
 ('from', 10767),
 ('her', 10474),
 ('they', 9895),
 ('has', 9186),
 ('so', 9154),
 ('like', 9038),
 ('about', 8313),
 ('very', 8305),
 ('out', 8134),
 ('there', 8057),
 ('she', 7779),
 ('what', 7737),
 ('or', 7732),
 ('good', 7720),
 ('more', 7521),
 ('when', 7456),
 ('some', 7441),
 ('if', 7285),
 ('just', 7152),
 ('can', 7001),
 ('story', 6780),
 ('time', 6515),
 ('my', 6488),
 ('g

In [21]:
pos_neg_ratios = Counter()

for term,cnt in list(total_counter.most_common()):
    if cnt >200:
        pos_neg_ratio = positive_counter[term]/float(negative_counter[term]+1)
        pos_neg_ratios[term] = pos_neg_ratio
for word, ratio in pos_neg_ratios.most_common():
    if ratio>1:
        pos_neg_ratios[term] = np.log(ratio)
    else:
        pos_neg_ratios[term] = -np.log(1/ratio+ 0.001)
    

In [22]:
pos_neg_ratios.most_common()

[('victoria', 14.6),
 ('captures', 7.68),
 ('wonderfully', 7.552631578947368),
 ('powell', 7.230769230769231),
 ('refreshing', 6.392857142857143),
 ('delightful', 6.051282051282051),
 ('beautifully', 5.828125),
 ('underrated', 5.583333333333333),
 ('superb', 5.524271844660194),
 ('welles', 5.3),
 ('sinatra', 5.15),
 ('touching', 5.140845070422535),
 ('stewart', 5.012820512820513),
 ('brilliantly', 4.928571428571429),
 ('friendship', 4.795918367346939),
 ('wonderful', 4.780487804878049),
 ('magnificent', 4.695652173913044),
 ('finest', 4.6938775510204085),
 ('jackie', 4.682926829268292),
 ('freedom', 4.5227272727272725),
 ('fantastic', 4.503448275862069),
 ('terrific', 4.493670886075949),
 ('noir', 4.454545454545454),
 ('outstanding', 4.441558441558442),
 ('nancy', 4.428571428571429),
 ('marie', 4.404255319148936),
 ('excellent', 4.326478149100257),
 ('chan', 4.15),
 ('gem', 4.027777777777778),
 ('amazing', 4.022813688212928),
 ('kelly', 3.842696629213483),
 ('powerful', 3.6691729323308

In [26]:
list(reversed(pos_neg_ratios.most_common()))[:30]

[('hued', -2.6923073218657225),
 ('unfunny', 0.06772908366533864),
 ('waste', 0.0728476821192053),
 ('pointless', 0.08583690987124463),
 ('redeeming', 0.09364548494983277),
 ('lousy', 0.09950248756218906),
 ('worst', 0.10157194679564692),
 ('laughable', 0.1038961038961039),
 ('awful', 0.10783055198973042),
 ('poorly', 0.10852713178294573),
 ('sucks', 0.13709677419354838),
 ('lame', 0.13782542113323124),
 ('insult', 0.13829787234042554),
 ('horrible', 0.14804202483285578),
 ('amateurish', 0.14814814814814814),
 ('pathetic', 0.14950980392156862),
 ('wasted', 0.1590909090909091),
 ('crap', 0.16071428571428573),
 ('tedious', 0.16489361702127658),
 ('dreadful', 0.16990291262135923),
 ('badly', 0.17314487632508835),
 ('worse', 0.176),
 ('terrible', 0.17744252873563218),
 ('embarrassing', 0.18229166666666666),
 ('mess', 0.18450184501845018),
 ('garbage', 0.18508997429305912),
 ('pile', 0.18857142857142858),
 ('stupid', 0.19104268719384185),
 ('vampires', 0.19806763285024154),
 ('dull', 0.2050

### 