# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [19]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [20]:
len(reviews)

25000

In [21]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [5]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [22]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


In [23]:
from collections import Counter
import numpy as np

In [24]:
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

In [27]:
for i in range(len(reviews)):
    if(labels[i] == "POSITIVE"):
        for word in reviews[i].split(" "):
            positive_counts[word] += 1
            total_counts[word] += 1
    else:
        for word in reviews[i].split(" "):
            negative_counts[word] += 1
            total_counts[word] += 1

In [32]:
pos_neg_ratios = Counter()

for term,cnt in list(total_counts.most_common()):
    if(cnt > 100):
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
        pos_neg_ratios[term] = pos_neg_ratio
        
for word,ratio in pos_neg_ratios.most_common():
    if(cnt > 1):
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log((1 / (ratio+0.01)))
    

In [33]:
pos_neg_ratios.most_common()

[('edie', 4.6914396211402671),
 ('antwone', 4.4774504443857204),
 ('din', 4.4068411910483203),
 ('gunga', 4.1898062457006793),
 ('goldsworthy', 4.1745411042163774),
 ('gypo', 4.0945112150014218),
 ('yokai', 4.0945112150014218),
 ('paulie', 4.0777069210690771),
 ('visconti', 3.932021691934835),
 ('flavia', 3.932021691934835),
 ('blandings', 3.8714093225428488),
 ('kells', 3.8714093225428488),
 ('brashear', 3.8503603450360391),
 ('gino', 3.8288587641673772),
 ('deathtrap', 3.8068846873048412),
 ('harilal', 3.7138159394039678),
 ('panahi', 3.7138159394039678),
 ('ossessione', 3.6638180235185649),
 ('tsui', 3.6378492830011573),
 ('caruso', 3.6378492830011573),
 ('sabu', 3.6111881463980646),
 ('ahmad', 3.6111881463980646),
 ('khouri', 3.5837966776607839),
 ('dominick', 3.5837966776607839),
 ('aweigh', 3.5556337349665741),
 ('mj', 3.5556337349665741),
 ('mcintire', 3.5266545990191038),
 ('kriemhild', 3.5266545990191038),
 ('blackie', 3.4968105458651015),
 ('daisies', 3.4968105458651015),
 ('

In [35]:
list(reversed(pos_neg_ratios.most_common()))[0:30]

[('rosarios', -4.6051701859880918),
 ('frewer', -4.6051701859880918),
 ('manu', -4.6051701859880918),
 ('borel', -4.6051701859880918),
 ('swinton', -4.6051701859880918),
 ('sagemiller', -4.6051701859880918),
 ('summersisle', -4.6051701859880918),
 ('qi', -4.6051701859880918),
 ('redline', -4.6051701859880918),
 ('slipstream', -4.6051701859880918),
 ('bolo', -4.6051701859880918),
 ('emraan', -4.6051701859880918),
 ('geico', -4.6051701859880918),
 ('cato', -4.6051701859880918),
 ('liliom', -4.6051701859880918),
 ('rajni', -4.6051701859880918),
 ('mayeda', -4.6051701859880918),
 ('crapfest', -4.6051701859880918),
 ('tmtm', -4.6051701859880918),
 ('sued', -4.6051701859880918),
 ('keyes', -4.6051701859880918),
 ('nichole', -4.6051701859880918),
 ('straightheads', -4.6051701859880918),
 ('aluminium', -4.6051701859880918),
 ('groaning', -4.6051701859880918),
 ('templars', -4.6051701859880918),
 ('krista', -4.6051701859880918),
 ('spandex', -4.6051701859880918),
 ('unisols', -4.605170185988091