# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [5]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [6]:
len(reviews)

25000

In [7]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [8]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [9]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


In [26]:
words_list = [] #List of all words

sentiment_list = [] #List of sentiment lable attached to each instance of a word 
                    #(depending on what review is comes from)

unique_words = [] #List of unique words

search_length = 100  
#Number of reviews to include, for speed of testing (starting from beginning, may want to randomize)

for i in range (search_length):
    _ = reviews[i].split()
    for j in _:
        words_list.append(j)
        sentiment_list.append(labels[i])
print("Words:",len(words_list))
unique_words = list(set(words_list))
print("Unique words:",len(unique_words))

Words: 28400
Unique words: 4449


In [51]:
counts_list = []  #List of the counts of each word
goods_list = []
bads_list = []
for u in range(len(unique_words)):
    counts_list.append(words_list.count(unique_words[u]))
    goods = 0
    bads = 0
    for k in range(len(words_list)):
        if words_list[k] == unique_word:
            if sentiment_list[k] == "POSITIVE":
                goods += 1
            else:
                bads += 1
    goods_list.append(goods)
    bads_list.append(bads)
    
    

In [54]:
import numpy as np
data = np.array([unique_words,counts_list,goods_list,bads_list])

In [58]:
import pandas as pd
df = pd.DataFrame(data.T, columns=["Word","Count","Positives","Negatives"])
df = df.sort_values("Count",ascending=False)
print(df)

               Word Count Positives Negatives
89               by    99         1         2
2876           from    96         1         2
3949          story    96         1         2
578              or    90         1         2
4327        titanic    90         1         2
2195        acharya     9         1         2
3991         seemed     9         1         2
2228      beautiful     9         1         2
1287     understand     9         1         2
2217       favorite     9         1         2
3295          stars     9         1         2
3910        believe     9         1         2
1961            fan     9         1         2
3993            car     9         1         2
2229        krishna     9         1         2
3260       suspense     9         1         2
1438         anyone     9         1         2
786          during     9         1         2
1749         moving     9         1         2
310      completely     9         1         2
4129          found     9         