# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [15]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [16]:
len(reviews)

25000

In [17]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [18]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [19]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


In [20]:
import pandas as pd
import numpy as np

words_list = [] #List of all words

sentiment_list = [] #List of sentiment lable attached to each instance of a word 
                    #(depending on what review is comes from)

search_length = 25000  
#Number of reviews to include, for speed of testing (starting from beginning, may want to randomize)
    
for i in range (search_length):
    _ = reviews[i].split()
    for j in _:
        words_list.append(j)
        sentiment_list.append(labels[i])

data = np.array([words_list,sentiment_list]).T
print(data.shape)

df = pd.DataFrame(data, columns = ["Word","Sentiment"])
df.head()


(6347388, 2)


Unnamed: 0,Word,Sentiment
0,bromwell,POSITIVE
1,high,POSITIVE
2,is,POSITIVE
3,a,POSITIVE
4,cartoon,POSITIVE


In [21]:
#df[df["Word"]=="high"]
counts = df.apply(pd.value_counts).fillna(0)
counts

Unnamed: 0,Word,Sentiment
.,327192.0,0.0
NEGATIVE,0.0,3144912.0
POSITIVE,0.0,3202476.0
a,163009.0,0.0
aa,5.0,0.0
aaa,9.0,0.0
aaaaaaah,1.0,0.0
aaaaah,1.0,0.0
aaaaatch,1.0,0.0
aaaahhhhhhh,1.0,0.0


In [22]:
import sys
#Plan:
#Split df into positive_words and negative_words dataframes
positive_words = df[df["Sentiment"]=="POSITIVE"]
negative_words = df[df["Sentiment"]=="NEGATIVE"]
#Create counts of these separate dataframes pos_counts, neg_counts  
pos_counts = positive_words.apply(pd.value_counts).fillna(0)
neg_counts = negative_words.apply(pd.value_counts).fillna(0)
#Create unique_words dataframe
unique = pd.DataFrame(df["Word"].unique(),columns = ["Count"])
#Add 'Positive_Count' and 'Negative_Count' columns to uniques
unique["Positive_Count"] = df["Word"].unique() 
unique["Negative_Count"] = df["Word"].unique()
#For each word in uniques fill in positive and negative counts from pos_counts, neg_counts dataframes using pd.loc[]
get_pos = lambda x: pos_counts.ix[x, 'Word'] if x in pos_counts.index.values else 0
get_neg = lambda x: neg_counts.ix[x, 'Word'] if x in neg_counts.index.values else 0
for i, word in enumerate(unique["Positive_Count"]):
    unique.ix[i, 'Positive_Count']=get_pos(word)
    unique.ix[i, 'Negative_Count']=get_neg(word)
    sys.stdout.write("\rProgress: " + str(100 * i/float(len(unique)))[:4]+"%") #shows % complete
unique

Progress: 99.9%

Unnamed: 0,Count,Positive_Count,Negative_Count
0,bromwell,8,0
1,high,1095,1066
2,is,57245,50083
3,a,83688,79321
4,cartoon,249,296
5,comedy,1742,1504
6,.,159654,167538
7,it,48025,48327
8,ran,122,116
9,at,11234,12279


In [25]:
words = unique
#words = words.sort_values("Positive_Count",ascending = False)
#words = words.iloc[60:-3000]
words["prob_pos"]=words["Positive_Count"]/float(len(words_list))
words["prob_neg"]=words["Negative_Count"]/float(len(words_list))
words["abs_diff"]=abs(words["prob_pos"]-words["prob_neg"])*10000
words["ratio"]=(words["prob_pos"]-words["prob_neg"])/(words["prob_pos"]+words["prob_neg"])
words = words[(words["Positive_Count"] > 30)&(words["Negative_Count"] > 30)]
words = words.sort_values("ratio",ascending = False)
words

Unnamed: 0,Count,Positive_Count,Negative_Count,prob_pos,prob_neg,abs_diff,ratio
5871,wonderfully,287,37,4.52154e-05,5.82917e-06,0.393863,0.771605
8621,delightful,236,38,3.71806e-05,5.98671e-06,0.311939,0.722628
4168,beautifully,373,63,5.87643e-05,9.92534e-06,0.48839,0.711009
748,underrated,201,35,3.16666e-05,5.51408e-06,0.261525,0.70339
663,superb,569,102,8.96432e-05,1.60696e-05,0.735736,0.695976
15368,welles,212,39,3.33996e-05,6.14426e-06,0.272553,0.689243
16963,sinatra,206,39,3.24543e-05,6.14426e-06,0.2631,0.681633
2764,touching,365,70,5.7504e-05,1.10282e-05,0.464758,0.678161
19586,sullivan,160,31,2.52072e-05,4.8839e-06,0.203233,0.675393
359,stewart,391,77,6.16001e-05,1.2131e-05,0.494692,0.67094


In [27]:
words = words.sort_values("ratio",ascending = True)
words

Unnamed: 0,Count,Positive_Count,Negative_Count,prob_pos,prob_neg,abs_diff,ratio
1357,waste,99,1358,1.5597e-05,0.000213946,1.98349,-0.864104
3221,pointless,40,465,6.3018e-06,7.32585e-05,0.669567,-0.841584
876,worst,252,2480,3.97014e-05,0.000390712,3.51011,-0.81552
5100,laughable,40,384,6.3018e-06,6.04973e-05,0.541955,-0.811321
1917,awful,168,1557,2.64676e-05,0.000245298,2.1883,-0.805217
2608,poorly,70,644,1.10282e-05,0.000101459,0.904309,-0.803922
4470,sucks,34,247,5.35653e-06,3.89136e-05,0.335571,-0.758007
4556,lame,90,652,1.41791e-05,0.000102719,0.885404,-0.757412
584,horrible,155,1046,2.44195e-05,0.000164792,1.40373,-0.741882
49,pathetic,61,407,9.61025e-06,6.41209e-05,0.545106,-0.739316
