# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [183]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [184]:
len(reviews)

25000

In [185]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [186]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [187]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


In [188]:
import pandas as pd
import numpy as np

words_list = [] #List of all words

sentiment_list = [] #List of sentiment lable attached to each instance of a word 
                    #(depending on what review is comes from)

search_length = 4000  
#Number of reviews to include, for speed of testing (starting from beginning, may want to randomize)
    
for i in range (search_length):
    _ = reviews[i].split()
    for j in _:
        words_list.append(j)
        sentiment_list.append(labels[i])

data = np.array([words_list,sentiment_list]).T
print(data.shape)

df = pd.DataFrame(data, columns = ["Word","Sentiment"])
df.head()


(997569, 2)


Unnamed: 0,Word,Sentiment
0,bromwell,POSITIVE
1,high,POSITIVE
2,is,POSITIVE
3,a,POSITIVE
4,cartoon,POSITIVE


In [189]:
#df[df["Word"]=="high"]
counts = df.apply(pd.value_counts).fillna(0)
counts

Unnamed: 0,Word,Sentiment
.,52138.0,0.0
NEGATIVE,0.0,486342.0
POSITIVE,0.0,511227.0
a,25326.0,0.0
aa,1.0,0.0
aaa,6.0,0.0
aaargh,2.0,0.0
aage,2.0,0.0
aaja,1.0,0.0
aaker,2.0,0.0


In [190]:
import sys
#Plan:
#Split df into positive_words and negative_words dataframes
positive_words = df[df["Sentiment"]=="POSITIVE"]
negative_words = df[df["Sentiment"]=="NEGATIVE"]
#Create counts of these separate dataframes pos_counts, neg_counts  
pos_counts = positive_words.apply(pd.value_counts).fillna(0)
neg_counts = negative_words.apply(pd.value_counts).fillna(0)
#Create unique_words dataframe
unique = pd.DataFrame(df["Word"].unique(),columns = ["Count"])
#Add 'Positive_Count' and 'Negative_Count' columns to uniques
unique["Positive_Count"] = df["Word"].unique() 
unique["Negative_Count"] = df["Word"].unique()
#For each word in uniques fill in positive and negative counts from pos_counts, neg_counts dataframes using pd.loc[]
get_pos = lambda x: pos_counts.ix[x, 'Word'] if x in pos_counts.index.values else 0
get_neg = lambda x: neg_counts.ix[x, 'Word'] if x in neg_counts.index.values else 0
for i, word in enumerate(unique["Positive_Count"]):
    unique.ix[i, 'Positive_Count']=get_pos(word)
    unique.ix[i, 'Negative_Count']=get_neg(word)
    sys.stdout.write("\rProgress: " + str(100 * i/float(len(unique)))[:4]+"%") #shows % complete
unique

Progress: 99.9%

Unnamed: 0,Count,Positive_Count,Negative_Count
0,bromwell,4,0
1,high,161,171
2,is,8905,7667
3,a,13317,12009
4,cartoon,44,32
5,comedy,304,221
6,.,25546,26592
7,it,7709,7494
8,ran,21,16
9,at,1761,1912


In [179]:
words = unique
#words = words.sort_values("Positive_Count",ascending = False)
#words = words.iloc[60:-3000]
words["prob_pos"]=words["Positive_Count"]/float(len(words_list))
words["prob_neg"]=words["Negative_Count"]/float(len(words_list))
words["abs_diff"]=abs(words["prob_pos"]-words["prob_neg"])*10000
words["abs_ratio"]=abs(words["prob_pos"]/words["prob_neg"])
words = words.sort_values("abs_diff",ascending = False)
words = words[(words["Positive_Count"] > 15)&(words["Positive_Count"] < 1000)]
words

Unnamed: 0,Count,Positive_Count,Negative_Count,prob_pos,prob_neg,abs_diff,abs_ratio
256,his,699,393,0.00274028,0.00154067,11.9961,1.77863
91,t,551,824,0.00216008,0.00323032,10.7024,0.668689
502,bad,102,325,0.00039987,0.0012741,8.74225,0.313846
343,film,843,651,0.00330481,0.00255211,7.52696,1.29493
696,movie,849,1036,0.00332833,0.00406142,7.33095,0.819498
465,so,327,494,0.00128194,0.00193662,6.54689,0.661943
244,he,632,469,0.00247762,0.00183862,6.39008,1.34755
206,they,443,604,0.00173669,0.00236786,6.31167,0.733444
92,story,347,194,0.00136034,0.000760537,5.99805,1.78866
155,great,256,113,0.00100359,0.000442993,5.60602,2.26549


In [None]:
#TRY USING BAYES!!