# Sentiment Analysis and Word2Vec

Today we'll be reviewing and reinforcing some concepts we learned yesterday by applying count vectorizers to create sentiment analyzers. Then we will use gensim to create word vectors.

## Agenda
- Build a basic sentiment analyzer
- Use classification to predict sentiment
- Use nltk and gensim to create word vectors

## Learning Objectives
By the end of this lesson students will be able to:
- Demonstrate how to create a sentiment analyzer
- Demonstrate how to preprocess word data
- Demonstrate how to create word vectors in gensim

Review key terms:
- Corpus
- Stemming
- Lemmatization
- Tokenization

# Let's start with a very simple example

Let's build a function that can classify a small amount of text, such as a tweet, into positive and negative.

What words tell us whether certain text is positive?

In [None]:
theTweet = "We have some delightful new food in the cafeteria. Awesome!!!"

In [None]:
# Let's come up with a list of positive and negative words we might run into in one tweet

positive_words = [ ]
negative_words = [ ]

In [None]:
#Tokenize

import re
theTokens = re.findall(r'\b\w[\w-]*\b', theTweet.lower())
print(theTokens[:5])


In [None]:
# Count positive words:
    


In [None]:
# Count negative words:



In [None]:
# return a percentage

numWords = len(theTokens)
percntPos = numPosWords / numWords
percntNeg = numNegWords / numWords
print("Positive: " + "{:.0%}".format(percntPos) + "  Negative: " + "{:.0%}".format(percntNeg))

### What are some shortcomings of this method?

# Sorting Positive from Negative Reviews

The easiest way to do sentiment classification of analysis is by training a model on data we've already labeled. 

Today we will begin by reviewing the basic NLP techniques we learned yesterday to create a sentiment analyzer from Rotten Tomatoes Movie reivew.  This code-along is adapted from kaggle's tutorial, available at: https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words.


## Step One: Import The Data

In [None]:
import pandas as pd       
train = pd.read_csv("/Users/markmummert/Desktop/Temp DSI Docs/NLP-2/labeledTrainData.tsv", header=0, 
                    delimiter="\t", quoting=3)

In [None]:
train.head()
#What are we looking at? Someone describe the columns

In [None]:
train.review.head()

There are a few steps we'll take to clean up the text data before it's ready for processing

- Remove the HTML code artifacts from the text
- Remove punctuation
- Remove stopwords (what are these?)


## Step One: Remove HTML code artifacts

Fortunately, we can use beautiful soup to remove the HTML artificats from our corpus

In [None]:
from bs4 import BeautifulSoup             

# Initialize the BeautifulSoup object on a single movie review     
example1 = BeautifulSoup(train["review"][0])  

# Print the raw review and then the output of get_text(), for 
# comparison
print train["review"][0]
print example1.get_text()


## Step Two: Remove Punctuation

Like we did yesterday, punctuation can be removed using regular expressions

In [None]:
import re
# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      example1.get_text() )  # The text to search
print letters_only

In [None]:
# Let's also take this time to convert everything to lowercase

In [None]:
lower_case = letters_only.lower()        # Convert to lower case
words = lower_case.split()               # Split into words

## Step Three: Stop Words

If you didn't complete the NLTK download yesterday you may run into some issues here.

In [None]:
import nltk
# nltk.download()  # Download text data sets, including stop words. Uncomment this if you did not download yesterday


In [None]:
from nltk.corpus import stopwords # Import the stop word list
print stopwords.words("english") 


In [None]:
# Remove stop words from "words"
words = [w for w in words if not w in stopwords.words("english")]
print words


## Step Four - Combine our cleaning into one function

**Check**: Why should do everything with one function?

In [None]:
def review_to_words( raw_review ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))   


## Step Five (Finally!) Applying our Function

In [None]:
# Get the number of reviews based on the dataframe column size
num_reviews = train["review"].size

# Initialize an empty list to hold the clean reviews
clean_train_reviews = []


num_reviews

In [None]:
print "Cleaning and parsing the training set movie reviews...\n"
clean_train_reviews = []
for i in xrange( 0, num_reviews ):
    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%1000 == 0 ):
        print "Review %d of %d\n" % ( i+1, num_reviews )                                                                    
    clean_train_reviews.append( review_to_words( train["review"][i] ))


## Our data is finally ready.....

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_features = train_data_features.toarray()


In [None]:
print train_data_features.shape

In [None]:
vocab = vectorizer.get_feature_names()
print vocab


### Now we have an array that we can use for classification!

In [None]:
from sklearn.neighbors import NearestNeighbors

In [None]:
neighs = NearestNeighbors(n_neighbors=5)

In [None]:
neighs.fit(train_data_features, train["sentiment"])
#this will take a while....

## How would we process and apply new data for predictions? What about a test set?

# Word Vectors

We talked briefly about vectors yesterday. Today we will see how they are utilized and implemented in Natural Language Processing.

For this example we will be working with a larger set of the data we used above.

In [None]:
train = pd.read_csv( "/Users/markmummert/Desktop/Temp DSI Docs/NLP-2/labeledTrainData.tsv", header=0, 
 delimiter="\t", quoting=3, encoding='utf-8' )
test = pd.read_csv( "/Users/markmummert/Desktop/Temp DSI Docs/NLP-2/testData.tsv", 
                   header=0, delimiter="\t", quoting=3, encoding='utf-8' )
unlabeled_train = pd.read_csv( "/Users/markmummert/Desktop/Temp DSI Docs/NLP-2/unlabeledTrainData.tsv", encoding='utf-8', header=0, 
 delimiter="\t", quoting=3 )

#When dealing with NLP you will often run into encoding errors.  
# The best way to address them is to pass an encoding parameter to pandas when you read in the data

Now we'll clean and prepare our data similar to how we did it before

In [None]:
def review_to_wordlist( review, remove_stopwords=False ):
    # Function to convert a document to a sequence of words,
    # optionally removing stop words.  Returns a list of words.
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(review).get_text()
    #  
    # 2. Remove non-letters
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    #
    # 3. Convert words to lower case and split them
    words = review_text.lower().split()
    #
    # 4. Optionally remove stop words (false by default)
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    #
    # 5. Return a list of words
    return(words)


The word-to-vec function that we'll use today takes sentences as a list of strings, so we will use a **tokenixer** to generate that

In [None]:
# Load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# Define a function to split a review into parsed sentences
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
    # Function to split a review into parsed sentences. Returns a 
    # list of sentences, where each sentence is a list of words
    #
    # 1. Use the NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())
    #
    # 2. Loop over each sentence
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append( review_to_wordlist( raw_sentence, \
              remove_stopwords ))
    #
    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    return sentences


Now we apply both our functions to prepare the data

In [None]:
%%time 
# This took me about 10 minutes

sentences = []  # Initialize an empty list of sentences

print "Parsing sentences from training set"
for review in train["review"]:
    sentences += review_to_sentences(review, tokenizer)

print "Parsing sentences from unlabeled set"
for review in unlabeled_train["review"]:
    sentences += review_to_sentences(review, tokenizer)


In [None]:
#Now how much data do we have?

print len(sentences)


## Vectorizing our Words

This will take a very long time. Now is a good time to talk about what the result will be. Start running the cell first. Then we'll talk

In [None]:
%%time 
#This took me about 5 minutes

# Import the built-in logging module and configure it so that Word2Vec 
# creates nice output messages
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

# Set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model (this will take some time)
from gensim.models import word2vec
print "Training model..."
model = word2vec.Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "300features_40minwords_10context"
model.save(model_name)

What is word2vec?

> Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.


Put more plainly - 

It's a way of 'abstracting' the meaning of words into numbers by distributing its meaning as a series or weights across elements. 

![example](https://adriancolyer.files.wordpress.com/2016/04/word2vec-distributed-representation.png?w=1132)

There's a very thorough discussion here: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

With enough training observations we can use these vectors to interpret speech patterns and define words.

## So what cool tricks can we do with word vectors?

In [None]:
# Which word doesn't match the others?
model.doesnt_match("man woman child kitchen".split())

In [None]:
model.doesnt_match("france england germany berlin".split())


In [None]:
model.doesnt_match("paris berlin london austria".split())
# We are limited by the size of our training set

In [None]:
model.most_similar("man")

# the numbers we are seeing here are cosine distance between the vectors

In [None]:
model.most_similar("woman")

In [None]:
model.most_similar("awful")


### Ok, but really what do we use these for....

- LDA or topic modeling
- Neural network inputs
- Clustering