![data-x](http://oi64.tinypic.com/o858n4.jpg)


# Data-X Notebook: Word2vec
#### Using NLP with Wrd2Vec in Python to do sentiment analysis on IMDB movie reviews


#  Following  steps: 
1. Reading of file labeledTrainData.tsv from data folder in a dataframe `train`.
2. Clean the reviews in the input file.
3. Use cleaned reviews to generate word vectors for this corpus.
4. Train a classifier on Word vectors represntation of reviews for sentiment analysis.

In [None]:
# Remove warnings
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
%matplotlib inline

#make compatible with Python 2 and Python 3
from __future__ import print_function, division, absolute_import 


## Data set

The labeled training data set consists of 25,000 IMDB movie reviews. There is also an unlabeled test set with 25,000 IMDB movie reviews. The sentiment of the reviews are binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 have a sentiment score of 1 (no reviews with score 5 or 6 are included in the analysis). No individual movie has more than 30 reviews.

## File description

* **labeledTrainData** - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review. 

* **testData** - The unlabeled test set. 25,000 rows containing an id, and text for each review. 

## Data columns
* **id** - Unique ID of each review
* **sentiment** - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
* **review** - Text of the review


## 1. Data set statistics


In [None]:


import numpy as np
import pandas as pd       
train = pd.read_csv("data/labeledTrainData.tsv",delimiter="\t")
# train.shape should be (25000,3)

In [None]:
train.head()

In [None]:
# import packages

import bs4 as bs
import nltk
# nltk.download('all')
from nltk.tokenize import sent_tokenize # tokenizes sentences
import re
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

eng_stopwords = stopwords.words('english')

<div id='sec3'></div>
##  2. Cleaning the reviews



We'll create a function called `review_cleaner` that reads in a review and:

- Removes HTML tags (using beautifulsoup)
- Removes non-letters (using regular expression)
- Converts all words to lowercase letters and tokenizes them (using .split() method on the review strings, so that every word in the review is an element in a list)
- Removes all the English stopwords from the list of movie review words
- Join the words back into one string seperated by space, append the emoticons to the end



In [None]:
# 1. 
from nltk.corpus import stopwords
from nltk.util import ngrams


ps = PorterStemmer()
wnl = WordNetLemmatizer()

def review_cleaner(review,lemmatize=True,stem=False):
    '''
    Clean and preprocess a review.

    1. Remove HTML tags
    2. Use regex to remove all special characters (only keep letters)
    3. Make strings to lower case and tokenize / word split reviews
    4. Remove English stopwords
    5. Rejoin to one string
    '''
    ps = PorterStemmer()
    wnl = WordNetLemmatizer()
    #1. Remove HTML tags
    review = bs.BeautifulSoup(review).text

    
    #2. Remove punctuation
    review = re.sub("[^a-zA-Z]", " ",review)
    
    #3. Tokenize into words (all lower case)
    review = review.lower().split()
    
    #4.Set stopwords
    eng_stopwords = set(stopwords.words("english"))

    clean_review=[]
    for word in review:
        if word not in eng_stopwords:
            if lemmatize is True:
                word=wnl.lemmatize(word)
            elif stem is True:
                if word == 'oed':
                    continue
                word=ps.stem(word)
            clean_review.append(word)
    return(clean_review)

In [None]:

num_reviews = len(train['review'])

review_clean_original = []

for i in range(0,num_reviews):
    if( (i+1)%5000 == 0 ):
        # print progress
        print("Done with %d reviews" %(i+1)) 
    review_clean_original.append(review_cleaner(train['review'][i]))
 

In [None]:
train['review'][0]

In [None]:
review_clean_original[0]

In [None]:

len(review_clean_original)

## 3. Convert Words to Distributed representation i.e train Word2Vec Model:

In [None]:
# !pip install gensim

#ref: https://radimrehurek.com/gensim/models/word2vec.html

In [None]:

sentences=review_clean_original
# Set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 40   # ignore all words with total frequency lower than this                       
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    


# Initialize and train the model (this will take some time)
from gensim.models import word2vec




print("Training word2vec model... ")
model = word2vec.Word2Vec(sentences, workers=num_workers, \
           size=num_features, min_count = min_word_count, \
            window = context)


# save the model for later use. You can load it later using Word2Vec.load()
model_name = "300features_40minwords_10context"
model.save(model_name)

In [None]:
#You can also use pretrained word2vec models that:
#Download the Google pretrained model from,it’s 1.5GB :
#https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
#Once you donload save unzip the file and you should will get another zip file named
#GoogleNews-vectors-negative300.bin. 


# Gmodel = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

## Exploring the Model Results





In [None]:
# Get vocabulary count of the model
vocab_tmp = list(model.wv.vocab)
print('Vocab length:',len(vocab_tmp))

In [None]:
# Get Vocabulary words
vocab_tmp[0]

In [None]:
# Get the Word embedding 
# model['stuff']

In [None]:
# Get cosine similarity of words
from sklearn.metrics.pairwise import cosine_similarity

model.similarity('movie','film')


In [None]:
model.similarity('actor','actress')


In [None]:

model.similarity('boring','dull')  

In [None]:
model.most_similar(positive=['actor','male'], negative=['female'])
# model.most_similar(positive=['king','woman'], negative=['man'])


In [None]:
model.doesnt_match("man woman child kitchen".split())

In [None]:
model.doesnt_match("man woman ok kill".split())

In [None]:
model.doesnt_match("france man germany berlin".split())

In [None]:
model.most_similar("man")

In [None]:
model.most_similar("movie")

In [None]:
model.most_similar("awful")

In [None]:
from gensim.models import Word2Vec
# Load the trained modelNumeric Representations of Words
model = Word2Vec.load("300features_40minwords_10context")

Now that we have a trained model with some semantic understanding of words, how should we use it? If you look beneath the hood, the Word2Vec model trained in earlier consists of a feature vector for each word in the vocabulary, stored in a numpy array called "wv.syn0":

In [None]:
type(model.wv.syn0)

In [None]:
model.wv.syn0.shape

In [None]:
model.corpus_count

In [None]:
# Get vocabulary count of the model
vocab_tmp = list(model.wv.vocab)
print('Vocab length:',len(vocab_tmp))

# Get distributional representation of each word
X = model[vocab_tmp]


In [None]:
from sklearn import decomposition
# get two principle components of the feature space
pca= decomposition.PCA(n_components=2).fit_transform(X)

# set figure settings
plt.figure(figsize=(10,10),dpi=100)

# save pca values and vocab in dataframe df
df = pd.concat([pd.DataFrame(pca),pd.Series(vocab_tmp)],axis=1)
df.columns = ['x', 'y', 'word']



plt.xlabel("Ist principal component")
plt.ylabel('2nd principal component')


plt.scatter(x=pca[:, 0], y=pca[:, 1],s=3)
for i, word in enumerate(df['word'][0:100]):
    plt.annotate(word, (df['x'].iloc[i], df['y'].iloc[i]))
plt.title("PCA Embedding")
plt.show()


In [None]:
## A popular non-linear dimensionality reduction technique that preserves greatly thge local 
## and global structure of the data. Essentially tries to reconstruct the subspace in which the 
## data exists
'''This will take time to run'''

# from sklearn import manifold
# tsne = manifold.TSNE(n_components=2)
# X_tsne = tsne.fit_transform(X)

# # set figure settings
# plt.figure(figsize=(10,10),dpi=100)

# # save pca values and vocab in dataframe df
# df2 = pd.concat([pd.DataFrame(pca),pd.Series(vocab_tmp)],axis=1)
# df2.columns = ['x', 'y', 'word']


# plt.scatter(df2['x'][0:500], df2['y'][0:500],s=3)
# for i, word in enumerate(df2['word'][0:500]):
#     plt.annotate(word, (df2['x'].iloc[i], df2['y'].iloc[i]))
# plt.title("Tsne Embedding")
# plt.show()


##  4. Use Word Vectors to create a sentiment analysis model using Random Forest Classifier

#### Vector Averaging to get feature encoding of review:

One challenge with the IMDB dataset is the variable-length reviews. We need to find a way to take individual word vectors and transform them into a feature set that is the same length for every review.

Since each word is a vector in 300-dimensional space, we can use vector operations to combine the words in each review. One method we tried was to simply average the word vectors in a given review (for this purpose, we removed stop words, which would just add noise).

The following code averages the feature vectors, building on our code from earlier sections.

In [None]:
import numpy as np  # Make sure that numpy is imported

def makeFeatureVec(review, model):
    # Function to average all of the word vectors in a given paragraph
    featureVec =[]
    
    # Index2word is a list that contains the names of the words in 
    # the model's vocabulary. Convert it to a set, for speed 
    index2word_set = set(model.wv.index2word)
    
    # Loop over each word in the review and, if it is in the model's
    # vocaublary, add its feature vector to the total
    for n,word in enumerate(review):
        if word in index2word_set: 
            featureVec.append(model[word])
            
    # Average the word vectors for a 
    featureVec = np.mean(featureVec,axis=0)
    return featureVec


def getAvgFeatureVecs(reviews, model):
    # Given a set of reviews (each one a list of words), calculate 
    # the average feature vector for each one 
    
    reviewFeatureVecs = []
    # Loop through the reviews
    for counter,review in enumerate(reviews):
        
        # Print a status message every 5000th review
        if counter%5000. == 0.:
            print("Review %d of %d" % (counter, len(reviews)))

        # Call the function (defined above) that makes average feature vectors
        vector= makeFeatureVec(review, model)
        reviewFeatureVecs.append(vector)
            
    return reviewFeatureVecs


In [None]:
from sklearn.ensemble import RandomForestClassifier
# # CountVectorizer can actucally handle a lot of the preprocessing for us
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics # for confusion matrix, accuracy score etc
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix


np.random.seed(0)


def train_sentiment(cleaned_reviews, y=train["sentiment"],max_features=1000):
    '''This function will:
    1. Convert reviews into feature vectors using word2vec.
    2. split data into train and test set.
    3. train a random forest model using train n-gram counts and y (labels)
    4. test the model on your test split
    5. print accuracy of sentiment prediction on test and training data
    6. print confusion matrix on test data results

    To change n-gram type, set value of ngram argument
    To change the number of features you want the countvectorizer to generate, set the value of max_features argument'''

    print("1.Creating Feature vectors using word2vec...\n")

    trainDataVecs = getAvgFeatureVecs( cleaned_reviews, model)
    
   
    print("\n2.Splitting dataset into train and test sets...\n")
    X_train, X_test, y_train, y_test = train_test_split(\
    trainDataVecs, y, random_state=0, test_size=.2)

   
    print("3. Training the random forest classifier...\n")
    
    # Initialize a Random Forest classifier with 75 trees
    forest = RandomForestClassifier(n_estimators = 50) 
    
    # Fit the forest to the training set, word2vecfeatures 
    # and the sentiment labels as the target variable
    forest = forest.fit(X_train, y_train)


    train_predictions = forest.predict(X_train)
    test_predictions = forest.predict(X_test)
    
    train_acc = metrics.accuracy_score(y_train, train_predictions)
    valid_acc = metrics.accuracy_score(y_test, test_predictions)
    print("=================Training Statistics======================\n")
    print("The training accuracy is: ", train_acc)
    print("The validation accuracy is: ", valid_acc)
    print()
    print('CONFUSION MATRIX:')
    print('         Predicted')
    print('          neg pos')
    print(' Actual')
    c=confusion_matrix(y_test, test_predictions)
    print('     neg  ',c[0])
    print('     pos  ',c[1])


In [None]:

train_sentiment(cleaned_reviews=review_clean_original, y=train["sentiment"],max_features=1000)
