# <a>Sentiment Analysis with pre-trained Word2Vec model</a>

Continuation from: https://www.kaggle.com/farsanas/are-you-ready-to-build-your-own-word-embedding

This session is divided into 2 part

* part 1: Deploy our own Word enbedding
* Part 2: Lets understand the challenges what happens when we work with Deep Neural Network for Test Analyais

## <a>Part1</a>

In [None]:
from IPython.display import YouTubeVideo      
YouTubeVideo('8iM5PdxBbWo')

## <a>Part2</a>

In [None]:
from IPython.display import YouTubeVideo      
YouTubeVideo('k2-OkFHsIlk')

## <a>Overview - Part1</a>
In this tutorial we'll do Sentiment analysis based on the concept of Word2Vec using our pre-trained model with unlabelled data where we've applied Word2Vec technique i.e representing a word with a dense vector of 50 numbers. The unlabelled data has 50000 IMDB movie reviews & we extracted some 28000+ unique words after doing some data preprocessing & applying Word2Vec technique with length of 50 numbers.

In [None]:
import numpy as np
import pandas as pd

In [None]:
df1 = pd.read_csv('../input/word2vec-nlp-tutorial/labeledTrainData.tsv.zip',
                 header=0, delimiter="\t", quoting=3)

print(df1.shape)  

## About the data

The labelled data set contains 25000 reviews with label(Sentiment). The output column Sentiment consists of 2 categories[0 & 1].

*0 -- Indicates negative sentiment * 
*1-- Indicates positive sentiment * 

In [None]:
df1.iloc[10:15,:]

## <a>Data Preprocessing</a>

In [None]:
#plit Data into Training and Test Data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df1['review'],
    df1['sentiment'],
    test_size=0.2, 
    random_state=42
)


In [None]:
print(X_train.head())

In [None]:
print(X_train.tolist()[0:2])

## <a>Build Tokenizer to get Number sequences for Each review</a>

In [None]:
from tensorflow.python.keras.preprocessing.text import Tokenizer

#Vocab size
top_words = 10000

t = Tokenizer(num_words=top_words)
t.fit_on_texts(X_train.tolist())

#Get the word index for each of the word in the review
X_train = t.texts_to_sequences(X_train.tolist())
X_test = t.texts_to_sequences(X_test.tolist())


In [None]:
print(X_train[0:2])

In [None]:
t.word_index.items() 

In [None]:
#Pad sequences to make each review size equal Get the word index for each of the word in the review

from tensorflow.python.keras.preprocessing import sequence


max_review_length = 300 

X_train = sequence.pad_sequences(X_train,maxlen=max_review_length,padding='post') 
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length, padding='post') 


## <a>Build Embedding Matrix from Pre-Trained Word2Vec model</a>

In [None]:
#Install gensim
!pip install gensim --quiet

#Load pre-trained model
import gensim
word2vec = gensim.models.Word2Vec.load('../input/w2v-model/word2vec movie-50.model')

#Embedding Length  #our word vec  length is 50
embedding_vector_length = word2vec.wv.vectors.shape[1]

print('Loaded word2vec model..')
print('Model shape: ', word2vec.wv.vectors.shape)

In [None]:
#Build matrix for current data
embedding_matrix = np.zeros((top_words + 1, # Vocablury size + 1,, we add 1 to vocab size for padding
                             embedding_vector_length))
for word, i in sorted(t.word_index.items(),key=lambda x:x[1]):  #
    if i > top_words:
        break
    if word in word2vec.wv.vocab: #if word is there then quickly extract the embedding
        embedding_vector = word2vec.wv[word]
        embedding_matrix[i] = embedding_vector

In [None]:
#Check embeddings for word 'great'
embedding_matrix[t.word_index['great']]

## <a> Build the Model - Part2 </a>

In [None]:
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dropout, Dense, Embedding, Flatten

#Build a sequential model
model1 = Sequential()

### <a>Add Embedding layer</a>

In [None]:
model1.add(Embedding(top_words + 1,   
                    embedding_vector_length,   
                    input_length=max_review_length, 
                    weights=[embedding_matrix],  
                    trainable=False)       
         )

In [None]:
#Flatten embedding layer output and flatten layers
model1.add(Flatten())                                                             
model1.add(Dense(200,activation='relu'))                                          
model1.add(Dense(100,activation='relu'))
model1.add(Dropout(0.5))                                                          
model1.add(Dense(60,activation='relu'))
model1.add(Dropout(0.4))
model1.add(Dense(30,activation='relu'))
model1.add(Dropout(0.3))
model1.add(Dense(1,activation='sigmoid'))                                         

model1.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
model1.summary()

In [None]:
#here we r training it 
model1.fit(X_train,y_train,
          epochs=10,
          batch_size=100,         
          validation_data=(X_test, y_test))