# Sentiment Analysis
https://www.repustate.com/blog/sentiment-analysis-steps/<br>
Sentiment analysis is the AI-powered method through which brands can find out the emotions that customers express about them on the internet. 
## Step0: Data collection
## Step1: Preprocessing text data
Tokenization -> Texting cleaning / Processing -> Text Vectorization <br><br>
Tokenization Methods: 1. NLTK 2.Keras Tokenizer API<br>
Text Vectorization Methods: 1. Bag of Words (BOW). 2. One Hot Encoding. 3. Term Frequency, Inverse Term Frequency (TF-IDF, BOW extension). 4. The Word Embedding Model(Pretrained: Word2Vec, GloVe, Keras Embedding Layer).
## Step2: Data Analysis
Training the model -> multilingual processing -> custom tags -> topic/aspect classification -> sentiment analysis <br><br>
Sentiment Analysis: Each aspect and theme is isolated in this stage by the platform and then analysed for the sentiment. Sentiment scores are given in the range of -1 to +1. A neutral statement may be termed as zero. 
## Step3: Data Visualization (Optional)

![SentimentAnalysisStructure](tableOfContent_W9_note.jpg)

---------

# Load data & Data exploring

In [1]:
%matplotlib inline

In [24]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tensorflow.compat.v1.keras.preprocessing.text import Tokenizer
from tensorflow.compat.v1.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Activation

In [4]:
data = pd.read_csv('F:/Durham_College_AI/2- semester/AI in enterprise/Final_project/train.csv')
data.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [12]:
data.dropna(subset = ["text"],inplace = True)

In [13]:
# check is this a balance dataset
data.groupby("label").count()

Unnamed: 0_level_0,id,title,author,text
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,10387,10387,10361,10387
1,10374,9816,8482,10374


In [14]:
len(data)

20761

In [26]:
indx = data["text"].str.len().idxmax()

In [35]:
len(data.iloc[19764]["text"])

264

# Insight:
positive only has 1/4 amount of data compare to negative tweets, this may affect my model lean to negative.

train = 10728 rows * 0.7 = 7509.6 <br>
test = 10728 * 0.3 = the rest

In [16]:
# Divide dataset into 70% vs 30% based on insight found above
X_train = data.loc[:len(data)*0.7, 'text'].values
y_train = data.loc[:len(data)*0.7, 'label'].values
X_test = data.loc[(len(data)*0.7) + 1:, 'text'].values
y_test = data.loc[(len(data)*0.7) + 1:, 'label'].values

# Step1:Preprocessing text data
<h4>Tokenization (use Keras Tokenizer API)<br>
    & Texting cleaning / Processing <br>
    & Text Vectorization (use Keras Word Embedding Model)</h4>

In [8]:
# pip install tensorflow

# Tokenization &Texting Cleaning & Vectorization
Here I will format the text samples and labels into tensors that can be fed into a neural network.<br>
To do this, I will utilize <strong>Keras.preprocessing.text.Tokenizer</strong> and <strong>keras.preprocessing.sequence.pad_sequences</strong>.<br><br>
Note: By default, <strong>Keras.preprocessing.text.Tokenizer</strong> removes all punctuation, turns the texts into space-separated sequences of words (words maybe include the ' character). These sequences are then split into lists of tokens. 

In [51]:


# initialize Tokenizer class
tokenizer_obj = Tokenizer()
# tockenize all the tweets
total_tweets = data.loc[:,'text'].values
tokenizer_obj.fit_on_texts(total_tweets) 

# Keras prefers inputs to be vectorized and all inputs to have the same length
# so I need to pad sequences
indx = data["text"].str.len().idxmax()
max_length = len(data.iloc[19764]["text"])
#max_length = 50 # based on calculation in csv file (max: 29, average: 17)
# define vocabulary size
vocab_size = len(tokenizer_obj.word_index) + 1
# tockenize train and test dataset
X_train_tokens =  tokenizer_obj.texts_to_sequences(X_train)
X_test_tokens = tokenizer_obj.texts_to_sequences(X_test)
# pad sequence tockenized-train and tockenized-test dataset
# parameter explanation: padding https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences
# String, 'pre' or 'post' (optional, defaults to 'pre'): pad either before or after each sequence. 
X_train_pad = pad_sequences(X_train_tokens,maxlen=max_length, padding='post')
X_test_pad = pad_sequences(X_test_tokens,maxlen = max_length, padding='post')
print(X_train_pad)

[[     7   7543      8 ...  21178  36565   4240]
 [  7649      5  75613 ...  20764   1169  13568]
 [    38   1515   3485 ...    670   8657   8883]
 ...
 [203023    843      1 ...      9     39    320]
 [   182      2      1 ...     10  15267   2259]
 [    24    322    397 ...      1    876     11]]


In [52]:
len(X_train_tokens)

14501

In [53]:
print(vocab_size)

238052


In [54]:
X_test_pad.shape

(6259, 264)

there is 15820 number of vocabulary in my dataset

# Step2: Data Analysis - Training the model 
<h2> LSTM model training </h2>
Ready to define my neural network model.<br><br>
The model will use an <strong>Embedding layer</strong> as the first hidden layer. The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset during training of the model.<br>
Second layer is <strong>LSTM</strong>, then <strong>output layer (classification)</strong><br>
LSTM parameters: https://keras.io/api/layers/recurrent_layers/lstm/

In [60]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, GRU
from keras.layers.embeddings import Embedding

# set embedding dimension is 200 (maybe too much for my small dataset, but whatever...)
EMBEDDING_DIM = 300

print('Building LSTM model with Tensorflow/Keras...')

# initialize Sequential class in order to struture my neural network model
model = Sequential()
# use keras word embedding layer as my first input layer
embedding_layer = Embedding(vocab_size, EMBEDDING_DIM, input_length=max_length)
model.add(embedding_layer)
# one layer of LSTM
model.add(LSTM(units=32,  dropout=0.2, recurrent_dropout=0.2,return_sequences = True))

model.add(LSTM(units=32,  dropout=0.2, recurrent_dropout=0.2,return_sequences = True))

model.add(LSTM(units=32,  dropout=0.2, recurrent_dropout=0.2,return_sequences = False))

model.add(Dense(64))
model.add(Activation("relu"))

# output layer using sigmoid as activation method (0-1)
model.add(Dense(1)) # classification problem: so output = 1)
model.add(Activation("sigmoid"))
# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print('Summary of the built model...')
print(model.summary())

Building LSTM model with Tensorflow/Keras...
Summary of the built model...
Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_7 (Embedding)     (None, 264, 300)          71415600  
                                                                 
 lstm_18 (LSTM)              (None, 264, 32)           42624     
                                                                 
 lstm_19 (LSTM)              (None, 264, 32)           8320      
                                                                 
 lstm_20 (LSTM)              (None, 32)                8320      
                                                                 
 dense_14 (Dense)            (None, 64)                2112      
                                                                 
 activation_14 (Activation)  (None, 64)                0         
                                             

Explanation of the summary above:
<li>Embedding layers is 50 words x 200 vector dimension
<li>LSTM is 32 dimension of the output space
<li>Dense = Final output layer is 1 output only

In [63]:
print('Training...')

model.fit(X_train_pad, y_train, batch_size=128, epochs=10, validation_data=(X_test_pad, y_test), verbose=1)

Training...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x25264eeeca0>

In [64]:
# Check weights matrix in the embedding layer
print(embedding_layer.get_weights()[0].shape)

(238052, 300)


There is 15820 rows and 200 columns in both embedding matrix and output

In [15]:
print(embedding_layer.get_weights()[0])

[[-0.00382903 -0.00587053  0.04335074 ...  0.12898205  0.01420071
   0.00967517]
 [ 0.07686709  0.07374097  0.09413031 ... -0.02480163 -0.09206655
   0.02909448]
 [-0.01791714 -0.11168876 -0.06725611 ... -0.03886044  0.09327106
  -0.09911796]
 ...
 [-0.0281518  -0.04022111 -0.02247751 ... -0.03610227 -0.01409967
  -0.03082849]
 [ 0.02415563 -0.0378341   0.03905306 ... -0.04646444 -0.0249963
   0.02757633]
 [ 0.01512632 -0.02184947  0.02084725 ...  0.01986808 -0.04185306
  -0.00685205]]


In [16]:
print(embedding_layer.get_weights()[0][0])

[-0.00382903 -0.00587053  0.04335074 -0.0454114   0.02117755  0.03388212
 -0.04732965 -0.09339476 -0.05080926 -0.04675207  0.05137672  0.18874048
  0.08256961 -0.00131278  0.00241474 -0.03182614  0.09045675 -0.13608947
  0.04030503 -0.04965901  0.03199952  0.01639097 -0.04001632  0.05339506
 -0.02994714 -0.05879462 -0.02570869 -0.03116549 -0.04035143 -0.06147698
 -0.06276058 -0.03623896 -0.00191258 -0.05377633  0.05297501  0.02893719
 -0.00930509  0.05842903 -0.07084234 -0.01193138 -0.01341305 -0.00723896
  0.03643291 -0.00594012  0.01471033 -0.03965194 -0.05313287 -0.07044734
 -0.01700999 -0.06558758 -0.01982605 -0.05706052  0.03168553  0.01901981
  0.07559431 -0.01364453  0.01010045  0.008158   -0.09819352 -0.07177477
 -0.00759261  0.02659782  0.06549978 -0.03236971  0.01208768  0.04632158
 -0.05012042  0.0225373  -0.04318896  0.05713596  0.02592877 -0.03936525
  0.08377057  0.09214601  0.03102545 -0.08449408 -0.08407471  0.12011185
 -0.05073658 -0.05673467  0.06373047  0.02344095 -0

In [65]:
# check how model fits test dataset
print('Testing...')
score, acc = model.evaluate(X_test_pad, y_test, batch_size=128)

print('Test score:', score)
print('Test accuracy:', acc)

print("Accuracy: {0:.2%}".format(acc))

Testing...
Test score: 0.3565372824668884
Test accuracy: 0.8889598846435547
Accuracy: 88.90%


# compare 70% training dataset with 50% training dataset
70% out of total as the training dataset accuracy: 81.85%-83.43% <Br>
50% out of total as the training dataset accuracy: 82.55%, it's test score is overfit (1.05%)<br>
Therefore, I will use 70% as my training dataset

In [69]:
test_sample = "You are bad." # should be closer to 0 -> negative tweet
# test_sample = "You are good." # should be closer to 1 -> positive tweet
# test_sample = "An update on FoxNews tech failures for the #GOPDebate "# should be closer to 0 -> negative tweet
# test_sample = "Before the #GOPDebate, 14 focus groupers said they had favorable view of Trump."# should be closer to 1 -> positive tweet

In [73]:
def tokenized_padding(sample):
    tokenizer_test = Tokenizer()
    test_tweets = sample
    tokenizer_test.fit_on_texts(test_tweets) 
    # padding sequences
    max_length = 264 # based on calculation in csv file (max: 29, average: 17)
    # define vocabulary size
    vocab_size = len(tokenizer_test.word_index) + 1
    test_sample_tokens =  tokenizer_obj.texts_to_sequences(test_tweets)
    test_samples_tokens_pad = pad_sequences(test_sample_tokens, maxlen=max_length, padding='post')
    return test_samples_tokens_pad

In [74]:
test_samples_tokens_pad = tokenized_padding(test_sample)

In [75]:
# predict
model_list = model.predict(x=test_samples_tokens_pad)
model_list

# The other way to show model_list: seperate array to show the result, otherwise print(model_list) shows ugly e-04 kind of number
# for val in model_list:
#     print(val)
# print("Average of the prediction = ", model_list.mean())

array([[0.04358363],
       [0.04358363],
       [0.04358357],
       [0.04358363],
       [0.04358363],
       [0.04358363],
       [0.04358363],
       [0.04358363],
       [0.04358357],
       [0.04358363],
       [0.04358363],
       [0.04358363]], dtype=float32)

**Value closer to 1 is strong positive sentiment<br>
Value close to 0 is a strong negative sentiment**

# Conclusion

# test sample = "You are bad." 
1. should be closer to 0 -> negative tweet<br>
2. result:<br>
![testSample_Bad](predict_YouAreBad_1.jpg)

# test sample = "You are good." 
1. should be closer to 1 -> positive tweet<br>
2. result:<br>
![testSample_Good](predict_YouAreGood_1.jpg)

# Recommendation

I've tried NLTK & Keras preprocessing, pre & post padding, 100 & 200 embedding dimension, 35 & 50 maximum vocabulary length, 50% & 70% of data as training set.<br>
Based on the performance (accuracy and training time), the best combination of above parameters is post, 200, 50, 70%, respectively. So I suggest to use this combination.<br>
Regarding preprocessing methods, NLTK has more steps to manual code it, Keras provides a convenient way to do preprocessing but it will leave ' alone instead of remove it, which is more reliable and readable.<br>
If there is a chance, I would like to compare Word2Vec and GloVe with Keras.

# Generate requirement.txt
https://pypi.org/project/pigar/
<ol>
    <li>Open the terminal under this environment - Anaconda > choose this environment > launch CMD.exe Prompt</li>
    <li>navigate to this folder - cd C://xx/xx...</li>
    <li>type command - pip install pigar</li>
    <li>type command - pigar</li>
</ol>
In the folder, you will see a "requirement.txt" file