# **I am Uzzal Mondal,** I completed the project with R&D on Sentiment analysis

**Sentiment analysis** is the process of classifying the emotional intent of text. Generally, the input to a sentiment classification model is a piece of text, and the output is the probability that the sentiment expressed is positive, negative, or neutral. Typically, this probability is based on either hand-generated features, word n-grams, TF-IDF features, or using deep learning models to capture sequential long- and short-term dependencies. Sentiment analysis is used to classify customer reviews on various online platforms as well as for niche applications like identifying signs of mental illness in online comments.

**Here we are importing the necessary libraries.**

**pandas** is used to read the dataset.

**numpy** is used to perform basic array operations.

**Tokenizer** is used to split the text into tokens.

**pad_sequences** is used to pad the data if necessary.

**Sequential()** model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.


**Embedding()** layer is initialized with random weights and will learn an embedding for all of the words in the training dataset. It requires 3 arguments:

**input_dim:** This is the size of the vocabulary in the text data.

**output_dim:** This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word.

**input_length:** Length of input sequences, when it is constant.

**json:** load joson file and working with json.

In [155]:
import pandas as pd
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras import Sequential
from keras.layers import Dense,SimpleRNN,Embedding,Flatten, LSTM, Dropout, BatchNormalization
import json
import pickle
import os
import gc
import re


load csv data file (tweets.csv) using pandas.
read_csv is used to load the data into the dataframe. data.head() can be used to see the first 5 rows of the dataset


In [None]:
df = pd.read_csv('/content/sample_data/tweets.csv')

In [None]:
df.head(5)

Unnamed: 0,id,label,tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...
1,2,0,Finally a transparant silicon case ^^ Thanks t...
2,3,0,We love this! Would you go? #talk #makememorie...
3,4,0,I'm wired I know I'm George I was made that wa...
4,5,1,What amazing service! Apple won't even talk to...


In [None]:
df['tweet'][1]

'Finally a transparant silicon case ^^ Thanks to my uncle :) #yay #Sony #Xperia #S #sonyexperias… http://instagram.com/p/YGEt5JC6JM/'

Now separe the hold data into text and y(lavel of sentiment)

In [None]:
text = df['tweet'].tolist()
y = df['label'].tolist()
text[:10]

['#fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android #apps #beautiful #cute #health #igers #iphoneonly #iphonesia #iphone',
 'Finally a transparant silicon case ^^ Thanks to my uncle :) #yay #Sony #Xperia #S #sonyexperias… http://instagram.com/p/YGEt5JC6JM/',
 'We love this! Would you go? #talk #makememories #unplug #relax #iphone #smartphone #wifi #connect... http://fb.me/6N3LsUpCu',
 "I'm wired I know I'm George I was made that way ;) #iphone #cute #daventry #home http://instagr.am/p/Li_5_ujS4k/",
 "What amazing service! Apple won't even talk to me about a question I have unless I pay them $19.95 for their stupid support!",
 'iPhone software update fucked up my phone big time Stupid iPhones',
 'Happy for us .. #instapic #instadaily #us #sony #xperia #xperiaZ https://instagram.com/p/z9qGfWlvj7/',
 'New Type C charger cable #UK http://www.ebay.co.uk/itm/-/112598674021 … #bay #Amazon #etsy New Year #Rob Cross #Toby Young #EVEMUN #McMafia #Taylor #SPECTRE 2018 #NewYear #Starting

In [None]:
y[:10]

[0, 0, 0, 0, 1, 1, 0, 0, 0, 0]


**Sentiment lavel is indicate for**

0 -> positive

1 -> negative

**Data Cleaning for vectarization**

In [30]:
def tweet_cleaner_without_stopwords(text):
    new_text = re.sub(r"'s\b", " is", text)
    new_text = re.sub("#", "", new_text)
    new_text = re.sub("@[A-Za-z0-9]+", "", new_text)
    new_text = re.sub(r"http\S+", "", new_text)
    #new_text = contractions.fix(new_text)
    new_text = re.sub(r"[^a-zA-Z]", " ", new_text)
    new_text = new_text.lower().strip()

    cleaned_text = ''
    for token in new_text.split():
        cleaned_text = cleaned_text + lemmatizer.lemmatize(token) + ' '

    return cleaned_text

In [16]:
# Stop words Removal
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS


In [34]:
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [23]:
nltk_stopwords = set(stopwords.words('english'))
print(nltk_stopwords)

{"aren't", 'over', 'to', 'yours', 'very', 'here', 'such', 'is', "you'll", 'hasn', 'myself', "you're", 'has', 'same', 'you', 'couldn', 'were', 'doing', 'there', 'needn', 'on', "shan't", 'when', "it's", 'which', 'whom', 'more', 'ma', 'only', 'what', 'with', "doesn't", 'than', 'are', 'how', 'him', 'mustn', "wasn't", 'for', 'and', 'once', "hasn't", 'out', 'down', "that'll", "mightn't", 'before', 'will', 'm', 'ain', "haven't", 'himself', 'after', "won't", 'by', 'as', 'where', 'we', 'our', 'weren', 'having', 'own', 'a', 'them', 'be', 'below', 'his', 'the', 'no', "you'd", 'he', 'had', 'wasn', "don't", 'who', 'i', 'd', 'doesn', 'an', "wouldn't", 'mightn', 'shan', 'during', 'under', 't', 'o', 'does', "she's", 'from', 'each', 'don', 'why', 'most', "shouldn't", 'too', 'being', 'hers', 'just', "isn't", 'herself', 'she', 'between', 'both', 'am', 'so', 'did', 'me', 'it', 'have', 'all', 'those', "should've", "didn't", "needn't", 'aren', 'into', "couldn't", 'shouldn', 'in', 'its', 'off', "you've", 'an

In [24]:
len(nltk_stopwords)

179

In [25]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
sklearn_stopwords = set(ENGLISH_STOP_WORDS)
print(sklearn_stopwords)

{'move', 'sometime', 'around', 'hasnt', 'to', 'get', 'hence', 'latterly', 'very', 'whereas', 'here', 'would', 'nevertheless', 'none', 'always', 'myself', 'perhaps', 'has', 'you', 'were', 'indeed', 'due', 'on', 'whither', 'when', 'mostly', 'whom', 'which', 'only', 'what', 'less', 'thin', 'than', 'somewhere', 'him', 'ever', 'beyond', 'for', 'and', 'once', 'four', 'out', 'down', 'hereafter', 'across', 'before', 'after', 'back', 'by', 'become', 'as', 'besides', 'never', 'go', 'where', 'we', 'sometimes', 'behind', 'whole', 'con', 'thereupon', 'inc', 'twenty', 'his', 'he', 'everything', 'i', 'towards', 'serious', 'describe', 'another', 'however', 'an', 'twelve', 'sixty', 'also', 'system', 'during', 'co', 'whereafter', 'why', 'too', 'being', 'cant', 'find', 'it', 'detail', 'noone', 'have', 'cannot', 'front', 'may', 'those', 'fire', 'wherever', 'in', 'its', 'off', 'top', 'even', 'something', 'yourself', 'bottom', 'nowhere', 'been', 'amount', 'therefore', 'de', 'moreover', 'thru', 'per', 'of', 

In [26]:
len(sklearn_stopwords)

318

In [27]:
# Combining the stopwords from sklearn & NLTK
combined_stopwords = nltk_stopwords.union(sklearn_stopwords)

In [28]:
len(combined_stopwords)

378

In [32]:
# Text Normalization: Stemming or Lemmatization (prefer)
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

**We have got clean text with stop word**

> clean_text_with_stopword[:10] -> display first 10 line from total text



In [41]:
clean_text_with_stopword = []
for line in text:
    clean_text_with_stopword.append(tweet_cleaner_without_stopwords(line))


In [42]:
clean_text_with_stopword[:10]

['fingerprint pregnancy test android apps beautiful cute health igers iphoneonly iphonesia iphone ',
 'finally a transparant silicon case thanks to my uncle yay sony xperia s sonyexperias ',
 'we love this would you go talk makememories unplug relax iphone smartphone wifi connect ',
 'i m wired i know i m george i wa made that way iphone cute daventry home ',
 'what amazing service apple won t even talk to me about a question i have unless i pay them for their stupid support ',
 'iphone software update fucked up my phone big time stupid iphones ',
 'happy for u instapic instadaily u sony xperia xperiaz ',
 'new type c charger cable uk bay amazon etsy new year rob cross toby young evemun mcmafia taylor spectre newyear starting recipe technology samsunggalaxys iphonex pic twitter com pjiwq wtc ',
 'bout to go shopping again listening to music iphone justme music likeforlike followforfollow ',
 'photo fun selfie pool water sony camera picoftheday sun instagood boy cute outdoor ']

**We have got clean text without stop word**

clean_text_without_stopword[:10] -> display first 10 line from total text

In [55]:
clean_text_without_stopword = []
for line in clean_text_with_stopword:
    new_line = (' ').join(word for word in line.split() if word not in combined_stopwords)
    clean_text_without_stopword.append(new_line)



In [56]:
clean_text_without_stopword[:10]

['fingerprint pregnancy test android apps beautiful cute health igers iphoneonly iphonesia iphone',
 'finally transparant silicon case thanks uncle yay sony xperia sonyexperias',
 'love talk makememories unplug relax iphone smartphone wifi connect',
 'wired know george wa way iphone cute daventry home',
 'amazing service apple talk question unless pay stupid support',
 'iphone software update fucked phone big time stupid iphones',
 'happy u instapic instadaily u sony xperia xperiaz',
 'new type c charger cable uk bay amazon etsy new year rob cross toby young evemun mcmafia taylor spectre newyear starting recipe technology samsunggalaxys iphonex pic twitter com pjiwq wtc',
 'bout shopping listening music iphone justme music likeforlike followforfollow',
 'photo fun selfie pool water sony camera picoftheday sun instagood boy cute outdoor']

# **Pre-Processing with NLP**
use tf-idf

In [66]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(clean_text_without_stopword)
vocal = tokenizer.word_index
vocal_size = len(vocal) + 1


**Total vocabulary in text is 15742**

In [68]:
vocal_size


15742

**Text to Vector**

In [69]:
clean_text_without_stopword = tokenizer.texts_to_sequences( clean_text_without_stopword )
clean_text_without_stopword[:10]

[[1853, 5521, 1060, 16, 57, 37, 23, 94, 74, 84, 82, 1],
 [53, 5522, 3547, 22, 68, 2223, 182, 8, 225, 5523],
 [13, 640, 5524, 3548, 376, 1, 93, 269, 965],
 [1854, 107, 5525, 77, 230, 1, 23, 5526, 75],
 [56, 253, 2, 640, 744, 2717, 342, 183, 306],
 [1, 426, 54, 641, 7, 200, 33, 183, 369],
 [35, 39, 205, 160, 39, 8, 225, 501],
 [4,
  706,
  161,
  87,
  316,
  427,
  2718,
  231,
  2224,
  4,
  72,
  3549,
  1428,
  3550,
  1624,
  5527,
  5528,
  2225,
  5529,
  259,
  966,
  1625,
  137,
  573,
  50,
  10,
  5,
  6,
  5530,
  3551],
 [1626, 168, 903, 30, 1, 2226, 30, 574, 1429],
 [18, 27, 63, 1430, 409, 8, 79, 95, 189, 32, 150, 23, 2227]]

**Need padding:**

Each vector size is not same becasue each line leanth is not equal of the text list.

**N.B:**
After appling padding sperch matricx will generate
so we can apply the embedded technic.

In [71]:
clean_text_without_stopword = pad_sequences(clean_text_without_stopword,padding='post')
clean_text_without_stopword[0:5]

array([[1853, 5521, 1060,   16,   57,   37,   23,   94,   74,   84,   82,
           1,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0],
       [  53, 5522, 3547,   22,   68, 2223,  182,    8,  225, 5523,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0],
       [  13,  640, 5524, 3548,  376,    1,   93,  269,  965,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0],
       [1854,  107, 5525,   77,  230,    1,   23, 5526,   75,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,  

# **Model building**

In [74]:
len(clean_text_without_stopword)

7920

In [113]:
clean_text_without_stopword = np.array(clean_text_without_stopword)
y = np.array(y)
X_train = clean_text_without_stopword[0:7000]
X_test  = clean_text_without_stopword[7001:]
y_train = y[0:7000]
y_test = y[7001:]
input_length = X_train[0].shape[0]
input_length

39

In [130]:
from keras import Sequential
from keras.layers import Dense,SimpleRNN,Embedding,Flatten
model = Sequential()
#model.add(Embedding(17,output_dim=2,input_length= input_length))
model.add(SimpleRNN(64,input_shape=(input_length,1),return_sequences=False))
model.add(Dense(1,activation='sigmoid'))
model.summary()

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_7 (SimpleRNN)    (None, 64)                4224      
                                                                 
 dense_7 (Dense)             (None, 1)                 65        
                                                                 
Total params: 4,289
Trainable params: 4,289
Non-trainable params: 0
_________________________________________________________________


In [131]:
model.compile('adam','accuracy')
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

In [127]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(clean_text_without_stopword, df['label'], test_size=0.4, random_state=1)

In [111]:
y_test.shape

(3168,)

In [132]:
hist = model.fit( X_train, y_train, epochs=50, validation_data=(X_test,y_test) )

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


**After 50 epochs, got accuracy 75.32% with 64 neuron in hidden lear(l1), its better than 128 neuron without embedding.**

In [137]:
model2 = Sequential()
model2.add(Embedding(vocal_size,output_dim=2,input_length= input_length))
model2.add(SimpleRNN(64,input_shape=(input_length,1),return_sequences=False))
model2.add(Dense(1,activation='sigmoid'))
model2.summary()

Model: "sequential_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_7 (Embedding)     (None, 39, 2)             31484     
                                                                 
 simple_rnn_10 (SimpleRNN)   (None, 64)                4288      
                                                                 
 dense_10 (Dense)            (None, 1)                 65        
                                                                 
Total params: 35,837
Trainable params: 35,837
Non-trainable params: 0
_________________________________________________________________


In [139]:
model2.compile('adam','accuracy')
model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
hist2 = model2.fit( X_train, y_train, epochs=50, validation_data=(X_test,y_test) )

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


After 50 epochs, got accuracy 98.82% and validation accuracy 82.3% with 64 neuron in hidden lear(l1), **I think, It's Overfitted Model.** So Let's do

Overfitted Model, Due to
1. Complexity of model
2. Low data size
3. Low featured input,
4. Lac. of Regularization in layer
5. Increase the neuron in first layer


In [186]:
model3 = Sequential()
model3.add(Embedding(vocal_size,output_dim=2,input_length= input_length))
model3.add(SimpleRNN(255,input_shape=(input_length,1),return_sequences=False))
model3.add(Dense(128))
#model3.add(Dense(32))
model3.add(Dropout(0.2))
#model3.add(BatchNormalization())
model3.add(Dense(1,activation='sigmoid'))
model3.summary()

Model: "sequential_33"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_29 (Embedding)    (None, 39, 2)             31484     
                                                                 
 simple_rnn_32 (SimpleRNN)   (None, 255)               65790     
                                                                 
 dense_56 (Dense)            (None, 128)               32768     
                                                                 
 dropout_13 (Dropout)        (None, 128)               0         
                                                                 
 dense_57 (Dense)            (None, 1)                 129       
                                                                 
Total params: 130,171
Trainable params: 130,171
Non-trainable params: 0
_________________________________________________________________


In [187]:
model3.compile('adam','accuracy')
model3.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
hist3 = model3.fit( X_train, y_train, epochs=50, validation_data=(X_test,y_test) )

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


After 50 epochs, got accuracy 82.13% and validation accuracy 76.86% with 255 neuron in hidden lear(l1), I think, It's confident Model. But after checking, we are understand which one is better between model2 and model3

# **Model Evaluation**

Let’s take a look at the performance of the model. To do this, I’m going to use the coefficient of determination. The closer this value is to 1, the better the model. First, let’s take a look at the score of the model on the test data.






```
# This is formatted as code
model3.score(X_test, y_test).round(3)


```


**#Of course, it would be better if it was closer to 1. Now, let’s see the score of the model on the training data.**



```
# This is formatted as code
model3.score(X_train, y_train).round(3)
```




As you can see, the performance of the model on the training data is close to the performance of the test data. If the performance of the model on the training data was high, it would mean that there is an overfitting problem. You may ask how to solve the overfitting problem? To overcome the overfitting problem, you can use regularization. Ridge or lasso models can be used for this.

Now let’s take a look at another metric, mean squared error, to evaluate the model. For this, let’s first predict the test data with the predict method.

`y_pred = model3.predict(X_test)`

Now, let’s import the mean_squared_error metric.

from sklearn.metrics import mean_squared_error
I’m going to use this metric now. First, let me import the math module because I’m going to calculate the square root of this metric.

`import math`
Let’s take a look at the square root of the mean squares error.

`math.sqrt(mean_squared_error(y_test, y_pred))`
#Output:

This value means that the model predicts with a standard deviation of X.

7. Model Prediction
Now, I’m going to predict the first row as an example. First, let’s select the first row of the training data.

`data_new = X_train[:1]`
Let me predict the data with our model.

`model.predict(data_new)`
#Output:

Let’s take a look at the real value.

`y_train[:1]`
#Output:

As you can see, our model predicted close to the real value.