## Code 3-29 to Code 3-47

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
import nltk
import warnings
import stop_words
warnings.filterwarnings('ignore')

In [2]:
t1 = pd.read_csv("lexicon_sent_processed.csv")

In [3]:
tgt = t1.loc[:,"score_bkt"]

ML model for sentiment contains 2 sets of features
1. Text Features
We want to create a generic sentiment analysis module - we select only words that are relevant to emotions or sentiment from the corpus. This is an important step as using all the words  (in our case pet food names, candy names etc) could bias the model to learn patterns that associate product and sentiment. We create sets of word features - positive/negative words, adjectives and selecting only “stop words”. 

2. Numeric Features
Lexical features created so far

In [4]:
### Get list of words to string

In [5]:
def cnv_str(x):
    x1 = list(eval(x))
    x2 = ' '.join(x1)
    return x2

In [6]:
### get Adjectives

In [7]:
def filter_pos(fltr,sent_list):
    str1 = ""
    for i in sent_list:
        if(i[1]=="JJ"):
            str1 = str1 + i[0].lower() + " "
    return str1
    

In [8]:
### Get stop words in a sentence

In [9]:
def get_stop_words(sent):
    list1 = set(sent.split())
    st_comm = list(list1 & st_set)
    st_comm = ' '.join(st_comm)
    return st_comm

Making the text corpus from 
1. positive words, negative words identified (using lexicons)
2. Stop words in that sentence
3. Adjectives found in that sentence

These will be treated as text features in the model

In [10]:
t1["pos_set1"] = t1["pos_set"].apply(cnv_str)
t1["neg_set1"] = t1["neg_set"].apply(cnv_str)
t1["pos_neg_comb"] = t1["pos_set1"] + " " + t1["neg_set1"]

get_pos_tags = nltk.pos_tag_sents(t1["Text"].str.split())

str_sel_list = []
for i in get_pos_tags:
    str_sel = filter_pos("JJ",i)
    str_sel_list.append(str_sel)
    
t1["pos_neg_comb_adj"] = t1["pos_neg_comb"] + str_sel_list

st1 = stop_words.get_stop_words('en')
st_set = set(st1)
onl_stop_words = t1["full_txt"].apply(get_stop_words)

t1["pos_neg_comb_adj_st"] = t1["pos_neg_comb_adj"] + onl_stop_words



In [11]:
nltk.__version__

'3.4.3'

In [12]:
### Split data into train and test

In [13]:
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.8,random_state=42,n_splits=1)

for train_index, test_index in sss.split(t1, tgt):
    x_train, x_test = t1[t1.index.isin(train_index)], t1[t1.index.isin(test_index)]
    y_train, y_test = t1.loc[t1.index.isin(train_index),"score_bkt"], t1.loc[t1.index.isin(test_index),"score_bkt"]

In [14]:
### Variables created in "1_Lexicon based Sentiment_ver2_vader.ipynb" file and the text corpus created in line 7

In [15]:
inde_vars = ["sent_len","pos_score","neg_score","neg_num_pos_count","neg_num_neg_count", 'boost_num_pos_count', 'boost_num_neg_count', 'neg_num_pos_count_sum',
       'neg_num_neg_count_sum', 'boost_num_pos_count_sum',
       'boost_num_neg_count_sum', 'excl_num_pos_count', 'excl_num_neg_count',
       'excl_num_pos_count_sum', 'excl_num_neg_count_sum'
            ]
x_train1 = x_train[inde_vars]
x_test1 = x_test[inde_vars]

In [16]:
x_train1.head()

Unnamed: 0,sent_len,pos_score,neg_score,neg_num_pos_count,neg_num_neg_count,boost_num_pos_count,boost_num_neg_count,neg_num_pos_count_sum,neg_num_neg_count_sum,boost_num_pos_count_sum,boost_num_neg_count_sum,excl_num_pos_count,excl_num_neg_count,excl_num_pos_count_sum,excl_num_neg_count_sum
5,49.0,3.5,8.0,0,0,0,1,0,0,0,1,0,0,0,0
6,86.0,2.0,4.0,0,0,0,0,0,0,0,0,0,0,0,0
8,111.0,8.0,0.0,0,0,0,0,0,0,0,0,1,0,0,0
49,38.0,7.5,1.0,0,0,0,0,0,0,0,0,1,0,0,0
51,24.0,3.0,0.0,0,0,1,0,0,0,0,0,0,0,0,0


In [17]:
### Applying TFIDF for text feature created in line 7

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(min_df=0.001,analyzer=u'word',ngram_range=(1,1))
tfidf_matrix_tr = tfidf_vectorizer.fit_transform(x_train["pos_neg_comb_adj_st"])

tfidf_matrix_te = tfidf_vectorizer.transform(x_test["pos_neg_comb_adj_st"])

x_train2= tfidf_matrix_tr.todense()
x_test2 = tfidf_matrix_te.todense()

In [19]:
### Combining numeric feature and text feature

In [20]:
import numpy as np
x_train3 = np.concatenate([x_train1,x_train2],axis=1)
x_test3 = np.concatenate([x_test1,x_test2],axis=1)

In [21]:
x_train3.shape, x_test3.shape

((11368, 1194), (45475, 1194))

In [None]:
#### Scaling features

In [24]:

from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
x_train_scaled = min_max_scaler.fit_transform(x_train3)
x_test_scaled = min_max_scaler.transform(x_test3)

In [25]:
#### Feature selection

In [26]:
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile=40)
selector.fit(x_train3,y_train)
x_train4 = selector.fit_transform(x_train_scaled,y_train)
x_test4 = selector.transform(x_test_scaled)

In [27]:
x_train4.shape

(11368, 478)

Encoding Dependent variables
We also need to convert the categorical target variable into an encoded into one hot encoding form. 
One hot encoding is a form where each level is represented by the absence of the other levels and the presence of that level. 
For eg Positive can be represented as 100,negative 010 and neutral 001

In [28]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
le = LabelEncoder()
y_train1 = le.fit_transform(y_train)
y_train2 = np_utils.to_categorical(y_train1)
y_test1 = le.transform(y_test)

Using TensorFlow backend.


In [29]:
y_train2.shape

(11368, 3)

In [30]:
###Building neural network model

In [31]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Input, Dense, Dropout
from keras.models import Model
from keras.utils import to_categorical
from keras.optimizers import Adam

In [32]:
def get_nn_mod(list_layers,dp):
    model = Sequential()
    model.add(Dense(list_layers[0], input_dim=x_train4.shape[1], activation='tanh', kernel_initializer='lecun_uniform'))
    model.add(Dropout(dp))

    for i in list_layers[1:]:
        model.add(Dense(i, input_dim=x_train4.shape[1], activation='tanh'))
        model.add(Dropout(dp))
        
    model.add(Dense(3, activation='softmax'))
    opt = Adam(0.0001)
# Compile model
    model.compile(optimizer=opt, loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model

In [33]:
list_layers = [500,200,100,50]
class_weight = {0:0.2,1:0.6,2:0.2}

model = get_nn_mod(list_layers,0.1)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [34]:
model.fit(x_train4,y_train2, batch_size=100, epochs=20,class_weight=class_weight,
          verbose=2,validation_split=0.2)

Instructions for updating:
Use tf.cast instead.
Train on 9094 samples, validate on 2274 samples
Epoch 1/20
 - 20s - loss: 0.1863 - acc: 0.7646 - val_loss: 0.1622 - val_acc: 0.7814
Epoch 2/20
 - 1s - loss: 0.1450 - acc: 0.8023 - val_loss: 0.1352 - val_acc: 0.8012
Epoch 3/20
 - 1s - loss: 0.1286 - acc: 0.8223 - val_loss: 0.1288 - val_acc: 0.8127
Epoch 4/20
 - 1s - loss: 0.1214 - acc: 0.8302 - val_loss: 0.1271 - val_acc: 0.8113
Epoch 5/20
 - 1s - loss: 0.1165 - acc: 0.8386 - val_loss: 0.1290 - val_acc: 0.8329
Epoch 6/20
 - 1s - loss: 0.1136 - acc: 0.8431 - val_loss: 0.1278 - val_acc: 0.8228
Epoch 7/20
 - 1s - loss: 0.1104 - acc: 0.8468 - val_loss: 0.1284 - val_acc: 0.8232
Epoch 8/20
 - 1s - loss: 0.1097 - acc: 0.8457 - val_loss: 0.1295 - val_acc: 0.8232
Epoch 9/20
 - 1s - loss: 0.1075 - acc: 0.8500 - val_loss: 0.1317 - val_acc: 0.8281
Epoch 10/20
 - 1s - loss: 0.1065 - acc: 0.8479 - val_loss: 0.1335 - val_acc: 0.8166
Epoch 11/20
 - 1s - loss: 0.1063 - acc: 0.8527 - val_loss: 0.1311 - val_

<keras.callbacks.History at 0x1539ab00>

In [15]:
### Measuring model performance

In [35]:
pred=model.predict_classes(x_test4)
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
ac1 = accuracy_score(y_test1, pred)
print (ac1, f1_score(y_test1, pred, average='macro'))

0.8069488730071468 0.5860706406689715


In [38]:
from sklearn.metrics import confusion_matrix
rows_name = t1["score_bkt"].unique()
pred_inv = le.inverse_transform(pred)

cmat = pd.DataFrame(confusion_matrix(y_test, pred_inv, labels=rows_name, sample_weight=None))
cmat.columns = rows_name 
cmat["act"] = rows_name
cmat

Unnamed: 0,pos,neu,neg,act
0,32422,2291,867,pos
1,1649,1234,502,neu
2,2082,1388,3040,neg
