# This notebok demonstrates how to use python NLTK package for Sentiment Analysis.

## 2017 Dec Shilpa Jain

# Install Python NLTK package

NLTK is a natural language toolkit for building programs in Python that work with natural language text.
We will use NLTK for this course.

In [None]:
!pip install nltk --upgrade

## Import NLTK and download NLTK book collection

In [None]:
import nltk
nltk.download()


## Cell below will load all the items in the book module that you have just downloaded. When this finishes, we will see the output.
We can see from the output that there are 9 pieces of text and 9 sentences loaded. For example, if we
type text1, we will see the title of the first piece of text. If we type sent3, we will see the body of the
third sentence.

In [1]:

import sys
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share your notebook.
client_8760db995d144d1cab8bb99f8e30e4d7 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='rCEutCTVbYiDKDqWrSU-G_YD5-YbqUZRhXBboFD25PHM',
    ibm_service_instance_id="iam-ServiceId-315c1a38-7c02-4c01-90be-598fa4710933",
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_8760db995d144d1cab8bb99f8e30e4d7.get_object(Bucket='textanalyticscourse291b8c2b2fd34be0aaeaf42d0c9becc5',Key='textdata.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_1 = pd.read_csv(body)
df_data_1.head()



Unnamed: 0,Text
0,"The ""Big Brother"" of Singapore football will b..."
1,Mahfizur Rahman watched his friends turn to cr...
2,"The going has been tough, but the Football Ass..."
3,Having pushed reigning world and European cham...
4,SINGAPORE - Registration for the Standard Char...


In [2]:
import nltk
docs=[]
for idx, row in df_data_1.iterrows():
    #print (row['Text'])
    
    tokens = nltk.word_tokenize(row['Text'])
    text2 = nltk.Text(tokens)
    docs.append(tokens)
#print ((docs))
print (text2)
text2.concordance('Singapore')

<Text: SINGAPORE - Registration for the Standard Chartered Marathon...>
Displaying 1 of 1 matches:
                                   SINGAPORE - Registration for the Standard Ch


##### In NLTK, there is a method called concordance that allows us to search for a word inside a piece of text.
##### Count method returns the number of times a word occurs in a piece of text.

## UsingWord Counts to Obtain an Overview of a Collection
Assume that you have a large document collection. For example, it could be all the email enquiries
from the customers of a company in a particular month. It could be all the tweets published by a particular
user. It could also be all the fictions written by a particular author. Without going through all the documents
inside the collection, how can you quickly get an idea about the major topics or themes covered by these
documents?

In NLTK, there is a built-in function called FreqDist() that makes our task very easy.

In [3]:
from nltk import *
print (text2)
fdist=FreqDist(text2)
type(fdist)

<Text: SINGAPORE - Registration for the Standard Chartered Marathon...>


nltk.probability.FreqDist

##### There is a method called most_common() that can be conveniently used to show the most frequent words in a frequency distribution.

In [4]:
fdist.most_common(10)

[('the', 8),
 ('at', 7),
 ('will', 6),
 ('.', 6),
 (',', 5),
 ('be', 4),
 ('and', 3),
 (')', 3),
 ('event', 3),
 ('on', 3)]

#### Looking at the most frequent words, you realize that they are not so meaningful. Many words are so commonly used everywhere that they do not reveal anything about the particular document or document collection we are looking at. There are a number of ways to address this problem.

#### We will create a new list text2_long_words and add only words with atleast 5 characters.

In [5]:
from nltk.corpus import stopwords
from nltk.stem.porter import *
stemmer=PorterStemmer()
import gensim
from gensim import corpora
from string import punctuation
 

# Input to dictionary is a list of list


def clean_data(text2):

    text2_long_words=[w for w in text2 if len(w)>=5]
    stop_list=stopwords.words('english')
    text2_stopremoved=[w for w in text2_long_words if w not in stop_list]
    text2_stemmed=[stemmer.stem(w) for w in text2_stopremoved]
    doc=[text2_stemmed]
    dictionary=corpora.Dictionary(doc)
    # remove punctuation from each token
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in text2_stemmed]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    return dictionary,tokens
    
    

In [6]:
dictionary,vocab=clean_data(text2)
#print (dictionary)
print (vocab)


['singapor', 'registr', 'standard', 'charter', 'marathon', 'begin', 'friday', 'registr', 'event', 'raffl', 'place', 'outsid', 'raffl', 'place', 'station', 'first', 'peopl', 'regist', 'event', 'receiv', 'limit', 'edit', 'goodi', 'privileg', 'includ', 'discount', 'prioriti', 'signup', 'run', 'clinic', 'transport', 'design', 'pickup', 'point', 'there', 'dedic', 'entri', 'collect', 'counter', 'standard', 'charter', 'give', 'prize', 'watch', 'membership', 'event', 'registr', 'avail', 'onlin', 'offici', 'websit', 'wwwmarathonsingaporecom', 'payoh', 'sport', 'recreat', 'centr']


#### Checking the frequency distribution and most common words again on the new list gives more sensible results and shows the major characters in a book.

In [7]:
fdist=FreqDist(vocab)
m=fdist.most_common(10)

## Import Brunel library for visualization

In [8]:
import brunel

### Create a dataframe to visualize the common words as a tag cloud using Brunel package.

In [9]:
import pandas as pd
df = pd.DataFrame(columns=['word', 'freq'])
for i in m:
    df.loc[len(df)] = i
    
print (df)
        

       word  freq
0     event   3.0
1   registr   3.0
2     place   2.0
3   charter   2.0
4     raffl   2.0
5  standard   2.0
6  prioriti   1.0
7     centr   1.0
8     begin   1.0
9     payoh   1.0


## Tag cloud of most common words

In [10]:
%%brunel cloud color(freq) size(freq) sort(freq)
label(word) style('font-size:200px;font-family:Impact') legends(none) :: width = 600, height=600

<IPython.core.display.Javascript object>

## Keras


In [11]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
 

Using TensorFlow backend.


In [12]:
print (len(docs))
train_docs=[]
for d in docs:
    train_docs.append(" ".join(d))
print ((train_docs))

5
["The `` Big Brother '' of Singapore football will be back , but not immediately , and not for long . In an exclusive interview with The New Paper , Persib Bandung striker Noh Alam Shah said he has agreed to sign a short-term deal with former club Tampines Rovers until the end of the season . But the 31-year-old said : `` Beyond that , I feel my future is still in Indonesia . `` I feel really appreciated here . Four Indo clubs already made me offers for the next season , which starts next January . '' The move to Singapore still hinges on whether Tampines can secure his medical documents and International Transfer Certificate from the Indonesia FA before the transfer window closes today , although the Stags are optimistic . If there are no surprises , Alam Shah will return to Singapore after July 11 , after Persib play their final Indonesia Super League ( ISL ) match against champions Sriwijaya . Said the striker : `` Tampines have always been very close to my heart and I 'm thankful

In [13]:
#create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)

In [21]:
# sequence encode
encoded_docs = tokenizer.texts_to_matrix(train_docs,mode='tfidf')
encoded_docs

array([[ 0.        ,  2.70684242,  2.50667111, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  2.58098477,  2.05958598, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  2.77823493,  2.45152986, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  2.77823493,  2.00181507, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  1.93795229,  0.6061358 , ...,  1.25276297,
         1.25276297,  1.25276297]])

In [22]:
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
Xtrain

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 2, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 1, 1]], dtype=int32)

In [23]:
from numpy import array
# define training labels
print (array([0 for _ in range(2)] + [1 for _ in range(3)]))
ytrain = array([0 for _ in range(2)] + [1 for _ in range(3)])

[0 0 1 1 1]


In [24]:
# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1
vocab_size

896

In [25]:
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 695, 100)          89600     
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 688, 32)           25632     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 344, 32)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 11008)             0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                110090    
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 11        
Total params: 225,333
Trainable params: 225,333
Non-trainable params: 0
_________________________________________________________________
None

In [26]:
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)

Epoch 1/10
0s - loss: 0.6937 - acc: 0.2000
Epoch 2/10
0s - loss: 0.6900 - acc: 0.6000
Epoch 3/10
0s - loss: 0.6778 - acc: 0.6000
Epoch 4/10
0s - loss: 0.6453 - acc: 0.6000
Epoch 5/10
0s - loss: 0.5998 - acc: 0.6000
Epoch 6/10
0s - loss: 0.5308 - acc: 0.8000
Epoch 7/10
0s - loss: 0.4470 - acc: 1.0000
Epoch 8/10
0s - loss: 0.3522 - acc: 1.0000
Epoch 9/10
0s - loss: 0.2635 - acc: 1.0000
Epoch 10/10
0s - loss: 0.1833 - acc: 1.0000


<keras.callbacks.History at 0x7fbfdc7439e8>

In [27]:
# evaluate
loss, acc = model.evaluate(Xtrain, ytrain, verbose=0)
print('Test Accuracy: %f' % (acc*100))

Test Accuracy: 100.000000


In [31]:
import numpy as np
TEXT='I like you'
pre_doc=[]
data_pre=[]
dic,pre_tok=clean_data(TEXT)
pre_doc.append(pre_tok)
print (pre_doc)
for d in pre_doc:
    data_pre.append(" ".join(d))
#tokens=clean_doc(test)
MAX_SEQUENCE_LENGTH=695
print (model)
SEQUENCES = tokenizer.texts_to_matrix(data_pre,mode='tfidf')
#max_length = max([len(s.split()) for s in TEXT])
DATA = pad_sequences(SEQUENCES, maxlen=MAX_SEQUENCE_LENGTH,padding='post')
#print (DATA)
PREDICTION = model.predict(DATA,verbose=0)
#print('result: ' + np.array(LABELS)[PREDICTION.argmax(axis=1)][0])

round(PREDICTION[0,0])
#model.predict(np.array('I like you'))

[[]]
<keras.models.Sequential object at 0x7fc05c0cf668>


1.0