Selections of code borrowed from: https://towardsdatascience.com/cnn-sentiment-analysis-1d16b7c5a0e7

1. To this runtime, upload the poli_data_format.csv file and the pretrained Indonesian word2vec model, id.bin: https://drive.google.com/file/d/0B0ZXk88koS2KQWxEemNNUHhnTWc/view (credit https://github.com/Kyubyong/wordvectors)
2. Clone cleaned Indonesian tweets and stopwords:

In [4]:
!git clone https://github.com/ridife/dataset-idsa.git

fatal: destination path 'dataset-idsa' already exists and is not an empty directory.


In [5]:
!wget "https://raw.githubusercontent.com/stopwords-iso/stopwords-id/master/stopwords-id.txt"

--2020-05-06 01:40:57--  https://raw.githubusercontent.com/stopwords-iso/stopwords-id/master/stopwords-id.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6444 (6.3K) [text/plain]
Saving to: ‘stopwords-id.txt.1’


2020-05-06 01:40:57 (82.8 MB/s) - ‘stopwords-id.txt.1’ saved [6444/6444]



3. Import necessary packages and download NLTK data:

In [6]:
from __future__ import division, print_function
from gensim import models
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Dropout, Reshape, Flatten, concatenate, Input, Conv1D, GlobalMaxPooling1D, Embedding
from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import os
import collections
import re
import string
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

4. Import the data and fix up a bit:

In [0]:
df = pd.read_csv('/content/dataset-idsa/Indonesian Sentiment Twitter Dataset Labeled.csv', sep='\t', header=0)
df2 = pd.read_csv('poli_data_format.csv', sep='\t', header=0)
df = pd.concat([df, df2])

In [0]:
df.columns = ['Label', 'Tweet']
df = df[df.Label != 0] # since binary classification, leave out 'neutral' tweets
df = df.reset_index(drop=True)
df['Label'] = [1 if i==1 else 0 for i in df.Label]

5. Preprocess a bit more, removing punctuation and stopwords:

In [0]:
def prep(text):
  prepped_text = ''
  prepped_text = re.sub('['+string.punctuation+']', '', text)
  return prepped_text.lower()

df['Tweet'] = df['Tweet'].apply(lambda x: prep(x))

In [0]:
# tokenize tweets; English tokenizer, but Indonesian has similar enough tokenization rules
tokens = [nltk.word_tokenize(sentence) for sentence in df.Tweet]

In [0]:
# load up Indonesian stopwords
stoplist = []
with open('/content/stopwords-id.txt', 'r', encoding='utf-8') as inf:
  for line in inf.readlines():
    line = line[:-1]
    stoplist.append(line)

In [0]:
def remove_stopwords(tokens, stoplist):
  return [token for token in tokens if token not in stoplist]

filtered_tokens = [remove_stopwords(sentence, stoplist) for sentence in tokens]

df['Tweet'] = [' '.join(sentence) for sentence in filtered_tokens]
df['Tokens'] = filtered_tokens

6. Set up one-hot encoded columns in dataframe:

In [0]:
# set up one-hot encoding of labels
pos = []
neg = []
for l in df.Label:
    if l == 0:
        pos.append(0)
        neg.append(1)
    elif l == 1:
        pos.append(1)
        neg.append(0)

df['Pos']= pos
df['Neg']= neg
df = df[['Tweet', 'Tokens', 'Label', 'Pos', 'Neg']]

7. Split for train and test:

In [0]:
df_train, df_test = train_test_split(df, test_size=0.10, random_state=42)

In [84]:
df_train

Unnamed: 0,Tweet,Tokens,Label,Pos,Neg
1670,tersilap fikir,"[tersilap, fikir]",0,0,1
5335,suka dgr lagu mcm zaman saloma lagu irama mela...,"[suka, dgr, lagu, mcm, zaman, saloma, lagu, ir...",1,1,0
5008,peliharalah dirimu siksaan yang khusus menimpa...,"[peliharalah, dirimu, siksaan, yang, khusus, m...",1,1,0
3622,jejak digital menyakitkan menusuk2,"[jejak, digital, menyakitkan, menusuk2]",0,0,1
3328,berlatih sempurna manusia yang sempurna susah ...,"[berlatih, sempurna, manusia, yang, sempurna, ...",0,0,1
...,...,...,...,...,...
3772,gak kasian yang laju berangkat,"[gak, kasian, yang, laju, berangkat]",0,0,1
5191,hahahaahhahahahahahahahaa la,"[hahahaahhahahahahahahahaa, la]",1,1,0
5226,peduli apapun tanggapan orang tentangku terser...,"[peduli, apapun, tanggapan, orang, tentangku, ...",1,1,0
5390,kemanusiaan derajatx hati simpati kemanusiaan ...,"[kemanusiaan, derajatx, hati, simpati, kemanus...",1,1,0


7.1 Determine maximum train/test sentence length and number of words

In [85]:
all_training_words = [word for tokens in df_train['Tokens'] for word in tokens]
training_sentence_lengths = [len(tokens) for tokens in df_train['Tokens']]
TRAINING_VOCAB = sorted(list(set(all_training_words)))
print('{} words total, with a vocabulary size of {}'.format(len(all_training_words), len(TRAINING_VOCAB)))
print('Max sentence length is {}'.format(max(training_sentence_lengths)))

44662 words total, with a vocabulary size of 12805
Max sentence length is 25


In [86]:
all_test_words = [word for tokens in df_test['Tokens'] for word in tokens]
test_sentence_lengths = [len(tokens) for tokens in df_test['Tokens']]
TEST_VOCAB = sorted(list(set(all_test_words)))
print('{} words total, with a vocabulary size of {}'.format(len(all_test_words), len(TEST_VOCAB)))
print('Max sentence length is {}'.format(max(test_sentence_lengths)))

5117 words total, with a vocabulary size of 2813
Max sentence length is 21


8. Load Word2Vec and associated values:

In [87]:
word2vec_path = '/content/id.bin'
word2vec = models.Word2Vec.load(word2vec_path)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


9. Get the Word2Vec embeddings; if a word cannot be found, generate a random vector for that word:

In [0]:
def get_average_word2vec(tokens_list, vector, generate_missing=False, k=300):
    if len(tokens_list)<1:
        return np.zeros(k)
    if generate_missing:
        vectorized = [vector[word] if word in vector else np.random.rand(k) for word in tokens_list]
    else:
        vectorized = [vector[word] if word in vector else np.zeros(k) for word in tokens_list]
    length = len(vectorized)
    summed = np.sum(vectorized, axis=0)
    averaged = np.divide(summed, length)
    return averaged

def get_word2vec_embeddings(vectors, clean_comments, generate_missing=False):
    embeddings = clean_comments['Tokens'].apply(lambda x: get_average_word2vec(x, vectors, 
                                                                                generate_missing=generate_missing))
    return list(embeddings)

In [89]:
training_embeddings = get_word2vec_embeddings(word2vec, df_train, generate_missing=True)
MAX_SEQUENCE_LENGTH = 28
EMBEDDING_DIM = 300

  """
  """


10. Tokenize and pad the word sequences for both train and test

In [90]:
tokenizer = Tokenizer(num_words=len(TRAINING_VOCAB), lower=True, char_level=False)
tokenizer.fit_on_texts(df['Tweet'].tolist())
training_sequences = tokenizer.texts_to_sequences(df_train['Tweet'].tolist())

train_word_index = tokenizer.word_index
print('Found {} unique tokens.'.format(len(train_word_index)))

Found 13688 unique tokens.


In [0]:
train_cnn_data = pad_sequences(training_sequences, maxlen=MAX_SEQUENCE_LENGTH)

In [92]:
train_embedding_weights = np.zeros((len(train_word_index)+1, EMBEDDING_DIM))
for word,index in train_word_index.items():
    train_embedding_weights[index,:] = word2vec[word] if word in word2vec else np.random.rand(EMBEDDING_DIM)


  This is separate from the ipykernel package so we can avoid doing imports until
  This is separate from the ipykernel package so we can avoid doing imports until


In [0]:
test_sequences = tokenizer.texts_to_sequences(df_test['Tweet'].tolist())
test_cnn_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)

11. Set up the actual CNN model and other values:

In [0]:
def ConvNet(embeddings, max_sequence_length, num_words, embedding_dim, labels_index):
    
    embedding_layer = Embedding(num_words,
                            embedding_dim,
                            weights=[embeddings],
                            input_length=max_sequence_length,
                            trainable=False)
    
    sequence_input = Input(shape=(max_sequence_length,), dtype='int32')
    embedded_sequences = embedding_layer(sequence_input)

    convs = []
    filter_sizes = [2,3,4,5,6] # five different filter sizes applied to each tweet

    for filter_size in filter_sizes:
        l_conv = Conv1D(filters=200, kernel_size=filter_size, activation='relu')(embedded_sequences)
        l_pool = GlobalMaxPooling1D()(l_conv)
        convs.append(l_pool)

    l_merge = concatenate(convs, axis=1)

    x = Dropout(0.1)(l_merge)  
    x = Dense(128, activation='relu')(x)
    x = Dropout(0.2)(x)
    preds = Dense(labels_index, activation='sigmoid')(x)

    model = Model(sequence_input, preds)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['acc'])
    model.summary()
    return model

In [0]:
label_names = ['Pos', 'Neg']

In [0]:
x_train = train_cnn_data
y_train = df_train[label_names].values

In [97]:
model = ConvNet(train_embedding_weights,
                MAX_SEQUENCE_LENGTH,
                len(train_word_index)+1,
                EMBEDDING_DIM, 
                len(list(label_names))
                )

Model: "model_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, 28)           0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 28, 300)      4106700     input_3[0][0]                    
__________________________________________________________________________________________________
conv1d_11 (Conv1D)              (None, 27, 200)      120200      embedding_3[0][0]                
__________________________________________________________________________________________________
conv1d_12 (Conv1D)              (None, 26, 200)      180200      embedding_3[0][0]                
____________________________________________________________________________________________

10. Train the CNN model:

In [0]:
num_epochs = 3
batch_size = 64

In [99]:
hist = model.fit(x_train,
                 y_train,
                 epochs=num_epochs,
                 validation_split=0.1,
                 shuffle=True,
                 batch_size=batch_size,
                 )

Train on 4620 samples, validate on 514 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


11. Test the CNN model:

In [100]:
predictions = model.predict(test_cnn_data, batch_size=1024, verbose=1)



In [0]:
labels = [1, 0]

prediction_labels=[]
for p in predictions:
    prediction_labels.append(labels[np.argmax(p)])

In [0]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

In [103]:
df_test['Prediction'] = prediction_labels

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [0]:
predicted_classes = df_test.Prediction
y_test = df_test.Label

12. Make the confusion matrix and evaluate:

In [105]:
conf_matrix = pd.DataFrame(confusion_matrix(y_test, predicted_classes))
print('Confusion Matrix')
display(conf_matrix)

test_scores = accuracy_score(y_test,predicted_classes), precision_score(y_test, predicted_classes), recall_score(y_test, predicted_classes), f1_score(y_test, predicted_classes)

print('\n \n Scores')
scores = pd.DataFrame(data=[test_scores])
scores.columns = ['accuracy', 'precision', 'recall', 'f1']
scores = scores.T
scores.columns = ['scores']
display(scores)

Confusion Matrix


Unnamed: 0,0,1
0,203,100
1,76,192



 
 Scores


Unnamed: 0,scores
accuracy,0.691769
precision,0.657534
recall,0.716418
f1,0.685714


In [106]:
df_test.Label.value_counts()

0    303
1    268
Name: Label, dtype: int64