<a href="https://colab.research.google.com/github/rozinurhuda/nlp-in-tfjs/blob/main/nlp_in_tfjs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Binary Classification Dataset

https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set?

Untuk mendownload data menggunakan API kaggle, Buat akun Kaggle. Pada kanan atas klik My Profile > Account pada titik tiga > API > Create Account. Salin username dan Key pada kode dibawa ini.


In [1]:
import os
os.environ['KAGGLE_USERNAME'] =  # fill in your username
os.environ['KAGGLE_KEY'] =  # fill in the key

In [2]:
!kaggle datasets download -d marklvl/sentiment-labelled-sentences-data-set

Downloading sentiment-labelled-sentences-data-set.zip to /content
  0% 0.00/326k [00:00<?, ?B/s]
100% 326k/326k [00:00<00:00, 41.5MB/s]


In [3]:
!unzip -q sentiment-labelled-sentences-data-set.zip -d .

**LSTM**

In [4]:
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

df = pd.read_csv('/content/sentiment labelled sentences/yelp_labelled.txt', names=['sentence', 'label'], sep='\t')
df.shape

(1000, 2)

In [5]:
df.head

<bound method NDFrame.head of                                               sentence  label
0                             Wow... Loved this place.      1
1                                   Crust is not good.      0
2            Not tasty and the texture was just nasty.      0
3    Stopped by during the late May bank holiday of...      1
4    The selection on the menu was great and so wer...      1
..                                                 ...    ...
995  I think food should have flavor and texture an...      0
996                           Appetite instantly gone.      0
997  Overall I was not impressed and would not go b...      0
998  The whole experience was underwhelming, and I ...      0
999  Then, as if I hadn't wasted enough of my life ...      0

[1000 rows x 2 columns]>

**Text Preprocessing**

In [6]:
# convert to lowercase
df['sentence'] = df['sentence'].str.lower()

In [7]:
# remove stopwords


# comment jika error dan gunakan 2 sintaks dibawah

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [8]:
stop = set(stopwords.words('english'))
df['sentence'] = df['sentence'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
df.head()

Unnamed: 0,sentence,label
0,wow... loved place.,1
1,crust good.,0
2,tasty texture nasty.,0
3,stopped late may bank holiday rick steve recom...,1
4,selection menu great prices.,1


**Tokenize**

In [9]:
vocab_size = 2000
oov_tok = "<OOV>"
filt = ' !"#$%&()*+.,-/:;=?@[\]^_`{|}~ ' # remove symbols

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok, filters=filt)
tokenizer.fit_on_texts(df['sentence'].values)

word2index = tokenizer.word_index
print(len(word2index))

1998


In [10]:
import json

with open('word2index.json', 'w') as fp:
  json.dump(word2index, fp)

In [11]:
max_length = max(len(values.split()) for i, values in enumerate(df['sentence']))
max_length

18

In [12]:
trunc_type='post'

all_seq = tokenizer.texts_to_sequences(df['sentence'].values)
all_padded = pad_sequences(all_seq, maxlen=max_length, padding=trunc_type)
all_padded.shape

(1000, 18)

In [14]:
# split train and test sets
from sklearn.model_selection import train_test_split

X = all_padded
# y = pd.get_dummies(df['label'].values)
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)


(800, 18) (800,)
(200, 18) (200,)


In [16]:
model = tf.keras.Sequential([
        tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=16, input_length=max_length),
        tf.keras.layers.LSTM(64),
        tf.keras.layers.Dense(24, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid'),
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 18, 16)            32000     
_________________________________________________________________
lstm (LSTM)                  (None, 64)                20736     
_________________________________________________________________
dense (Dense)                (None, 24)                1560      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 25        
Total params: 54,321
Trainable params: 54,321
Non-trainable params: 0
_________________________________________________________________


In [17]:
num_epochs = 30
history = model.fit(X_train, y_train, epochs=num_epochs, validation_data=(X_test, y_test))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [18]:
#def toSequence(sentence):
#  pad = []
#  for stc in sentence.split():
#    if stc.lower() in word2index.keys(): 
#      pad.append(word2index[stc.lower()])
#    else: 
#      continue
#  return pad

#pad = toSequence('affordable price and nice dessert')
#pad = [269, 353, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0 ,0,0,0,0]
#len(pad)
#model.predict([pad])
