# BPML - Submission 1
Dataset yang dgunakan adalah data Sentiment positif dan negatif dari twitter.
Dataset ini terdiri dari 1600000 data.
Terdapat 6 Variabel pada dataset namun pada submission ini hanya menggunakan 2 variabel yaitu Sentiment(Positif atau Negatif) dan isi twitter.
Pada project ini akan diklasifikasikan isi twitter yang bersentimen positif atau negatif

### Mengimport dataset menggunakan pandas

In [1]:
import pandas as pd
df = pd.read_csv('sentiment.csv', header=None, encoding='ISO-8859-1')

### Menampilkan 5 data terakhir

In [2]:
df.tail()

Unnamed: 0,0,1,2,3,4,5
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...
1599999,4,2193602129,Tue Jun 16 08:40:50 PDT 2009,NO_QUERY,RyanTrevMorris,happy #charitytuesday @theNSPCC @SparksCharity...


### Menghapus Kolom yang tidak digunakan
Pada project ini hanya digunakan kolom 0 (sentimen) dan 5 (isi twitter).
Kolom 1,2,3,4 dihapus.

In [3]:
df = df.drop(columns=[1,2,3,4])

### Mengganti isi kolom 0(sentimen)
Nilai pada kolom 0 yaitu 0 (untuk sentimen negatif) dan 4 (untuk sentimen positif)

In [4]:
df[0].replace([0,4], [0,1], inplace=True)

### Membagi data menjadi data training dan data testing

In [5]:
tweets = df[5].values
sentiment = df[0].values

In [6]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(tweets, sentiment, test_size=0.2)

### Melakukan Tokenizing dan Padding Terhadap data tweets

In [7]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=10000, oov_token='-')
tokenizer.fit_on_texts(tweets)

In [8]:
seq_train = tokenizer.texts_to_sequences(x_train)
seq_test = tokenizer.texts_to_sequences(x_test)

pad_train = pad_sequences(seq_train)
pad_test = pad_sequences(seq_test)

### Model Arsitektur Neural Network

In [9]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import *

nlp_model = Sequential()
nlp_model.add(Embedding(input_dim=10000, output_dim=32))
nlp_model.add(LSTM(64))
nlp_model.add(Dense(512, activation='relu'))
nlp_model.add(Dense(1024, activation='relu'))
nlp_model.add(Dense(512, activation='relu'))
nlp_model.add(Dense(64, activation='relu'))
nlp_model.add(Dense(1, activation='sigmoid'))

nlp_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 32)          320000    
_________________________________________________________________
lstm (LSTM)                  (None, 64)                24832     
_________________________________________________________________
dense (Dense)                (None, 512)               33280     
_________________________________________________________________
dense_1 (Dense)              (None, 1024)              525312    
_________________________________________________________________
dense_2 (Dense)              (None, 512)               524800    
_________________________________________________________________
dense_3 (Dense)              (None, 64)                32832     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 6

### Melakukan compile terhadap model
Model arsitektur yang telah dibuat dilakukan compile menggunakan loss function binary_crossntropy (karena classnya hanya 2) dan optimizer adam

In [10]:
nlp_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### Melakukan fitting
Melakukan fitting terhadap model menggunakan data training dan divalidasi menggunakan data testing

In [11]:
num_epochs = 10
nlp_model_history = nlp_model.fit(pad_train, y_train, epochs=num_epochs, validation_data=(pad_test, y_test), batch_size=2048, verbose=1)

Train on 1280000 samples, validate on 320000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Grafik perubahan akurasi terhadap jumlah epoch

In [12]:
import matplotlib.pyplot as plt
acc = nlp_model_history.history['accuracy']
val_acc = nlp_model_history.history['val_accuracy']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'blue', label='Training accuracy')
plt.plot(epochs, val_acc, 'black', label='Validation accuracy')
plt.title('Training and Validation Accuracy')
plt.legend()
plt.show()

<Figure size 640x480 with 1 Axes>