# INFORMASI DATA

Taswiyah Marsyah Noor (taswiyah2908@gmail.com)

+ **Total Data**: 3658 data
+ **Sumber Data**: TrustPilot (Electronics Store Flashbay)
+ **Kapasitas Penyimpanan**: 701 KB
+ **Library**: Selenium

# IMPORT LIBRARY

In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re
import nltk
import torch
import tensorflow as tf
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from imblearn.over_sampling import RandomOverSampler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from keras.models import Sequential
from keras.layers import Dense, Dropout
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from tensorflow.keras import layers, models

In [2]:
import warnings
warnings.filterwarnings("ignore")

# LOAD DATA

In [3]:
df = pd.read_csv('all_reviews.csv')

# EDA

In [4]:
df

Unnamed: 0,title,content,sentiment
0,Great products fast turnaround time!,"Date of experience: April 06, 2025",
1,Rachel is the absolute best!,"From my first telephone contact with Rachel, i...",
2,Will be using Flashbay again!,"My rep, Alex, was great to work with. He was r...",
3,Outstanding customer service,Outstanding customer service. Brian Truong has...,
4,Fantastic,Fantastic! Always fast and great service. Will...,
...,...,...,...
3680,Excellent,Fast delivery.,neutral
3681,Wonderful service!!,Absolutely incredible experience!!,neutral
3682,Daryl is the best,Daryl get the job done on time and on budget,neutral
3683,Turnaround time was quick,The order process went smoothly. Product was r...,neutral


## PELABELAN

In [5]:
sia = SentimentIntensityAnalyzer()
def vader_sentiment(text):
    score = sia.polarity_scores(text)
    compound = score['compound']  # nilai gabungan dari semua skor
    if compound >= 0.05:
        return 'positive'
    elif compound <= -0.05:
        return 'negative'
    else:
        return 'neutral'
    
df['sentiment'] = df['content'].astype(str).apply(vader_sentiment)

Melakukan pelabelan data hasil scraping menggunakan VADER, dengan penentuan skor sentimen berdasarkan bobot yang telah ditentukan.

In [6]:
df['sentiment'].value_counts()

sentiment
positive    2580
neutral     1013
negative      92
Name: count, dtype: int64

In [7]:
df.isnull().sum()

title        168
content      171
sentiment      0
dtype: int64

Terdapat total 168 data kosong pada kolom title dan 171 data kosong pada kolom content, sehingga saya memutuskan untuk menghapus baris-baris tersebut.

In [8]:
df = df.dropna()

Melakukan pembersihan data untuk menghilangkan karakter-karakter atau elemen yang tidak relevan.

# PREPROCESSING

In [9]:
df['content'] = df['content'].str.lower()

In [10]:
df['content'] = df['content'].astype(str)
# df_tfidf['content'] = df_tfidf['content'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

In [11]:
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\d+', '', text)
    words = text.split()
    words = [word for word in words if word not in stop_words]
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)

# Apply ke kolom content
df['content'] = df['content'].apply(preprocess_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# SPLIT DATA

In [12]:
X = df['content']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# OVER SAMPLING

In [13]:
# Konversi X_train ke DataFrame karena fit_resample butuh 2D input
X_train_df = X_train.to_frame()

# Lakukan oversampling
ros = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = ros.fit_resample(X_train_df, y_train)

# Convert kembali ke Series (jika kamu ingin mempertahankan format Series)
X_train_resampled = X_train_resampled['content']

Karena distribusi label tidak seimbang—dengan jumlah data positif jauh lebih banyak dibandingkan netral dan negatif—saya menerapkan teknik Random Over Sampling untuk menyeimbangkan data.

# FITUR EKSTRAKSI DAN PEMODELAN

Berikut adalah kombinasi yang saya lakukan pada projek ini
+ **1. TF-IDF + Logistic Regression**
+ **2. TF-IDF + Dense Layer (MLP dengan Keras)**
+ **3. CountVectorizer + Logistic Regression**

In [14]:
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train_resampled)
X_test_tfidf = tfidf.transform(X_test)

In [15]:
# Model
model_logistic = LogisticRegression(max_iter=1000)
model_logistic.fit(X_train_tfidf, y_train_resampled)

# === TRAINING EVALUATION ===
y_train_pred = model_logistic.predict(X_train_tfidf)
print("TF-IDF + Logistic Regression — TRAINING")
print(classification_report(y_train_resampled, y_train_pred))

# === TESTING EVALUATION ===
y_test_pred = model_logistic.predict(X_test_tfidf)
print("TF-IDF + Logistic Regression — TESTING")
print(classification_report(y_test, y_test_pred))

TF-IDF + Logistic Regression — TRAINING
              precision    recall  f1-score   support

    negative       0.99      1.00      1.00      2064
     neutral       1.00      0.99      0.99      2064
    positive       0.99      0.99      0.99      2064

    accuracy                           0.99      6192
   macro avg       0.99      0.99      0.99      6192
weighted avg       0.99      0.99      0.99      6192

TF-IDF + Logistic Regression — TESTING
              precision    recall  f1-score   support

    negative       0.41      0.39      0.40        18
     neutral       0.91      0.85      0.88       169
    positive       0.95      0.97      0.96       516

    accuracy                           0.92       703
   macro avg       0.75      0.74      0.74       703
weighted avg       0.92      0.92      0.92       703



**Training**
+ Akurasi sangat tinggi: 99%
+ Model sangat baik mengenali seluruh kelas pada data latih <br>
⚠️ Hasil mendekati sempurna → indikasi overfitting terhadap data training

**Testing**
+ Akurasi tinggi: 92%
+ Mampu menangani kelas dominan (positive) dan cukup baik pada kelas neutral <br>
⚠️ Performa pada kelas negative masih rendah → menurunkan keseimbangan prediksi antar kelas <br>
⚠️ Ada penurunan akurasi dari training ke testing → overfitting ringan

In [16]:
# Array
X_train_array = X_train_tfidf.toarray()
X_test_array = X_test_tfidf.toarray()

# Encode label
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train_resampled)
y_test_enc = le.transform(y_test)

y_train_cat = to_categorical(y_train_enc)
y_test_cat = to_categorical(y_test_enc)

# Build model
model_dense = Sequential()
model_dense.add(Dense(128, activation='relu', input_shape=(X_train_array.shape[1],)))
model_dense.add(Dense(64, activation='relu'))
model_dense.add(Dense(y_train_cat.shape[1], activation='softmax'))

model_dense.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model_dense.fit(X_train_array, y_train_cat, epochs=5, batch_size=32, validation_split=0.1)

# === TRAINING EVALUATION ===
y_train_pred = model_dense.predict(X_train_array)
y_train_pred_classes = np.argmax(y_train_pred, axis=1)
print("TF-IDF + Dense Layer — TRAINING")
print(classification_report(y_train_enc, y_train_pred_classes, target_names=le.classes_))

# === TESTING EVALUATION ===
y_test_pred = model_dense.predict(X_test_array)
y_test_pred_classes = np.argmax(y_test_pred, axis=1)
print("TF-IDF + Dense Layer — TESTING")
print(classification_report(y_test_enc, y_test_pred_classes, target_names=le.classes_))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
TF-IDF + Dense Layer — TRAINING
              precision    recall  f1-score   support

    negative       1.00      1.00      1.00      2064
     neutral       1.00      1.00      1.00      2064
    positive       1.00      1.00      1.00      2064

    accuracy                           1.00      6192
   macro avg       1.00      1.00      1.00      6192
weighted avg       1.00      1.00      1.00      6192

TF-IDF + Dense Layer — TESTING
              precision    recall  f1-score   support

    negative       0.50      0.11      0.18        18
     neutral       0.92      0.85      0.88       169
    positive       0.94      0.99      0.96       516

    accuracy                           0.93       703
   macro avg       0.79      0.65      0.67       703
weighted avg       0.92      0.93      0.92       703



**Training**
+ Akurasi sempurna: 100%
+ Model sangat pas mengenali seluruh kelas pada data latih <br>
⚠️ Ini overfitting parah karena model terlalu cocok dengan data training

**Testing**
+ Akurasi tinggi: 93%
+ Performa sangat baik pada kelas dominan (positive) <br>
⚠️ Kelas negative sangat buruk dikenali (recall 0.11) <br>
⚠️ Performa model tidak seimbang antar kelas

In [17]:
cv = CountVectorizer()
X_train_count = cv.fit_transform(X_train_resampled)
X_test_count = cv.transform(X_test)

In [18]:
# === TRAINING EVALUATION ===
y_train_pred = model_logistic.predict(X_train_count)
print("CountVectorizer + Logistic Regression — TRAINING")
print(classification_report(y_train_resampled, y_train_pred))

# === TESTING EVALUATION ===
y_test_pred = model_logistic.predict(X_test_count)
print("CountVectorizer + Logistic Regression — TESTING")
print(classification_report(y_test, y_test_pred))

CountVectorizer + Logistic Regression — TRAINING
              precision    recall  f1-score   support

    negative       0.95      1.00      0.97      2064
     neutral       0.99      0.97      0.98      2064
    positive       0.99      0.95      0.97      2064

    accuracy                           0.97      6192
   macro avg       0.97      0.97      0.97      6192
weighted avg       0.97      0.97      0.97      6192

CountVectorizer + Logistic Regression — TESTING
              precision    recall  f1-score   support

    negative       0.18      0.50      0.26        18
     neutral       0.88      0.82      0.85       169
    positive       0.96      0.92      0.94       516

    accuracy                           0.88       703
   macro avg       0.67      0.75      0.68       703
weighted avg       0.92      0.88      0.90       703



**Training**
+ Akurasi sangat tinggi: 97%
+ Semua kelas dikenali dengan sempurna di data training <br>
⚠️ Sangat mungkin overfitting

**Testing**
+ Akurasi tinggi: 88%
+ Performa baik untuk kelas positive dan neutral <br>
⚠️ Kelas negative masih lemah (recall hanya 0.50) <br>
⚠️ Ada ketidakseimbangan performa antar kelas

# SIMPAN MODEL

In [25]:
# Menyimpan model Dense Layer (MLP) yang sudah dilatih
model_dense.save('model_dense_layer.h5')

In [30]:
# Membuat model Keras untuk Logistic Regression
logistic_model_keras = models.Sequential([
    layers.InputLayer(input_shape=(X_train_array.shape[1],)),  # Menyesuaikan dengan jumlah fitur pada X_train_tfidf
    layers.Dense(3, activation='softmax', use_bias=True, 
                  kernel_initializer=tf.constant_initializer(model_logistic.coef_.T),  # Bobot dari model Logistic
                  bias_initializer=tf.constant_initializer(model_logistic.intercept_))  # Intercept dari model Logistic
])

# Menyimpan model Logistic Regression dalam format Keras
logistic_model_keras.save('model_logistic_regression.h5')



# INFERENCE

In [37]:
# Daftar label sesuai dengan urutan kelas
labels = ["negatif", "netral", "positif"]

# Memuat model Dense Layer (MLP) yang sudah disimpan
mlp_model_loaded = tf.keras.models.load_model('model_dense_layer.h5')

# Misalnya Anda punya data baru (data yang belum pernah dilihat oleh model)
# Contoh data baru
new_data = ["wow the goods are very good and the service is okay the goods arrive quickl."]

# Lakukan transformasi pada data baru dengan TF-IDF Vectorizer (yang sudah dilatih sebelumnya)
new_data_tfidf = tfidf.transform(new_data).toarray()

# Lakukan prediksi dengan model yang sudah dimuat
y_pred_mlp = mlp_model_loaded.predict(new_data_tfidf)

# Menentukan kelas dengan probabilitas tertinggi
predicted_class_index = np.argmax(y_pred_mlp, axis=1)

# Mengonversi indeks kelas ke label
predicted_label = labels[predicted_class_index[0]]

# Menampilkan hasil prediksi
print(f'Prediksi kelas: {predicted_label}')

Prediksi kelas: positif


In [39]:
# Memuat model yang sudah disimpan
logistic_loaded_model = tf.keras.models.load_model('model_logistic_regression.h5')

# Data baru untuk prediksi
new_data = ["9 February 2025"]
new_data_tfidf = tfidf.transform(new_data).toarray()

# Melakukan prediksi
y_pred_log = logistic_loaded_model.predict(new_data_tfidf)

# Menentukan kelas dengan probabilitas tertinggi
predicted_class_index_log = np.argmax(y_pred_log, axis=1)

# Mengonversi indeks kelas ke label
predicted_label_log = labels[predicted_class_index_log[0]]

# Menampilkan hasil prediksi
print(f'Prediksi kelas: {predicted_label_log}')

Prediksi kelas: netral
