## Sentiment Analysis on Womens Clothing E-Commerce Reviews

1. **Ambil Data dari Dataset**: Menggunakan Kaggle API untuk mengunduh dataset dan memuatnya ke dalam pandas DataFrame.
2. **Konversi Rating ke Sentimen**: Mengonversi kolom rating ke kategori sentimen sesuai aturan (1,2 = negative, 3 = normal, 4,5 -positive).
3. **Menampilkan Jumlah Data untuk Masing-Masing Kategori Sentimen** : data bernilai postive, netral dan normal
4. **Membangun Vocab Word**: Membuat vocab dengan 10.000 kata dari dataset dan menambahkan 2 token untuk padding dan OOV.
5. **Menampilkan 10 Kata Paling Sering Muncul**: Menampilkan 10 kata yang paling sering muncul dalam dataset.

In [None]:
# Import library yang diperlukan
from google.colab import files
import pandas as pd
import zipfile
import os
from tensorflow.keras.preprocessing.text import Tokenizer

# Langkah 1: Unggah file dataset secara manual
print("Silakan unggah file .zip atau .csv dari dataset 'nicapotato/womens-ecommerce-clothing-reviews'")
uploaded = files.upload()

# Langkah 2: Cek dan proses file yang diunggah
for filename in uploaded.keys():
    print(f"File yang diunggah: {filename}")

    # Jika file adalah .zip, ekstrak isinya
    if filename.endswith('.zip'):
        with zipfile.ZipFile(filename, 'r') as zip_ref:
            zip_ref.extractall()
        print(f"File {filename} telah diekstrak.")
        # Asumsikan file CSV bernama 'Womens Clothing E-Commerce Reviews.csv'
        csv_file = 'Womens Clothing E-Commerce Reviews.csv'
    else:
        # Jika file adalah .csv, gunakan langsung
        csv_file = filename

# Langkah 3: Verifikasi keberadaan file CSV
if os.path.exists(csv_file):
    print(f"File CSV ditemukan: {csv_file}")
else:
    print(f"File {csv_file} tidak ditemukan. Berikut file yang tersedia:")
    !ls
    raise FileNotFoundError(f"File {csv_file} tidak ditemukan. Pastikan nama file sesuai atau file telah diekstrak.")

# Langkah 4: Muat dataset ke pandas DataFrame
df = pd.read_csv(csv_file)
print("\nFirst few rows of the dataset:")
print(df.head())

# Langkah 5: Handle missing values
df['Review Text'] = df['Review Text'].fillna('')  # Ganti NaN dengan string kosong
print(f"\nMissing Ratings: {df['Rating'].isna().sum()}")

# Langkah 6: Konversi rating ke sentimen
def convert_rating_to_sentiment(rating):
    if rating in [1, 2]:
        return 'negative'
    elif rating == 3:
        return 'neutral'
    elif rating in [4, 5]:
        return 'positive'

df['Sentiment'] = df['Rating'].apply(convert_rating_to_sentiment)

# Langkah 7: Tampilkan jumlah data untuk setiap kategori sentimen
print("\nJumlah Data per Kategori Sentimen:")
sentiment_counts = df['Sentiment'].value_counts()
print(sentiment_counts)

# Langkah 8: Bangun vocabulary dengan 10.000 kata + 2 token (<PAD>, <OOV>)
tokenizer = Tokenizer(num_words=10000 + 2, oov_token='<OOV>')
tokenizer.fit_on_texts(df['Review Text'])
vocab_size = len(tokenizer.word_index) + 1  # Total kata unik dalam dataset
print(f"\nVocab Size (total kata unik): {vocab_size}")
print(f"Vocab Size yang digunakan (terbatas): 10002")  # 10.000 kata + <PAD> + <OOV>

# Langkah 9: Tampilkan 10 kata paling sering
word_freq = sorted(tokenizer.word_counts.items(), key=lambda x: x[1], reverse=True)[:10]
print("\n10 Kata yang Paling Sering Muncul:")
for word, freq in word_freq:
    print(f"{word}: {freq}")

Silakan unggah file .zip atau .csv dari dataset 'nicapotato/womens-ecommerce-clothing-reviews'


Saving Womens Clothing E-Commerce Reviews.csv.zip to Womens Clothing E-Commerce Reviews.csv (1).zip
File yang diunggah: Womens Clothing E-Commerce Reviews.csv (1).zip
File Womens Clothing E-Commerce Reviews.csv (1).zip telah diekstrak.
File CSV ditemukan: Womens Clothing E-Commerce Reviews.csv

First few rows of the dataset:
   Unnamed: 0  Clothing ID  Age                    Title  \
0           0          767   33                      NaN   
1           1         1080   34                      NaN   
2           2         1077   60  Some major design flaws   
3           3         1049   50         My favorite buy!   
4           4          847   47         Flattering shirt   

                                         Review Text  Rating  Recommended IND  \
0  Absolutely wonderful - silky and sexy and comf...       4                1   
1  Love this dress!  it's sooo pretty.  i happene...       5                1   
2  I had such high hopes for this dress and reall...       3         

In [1]:
# Langkah 1: Instal dependensi dengan versi yang kompatibel
!pip uninstall -y numpy scipy gensim scikit-learn pandas  # Hapus versi yang ada
!pip install numpy==1.25.2
!pip install scipy==1.10.1
!pip install gensim==4.3.2
!pip install scikit-learn==1.5.2
!pip install pandas==2.2.2

# Import library yang diperlukan
from google.colab import files
import pandas as pd
import zipfile
import os
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from gensim.models import Word2Vec
import re
import string

# Langkah 2: Unggah file dataset secara manual
print("Silakan unggah file .zip atau .csv dari dataset 'nicapotato/womens-ecommerce-clothing-reviews'")
uploaded = files.upload()

# Langkah 3: Cek dan proses file yang diunggah
for filename in uploaded.keys():
    print(f"File yang diunggah: {filename}")
    if filename.endswith('.zip'):
        with zipfile.ZipFile(filename, 'r') as zip_ref:
            zip_ref.extractall()
        print(f"File {filename} telah diekstrak.")
        csv_file = 'Womens Clothing E-Commerce Reviews.csv'
    else:
        csv_file = filename

# Langkah 4: Verifikasi keberadaan file CSV
if os.path.exists(csv_file):
    print(f"File CSV ditemukan: {csv_file}")
else:
    print(f"File {csv_file} tidak ditemukan. Berikut file yang tersedia:")
    !ls
    raise FileNotFoundError(f"File {csv_file} tidak ditemukan.")

# Langkah 5: Muat dataset ke pandas DataFrame
df = pd.read_csv(csv_file)
df['Review Text'] = df['Review Text'].fillna('')  # Handle NaN
print("\nFirst few rows of the dataset:")
print(df.head())

# Langkah 6: Konversi rating ke sentimen
def convert_rating_to_sentiment(rating):
    if rating in [1, 2]:
        return 'negative'
    elif rating == 3:
        return 'neutral'
    elif rating in [4, 5]:
        return 'positive'

df['Sentiment'] = df['Rating'].apply(convert_rating_to_sentiment)
print("\nJumlah Data per Kategori Sentimen:")
print(df['Sentiment'].value_counts())

# Langkah 7: Preprocessing teks
def preprocess_text(text):
    text = text.lower()  # Ubah ke lowercase
    text = re.sub(f'[{string.punctuation}]', '', text)  # Hapus tanda baca
    text = re.sub(r'\d+', '', text)  # Hapus angka
    return text

df['Cleaned Review'] = df['Review Text'].apply(preprocess_text)

# Langkah 8: Bagi data menjadi training (80%), validation (10%), dan testing (10%)
X = df['Cleaned Review']
y = df['Sentiment']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

print(f"\nUkuran Data:")
print(f"Training: {len(X_train)} samples")
print(f"Validation: {len(X_val)} samples")
print(f"Testing: {len(X_test)} samples")

# Langkah 9: Tokenisasi teks untuk Word2Vec
def tokenize_text(text):
    return text.split()

train_sentences = [tokenize_text(text) for text in X_train]
val_sentences = [tokenize_text(text) for text in X_val]
test_sentences = [tokenize_text(text) for text in X_test]

# Langkah 10: Latih model Word2Vec (CBOW dan Skip-Gram)
# CBOW (sg=0)
cbow_model = Word2Vec(sentences=train_sentences, vector_size=100, window=5, min_count=1, workers=4, sg=0)
# Skip-Gram (sg=1)
skipgram_model = Word2Vec(sentences=train_sentences, vector_size=100, window=5, min_count=1, workers=4, sg=1)

print("\nModel Word2Vec telah dilatih (CBOW dan Skip-Gram).")

# Langkah 11: Fungsi untuk membuat vektor rata-rata dari teks
def get_average_word2vec(tokens, model, vector_size):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if len(vectors) == 0:
        return np.zeros(vector_size)
    return np.mean(vectors, axis=0)

# Vektorisasi data untuk CBOW
X_train_cbow = np.array([get_average_word2vec(tokens, cbow_model, 100) for tokens in train_sentences])
X_val_cbow = np.array([get_average_word2vec(tokens, cbow_model, 100) for tokens in val_sentences])
X_test_cbow = np.array([get_average_word2vec(tokens, cbow_model, 100) for tokens in test_sentences])

# Vektorisasi data untuk Skip-Gram
X_train_skipgram = np.array([get_average_word2vec(tokens, skipgram_model, 100) for tokens in train_sentences])
X_val_skipgram = np.array([get_average_word2vec(tokens, skipgram_model, 100) for tokens in val_sentences])
X_test_skipgram = np.array([get_average_word2vec(tokens, skipgram_model, 100) for tokens in test_sentences])

# Langkah 12: Latih dan evaluasi model klasifikasi (Logistic Regression)
# CBOW
clf_cbow = LogisticRegression(max_iter=1000, random_state=42)
clf_cbow.fit(X_train_cbow, y_train)
y_pred_cbow = clf_cbow.predict(X_test_cbow)

print("\nPerforma Klasifikasi dengan CBOW:")
print(f"Akurasi: {accuracy_score(y_test, y_pred_cbow):.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_cbow))

# Skip-Gram
clf_skipgram = LogisticRegression(max_iter=1000, random_state=42)
clf_skipgram.fit(X_train_skipgram, y_train)
y_pred_skipgram = clf_skipgram.predict(X_test_skipgram)

print("\nPerforma Klasifikasi dengan Skip-Gram:")
print(f"Akurasi: {accuracy_score(y_test, y_pred_skipgram):.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_skipgram))

# Langkah 13: Perbandingan CBOW vs Skip-Gram
cbow_acc = accuracy_score(y_test, y_pred_cbow)
skipgram_acc = accuracy_score(y_test, y_pred_skipgram)
print("\nPerbandingan Performa:")
print(f"Akurasi CBOW: {cbow_acc:.4f}")
print(f"Akurasi Skip-Gram: {skipgram_acc:.4f}")
if cbow_acc > skipgram_acc:
    print("CBOW lebih baik berdasarkan akurasi.")
elif skipgram_acc > cbow_acc:
    print("Skip-Gram lebih baik berdasarkan akurasi.")
else:
    print("CBOW dan Skip-Gram memiliki akurasi yang sama.")

Found existing installation: numpy 1.25.2
Uninstalling numpy-1.25.2:
  Successfully uninstalled numpy-1.25.2
Found existing installation: scipy 1.10.1
Uninstalling scipy-1.10.1:
  Successfully uninstalled scipy-1.10.1
[0mCollecting numpy==1.25.2
  Using cached numpy-1.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Using cached numpy-1.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
Installing collected packages: numpy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pandas-gbq 0.28.0 requires pandas>=1.1.4, which is not installed.
statsmodels 0.14.4 requires pandas!=2.1.0,>=1.4, which is not installed.
statsmodels 0.14.4 requires scipy!=1.9.2,>=1.8, which is not installed.
bokeh 3.7.2 requires pandas>=1.2, which is not installed.
pytensor 2.30.3 requires scipy<2,>=1, which is not installed.
clara

Collecting scipy==1.10.1
  Using cached scipy-1.10.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
Using cached scipy-1.10.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.1 MB)
Installing collected packages: scipy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
statsmodels 0.14.4 requires pandas!=2.1.0,>=1.4, which is not installed.
sentence-transformers 3.4.1 requires scikit-learn, which is not installed.
plotnine 0.14.5 requires pandas>=2.2.0, which is not installed.
yellowbrick 1.5 requires scikit-learn>=1.0.0, which is not installed.
librosa 0.11.0 requires scikit-learn>=1.1.0, which is not installed.
hdbscan 0.8.40 requires scikit-learn>=0.20, which is not installed.
imbalanced-learn 0.13.0 requires scikit-learn<2,>=1.3.2, which is not installed.
arviz 0.21.0 requires pandas>=1.5.0, which is not installed

Saving Womens Clothing E-Commerce Reviews.csv.zip to Womens Clothing E-Commerce Reviews.csv (2).zip
File yang diunggah: Womens Clothing E-Commerce Reviews.csv (2).zip
File Womens Clothing E-Commerce Reviews.csv (2).zip telah diekstrak.
File CSV ditemukan: Womens Clothing E-Commerce Reviews.csv

First few rows of the dataset:
   Unnamed: 0  Clothing ID  Age                    Title  \
0           0          767   33                      NaN   
1           1         1080   34                      NaN   
2           2         1077   60  Some major design flaws   
3           3         1049   50         My favorite buy!   
4           4          847   47         Flattering shirt   

                                         Review Text  Rating  Recommended IND  \
0  Absolutely wonderful - silky and sexy and comf...       4                1   
1  Love this dress!  it's sooo pretty.  i happene...       5                1   
2  I had such high hopes for this dress and reall...       3         