<a href="https://colab.research.google.com/github/sowmyarshetty/NNClass/blob/main/AmazonHomeKitchenReviewsPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install BERTopic

Collecting BERTopic
  Downloading bertopic-0.17.0-py3-none-any.whl.metadata (23 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->BERTopic)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->BERTopic)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->BERTopic)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers>=0.4.1->BERTopic)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers>=0.4.1->BERTopic)
  Downloa

In [2]:
import pandas as pd
import dask.dataframe as dd
import gdown
import re
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

In [3]:
# Mount Google Drive (For Colab Users)
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [4]:
amazonhkdatasetfileid = '14GcJAzyN2PFg2JuyzF0pRmxlMmimrz9o'
amazonhkdatasetfilename = 'AmazonHomeKitchenReviews.csv'

url = f"https://drive.google.com/uc?export=download&id={amazonhkdatasetfileid}"

gdown.download(url,amazonhkdatasetfilename, quiet=False)


Downloading...
From (original): https://drive.google.com/uc?export=download&id=14GcJAzyN2PFg2JuyzF0pRmxlMmimrz9o
From (redirected): https://drive.google.com/uc?export=download&id=14GcJAzyN2PFg2JuyzF0pRmxlMmimrz9o&confirm=t&uuid=130918e0-9586-42d8-be81-a0fdf2c0623f
To: /content/AmazonHomeKitchenReviews.csv
100%|██████████| 692M/692M [00:10<00:00, 64.1MB/s]


'AmazonHomeKitchenReviews.csv'

* Read the dataset csv  into dataframes

In [24]:
df_data = pd.read_csv(amazonhkdatasetfilename)




*   Analyse the datasets
*  Check total number of unique products and the review counts




In [25]:
df_renamed = df_data.rename(columns={'title_y' : 'product_title','title_x':'review_title','text':'review_text'})
df_renamed.groupby('product_title').size().sort_values(ascending=False).head(5)
print(df_renamed.columns)

Index(['Unnamed: 0', 'rating', 'review_title', 'review_text', 'images', 'asin',
       'parent_asin', 'user_id', 'timestamp', 'helpful_vote',
       'verified_purchase', 'product_title', 'description', 'price', 'Brand',
       'Material', 'Color', 'categories'],
      dtype='object')


* Pre-processing
* X = review_title,review_text
* y = rating

In [26]:
df_renamed.head(2)

Unnamed: 0.1,Unnamed: 0,rating,review_title,review_text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase,product_title,description,price,Brand,Material,Color,categories
0,59,5,Adorable!,These are so sweet. I do wish the stopper part...,[],B01HBWGU80,B01DR2ACA0,AGKHLEW2SOWHNMFQIJGBECAF7INQ,2019-07-23 04:29:16.671,0,True,"Little Bird Wine Bottle Stopper, Silicone Stop...",[],9.49,LouisChoice,Silicone,Assorted Color,"['Home & Kitchen', 'Kitchen & Dining', 'Kitche..."
1,87,5,"Stailess, healthier than coated pans","Great little stainless steel, balanced, good w...",[],B07T5CRVKQ,B08C7JYKZH,AEVWAM3YWN5URJVJIZZ6XPD2MKIA,2020-11-02 22:09:44.073,1,True,"Fortune Candy 8-Inch Fry Pan with Lid, 3-ply S...",[],24.99,Fortune Candy,"Stainless Steel, Aluminum",Mirror Finish,"['Home & Kitchen', 'Kitchen & Dining', 'Cookwa..."


**Text Pre-processsing **

* Used a lemmatizer for review title and review text
* This improve accuracy: By grouping similar words together and it can help the model understand the meaning of text better.
* It can reduce noise: It can help remove redundant information from your text data.
* Improve efficiency: It can help reduce the size of your vocabulary and speed up your analysis.

In [27]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download stopwords and punkt if not already downloaded
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Tokenize text
    tokens = word_tokenize(text.lower())  # Convert to lowercase and tokenize
    # Remove stop words and lemmatize
    cleaned_tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalpha() and word not in stop_words]
    return " ".join(cleaned_tokens)

# Apply preprocessing to your review text
df_renamed['processed_review'] = df_renamed['review_title'].astype(str) +  df_renamed['review_text'].astype(str).apply(preprocess_text)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [28]:
df_renamed['processed_review'].head(2)

Unnamed: 0,processed_review
0,Adorable!sweet wish stopper part little longer...
1,"Stailess, healthier than coated pansgreat litt..."


In [29]:
df_renamed["categories"].value_counts().sort_values(ascending=False).head(10)

Unnamed: 0_level_0,count
categories,Unnamed: 1_level_1
"['Home & Kitchen', 'Bedding', 'Sheets & Pillowcases', 'Sheet & Pillowcase Sets']",91691
"['Home & Kitchen', 'Home Décor Products', 'Window Treatments', 'Curtains & Drapes', 'Panels']",51297
"['Home & Kitchen', 'Bedding', 'Decorative Pillows, Inserts & Covers', 'Throw Pillow Covers']",28532
"['Home & Kitchen', 'Bedding', 'Sheets & Pillowcases', 'Pillowcases']",26212
"['Home & Kitchen', 'Bedding', 'Blankets & Throws', 'Throws']",21117
"['Home & Kitchen', 'Bath', 'Bath Rugs']",19613
"['Home & Kitchen', 'Home Décor Products', 'Slipcovers', 'Sofa Slipcovers']",17495
"['Home & Kitchen', 'Kitchen & Dining', 'Dining & Entertaining', 'Glassware & Drinkware', 'Tumblers & Water Glasses']",15843
"['Home & Kitchen', 'Kitchen & Dining', 'Kitchen & Table Linens', 'Tablecloths']",14710
"['Home & Kitchen', 'Bath', 'Bathroom Accessories', 'Shower Curtains, Hooks & Liners', 'Shower Curtain Liners']",13845




*   Categories column Encoding
*   Multi - Hot Encoding  
* Convert the categories column from string '<list>' to list before passing it to multi lable binarizer
* To do - you can use TD-IDF to extract important category words if required


In [33]:
import ast
df_encoded = df_renamed.copy()
# Because the categories column is a string <list> , we have to convert it into a list before encoding
df_encoded['categories'] = df_encoded['categories'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)


In [34]:
print(df_encoded['categories'].head(5))

0    [Home & Kitchen, Kitchen & Dining, Kitchen Ute...
1    [Home & Kitchen, Kitchen & Dining, Cookware, P...
2    [Home & Kitchen, Kitchen & Dining, Kitchen & T...
3    [Home & Kitchen, Kitchen & Dining, Kitchen & T...
4    [Home & Kitchen, Bedding, Sheets & Pillowcases...
Name: categories, dtype: object


In [39]:
from sklearn.preprocessing import MultiLabelBinarizer

# Convert NaN or empty categories to empty lists
df_encoded['categories'] = df_encoded['categories'].apply(lambda x: x if isinstance(x, list) else [])

# Apply MultiLabelBinarizer
mlb = MultiLabelBinarizer()
categories_encoded = mlb.fit_transform(df_encoded['categories'])

# Convert to DataFrame with category names as columns
categories_df = pd.DataFrame(categories_encoded, columns=mlb.classes_)

# Merge back with original DataFrame
df = pd.concat([df_encoded, categories_df], axis=1)
# df.drop(columns=['categories'], inplace=True)

print(df.head(5))


   Unnamed: 0  rating                          review_title  \
0          59       5                             Adorable!   
1          87       5  Stailess, healthier than coated pans   
2          89       5               Pretty colors available   
3          90       4                         Nice material   
4          93       4                      Love the zipper!   

                                         review_text images        asin  \
0  These are so sweet. I do wish the stopper part...     []  B01HBWGU80   
1  Great little stainless steel, balanced, good w...     []  B07T5CRVKQ   
2  Nice thin placemats of good size. Can be used ...     []  B07JRGZG6F   
3                 Very pretty, wish they came bigger     []  B00TW2M6YA   
4  The red is a deeper red rather than a bright r...     []  B01N6C4XJ7   

  parent_asin                       user_id                timestamp  \
0  B01DR2ACA0  AGKHLEW2SOWHNMFQIJGBECAF7INQ  2019-07-23 04:29:16.671   
1  B08C7JYKZH  AEVWAM3YWN5

In [43]:
# print(categories_df.columns)  # Displays the binary-encoded category column names

print(df.info)  # Categories are now binary-encoded

<bound method DataFrame.info of         Unnamed: 0  rating                          review_title  \
0               59       5                             Adorable!   
1               87       5  Stailess, healthier than coated pans   
2               89       5               Pretty colors available   
3               90       4                         Nice material   
4               93       4                      Love the zipper!   
...            ...     ...                                   ...   
754074    13376217       5                            Five Stars   
754075    13376226       5                          Blue beauty!   
754076    13376239       5                           looks good!   
754077    13376281       5                        Linen Favorite   
754078    13376282       3                    Fine for Light Use   

                                              review_text  \
0       These are so sweet. I do wish the stopper part...   
1       Great little stainles


* Convert category into embeddings rather than converting categories into many columns
* Instead of converting categories into many columns , we can assign a unique index to each category and use an embedding layer in the neural network





In [44]:
df_renamed["categories"].value_counts().sort_values(ascending=False).head(10)

Unnamed: 0_level_0,count
categories,Unnamed: 1_level_1
"['Home & Kitchen', 'Bedding', 'Sheets & Pillowcases', 'Sheet & Pillowcase Sets']",91691
"['Home & Kitchen', 'Home Décor Products', 'Window Treatments', 'Curtains & Drapes', 'Panels']",51297
"['Home & Kitchen', 'Bedding', 'Decorative Pillows, Inserts & Covers', 'Throw Pillow Covers']",28532
"['Home & Kitchen', 'Bedding', 'Sheets & Pillowcases', 'Pillowcases']",26212
"['Home & Kitchen', 'Bedding', 'Blankets & Throws', 'Throws']",21117
"['Home & Kitchen', 'Bath', 'Bath Rugs']",19613
"['Home & Kitchen', 'Home Décor Products', 'Slipcovers', 'Sofa Slipcovers']",17495
"['Home & Kitchen', 'Kitchen & Dining', 'Dining & Entertaining', 'Glassware & Drinkware', 'Tumblers & Water Glasses']",15843
"['Home & Kitchen', 'Kitchen & Dining', 'Kitchen & Table Linens', 'Tablecloths']",14710
"['Home & Kitchen', 'Bath', 'Bathroom Accessories', 'Shower Curtains, Hooks & Liners', 'Shower Curtain Liners']",13845


In [78]:
# import tensorflow as tf
# from tensorflow.keras.preprocessing.text import Tokenizer
# from tensorflow.keras.preprocessing.sequence import pad_sequences

# df_categories_tokenized = df_renamed.copy()

# # Flatten all categories into a single list for tokenization
# all_categories = [cat for sublist in df_categories_tokenized['categories'] for cat in sublist]

# # Fit tokenizer on all unique categories
# tokenizer = Tokenizer()
# tokenizer.fit_on_texts(all_categories)

# # Convert each row’s category list into a sequence of indexes
# df_categories_tokenized['categories_tokenized'] = df_categories_tokenized['categories'].apply(lambda x: tokenizer.texts_to_sequences(x))

#THE PADDING IS FAILING HERE

# # Set MAX_CATEGORIES to a reasonable value, e.g., 100
# MAX_CATEGORIES = 100

# # Apply padding to make all sequences the same length
# padded_categories = pad_sequences(df['categories_tokenized'], maxlen=MAX_CATEGORIES, padding='post')

# # Convert the padded sequences into a DataFrame
# df_padded = pd.DataFrame(padded_categories, columns=[f'category_{i+1}' for i in range(MAX_CATEGORIES)])

# # Check the padded output
# print(df_padded.head())  # Verify the first few rows of the padded categories




In [79]:
# # Check the first few entries
# print(df_categories_tokenized['categories_tokenized'].head())

# # Check if each entry is a list
# print(df_categories_tokenized['categories_tokenized'].apply(type).value_counts())  # Should output list for each row

# df_categories_tokenized['categories_tokenized'] = df_categories_tokenized['categories_tokenized'].apply(lambda x: x if isinstance(x, list) else [])

# # Check again
# print(df_categories_tokenized['categories_tokenized'].apply(type).value_counts())  # Should now show only list


# # Replace any null values with empty lists
# df_categories_tokenized['categories_tokenized'] = df_categories_tokenized['categories_tokenized'].apply(lambda x: x if isinstance(x, list) else [])

# # Ensure there are no empty lists if needed
# df_categories_tokenized['categories_tokenized'] = df_categories_tokenized['categories_tokenized'].apply(lambda x: x if len(x) > 0 else [])


# # Check if any lists are empty
# empty_lists = df_categories_tokenized['categories_tokenized'].apply(lambda x: len(x) == 0)
# print(f"Number of empty lists: {empty_lists.sum()}")

# # Optionally replace empty lists with a default value (e.g., empty list)
# df_categories_tokenized['categories_tokenized'] = df_categories_tokenized['categories_tokenized'].apply(lambda x: x if len(x) > 0 else [0])

# category_lengths = df_categories_tokenized['categories_tokenized'].apply(len)
# print(f"Max number of categories: {category_lengths.max()}")
# print(f"Min number of categories: {category_lengths.min()}")
# print(f"Mean number of categories: {category_lengths.mean()}")
# # Check the first few rows of 'category_sequences' to understand its structure
# print(df_categories_tokenized['categories_tokenized'].head())








*   Encode Review and Review TExt
* Use TD-IDF for encoding review and review text
* I had ensure there are no null values in review_Text



In [82]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # You can adjust the number of features
df['review_title'] = df['review_title'].fillna('No Review')
df['review_text'] = df['review_text'].fillna('No Review')

# Fit and transform the review title and review text
X_title = tfidf_vectorizer.fit_transform(df['review_title'])
X_text = tfidf_vectorizer.fit_transform(df['review_text'])

# Convert to dense format (optional, depending on your model)
X_title_dense = X_title.toarray()
X_text_dense = X_text.toarray()

# Check the shape of the TF-IDF encoded title and text
print(f"Shape of review title matrix: {X_title_dense.shape}")
print(f"Shape of review text matrix: {X_text_dense.shape}")


Shape of review title matrix: (754079, 5000)
Shape of review text matrix: (754079, 5000)


If you're using a neural network model, Word Embeddings such as Word2Vec, GloVe, or the Keras Embedding layer can be used to learn better semantic representations of words.