In this notebook, we will train a lightweight LLM to generate a short article per product category. We will use llama mini (3B) with n-shotting and fine-tuning on our dataset as a first attempt. 

In [1]:
# First we download the dataset from the kagglehub URL and save it to a dataframe
import kagglehub
import pandas as pd
import os

path = kagglehub.dataset_download("datafiniti/consumer-reviews-of-amazon-products")


# Download the datasets from the specified path, from previous notebook we know the shape of the data and the columns we are interested in
file_path1 = os.path.join(path, "1429_1.csv")
df1 = pd.read_csv(file_path1)
file_path2 = os.path.join(path, "Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv")
df2 = pd.read_csv(file_path2)
file_path3 = os.path.join(path, "Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv")
df3 = pd.read_csv(file_path3)

  df1 = pd.read_csv(file_path1)


In [2]:
# Import pickle dictionary to apply our meta-category mapping to the dataframes once these are merged
import pickle
from pathlib import Path
# Load the meta-category mapping from the pickle file

current_dir = Path.cwd()
parent_dir = current_dir.parent
pickle_file_path = parent_dir / "Clustering model" / "unique_categories_dict.pkl"

with open(pickle_file_path, "rb") as f:
    meta_category_mapping = pickle.load(f)
print("Meta-category mapping loaded successfully.")
print(meta_category_mapping)

Meta-category mapping loaded successfully.
{'AA,AAA,Electronics Features,Health,Electronics,Health & Household,Camcorder Batteries,Camera & Photo,Batteries,Household Batteries,Accessories,Camera Batteries,Health and Beauty,Household Supplies,Batteries & Chargers,Health, Household & Baby Care,Health Personal Care': 'Batteries', 'AA,AAA,Health,Electronics,Health & Household,Camcorder Batteries,Camera & Photo,Batteries,Household Batteries,Robot Check,Accessories,Camera Batteries,Health and Beauty,Household Supplies,Batteries & Chargers,Health, Household & Baby Care,Health Personal Care': 'Batteries', 'Accessories,USB Cables,Computers & Accessories,Computer Accessories & Peripherals,Electronics,Cables,Cables & Interconnects': 'Portable Electronics', 'Amazon Device Accessories,Kindle Store,Kindle Touch (4th Generation) Accessories,Kindle E-Reader Accessories,Covers,Kindle Touch (4th Generation) Covers': 'Portable Electronics', 'Amazon Devices & Accessories,Amazon Device Accessories,Power Ad

In [3]:
# print the columns of the dataframes
print("Columns in df1:", df1.columns)
print("Columns in df2:", df2.columns)
print("Columns in df3:", df3.columns)

Columns in df1: Index(['id', 'name', 'asins', 'brand', 'categories', 'keys', 'manufacturer',
       'reviews.date', 'reviews.dateAdded', 'reviews.dateSeen',
       'reviews.didPurchase', 'reviews.doRecommend', 'reviews.id',
       'reviews.numHelpful', 'reviews.rating', 'reviews.sourceURLs',
       'reviews.text', 'reviews.title', 'reviews.userCity',
       'reviews.userProvince', 'reviews.username'],
      dtype='object')
Columns in df2: Index(['id', 'dateAdded', 'dateUpdated', 'name', 'asins', 'brand',
       'categories', 'primaryCategories', 'imageURLs', 'keys', 'manufacturer',
       'manufacturerNumber', 'reviews.date', 'reviews.dateAdded',
       'reviews.dateSeen', 'reviews.doRecommend', 'reviews.id',
       'reviews.numHelpful', 'reviews.rating', 'reviews.sourceURLs',
       'reviews.text', 'reviews.title', 'reviews.username', 'sourceURLs'],
      dtype='object')
Columns in df3: Index(['id', 'dateAdded', 'dateUpdated', 'name', 'asins', 'brand',
       'categories', 'primaryCat

In [4]:
# Columns to keep 
columns_to_keep = ['name', 'asins', 'categories',  'reviews.doRecommend', 'reviews.numHelpful', 'reviews.rating', 'reviews.text', 'reviews.title']

# Filter the dataframes to keep only the relevant columns
df1_filtered = df1[columns_to_keep]
df2_filtered = df2[columns_to_keep]
df3_filtered = df3[columns_to_keep]
# Concatenate the filtered dataframes
df_combined = pd.concat([df1_filtered, df2_filtered, df3_filtered], ignore_index=True)
# Append df_combined with the meta-category mapping 
df_combined['meta_category'] = df_combined['categories'].map(meta_category_mapping)

# Print the shape and the head of the combined dataframe
print("Shape of the combined dataframe:", df_combined.shape)
print("Head of the combined dataframe:")
print(df_combined.head())

Shape of the combined dataframe: (67992, 9)
Head of the combined dataframe:
                                                name       asins  \
0  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...  B01AHB9CN2   
1  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...  B01AHB9CN2   
2  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...  B01AHB9CN2   
3  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...  B01AHB9CN2   
4  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...  B01AHB9CN2   

                                          categories reviews.doRecommend  \
0  Electronics,iPad & Tablets,All Tablets,Fire Ta...                True   
1  Electronics,iPad & Tablets,All Tablets,Fire Ta...                True   
2  Electronics,iPad & Tablets,All Tablets,Fire Ta...                True   
3  Electronics,iPad & Tablets,All Tablets,Fire Ta...                True   
4  Electronics,iPad & Tablets,All Tablets,Fire Ta...                True   

   reviews.numHelpful  reviews.rating  \
0                

In [5]:
import re
# Drop rows with missing review text or rating
df_combined = df_combined.dropna(subset=["reviews.text", "reviews.rating"])

# Clean the review text
def clean_text(text):
    text = text.lower()  # lowercase
    text = re.sub(r"<.*?>", " ", text)  # remove HTML tags
    text = re.sub(r"[^\w\s]", " ", text)  # remove punctuation
    text = re.sub(r"\s+", " ", text)  # remove extra whitespace
    return text.strip()

df_combined["cleaned_text"] = df_combined["reviews.text"].apply(clean_text)
print("✅ Cleaned data")
df_combined[["reviews.text", "cleaned_text"]].head()

✅ Cleaned data


Unnamed: 0,reviews.text,cleaned_text
0,This product so far has not disappointed. My c...,this product so far has not disappointed my ch...
1,great for beginner or experienced person. Boug...,great for beginner or experienced person bough...
2,Inexpensive tablet for him to use and learn on...,inexpensive tablet for him to use and learn on...
3,I've had my Fire HD 8 two weeks now and I love...,i ve had my fire hd 8 two weeks now and i love...
4,I bought this for my grand daughter when she c...,i bought this for my grand daughter when she c...


In [6]:
# Use a pre-trained SentenceTransformer model to generate embeddings for the cleaned review text

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings for the cleaned review text
embeddings = model.encode(df_combined["cleaned_text"].tolist(), show_progress_bar=True, convert_to_tensor= True, device='cuda')
# Add the embeddings to the dataframe
df_combined["embeddings"] = embeddings.tolist()






Batches:   0%|          | 0/2124 [00:00<?, ?it/s]

In [None]:
from transformers import pipeline

# Use an ABSA model from Hugging Face with DistilBertTokenizer
absa_pipeline = pipeline(
    "sentiment-analysis",
    model="yangheng/deberta-v3-base-absa-v1.1"
)

# Apply the ABSA pipeline to the cleaned review text 
absa_results = absa_pipeline(df_combined["cleaned_text"].tolist())

# Add the ABSA results to the dataframe
df_combined["absa_results"] = absa_results

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Device set to use cuda:0
Token indices sequence length is longer than the specified maximum sequence length for this model (732 > 512). Running this sequence through the model will result in indexing errors


KeyboardInterrupt: 