# **Model Prototyping**

In this file, we test Hugging Face transformers on a sample of our reviews, to check their suitability.

### Import Libraries

In [24]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import re
import json
from sentence_transformers import SentenceTransformer
from sklearn.metrics import accuracy_score, f1_score
from transformers import pipeline
from sklearn.metrics.pairwise import cosine_similarity

### Sample Our Reviews & Create Binary Sentiment Value

In [17]:
# We sample 5000 of our reviews
df = pd.read_csv('../data/Reviews.csv')
df = df.sample(5000, random_state=42)

# Map star ratings to binary sentiment (neg=1-2, pos=4-5)
def map_sentiment(score):
    if score <= 2: return "NEGATIVE"
    if score >= 4: return "POSITIVE"
    return None  # ignore neutral 3-star reviews

df["label"] = df["Score"].apply(map_sentiment)
df = df.dropna(subset=["label"])

# Create review and label lists
reviews = df["Text"].tolist()
labels = df["label"].tolist()

### Iteratively Test Sentiment Models

In [18]:
sentiment_models = {
    "distilbert-sst2": "distilbert-base-uncased-finetuned-sst-2-english",
    "roberta-twitter": "cardiffnlp/twitter-roberta-base-sentiment",
    "bert-5star": "nlptown/bert-base-multilingual-uncased-sentiment"
}
results_sentiment = []

In [19]:
for name, model_id in sentiment_models.items():
    print(f"\nTesting {name}...")
    clf = pipeline("sentiment-analysis", model=model_id)

    start = time.time()
    preds = []
    true_labels = []
    for i, r in enumerate(reviews[:1000]):  # test on 1k reviews for speed
        out = clf(r[:512])[0]["label"]  # truncate long reviews
        # Map 5-star model output to POS/NEG if needed
        if name == "bert-5star":
            if out in ["1 star", "2 stars"]:
                out = "NEGATIVE"
            elif out in ["4 stars", "5 stars"]:
                out = "POSITIVE"
            else:
                continue
        preds.append(out)
        true_labels.append(labels[i])

    duration = time.time() - start
    acc = accuracy_score(true_labels, preds)
    # Use 'binary' only if there are exactly 2 unique classes, else use 'macro'
    unique_labels = set(true_labels) | set(preds)
    if len(unique_labels) == 2:
        f1 = f1_score(true_labels, preds, pos_label="POSITIVE", average="binary")
    else:
        f1 = f1_score(true_labels, preds, average="macro")

    results_sentiment.append((name, acc, f1, duration))

pd.DataFrame(results_sentiment, columns=["Model", "Accuracy", "F1", "Time (s)"])


Testing distilbert-sst2...

Testing roberta-twitter...

Testing bert-5star...


Unnamed: 0,Model,Accuracy,F1,Time (s)
0,distilbert-sst2,0.828,0.887728,16.620344
1,roberta-twitter,0.0,0.0,31.197147
2,bert-5star,0.925193,0.954301,32.546389


Here, we see that bert-5star performed by far the best, but was far slower than other models. If we need a quick option, we can use distilbert-sst2 instead.

### Iteratively Test Summarizer Models

In [20]:
summarizers = {
    "bart-large": "facebook/bart-large-cnn",
    "t5-small": "t5-small",
    "pegasus": "google/pegasus-xsum"
}
sample_long_reviews = [r for r in reviews if len(r.split()) > 80][:3]

In [21]:
for name, model_id in summarizers.items():
    print(f"\nTesting {name}...")
    summarizer = pipeline("summarization", model=model_id)

    for text in sample_long_reviews:
        summary = summarizer(text, max_length=60, min_length=15, do_sample=False)[0]["summary_text"]
        print(f"\nReview snippet: {text[:200]}...")
        print(f"→ Summary: {summary}")


Testing bart-large...

Review snippet: Having tried a couple of other brands of gluten-free sandwich cookies, these are the best of the bunch.  They're crunchy and true to the texture of the other "real" cookies that aren't gluten-free.  S...
→ Summary: Glutino's gluten-free sandwich cookies are the best of the bunch. They're crunchy and true to the texture of the other "real" cookies. The chocolate version is just as good and has a true "chocolatey" taste.

Review snippet: My cat loves these treats. If ever I can't find her in the house, I just pop the top and she bolts out of wherever she was hiding to come get a treat. She doesn't like crunchy treats much, so these ar...
→ Summary: My cat loves these treats. If ever I can't find her in the house, I just pop the top and she bolts out of wherever she was hiding to come get a treat. They do tend to dry out by the time I near the end of the bottle, however.

Review snippet: First there was Frosted Mini-Wheats, in original size, then th

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Review snippet: Having tried a couple of other brands of gluten-free sandwich cookies, these are the best of the bunch.  They're crunchy and true to the texture of the other "real" cookies that aren't gluten-free.  S...
→ Summary: I'm a huge fan of Glutino's gluten-free chocolate chip cookies.

Review snippet: My cat loves these treats. If ever I can't find her in the house, I just pop the top and she bolts out of wherever she was hiding to come get a treat. She doesn't like crunchy treats much, so these ar...
→ Summary: These cat treats come in three different sizes, so you can find one for your cat, one for your dog, and one for your ferret.

Review snippet: First there was Frosted Mini-Wheats, in original size, then there was Frosted Mini-Wheats Bite Size. Well, if for some reason those were too much of a mouthful, we now have Frosted Mini-Wheats Little ...
→ Summary: Kellogs, the makers of Frosted Mini-Wheats, have been rolling out a series of smaller versions of their famous bisc

We can see that bart-large had the most accurate and on-topic summaries.

### Iteratively Test Embedding Models

In [22]:
embed_models = {
    "miniLM": "all-MiniLM-L6-v2",
    "mpnet": "all-mpnet-base-v2",
    "multi-qa": "multi-qa-MiniLM-L6-cos-v1"
}
sample_queries = [
    "taste great",
    "energy boost",
    "too sweet"
]

In [23]:
for name, model_id in embed_models.items():
    print(f"\nTesting {name}...")
    model = SentenceTransformer(model_id)

    start = time.time()
    review_embeds = model.encode(reviews[:2000], show_progress_bar=True)
    query_embeds = model.encode(sample_queries)
    duration = time.time() - start

    print(f"Embeddings shape: {review_embeds.shape}, Time: {duration:.2f}s")

    # Quick cosine similarity check for first query
    sims = cosine_similarity([query_embeds[0]], review_embeds)[0]
    top_idx = sims.argsort()[-5:][::-1]
    for idx in top_idx:
        print(f"Query match: {reviews[idx][:200]}...")


Testing miniLM...


Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Embeddings shape: (2000, 384), Time: 9.56s
Query match: These taste wonderful. And they are at an amazing price, compared to what I can get them for at my local walmart....
Query match: Excellent taste, quality is great. Love that they are all natural organic and at the price, a bargain! I will definitely order these again....
Query match: The best I can say is that the taste is quite origonal and unique.If you want the best and dont mind paying fot it then this is for you.<br /><br />The bad part is the large cost of the product.But, i...
Query match: this flavor is soooo good!its very smooth and the flavor is not too strong. its just good! its a regular in our house now. everyone loves it....
Query match: I have always loved this product. It taste great but the shipping cost is too expensive. It is alot cheaper going to the store and buying it that way you wont have to pay for shipping....

Testing mpnet...


Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Embeddings shape: (2000, 768), Time: 38.55s
Query match: The best I can say is that the taste is quite origonal and unique.If you want the best and dont mind paying fot it then this is for you.<br /><br />The bad part is the large cost of the product.But, i...
Query match: The taste just isn't good enough to fit the price.  Overpriced, not healthy enough for that "organic and good for you!" price and the taste didn't impress anyone in my family including a slew of teena...
Query match: The fact that I ordered myself a case of these rather speaks for itself.  I consider myself somewhat of a connoisseur of offbeat snack foods, and these are my favorite of all time.  Slightly sweet, sl...
Query match: and I want to congratulate the graphic artist for putting the entire product name on such a small box.  The ad men must have really thought long and hard.<br /><br />But seriously, I love the product....
Query match: This flavor tastes very good. Great product!!! I subscribed to receive one 

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Embeddings shape: (2000, 384), Time: 11.07s
Query match: These taste wonderful. And they are at an amazing price, compared to what I can get them for at my local walmart....
Query match: The best I can say is that the taste is quite origonal and unique.If you want the best and dont mind paying fot it then this is for you.<br /><br />The bad part is the large cost of the product.But, i...
Query match: Excellent taste, quality is great. Love that they are all natural organic and at the price, a bargain! I will definitely order these again....
Query match: I have always loved this product. It taste great but the shipping cost is too expensive. It is alot cheaper going to the store and buying it that way you wont have to pay for shipping....
Query match: this flavor is soooo good!its very smooth and the flavor is not too strong. its just good! its a regular in our house now. everyone loves it....


miniLM provides speed and matches that are alright. mpnet is the slowest option, but gives more descriptive matches. multi-qa gives a balance of speed and descriptiveness, making it an ideal choice.

### Model Choices

- **Sentiment Model:** bert-5star
- **Summarizer Model:** bart-large
- **Embedding Model:** multi-qa

In [25]:
# create JSON file for quick use

# create the JSON file itself
config = {
    "sentiment_model": "nlptown/bert-base-multilingual-uncased-sentiment",
    "summarizer_model": "facebook/bart-large-cnn",
    "embedding_model": "multi-qa-MiniLM-L6-cos-v1"
}

# Save to config.json
with open("config.json", "w") as f:
    json.dump(config, f, indent=2)