# Custom Word Embeddings for Beer Reviews

This notebook demonstrates how to train custom word embeddings on the beer review corpus to capture semantic relationships specific to beer descriptions. These embeddings are used to enhance the recommendation system by allowing attribute vectors to be derived from domain‑specific context.

# Task D

Create custom word embeddings from your product review data instead of using the default
SpaCy word embeddings.

 Do your top-3 recommendations change when using your own embedding?


You can use either SpaCy or Gensim in creating your custom embeddings.

In [1]:
import pandas as pd
import re, string

beer_stats = pd.read_csv("beer_stats.csv")
reviews = pd.read_csv("beer_reviews.csv")

reviews.head()

Unnamed: 0,product_name,product_review,user_rating,clean_text
0,Kentucky Brunch Brand Stout,Good,4.41,good
1,Kentucky Brunch Brand Stout,"Pours the purest black color you’ve ever seen,...",4.94,pours purest black color youve ever seen swall...
2,Kentucky Brunch Brand Stout,"This beer is intense, and yet, it feels very s...",4.98,beer intense yet feels smooth chocolate notes ...
3,Kentucky Brunch Brand Stout,2022 vintage poured at fridge temp but tasted ...,4.43,2022 vintage poured fridge temp tasted warmed ...
4,Kentucky Brunch Brand Stout,"Sampled at the brewery, this is the 2022 bottl...",4.61,sampled brewery 2022 bottle version beer pours...


In [2]:
# User's input (aka query vector) for the 3 attributes the user wants
keywords = ["chocolate", "dark", "coffee"]

# Tokenize each attribute and retrieve embeddings for each attribute word

## Creating custom word embeddings using Gensim

In [3]:
from gensim.models.phrases import Phrases, Phraser
from gensim.utils import simple_preprocess

# Retrieving text
texts = reviews["clean_text"].fillna("").astype(str) # fill na's

# Normalization
def normalize(txt):
    txt = txt.lower()
    txt = re.sub(r"\s+", " ", txt)
    return txt.strip()

# Tokenize into lists of words, and cleans text through simple_preprocess
# simple_preprocess does lowercasing, punctuation removal, basic tokenization
tokens = [simple_preprocess(normalize(t), deacc = True, min_len = 2) for t in texts] # Arguments get rid of accents (deacc) and gets rid of one-letter words like "I"

# Token is a list of lists, with individual lists representing the reviews, and words each being an element within that list

# Phrases() scans to find words that frequently co-occur and should be merged into one token
# forms multi-word phrases (like "new_york") from text, joining them with an underscore
bigram  = Phrases(tokens, min_count=10, threshold=10) 
trigram = Phrases(bigram[tokens], min_count=10, threshold=10) # phrases need to occur minimum of 10 times, threshold = 10 only promote pairs to phrases if their co-occurrence is ~10× more likely than chance (similar to lift)

# Phraser() compiles the heavy Phrases models into faster, memory-efficient transformers for application time
# Contains only the finalized merge rules (which pairs → merge)
bigram_phraser = Phraser(bigram)
trigram_phraser = Phraser(trigram)

# We use both bigrams and trigrams to capture both 2-token and 3-token phrases, and build our trigrams on top of our bigrams to do so
# sentences is, similar to tokens, a list of lists but this time combines tokens into bigrams or trigrams if they meet the requirements to do so
sentences = [trigram_phraser[bigram_phraser[t]] for t in tokens]

sentences is a list of lists of our reviews, but with the words/tokens as bigrams or trigrams if they meet the requirements to do so; simply one word tokens within the list otherwise

### Training Word2Vec with our tokenized reviews

In [4]:
from gensim.models import Word2Vec

# Creates a numeric vector for each token such that words used in similar contexts end up near each other in the vector space
w2v = Word2Vec(
    sentences=sentences,
    vector_size=200,      
    window=5,             # Context size (up to 5 words before and after) ie. With window=5, “barrel_aged” will pair with words up to 5 away in the same review
    min_count=5,          # Tokens occurring fewer than 5 times are discarded
    workers=4,
    sg=1,                 # 1=skip-gram aka given center word, predict context words. Works better for rare words
    negative=10,          # Use negative sampling in order to not calculate 10k probabilities each step
    sample=1e-5,          # Subsampling; randomly discards a fraction of very frequent tokens so they don’t dominate training
    epochs=10,            # Number of passes over the corpus. More epochs = more training
    seed = 42
)

# Model
model = w2v  # or ft
model.save("beer_reviews.w2v")

model.wv

<gensim.models.keyedvectors.KeyedVectors at 0x1d3b5c135b0>

### Quick check of model

In [28]:
# Results make sense
model.wv.most_similar("coffee", topn=10)

[('chocolate', 0.996605634689331),
 ('dark_chocolate', 0.9919734597206116),
 ('bourbon', 0.9902859330177307),
 ('coconut', 0.989770770072937),
 ('cocoa', 0.9895609617233276),
 ('cinnamon', 0.9886811375617981),
 ('vanilla', 0.987617552280426),
 ('barrel', 0.986538290977478),
 ('molasses', 0.9864659309387207),
 ('fudge', 0.9857189655303955)]

### Turn reviews/beers into vectors

In [6]:
import numpy as np
# Convert all review vectors aligned with a specific beer into one condensed vector with the "average flavor profile"
# Then compare that average flavor profile vector with the user's 3 attributes

# Creating review vector
# kv[w] rgabs vector aligned with each specific word/phrase
# Each word has a set vector, so we're averaging all the words in a review into one vector
# Then again averaging every review related to a singular beer to one beer vector (averaging twice)
def doc_vector(tokens, kv, use_tfidf=False):
    # simple average
    vecs = [kv[w] for w in tokens if w in kv]
    return np.mean(vecs, axis=0) if vecs else np.zeros(kv.vector_size)
    
review_tokens = sentences # sentences being the list of lists of word/tokens
review_vecs = [doc_vector(t, model.wv) for t in review_tokens] # creating review vectors

# Attach those review-level vectors onto Review dataframe
reviews_with_vecs = reviews.copy() 
reviews_with_vecs["__vec"] = review_vecs

# Create a Series that maps each product_name → its beer-level vector
# Each beer-level vector is the mean of its review vectors
beer_vecs = (reviews_with_vecs
             .groupby("product_name")["__vec"]
             .apply(lambda arr: np.mean(np.stack(arr), axis=0))
            )

In [7]:
beer_vecs[:5]

product_name
10 Year Barleywine                         [0.049762424, -0.0027068614, -0.06740016, -0.0...
4th Anniversary                            [0.057966426, 0.00014716451, -0.06475773, -0.0...
A Deal With The Devil - Double Oak-Aged    [0.050392777, -0.003247941, -0.06820337, -0.08...
A Deal With The Devil - Triple Oak-Aged    [0.049019452, -0.0033946633, -0.066780366, -0....
Aaron                                      [0.051318496, -0.0030460143, -0.0692233, -0.08...
Name: __vec, dtype: object

In [15]:
# To calculate cosine similarity
from numpy.linalg import norm

# Turn the user's 3 attributes into one averaged-out query vector
def wordset_vector(words, kv):
    # average of attribute seed words; include phrases if you used phrasers
    got = [kv[w] for w in words if w in kv]
    return np.mean(got, axis=0) if got else np.zeros(kv.vector_size)

# Computing cosine similarity
def cosine(a, b):
    na, nb = norm(a), norm(b)
    return float(a @ b / (na*nb)) if na > 0 and nb > 0 else 0.0 # a @ b is the dot product, na and nb are the lengths, prevents division by 0 as well

# Example user attributes
attrs = ["chocolate", "dark", "coffee"]

# Building query vector from our function
qvec = wordset_vector(attrs, model.wv) 

# For each beer vector, compute cosine similarity with the query vector above
scores = beer_vecs.apply(lambda v: cosine(qvec, v)).sort_values(ascending=False)

# Cosine similarity score range -1 to 1
top3 = scores.head(3)
top23 = scores.head(23)

### Top 3 Recommendations:

In [16]:
top3

product_name
Fundamental Forces                      0.962027
Bourbon Paradise                        0.961523
Speedway Stout - Bourbon Barrel-Aged    0.961389
Name: __vec, dtype: float64

### Show a table showing your three final recommendations along with 20 other top contenders so that I can understand how the top three got chosen. 

In [27]:
top23_df = top23.reset_index()
top23_df.columns = ["Beer Name", "Cosine Similarity"]
top23_df

Unnamed: 0,Beer Name,Cosine Similarity
0,Fundamental Forces,0.962027
1,Bourbon Paradise,0.961523
2,Speedway Stout - Bourbon Barrel-Aged,0.961389
3,Red Eye November,0.961021
4,Reaction State,0.960835
5,"Somewhere, Something Incredible Is Waiting To ...",0.960809
6,Ten FIDY - Bourbon Barrel-Aged,0.96071
7,Bourbon Barrel Champion Ground,0.960596
8,Truth - Vanilla Bean,0.960345
9,Speedway Stout - Vietnamese Coffee - Bourbon B...,0.960258
