# Beer Recommendation System

This notebook builds a recommendation system for craft beers based on tasting descriptors. We scrape data from BeerAdvocate, clean and aggregate reviews, and compute similarity across beers using TF‑IDF (Task B), SpaCy embeddings (Task C), and custom word embeddings (Task D). Users can specify flavor, aroma, or texture attributes to find beers that best match their preferences.

# Project 2

In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import undetected_chromedriver as uc
import pandas as pd
import numpy as np
from collections import Counter
import statsmodels.api as sm
import matplotlib.pyplot as plt
import regex as re
import time
import nltk
nltk.download('stopwords')
nltk.download('punkt')
# nltk.download('all')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import itertools
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\conno\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\conno\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Task A

In [None]:
# Set Chrome options
options = Options()
#options.add_argument("--headless=new")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
chrome_prefs = {
    "profile.managed_default_content_settings.images": 2,  # Block images
    "profile.managed_default_content_settings.stylesheets": 2,  # Block CSS
    "profile.managed_default_content_settings.javascript": 2  # Keep JS if needed
}
options.add_experimental_option("prefs", chrome_prefs)
driver = uc.Chrome(options=options)

# Open the edmunds page
url = "https://www.beeradvocate.com/beer/top-rated/"
driver.get(url)

In [None]:
# Get url's to 250 beers
beer_elements = driver.find_elements(By.XPATH, "//a[contains(@href, '/beer/profile/')]")

beer_urls = []
pattern = re.compile(r"^https://www\.beeradvocate\.com/beer/profile/\d+/\d+/$")

for el in beer_elements:
    url = el.get_attribute("href")
    if url and pattern.match(url):
        beer_urls.append(url)

In [8]:
beers = pd.DataFrame(columns= ['product_name', 'brewery','stats'])
beer_data = pd.DataFrame(columns= ['product_name','product_review','user_rating'])
max_pages = 10

In [None]:
# Scrape reviews
wait = WebDriverWait(driver, 10)
for url in beer_urls:
    driver.get(url)
    beer_name = driver.find_element(By.CLASS_NAME, 'titleBar').text.split('\n')[0]
    brewery = driver.find_element(By.CLASS_NAME, 'titleBar').text.split('\n')[1]
    #description = driver.find_element( By.XPATH,"//*[@style='margin-top: 10px; padding:0px 20px; font-size:1.05em;']")
    stats = driver.find_element(By.CLASS_NAME,"beerstats").text
    beers.loc[len(beers.index)] = [beer_name, brewery, stats]

    pages = 1
    end_of_reviews = False
    while not end_of_reviews and pages <= max_pages: 
        time.sleep(1)
        comments = driver.find_elements(By.CLASS_NAME,"user-comment")
        for comment in comments:
            try:
                text = comment.find_element( By.XPATH,".//*[@style='margin:20px 0px; font-size:11pt; line-height:1.4;']").text
                temp = comment.find_element(By.CLASS_NAME,"BAscore_norm")
                rating = temp.text
                beer_data.loc[len(beer_data.index)] = [beer_name, text, rating]
            except:
                pass

        try:
            time.sleep(1)
            next_button = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, "next")))
            next_button.click()
            pages += 1
            #print("Successfully clicked the 'Next' page link.")
        except Exception as e:
            print('Moving to next beer')
            end_of_reviews = True


Moving to next beer


# Task B

In [6]:
import pandas as pd
import numpy as np
import re
import warnings
warnings.filterwarnings("ignore")
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.corpus import stopwords
#nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics.pairwise import cosine_similarity
# Removed aliasing to avoid tfidf conflict
beer_stats = pd.read_csv("beer_stats.csv")
reviews = pd.read_csv("beer_reviews.csv")

#Drop NA
reviews = reviews.dropna(subset=["product_name", "product_review", "user_rating"]).copy()
#Clean Text
reviews["product_name"] = reviews["product_name"].astype(str).str.strip()
reviews["product_review"] = reviews["product_review"].astype(str).str.replace(r"\s+", " ", regex=True).str.strip()

# Make sure ratings are numeric, and between 0-5
reviews["user_rating"] = pd.to_numeric(reviews["user_rating"], errors="coerce")
reviews = reviews.dropna(subset=["user_rating"])
reviews = reviews[(reviews["user_rating"] >= 0) & (reviews["user_rating"] <= 5)]


print(f"Loaded {len(reviews):,} reviews across {reviews['product_name'].nunique()} beers.")
reviews.head(3)


Loaded 11,315 reviews across 249 beers.


Unnamed: 0,product_name,product_review,user_rating,clean_text
0,Kentucky Brunch Brand Stout,Good,4.41,good
1,Kentucky Brunch Brand Stout,"Pours the purest black color you’ve ever seen,...",4.94,pours purest black color youve ever seen swall...
2,Kentucky Brunch Brand Stout,"This beer is intense, and yet, it feels very s...",4.98,beer intense yet feels smooth chocolate notes ...


In [7]:
#normalize the text further
text_col = "clean_text" if "clean_text" in reviews.columns else "product_review"

reviews["clean_low"] = (
    reviews[text_col].astype(str).str.lower()
    .str.replace(r"[^a-z\s\-']", " ", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

#Tokenize the words so we can do tfidf
reviews["tokens"] = reviews["clean_low"].str.findall(r"[a-z][a-z\-']{2,}")


#Bag of words
stop = set(stopwords.words("english"))
uni = reviews[["tokens"]].explode("tokens")
uni = uni[~uni["tokens"].isin(stop)]
top_unigrams = uni["tokens"].value_counts()


#Function to build bigrams
def make_bigrams(tokens):
    if not isinstance(tokens, list) or len(tokens) < 2:
        return []
    return [f"{a} {b}" for a, b in zip(tokens[:-1], tokens[1:])]

reviews["bigrams"] = reviews["tokens"].apply(make_bigrams)
bi = reviews.explode("bigrams")["bigrams"].dropna()

# Filter out slop
bi = bi[~bi.str.contains(r"\b(" + "|".join(sorted(stop)) + r")\b", regex=True)]
bi = bi[~bi.str.contains(r"[^a-z\s\-']", regex=True)]
top_bigrams = bi.value_counts()


#List of beer words!
seed_attrs = {
    "hoppy","malty","bitter","sweet","roasty","toasty","citrusy","citrus","piney","resinous",
    "fruity","tropical","grapefruit","orange","lemon","juicy","dry","crisp","clean",
    "funky","tart","sour","oaky","woody","vanilla","chocolate","coffee","caramel",
    "spicy","peppery","floral","earthy","dank","boozy","smooth","creamy","silky","balanced",
    "aroma","mouthfeel","finish","body","complex","rich"
}

#finds descriptive words
descriptor_like = [
    w for w in top_unigrams.index[:200]   # inspect top 200; adjust as needed
    if (w in seed_attrs) or w.endswith(("y","ish"))
]

attribute_candidates = sorted(set(seed_attrs).union(descriptor_like))



print("Top unigrams (preview):")
print(top_unigrams.head(25).to_string())

print("\nTop bigrams (preview):")
print(top_bigrams.head(25).to_string())

print("\nCandidate attribute list (edit/prune as needed):")
print(attribute_candidates)
print(f"\nTotal candidate attributes: {len(attribute_candidates)}")


Top unigrams (preview):
tokens
beer           7948
head           6328
taste          5270
chocolate      4711
dark           4702
sweet          3912
coffee         3844
like           3681
vanilla        3677
notes          3537
one            3457
bourbon        3394
nose           3281
good           3244
nice           3228
light          3113
well           2975
finish         2967
aroma          2962
carbonation    2947
pours          2911
body           2760
orange         2689
black          2683
fruit          2676

Top bigrams (preview):
bigrams
white head           1302
dark chocolate        955
tan head              711
dark brown            629
brown sugar           580
medium body           556
tropical fruit        508
dark fruit            506
taste follows         475
maple syrup           423
medium bodied         422
follows nose          396
brown head            391
milk chocolate        388
roasted malt          381
tree house            376
full bodied          

In [8]:
final_attributes = [
    # Flavor
    "chocolate", "dark chocolate", "milk chocolate", "coffee", "vanilla",
    "caramel", "brown sugar", "maple syrup", "honey",
    "citrusy", "grapefruit", "orange", "lemon", "fruity", "tropical fruit", "dark fruit",
    
    # Aroma / hops
    "hoppy", "malty", "roasty", "toasty", "piney", "earthy", "floral", "spicy", "peppery", "dank", "funky",
    
    # Texture
    "smooth", "creamy", "silky", "sticky", "dry", "crisp", "rich", "full body", "medium body", "well balanced"
]

print("Final attribute list:")
print(final_attributes)
print(f"\nTotal attributes: {len(final_attributes)}")


Final attribute list:
['chocolate', 'dark chocolate', 'milk chocolate', 'coffee', 'vanilla', 'caramel', 'brown sugar', 'maple syrup', 'honey', 'citrusy', 'grapefruit', 'orange', 'lemon', 'fruity', 'tropical fruit', 'dark fruit', 'hoppy', 'malty', 'roasty', 'toasty', 'piney', 'earthy', 'floral', 'spicy', 'peppery', 'dank', 'funky', 'smooth', 'creamy', 'silky', 'sticky', 'dry', 'crisp', 'rich', 'full body', 'medium body', 'well balanced']

Total attributes: 37


In [9]:
# Create beer documents by aggregating reviews for each beer
beer_docs = reviews.groupby("product_name").agg({
    "product_review": " ".join,  # Concatenate all reviews for each beer
    "user_rating": "mean"
}).reset_index()

# Add clean_text column for TF-IDF
beer_docs["clean_text"] = beer_docs["product_review"]

# Create and fit TfidfVectorizer with the curated attributes
tfidf = TfidfVectorizer(vocabulary=final_attributes)
beer_tfidf = tfidf.fit_transform(beer_docs["clean_text"])

print(f"Created TF-IDF vectors for {len(beer_docs)} beers using {len(final_attributes)} attributes")
print(f"TF-IDF matrix shape: {beer_tfidf.shape}")

Created TF-IDF vectors for 249 beers using 37 attributes
TF-IDF matrix shape: (249, 37)


We manually curated a final list of 37 sensory descriptors covering: 
flavor (chocolate, coffee, vanilla, caramel, tropical fruit, etc.),
aroma/hops (hoppy, malty, roasty, piney, floral, etc.), 
and texture (smooth, creamy, dry, crisp, full body, well balanced, etc.). 
This ensured the specific attributes we use actually make sense in a beer context.

In [11]:
#Pick 3 random attributes
user_attrs = ["chocolate", "coffee", "vanilla"]

# If they are missing we need to know
#Can convert this into input() with validation later
missing = [w for w in user_attrs if w not in final_attributes]
if missing:
    print("These attributes aren't in our list:", missing)
    print("Pick from:", list(tfidf.get_feature_names_out()))
else:
    # Turn that list into what we're searching for
    query_text = " ".join(user_attrs)

    # Vectorize the query with tfidf
    query_vec = tfidf.transform([query_text])

    # calc the cosine similarity of every beer to the query vec
    sims = cosine_similarity(beer_tfidf, query_vec).ravel()

    # Results Table
    res = beer_docs[["product_name"]].copy()
    res["cosine_score"] = sims

    # Show the top 23 beers
    top23 = res.sort_values("cosine_score", ascending=False).head(23).reset_index(drop=True)
    top23.index = top23.index + 1  # nicer display

    print("User chose attributes:", user_attrs)
    print(top23)


User chose attributes: ['chocolate', 'coffee', 'vanilla']
                                         product_name  cosine_score
1                                            Affogato      0.975788
2                                        Pirate Bomb!      0.975485
3                Speedway Stout - Bourbon Barrel-Aged      0.966555
4                                               Bomb!      0.964093
5                            Last Buffalo In The Park      0.962517
6                      Affogato - Bourbon Barrel-Aged      0.960607
7   Somewhere, Something Incredible Is Waiting To ...      0.957658
8                                      Reaction State      0.947541
9                                                 KBS      0.941670
10                                  Barrel Aged Bomb!      0.932402
11                     CBS (Canadian Breakfast Stout)      0.932381
12                         KBS - Maple Mackinac Fudge      0.932024
13                Plead The 5th - Bourbon Barrel-Aged     

A user query of attributes (user inputted list) can be turned into a TF-IDF vector using the same model. Computed cosine similarity between the query vector and every beer profile. 
This ranks the beers by how well they matched the requested attributes.

We used VADER sentiment analysis to score reviews that mentioned the chosen attributes.
Then we scaled those to be between 0 and 1, and used an equal weighting for combining them for a final score. It's equally important for them to mentionn the keywords and have a positive sentiment.

In [2]:
!pip install spacy
!python -m spacy download en_core_web_md

Collecting spacy
  Downloading spacy-3.8.7-cp312-cp312-win_amd64.whl.metadata (28 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.13-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp312-cp312-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.10-cp312-cp312-win_amd64.whl.metadata (2.5 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.6-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Using cached wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.1-cp312-cp312-win_a

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.3.3 which is incompatible.
scipy 1.13.1 requires numpy<2.3,>=1.22.4, but you have numpy 2.3.3 which is incompatible.


Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
     ---------------------------------------- 0.0/33.5 MB ? eta -:--:--
     ----- ---------------------------------- 4.5/33.5 MB 26.9 MB/s eta 0:00:02
     ------------- ------------------------- 11.8/33.5 MB 32.1 MB/s eta 0:00:01
     ------------------------ -------------- 20.7/33.5 MB 36.4 MB/s eta 0:00:01
     ---------------------------------- ---- 29.9/33.5 MB 38.7 MB/s eta 0:00:01
     ---------------------------------------- 33.5/33.5 MB 34.3 MB/s  0:00:01
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [12]:
import spacy
nlp = spacy.load("en_core_web_md")

# Build one vector per beer
beer_vectors = np.vstack([nlp(str(text)).vector for text in beer_docs["clean_text"]])
beer_names = beer_docs["product_name"].tolist()

print("Beer vectors shape (beers × dims):", beer_vectors.shape)



Beer vectors shape (beers × dims): (249, 300)


In [14]:
# average word vectors
attribute_query_vector = nlp(" ".join(user_attrs)).vector.reshape(1, -1)

# Find the cosine similarity
embedding_cosine_scores = cosine_similarity(beer_vectors, attribute_query_vector).ravel()

# Results table
taskc_results = pd.DataFrame({
    "product_name": beer_names,
    "embedding_cosine": embedding_cosine_scores
}).sort_values("embedding_cosine", ascending=False)

taskc_top23 = taskc_results.head(23).reset_index(drop=True)
taskc_top23.index = taskc_top23.index + 1

print("user attributes:", user_attrs)
taskc_top23

user attributes: ['chocolate', 'coffee', 'vanilla']


Unnamed: 0,product_name,embedding_cosine
1,"Somewhere, Something Incredible Is Waiting To ...",0.585234
2,All That Is And All That Ever Will Be,0.581466
3,Moment Of Clarity,0.564991
4,Speedway Stout - Vietnamese Coffee,0.564068
5,Hold On To Sunshine,0.561988
6,Affogato - Bourbon Barrel-Aged,0.558364
7,Sunday Brunch,0.557809
8,Caffè Americano,0.555352
9,Barrel-Aged Sump Coffee Stout,0.552474
10,Canuckley,0.550989


In Task B (no spacy, just TF-DF + cosine similarity) the top results were all ones that explicitly mentioned the three 
words we chose.The simple bag of words style approach highlights reviews that directly mention these keywords.

In Task C, when we incorporated the spacy embeddings, it shows beers that had reviews that were more semantically similar,
so some beers would show at the top despite the specific keywords maybe not being explicitly mentioned.

Without the spacy embeddings, we had to manually predefine an attribute list, and so the user has to input only words found
in that list. 

With the spacy embeddings, it works with any word that spacy has a vector for, making it way more flexible for users, but
could come at the cost of precision, because for these beer reviews sometimes mentioning specific flavor words are important.




sentences is a list of lists of our reviews, but with the words/tokens as bigrams or trigrams if they meet the requirements to do so; simply one word tokens within the list otherwise

### Task D: Custom Word Embeddings (End-to-End) using Gensim

1. **Preprocessing & Tokenization**  
   - Took review text (`clean_text`) and tokenized into words.  
   - Applied bigram & trigram models so phrases like *barrel_aged* or *new_england_ipa* become single tokens.

2. **Training Word2Vec**  
   - Trained a skip-gram Word2Vec model.z
   - Model learns embeddings by bringing words used in similar contexts (e.g. *hoppy*, *piney*, *citrus*) closer together.  
   - Verified with `most_similar("coffee")`, which returned realistic coffee-related descriptors.

3. **Building Beer Vectors**  
   - Averaged word vectors within each review → one review vector.  
   - Averaged all review vectors for the same beer → one beer vector.

4. **Query Vector & Recommendations**  
   - Turned 3 user-specified attributes (e.g. *coffee*, *vanilla*, *chocolate*) into a query vector by averaging their embeddings.  
   - Computed cosine similarity between the query vector and each beer vector.  
   - Ranked beers by similarity → returned Top-3 recommendations (+20 others for comparison).

**Outcome:**  
Our custom embeddings captured beer-specific language (e.g., *piney*, *dankness*, *orange_pineapple* near *hoppy*), producing more domain-relevant recommendations than our more generic pretrained vectors from before.


In [1]:
import pandas as pd
import re, string

beer_stats = pd.read_csv("beer_stats.csv")
reviews = pd.read_csv("beer_reviews.csv")

reviews.head()

Unnamed: 0,product_name,product_review,user_rating,clean_text
0,Kentucky Brunch Brand Stout,Good,4.41,good
1,Kentucky Brunch Brand Stout,"Pours the purest black color you’ve ever seen,...",4.94,pours purest black color youve ever seen swall...
2,Kentucky Brunch Brand Stout,"This beer is intense, and yet, it feels very s...",4.98,beer intense yet feels smooth chocolate notes ...
3,Kentucky Brunch Brand Stout,2022 vintage poured at fridge temp but tasted ...,4.43,2022 vintage poured fridge temp tasted warmed ...
4,Kentucky Brunch Brand Stout,"Sampled at the brewery, this is the 2022 bottl...",4.61,sampled brewery 2022 bottle version beer pours...


In [3]:
from gensim.models.phrases import Phrases, Phraser
from gensim.utils import simple_preprocess

# Retrieving text
texts = reviews["clean_text"].fillna("").astype(str) # fill na's

# Normalization
def normalize(txt):
    txt = txt.lower()
    txt = re.sub(r"\s+", " ", txt)
    return txt.strip()

# Tokenize into lists of words, and cleans text through simple_preprocess
# simple_preprocess does lowercasing, punctuation removal, basic tokenization
tokens = [simple_preprocess(normalize(t), deacc = True, min_len = 2) for t in texts] # Arguments get rid of accents (deacc) and gets rid of one-letter words like "I"

# Token is a list of lists, with individual lists representing the reviews, and words each being an element within that list

# Phrases() scans to find words that frequently co-occur and should be merged into one token
# forms multi-word phrases (like "new_york") from text, joining them with an underscore
bigram  = Phrases(tokens, min_count=10, threshold=10) 
trigram = Phrases(bigram[tokens], min_count=10, threshold=10) # phrases need to occur minimum of 10 times, threshold = 10 only promote pairs to phrases if their co-occurrence is ~10× more likely than chance (similar to lift)

# Phraser() compiles the heavy Phrases models into faster, memory-efficient transformers for application time
# Contains only the finalized merge rules (which pairs → merge)
bigram_phraser = Phraser(bigram)
trigram_phraser = Phraser(trigram)

# We use both bigrams and trigrams to capture both 2-token and 3-token phrases, and build our trigrams on top of our bigrams to do so
# sentences is, similar to tokens, a list of lists but this time combines tokens into bigrams or trigrams if they meet the requirements to do so
sentences = [trigram_phraser[bigram_phraser[t]] for t in tokens]

sentences is a list of lists of our reviews, but with the words/tokens as bigrams or trigrams if they meet the requirements to do so; simply one word tokens within the list otherwise

### Training Word2Vec with our tokenized reviews

In [4]:
from gensim.models import Word2Vec

# Creates a numeric vector for each token such that words used in similar contexts end up near each other in the vector space
w2v = Word2Vec(
    sentences=sentences,
    vector_size=200,      
    window=5,             # Context size (up to 5 words before and after) ie. With window=5, “barrel_aged” will pair with words up to 5 away in the same review
    min_count=5,          # Tokens occurring fewer than 5 times are discarded
    workers=4,
    sg=1,                 # 1=skip-gram aka given center word, predict context words. Works better for rare words
    negative=10,          # Use negative sampling in order to not calculate 10k probabilities each step
    sample=1e-5,          # Subsampling; randomly discards a fraction of very frequent tokens so they don’t dominate training
    epochs=10,            # Number of passes over the corpus. More epochs = more training
    seed = 42
)

# Model
model = w2v 
model.save("beer_reviews.w2v")

model.wv

<gensim.models.keyedvectors.KeyedVectors at 0x18ce88359d0>

### Quick check of model

In [5]:
# Results make sense
model.wv.most_similar("coffee", topn=10)

[('chocolate', 0.9968509674072266),
 ('bourbon', 0.9946306347846985),
 ('vanilla', 0.9934223890304565),
 ('dark_chocolate', 0.9899531006813049),
 ('cinnamon', 0.9864681959152222),
 ('molasses', 0.9856863617897034),
 ('cocoa', 0.9823498129844666),
 ('coconut', 0.9819732904434204),
 ('toffee', 0.9790197610855103),
 ('caramel', 0.9768047332763672)]

### Turn reviews/beers into vectors

In [6]:
import numpy as np
# Convert all review vectors aligned with a specific beer into one condensed vector with the "average flavor profile"
# Then compare that average flavor profile vector with the user's 3 attributes

# Creating review vector
# kv[w] rgabs vector aligned with each specific word/phrase
# Each word has a set vector, so we're averaging all the words in a review into one vector
# Then again averaging every review related to a singular beer to one beer vector (averaging twice)
def doc_vector(tokens, kv, use_tfidf=False):
    # simple average
    vecs = [kv[w] for w in tokens if w in kv]
    return np.mean(vecs, axis=0) if vecs else np.zeros(kv.vector_size)
    
review_tokens = sentences # sentences being the list of lists of word/tokens
review_vecs = [doc_vector(t, model.wv) for t in review_tokens] # creating review vectors

# Attach those review-level vectors onto Review dataframe
reviews_with_vecs = reviews.copy() 
reviews_with_vecs["__vec"] = review_vecs

# Create a Series that maps each product_name → its beer-level vector
# Each beer-level vector is the mean of its review vectors
beer_vecs = (reviews_with_vecs
             .groupby("product_name")["__vec"]
             .apply(lambda arr: np.mean(np.stack(arr), axis=0))
            )

In [7]:
beer_vecs[:5]

product_name
10 Year Barleywine                         [0.012826044, -0.087914065, -0.025214648, -0.0...
4th Anniversary                            [0.017295592, -0.085724086, -0.033248864, -0.0...
A Deal With The Devil - Double Oak-Aged    [0.013409764, -0.08878207, -0.024757393, -0.08...
A Deal With The Devil - Triple Oak-Aged    [0.012703434, -0.08922341, -0.025134195, -0.08...
Aaron                                      [0.013797521, -0.08779944, -0.025074193, -0.08...
Name: __vec, dtype: object

In [8]:
# To calculate cosine similarity
from numpy.linalg import norm

# Turn the user's 3 attributes into one averaged-out query vector
def wordset_vector(words, kv):
    # average of attribute seed words; include phrases if you used phrasers
    got = [kv[w] for w in words if w in kv]
    return np.mean(got, axis=0) if got else np.zeros(kv.vector_size)

# Computing cosine similarity
def cosine(a, b):
    na, nb = norm(a), norm(b)
    return float(a @ b / (na*nb)) if na > 0 and nb > 0 else 0.0 # a @ b is the dot product, na and nb are the lengths, prevents division by 0 as well

# Example user attributes
attrs = ["chocolate", "vanilla", "coffee"]

# Building query vector from our function
qvec = wordset_vector(attrs, model.wv) 

# For each beer vector, compute cosine similarity with the query vector above
scores = beer_vecs.apply(lambda v: cosine(qvec, v)).sort_values(ascending=False)

# Cosine similarity score range -1 to 1
top3 = scores.head(3)
top23 = scores.head(23)

### Top 3 Recommendations:

In [9]:
top3

product_name
Fundamental Forces                      0.919509
Bourbon Paradise                        0.918391
Speedway Stout - Bourbon Barrel-Aged    0.918232
Name: __vec, dtype: float64

### Show a table showing your three final recommendations along with 20 other top contenders so that I can understand how the top three got chosen. 

In [10]:
top23_df = top23.reset_index()
top23_df.columns = ["Beer Name", "Cosine Similarity"]
top23_df

Unnamed: 0,Beer Name,Cosine Similarity
0,Fundamental Forces,0.919509
1,Bourbon Paradise,0.918391
2,Speedway Stout - Bourbon Barrel-Aged,0.918232
3,Red Eye November,0.918015
4,"Somewhere, Something Incredible Is Waiting To ...",0.917581
5,Reaction State,0.917564
6,Bourbon Barrel Champion Ground,0.916872
7,Caffè Americano,0.916842
8,Truth - Vanilla Bean,0.916759
9,Ten FIDY - Bourbon Barrel-Aged,0.916532


Compared with our results from spaCy's default word vectors and TF-IDF based similarity recommendations, our new top 3 recommended beers given the attributes: chocolate, vanilla, and coffee, are all different. Though beers such as Speedway Stout - Bourbon Barrel-Aged ranked in the top 4 for all three versions, there is overlap in beers ranked in the top 3. This means creating domain-specific vectors influenced the cosine similarity scores enough to produce distinct top recommendations each time. 

# Task E

In [None]:
df = pd.read_csv("beer_reviews.csv")
top_rated = df.groupby("product_name")["user_rating"].mean().sort_values(ascending=False).head(3)
print(top_rated)

product_name
10 Year Barleywine    4.972727
O.W.K.                4.921765
M.J.K.                4.847727
Name: user_rating, dtype: float64


# Insight

Beer Attributes considered – Chocolate, Dark, and Coffee

Our Recommendation (Task D – Attribute-Based):

Fundamental Forces

Bourbon Paradise

Speedway Stout – Bourbon Barrel-Aged

As seen in the similarity scores, these recommendations stand out with values above 0.96, indicating a strong alignment with the user’s specified attributes. Each of these beers emphasizes chocolate and dark-roasted flavors in their reviews, alongside rich coffee notes that directly reflect the user’s preferences.

A closer look at the reviews shows that Fundamental Forces is often praised for its velvety dark chocolate character balanced with roasted malt depth, while Bourbon Paradise highlights bourbon warmth layered over coffee and cocoa tones. Similarly, Speedway Stout – Bourbon Barrel-Aged is frequently noted for its bold dark profile, blending espresso-like bitterness with chocolate sweetness. Together, these beers capture the essence of “dark, chocolate, and coffee” with high consistency and strong user sentiment.

Top Rated Beers from the Dataset (Task E – Ratings-Only):

10 Year Barleywine

O.W.K.

M.J.K.

From user reviews of these top-rated beers, it is clear that they excel in overall quality and popularity, but do not necessarily emphasize the specific flavor attributes of chocolate, dark, and coffee. For example, 10 Year Barleywine is praised for its complexity and sweetness, with notes of dried fruit and caramel rather than roasted depth. O.W.K. often highlights balance and smoothness but is less focused on rich dark flavors. M.J.K. is celebrated for intensity and craftsmanship, yet reviews suggest a more diverse profile that does not center on chocolate or coffee-driven notes.

Comparison and Conclusion

The contrast shows that while the highest-rated beers are outstanding in terms of general acclaim, they do not fully meet the attribute preferences of a user seeking chocolate, dark, and coffee flavors. The attribute-based recommendations, by contrast, directly align with these specific tastes, offering beers that users consistently describe in those terms.

This illustrates the value of personalization: ratings-only approaches capture broad popularity, but attribute-driven methods ensure that the recommendations reflect what the individual user actually wants in their drinking experience.

# Task F
Choose any 10 beers in your data. Now choose any one of them, and find the most similar beer (among the remaining 9). Explain your method and logic. 

In [None]:
# Check available columns in beer_reviews.csv to find the correct review text column name
import pandas as pd
df = pd.read_csv("beer_reviews.csv")
print(df.columns.tolist())

In [None]:
# ==== Task F with integration to Tasks B, C, and D ====
import warnings
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# ---------------------------
# Load & Aggregate
# ---------------------------
def load_dataframe(path: str) -> pd.DataFrame:
    df = pd.read_csv(path) if path.endswith(".csv") else pd.read_parquet(path)
    for c in ["product_name", "product_review", "user_rating"]:
        if c not in df.columns:
            raise ValueError(f"Missing column: {c}")
    df["product_review"] = df["product_review"].fillna("").astype(str)
    return df

def aggregate_reviews_by_beer(df: pd.DataFrame):
    # drop empty reviews (no text signal)
    df2 = df[df["product_review"].str.strip() != ""].copy()
    beer_texts = df2.groupby("product_name")["product_review"].apply(lambda s: "\n".join(s.tolist()))
    beer_stats = df2.groupby("product_name").agg(
        avg_rating=("user_rating", "mean"),
        n_reviews=("user_rating", "size")
    ).reset_index()
    return beer_texts, beer_stats

# ---------------------------
# Random 10 + Random Target
# ---------------------------
def choose_random_10_and_target(beer_texts: pd.Series, n: int = 10, random_state: int = 42, min_len: int = 0):
    """
    1) Randomly sample n beers (optionally requiring a min concatenated text length).
    2) Randomly pick 1 target from those n.
    """
    rng = np.random.default_rng(random_state)
    candidates = beer_texts if min_len <= 0 else beer_texts[beer_texts.str.len() >= min_len]
    if len(candidates) < n:
        warnings.warn(f"Only {len(candidates)} beers available after min_len={min_len}. Using all beers.")
        candidates = beer_texts
    selection = rng.choice(candidates.index.to_list(), size=n, replace=False).tolist()
    target = rng.choice(selection, size=1)[0]
    return selection, target

# ---------------------------
# TF-IDF Similarity (Task B logic)
# ---------------------------
def compute_tfidf_similarity(beer_texts, target_beer, beer_pool):
    pool = list(dict.fromkeys(beer_pool))
    if target_beer not in pool:
        pool = [target_beer] + pool
    sub_texts = beer_texts.reindex(pool).fillna("")
    vec = TfidfVectorizer(
        stop_words="english",
        lowercase=True,
        ngram_range=(1, 2),
        min_df=2,
        max_features=20000
    )
    X = vec.fit_transform(sub_texts.values)
    target_idx = pool.index(target_beer)
    sims = cosine_similarity(X[target_idx], X).flatten()
    rows = []
    for i, beer in enumerate(pool):
        if beer == target_beer:
            continue
        rows.append({"target_beer": target_beer, "candidate_beer": beer, "similarity_tfidf": float(sims[i])})
    return pd.DataFrame(rows).sort_values("similarity_tfidf", ascending=False).reset_index(drop=True)

# ---------------------------
# Task C embedding similarity (SpaCy) — uses YOUR nlp if present
# ---------------------------
def compute_task_c_similarity(beer_texts, target_beer, beer_pool):
    # Use an existing SpaCy pipeline (from your Task C) if available; otherwise try to load md/lg
    nlp = globals().get("nlp", None)
    if nlp is None:
        try:
            import spacy
            for m in ["en_core_web_md", "en_core_web_lg"]:
                try:
                    nlp = spacy.load(m)
                    break
                except Exception:
                    pass
        except Exception:
            nlp = None
    if nlp is None:
        return None  # Task C not available in this runtime

    def doc_vec(text: str):
        doc = nlp(text)
        # prefer token-mean to avoid empty .vector if pipeline lacks vectors
        vecs = [t.vector for t in doc if t.has_vector and not t.is_space]
        return np.mean(vecs, axis=0) if vecs else None

    pool = list(dict.fromkeys(beer_pool))
    if target_beer not in pool:
        pool = [target_beer] + pool
    sub_texts = beer_texts.reindex(pool).fillna("")

    vectors = {}
    for beer, txt in sub_texts.items():
        v = doc_vec(txt)
        if v is None:
            return None
        vectors[beer] = v

    target_vec = vectors[target_beer].reshape(1, -1)
    mat = np.vstack([vectors[b] for b in pool])
    sims = cosine_similarity(target_vec, mat).flatten()

    rows = []
    for i, beer in enumerate(pool):
        if beer == target_beer:
            continue
        rows.append({"target_beer": target_beer, "candidate_beer": beer, "similarity_c": float(sims[i])})
    return pd.DataFrame(rows).sort_values("similarity_c", ascending=False).reset_index(drop=True)

# ---------------------------
# Task D embedding similarity (custom) — uses YOUR functions/kv if present
# Expects your notebook to have:
#   - a KeyedVectors-like object named `kv` (or `MODEL`)
#   - a function `doc_vector(tokens, kv, use_tfidf=False)` or similar
#   - a tokenizer / normalize function you used in Task D (we try common names)
# ---------------------------
def compute_task_d_similarity(beer_texts, target_beer, beer_pool):
    kv = globals().get("kv", None) or globals().get("MODEL", None)
    doc_vector = globals().get("doc_vector", None)
    normalize_fn = globals().get("normalize", None)
    # simple fallback tokenizer if your Task D tokenizer isn't present
    def default_tokenize(text: str):
        return [t for t in re.split(r"\W+", text.lower()) if t]

    if kv is None or doc_vector is None:
        return None  # Task D not available in this runtime

    import re
    tokenize = normalize_fn if callable(normalize_fn) else default_tokenize

    pool = list(dict.fromkeys(beer_pool))
    if target_beer not in pool:
        pool = [target_beer] + pool
    sub_texts = beer_texts.reindex(pool).fillna("")

    # Build vectors using your Task D function
    vectors = {}
    for beer, txt in sub_texts.items():
        tokens = tokenize(txt)
        v = doc_vector(tokens, kv)  # uses YOUR implementation
        if v is None or (hasattr(v, "__len__") and len(v) == 0):
            return None
        vectors[beer] = np.asarray(v, dtype=float)

    target_vec = vectors[target_beer].reshape(1, -1)
    mat = np.vstack([vectors[b] for b in pool])
    sims = cosine_similarity(target_vec, mat).flatten()

    rows = []
    for i, beer in enumerate(pool):
        if beer == target_beer:
            continue
        rows.append({"target_beer": target_beer, "candidate_beer": beer, "similarity_d": float(sims[i])})
    return pd.DataFrame(rows).sort_values("similarity_d", ascending=False).reset_index(drop=True)

# ---------------------------
# RUN: random 10 → random target → compute & merge
# ---------------------------
# 1) Load your file
df = load_dataframe("beer_reviews.csv")  # <-- change if needed
beer_texts, beer_stats = aggregate_reviews_by_beer(df)

# 2) Random selection
beer_list, target_beer = choose_random_10_and_target(beer_texts, n=10, random_state=99, min_len=0)
print("Randomly selected beers:")
for b in beer_list: print(" •", b)
print("\nTarget beer:", target_beer)

# 3) TF-IDF (always)
tfidf_table = compute_tfidf_similarity(beer_texts, target_beer, beer_list)

# 4) Task C (SpaCy) — only if available in this runtime
c_table = compute_task_c_similarity(beer_texts, target_beer, beer_list)

# 5) Task D (custom embeddings) — only if available in this runtime
d_table = compute_task_d_similarity(beer_texts, target_beer, beer_list)

# 6) Merge & rank (TF-IDF primary; then C; then D; then rating)
report = (tfidf_table
          .merge(beer_stats.rename(columns={"product_name": "candidate_beer"}),
                 on="candidate_beer", how="left"))

if c_table is not None:
    report = report.merge(c_table[["candidate_beer","similarity_c"]], on="candidate_beer", how="left")
if d_table is not None:
    report = report.merge(d_table[["candidate_beer","similarity_d"]], on="candidate_beer", how="left")

sort_cols, asc = ["similarity_tfidf"], [False]
if "similarity_c" in report: sort_cols.append("similarity_c"); asc.append(False)
if "similarity_d" in report: sort_cols.append("similarity_d"); asc.append(False)
if "avg_rating" in report:   sort_cols.append("avg_rating");   asc.append(False)

report = report.sort_values(sort_cols, ascending=asc).reset_index(drop=True)

# Tidy display
cols = ["target_beer", "candidate_beer", "similarity_tfidf"]
if "similarity_c" in report: cols.append("similarity_c")
if "similarity_d" in report: cols.append("similarity_d")
if "avg_rating" in report:   cols += ["avg_rating", "n_reviews"]
report = report.assign(target_beer=target_beer)[[c for c in cols if c in report.columns]]

report


Randomly selected beers:
 • Bourbon Barrel Oro Negro
 • Trappist Westvleteren 8 (VIII)
 • Lou Pepe - Kriek
 • Hommage
 • Triple Citra Daydream
 • Very Green
 • Plead The 5th - Bourbon Barrel-Aged
 • Very GGGreennn
 • Hunahpu's Imperial Stout - Double Barrel Aged
 • KBS

Target beer: Plead The 5th - Bourbon Barrel-Aged


Unnamed: 0,target_beer,candidate_beer,similarity_tfidf,avg_rating,n_reviews
0,Plead The 5th - Bourbon Barrel-Aged,KBS,0.770942,4.397342,79
1,Plead The 5th - Bourbon Barrel-Aged,Bourbon Barrel Oro Negro,0.751962,4.441633,49
2,Plead The 5th - Bourbon Barrel-Aged,Trappist Westvleteren 8 (VIII),0.488447,4.476512,43
3,Plead The 5th - Bourbon Barrel-Aged,Hunahpu's Imperial Stout - Double Barrel Aged,0.482346,4.384286,35
4,Plead The 5th - Bourbon Barrel-Aged,Triple Citra Daydream,0.285398,4.491228,57
5,Plead The 5th - Bourbon Barrel-Aged,Very Green,0.255702,4.520769,52
6,Plead The 5th - Bourbon Barrel-Aged,Lou Pepe - Kriek,0.253737,4.686562,32
7,Plead The 5th - Bourbon Barrel-Aged,Very GGGreennn,0.252927,4.487705,61
8,Plead The 5th - Bourbon Barrel-Aged,Hommage,0.238628,4.378367,49


We first selected a random subset of ten beers from the dataset to avoid bias toward any one style or rating. From this group, one beer was randomly designated as the target, while the remaining nine were treated as potential comparators. All reviews for each beer were aggregated into a single text document, capturing the collective descriptors used by reviewers. These texts were then transformed into TF-IDF vectors, which weight distinctive terms and phrases (including bigrams such as “chocolate notes” or “citrus hops”) more heavily than generic language. We measured the similarity between the target beer and each of the other nine using cosine similarity, which compares the angle between their TF-IDF vectors to quantify how closely the reviews align. The nine beers were then ranked in descending order of similarity, with the highest-scoring beer representing the most similar competitor to the target. This approach ensures that recommendations are grounded in how consumers actually describe the beers, making the similarity measure both interpretable and reproducible.