# Beer Recommendation System

User: "I want a very sweet, very hoppy beer with low alcohol"

System: 
1. Creates feature vector from request
2. Predicts rating: 2.8/5 ⚠️
3. Says: "Warning: Beers with this profile typically get poor ratings (2.8/5)"
4. Suggests: "Here are similar but better-rated alternatives:"
   - Less sweet + hoppy = 4.2/5 predicted
   - Sweet + less hoppy = 4.0/5 predicted

pipeline looks like

user query -> LLM -> LLM feature vectors -> processing -> top 5 recommendations

We will first begin work on the processing part
Processing Workflow:

0) Have a beer flavor profile and mouthfeel matrix from the relevant features in the dataset (normalized ofc)
1) Apply hard filters like strength, mainstream etc
2) Create a vector from the LLM features,
3) Use this vector to get the closest matching beers (k = 5) from the matrix in 0)
4) Now rank these 5 beers using a scoring mechanism which uses review_overall * number_of_reviews kind of thing
5) then return the results to the user along with the descriptions(notes) present in the dataset
6) If a reference beer is also given first get the reference beers similarity from matrix in 0) and then tweak it according to the other things mentioned and go through steps 3 - 6

User Request → LLM → Feature Vector → Two Parallel Paths:

                    ↓

    1. Regression: Predict Rating  
    2. KNN: Find Similar Beers

                    ↓

    If rating < 3: Generate alternatives  
    If rating ≥ 3: Rank KNN results by quality

# Data Preprocessing 

In [1]:
import pandas as pd
import numpy as np
import re

# Preprocessing
df = pd.read_csv('./data/beer_profile_and_ratings.csv')


# List of mainstream beer patterns
mainstream_patterns = [
     
    # company names
    'co.', 'inc'
    # Major American Brands
    'budweiser', 'bud', 'busch', 'michelob',
    'miller', 'coors', 'keystone', 'blue moon',
    'pabst', 'pbr', 'schlitz', 'old milwaukee',
    'rolling rock', 'yuengling', 'natural light', 'natty',
    
    # Sam Adams
    'samuel adams', 'sam adams', 'boston lager',
    
    # Mexican/Latin Beers
    'corona', 'modelo', 'pacifico',
    'dos equis', 'tecate', 'sol', 'victoria',
    
    # European Imports
    'heineken', 'amstel', 'stella artois',
    'becks', "beck's", 'st pauli', 'warsteiner',
    'guinness', 'harp', 'smithwick', 'kilkenny',
    'peroni', 'moretti', 'nastro azzurro',
    'carlsberg', 'tuborg', 'kronenbourg',
    'fosters', "foster's", 'grolsch', 'pilsner urquell',
    
    # Canadian
    'molson', 'labatt', 'moosehead', 'sleeman',
    
    # Asian
    'sapporo', 'asahi', 'kirin', 'tsingtao', 'singha', 'tiger', 'leo',
    
    # Large Craft (Now Owned by Big Beer)
    'shock top', 'goose island', 'elysian', 'lagunitas',
    'ballast point', '10 barrel', 'golden road',
    'blue point', 'devils backbone', 'karbach',
    'breckenridge', 'four peaks', 'wicked weed',
    
    # Large Independent Craft (Widely Distributed)
    'sierra nevada', 'new belgium', 'fat tire',
    'stone', 'brooklyn', 'dogfish head', 
    "bell's", 'bells brewery', 'founders',
    'deschutes', 'rogue', 'anchor steam',
    
    # Other Mainstream
    'red stripe', 'newcastle', 'bass', 'boddingtons',
    'murphy', 'beamish', 'tennents', 'carling',
    'leinenkugel', 'magic hat', 'pyramid',
    'widmer', 'redhook', 'kona', 'longboard',
    'landshark', 'presidente', 'medalla',
    
    # Indian Beers
    'kingfisher', 'haywards', 'thunderbolt',
    'kalyani', 'knockout', 'royal challenge',
    'carlsberg elephant', 'bira 91', 'bira',
    'simba', 'godfather', 'hunter', 'zingaro',
    'london pilsner', 'kotsberg', 'bullet',
    'khajuraho', 'taj mahal', 'flying horse', 'dansberg',
    'golden eagle', 'guru', 'bad monkey', 'bee young',
    'white rhino', 'white owl', 'effingut'
]


# marking which beers are from mainstream brands (off the shelf)
def matches_mainstream_pattern(beer_name_full):
        """Check if beer/brewery name matches any mainstream pattern"""
        combined_name = beer_name_full.lower()
        
        for pattern in mainstream_patterns:
            if pattern in combined_name:
                return True
        return False

df['mainstream'] = df.apply(lambda row : matches_mainstream_pattern(row['Beer Name (Full)']), axis=1)

print(np.mean(df['mainstream']))

df['mainstream'] = df['mainstream'] | (df['number_of_reviews'] >= 300)

df['strength'] = df['ABV'].apply(
    lambda x: 'Light' if x <= 5 else
              'Medium' if x <= 7 else
              'Strong' if x <= 10 else
              'Extra Strong'
)

df = df.drop(columns=['Min IBU', 'Max IBU', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'Beer Name (Full)', 'Brewery'])


cols = df.columns.tolist()
cols[1], cols[2] = cols[2], cols[1]
df = df[cols]
df['mainstream'] = df['mainstream'].astype(int)

print(df.columns)


0.2752580544260244
Index(['Name', 'Description', 'Style', 'ABV', 'Astringency', 'Body', 'Alcohol',
       'Bitter', 'Sweet', 'Sour', 'Salty', 'Fruits', 'Hoppy', 'Spices',
       'Malty', 'review_overall', 'number_of_reviews', 'mainstream',
       'strength'],
      dtype='object')


# Regression Component

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
import re

# Preprocessing and feature Engineering for Regression

reg_df = df.drop(columns=['number_of_reviews', 'strength', 'Name', 'Description'])

cols = reg_df.columns.tolist()
# Swap last two
cols[-2], cols[-1] = cols[-1], cols[-2]
# Reorder dataframe
reg_df = reg_df[cols]


y_reg = reg_df.iloc[:, -1]
X_reg = reg_df.iloc[:, :-1]

X = X_reg.copy()

flavor_features = ['ABV', 'Astringency', 'Body', 'Alcohol', 'Bitter', 'Sweet', 'Sour', 'Salty', 'Fruits', 'Hoppy', 'Spices', 'Malty']

scalar = MinMaxScaler()

X[flavor_features] = scalar.fit_transform(X[flavor_features])

# Removing sub categories like Lager-English, Lager-belgium to just Lager
# Using regex to split by ' - ' first, then by ' / ' if no hyphen
X['Style'] = X['Style'].str.split(' - ').str[0].str.split(' / ').str[0]

# Create encoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform the 'style' column
encoded_array = encoder.fit_transform(X[['Style']])

# Get feature names
feature_names = encoder.get_feature_names_out(['Style'])

# Create DataFrame with encoded features
encoded_df = pd.DataFrame(encoded_array, columns=feature_names, index=X_reg.index)

# Concatenate with original DataFrame (dropping the original 'style' column)
X_reg_scaled = pd.concat([X.drop('Style', axis=1), encoded_df], axis=1)


display(X_reg_scaled)


Unnamed: 0,ABV,Astringency,Body,Alcohol,Bitter,Sweet,Sour,Salty,Fruits,Hoppy,...,Style_Scotch Ale,Style_Scottish Ale,Style_Smoked Beer,Style_Sour,Style_Stout,Style_Strong Ale,Style_Tripel,Style_Wheat Beer,Style_Wild Ale,Style_Winter Warmer
0,0.092174,0.160494,0.182857,0.064748,0.313333,0.281369,0.116197,0.000000,0.188571,0.331395,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.125217,0.148148,0.325714,0.129496,0.220000,0.209125,0.056338,0.000000,0.137143,0.203488,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.086957,0.172840,0.211429,0.043165,0.280000,0.163498,0.038732,0.000000,0.057143,0.313953,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.147826,0.160494,0.314286,0.223022,0.313333,0.384030,0.063380,0.020833,0.280000,0.232558,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.125217,0.308642,0.291429,0.187050,0.293333,0.171103,0.031690,0.020833,0.062857,0.296512,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3192,0.153043,0.185185,0.211429,0.172662,0.233333,0.174905,0.066901,0.000000,0.131429,0.261628,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3193,0.104348,0.185185,0.177143,0.165468,0.106667,0.205323,0.151408,0.000000,0.308571,0.081395,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3194,0.118261,0.098765,0.251429,0.172662,0.126667,0.197719,0.073944,0.000000,0.148571,0.122093,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3195,0.130435,0.135802,0.205714,0.359712,0.466667,0.273764,0.207746,0.000000,0.462857,0.639535,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [3]:
# Testing Various Regression Models

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor

X_reg_np = X_reg_scaled.to_numpy()
y_reg_np = y_reg.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X_reg_np, y_reg_np, test_size=0.2)

linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

y_pred_linear = linear_model.predict(X_test)

linear_mse = mean_squared_error(y_pred_linear, y_test)

print(f"Test MSE for linear model {linear_mse}")

 #Train Gradient Boosting

gb_model = GradientBoostingRegressor(
    n_estimators=150,
    learning_rate=0.1,
    max_depth=4,
)

gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)

# Calculate MSE
gb_mse = mean_squared_error(y_pred_gb, y_test)

print(f"Test MSE for Gradient Boosted trees model {gb_mse}")
print(f"diff = {linear_mse - gb_mse}\n")


print("Gradient Boosted trees for the win")


Test MSE for linear model 0.11373085159549703
Test MSE for Gradient Boosted trees model 0.0931292831516258
diff = 0.020601568443871232

Gradient Boosted trees for the win


In [4]:
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor

gb_model = GradientBoostingRegressor(
    n_estimators=150,
    learning_rate=0.1,
    max_depth=4)

# Training on the whole dataset
X_train = X_reg_scaled.to_numpy()
y_train = y_reg.to_numpy()

gb_model.fit(X_train, y_train)




In [5]:
# Test LLM token
# standard test point
llm_output = {
  "ABV": 4.5,
  "Astringency": 18,
  "Body": 35,
  "Alcohol": 25,
  "Bitter": 28,
  "Sweet": 22,
  "Sour": 65,
  "Salty": 0,
  "Fruits": 85,
  "Hoppy": 45,
  "Spices": 8,
  "Malty": 40,
  "mainstream": 1,
  "style": "Wheat Beer"
}

# Bad test point
llm_output_bad = {
 "ABV": 0.05,
 "Astringency": 2,
 "Body": 10,
 "Alcohol": 10,
 "Bitter": 3,
 "Sweet": 13,
 "Sour": 3,
 "Salty": 0,
 "Fruits": 1,
 "Hoppy": 3,
 "Spices": 3,
 "Malty": 20,
 "mainstream": 1,
 "style": "Low Alcohol Beer"
}

print("True Rating")
print(y_reg[1714])



test_point = {col: 0 for col in X_reg_scaled.columns}


# Define continuous features
continuous_features = ['ABV', 'Astringency', 'Body', 'Alcohol', 'Bitter', 
                      'Sweet', 'Sour', 'Salty', 'Fruits', 'Hoppy', 'Spices', 'Malty']



# Fill in the scaled continuous features
for feat in continuous_features:
    test_point[feat] = [llm_output[feat]]

# Set mainstream (doesn't need scaling)
test_point['mainstream'] = llm_output['mainstream']

# One-hot encode the style
style_column = f"Style_{llm_output['style']}"
if style_column in X_reg_scaled.columns:
    test_point[style_column] = 1

test_point = pd.DataFrame(test_point)

test_point[continuous_features] = scalar.transform(test_point[continuous_features])
test_point = test_point[X_reg_scaled.columns] # Make sure Columns are in same order of the training set
display(test_point.columns)

test_point = test_point.to_numpy()
# Now predict
predicted_rating = gb_model.predict(test_point.reshape(1, -1))[0]
print(f"Predicted rating: {predicted_rating:.2f}")



True Rating
1.8


Index(['ABV', 'Astringency', 'Body', 'Alcohol', 'Bitter', 'Sweet', 'Sour',
       'Salty', 'Fruits', 'Hoppy', 'Spices', 'Malty', 'mainstream',
       'Style_Altbier', 'Style_Barleywine', 'Style_Bitter',
       'Style_Bière de Champagne', 'Style_Blonde Ale', 'Style_Bock',
       'Style_Braggot', 'Style_Brett Beer', 'Style_Brown Ale',
       'Style_California Common', 'Style_Chile Beer', 'Style_Cream Ale',
       'Style_Dubbel', 'Style_Farmhouse Ale', 'Style_Fruit and Field Beer',
       'Style_Gruit', 'Style_Happoshu', 'Style_Herb and Spice Beer',
       'Style_IPA', 'Style_Kvass', 'Style_Kölsch', 'Style_Lager',
       'Style_Lambic', 'Style_Low Alcohol Beer', 'Style_Mild Ale',
       'Style_Old Ale', 'Style_Pale Ale', 'Style_Pilsner', 'Style_Porter',
       'Style_Pumpkin Beer', 'Style_Quadrupel (Quad)', 'Style_Red Ale',
       'Style_Rye Beer', 'Style_Scotch Ale', 'Style_Scottish Ale',
       'Style_Smoked Beer', 'Style_Sour', 'Style_Stout', 'Style_Strong Ale',
       'Style_Tripel', 

Predicted rating: 3.65


# Recommendation Component

In [6]:
def get_strength(ABV):
    
  if ABV <= 5:
    strength = 'Light'
  elif ABV <= 7:
    strength = 'Medium'
  elif ABV <= 10:
    strength = 'Strong'
  else:
    strength = 'Extra Strong'

  return strength

llm_output = {
  "ABV": 4.5,
  "Astringency": 18,
  "Body": 35,
  "Alcohol": 25,
  "Bitter": 28,
  "Sweet": 22,
  "Sour": 65,
  "Salty": 0,
  "Fruits": 85,
  "Hoppy": 45,
  "Spices": 8,
  "Malty": 40,
  "mainstream": 1,
  "style": "Wheat Beer"
}


scaling_features = ['ABV', 'Astringency', 'Body', 'Alcohol', 'Bitter','Sweet','Sour', 'Salty',	'Fruits',	'Hoppy'	,'Spices',	'Malty']

X_recommend = df[['Style'] + scaling_features + ['mainstream','strength']].copy()

y_recommend = df[['Name', 'Description', 'review_overall', 'number_of_reviews']]

X_recommend['Style'] = X_recommend['Style'].str.split(' - ').str[0].str.split(' / ').str[0]

# Create encoder
encoder2 = OneHotEncoder(sparse_output=False)

# Fit and transform the 'style' column
encoded_array = encoder2.fit_transform(X_recommend[['Style']])

# Get feature names
feature_names = encoder2.get_feature_names_out(['Style'])

# Create DataFrame with encoded features
encoded_df = pd.DataFrame(encoded_array, columns=feature_names, index=X_recommend.index)

# Concatenate with original DataFrame (dropping the original 'style' column)
X_recommend = pd.concat([X_recommend.drop('Style', axis=1), encoded_df], axis=1)


# subsetting the dataframe for proper recommendation
if llm_output['mainstream'] == 1: # filter according to mainstream

  mainstream_mask = X_recommend['mainstream'] == 1

  X_recommend_sub = X_recommend[mainstream_mask]
  y_recommend_sub = y_recommend[mainstream_mask]

strength = get_strength(llm_output['ABV'])

strength_mask = X_recommend_sub['strength'] == strength

X_recommend_sub = X_recommend_sub[strength_mask]
y_recommend_sub = y_recommend_sub[strength_mask]

print(X_recommend_sub.shape)

# Dropping strength and mainstream used for content based filtering
X_recommend_sub = X_recommend_sub.drop(columns=['strength', 'mainstream'])

print(X_recommend_sub.shape)

scalar2 = MinMaxScaler()

X_recommend_sub[scaling_features] = scalar2.fit_transform(X_recommend_sub[scaling_features])

X_recommend_scaled = X_recommend_sub.copy()

# Creating test point
# Initialize all columns to 0
test_point = {col: 0 for col in X_recommend_scaled.columns}

for feat in scaling_features:
    test_point[feat] = llm_output[feat]

# One-hot encode the style
style_column = f"Style_{llm_output['style']}"
if style_column in X_recommend_scaled.columns:
    test_point[style_column] = 1

# Convert to DataFrame
test_df = pd.DataFrame([test_point])


# Scale the continuous features
test_df[scaling_features] = scalar2.transform(test_df[scaling_features])

# Make sure Columns are in same order of the training set
test_df = test_df[X_recommend_scaled.columns]

display(test_df)
display(X_recommend_scaled)

# Convert to numpy arrays for KNN
test_vector = test_df.values[0]
X_recommend_scaled_np = X_recommend_scaled.to_numpy()
y_recommend_np = y_recommend_sub.to_numpy()



(363, 57)
(363, 55)


Unnamed: 0,ABV,Astringency,Body,Alcohol,Bitter,Sweet,Sour,Salty,Fruits,Hoppy,...,Style_Scotch Ale,Style_Scottish Ale,Style_Smoked Beer,Style_Sour,Style_Stout,Style_Strong Ale,Style_Tripel,Style_Wheat Beer,Style_Wild Ale,Style_Winter Warmer
0,0.9,0.333333,0.336538,0.735294,0.27451,0.112245,0.325,0.0,0.559211,0.304054,...,0,0,0,0,0,0,0,1,0,0


Unnamed: 0,ABV,Astringency,Body,Alcohol,Bitter,Sweet,Sour,Salty,Fruits,Hoppy,...,Style_Scotch Ale,Style_Scottish Ale,Style_Smoked Beer,Style_Sour,Style_Stout,Style_Strong Ale,Style_Tripel,Style_Wheat Beer,Style_Wild Ale,Style_Winter Warmer
2,1.00,0.259259,0.355769,0.176471,0.411765,0.219388,0.055,0.000000,0.065789,0.364865,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,1.00,0.333333,0.471154,0.147059,0.362745,0.372449,0.110,0.000000,0.138158,0.250000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17,0.90,0.351852,0.403846,0.382353,0.529412,0.316327,0.095,0.000000,0.164474,0.432432,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
108,1.00,0.166667,0.403846,0.147059,0.294118,0.321429,0.080,0.083333,0.125000,0.344595,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
110,1.00,0.333333,0.307692,0.117647,0.490196,0.229592,0.105,0.000000,0.098684,0.554054,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3121,1.00,0.351852,0.384615,0.147059,0.186275,0.163265,0.380,0.083333,0.546053,0.216216,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3125,0.90,0.333333,0.307692,0.088235,0.147059,0.163265,0.300,0.083333,0.447368,0.263514,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3126,0.88,0.259259,0.336538,0.147059,0.245098,0.163265,0.345,0.000000,0.638158,0.270270,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3127,1.00,0.370370,0.317308,0.235294,0.215686,0.244898,0.270,0.083333,0.565789,0.425676,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [9]:
from sklearn.neighbors import NearestNeighbors
import numpy as np

llm_output = {
  "ABV": 4.5,
  "Astringency": 18,
  "Body": 35,
  "Alcohol": 25,
  "Bitter": 28,
  "Sweet": 22,
  "Sour": 65,
  "Salty": 0,
  "Fruits": 85,
  "Hoppy": 45,
  "Spices": 8,
  "Malty": 40,
  "mainstream": 1,
  "style": "Wheat Beer"
}


def get_quality_score(rating, num_reviews):

    # Calculates final quality score using Bayseian Average (IMdb)
    # Incorporates num_reviews in final score
    m = 50  
    C = 3.748
    return rating * (0.6 + 0.4 * np.log1p(num_reviews) / 10)


# Initialize KNN model
knn = NearestNeighbors(n_neighbors=10, metric='euclidean')
knn.fit(X_recommend_scaled_np)

# Find 10 nearest neighbors
distances, indices = knn.kneighbors([test_vector])

# Get the beer info for these 10 neighbors
top_10_beers = []
for i, idx in enumerate(indices[0]):
    beer_info = {
        'name': y_recommend_np[idx][0],
        'description': y_recommend_np[idx][1],
        'rating': y_recommend_np[idx][2],
        'num_reviews': y_recommend_np[idx][3],
        'distance': distances[0][i],
        'index': idx
    }
    top_10_beers.append(beer_info)


# Create quality score 
for beer in top_10_beers:
    # Bayesian Average
    print("name",beer['name'])
    print("num reviews",beer['num_reviews'])
    print("rating",beer['rating'])
    print("Distance",beer['distance'])
    beer['quality_score'] = get_quality_score(beer['rating'], beer['num_reviews'])
    print("Quality score",beer['quality_score'])
    print()
    
# Sort by quality score (descending)
top_10_beers.sort(key=lambda x: x['quality_score'], reverse=True)

# Get top 5
final_recommendations = top_10_beers[:5]

# Display results
print("Top 5 Recommendations:")
print("-" * 50)
for i, beer in enumerate(final_recommendations, 1):
    print(f"{i}. {beer['name']}")
    print(f"   Rating: {beer['rating']:.2f}/5 ({beer['num_reviews']} reviews)")
    print(f"   Distance: {beer['distance']:.3f}")
    print(f"   {beer['description'][:100]}...")
    print()




name Franziskaner Hefe-Weisse
num reviews 1528
rating 4.145288
Distance 0.49674911775956454
Quality score 3.7029640832362904

name UFO White
num reviews 140
rating 3.825
Distance 0.5164280337570434
Quality score 3.0521602632278597

name Mothership Wit
num reviews 550
rating 3.815455
Distance 0.5835413337750913
Quality score 3.2525586054502615

name Brooklyner Weisse
num reviews 270
rating 4.001852
Distance 0.5847379635657177
Quality score 3.2978652163030033

name Widmer Hefeweizen
num reviews 753
rating 3.403054
Distance 0.6005334341243135
Quality score 2.9436951199807577

name 312 Urban Wheat
num reviews 645
rating 3.74031
Distance 0.6080762465762302
Quality score 3.212297843679724

name Circus Boy The Hefeweizen!
num reviews 658
rating 3.607143
Distance 0.6138220538706011
Quality score 3.100804518496639

name ZÔN
num reviews 208
rating 3.829327
Distance 0.629716422210715
Quality score 3.1158979917629464

name Solace
num reviews 105
rating 3.728571
Distance 0.6409967126240181
Quality 

# Completed workflow/Pipeline

In [18]:
# Testing

import pandas as pd

# Show all columns
pd.set_option('display.max_columns', None)

# Show all rows
pd.set_option('display.max_rows', None)

# Avoid line-wrapping or truncated cells
pd.set_option('display.max_colwidth', None)

def generate_test_point(llm_output, X_train, scalar, type):

    scaling_features = ['ABV', 'Astringency', 'Body', 'Alcohol', 'Bitter', 
                      'Sweet', 'Sour', 'Salty', 'Fruits', 'Hoppy', 'Spices', 'Malty']


    test_point = {col: 0 for col in X_train.columns}

    # Fill in the scaled continuous features
    for feat in scaling_features:
        test_point[feat] = [llm_output[feat]]

    # One-hot encode the style
    style_column = f"Style_{llm_output['style']}"
    if style_column in X_train.columns:
        test_point[style_column] = 1
    
    if type == 'Regressor':
        # Set mainstream (doesn't need scaling)
        test_point['mainstream'] = llm_output['mainstream']

    test_point = pd.DataFrame(test_point)

    test_point[scaling_features] = scalar.transform(test_point[scaling_features])
    # Make sure Columns are in same order of the training set
    test_point = test_point[X_train.columns] 

    return test_point

def get_beer_recommendations(llm_output, X_recommend, y_recommend, alt = False, alt_rating_threshold = 3.5):

    # PERFORMING RECOMMENDATION
    # subsetting the dataframe for proper recommendation

    # for alternative recommendations rating is given preference
    if alt:
        rating_mask = y_recommend['review_overall'] >= alt_rating_threshold

        X_recommend = X_recommend[rating_mask]
        y_recommend = y_recommend[rating_mask] 

    if llm_output['mainstream'] == 1: # filter according to mainstream

        mainstream_mask = X_recommend['mainstream'] == 1

        X_recommend = X_recommend[mainstream_mask]
        y_recommend = y_recommend[mainstream_mask]

    strength = get_strength(llm_output['ABV'])

    strength_mask = X_recommend['strength'] == strength

    X_recommend_sub = X_recommend[strength_mask]
    y_recommend_sub = y_recommend[strength_mask]


    # Dropping strength and mainstream used for content based filtering
    X_recommend_sub = X_recommend_sub.drop(columns=['strength', 'mainstream'])


    # Scaling after subsetting to reduce noise
    scalar2 = MinMaxScaler()

    X_recommend_sub[scaling_features] = scalar2.fit_transform(X_recommend_sub[scaling_features])

    # display(X_recommend.columns)

    X_recommend_scaled = X_recommend_sub

    X_recommend_scaled_np = X_recommend_scaled.to_numpy()
    y_recommend_np = y_recommend_sub.to_numpy()

    test_point_recommendation = generate_test_point(llm_output, X_recommend_scaled, scalar2, type="Recommend")

    test_point_recommendation_np = test_point_recommendation.values[0]

    # display(test_point_recommendation)

    knn.fit(X_recommend_scaled_np)  # Included in Loop because subsetting always changes the df

    # Find 10 nearest neighbors
    distances, indices = knn.kneighbors([test_point_recommendation_np])

    # Get the beer info for these 10 neighbors
    top_10_beers = []
    for i, idx in enumerate(indices[0]):
        beer_info = {
            'name': y_recommend_np[idx][0],
            'description': y_recommend_np[idx][1],
            'rating': y_recommend_np[idx][2],
            'num_reviews': y_recommend_np[idx][3],
            'distance': distances[0][i],
            'index': idx
        }
        top_10_beers.append(beer_info)


    # Create quality score 
    for beer in top_10_beers:
        beer['quality_score'] = get_quality_score(beer['rating'], beer['num_reviews'])

        
    # Sort by quality score (descending)
    top_10_beers.sort(key=lambda x: x['quality_score'], reverse=True)

    # Get top 2
    final_recommendations = top_10_beers[:2]

    return final_recommendations

def display_results(predicted_rating, regular_recommendations, alt_recommendations=None):
    """
    Display recommendations with warnings for low-rated combinations
    
    """
    
    if alt_recommendations is not None:
        # Low rating case - show warning and alternatives
        print("━" * 60)
        print(f"⚠️  Warning: This flavor combination typically rates {predicted_rating:.2f}/5")
        print("━" * 60)
        
        print("\n📍 Here's what matches your exact request:")
        if regular_recommendations:
            for i, beer in enumerate(regular_recommendations[:3], 1):
                print(f"{i}. {beer['name']} ({beer['rating']:.2f}★ - {beer['num_reviews']} reviews)")
                print(f"   Distance: {beer['distance']:.3f}")
        else:
            print("   No exact matches found in our database.")
        
        print("\n💡 Suggested Alternatives (similar but better rated):")
        if alt_recommendations:
            for i, beer in enumerate(alt_recommendations[:3], 1):
                print(f"{i}. {beer['name']} ({beer['rating']:.2f}★ - {beer['num_reviews']} reviews)")
                print(f"   Distance: {beer['distance']:.3f}")
        else:
            print("   No high-rated alternatives found with your criteria.")
            
        print("\n💭 Tip: The flavor combination you requested is uncommon. The alternatives above")
        print("   maintain similar characteristics but with proven appeal to beer enthusiasts.")
        
    else:
        # Good rating case - normal display
        print("━" * 60)
        print(f"✅ Great choice! Predicted rating: {predicted_rating:.2f}/5")
        print("━" * 60)
        
        print("\n🍺 Top Recommendations:")
        for i, beer in enumerate(regular_recommendations[:5], 1):
            print(f"\n{i}. {beer['name']}")
            print(f"   Rating: {beer['rating']:.2f}/5 ({beer['num_reviews']} reviews)")
            print(f"   Distance: {beer['distance']:.3f}")
            print(f"   Notes: {beer['description'][:120]}...")
    
    print("\n" + "─" * 60)


test_cases = {
    "I want a light citrusy beer": {
        "ABV": 4.2,
        "Astringency": 12,  # ~15% - light
        "Body": 35,  # ~20% - light body
        "Alcohol": 15,  # ~11% - light
        "Bitter": 45,  # ~30% - moderate
        "Sweet": 40,  # ~15% - low
        "Sour": 85,  # ~30% - citrus tartness
        "Salty": 0,
        "Fruits": 140,  # ~80% - very high citrus
        "Hoppy": 65,  # ~38% - moderate
        "Spices": 15,  # ~8% - low
        "Malty": 50,  # ~21% - light
        "mainstream": 1,
        "style": "Wheat Beer"
    },
    
    "I want a strong lager which is fruity and not too malty": {
        "ABV": 7.5,
        "Astringency": 20,  # ~25%
        "Body": 70,  # ~40% - medium
        "Alcohol": 75,  # ~54% - strong
        "Bitter": 55,  # ~37% - moderate
        "Sweet": 65,  # ~25% - moderate
        "Sour": 25,  # ~9% - low
        "Salty": 0,
        "Fruits": 120,  # ~69% - high fruity
        "Hoppy": 60,  # ~35% - moderate
        "Spices": 10,  # ~5% - very low
        "Malty": 60,  # ~25% - not too malty
        "mainstream": 1,
        "style": "Lager"
    },
    
    "I want a strong orangey tart beer": {
        "ABV": 8.2,
        "Astringency": 45,  # ~56% - high tartness
        "Body": 60,  # ~34% - medium
        "Alcohol": 85,  # ~61% - strong
        "Bitter": 30,  # ~20% - low
        "Sweet": 75,  # ~29% - moderate
        "Sour": 240,  # ~85% - very tart
        "Salty": 8,  # ~17% - slight
        "Fruits": 155,  # ~89% - very orangey
        "Hoppy": 25,  # ~15% - low
        "Spices": 35,  # ~19% - some zest
        "Malty": 40,  # ~17% - low
        "mainstream": 0,
        "style": "Sour"
    },
    
    "Give me a hoppy IPA with tropical notes": {
        "ABV": 6.8,
        "Astringency": 35,  # ~43% - moderate
        "Body": 75,  # ~43% - medium
        "Alcohol": 65,  # ~47% - moderate-strong
        "Bitter": 110,  # ~73% - high bitter
        "Sweet": 55,  # ~21% - low-moderate
        "Sour": 15,  # ~5% - very low
        "Salty": 0,
        "Fruits": 145,  # ~83% - very tropical
        "Hoppy": 155,  # ~90% - extremely hoppy
        "Spices": 20,  # ~11% - slight
        "Malty": 75,  # ~31% - moderate
        "mainstream": 1,
        "style": "IPA"
    },
    
    "I want a sessionable pilsner": {
        "ABV": 4.5,
        "Astringency": 10,  # ~12% - low
        "Body": 30,  # ~17% - light
        "Alcohol": 18,  # ~13% - light
        "Bitter": 45,  # ~30% - moderate
        "Sweet": 25,  # ~10% - low
        "Sour": 8,  # ~3% - very low
        "Salty": 0,
        "Fruits": 15,  # ~9% - very low
        "Hoppy": 65,  # ~38% - moderate
        "Spices": 8,  # ~4% - very low
        "Malty": 80,  # ~33% - moderate
        "mainstream": 1,
        "style": "Pilsner"
    },
    
    "I need a dessert beer with chocolate and coffee notes": {
        "ABV": 10.5,
        "Astringency": 55,  # ~68% - high
        "Body": 160,  # ~91% - extremely full
        "Alcohol": 120,  # ~86% - very strong
        "Bitter": 75,  # ~50% - moderate-high
        "Sweet": 195,  # ~74% - very sweet
        "Sour": 8,  # ~3% - none
        "Salty": 12,  # ~25% - slight
        "Fruits": 25,  # ~14% - low
        "Hoppy": 35,  # ~20% - low
        "Spices": 140,  # ~76% - high coffee/cocoa
        "Malty": 210,  # ~88% - extremely malty
        "mainstream": 0,
        "style": "Stout"
    },
    
    "Something light and refreshing with low alcohol": {
        "ABV": 3.2,
        "Astringency": 8,  # ~10% - very low
        "Body": 25,  # ~14% - very light
        "Alcohol": 10,  # ~7% - very low
        "Bitter": 25,  # ~17% - low
        "Sweet": 35,  # ~13% - low
        "Sour": 20,  # ~7% - slight
        "Salty": 0,
        "Fruits": 55,  # ~31% - moderate
        "Hoppy": 30,  # ~17% - low
        "Spices": 8,  # ~4% - very low
        "Malty": 55,  # ~23% - light
        "mainstream": 1,
        "style": "Light Beer"
    },
    
    "I want a Belgian tripel with spicy notes": {
        "ABV": 9.0,
        "Astringency": 30,  # ~37% - moderate
        "Body": 85,  # ~49% - medium-full
        "Alcohol": 95,  # ~68% - strong
        "Bitter": 45,  # ~30% - moderate
        "Sweet": 115,  # ~44% - moderate-high
        "Sour": 18,  # ~6% - slight
        "Salty": 0,
        "Fruits": 95,  # ~54% - fruity esters
        "Hoppy": 50,  # ~29% - moderate
        "Spices": 155,  # ~84% - very spicy
        "Malty": 120,  # ~50% - moderate
        "mainstream": 0,
        "style": "Tripel"
    },
    
    "Give me a bitter hoppy beer with no sweetness": {
        "ABV": 6.5,
        "Astringency": 60,  # ~74% - high
        "Body": 65,  # ~37% - medium
        "Alcohol": 60,  # ~43% - moderate
        "Bitter": 135,  # ~90% - extremely bitter
        "Sweet": 15,  # ~6% - minimal
        "Sour": 12,  # ~4% - none
        "Salty": 0,
        "Fruits": 35,  # ~20% - low
        "Hoppy": 165,  # ~96% - maximum hoppy
        "Spices": 25,  # ~14% - slight
        "Malty": 65,  # ~27% - low-moderate
        "mainstream": 1,
        "style": "IPA"
    },
    
    "I want a sweet malty amber ale": {
        "ABV": 5.5,
        "Astringency": 15,  # ~19% - low
        "Body": 95,  # ~54% - full
        "Alcohol": 40,  # ~29% - moderate
        "Bitter": 35,  # ~23% - low-moderate
        "Sweet": 145,  # ~55% - high sweet
        "Sour": 8,  # ~3% - none
        "Salty": 0,
        "Fruits": 45,  # ~26% - moderate
        "Hoppy": 40,  # ~23% - low
        "Spices": 18,  # ~10% - slight
        "Malty": 185,  # ~77% - very malty
        "mainstream": 1,
        "style": "Amber Ale"
    },
    
    "Something sour and funky with brett character": {
        "ABV": 6.2,
        "Astringency": 65,  # ~80% - very high
        "Body": 55,  # ~31% - medium-light
        "Alcohol": 55,  # ~40% - moderate
        "Bitter": 20,  # ~13% - low
        "Sweet": 30,  # ~11% - low
        "Sour": 265,  # ~93% - extremely sour
        "Salty": 18,  # ~38% - noticeable
        "Fruits": 125,  # ~71% - high fruit
        "Hoppy": 20,  # ~12% - very low
        "Spices": 55,  # ~30% - funky notes
        "Malty": 35,  # ~15% - very low
        "mainstream": 0,
        "style": "Wild Ale"
    },
    
    "I want a very sweet and very hoppy beer": {
        "ABV": 5.0,
        "Astringency": 35,  # ~43% - moderate
        "Body": 65,  # ~37% - medium
        "Alcohol": 35,  # ~25% - moderate
        "Bitter": 120,  # ~80% - very bitter
        "Sweet": 210,  # ~80% - very sweet
        "Sour": 15,  # ~5% - minimal
        "Salty": 0,
        "Fruits": 85,  # ~49% - moderate
        "Hoppy": 150,  # ~87% - very hoppy
        "Spices": 15,  # ~8% - low
        "Malty": 95,  # ~40% - moderate
        "mainstream": 1,
        "style": "Pale Ale"
    },
    
        "Just a Bad beer ":{
        "ABV": 0.05,
        "Astringency": 2,
        "Body": 10,
        "Alcohol": 10,
        "Bitter": 3,
        "Sweet": 13,
        "Sour": 3,
        "Salty": 0,
        "Fruits": 1,
        "Hoppy": 3,
        "Spices": 3,
        "Malty": 20,
        "mainstream": 1,
        "style": "Low Alcohol Beer"
        }
}

for prompt, llm_output in test_cases.items():

    print(f"User Prompt = {prompt}")

    # PERFORMING REGRESSION
    test_point_regression = generate_test_point(llm_output, X_reg_scaled, scalar, type="Regressor")

    test_point_regression_np = test_point_regression.values[0]

    predicted_rating = gb_model.predict(test_point_regression_np.reshape(1, -1))[0]

    # Creating copies because each iteration involves subsetting and changing the Dataframes
    X_recommend_sub = X_recommend.copy()
    y_recommend_sub = y_recommend.copy()

    final_recommendations = get_beer_recommendations(llm_output, X_recommend_sub, y_recommend_sub, alt=False, alt_rating_threshold=3.0)

    alt_recommendations = None

    if predicted_rating < 3.0:
        alt_recommendations = get_beer_recommendations(llm_output, X_recommend_sub, y_recommend_sub, alt=True, alt_rating_threshold = 3.0)

    # Display results
    display_results(predicted_rating, final_recommendations, alt_recommendations)
    print('=====================================================================================================================================\n')





User Prompt = I want a light citrusy beer
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ Great choice! Predicted rating: 3.96/5
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🍺 Top Recommendations:

1. Whirlwind Witbier
   Rating: 3.93/5 (519 reviews)
   Distance: 0.699
   Notes: Notes:...

2. ZÔN
   Rating: 3.83/5 (208 reviews)
   Distance: 0.509
   Notes: Notes:Boulevard’s summer seasonal is our interpretation of a classic Belgian witbier. ZŌN (Flemish for “sun”) combines t...

────────────────────────────────────────────────────────────

User Prompt = I want a strong lager which is fruity and not too malty
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ Great choice! Predicted rating: 3.49/5
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🍺 Top Recommendations:

1. The Kaiser
   Rating: 3.65/5 (581 reviews)
   Distance: 0.875
   Notes: Notes:The Kaiser once said, “Give me a woman who loves beer and I will conquer the world.

In [16]:
flavor_features = ['ABV', 'Astringency', 'Body', 'Alcohol', 'Bitter', 
                   'Sweet', 'Sour', 'Salty', 'Fruits', 'Hoppy', 'Spices', 'Malty', 'review_overall']

# Method 1: Simple min/max
for feature in flavor_features:
    print(f"{feature}: min={df[feature].min()}, max={df[feature].max()}, mean = {df[feature].mean()}")

print()
print(df['number_of_reviews'].describe())
print(f"Median reviews: {df['number_of_reviews'].median()}")
print(f"Beers with 10+ reviews: {(df['number_of_reviews'] >= 10).sum()}")
print(f"Beers with 20+ reviews: {(df['number_of_reviews'] >= 20).sum()}")
print(f"Beers with 50+ reviews: {(df['number_of_reviews'] >= 50).sum()}")


ABV: min=0.0, max=57.5, mean = 6.5266875195495775
Astringency: min=0, max=81, mean = 16.51579605880513
Body: min=0, max=175, mean = 46.1294964028777
Alcohol: min=0, max=139, mean = 17.055989990616204
Bitter: min=0, max=150, mean = 36.36440412887082
Sweet: min=0, max=263, mean = 58.2708789490147
Sour: min=0, max=284, mean = 33.14544885830466
Salty: min=0, max=48, mean = 1.0172036284016266
Fruits: min=0, max=175, mean = 38.52955896152643
Hoppy: min=0, max=172, mean = 40.92461682827651
Spices: min=0, max=184, mean = 18.34563653425086
Malty: min=0, max=239, mean = 75.33093525179856
review_overall: min=1.136364, max=5.0, mean = 3.747521867062871

count    3197.000000
mean      233.284955
std       361.811847
min         1.000000
25%        23.000000
50%        93.000000
75%       284.000000
max      3290.000000
Name: number_of_reviews, dtype: float64
Median reviews: 93.0
Beers with 10+ reviews: 2731
Beers with 20+ reviews: 2466
Beers with 50+ reviews: 1990


User: "I want a very bitter, very sweet, light beer"

Your System:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Warning: This flavor combination typically rates 2.3/5

Here's what you asked for:
1. Beer X (2.1★) - Matches your request
2. Beer Y (2.4★) - Close match

💡 Suggested Alternatives (similar but better rated):
1. Beer Z (4.2★) - Less sweet, still bitter
2. Beer W (4.0★) - Medium body instead of light