<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Capstone Project - Wine Recommender System <br> [Part 2 of 3]

## Contents:
- [Modeling](##Modeling)
- [Hyperparameter Tuning](##Hyperparameter-Tuning)

---
## Modeling
---

In [14]:
# Importing Libraries

import pandas as pd
import numpy as np
import random
from surprise import Dataset, Reader, accuracy
from surprise.model_selection import cross_validate, KFold, GridSearchCV
from surprise import NormalPredictor, BaselineOnly, KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline, SVD, NMF, CoClustering

In [15]:
df = pd.read_csv('../data/wine_reviews_clean.csv')

In [16]:
# Assign a unique ID to each wine
df['wine_id'] = df.index

# Assign a unique ID to each taster
taster_ids = df['taster_name'].drop_duplicates().reset_index(drop=True)
df['user_id'] = df['taster_name'].map(lambda taster: taster_ids[taster_ids == taster].index[0])

# Convert the dataframe to a user preferences format
user_prefs = df[['user_id', 'wine_id', 'points']]

# Save user preferences dataframe
user_prefs[['user_id', 'wine_id', 'points']].to_csv("../data/user_prefs.csv", index=False)

# Load data
reader = Reader(line_format='user item rating', sep=',', rating_scale=(90, 100), skip_lines=1)
data = Dataset.load_from_file("../data/user_prefs.csv", reader=reader)

In [17]:
# Define the algorithms
algorithms = [
    ('Normal Predictor', NormalPredictor()),
    ('Baseline Predictor', BaselineOnly()),
    ('KNN Basic', KNNBasic()),
    ('KNN with Means', KNNWithMeans()),
    ('KNN with Z-score', KNNWithZScore()),
    ('KNN Baseline', KNNBaseline()),
    ('SVD', SVD()),
    ('Non-negative Matrix Factorization', NMF()),
    ('Co-Clustering', CoClustering()),
]

In [18]:
# K-Fold Cross-Validation
k_fold = KFold(n_splits=5, shuffle=True, random_state=random.seed(42))

The threshold value determines the benchmark for wines to be considered relevant. A low threshold means that even wines with lower ratings will be considered relevant, while a high threshold requires wines to have higher ratings to be considered relevant. 

When the threshold is low, precision might be high because many wines are considered relevant, but recall might be lower because fewer relevant wines are included in the top k recommendations. 

When the threshold is high, precision might be lower because only a few wines meet the high rating requirement, but recall might be higher because more of the relevant wines are included in the top k recommendations.
- `Precision@k` : Proportion of relevant wines among the top k recommendations
- `Recall@k` : Proportion of relevant wines among the top k recommendations out of all the relevant wines


In [19]:
# Define Precision@k and Recall@k
def precision_recall_at_k(predictions, k=10, threshold=91):
    user_est_true = dict()
    for uid, _, true_r, est, _ in predictions:
        if uid not in user_est_true:
            user_est_true[uid] = []
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():
        user_ratings.sort(key=lambda x: x[0], reverse=True)
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1
        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1

    return precisions, recalls

In [20]:
# Evaluate algorithms
results = []
for name, algo in algorithms:
    print(f"Evaluating {name}...")
    algo_results = cross_validate(algo, data, measures=['RMSE'], cv=k_fold, verbose=False)
    rmse = algo_results['test_rmse'].mean()

    # Precision and Recall
    precisions_list = []
    recalls_list = []
    for trainset, testset in k_fold.split(data):
        algo.fit(trainset)
        predictions = algo.test(testset)
        precisions, recalls = precision_recall_at_k(predictions, k=10, threshold=91)

        precisions_list.append(sum(prec for prec in precisions.values()) / len(precisions))
        recalls_list.append(sum(rec for rec in recalls.values()) / len(recalls))

    precision = sum(precisions_list) / len(precisions_list)
    recall = sum(recalls_list) / len(recalls_list)

    results.append({
        'Algorithm': name,
        'RMSE': rmse,
        'Precision@k': precision,
        'Recall@k': recall,
    })

Evaluating Normal Predictor...
Evaluating Baseline Predictor...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating KNN Basic...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd 

In [21]:
# Display the results
results = pd.DataFrame(results)
results = results[['Algorithm', 'RMSE', 'Precision@k', 'Recall@k']]
print(results)

                           Algorithm      RMSE  Precision@k  Recall@k
0                   Normal Predictor  2.175808     0.568371  0.192018
1                 Baseline Predictor  1.614335     0.722982  0.106865
2                          KNN Basic  1.632080     0.590877  0.195258
3                     KNN with Means  1.632050     0.589064  0.210714
4                   KNN with Z-score  1.632015     0.596825  0.188387
5                       KNN Baseline  1.614353     0.733534  0.093825
6                                SVD  1.614460     0.756316  0.077905
7  Non-negative Matrix Factorization  1.632030     0.570994  0.186223
8                      Co-Clustering  1.632060     0.564486  0.185423


We will choose the SVD algorithm for tuning as it has the one of the lowest RMSE scores and the highest Precision@k score.

## Hyperparameter Tuning

In [22]:
# Set hyperparameters for SVD
param_grid = {
    'n_factors': [150],                 # [50, 100, 150] 
    'n_epochs': [40],                   # [20, 30, 40]  
    'lr_all': [0.005],                  # [0.001, 0.002, 0.005] 
    'reg_all': [0.005],                 # [0.005, 0.05, 0.1]
}

In [23]:
# Instantiate and fit tuned SVD model using GridSearchCV
grid_search = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3, n_jobs=-1)
grid_search.fit(data)

print(f'Best RMSE score of tuned SVD model is {round(grid_search.best_score["rmse"],6)}')
print(f'Best parameters of tuned SVD model are {grid_search.best_params["rmse"]}')

Best RMSE score of tuned SVD model is 1.614552
Best parameters of tuned SVD model are {'n_factors': 150, 'n_epochs': 40, 'lr_all': 0.005, 'reg_all': 0.005}


In [24]:
# Train the SVD model with the best hyperparameters
tuned_svd = SVD(
    n_factors=best_params['n_factors'],
    n_epochs=best_params['n_epochs'],
    lr_all=best_params['lr_all'],
    reg_all=best_params['reg_all'],
)

NameError: name 'best_params' is not defined

In [25]:
# Train the SVD model with the best hyperparameters
tuned_svd = SVD(
    n_factors=grid_search.best_params["rmse"]["n_factors"],
    n_epochs=grid_search.best_params["rmse"]["n_epochs"],
    lr_all=grid_search.best_params["rmse"]["lr_all"],
    reg_all=grid_search.best_params["rmse"]["reg_all"],
)

In [26]:
# Precision and Recall
k_fold = KFold(n_splits=5)
precisions_list = []
recalls_list = []

for trainset, testset in k_fold.split(data):
    tuned_svd.fit(trainset)
    predictions = tuned_svd.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=10, threshold=91)

    precisions_list.append(sum(prec for prec in precisions.values()) / len(precisions))
    recalls_list.append(sum(rec for rec in recalls.values()) / len(recalls))

precision = sum(precisions_list) / len(precisions_list)
recall = sum(recalls_list) / len(recalls_list)

print(f"Tuned SVD model:")
print(f"Precision@k: {precision:.4f}")
print(f"Recall@k: {recall:.4f}")

Tuned SVD model:
Precision@k: 0.7442
Recall@k: 0.0750


In [27]:
thresholds = [90, 91, 92]

results = []
for threshold in thresholds:
    k_fold = KFold(n_splits=5)
    precisions_list = []
    recalls_list = []

    for trainset, testset in k_fold.split(data):
        tuned_svd.fit(trainset)
        predictions = tuned_svd.test(testset)
        precisions, recalls = precision_recall_at_k(predictions, k=10, threshold=threshold)

        precisions_list.append(sum(prec for prec in precisions.values()) / len(precisions))
        recalls_list.append(sum(rec for rec in recalls.values()) / len(recalls))

    precision = sum(precisions_list) / len(precisions_list)
    recall = sum(recalls_list) / len(recalls_list)
    results.append({
        'Threshold': threshold,
        'Precision@k': precision,
        'Recall@k': recall,
    })

# Display the results
results_df = pd.DataFrame(results)

# Calculate F1-score
results_df['F1-score'] = 2 * (results_df['Precision@k'] * results_df['Recall@k']) / (results_df['Precision@k'] + results_df['Recall@k'])

results_df = results_df[['Threshold', 'Precision@k', 'Recall@k', 'F1-score']]
print(results_df)


   Threshold  Precision@k  Recall@k  F1-score
0         90     1.000000  0.195189  0.326624
1         91     0.772982  0.066032  0.121671
2         92     0.980000  0.053176  0.100879


In [28]:
# Create a new user ID
new_user_id = df['user_id'].max() + 1

# Create a list of all wine_ids
wine_ids = df['wine_id'].unique()

# Generate predicted ratings for the new user for each wine
predicted_ratings = []
for wine_id in wine_ids:
    predicted_rating = tuned_svd.predict(new_user_id, wine_id)
    predicted_ratings.append((wine_id, predicted_rating.est))

# Sort the wines based on the predicted ratings
sorted_predicted_ratings = sorted(predicted_ratings, key=lambda x: x[1], reverse=True)

# Retrieve the top 10 recommendations
top_10_recommendations = sorted_predicted_ratings[:10]

# Compare the estimated ratings to the actual ratings
recommended_wines = pd.DataFrame(top_10_recommendations, columns=['wine_id', 'estimated_rating'])
recommended_wines = recommended_wines.merge(df[['wine_id', 'title', 'points']], on='wine_id')

print("Top 10 Recommendations for the New User:")
print(recommended_wines)

Top 10 Recommendations for the New User:
   wine_id  estimated_rating  \
0        0         91.699629   
1        1         91.699629   
2        2         91.699629   
3        3         91.699629   
4        4         91.699629   
5        5         91.699629   
6        6         91.699629   
7        7         91.699629   
8        8         91.699629   
9        9         91.699629   

                                               title  points  
0  Dopff & Irion 2004 Schoenenbourg Grand Cru Ven...      92  
1         Ceretto 2003 Bricco Rocche Prapó  (Barolo)      92  
2  Matrix 2007 Stuhlmuller Vineyard Chardonnay (A...      92  
3  Mauritson 2007 Rockpile Cemetary Vineyard Zinf...      92  
4    Silverado 2006 Cabernet Sauvignon (Napa Valley)      92  
5  Le Riche 2003 Cabernet Sauvignon Reserve Caber...      91  
6  Pierre Sparr 2007 Vendages Tardives Gewurztram...      91  
7        Pierre Sparr 2008 Alsace One White (Alsace)      91  
8               Kuentz-Bas 2008 Pinot B

In [29]:
# Define the traits
desired_traits = {'full-bodied', 'cherry', 'tobacco', 'cinnamon'}

# Sort wines by points in descending order
df_sorted = df.sort_values('points', ascending=False)

# Filter wines based on desired traits
filtered_wines = df_sorted[df_sorted['tokens'].apply(lambda tokens: all(trait in tokens for trait in desired_traits))]

print("Number of wines with desired trait(s):", len(filtered_wines))

# Generate predicted ratings for the new user for each filtered wine
predicted_ratings = []
for wine_id in filtered_wines['wine_id']:
    predicted_rating = tuned_svd.predict(new_user_id, wine_id)
    predicted_ratings.append((wine_id, predicted_rating.est))

# Sort the wines based on the predicted ratings
sorted_predicted_ratings = sorted(predicted_ratings, key=lambda x: x[1], reverse=True)

# Retrieve the top 10 recommendations
top_10_recommendations = sorted_predicted_ratings[:10]

# Compare the estimated ratings to the actual ratings
recommended_wines = pd.DataFrame(top_10_recommendations, columns=['wine_id', 'estimated_rating'])
recommended_wines = recommended_wines.merge(df[['wine_id', 'title', 'points']], on='wine_id')

# Sort by 'points' in descending order
recommended_wines = recommended_wines.sort_values(by='points', ascending=False)

print("Top 10 Recommendations for new user with Desired Traits:")
print(recommended_wines)

Number of wines with desired trait(s): 34
Top 10 Recommendations for new user with Desired Traits:
   wine_id  estimated_rating  \
0    56832         91.699629   
1    44425         91.699629   
2    50263         91.699629   
3     9067         91.699629   
4    32077         91.699629   
5    32085         91.699629   
6    41327         91.699629   
7     2696         91.699629   
8    30428         91.699629   
9    27036         91.699629   

                                               title  points  
0        La Ca' Nova 2016 Montestefano  (Barbaresco)      97  
1  Nino Negri 2013 5 Stelle  (Sforzato di Valtell...      95  
2               Le Gode 2012  Brunello di Montalcino      95  
3       Le Ragnaie 2012 VV  (Brunello di Montalcino)      95  
4              Bruno Giacosa 2011 Falletto  (Barolo)      95  
5          Cavallotto 2009 Vignolo Riserva  (Barolo)      95  
6       Le Ragnaie 2012 VV  (Brunello di Montalcino)      95  
7  Nino Negri 2013 5 Stelle  (Sforzato di Va