# Avant Ski/ Send It

by: Stephanie Ciaccia

# Overview

Skiing holds a prominent place for those seeking winter recreational activities in the United States. With its stunning mountain ranges and diverse terrain, the country boasts numerous ski resorts that cater to all skill levels, from beginners to seasoned professionals. Skiing offers a unique blend of adventure, physical activity, and natural beauty, making it a popular choice for winter enthusiasts seeking both relaxation and excitement.

The ski market in the United States is thriving, contributing significantly to the economy. According to the [National Ski Areas Association (NSAA)](chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://nsaa.org/webdocs/Media_Public/IndustryStats/Historical_Skier_Days_1979_2022.pdf), approximately 60.7 million skiers and snowboarders visited 473 ski resorts in the 2021-2022 winter season.

# Business Problem

Skiing, an exhilarating winter sport cherished by many, often involves time-consuming and daunting trip planning. The sheer abundance of ski resorts available makes it overwhelming to choose the ideal destination, and existing ski websites lack the necessary tools to filter options based on individual preferences.

To address these challenges, I'm developing Avant Ski, a ski resort recommendation app. Avant Ski simplifies the ski resort selection process by leveraging data and user preferences. With dynamic filtering features, users can personalize their search based on budget, location, amenities, and skill level. By bridging the gap between ski enthusiasts and their dream destinations, Avant Ski makes skiing accessible to a wider audience, empowering them to plan unforgettable ski trips with confidence.

Since data plays a crucial role in this application, I plan to showcase the app to representatives from different ski resorts across the USA at the National Ski Area Association Winter Confernce. This presentation aims to foster partnerships and encourage resort feature sharing between Avant Ski and these resorts once the app is launched.

# Data Understanding

In [1]:
import pandas as pd
import numpy as np
import math
from datetime import datetime
import datetime
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly
import plotly.express as px
import plotly.io as pio
from matplotlib.ticker import StrMethodFormatter

from surprise.model_selection import cross_validate
from surprise import Dataset, Reader, accuracy
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline, KNNWithZScore,  SVD, SVDpp, NMF, BaselineOnly, CoClustering, SlopeOne, NormalPredictor
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split

from collections import Counter
from nltk.corpus import stopwords

from IPython.display import Image, display


Function to print full rows

In [2]:
def print_full(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')

# Importing Data Files

In [76]:
final_user_df = pd.read_csv("data/cleaned_data_exports/user_df_model.csv")

In [4]:
content_df = pd.read_csv("data/cleaned_data_exports/scraped_feature_df.csv")

# Data Modeling - Recommendation System
The proposed recommendation system will adopt a cascade hybrid approach, integrating user-based filtering and content-based recommendation systems.

This **cascade hybrid** approach establishes a hierarchical structure within the recommendation system. The primary model, a collaborative user-based system, will generate the initial set of recommendations based on user preferences and ratings. Then, a secondary model, a content-based system, will refine the recommendations by considering additional factors such as mountain characteristics and user-defined filters. This two-step process aims to provide more accurate and personalized ski resort suggestions.

In the **user-based collaborative** filtering phase, the model will analyze the historical ratings of users for different ski resorts. By identifying users with similar rating patterns, the model will predict the target user's ratings for unvisited resorts, leveraging the collective wisdom of similar users.

The **content-based system** operates by constructing a matrix of resort features and determining the similarity between resorts based on these features. This approach allows the system to identify resorts with similar characteristics and recommend them based on user preferences.

To evaluate the accuracy of the collaborative model, the **Root Mean Square Error (RMSE)** score will be used. This metric quantifies the difference between the actual ratings and the predicted ratings, providing insights into the model's performance in capturing user preferences.

While the content-based system does not have an accuracy score, a combination of ski expertise and specific user information will be utilized to subjectively assess the effectiveness of the hybrid model.

By utilizing both user-based and content-based approaches, the cascade hybrid recommendation system aims to enhance the ski resort selection process, **offering users more personalized and relevant recommendations**.

### Surprise Data
To make this model, we will be using the python package Surprise. This is a scikit tool that uses a range of algorithms made up of matrix factorization-based methods for collaborative filtering. 

To begin, we will make new dataframe from our final cleaned dataframe with three columns that include user **id, ratings, and movie ids.**

In [77]:
final_user_df.drop(columns="Unnamed: 0", inplace=True)

In [78]:
final_user_df['rating'].value_counts()

5    979
4    888
3    390
2    175
1     89
Name: rating, dtype: int64

In [89]:
#copying final rewview dataframe
surprise_df = final_user_df.copy()

#dropping unneeded columns
surprise_df = surprise_df[['user_name', 'ski_resort', 'rating']]

#saving final surprise_df
surprise_df.to_csv("data/cleaned_data_exports/surprise_df.csv")

In [80]:
surprise_df

Unnamed: 0,user_name,ski_resort,rating
0,anon_1,Winter Park,4
1,anon_1,Arapahoe Basin,5
2,anon_1,Steamboat,5
3,anon_1,Copper Mountain,5
4,anon_2,Solitude Mountain,5
...,...,...,...
2516,John undefined,Bridger Bowl,5
2517,Daniel undefined,Blacktail Mountain,5
2518,Daniel undefined,Blacktail Mountain,5
2519,Elizabeth,Whitefish Mountain,1


In [81]:
from surprise import Reader, Dataset

reader = Reader(rating_scale=(1, 5))

#loading final dataset
data = Dataset.load_from_df(surprise_df[['user_name', 'ski_resort', 'rating']], reader)

#spltting into train and test
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

In [82]:
#looking at number of users
print('Number of users: ', trainset.n_users, '\n')
print('Number of items: ', trainset.n_items)

Number of users:  532 

Number of items:  264


### Normal Predictor Model

The Normal Predictor baseline model's Root Mean Squared Error (RMSE) tells us that our predicted rating is **1.41** points away from the actual rating.

In [15]:
# Instantiate the model
baseline = NormalPredictor()

#fitting model
baseline.fit(trainset)

# making prediction on testset
predictions = baseline.test(testset)

# Save RMSE score
baseline_normal = accuracy.rmse(predictions)

RMSE: 1.3994


In [16]:
#saving normal rmse
test_baseline_normal_rmse = 1.399

In [192]:
data_rmse = [['normal predictor', 1.399, 'n/a']]

model_df = pd.DataFrame(data_rmse, columns=['model', 'rmse', 'params'])

In [193]:
model_df

Unnamed: 0,model,rmse,params
0,normal predictor,1.399,


#### Defining a function to save the RMSE for all model iterations.

In [46]:
def model_comp(model_name, rmse, params):
    model_df.loc[len(model_df.index)] = [model_name, rmse, params] 

The Normal Predictor baseline model's Root Mean Squared Error (RMSE) tells us that our predicted rating is **1.41** points away from the actual rating.

### Algorithim Selection

To start off, I'll be testing all algorithms available in Surprise to deterime which models I should I should further adjust hyperparameters.

This code was adapated from this [blog](https://towardsdatascience.com/building-and-testing-recommender-systems-with-surprise-step-by-step-d4ba702ef80b0)

In [20]:
benchmark = []

# Iterate over all algorithms to see which performs beset
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
benchmark_df = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse') 

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


In [28]:
benchmark_df

Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVD,0.962886,0.086641,0.003666
SVDpp,0.977763,0.186876,0.009264
KNNBaseline,0.994478,0.006389,0.012054
BaselineOnly,0.997223,0.001697,0.001988
KNNBasic,1.019341,0.004235,0.00915
CoClustering,1.155632,0.06012,0.002458
KNNWithZScore,1.164038,0.015831,0.010955
KNNWithMeans,1.178665,0.007177,0.010117
SlopeOne,1.188278,0.004408,0.00393
NMF,1.188651,0.110433,0.005485


From the intiial test, SVDpp and SVD had the lowerse RMSE scores. I will start adjusting hyperparameters to see which model will perform the best.

### Model #1 - SVD

We will be using Singular Value Decomposition (SVD) to reduce the dimensionality of our matrix. SVD is a matrix factorization model that decomposes the reviewer reviews and resort into three matrices (user name, review, and rating). This helps us understand the relationship between users and items, or in this case users and ski resorts.

In [21]:
# Cross validate a basic SVD with no hyperparameter tuning expecting sub-par results
svd_basic = SVD(random_state=42)

results = cross_validate(svd_basic, data, measures=['RMSE'], cv=3, n_jobs = -1, verbose=True)

Evaluating RMSE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9697  0.9805  0.9630  0.9710  0.0072  
Fit time          0.10    0.08    0.08    0.09    0.01    
Test time         0.00    0.00    0.00    0.00    0.00    


In [43]:
# Fit to trainset and predict on the testset for evaluation
svd_basic.fit(trainset)

predictions = svd_basic.test(testset)

svd_simple = accuracy.rmse(predictions)

RMSE: 0.9480


In [194]:
model_comp('svd', .948, 'n/a')

### Model #2 - SVD Grid Search

To begin, I will be adjusting the following parameters:

 - n_factors - Increasing the number of factors can improve the model's ability to capture user and content interactions more accurately
 - n_epochs - Changes the number of iterations the model performs on the training data
 - init_mean - It changes the starting point for factor initilization
 - biased - Since user bias is a common issue with ratings, this parameter accounts for more inherent bias

In [41]:
#test grid search
params = {'n_factors': [100, 120, 140],
          'n_epochs': [20, 40, 60, 80, 100, 120],
          'init_mean': [0,.01, 0.05],
         'biased': [True, False]}

g_s_svd = GridSearchCV(SVD, param_grid=params, cv=5, refit=True)

g_s_svd.fit(data)
g_s_svd.best_params['rmse']

{'n_factors': 140, 'n_epochs': 40, 'init_mean': 0, 'biased': True}

In [42]:
# instantiating SVD with best hyperparameters from gridsearch
g_s_svd = SVD(n_factors=140 ,n_epochs=40, init_mean=0, biased=True)

# fit on trainset and make predictions using testset
g_s_svd.fit(trainset)
predictions = g_s_svd.test(testset)
g_s_svd_1 = accuracy.rmse(predictions)

RMSE: 0.9105


In [195]:
model_comp('svd_grid_1', .9105, {'n_factors': 140, 'n_epochs': 40, 'init_mean': 0, 'biased': True})

### Model #3 - SVD Grid Search # 2

I will continue to adjust the hyperparameters to see if we can bring down the RMSE.

In [98]:
#test grid search
params = {'n_factors': [20, 30, 40, 50, 60, 70, 80],
          'n_epochs': [70, 80, 90, 100, 120, 130],
          'init_mean': [-0.5, 0, 0.5],
         'biased': [True, False]}

g_s_svd_2 = GridSearchCV(SVD, param_grid=params, cv=5, refit=True)

g_s_svd_2.fit(data)
g_s_svd_2.best_params['rmse']

{'n_factors': 80, 'n_epochs': 90, 'init_mean': 0, 'biased': True}

In [104]:
# instantiating SVD with best hyperparameters from gridsearch
g_s_svd_2 = SVD(n_factors=80 ,n_epochs=90, init_mean=0, biased=True)

# fit on trainset and make predictions using testset
g_s_svd_2.fit(trainset)
predictions_2 = g_s_svd_2.test(testset)
g_s_svd_2 = accuracy.rmse(predictions_2)

RMSE: 0.9171


In [196]:
model_comp('svd_grid_2', .917, {'n_factors': 80, 'n_epochs': 90, 'init_mean': 0, 'biased': True})

### SVD Grid Search #3

I am slightly adjusting the number of n_factors and n_epochs to see if this will bring down RMSE.

In [52]:
#test grid search
params = {'n_factors': [110, 120, 130, 140, 160],
          'n_epochs': [10, 20, 30, 40, 50, 60, 70, 80],
          'init_mean': [-0.5, 0, 0.5],
         'biased': [True, False]}

g_s_svd_2 = GridSearchCV(SVD, param_grid=params, cv=5, refit=True)

g_s_svd_2.fit(data)
g_s_svd_2.best_params['rmse']

{'n_factors': 120, 'n_epochs': 60, 'init_mean': 0, 'biased': True}

In [53]:
# instantiating SVD with best hyperparameters from gridsearch
g_s_svd_3 = SVD(n_factors=120 ,n_epochs=60, init_mean=0, biased=True)

# fit on trainset and make predictions using testset
g_s_svd_3.fit(trainset)
predictions_3 = g_s_svd_3.test(testset)
g_s_svd_3_rmse = accuracy.rmse(predictions_3)

RMSE: 0.9143


In [197]:
model_comp('g_s_svd_3', .914, {'n_factors': 120, 'n_epochs': 60, 'init_mean': 0, 'biased': True})

### SVD Grid Search #4


In [106]:
#test grid search
params = {'n_factors': [90, 100, 105, 110, 115, 120],
          'n_epochs': [5, 10, 20, 30, 40, 50, 60, 70],
          'init_mean': [-0.5, 0, 0.5],
         'biased': [True, False]}

g_s_svd_4 = GridSearchCV(SVD, param_grid=params, cv=5, refit=True)

g_s_svd_4.fit(data)
g_s_svd_4.best_params['rmse']

{'n_factors': 90, 'n_epochs': 50, 'init_mean': 0, 'biased': True}

In [113]:
# instantiating SVD with best hyperparameters from gridsearch
g_s_svd_4 = SVD(n_factors=90 ,n_epochs=50, init_mean=0, biased=True)

# fit on trainset and make predictions using testset
g_s_svd_4.fit(trainset)
predictions_4 = g_s_svd_4.test(testset)
g_s_svd_4_rmse = accuracy.rmse(predictions_4)

RMSE: 0.9103


In [198]:
model_comp('g_s_svd_4', .910, {'n_factors': 90, 'n_epochs': 50, 'init_mean': 0, 'biased': True})

### SVD Grid Search #5

In [116]:
#test grid search
params = {'n_factors': [40, 50, 60, 70, 80, 90],
          'n_epochs': [30, 40, 50, 60, 70, 80, 90, 100],
          'init_mean': [-0.5, 0, 0.5],
         'biased': [True, False]}

g_s_svd_5 = GridSearchCV(SVD, param_grid=params, cv=5, refit=True)

g_s_svd_5.fit(data)
g_s_svd_5.best_params['rmse']

{'n_factors': 90, 'n_epochs': 70, 'init_mean': 0, 'biased': True}

In [135]:
# instantiating SVD with best hyperparameters from gridsearch
g_s_svd_5 = SVD(n_factors=90 ,n_epochs=70, init_mean=0, biased=True)

# fit on trainset and make predictions using testset
g_s_svd_5.fit(trainset)
predictions_5 = g_s_svd_5.test(testset)
g_s_svd_5_rmse = accuracy.rmse(predictions_5)

RMSE: 0.9016


In [199]:
model_comp('g_s_svd_5', .901, {'n_factors': 90, 'n_epochs': 70, 'init_mean': 0, 'biased': True})

### Grid Search #6

In [145]:
#test grid search
params = {'n_factors': [115, 120, 125, 130, 135, 140],
          'n_epochs': [20, 30, 40, 50, 60, 70],
          'init_mean': [-0.5, 0, 0.5],
         'biased': [True, False]}

g_s_svd_6 = GridSearchCV(SVD, param_grid=params, cv=5, refit=True)

g_s_svd_6.fit(data)
g_s_svd_6.best_params['rmse']

{'n_factors': 130, 'n_epochs': 50, 'init_mean': 0, 'biased': True}

In [159]:
# instantiating SVD with best hyperparameters from gridsearch
g_s_svd_6 = SVD(n_factors=130 ,n_epochs=50, init_mean=0, biased=True, random_state=42)

# fit on trainset and make predictions using testset
g_s_svd_6.fit(trainset)
predictions_6 = g_s_svd_6.test(testset)
g_s_svd_6_rmse = accuracy.rmse(predictions_6)

RMSE: 0.9149


In [200]:
model_comp('g_s_svd_6', .914, {'n_factors': 130, 'n_epochs': 50, 'init_mean': 0, 'biased': True})

### Grid Search #7

In [153]:
#test grid search
params = {'n_factors': [115, 120, 125, 127, 130, 135, 140, 145],
          'n_epochs': [10, 20, 25, 30, 35, 40, 45, 50, 60, 70],
          'init_mean': [-0.5, 0, 0.5],
         'biased': [True, False]}

g_s_svd_7 = GridSearchCV(SVD, param_grid=params, cv=5, refit=True)

g_s_svd_7.fit(data)
g_s_svd_7.best_params['rmse']

{'n_factors': 125, 'n_epochs': 60, 'init_mean': 0, 'biased': True}

In [157]:
# instantiating SVD with best hyperparameters from gridsearch
g_s_svd_7 = SVD(n_factors=125 ,n_epochs=60, init_mean=0, biased=True, random_state=42)

# fit on trainset and make predictions using testset
g_s_svd_7.fit(trainset)
predictions_7 = g_s_svd_7.test(testset)
g_s_svd_7_rmse = accuracy.rmse(predictions_7)

RMSE: 0.9041


In [201]:
model_comp('g_s_svd_7', .904, {'n_factors': 125, 'n_epochs': 60, 'init_mean': 0, 'biased': True})

### Model #3 - SVDpp

SVD++ is SVD with Implicit Feedback, which incorporates implicit feedback information into the model which capture additional user preferences and can improve the accuracy of the model.

SVD++ was the top performing model in the initial benchmark selection, so I am expecting this model to perform the best.

 - n_factors - Increasing the number of factors can improve the model's ability to capture user and content interactions more accurately
 - n_epochs - Changes the number of iterations the model performs on the training data
 - init_mean - It changes the starting point for factor initilization
 - reg-all - This is a regularization term that's applied to all parameters and helps prevent overfitting

To begin I will start with a lower n_factors and n_epochs. Larger n_factors numbers can lead to overfitting.

In [185]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[5, 10, 20, 30, 40, 50],
                  'n_epochs': [20, 30, 40, 50, 60],
                    'init_mean':[0, .01, .02, .03],
                    'reg_all':[.01, .02, .03]}
svd_pp_model = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model.fit(data)
print(svd_pp_model.best_params['rmse'])

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    1.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    1.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    1.6s remaining:    0.0s


{'n_factors': 50, 'n_epochs': 30, 'init_mean': 0, 'reg_all': 0.03}


[Parallel(n_jobs=1)]: Done 1080 out of 1080 | elapsed:  7.8min finished


In [190]:
# instantiating NFM
svd_pp_model = SVDpp(n_factors=50, n_epochs=30, init_mean=.0, reg_all=.03)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model.fit(trainset)
predictions = svd_pp_model.test(testset)
svd_pp_model_1 = accuracy.rmse(predictions)

RMSE: 0.9116


In [202]:
model_comp('svd_pp_1', .9116, {'n_factors': 50, 'n_epochs': 30, 'init_mean': 0.0, 'reg_all': 0.03} )

### Model #7 - SVD ++ Grid Search #2

The last SVD++ model had a slightly lower RMSE than the SVD models. I am going to increase the n_factors to increase the complexity of the model.

In [203]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[50, 70, 90, 120, 140, 150],
                  'n_epochs': [20, 40, 60, 80, 100],
                    'init_mean':[0, .01, .02, .03],
                     'reg_all':[.01, .02, .03]}
svd_pp_model_2 = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model_2.fit(data)
svd_pp_model_2.best_params['rmse']

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    1.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    1.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    2.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    2.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    2.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1080 out of 1080 | elapsed: 24.3min finished


{'n_factors': 90, 'n_epochs': 20, 'init_mean': 0, 'reg_all': 0.03}

In [204]:
# instantiating SVD
svd_pp_model_2 = SVDpp(n_factors=90, n_epochs=20, init_mean=0, reg_all=.03)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model_2.fit(trainset)
predictions = svd_pp_model_2.test(testset)
svd_pp_model_2_rmse = accuracy.rmse(predictions)

RMSE: 0.9176


In [205]:
model_comp('svd_pp_2', .917, {'n_factors': 120, 'n_epochs': 20, 'init_mean': 0.2, 'reg_all': 0.03})

### Model #8 - SVD++ Grid Search #3

The previous model performed slightly worse than the first SVD ++ model. I am now going to adjust the n_factors and n_epochs by making the steps inbetween the numbers smaller in attempt to capture the best hyperparameters.

In [70]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[90, 100, 110, 120, 130],
                  'n_epochs': [5, 10, 20, 30, 40, 50, 60, 70],
                    'init_mean':[0, .01, .02, .03],
                    'reg_all':[.01, .02, .03]}
svd_pp_model_3 = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model_3.fit(data)
svd_pp_model_3.best_params['rmse']

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    1.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1440 out of 1440 | elapsed: 20.4min finished


{'n_factors': 90, 'n_epochs': 30, 'init_mean': 0.01, 'reg_all': 0.03}

In [72]:
# instantiating NFM
svd_pp_model_3 = SVDpp(n_factors=90, n_epochs=30, init_mean=0.01, reg_all=.03)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model_3.fit(trainset)
predictions = svd_pp_model_3.test(testset)
svd_pp_model_3_rmse = accuracy.rmse(predictions)

RMSE: 0.9148


In [73]:
model_comp('svd_pp_3', .9148, {'n_factors': 90, 'n_epochs': 30, 'init_mean': 0.01, 'reg_all': 0.03})

### Model #9 - SVD PP - Grid Search #4

The last model performed slightly worse than the previous model. I'm going to adjust the n_factors and n_epochs again by further adjusting the distance between the parameter options.

In [83]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[120, 125, 130, 140],
                  'n_epochs': [5, 10, 15, 20, 30, 35],
                    'init_mean':[0, .01, .02, .03],
                    'reg_all':[.01, .02, .03]}
svd_pp_model_4 = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model_4.fit(data)
print(svd_pp_model_4.best_params['rmse'])

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    1.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    1.4s remaining:    0.0s


{'n_factors': 125, 'n_epochs': 20, 'init_mean': 0, 'reg_all': 0.01}


[Parallel(n_jobs=1)]: Done 864 out of 864 | elapsed:  7.6min finished


In [86]:
# instantiating NFM
svd_pp_model_4 = SVDpp(n_factors=125, n_epochs=20, init_mean=.0, reg_all=.01)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model_4.fit(trainset)
predictions = svd_pp_model_4.test(testset)
svd_pp_model_4_rmse = accuracy.rmse(predictions)

RMSE: 0.9299


In [87]:
model_comp('svd_pp_4', 0.929, {'n_factors': 125, 'n_epochs': 20, 'init_mean': 0.0, 'reg_all': 0.01})

### Model #10 - SVD PP - Grid Search #5

In [59]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[110, 120, 125, 130, 135, 140, 145, 150, 160],
                  'n_epochs': [20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120],
                    'init_mean':[0, .01],
                    'reg_all':[.01, .02, .03]}
svd_pp_model_5 = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model_5.fit(data)
svd_pp_model_5.best_params['rmse']

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    2.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    2.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    2.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    3.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    3.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    4.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1782 out of 1782 | elapsed: 56.2min finished


{'n_factors': 125, 'n_epochs': 50, 'init_mean': 0.01, 'reg_all': 0.03}

In [88]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[10, 20, 30, 40, 50, 60, 70],
                  'n_epochs': [10, 20, 30, 40, 50, 60, 70],
                    'init_mean':[0, .01],
                    'reg_all':[.01, .02, .03]}
svd_pp_model_5 = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model_5.fit(data)
svd_pp_model_5.best_params['rmse']

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    0.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 882 out of 882 | elapsed:  7.6min finished


{'n_factors': 70, 'n_epochs': 40, 'init_mean': 0.01, 'reg_all': 0.02}

In [95]:
# instantiating NFM
svd_pp_model_5 = SVDpp(n_factors=70, n_epochs=40, init_mean=.01, reg_all=.02)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model_5.fit(trainset)
predictions = svd_pp_model_5.test(testset)
svd_pp_model_5_rmse = accuracy.rmse(predictions)

RMSE: 0.9165


In [96]:
model_comp('svd_pp_5', .916, {'n_factors': 70, 'n_epochs': 40, 'init_mean': 0.01, 'reg_all': 0.02} )

# Results

The best model was the final SVD Grid Search #4 that gave us a RMSE of .90.

- n_factors = 125
- n_epochs = .60
- biased = True

The RMSE scores for the top performing models were all very close, so I trained multiple models and tested the function to see which performed the best. After testing the functions with three users, two of which filled out the google survey, I decided to use the hyperparameters associated with SVD Grid Search #4.

The other models more often returned resorts that were not aligned with the users top rated resorts in terms of mountain characteristics and features. There were some instances where some of the recommendations did seem off, however the content based model should compensate for this.

In [163]:
model_df.sort_values(by="rmse")

Unnamed: 0,model,rmse,params
12,g_s_svd_5,0.901,"{'n_factors': 90, 'n_epochs': 70, 'init_mean':..."
13,g_s_svd_6,0.901,"{'n_factors': 130, 'n_epochs': 50, 'init_mean'..."
5,svd_pp_1,0.903,"{'n_factors': 50, 'n_epochs': 50, 'init_mean':..."
15,g_s_svd_7,0.904,"{'n_factors': 125, 'n_epochs': 60, 'init_mean'..."
11,g_s_svd_4,0.91,"{'n_factors': 90, 'n_epochs': 50, 'init_mean':..."
2,svd_grid_1,0.9105,"{'n_factors': 140, 'n_epochs': 40, 'init_mean'..."
3,svd_grid_2,0.914,"{'n_factors': 140, 'n_epochs': 40, 'init_mean'..."
4,g_s_svd_3,0.914,"{'n_factors': 120, 'n_epochs': 60, 'init_mean'..."
14,g_s_svd_6,0.914,"{'n_factors': 130, 'n_epochs': 50, 'init_mean'..."
7,svd_pp_3,0.9148,"{'n_factors': 90, 'n_epochs': 30, 'init_mean':..."


#### Best Model

In [172]:
# instantiating NFM
best_model = SVD(n_factors=125 ,n_epochs=60, init_mean=0, biased=True, random_state=42)

# Fit on trainset and make predictions using testset to return RMSE metric
best_model.fit(trainset)
predictions = best_model.test(testset)
best_model_rmse = accuracy.rmse(predictions)

RMSE: 0.9041


### Training on Full Dataset

Before creating the collaborative mode, we will need to train it on the full dataset. This is important because it ensures that the model learns from the complete set of user resort reviews, so it can make more accurate predictions.

In [187]:
#setting scale
reader = Reader(rating_scale=(1, 5))

#loading final dataset
data_full = Dataset.load_from_df(surprise_df[['user_name', 'ski_resort', 'rating']], reader)

#making trainset
full_trainset = data_full.build_full_trainset()

algo = SVD(n_factors=90, n_epochs=70, init_mean=.0, reg_all= 0.0, biased=True)
algo.fit(full_trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fc560ebbfd0>

## Collaborative System - Building Function

To implement the model, we will need to create a function that takes in user inputs and outputs predictions. To start, I will make a dataframe of only the user names, ratings, and ski resorts.

In [181]:
surprise_df.head()

Unnamed: 0,user_name,ski_resort,rating
0,anon_1,Winter Park,4
1,anon_1,Arapahoe Basin,5
2,anon_1,Steamboat,5
3,anon_1,Copper Mountain,5
4,anon_2,Solitude Mountain,5


In [180]:
#saving new dataframe with only user information
user_df = surprise_df.reset_index()
user_df.set_index('user_name', inplace = True)
user_df.drop(columns = ['rating', 'index'], inplace =True)
user_df.head()

Unnamed: 0_level_0,ski_resort
user_name,Unnamed: 1_level_1
anon_1,Winter Park
anon_1,Arapahoe Basin
anon_1,Steamboat
anon_1,Copper Mountain
anon_2,Solitude Mountain


### Saving Trained Model

For the hybrid model and streamlist app, I will need to save the trained model. To do this, I adapted code found on [Google Colab](https://colab.research.google.com/github/singhsidhukuldeep/Recommendation-System/blob/master/Building_Recommender_System_with_Surprise.ipynb#scrollTo=lM7Db2cj7-IZ).

In [18]:
#saving trained model
from surprise import dump
import os

model_filename = "./model.pickle"

print (">> Starting dump")

# Dump algorithm and reload it.
file_name = os.path.expanduser(model_filename)
dump.dump(file_name, algo=algo)

print (">> Dump done")
print(model_filename)

>> Starting dump
>> Dump done
./model.pickle


### Function #1  - User & Recommendation #  Inputs

Below is a function that will be used to take user inputs and return predicted ratings. This will used the final trained SVD++ model. The results will be sorted by the highest ratings.

In [178]:
def shred_recommender():
    #user input
    user = str(input('Name: '))
    #number of recommendations
    n_recs = int(input('How many resort recommendations do you want? '))
    
    #making a list of the resorts rated by each user
    have_rated = list(user_df.loc[user, 'ski_resort'])
    #dropping rated resorts for final recommendation
    not_rated = content_df.copy()
    not_rated = not_rated.loc[~not_rated['ski_resort'].isin(have_rated)]
    not_rated = not_rated.drop_duplicates(subset=['ski_resort'])
    not_rated.reset_index(inplace=True)
    
    #running the model and saving the predicted ratings as a new column
    not_rated['predicted_rating'] = not_rated['ski_resort'].apply(lambda x: algo.predict(user, x).est)
    not_rated.sort_values(by='predicted_rating', ascending=False, inplace=True)
    not_rated = not_rated[['ski_resort', 'state', 'city', 'summit', 'drop', 'base', 'intermediate_runs',
                          'advanced_runs', 'expert_runs', 'predicted_rating', 'ikon', 'epic',
                          'mountain_collective']].copy()
    return not_rated.head(n_recs)

### Function Testing

I will be testing the model results with two users who filled out the resort survey, and who I created a user profile based on a set of questions I asked each user. I removed their last names from the survey.

**Alexandria K.**
- Dislikes "bougie" resorts
- Travels to shred
- Looks for expert runs and accessible transportation
- Buys the Epic pass but dislikes Vail and corporate ski vibes
- Budget-friendly planning

**Raghava K.**
- Loves expert terrain and well- marked tails
- All about the apres-ski life 
- Travels to shred but wants to have fun while doing it
- Uses both Epic & Ikon passes

Making a function to display user reviews and merging this with the final content dataframe. This will be useful in comparing results with user's rated mountains.

In [175]:
#making a function to review original 
def user_comp(name):

    #saving reviews as a dataframe
    user_1_df = final_user_df.loc[final_user_df["user_name"] == name]

    #making a list of resort names
    user_1_resort_list = user_1_df['ski_resort'].to_list()

    #saving as df to review features and compare with review
    user_1_df = content_df.loc[content_df['ski_resort'].isin(user_1_resort_list)]
    
    user_1_df.drop(columns="Unnamed: 0")
    
    review_df = pd.DataFrame(final_user_df.loc[final_user_df["user_name"] == name])
    
    review_df = review_df[["ski_resort", "rating", "review"]]
    
    user_1_df = user_1_df[['ski_resort', 'state', 'city', 'summit', 'drop', 'base', 'intermediate_runs',
                          'advanced_runs', 'expert_runs', 'ikon', 'epic',
                          'mountain_collective']]
    
    return(pd.merge(review_df, user_1_df, on="ski_resort")).drop(columns=['ikon', 'epic', 'mountain_collective'])

### User #1 - Raghava K.

The first user rated large mountains with advanced, expert, runs and good transportation highly. Their lowest score was given to a mountain that did not have good signage and was difficult to get to with public transport.

Overall, 4/5 recommendations were similar to the mountains rated in terms of trail difficulty, summit, and drop.

In [176]:
user_comp("Raghava K.")

Unnamed: 0,ski_resort,rating,review,state,city,summit,drop,base,intermediate_runs,advanced_runs,expert_runs
0,Breckenridge,4,Breck is a decent resort but a lot of the terr...,Colorado,Breckenridge,12998,3398,9600,23,36,28.0
1,Crested Butte Mountain,5,Really amazing and extensive terrain for advan...,Colorado,Mt. Crested Butte Mountain,12162,3062,9375,25,25,36.0
2,Vail,5,I enjoyed Vail a lot more this time than my la...,Colorado,Vail,11570,3450,8120,35,40,2.0
3,Beaver Creek,3,re are some good runs here and part of my qual...,Colorado,Vail,11440,3340,8100,30,24,8.0
4,Telluride,5,Telluride is amazing town has very good food ...,Colorado,Telluride,13150,4425,8725,30,21,34.0


In [188]:
shred_recommender()

Name: Raghava K.
How many resort recommendations do you want? 3


Unnamed: 0,ski_resort,state,city,summit,drop,base,intermediate_runs,advanced_runs,expert_runs,predicted_rating,ikon,epic,mountain_collective
20,Snowbasin,Utah,Huntsville,9350,2900,6450,33,52,6.0,4.786659,1,0,1
191,Big Powderhorn Mountain,Michigan,Bessemer,1800,600,1200,40,31,2.0,4.784367,0,0,0
161,Plattekill Mountain,New York,Roxbury,3500,1100,2400,40,0,20.0,4.717126,0,0,0


### User #2 - Alexandria K.

The second user rated Vail and Park City Mountain low, due to the business of the resorts and overall feel of the mountain. They preferred mountains with a majority of advanced and expert terrain.

2/3 of the collorative models suggestions seemed to be aligned with other resort in terms of mountain terrain and run difficulty. 

In [30]:
user_comp("Alexandria K.")

Unnamed: 0,ski_resort,rating,review,state,city,summit,drop,base,intermediate_runs,advanced_runs,expert_runs
0,Stevens Pass,5.0,"Lots of snow, small local mountain.",Washington,Skykomish,5845,1800,4061,43,31,18.0
1,Vail,3.0,"Lots of terrain, but very busy.",Colorado,Vail,11570,3450,8120,35,40,2.0
2,Snowbird,4.0,"Great expert terrain, feels very grand and exc...",Utah,Snowbird,11000,3240,7760,25,43,24.0
3,Park City Mountain,2.0,"Lots of terrain, but usually very busy. town ...",Utah,Park City Mountain,10026,3226,6800,41,28,23.0


In [189]:
shred_recommender()

Name: Alexandria K.
How many resort recommendations do you want? 3


Unnamed: 0,ski_resort,state,city,summit,drop,base,intermediate_runs,advanced_runs,expert_runs,predicted_rating,ikon,epic,mountain_collective
42,Steamboat,Colorado,Steamboat Springs,10568,3668,6900,43,40,5.0,4.529676,1,0,0
162,Plattekill Mountain,New York,Roxbury,3500,1100,2400,40,0,20.0,4.526197,0,0,0
94,Lutsen Mountains,Minnesota,Lutsen Mountains,1688,825,800,58,24,8.0,4.489648,0,0,0


In [98]:
list(user_df.loc['Alexandria Kelly', 'ski_resort'])

['Stevens Pass', 'Vail', 'Snowbird', 'Park City Mountain']

### Function #2  - User, Recommendation #, and State Inputs

Below I added an additional filter that allows users to sort resorts by state. I will plan to further adjust the filters in the hybird and streamlit models.

In [137]:
def shred_recommender_state():
    user = str(input('Name: '))
    n_recs = int(input('How many resort recommendations do you want? '))
    state = str(input('What state would you like to shred in? '))
    
    have_rated = list(user_df.loc[user, 'ski_resort'])
    not_rated = content_df.copy()
    not_rated = not_rated.loc[~not_rated['ski_resort'].isin(have_rated) & (not_rated['state'] == state)]
    not_rated = not_rated.drop_duplicates(subset=['ski_resort'])
    not_rated.reset_index(inplace=True)
    not_rated['predicted_rating'] = not_rated['ski_resort'].apply(lambda x: algo.predict(user, x).est)
    not_rated.sort_values(by='predicted_rating', ascending=False, inplace=True)
    not_rated = not_rated[['ski_resort', 'state', 'city', 'sumt', 'drop', 'base', 'intermediate_runs',
                          'advanced_runs', 'expert_runs', 'predicted_rating', 'ikon', 'epic',
                          'mountain_collective']].copy()
    return not_rated.head(n_recs)

In [138]:
shred_recommender_state()

Name: Stephanie Ciaccia
How many resort recommendations do you want? 3
What state would you like to shred in? Utah


Unnamed: 0,ski_resort,state,city,sumt,drop,base,intermediate_runs,advanced_runs,expert_runs,predicted_rating,ikon,epic,mountain_collective
0,Alta,Utah,Alta,11068,2538,8530,0,0,0,4.457562,1,0,1
8,Snowbasin,Utah,Huntsville,9350,2900,6450,33,52,6,4.407571,1,0,1
4,Deer Valley,Utah,Park City,9570,3000,6570,31,10,32,3.962196,1,0,0


# Conclusions

The analysis of the outputs and RMSE score indicates that the recommendation system effectively suggests ski resorts that are generally aligned with user ratings. However, considering the high opportunity cost associated with planning a ski trip, relying solely on a user-based system is insufficient to provide users with tailored recommendations that account for factors such as cost and time of year. While identifying similar users is crucial, it is essential to consider additional elements when making recommendations.

# Next Steps

Next steps involves implementing a content-based system and integrating the results with the collaborative model. This combined approach enhances the filtering capabilities of the collaborative model, allowing for more dynamic filtering based on user preferences. By incorporating both content-based and collaborative filtering, the system will deliver more accurate recommendations that align closely with user preferences, resulting in a more refined and personalized recommendation system.

The **Cascade Hybrid Modeling** notebook will be saved in the main github repository.