# Avant Ski/ Send It

by: Stephanie Ciaccia

# Overview

Skiing holds a prominent place for those seeking winter recreational activities in the United States. With its stunning mountain ranges and diverse terrain, the country boasts numerous ski resorts that cater to all skill levels, from beginners to seasoned professionals. Skiing offers a unique blend of adventure, physical activity, and natural beauty, making it a popular choice for winter enthusiasts seeking both relaxation and excitement.

The ski market in the United States is thriving, contributing significantly to the economy. According to the [National Ski Areas Association (NSAA)](chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://nsaa.org/webdocs/Media_Public/IndustryStats/Historical_Skier_Days_1979_2022.pdf), approximately 60.7 million skiers and snowboarders visited 473 ski resorts in the 2021-2022 winter season.

# Business Problem

Skiing is an exhilarating winter activity enjoyed by many, but barriers such as high costs and limited accessibility often hinder people from fully experiencing its joys. Choosing the right ski resort can be overwhelming due to the multitude of options available, and existing websites lack dynamic filtering capabilities based on user preferences.

To address these challenges, I'm developing Avant Ski, a ski resort recommendation app. Avant Ski simplifies the ski resort selection process by leveraging data and user preferences. With dynamic filtering features, users can personalize their search based on budget, location, amenities, and skill level. By bridging the gap between ski enthusiasts and their dream destinations, Avant Ski makes skiing accessible to a wider audience, empowering them to plan unforgettable ski trips with confidence.

# Data Understanding

In [1]:
import pandas as pd
import numpy as np
import math
from datetime import datetime
import datetime
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly
import plotly.express as px
import plotly.io as pio
from matplotlib.ticker import StrMethodFormatter

from surprise.model_selection import cross_validate
from surprise import Dataset, Reader, accuracy
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline, KNNWithZScore,  SVD, SVDpp, NMF, BaselineOnly, CoClustering, SlopeOne, NormalPredictor
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split

from collections import Counter
from nltk.corpus import stopwords

from IPython.display import Image, display


Function to print full rows

In [2]:
def print_full(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')

# Importing Data Files

In [3]:
final_user_df = pd.read_csv("data/cleaned_data_exports/user_df_model.csv")

In [92]:
content_df = pd.read_csv("data/cleaned_data_exports/scraped_feature_df.csv")

# Data Modeling - Recommendation System
The proposed recommendation system will adopt a cascade hybrid approach, integrating user-based filtering and content-based recommendation systems.

This **cascade hybrid** approach establishes a hierarchical structure within the recommendation system. The primary model, a collaborative user-based system, will generate the initial set of recommendations based on user preferences and ratings. Then, a secondary model, a content-based system, will refine the recommendations by considering additional factors such as mountain characteristics and user-defined filters. This two-step process aims to provide more accurate and personalized ski resort suggestions.

In the **user-based collaborative** filtering phase, the model will analyze the historical ratings of users for different ski resorts. By identifying users with similar rating patterns, the model will predict the target user's ratings for unvisited resorts, leveraging the collective wisdom of similar users.

The **content-based system** operates by constructing a matrix of resort features and determining the similarity between resorts based on these features. This approach allows the system to identify resorts with similar characteristics and recommend them based on user preferences.

To evaluate the accuracy of the collaborative model, the **Root Mean Square Error (RMSE)** score will be used. This metric quantifies the difference between the actual ratings and the predicted ratings, providing insights into the model's performance in capturing user preferences.

While the content-based system does not have an accuracy score, a combination of ski expertise and specific user information will be utilized to subjectively assess the effectiveness of the hybrid model.

By utilizing both user-based and content-based approaches, the cascade hybrid recommendation system aims to enhance the ski resort selection process, **offering users more personalized and relevant recommendations**.

### Surprise Data
To make this model, we will be using the python package Surprise. This is a scikit tool that uses a range of algorithms made up of matrix factorization-based methods for collaborative filtering. 

To begin, we will make new dataframe from our final cleaned dataframe with three columns that include user **id, ratings, and movie ids.**

In [4]:
final_user_df.drop(columns="Unnamed: 0", inplace=True)

In [5]:
final_user_df['rating'].value_counts()

5.0    1127
4.0     936
3.0     420
2.0     197
1.0     115
Name: rating, dtype: int64

In [6]:
final_user_df.loc[final_user_df['ski_resort'] == "Mt. Bohemia"]

Unnamed: 0,review_date,state,ski_resort,rating,review,user_name
1582,2010-11-28,Michigan,Mt. Bohemia,5.0,We happened upon this place about 7 or 8 years...,Peter Mende
1613,2010-12-27,Michigan,Mt. Bohemia,5.0,Best in the Midwest \r\nI pulled up in the par...,A. cooper
1762,2011-02-27,Michigan,Mt. Bohemia,5.0,The scenery and weather is always top notch in...,skylolow
1954,2012-03-05,Michigan,Mt. Bohemia,5.0,"Best place east of the rockies for steeps, tre...",Adye 1
2137,2014-01-28,Michigan,Mt. Bohemia,5.0,"This was my first visit to Bohemia, and I will...",Linda Smith
2345,2016-03-16,Michigan,Mt. Bohemia,5.0,I was also there on the 13 of March. When I lo...,Ryan Craig


In [7]:
#copying final rewview dataframe
surprise_df = final_user_df.copy()

#dropping unneeded columns
surprise_df = surprise_df[['user_name', 'ski_resort', 'rating']]

In [8]:
surprise_df

Unnamed: 0,user_name,ski_resort,rating
0,anon_1,Winter Park,4.0
1,anon_1,Arapahoe Basin,5.0
2,anon_1,Steamboat,5.0
3,anon_1,Copper Mountain,5.0
4,anon_2,Solitude Mountain,5.0
...,...,...,...
2802,Payton Sharum,Snowy Range Ski Recreation Area,5.0
2803,Lori Young,Discovery,5.0
2804,Lori Young,Discovery,5.0
2805,Steven Freund,Solitude Mountain,5.0


In [9]:
# counting the number of reviews for each user
value_counts = surprise_df['user_name'].value_counts()

# selecting only users with more than three reviews
selected_users = value_counts[value_counts > 2].index

# selecting only the rows where the user_name is in the selected_users list
surprise_df = surprise_df[surprise_df['user_name'].isin(selected_users)]

In [10]:
surprise_df['ski_resort'].value_counts()

Ski Brule              67
Killington             49
Vail                   49
Breckenridge           48
Snowbird               35
                       ..
Bradford                1
Mad River Mountain      1
New Hermon Mountain     1
Black Mountain          1
Soda Springs            1
Name: ski_resort, Length: 269, dtype: int64

In [122]:
from surprise import Reader, Dataset

reader = Reader(rating_scale=(1, 5))

#loading final dataset
data = Dataset.load_from_df(surprise_df[['user_name', 'ski_resort', 'rating']], reader)

#spltting into train and test
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

In [123]:
#looking at number of users
print('Number of users: ', trainset.n_users, '\n')
print('Number of items: ', trainset.n_items)

Number of users:  534 

Number of items:  264


### Normal Predictor Model

The Normal Predictor baseline model's Root Mean Squared Error (RMSE) tells us that our predicted rating is **1.41** points away from the actual rating.

In [13]:
# Instantiate the model
baseline = NormalPredictor()

#fitting model
baseline.fit(trainset)

# making prediction on testset
predictions = baseline.test(testset)

# Save RMSE score
baseline_normal = accuracy.rmse(predictions)

RMSE: 1.3587


In [14]:
#saving normal rmse
test_baseline_normal_rmse = 1.358

In [157]:
data_rmse = [['normal predictor', 1.358, 'n/a']]

model_df = pd.DataFrame(data_rmse, columns=['model', 'rmse', 'params'])

In [158]:
model_df

Unnamed: 0,model,rmse,params
0,normal predictor,1.358,


#### Defining a function to save the RMSE for all model iterations.

In [159]:
def model_comp(model_name, rmse, params):
    model_df.loc[len(model_df.index)] = [model_name, rmse, params] 

The Normal Predictor baseline model's Root Mean Squared Error (RMSE) tells us that our predicted rating is **1.41** points away from the actual rating.

### Algorithim Selection

To start off, I'll be testing all algorithms available in Surprise to deterime which models I should I should further adjust hyperparameters.

This code was adapated from this [blog](https://towardsdatascience.com/building-and-testing-recommender-systems-with-surprise-step-by-step-d4ba702ef80b0)

In [179]:
benchmark = []

# Iterate over all algorithms to see which performs beset
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
benchmark_df = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse') 

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


In [180]:
benchmark_df

Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVD,0.961244,0.085539,0.003242
SVDpp,0.964481,0.185893,0.009628
BaselineOnly,0.991297,0.001756,0.002018
KNNBaseline,1.009603,0.005219,0.011823
KNNBasic,1.017418,0.003192,0.009076
KNNWithMeans,1.149623,0.007013,0.009889
CoClustering,1.158347,0.057029,0.00244
KNNWithZScore,1.170293,0.015271,0.010673
SlopeOne,1.173589,0.003891,0.003808
NMF,1.193571,0.105964,0.002901


From the intiial test, SVDpp and SVD had the lowerse RMSE scores. I will start adjusting hyperparameters to see which model will perform the best.

### Model #1 - SVD

We will be using Singular Value Decomposition (SVD) to reduce the dimensionality of our matrix. SVD is a matrix factorization model that decomposes the reviewer reviews and movies into three matrices. This helps us understand the relationship between users and items, or in this case users and ski resorts.

In [21]:
# Cross validate a basic SVD with no hyperparameter tuning expecting sub-par results
svd_basic = SVD(random_state=42)

results = cross_validate(svd_basic, data, measures=['RMSE'], cv=3, n_jobs = -1, verbose=True)

Evaluating RMSE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9376  1.0281  0.9532  0.9730  0.0395  
Fit time          0.08    0.08    0.08    0.08    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    


In [22]:
# Fit to trainset and predict on the testset for evaluation
svd_basic.fit(trainset)

predictions = svd_basic.test(testset)

svd_simple = accuracy.rmse(predictions)

RMSE: 0.9254


In [160]:
model_comp('svd', .925, 'n/a')

### Model #2 - SVD Grid Search

To begin, I will be adjusting the following parameters:

 - n_factors - Increasing the number of factors can improve the model's ability to capture user and content interactions more accurately
 - n_epochs - Changes the number of iterations the model performs on the training data
 - init_mean - It changes the starting point for factor initilization
 - biased - Since user bias is a common issue with ratings, this parameter accounts for more inherent bias

In [124]:
#test grid search
params = {'n_factors': [100, 120, 140],
          'n_epochs': [20, 40, 60, 80, 100, 120],
          'init_mean': [0,.01, 0.05],
         'biased': [True, False]}

g_s_svd = GridSearchCV(SVD, param_grid=params, cv=5, refit=True)

g_s_svd.fit(data)
g_s_svd.best_params['rmse']

{'n_factors': 140, 'n_epochs': 40, 'init_mean': 0, 'biased': True}

In [156]:
# instantiating SVD with best hyperparameters from gridsearch
g_s_svd = SVD(n_factors=140 ,n_epochs=40, init_mean=0, biased=True)

# fit on trainset and make predictions using testset
g_s_svd.fit(trainset)
predictions = g_s_svd.test(testset)
g_s_svd_1 = accuracy.rmse(predictions)

RMSE: 0.8860


In [161]:
model_comp('svd_grid_1', .88, {'n_factors': 140, 'n_epochs': 40, 'init_mean': 0, 'biased': True})

### Model #3 - SVD Grid Search # 2

I will continue to adjust the hyperparameters to see if we can bring down the RMSE.

In [32]:
#test grid search
params = {'n_factors': [80, 100, 110, 120, 130, 140],
          'n_epochs': [20, 30, 40, 50, 60, 70, 80],
          'init_mean': [-0.5, 0, 0.5],
         'biased': [True, False]}

g_s_svd_2 = GridSearchCV(SVD, param_grid=params, cv=5, refit=True)

g_s_svd_2.fit(data)
g_s_svd_2.best_params['rmse']

{'n_factors': 140, 'n_epochs': 50, 'init_mean': 0, 'biased': True}

In [162]:
# instantiating SVD with best hyperparameters from gridsearch
g_s_svd_2 = SVD(n_factors=140 ,n_epochs=50, init_mean=0, biased=True)

# fit on trainset and make predictions using testset
g_s_svd_2.fit(trainset)
predictions_2 = g_s_svd_2.test(testset)
g_s_svd_2 = accuracy.rmse(predictions_2)

RMSE: 0.9026


In [163]:
model_comp('svd_grid_2', .902, {'n_factors': 140, 'n_epochs': 50, 'init_mean': 0, 'biased': True})

### SVD Grid Search #3

I am slightly adjusting the number of n_factors and n_epochs to see if this will bring down RMSE.

In [35]:
#test grid search
params = {'n_factors': [110, 120, 130, 140, 160],
          'n_epochs': [10, 20, 30, 40, 50, 60, 70, 80],
          'init_mean': [-0.5, 0, 0.5],
         'biased': [True, False]}

g_s_svd_2 = GridSearchCV(SVD, param_grid=params, cv=5, refit=True)

g_s_svd_2.fit(data)
g_s_svd_2.best_params['rmse']

{'n_factors': 160, 'n_epochs': 40, 'init_mean': 0, 'biased': True}

In [164]:
# instantiating SVD with best hyperparameters from gridsearch
g_s_svd_3 = SVD(n_factors=160 ,n_epochs=40, init_mean=0, biased=True)

# fit on trainset and make predictions using testset
g_s_svd_3.fit(trainset)
predictions_3 = g_s_svd_3.test(testset)
g_s_svd_3_rmse = accuracy.rmse(predictions_3)

RMSE: 0.8963


In [165]:
model_comp('g_s_svd_3', .896, {'n_factors': 160, 'n_epochs': 40, 'init_mean': 0, 'biased': True})

### Model #3 - SVDpp

SVD++ is SVD with Implicit Feedback, which incorporates implicit feedback information into the model which capture additional user preferences and can improve the accuracy of the model.

SVD++ was the top performing model in the initial benchmark selection, so I am expecting this model to perform the best.

 - n_factors - Increasing the number of factors can improve the model's ability to capture user and content interactions more accurately
 - n_epochs - Changes the number of iterations the model performs on the training data
 - init_mean - It changes the starting point for factor initilization
 - reg-all - This is a regularization term that's applied to all parameters and helps prevent overfitting

To begin I will start with a lower n_factors and n_epochs. Larger n_factors numbers can lead to overfitting.

In [116]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[5, 10, 20, 30, 40, 50],
                  'n_epochs': [20, 30, 40, 50, 60],
                    'init_mean':[0, .01, .02, .03],
                    'reg_all':[.01, .02, .03]}
svd_pp_model = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model.fit(data)
print(svd_pp_model.best_params['rmse'])

In [166]:
# instantiating NFM
svd_pp_model = SVDpp(n_factors=50, n_epochs=60, init_mean=.00, reg_all=.02)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model.fit(trainset)
predictions = svd_pp_model.test(testset)
svd_pp_model_1 = accuracy.rmse(predictions)

RMSE: 0.9302


In [167]:
model_comp('svd_pp_1', .930, {'n_factors': 50, 'n_epochs': 60, 'init_mean': 0, 'reg_all': 0.02} )

### Model #7 - SVD ++ Grid Search #2

The last SVD++ model had a slightly higher RMSE than the SVD models. I am going to increase the n_factors to increase the complexity of the model.

In [45]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[50, 70, 90, 120, 140, 150],
                  'n_epochs': [20, 40, 60, 80, 100],
                    'init_mean':[0, .01, .02, .03],
                     'reg_all':[.01, .02, .03]}
svd_pp_model_2 = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model_2.fit(data)
svd_pp_model_2.best_params['rmse']

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    1.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    1.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    2.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    2.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    2.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1080 out of 1080 | elapsed: 24.0min finished


{'n_factors': 90, 'n_epochs': 40, 'init_mean': 0, 'reg_all': 0.03}

In [168]:
# instantiating SVD
svd_pp_model_2 = SVDpp(n_factors=90, n_epochs=40, init_mean=0.0, reg_all=.03)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model_2.fit(trainset)
predictions = svd_pp_model_2.test(testset)
svd_pp_model_2_rmse = accuracy.rmse(predictions)

RMSE: 0.9011


In [169]:
model_comp('svd_pp_2', .901, {'n_factors': 90, 'n_epochs': 40, 'init_mean': 0, 'reg_all': 0.03})

### Model #8 - SVD++ Grid Search #3

The previous model performed slightly better than the first SVD ++ model. I am now going to adjust the n_factors and n_epochs by making the steps inbetween the numbers smaller in attempt to capture the best hyperparameters.

In [52]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[90, 100, 110, 120, 130],
                  'n_epochs': [20, 30, 40, 50, 60, 70],
                    'init_mean':[0, .01, .02, .03],
                    'reg_all':[.01, .02, .03]}
svd_pp_model_3 = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model_3.fit(data)
svd_pp_model_3.best_params['rmse']

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    1.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    2.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    2.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    3.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    3.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    3.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1080 out of 1080 | elapsed: 18.9min finished


{'n_factors': 100, 'n_epochs': 30, 'init_mean': 0.02, 'reg_all': 0.02}

In [170]:
# instantiating NFM
svd_pp_model_3 = SVDpp(n_factors=100, n_epochs=30, init_mean=0.02, reg_all=.02)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model_3.fit(trainset)
predictions = svd_pp_model_3.test(testset)
svd_pp_model_3_rmse = accuracy.rmse(predictions)

RMSE: 0.9069


In [171]:
model_comp('svd_pp_3', .906, {'n_factors': 100, 'n_epochs': 30, 'init_mean': 0.02, 'reg_all': 0.02})

### Model #9 - SVD PP - Grid Search #4

The last model performed better than the previous. I'm going to adjust the n_factors and n_epochs again by further adjusting the distance between the parameter options.

In [56]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[120, 125, 130, 140],
                  'n_epochs': [5, 10, 15, 20, 30, 35],
                    'init_mean':[0, .01, .02, .03],
                    'reg_all':[.01, .02, .03]}
svd_pp_model_4 = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model_4.fit(data)
print(svd_pp_model_4.best_params['rmse'])

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    1.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    1.4s remaining:    0.0s


{'n_factors': 125, 'n_epochs': 35, 'init_mean': 0.02, 'reg_all': 0.03}


[Parallel(n_jobs=1)]: Done 864 out of 864 | elapsed:  7.6min finished


In [172]:
# instantiating NFM
svd_pp_model_4 = SVDpp(n_factors=125, n_epochs=35, init_mean=.02, reg_all=.03)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model_4.fit(trainset)
predictions = svd_pp_model_4.test(testset)
svd_pp_model_4_rmse = accuracy.rmse(predictions)

RMSE: 0.9031


In [173]:
model_comp('svd_pp_4', 0.903, {'n_factors': 125, 'n_epochs': 35, 'init_mean': 0.02, 'reg_all': 0.03})

### Model #10 - SVD PP - Grid Search #5

In [59]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[110, 120, 125, 130, 135, 140, 145, 150, 160],
                  'n_epochs': [20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120],
                    'init_mean':[0, .01],
                    'reg_all':[.01, .02, .03]}
svd_pp_model_5 = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model_5.fit(data)
svd_pp_model_5.best_params['rmse']

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    2.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    2.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    2.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    3.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    3.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    4.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1782 out of 1782 | elapsed: 56.2min finished


{'n_factors': 125, 'n_epochs': 50, 'init_mean': 0.01, 'reg_all': 0.03}

In [174]:
# instantiating NFM
svd_pp_model_5 = SVDpp(n_factors=125, n_epochs=50, init_mean=.01, reg_all=.03)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model_5.fit(trainset)
predictions = svd_pp_model_5.test(testset)
svd_pp_model_5_rmse = accuracy.rmse(predictions)

RMSE: 0.8937


In [175]:
model_comp('svd_pp_5', .893, {'n_factors': 125, 'n_epochs': 50, 'init_mean': 0.01, 'reg_all': 0.03} )

### Model Comparison

After comparing all models, the SVD ++ models performed the best in terms of RMSE. Our final model for the collaborative system will be SVD ++. 

The RMSE scores for the top performing models were all very close, so I trained multiple models and tested the function to see which performed the best. After testing the functions with three users, two of which filled out the google survey, I decided to use the hyperparameters associated with SVD Grid Search #4 which also had the lowest RMSE score. 

The other models returned often returned resorts that were not aligned with the users top ratings, which myself as well as the other users typically do not go to unless they're close to the main city they live in.

In [187]:
model_df.sort_values(by="rmse").head()

Unnamed: 0,model,rmse,params
2,svd_grid_1,0.88,"{'n_factors': 140, 'n_epochs': 40, 'init_mean'..."
9,svd_pp_5,0.893,"{'n_factors': 125, 'n_epochs': 50, 'init_mean'..."
4,g_s_svd_3,0.896,"{'n_factors': 160, 'n_epochs': 40, 'init_mean'..."
6,svd_pp_2,0.901,"{'n_factors': 90, 'n_epochs': 40, 'init_mean':..."
3,svd_grid_2,0.902,"{'n_factors': 140, 'n_epochs': 50, 'init_mean'..."


#### Best Model

In [222]:
# instantiating NFM
best_model = SVD(n_factors=140, n_epochs=40, biased=True)

# Fit on trainset and make predictions using testset to return RMSE metric
best_model.fit(trainset)
predictions = best_model.test(testset)
best_model_rmse = accuracy.rmse(predictions)

RMSE: 0.8968


In [214]:
best_model = SVDpp(n_factors=140, n_epochs=120, init_mean=.01)

### Training on Full Dataset

Before creating the collaborative mode, we will need to train it on the full dataset. This is important because it ensures that the model learns from the complete set of user resort reviews, so it can make more accurate predictions.

In [223]:
#setting scale
reader = Reader(rating_scale=(1, 5))

#loading final dataset
data_full = Dataset.load_from_df(surprise_df[['user_name', 'ski_resort', 'rating']], reader)

#making trainset
full_trainset = data_full.build_full_trainset()

algo = SVD(n_factors=140, n_epochs=40, biased=True)
algo.fit(full_trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7ff683a1fa90>

## Collaborative System - Building Function

To implement the model, we will need to create a function that takes in user inputs and outputs predictions. To start, I will make a dataframe of only the user names, ratings, and ski resorts.

In [89]:
surprise_df.head()

Unnamed: 0,user_name,ski_resort,rating
0,anon_1,Winter Park,4.0
1,anon_1,Arapahoe Basin,5.0
2,anon_1,Steamboat,5.0
3,anon_1,Copper Mountain,5.0
4,anon_2,Solitude Mountain,5.0


In [90]:
#saving new dataframe with only user information
user_df = surprise_df.reset_index()
user_df.set_index('user_name', inplace = True)
user_df.drop(columns = ['rating', 'index'], inplace =True)
user_df.head()

Unnamed: 0_level_0,ski_resort
user_name,Unnamed: 1_level_1
anon_1,Winter Park
anon_1,Arapahoe Basin
anon_1,Steamboat
anon_1,Copper Mountain
anon_2,Solitude Mountain


### Saving Trained Model

For the hybrid model and streamlist app, I will need to save the trained model. To do this, I adapted code found on [Google Colab](https://colab.research.google.com/github/singhsidhukuldeep/Recommendation-System/blob/master/Building_Recommender_System_with_Surprise.ipynb#scrollTo=lM7Db2cj7-IZ).

In [227]:
#saving trained model
from surprise import dump
import os

model_filename = "./model.pickle"

print (">> Starting dump")

# Dump algorithm and reload it.
file_name = os.path.expanduser(model_filename)
dump.dump(file_name, algo=algo)

print (">> Dump done")
print(model_filename)

>> Starting dump
>> Dump done
./model.pickle


### Function #1  - User & Recommendation #  Inputs

Below is a function that will be used to take user inputs and return predicted ratings. This will used the final trained SVD++ model. The results will be sorted by the highest ratings.

In [93]:
def shred_recommender():
    #user input
    user = str(input('Name: '))
    #number of recommendations
    n_recs = int(input('How many resort recommendations do you want? '))
    
    #making a list of the resorts rated by each user
    have_rated = list(user_df.loc[user, 'ski_resort'])
    #dropping rated resorts for final recommendation
    not_rated = content_df.copy()
    not_rated = not_rated.loc[~not_rated['ski_resort'].isin(have_rated)]
    not_rated = not_rated.drop_duplicates(subset=['ski_resort'])
    not_rated.reset_index(inplace=True)
    
    #running the model and saving the predicted ratings as a new column
    not_rated['predicted_rating'] = not_rated['ski_resort'].apply(lambda x: algo.predict(user, x).est)
    not_rated.sort_values(by='predicted_rating', ascending=False, inplace=True)
    not_rated = not_rated[['ski_resort', 'state', 'city', 'sumt', 'drop', 'base', 'intermediate_runs',
                          'advanced_runs', 'expert_runs', 'predicted_rating', 'ikon', 'epic',
                          'mountain_collective']].copy()
    return not_rated.head(n_recs)

# Results - Function Testing

I will be testing the model results with users whose mountain preferences have been definied:

- Stephanie C. - Enjoys mountains with advanced and expert terrain. Enjoys mountains that have good ammenities and that are close to public transportation.
- Alexandria K. - Mountains where the majority of skiiers are there for the sport, and that do not feel overly "bougey". Skies in expert terrain.
- Raghava K. - Enjoys large mountains, back bowls, expert terrain. Parking and mountain ammenities are also important.
- Joseph L. - Skies locally in the NY area. Is looking to explore more mountains.

#### User #1 - Stephanie

For the first user comparison, I used myself. I think the recommended resorts for myself are in line with my preferences. They are large mountains with vast advanced and expert terrains, which is what I enjoy skiing.

In [94]:
final_user_df.loc[final_user_df["user_name"] == "Stephanie Ciaccia"]

Unnamed: 0,review_date,state,ski_resort,rating,review,user_name
36,2023-05-05,New York,Hunter Mountain,1.0,Very crowded and snow quality is typically low...,Stephanie Ciaccia
37,2023-05-05,Colorado,Vail,4.0,Amazing back bowls and snow. Pricey resort tho...,Stephanie Ciaccia
38,2023-05-05,Utah,Snowbird,5.0,Amazing terrain and huge mountain yet still fe...,Stephanie Ciaccia
39,2023-05-05,Colorado,Breckenridge,4.0,Great bus route that makes the mountain easily...,Stephanie Ciaccia
40,2023-05-05,Utah,Park City Mountain,4.0,"Great terrain, close to airport, but lodging i...",Stephanie Ciaccia


In [224]:
shred_recommender()

Name: Stephanie Ciaccia
How many resort recommendations do you want? 3


Unnamed: 0,ski_resort,state,city,sumt,drop,base,intermediate_runs,advanced_runs,expert_runs,predicted_rating,ikon,epic,mountain_collective
247,Snowbasin,Utah,Huntsville,9350,2900,6450,33,52,6,4.487251,1,0,1
302,Whiteface Mountain,New York,Wilngton,4650,3430,1220,46,31,0,4.479332,0,0,0
5,Alta,Utah,Alta,11068,2538,8530,0,0,0,4.39179,1,0,1


### User #2 - Raghava

The second user's has rated large mountains with advanced, expert, runs and good transportation highly. Their lowest score was given to a mountain that did not have good signage and was difficult to get to with public transport.

In [143]:
final_user_df.loc[final_user_df["user_name"] == "Raghava Kamalesh"]

Unnamed: 0,review_date,state,ski_resort,rating,review,user_name
55,2023-05-10,Colorado,Breckenridge,4.0,"Breck is a decent resort, but a lot of the ter...",Raghava Kamalesh
56,2023-05-10,Colorado,Crested Butte Mountain,5.0,Really amazing and extensive terrain for advan...,Raghava Kamalesh
57,2023-05-10,Colorado,Vail,5.0,I enjoyed Vail a lot more this time than my la...,Raghava Kamalesh
58,2023-05-10,Colorado,Beaver Creek,3.0,"There are some good runs here, and part of my ...",Raghava Kamalesh
59,2023-05-10,Colorado,Telluride,5.0,Telluride is amazing. The town has very good f...,Raghava Kamalesh


In [225]:
shred_recommender()

Name: Raghava Kamalesh
How many resort recommendations do you want? 3


Unnamed: 0,ski_resort,state,city,sumt,drop,base,intermediate_runs,advanced_runs,expert_runs,predicted_rating,ikon,epic,mountain_collective
247,Snowbasin,Utah,Huntsville,9350,2900,6450,33,52,6,4.884927,1,0,1
248,Snowbird,Utah,Snowbird,11000,3240,7760,25,43,24,4.849952,1,0,1
271,Sunday River,Maine,Newry,3140,2340,800,36,18,16,4.823627,1,0,0


In [97]:
list(user_df.loc['Raghava Kamalesh', 'ski_resort'])

['Breckenridge', 'Crested Butte Mountain', 'Vail', 'Beaver Creek', 'Telluride']

### User #3 - Alexandria

The third user prefers mountains that have expert terrain and where the mountain doesn't feel too overrun with guests who are there for reasons other than skiing. The recommendations are smaller less known resorts that still have advanced terrain.

In [144]:
final_user_df.loc[final_user_df["user_name"] == "Alexandria Kelly"]

Unnamed: 0,review_date,state,ski_resort,rating,review,user_name
46,2023-05-09,Washington,Stevens Pass,5.0,"Lots of snow, small local mountain.",Alexandria Kelly
47,2023-05-09,Colorado,Vail,3.0,"Lots of terrain, but very busy.",Alexandria Kelly
48,2023-05-09,Utah,Snowbird,4.0,"Great expert terrain, feels very grand and exc...",Alexandria Kelly
49,2023-05-09,Utah,Park City Mountain,2.0,"Lots of terrain, but usually very busy. The to...",Alexandria Kelly


In [226]:
shred_recommender()

Name: Alexandria Kelly
How many resort recommendations do you want? 3


Unnamed: 0,ski_resort,state,city,sumt,drop,base,intermediate_runs,advanced_runs,expert_runs,predicted_rating,ikon,epic,mountain_collective
280,Telluride,Colorado,Telluride,13150,4425,8725,30,21,34,4.644477,0,0,0
233,Ski Brule,Michigan,Iron River,1860,500,1360,35,24,6,4.630016,0,0,0
125,Jackson Hole,Wyoming,Teton Village,10450,4139,6311,41,38,17,4.480602,1,0,1


In [98]:
list(user_df.loc['Alexandria Kelly', 'ski_resort'])

['Stevens Pass', 'Vail', 'Snowbird', 'Park City Mountain']

### Function #2  - User, Recommendation #, and State Inputs

Below I added an additional filter that allows users to sort resorts by state. I will plan to further adjust the filters in the hybird and streamlit models.

In [137]:
def shred_recommender_state():
    user = str(input('Name: '))
    n_recs = int(input('How many resort recommendations do you want? '))
    state = str(input('What state would you like to shred in? '))
    
    have_rated = list(user_df.loc[user, 'ski_resort'])
    not_rated = content_df.copy()
    not_rated = not_rated.loc[~not_rated['ski_resort'].isin(have_rated) & (not_rated['state'] == state)]
    not_rated = not_rated.drop_duplicates(subset=['ski_resort'])
    not_rated.reset_index(inplace=True)
    not_rated['predicted_rating'] = not_rated['ski_resort'].apply(lambda x: algo.predict(user, x).est)
    not_rated.sort_values(by='predicted_rating', ascending=False, inplace=True)
    not_rated = not_rated[['ski_resort', 'state', 'city', 'sumt', 'drop', 'base', 'intermediate_runs',
                          'advanced_runs', 'expert_runs', 'predicted_rating', 'ikon', 'epic',
                          'mountain_collective']].copy()
    return not_rated.head(n_recs)

In [138]:
shred_recommender_state()

Name: Stephanie Ciaccia
How many resort recommendations do you want? 3
What state would you like to shred in? Utah


Unnamed: 0,ski_resort,state,city,sumt,drop,base,intermediate_runs,advanced_runs,expert_runs,predicted_rating,ikon,epic,mountain_collective
0,Alta,Utah,Alta,11068,2538,8530,0,0,0,4.457562,1,0,1
8,Snowbasin,Utah,Huntsville,9350,2900,6450,33,52,6,4.407571,1,0,1
4,Deer Valley,Utah,Park City,9570,3000,6570,31,10,32,3.962196,1,0,0


# Conclusions

The analysis of the outputs indicates that the recommendation system effectively suggests ski resorts aligned with user ratings. However, considering the high opportunity cost associated with planning a ski trip, relying solely on a user-based system is insufficient to provide users with tailored recommendations that account for factors such as cost and time of year. While identifying similar users is crucial, it is essential to consider additional elements when making recommendations.

# Next Steps

Next steps involves implementing a content-based system and integrating the results with the collaborative model. This combined approach enhances the filtering capabilities of the collaborative model, allowing for more dynamic filtering based on user preferences. By incorporating both content-based and collaborative filtering, the system will deliver more accurate recommendations that align closely with user preferences, resulting in a more refined and personalized recommendation system.

The **Cascade Hybrid Modeling** notebook will be saved in the main github repository.