# Avant Ski - Content Based System

by: Stephanie Ciaccia

# Overview

Skiing holds a prominent place for those seeking winter recreational activities in the United States. With its stunning mountain ranges and diverse terrain, the country boasts numerous ski resorts that cater to all skill levels, from beginners to seasoned professionals. Skiing offers a unique blend of adventure, physical activity, and natural beauty, making it a popular choice for winter enthusiasts seeking both relaxation and excitement.

The ski market in the United States is thriving, contributing significantly to the economy. According to the [National Ski Areas Association (NSAA)](chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://nsaa.org/webdocs/Media_Public/IndustryStats/Historical_Skier_Days_1979_2022.pdf), approximately 60.7 million skiers and snowboarders visited 473 ski resorts in the 2021-2022 winter season.

# Business Problem 
Skiing, an exhilarating winter sport cherished by many, often involves time-consuming and daunting trip planning. The sheer abundance of ski resorts available makes it overwhelming to choose the ideal destination, and existing ski websites lack the necessary tools to filter options based on individual preferences.

To address these challenges, I'm developing Avant Ski, a ski resort recommendation app. Avant Ski simplifies the ski resort selection process by leveraging data and user preferences. With dynamic filtering features, users can personalize their search based on budget, location, amenities, and skill level. By bridging the gap between ski enthusiasts and their dream destinations, Avant Ski makes skiing accessible to a wider audience, empowering them to plan unforgettable ski trips with confidence.

Since data plays a crucial role in this application, I plan to showcase the app to representatives from different ski resorts across the USA at the National Ski Area Association Winter Confernce. This presentation aims to foster partnerships and encourage resort feature sharing between Avant Ski and these resorts once the app is launched.

# Data Understading

In [1]:
import pandas as pd
import numpy as np
import math
from datetime import datetime
import datetime
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
%matplotlib inline
import plotly
import plotly.express as px
import plotly.io as pio
from matplotlib.ticker import StrMethodFormatter

from collections import Counter
from nltk.corpus import stopwords

from IPython.display import display

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.neighbors import NearestNeighbors

from surprise import SVDpp, SVD
from surprise import accuracy
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split

import glob
import os

In [2]:
def print_full(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')

## Importing Data
### Data Source #1 - Final Feature Data
Importing main ski resort and features dataframe that I scraped and cleaned from OnTheSnow in cleaning notebook.

In [3]:
content_df = pd.read_csv("data/cleaned_data_exports/scraped_feature_df.csv")

In [4]:
content_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330 entries, 0 to 329
Columns: 103 entries, Unnamed: 0 to total_lifts
dtypes: float64(21), int64(69), object(13)
memory usage: 265.7+ KB


In [5]:
content_df.head()

Unnamed: 0.1,Unnamed: 0,ski_resort,address,city,state,zipcode,summit,drop,base,gondolas_and_trams,...,long,airport_2,distance_2,lat_2,long_2,airport_3,distance_3,lat_3,long_3,total_lifts
0,0,Palisades Tahoe,PO Box 2007,Olympic Valley,California,96146,9050,2850,6200,3.0,...,-120.139563,Minden-Tahoe,47,39.000309,-119.750806,Greenville Muni,3357,41.446832,-80.391262,36
1,1,Mammoth Mountain,P.O. Box 24,Mammoth Mountain Lakes,California,93546,11053,3100,7953,3.0,...,-118.837772,Bryant,71,38.262419,-119.225709,Grand Rapids-Itasca County,2330,47.211103,-93.509845,25
2,2,Donner Ski Ranch,P.O. Box 66,Norden,California,95724,8012,750,7031,0.0,...,-120.139563,Reno/Tahoe International,54,39.498576,-119.768065,Mapleton Municipal,2085,42.178295,-95.793645,8
3,3,Sugar Bowl,P.O. Box 5,Norden,California,95724,8383,1500,6883,1.0,...,-120.139563,Reno/Tahoe International,54,39.498576,-119.768065,Mapleton Municipal,2085,42.178295,-95.793645,12
4,4,Kirkwood,PO Box 1,Kirkwood,California,95646,9800,2000,7800,0.0,...,-119.995335,Minden-Tahoe,43,39.000309,-119.750806,Nogales International,1165,31.417722,-110.847889,13


### Data Source #2 - Final User/Review Data

Importing final cleaned user review data from the cleaning notebook.

In [6]:
final_user_df = pd.read_csv("data/cleaned_data_exports/user_df_model.csv")

In [7]:
final_user_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2521 entries, 0 to 2520
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   2521 non-null   int64 
 1   review_date  2521 non-null   object
 2   state        2521 non-null   object
 3   ski_resort   2521 non-null   object
 4   rating       2521 non-null   int64 
 5   review       2521 non-null   object
 6   user_name    2521 non-null   object
dtypes: int64(2), object(5)
memory usage: 138.0+ KB


In [8]:
final_user_df.head()

Unnamed: 0.1,Unnamed: 0,review_date,state,ski_resort,rating,review,user_name
0,0,2023-05-04,Colorado,Winter Park,4,Very family friendly,anon_1
1,1,2023-05-04,Colorado,Arapahoe Basin,5,Challenging terrain no frills,anon_1
2,2,2023-05-04,Colorado,Steamboat,5,Great public transport to and from lodging,anon_1
3,3,2023-05-04,Colorado,Copper Mountain,5,Extremely diverse terrain and fantastic terrai...,anon_1
4,4,2023-05-04,Utah,Solitude Mountain,5,y had so much terrain especially for the start...,anon_2


### Content Modeling

To begin our content modeling, we will need to create a feature matrix that will store all of the feature information. This matrix will allow us to calculate the similarties between item vectors, so we can determine which ski resorts are similar.

In [9]:
content_df.columns.to_list()

['Unnamed: 0',
 'ski_resort',
 'address',
 'city',
 'state',
 'zipcode',
 'summit',
 'drop',
 'base',
 'gondolas_and_trams',
 'fast_eight',
 'high_speed_sixes',
 'quad_chairs',
 'triple_chairs',
 'double_chairs',
 'surface_lifts',
 'total_runs',
 'longest_run',
 'skiable_terrain',
 'snow_making',
 'daysOpenLastYear',
 'averageSnowfall',
 'projectedOpening',
 'projectedClosing',
 'nov_snow',
 'dec_snow',
 'jan_snow',
 'feb_snow',
 'mar_snow',
 'apr_snow',
 'childrenWeekdayPrice',
 'childrenWeekendPrice',
 'teenagerWeekdayPrice',
 'teenagerWeekendPrice',
 'adultWeekdayPrice',
 'adultWeekendPrice',
 'seniorWeekdayPrice',
 'seniorWeekendPrice',
 'childrenPrice_season',
 'teenagerPrice_season',
 'adultPrice_season',
 'Url',
 'beginner_runs',
 'intermediate_runs',
 'advanced_runs',
 'expert_runs',
 'night_skiing',
 'epic',
 'mountain_collective',
 'ikon',
 'indy',
 'dec_mean_2_guests',
 'dec_min_2_guests',
 'dec_max_2_guests',
 'jan_mean_2_guests',
 'jan_min_2_guests',
 'jan_max_2_guests',
 

In [10]:
#making a copy of the finaldataframe
content_matrix = content_df.copy()

In [11]:
drop_list = ['address', 'zipcode', 'Url', 'projectedOpening', 'projectedClosing', 'Unnamed: 0',
             "daysOpenLastYear", 'projectedOpening', 'projectedClosing',
             'full_address','airport_1','distance_1','lat', 'long',
             'airport_2','distance_2','lat_2','long_2','airport_3','distance_3','lat_3','long_3']

content_matrix.drop(columns=drop_list, inplace=True)

In [12]:
content_matrix.head()

Unnamed: 0,ski_resort,city,state,summit,drop,base,gondolas_and_trams,fast_eight,high_speed_sixes,quad_chairs,...,mar_max_4_guests,apr_mean_4_guests,apr_min_4_guests,apr_max_4_guests,may_mean_4_guests,may_min_4_guests,may_max_4_guests,latitude,longitude,total_lifts
0,Palisades Tahoe,Olympic Valley,California,9050,2850,6200,3.0,6,4,1,...,1320,463,178,1260,331,122,825,39.19698,-120.235705,36
1,Mammoth Mountain,Mammoth Mountain Lakes,California,11053,3100,7953,3.0,9,2,1,...,589,325,126,699,144,64,246,37.648546,-118.972079,25
2,Donner Ski Ranch,Norden,California,8012,750,7031,0.0,0,0,0,...,739,349,165,996,231,83,643,39.317356,-120.354182,8
3,Sugar Bowl,Norden,California,8383,1500,6883,1.0,5,0,3,...,739,349,165,996,245,120,643,39.317356,-120.354182,12
4,Kirkwood,Kirkwood,California,9800,2000,7800,0.0,2,0,2,...,1179,420,150,950,309,114,590,38.702308,-120.072244,13


### One Hot Encoding Categorical Variables

I will be one hot encoding the state column, as this is the only column in the dataframe that is a caterogial values. I would like to keep this in the final model, as the location of a resort often plays an important role in deciding where to ski.

In [13]:
# Instantiating OHE
ohe = OneHotEncoder()

# fit and transforming
ohe_state = pd.DataFrame(ohe.fit_transform(content_matrix[['state']]).toarray())

# renaming based on original names
ohe_state.columns = ohe.get_feature_names(['state'])

In [14]:
ohe_state

Unnamed: 0,state_Alaska,state_Arizona,state_California,state_Colorado,state_Connecticut,state_Idaho,state_Illinois,state_Indiana,state_Iowa,state_Maine,...,state_Rhode Island,state_South Dakota,state_Tennessee,state_Utah,state_Vermont,state_Virginia,state_Washington,state_West Virginia,state_Wisconsin,state_Wyoming
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
325,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
326,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
327,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
328,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
#setting index
ohe_state = ohe_state.set_index(content_matrix['ski_resort'])

In [16]:
ohe_state

Unnamed: 0_level_0,state_Alaska,state_Arizona,state_California,state_Colorado,state_Connecticut,state_Idaho,state_Illinois,state_Indiana,state_Iowa,state_Maine,...,state_Rhode Island,state_South Dakota,state_Tennessee,state_Utah,state_Vermont,state_Virginia,state_Washington,state_West Virginia,state_Wisconsin,state_Wyoming
ski_resort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Palisades Tahoe,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Mammoth Mountain,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Donner Ski Ranch,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Sugar Bowl,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Kirkwood,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Oak Mountain,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Mt. Pleasant,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Hunt Hollow,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Powder Ridge Connecticut,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
#resetting index as ski_resort
content_matrix = content_matrix.set_index("ski_resort")

#dropping state column
final_content_matrix = content_matrix.drop(columns=["state", "city", 'latitude','longitude'])

#filling null matrix values with 0
final_content_matrix = final_content_matrix.fillna(0)

In [18]:
final_content_matrix.info()

<class 'pandas.core.frame.DataFrame'>
Index: 330 entries, Palisades Tahoe to Shawnee Mountain
Data columns (total 78 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   summit                330 non-null    int64  
 1   drop                  330 non-null    int64  
 2   base                  330 non-null    int64  
 3   gondolas_and_trams    330 non-null    float64
 4   fast_eight            330 non-null    int64  
 5   high_speed_sixes      330 non-null    int64  
 6   quad_chairs           330 non-null    int64  
 7   triple_chairs         330 non-null    int64  
 8   double_chairs         330 non-null    int64  
 9   surface_lifts         330 non-null    int64  
 10  total_runs            330 non-null    int64  
 11  longest_run           330 non-null    int64  
 12  skiable_terrain       330 non-null    int64  
 13  snow_making           330 non-null    int64  
 14  averageSnowfall       330 non-null    int64  
 15  n

### Scaling Data

I will be using StandardScaler to scale the values in the matrix to ensure they are on the same scale. This is necessary to continue modeling.

In [19]:
#instantiating minmaxscaler
scaler = StandardScaler()

#scaling array
scaled = scaler.fit_transform(final_content_matrix)

#saving as dataframe
scaled_ski_df = pd.DataFrame(scaled, index=final_content_matrix.index, columns=final_content_matrix.columns)

In [20]:
scaled_ski_df

Unnamed: 0_level_0,summit,drop,base,gondolas_and_trams,fast_eight,high_speed_sixes,quad_chairs,triple_chairs,double_chairs,surface_lifts,...,mar_mean_4_guests,mar_min_4_guests,mar_max_4_guests,apr_mean_4_guests,apr_min_4_guests,apr_max_4_guests,may_mean_4_guests,may_min_4_guests,may_max_4_guests,total_lifts
ski_resort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Palisades Tahoe,1.186487,1.710642,0.902438,4.759106,2.238096,5.503542,0.006503,6.302092,1.902088,1.080304,...,2.541470,2.537409,2.180666,2.334115,1.841660,1.669874,2.000277,1.059825,1.368909,4.654503
Mammoth Mountain,1.722887,1.973881,1.465619,4.759106,3.594102,2.599995,0.006503,2.697776,1.325698,-1.290043,...,0.792226,1.139202,-0.155575,0.684433,0.415601,0.110550,-1.049510,-1.007546,-0.928538,2.801482
Donner Ski Ranch,0.908512,-0.500562,1.169411,-0.323434,-0.473917,-0.303553,-0.708794,-0.305821,1.902088,-0.341905,...,0.648301,2.019554,0.323818,0.971334,1.485145,0.936075,0.369375,-0.330304,0.646741,-0.062278
Sugar Bowl,1.007865,0.289154,1.121864,1.370746,1.786094,-0.303553,1.437096,-0.305821,-0.979864,-0.341905,...,0.902937,2.019554,0.323818,0.971334,1.485145,0.936075,0.597701,0.988536,0.646741,0.611548
Kirkwood,1.387336,0.815631,1.416465,-0.323434,0.430087,-0.303553,0.721800,2.097056,-0.403473,0.132165,...,1.733275,0.880274,1.730037,1.820084,1.073782,0.808216,1.641478,0.774670,0.436439,0.780005
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Oak Mountain,-0.594373,-0.605857,-0.527200,-0.323434,-0.473917,-0.303553,0.006503,-0.906540,-0.979864,0.132165,...,0.028315,-0.310791,0.419696,0.206265,0.388177,0.305118,0.173666,0.097428,-0.079395,-0.736104
Mt. Pleasant,-0.824680,-0.932273,-0.703897,-0.323434,-0.473917,-0.303553,-0.708794,-0.305821,-0.979864,-0.815974,...,-1.012374,-1.475964,-0.919408,-1.001111,-1.531517,-0.859511,-0.315604,-0.579814,-0.317473,-1.073017
Hunt Hollow,-0.693458,-0.421590,-0.768150,-0.323434,-0.473917,-0.303553,-0.708794,-0.305821,-0.403473,-0.815974,...,0.515447,0.336527,0.518771,0.469257,0.360753,0.391284,0.548774,0.240005,-0.142883,-0.904560
Powder Ridge Connecticut,-1.044275,-0.711153,-1.034802,-0.323434,-0.473917,-0.303553,-0.708794,-0.305821,0.172917,0.132165,...,-0.469887,-1.165251,1.084454,-0.367538,-1.175002,1.500322,-0.984274,-0.686747,-0.912666,-0.399191


In [21]:
#merging scaled_ski_df and one hot encoded dataframes
final_content_df = scaled_ski_df.join(ohe_state)

In [22]:
final_content_df.head()

Unnamed: 0_level_0,summit,drop,base,gondolas_and_trams,fast_eight,high_speed_sixes,quad_chairs,triple_chairs,double_chairs,surface_lifts,...,state_Rhode Island,state_South Dakota,state_Tennessee,state_Utah,state_Vermont,state_Virginia,state_Washington,state_West Virginia,state_Wisconsin,state_Wyoming
ski_resort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Palisades Tahoe,1.186487,1.710642,0.902438,4.759106,2.238096,5.503542,0.006503,6.302092,1.902088,1.080304,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Mammoth Mountain,1.722887,1.973881,1.465619,4.759106,3.594102,2.599995,0.006503,2.697776,1.325698,-1.290043,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Donner Ski Ranch,0.908512,-0.500562,1.169411,-0.323434,-0.473917,-0.303553,-0.708794,-0.305821,1.902088,-0.341905,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Sugar Bowl,1.007865,0.289154,1.121864,1.370746,1.786094,-0.303553,1.437096,-0.305821,-0.979864,-0.341905,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Kirkwood,1.387336,0.815631,1.416465,-0.323434,0.430087,-0.303553,0.7218,2.097056,-0.403473,0.132165,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Cosine Similarity

I will start the content based modeling using cosine similarty to determine the distance between related ski resorts. 

In [23]:
sim_df = pd.DataFrame(cosine_similarity(final_content_df), index=final_content_df.index, columns=final_content_df.index)

In [24]:
sim_df.head()

ski_resort,Palisades Tahoe,Mammoth Mountain,Donner Ski Ranch,Sugar Bowl,Kirkwood,Boreal,Sierra at Tahoe,Mt. Rose Ski Tahoe,Soda Springs,Wolf Creek,...,Elko SnoBowl,Eagle Point,Pine Knob,Whaleback,Little Switzerland,Oak Mountain,Mt. Pleasant,Hunt Hollow,Powder Ridge Connecticut,Shawnee Mountain
ski_resort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Palisades Tahoe,1.0,0.823149,0.539314,0.735256,0.465232,0.372678,0.553026,-0.03623,0.266275,0.07518,...,-0.614285,-0.146584,-0.348659,-0.680172,-0.251267,-0.248502,-0.726121,-0.252279,-0.281585,-0.488876
Mammoth Mountain,0.823149,1.0,0.432411,0.749373,0.189272,0.408419,0.353691,0.13377,0.243271,0.277605,...,-0.50431,-0.055882,-0.412711,-0.609549,-0.416625,-0.277246,-0.539815,-0.352964,-0.386309,-0.363905
Donner Ski Ranch,0.539314,0.432411,1.0,0.698741,0.27933,0.566406,0.503379,-0.338611,0.673875,-0.152606,...,-0.708454,-0.095278,-0.364185,-0.641948,-0.061481,0.292695,-0.715639,-0.068328,-0.229705,-0.410521
Sugar Bowl,0.735256,0.749373,0.698741,1.0,0.387737,0.596819,0.508879,-0.081097,0.49112,0.172871,...,-0.696508,-0.02745,-0.356759,-0.668789,-0.288107,-0.014466,-0.718568,-0.275252,-0.373465,-0.567521
Kirkwood,0.465232,0.189272,0.27933,0.387737,1.0,0.383667,0.617872,-0.103257,0.236125,0.206694,...,-0.307853,0.233529,-0.230095,-0.462766,-0.002941,-0.114154,-0.575598,-0.145045,-0.130904,-0.567811


In [25]:
#saving final content dataframe
final_content_df.to_csv("data/cleaned_data_exports/final_content_df.csv")

In [26]:
#saving similarity matrix for streamlit modeling
sim_df.to_csv("data/cleaned_data_exports/similarity_matrix.csv")

### Function Building

In [28]:
# Input for mountain name
mountain_name = str(input("What is your favorite ski resort? "))

# input to ask user how many recommendations they would like
n_recs = int(input('How many recommendations would you like? '))
    
#what month would you like to travel
travel_date = str(input('What month would you like to travel? '))

What is your favorite ski resort? Telluride
How many recommendations would you like? 5
What month would you like to travel? December


In [29]:
# Pulling out an individual mountain
y = sim_df.loc[[mountain_name]].T

#resetting index and sorting values
cos_sim_df = y.reset_index().sort_values(by=mountain_name, ascending=False).head(n_recs + 1)

In [30]:
cos_sim_df

ski_resort,ski_resort.1,Telluride
47,Telluride,1.0
42,Aspen Snowmass,0.833555
26,Solitude Mountain,0.785323
81,Beaver Creek,0.740816
21,Brighton,0.681397
39,Jackson Hole,0.677006


In [31]:
#making list for column names
rec_list = []
    
#grabbing rows from content_matrix 
for x in cos_sim_df['ski_resort']:
    rec_df = content_matrix.loc[[x]]  
    rec_list.append(rec_df)  #

rec_df = pd.concat(rec_list)

#Concatenate all the dataframes in rec_list into a single dataframe
concat_df = rec_df[["city", "state", "summit", "drop", "base", "adultWeekdayPrice",
                           "beginner_runs", "intermediate_runs", "adultWeekendPrice", "expert_runs"]]

concat_df = concat_df.reset_index()

concat_df

Unnamed: 0,ski_resort,city,state,summit,drop,base,adultWeekdayPrice,beginner_runs,intermediate_runs,adultWeekendPrice,expert_runs
0,Telluride,Telluride,Colorado,13150,4425,8725,209.0,16,30,219.0,34.0
1,Aspen Snowmass,Aspen,Colorado,12510,4406,8104,189.0,0,0,199.0,0.0
2,Solitude Mountain,Brighton,Utah,10488,2494,7994,115.0,6,46,115.0,18.0
3,Beaver Creek,Vail,Colorado,11440,3340,8100,191.0,38,30,275.0,8.0
4,Brighton,Brighton,Utah,10500,1745,8755,85.0,0,0,85.0,0.0
5,Jackson Hole,Teton Village,Wyoming,10450,4139,6311,215.0,4,41,215.0,17.0


In [32]:
#filtering based on month to return airbnb prices and turning into dataframe
travel_date = travel_date.lower()

month = ["december", "january", "february", "march", "april", "may"]
month_abv = ["dec", "jan", "feb", "mar", "apr", "may"]

#for loop that changes the user input to the month appreviation that's in the column names
selected_columns = []
for x, y in zip(month_abv, month):
    if travel_date == y:
        selected_columns = [x + "_mean_4_guests", x + "_mean_2_guests"]

result = rec_df[selected_columns]

#resetting index
result = result.reset_index()

result

Unnamed: 0,ski_resort,dec_mean_4_guests,dec_mean_2_guests
0,Telluride,425,308
1,Aspen Snowmass,624,316
2,Solitude Mountain,375,322
3,Beaver Creek,420,268
4,Brighton,375,322
5,Jackson Hole,436,254


Testing first part of function by hard coding example user inputs

In [33]:
#Input for book title that returns the 'asin' index number for the book to be used to call dataframe
mountain_name = "Park City Mountain"
    
# input to ask user how many recommendations they would like
n_recs = 3
    
#what month would you like to travel
travel_date = "March"
    
# Pulling out an individual resort
y = sim_df.loc[[mountain_name]].T

#sorting values by similarity score
cos_sim_df = y.reset_index().sort_values(by=mountain_name, ascending=False).head(n_recs + 1)

#making list for column names
rec_list = []
    
#grabbing rows from content_matrix 
for x in cos_sim_df['ski_resort']:
    rec_df = content_matrix.loc[[x]]  
    rec_list.append(rec_df)  #

rec_df = pd.concat(rec_list)

#Concatenate all the dataframes in rec_list into a single dataframe
concat_df = rec_df[["city", "state", "summit", "drop", "base", "adultWeekdayPrice",
                           "beginner_runs", "intermediate_runs", "adultWeekendPrice", "expert_runs"]]

concat_df = concat_df.reset_index()

#filtering based on month to return airbnb prices and turning into dataframe
travel_date = travel_date.lower()

month = ["december", "january", "february", "march", "april", "may"]
month_abv = ["dec", "jan", "feb", "mar", "apr", "may"]

#for loop that changes the user input to the month appreviation that's in the column names
selected_columns = []

for x, y in zip(month_abv, month):
    if travel_date == y:
        selected_columns = [x + "_mean_4_guests", x + "_mean_2_guests"]

result = rec_df[selected_columns]

#resetting index
result = result.reset_index()

#merging dataframes 
final_concat_df = pd.merge(concat_df, result, on="ski_resort")
    
#dropping mountain name from the results
final_concat_df = final_concat_df[final_concat_df.ski_resort != mountain_name]

#showing final dataframe
final_concat_df.head(n_recs)

Unnamed: 0,ski_resort,city,state,summit,drop,base,adultWeekdayPrice,beginner_runs,intermediate_runs,adultWeekendPrice,expert_runs,mar_mean_4_guests,mar_mean_2_guests
1,Breckenridge,Breckenridge,Colorado,12998,3398,9600,149.0,13,23,179.0,28.0,436,258
2,Keystone,Keystone,Colorado,12408,3128,9280,195.0,16,43,225.0,0.0,335,255
3,Vail,Vail,Colorado,11570,3450,8120,225.0,23,35,245.0,2.0,462,338


Making final function. I will need to add a line of code to drop the row where ski_resort matches the user's input to ensure this is not part of their recommendations.

In [34]:
# Content-based model
def content_model():
    
    #user inputs
    n_recs = int(input('How many resort recommendations do you want? '))
    mountain_name = str(input("What's your favorite ski resort? "))
    travel_date = str(input('What month would you like to travel? '))
    
    # Pulling out an individual resort
    y = sim_df.loc[[mountain_name]].T

    #sorting values by similarity score
    cos_sim_df = y.reset_index().sort_values(by=mountain_name, ascending=False).head(n_recs + 1)

    #making list for column names
    rec_list = []
    
    #grabbing rows from content_matrix 
    for x in cos_sim_df['ski_resort']:
        rec_df = content_matrix.loc[[x]]  
        rec_list.append(rec_df)  #

    rec_df = pd.concat(rec_list)

    #Concatenate all the dataframes in rec_list into a single dataframe
    concat_df = rec_df[["city", "state", "summit", "drop", "base", "adultWeekdayPrice", "adultWeekendPrice",
                           "beginner_runs", "intermediate_runs", "advanced_runs", "expert_runs"]]
    concat_df = concat_df.reset_index()

    #filtering based on month to return airbnb prices and turning into dataframe
    travel_date = travel_date.lower()

    month = ["december", "january", "february", "march", "april", "may"]
    month_abv = ["dec", "jan", "feb", "mar", "apr", "may"]

    selected_columns = []
    for x, y in zip(month_abv, month):
        if travel_date == y:
            selected_columns = [x + "_mean_4_guests", x + "_mean_2_guests"]

    result = rec_df[selected_columns]
    result = result.reset_index()

    #merging dataframes 
    final_concat_df = pd.merge(concat_df, result, on="ski_resort")
    
    #dropping mountain name from the results
    final_concat_df = final_concat_df[final_concat_df.ski_resort != mountain_name]

    #showing final dataframe
    return(final_concat_df.head(n_recs))

I will now test the content model. After inputting a few different resorts, the recommendations are aligned in terms of mountain characteristics.

In [35]:
content_model()

How many resort recommendations do you want? 3
What's your favorite ski resort? Telluride
What month would you like to travel? December


Unnamed: 0,ski_resort,city,state,summit,drop,base,adultWeekdayPrice,adultWeekendPrice,beginner_runs,intermediate_runs,advanced_runs,expert_runs,dec_mean_4_guests,dec_mean_2_guests
1,Aspen Snowmass,Aspen,Colorado,12510,4406,8104,189.0,199.0,0,0,0,0.0,624,316
2,Solitude Mountain,Brighton,Utah,10488,2494,7994,115.0,115.0,6,46,30,18.0,375,322
3,Beaver Creek,Vail,Colorado,11440,3340,8100,191.0,275.0,38,30,24,8.0,420,268


### Collaborative Model

Importing the final cleaned user/review surprise dataframe from the collaborative model notebook and using the final params from the best model in the collaborative notebook.

The best model was the first SVD Grid Search #1 that gave us a RMSE of .90.

- n_factors = 140
- n_epochs = .40
- biased = True

In [36]:
final_user_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2521 entries, 0 to 2520
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   2521 non-null   int64 
 1   review_date  2521 non-null   object
 2   state        2521 non-null   object
 3   ski_resort   2521 non-null   object
 4   rating       2521 non-null   int64 
 5   review       2521 non-null   object
 6   user_name    2521 non-null   object
dtypes: int64(2), object(5)
memory usage: 138.0+ KB


In [37]:
final_user_df.drop(columns="Unnamed: 0", inplace=True)

In [38]:
from surprise import Reader, Dataset
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split

#copying final rewview dataframe
surprise_df = final_user_df.copy()

#dropping unneeded columns
surprise_df = surprise_df[['user_name', 'ski_resort', 'rating']]

# counting the number of reviews for each user
value_counts = surprise_df['user_name'].value_counts()

# selecting only users with more than three reviews
selected_users = value_counts[value_counts > 2].index

# selecting only the rows where the user_name is in the selected_users list
surprise_df = surprise_df[surprise_df['user_name'].isin(selected_users)]

#saving for streamlit app
surprise_df.to_csv("data/cleaned_data_exports/surprise_df.csv")

In [39]:
surprise_df.head()

Unnamed: 0,user_name,ski_resort,rating
0,anon_1,Winter Park,4
1,anon_1,Arapahoe Basin,5
2,anon_1,Steamboat,5
3,anon_1,Copper Mountain,5
4,anon_2,Solitude Mountain,5


In [40]:
#saving Reader information
reader = Reader(rating_scale=(1, 5))

#loading final dataset
data = Dataset.load_from_df(surprise_df[['user_name', 'ski_resort', 'rating']], reader)

#making trainset
trainset = data.build_full_trainset()

#instantiating model and training
algo = SVD(n_factors=140, n_epochs=40, biased=True, random_state=42)
algo.fit(trainset) 

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7ff240d2e5e0>

In [41]:
#saving new dataframe with only user information
user_df = surprise_df.reset_index()
user_df.set_index('user_name', inplace = True)
user_df.drop(columns = ['rating', 'index'], inplace =True)
user_df.head()

Unnamed: 0_level_0,ski_resort
user_name,Unnamed: 1_level_1
anon_1,Winter Park
anon_1,Arapahoe Basin
anon_1,Steamboat
anon_1,Copper Mountain
anon_2,Solitude Mountain


In [42]:
#looking at number of users
print('Number of users: ', trainset.n_users, '\n')
print('Number of items: ', trainset.n_items)

Number of users:  534 

Number of items:  269


# Collaborative and Content Based Models

### Final Collaborative Model

In [43]:
#Collaborative model
def collaborative_model():
    
    user = str(input('Name: '))
    n_recs = int(input('How many resort recommendations do you want? '))
    
    have_rated = list(user_df.loc[user, 'ski_resort'])
    not_rated = content_df.copy()
    not_rated = not_rated.loc[~not_rated['ski_resort'].isin(have_rated)]  # & (not_rated['state'] == state)]
    not_rated = not_rated.drop_duplicates(subset=['ski_resort'])
    not_rated.reset_index(inplace=True)
    not_rated['predicted_rating'] = not_rated['ski_resort'].apply(lambda x: algo.predict(user, x).est)
    not_rated.sort_values(by='predicted_rating', ascending=False, inplace=True)
    not_rated = not_rated[['ski_resort', 'state', 'city', "adultWeekdayPrice", "adultWeekendPrice", 'summit', 'drop',
                           'base','ikon', 'epic','mountain_collective',
                          'advanced_runs',  'intermediate_runs', 'expert_runs', 'predicted_rating']].copy()

    return not_rated.head(n_recs)

In [44]:
collaborative_model()

Name: Stephanie Ciaccia
How many resort recommendations do you want? 3


Unnamed: 0,ski_resort,state,city,adultWeekdayPrice,adultWeekendPrice,summit,drop,base,ikon,epic,mountain_collective,advanced_runs,intermediate_runs,expert_runs,predicted_rating
154,Whiteface Mountain,New York,Wilmington,115.0,115.0,4650,3430,1220,0,0,0,31,46,0.0,4.437569
47,Taos Ski Valley,New Mexico,Taos Ski Valley Ski Valley,195.0,195.0,12481,3281,9200,1,0,1,30,16,40.0,4.431001
83,Granite Peak,Wisconsin,Wausau,95.0,105.0,1942,700,1242,0,0,0,0,0,0.0,4.428575


### Final Content Model

In [45]:
# Content-based model
def content_model():
    
    #user inputs
    n_recs = int(input('How many resort recommendations do you want? '))
    mountain_name = str(input("What's your favorite ski resort? "))
    travel_date = str(input('What month would you like to travel? '))
    
    # Pulling out an individual resort
    y = sim_df.loc[[mountain_name]].T

    #sorting values by similarity score
    cos_sim_df = y.reset_index().sort_values(by=mountain_name, ascending=False).head(n_recs + 1)
    
    #making list for column names
    rec_list = []
    
    #grabbing rows from content_matrix 
    for x in cos_sim_df['ski_resort']:
        rec_df = content_matrix.loc[[x]]  
        rec_list.append(rec_df)  #

    rec_df = pd.concat(rec_list)

    #Concatenate all the dataframes in rec_list into a single dataframe
    concat_df = rec_df[["city", "state", "summit", "drop", "base", "adultWeekdayPrice", "adultWeekendPrice", 
                           'ikon', 'epic','mountain_collective',
                        "beginner_runs", "intermediate_runs", "advanced_runs", "expert_runs"]]
    concat_df = concat_df.reset_index()

    #filtering based on month to return airbnb prices and turning into dataframe
    travel_date = travel_date.lower()

    month = ["december", "january", "february", "march", "april", "may"]
    month_abv = ["dec", "jan", "feb", "mar", "apr", "may"]

    selected_columns = []
    for x, y in zip(month_abv, month):
        if travel_date == y:
            selected_columns = [x + "_mean_4_guests", x + "_mean_2_guests"]

    result = rec_df[selected_columns]
    result = result.reset_index()

    #merging dataframes 
    final_concat_df = pd.merge(concat_df, result, on="ski_resort")
    
    #dropping mountain name from the results
    final_concat_df = final_concat_df[final_concat_df.ski_resort != mountain_name]

    #showing final dataframe
    return(final_concat_df.head(n_recs))

In [46]:
content_model()

How many resort recommendations do you want? 5
What's your favorite ski resort? Telluride
What month would you like to travel? December


Unnamed: 0,ski_resort,city,state,summit,drop,base,adultWeekdayPrice,adultWeekendPrice,ikon,epic,mountain_collective,beginner_runs,intermediate_runs,advanced_runs,expert_runs,dec_mean_4_guests,dec_mean_2_guests
1,Aspen Snowmass,Aspen,Colorado,12510,4406,8104,189.0,199.0,1,0,1,0,0,0,0.0,624,316
2,Solitude Mountain,Brighton,Utah,10488,2494,7994,115.0,115.0,1,0,0,6,46,30,18.0,375,322
3,Beaver Creek,Vail,Colorado,11440,3340,8100,191.0,275.0,0,1,0,38,30,24,8.0,420,268
4,Brighton,Brighton,Utah,10500,1745,8755,85.0,85.0,1,0,0,0,0,0,0.0,375,322
5,Jackson Hole,Teton Village,Wyoming,10450,4139,6311,215.0,215.0,1,0,1,4,41,38,17.0,436,254


## Cascade-Hybrid Model

I will be creating a cascade hybrid model for my final recommendation system. 

Unlike traditional collaborative user-based models commonly used in music and streaming platforms, these models have limitations when applied to the context of ski trip planning, given the higher opportunity cost involved.

The hybrid model begins with a collaborative model, and takes the top 30 resorts and then will refine the final recommendations by using the content based system, which will use the user's input as a guide for the final recommendations. 

In combining the models, there were a few adjustments that neede to be made:
- Since it is not guaranteed that the user's input for their mountain preference will be selected by the collaborative model, I added in the user's mountain as the top recommendation for the collaborative model. I did this my adding in the row to the final output dataframe, and then assigning the predicted rating to 5 to ensure that it appeared at the top of all results.
- I adjusted the content based model to use the final dataframe from the collaborative model. This included the top 50 results.

In [47]:
# User inputs
user = "Stephanie Ciaccia"
n_recs = 5
mountain_name = "Stevens Pass"
travel_date = "December"
mtn_pass = "Epic"
    
# Pulling out an individual resort
y = sim_df.loc[[mountain_name]].T
    
#sorting values by similarity score
cos_sim_df = y.reset_index().sort_values(by=mountain_name, ascending=False)
    
#making list for column names
rec_list = []
    
#grabbing rows from content_matrix for final output
for x in cos_sim_df['ski_resort']:
    rec_df = content_matrix.loc[[x]]  
    rec_list.append(rec_df)  #

rec_df = pd.concat(rec_list)

#Concatenate all the dataframes in rec_list into a single dataframe
concat_df = rec_df[["city", "state", "summit", "drop", "base","adultWeekdayPrice", "adultWeekendPrice",
                           "beginner_runs", "intermediate_runs", "advanced_runs", "expert_runs",
                        "ikon", "epic", "mountain_collective", 'indy']]
    
concat_df = concat_df.reset_index()

In [48]:
#filtering based on month to return airbnb prices and turning into dataframe
travel_date = travel_date.lower()

month = ["december", "january", "february", "march", "april", "may"]
month_abv = ["dec", "jan", "feb", "mar", "apr", "may"]

selected_columns = []
for x, y in zip(month_abv, month):
    if travel_date == y:
        selected_columns = [x + "_mean_4_guests", x + "_mean_2_guests"]

result = rec_df[selected_columns]
result = result.reset_index()                        
content_recommendations = pd.merge(concat_df, result, on="ski_resort")

In [49]:
content_recommendations.head()

Unnamed: 0,ski_resort,city,state,summit,drop,base,adultWeekdayPrice,adultWeekendPrice,beginner_runs,intermediate_runs,advanced_runs,expert_runs,ikon,epic,mountain_collective,indy,dec_mean_4_guests,dec_mean_2_guests
0,Stevens Pass,Skykomish,Washington,5845,1800,4061,0.0,0.0,8,43,31,18.0,0,1,0,0,276,235
1,Timberline Lodge,Timberline Lodge,Oregon,8540,3690,6000,0.0,0.0,25,50,13,12.0,0,0,0,0,255,208
2,Boreal,Truckee,California,7700,500,7200,49.0,0.0,26,29,44,0.0,0,0,0,0,275,231
3,Bear Valley,Bear Valley,California,8500,1900,6600,0.0,0.0,11,41,45,4.0,0,0,0,0,270,235
4,Mt. Hood Meadows,Mt. Hood,Oregon,7300,2777,4523,0.0,0.0,0,0,0,0.0,0,0,0,1,267,184


In [50]:
#adding mountain fil
if mtn_pass == "Ikon":
    content_recommendations = content_recommendations.loc[content_recommendations['ikon'] == 1]
elif mtn_pass == "Epic":
    content_recommendations = content_recommendations.loc[content_recommendations['epic'] == 1]
elif mtn_pass == "Mountain_collective":
    content_recommendations = content_recommendations.loc[content_recommendations['mountain_collective'] == 1]
elif mtn_pass == "Indy":
    content_recommendations = content_recommendations.loc[content_recommendations['indy'] == 1]
elif mtn_pass == "No":
    pass

In [51]:
content_recommendations = content_recommendations[content_recommendations.ski_resort != mountain_name].head(20)

In [52]:
# Collaborative model
have_rated = list(user_df.loc[user, 'ski_resort'])
not_rated = final_user_df.copy()
not_rated = not_rated.loc[~not_rated['ski_resort'].isin(have_rated)]
not_rated = not_rated.drop_duplicates(subset=['ski_resort'])
not_rated.reset_index(inplace=True)
not_rated['predicted_rating'] = not_rated['ski_resort'].apply(lambda x: algo.predict(user, x).est)
not_rated.sort_values(by='predicted_rating', ascending=False, inplace=True)
collaborative_recommendations = not_rated[['ski_resort', 'predicted_rating']]

# Combine content-based and collaborative recommendations
combined_recommendations = pd.merge(content_recommendations, collaborative_recommendations, on='ski_resort', how='left')
combined_recommendations = combined_recommendations.drop_duplicates(subset=['ski_resort'])
combined_recommendations.sort_values(by='predicted_rating', ascending=False, inplace=True)
combined_recommendations.drop(columns=['ikon', 'mountain_collective', 'epic', 'indy'], inplace=True)
combined_recommendations.head(n_recs)

Unnamed: 0,ski_resort,city,state,summit,drop,base,adultWeekdayPrice,adultWeekendPrice,beginner_runs,intermediate_runs,advanced_runs,expert_runs,dec_mean_4_guests,dec_mean_2_guests,predicted_rating
2,Kirkwood,Kirkwood,California,9800,2000,7800,0.0,0.0,0,0,0,0.0,371,304,4.216628
6,Okemo Mountain,Ludlow,Vermont,3344,2200,1144,0.0,0.0,33,38,21,9.0,290,266,4.154102
12,Mt. Sunapee,Newbury,New Hampshire,2743,1510,1233,0.0,0.0,29,47,24,0.0,266,197,4.139407
15,Attitash,Bartlett,New Hampshire,2350,1750,600,79.0,89.0,26,46,28,0.0,211,164,4.112258
10,Crested Butte Mountain,Mt. Crested Butte Mountain,Colorado,12162,3062,9375,149.0,165.0,14,25,25,36.0,255,150,4.059768


In [53]:
def hybrid_model_content():
    
    # User inputs
    user = str(input('Name: '))
    n_recs = int(input('How many resort recommendations do you want? '))
    mountain_name = str(input("What's your favorite ski resort? "))
    travel_date = str(input('What month would you like to travel? '))
    mtn_pass = str(input('Are you using a multi-resort pass?  '))
    
    # Pulling out an individual resort
    y = sim_df.loc[[mountain_name]].T
    
    #sorting values by similarity score
    cos_sim_df = y.reset_index().sort_values(by=mountain_name, ascending=False)
    
    #making list for column names
    rec_list = []
    
    #grabbing rows from content_matrix for final output
    for x in cos_sim_df['ski_resort']:
        rec_df = content_matrix.loc[[x]]  
        rec_list.append(rec_df)  #

    rec_df = pd.concat(rec_list)

    #Concatenate all the dataframes in rec_list into a single dataframe
    concat_df = rec_df[["city", "state", "summit", "drop", "base","adultWeekdayPrice", "adultWeekendPrice",
                           "beginner_runs", "intermediate_runs", "advanced_runs", "expert_runs",
                        "ikon", "epic", "mountain_collective", 'indy','nov_snow', 'dec_snow',
                        'jan_snow', 'feb_snow', 'mar_snow','apr_snow']]
    
    concat_df = concat_df.reset_index()

    #filtering based on month to return airbnb prices and turning into dataframe
    travel_date = travel_date.lower()

    month = ["december", "january", "february", "march", "april", "may"]
    month_abv = ["dec", "jan", "feb", "mar", "apr", "may"]

    selected_columns = []
    for x, y in zip(month_abv, month):
        if travel_date == y:
            selected_columns = [x + "_mean_4_guests", x + "_mean_2_guests"]

    result = rec_df[selected_columns]
    result = result.reset_index()                        
    content_recommendations = pd.merge(concat_df, result, on="ski_resort")
    
    #adding mountain fil
    if mtn_pass == "Ikon":
        content_recommendations = content_recommendations.loc[content_recommendations['ikon'] == 1]
    elif mtn_pass == "Epic":
        content_recommendations = content_recommendations.loc[content_recommendations['epic'] == 1]
    elif mtn_pass == "Mountain_collective":
        content_recommendations = content_recommendations.loc[content_recommendations['mountain_collective'] == 1]
    elif mtn_pass == "Indy":
        content_recommendations = content_recommendations.loc[content_recommendations['indy'] == 1]
    elif mtn_pass == "No":
        pass
    
    content_recommendations = content_recommendations[content_recommendations.ski_resort != mountain_name].head(30)

    # Collaborative model
    have_rated = list(user_df.loc[user, 'ski_resort'])
    not_rated = final_user_df.copy()
    not_rated = not_rated.loc[~not_rated['ski_resort'].isin(have_rated)]
    not_rated = not_rated.drop_duplicates(subset=['ski_resort'])
    not_rated.reset_index(inplace=True)
    not_rated['predicted_rating'] = not_rated['ski_resort'].apply(lambda x: algo.predict(user, x).est)
    not_rated.sort_values(by='predicted_rating', ascending=False, inplace=True)
    collaborative_recommendations = not_rated[['ski_resort', 'predicted_rating']]

    # Combine content-based and collaborative recommendations
    combined_recommendations = pd.merge(content_recommendations, collaborative_recommendations, on='ski_resort', how='left')
    combined_recommendations = combined_recommendations.drop_duplicates(subset=['ski_resort'])
    combined_recommendations.sort_values(by='predicted_rating', ascending=False, inplace=True)
    combined_recommendations.drop(columns=['ikon', 'mountain_collective', 'epic', 'indy'], inplace=True)
    return combined_recommendations.head(n_recs)

## Function Testing - Recommendation Analysis

I will be testing the model results with two users who filled out the resort survey, and who I created a user profile based on a set of questions I asked each user. I removed their last names from the survey.

**Alexandria K.**
- Dislikes "bougie" resorts
- Travels to shred
- Looks for expert runs and accessible transportation
- Buys the Epic pass but dislikes Vail and corporate ski vibes
- Budget-friendly planning

**Raghava K.**
- Loves expert terrain and well- marked tails
- All about the apres-ski life 
- Travels to shred but wants to have fun while doing it
- Uses both Epic & Ikon passes

#### Alexandria's Results

Alexandria tested out the model with the [deployed streamlit app](https://stephcia-ski-recommendation-system-ski-model-stephanie-zs77j6.streamlit.app/) and gave me her opion on the recommendations. For some reason the output was slightly different in the depolyed streamlit app, than below.

**Inputs**
- Name: Alexandria K.
- How many resort recommendations do you want? 3
- What's your favorite ski resort? Snowbird
- What month would you like to travel? February
- Are you using a multi-resort pass?  Epic

**Outputs**
- Kirkwood
- Breckenridge
- Wildcat

Feedback: I would not go to Wildcat, but Kirkwood I would. I have been to Breckenridge, which I did not rate in the survey and I do like the mountain.

In [54]:
content_model()

How many resort recommendations do you want? 5
What's your favorite ski resort? Stevens Pass
What month would you like to travel? December


Unnamed: 0,ski_resort,city,state,summit,drop,base,adultWeekdayPrice,adultWeekendPrice,ikon,epic,mountain_collective,beginner_runs,intermediate_runs,advanced_runs,expert_runs,dec_mean_4_guests,dec_mean_2_guests
1,Timberline Lodge,Timberline Lodge,Oregon,8540,3690,6000,0.0,0.0,0,0,0,25,50,13,12.0,255,208
2,Boreal,Truckee,California,7700,500,7200,49.0,0.0,0,0,0,26,29,44,0.0,275,231
3,Bear Valley,Bear Valley,California,8500,1900,6600,0.0,0.0,0,0,0,11,41,45,4.0,270,235
4,Mt. Hood Meadows,Mt. Hood,Oregon,7300,2777,4523,0.0,0.0,0,0,0,0,0,0,0.0,267,184
5,White Pass,White Pass,Washington,6550,2050,4500,69.0,69.0,0,0,0,0,0,0,0.0,237,212


In [61]:
collaborative_model()

Name: Alexandria K.
How many resort recommendations do you want? 5


Unnamed: 0,ski_resort,state,city,adultWeekdayPrice,adultWeekendPrice,summit,drop,base,ikon,epic,mountain_collective,advanced_runs,intermediate_runs,expert_runs,predicted_rating
51,Breckenridge,Colorado,Breckenridge,149.0,179.0,12998,3398,9600,0,1,0,36,23,28.0,4.416056
192,Big Powderhorn Mountain,Michigan,Bessemer,69.0,69.0,1800,600,1200,0,0,0,31,40,2.0,4.377006
19,Snowbasin,Utah,Huntsville,149.0,169.0,9350,2900,6450,1,0,1,52,33,6.0,4.375008
99,Ski Brule,Michigan,Iron River,70.0,70.0,1860,500,1360,0,0,0,24,35,6.0,4.363485
83,Granite Peak,Wisconsin,Wausau,95.0,105.0,1942,700,1242,0,0,0,0,0,0.0,4.344742


In [76]:
hybrid_model_content()

Name: Alexandria K.
How many resort recommendations do you want? 3
What's your favorite ski resort? Snowbird
What month would you like to travel? February
Are you using a multi-resort pass?  Epic


Unnamed: 0,ski_resort,city,state,summit,drop,base,adultWeekdayPrice,adultWeekendPrice,beginner_runs,intermediate_runs,...,expert_runs,nov_snow,dec_snow,jan_snow,feb_snow,mar_snow,apr_snow,feb_mean_4_guests,feb_mean_2_guests,predicted_rating
3,Breckenridge,Breckenridge,Colorado,12998,3398,9600,149.0,179.0,13,23,...,28.0,30,59,57,54,59,36,382,251,4.416056
7,Crested Butte Mountain,Mt. Crested Butte Mountain,Colorado,12162,3062,9375,149.0,165.0,14,25,...,36.0,14,55,50,44,41,5,300,207,4.333826
0,Kirkwood,Kirkwood,California,9800,2000,7800,0.0,0.0,0,0,...,0.0,22,79,85,81,79,25,403,340,4.189166


#### Raghava's Results

Raghava tested out the model with the [deployed streamlit app](https://stephcia-ski-recommendation-system-ski-model-stephanie-zs77j6.streamlit.app/). Below are his inputs, recommendations, and feedback. Similar to Alexandria's review, one of the final result are slightly different in the app vs. the code below. This might

**Inputs**
- Name: Raghava K.
- How many resort recommendations do you want? 5
- What's your favorite ski resort? Alta
- What month would you like to travel? February
- Are you using a multi-resort pass?  Ikon

**Outputs**
- Snowbasin
- Snowbird
- Big Sky

Feedback: I think these recs are great! I have been to all of them and like them a lot, and I think someone who loves Alta and has an Ikon pass would also enjoy these resorts

In [57]:
content_model()

How many resort recommendations do you want? 3
What's your favorite ski resort? Telluride
What month would you like to travel? December


Unnamed: 0,ski_resort,city,state,summit,drop,base,adultWeekdayPrice,adultWeekendPrice,ikon,epic,mountain_collective,beginner_runs,intermediate_runs,advanced_runs,expert_runs,dec_mean_4_guests,dec_mean_2_guests
1,Aspen Snowmass,Aspen,Colorado,12510,4406,8104,189.0,199.0,1,0,1,0,0,0,0.0,624,316
2,Solitude Mountain,Brighton,Utah,10488,2494,7994,115.0,115.0,1,0,0,6,46,30,18.0,375,322
3,Beaver Creek,Vail,Colorado,11440,3340,8100,191.0,275.0,0,1,0,38,30,24,8.0,420,268


In [58]:
collaborative_model()

Name: Raghava K.
How many resort recommendations do you want? 3


Unnamed: 0,ski_resort,state,city,adultWeekdayPrice,adultWeekendPrice,summit,drop,base,ikon,epic,mountain_collective,advanced_runs,intermediate_runs,expert_runs,predicted_rating
20,Snowbasin,Utah,Huntsville,149.0,169.0,9350,2900,6450,1,0,1,52,33,6.0,4.835467
191,Big Powderhorn Mountain,Michigan,Bessemer,69.0,69.0,1800,600,1200,0,0,0,31,40,2.0,4.796809
119,Nubs Nob,Michigan,Harbor Springs,65.0,85.0,1338,427,911,0,0,0,0,48,21.0,4.792542


In [79]:
hybrid_model_content()

Unnamed: 0,ski_resort,city,state,summit,drop,base,adultWeekdayPrice,adultWeekendPrice,beginner_runs,intermediate_runs,...,expert_runs,nov_snow,dec_snow,jan_snow,feb_snow,mar_snow,apr_snow,feb_mean_4_guests,feb_mean_2_guests,predicted_rating
6,Snowbasin,Huntsville,Utah,9350,2900,6450,149.0,169.0,9,33,...,6.0,14,68,67,68,57,16,281,222,4.835467
16,Stratton Mountain,Stratton Mountain Mountain,Vermont,3875,2003,1872,125.0,125.0,40,35,...,9.0,3,30,25,37,22,1,478,380,4.622277
12,Jackson Hole,Teton Village,Wyoming,10450,4139,6311,215.0,215.0,4,41,...,17.0,24,108,100,110,71,15,508,365,4.556631


Below I am creating a dataframe to match Raghava's results from the streamlit app.

In [84]:
resorts = ["Snowbasin", "Snowbird", "Big Sky"]

r_df = content_df.loc[content_df['ski_resort'].isin(resorts)]

In [85]:
#using plotly to plot the top featurescolor=
fig = px.bar(r_df, x="ski_resort", y=["beginner_runs", "intermediate_runs", "advanced_runs", "expert_runs"],
            width=1000, height=500)
fig.update_layout(title_text='Terrain Difficulty',
                  title_x=0.5,
                  xaxis_title="Ski Resort",
                  yaxis_title="Difficulty Level (%)",
                 plot_bgcolor='white',
                 font=dict(size=14),
                 barmode='group')

newnames = {"beginner_runs":'Beginner', "intermediate_runs": 'Intermediate Runs',
           "advanced_runs":'Advanced Runs', "expert_runs": 'Expert Runs',}
fig.for_each_trace(lambda t: t.update(name = newnames[t.name],
                                      legendgroup = newnames[t.name],
                                       hovertemplate = t.hovertemplate.replace(t.name, newnames[t.name])
                                      )
                   )

fig.update_traces(textposition='outside')               
              
fig.show()

In [86]:
#using plotly to plot the top featurescolor=
fig = px.bar(r_df, x="ski_resort", y=['nov_snow', 'dec_snow', 'jan_snow', 'feb_snow', 'mar_snow','apr_snow'],
            width=1000, height=500)
fig.update_layout(title_text='Average Monthly Snowfall',
                  title_x=0.5,
                  xaxis_title="Ski Resort",
                  yaxis_title="Snowfall (in.)",
                 plot_bgcolor='white',
                 font=dict(size=14),
                 barmode='group')

newnames = {"nov_snow":'November', "dec_snow": 'December',
           "mar_snow":'March', "apr_snow": 'April', "feb_snow": 'February', "jan_snow": 'January',
           "feb_snow": 'February'}

fig.for_each_trace(lambda t: t.update(name = newnames[t.name],
                                      legendgroup = newnames[t.name],
                                       hovertemplate = t.hovertemplate.replace(t.name, newnames[t.name])
                                     )
                  )
fig.update_traces(textposition='outside')               
              
fig.show()

In [87]:
#using plotly to plot the top featurescolor=
fig = px.bar(r_df, x="ski_resort", y=["dec_mean_2_guests", "dec_mean_4_guests"],
            width=1000, height=500)
fig.update_layout(title_text='December Airbnb Costs',
                  title_x=0.5,
                  xaxis_title="Vertical",
                  yaxis_title="Nightly Price ($)",
                 plot_bgcolor='white',
                 font=dict(size=14),
                 barmode='group')

newnames = {'dec_mean_2_guests':'2 Guest', 'dec_mean_4_guests': '4 Guest'}
fig.for_each_trace(lambda t: t.update(name = newnames[t.name],
                                      legendgroup = newnames[t.name],
                                      hovertemplate = t.hovertemplate.replace(t.name, newnames[t.name])
                                     )
                  )

fig.update_traces(textposition='outside')               
              
fig.show()

In [88]:
#using plotly to plot the top featurescolor=
fig = px.bar(r_df, x="ski_resort", y=['base', 'summit', 'drop'],
            width=1000, height=500)
fig.update_layout(title_text='Mountain Vertical',
                  title_x=0.5,
                  xaxis_title="Ski Resort",
                  yaxis_title="Ft.",
                 plot_bgcolor='white',
                 font=dict(size=14),
                 barmode='group')

fig.update_traces(textposition='outside')               
              
fig.show()

In [89]:
#using plotly to plot the top featurescolor=
fig = px.bar(r_df, x="ski_resort", y=["adultWeekdayPrice", "adultWeekendPrice"],
            width=1000, height=500)
fig.update_layout(title_text='Adult Lift Ticket Prices',
                  title_x=0.5,
                  xaxis_title="Ski Resort",
                  yaxis_title="Price ($)",
                 plot_bgcolor='white',
                 font=dict(size=14),
                 barmode='group')

newnames = {"adultWeekdayPrice":'Weekday Price', "adultWeekendPrice": 'Weekend Price'}
fig.for_each_trace(lambda t: t.update(name = newnames[t.name],
                                      legendgroup = newnames[t.name],
                                       hovertemplate = t.hovertemplate.replace(t.name, newnames[t.name])
                                      )
                 )

fig.update_traces(textposition='outside')               
              
fig.show()

# Results

### Collaborative Filtering

The best collaborative filtering model was a Singular Value Decomposition (SVD) model with a RMSE score of 0.90. While this score is relatively good considering the small size of the final dataset, there were instances where the results did not entirely correspond to previous user reviews. Potential factors contributing to this include user bias, rating distribution, and the limited size of the dataset. 

![img](images/rating_distribution.png)

Considering the high opportunity cost associated with planning a ski trip, relying solely on a user-based system is insufficient to provide users with tailored recommendations that account for factors such as cost and time of year. While identifying similar users is crucial, it is essential to consider additional elements when making recommendations.

### Cascade-Hybrid Model

The final model was the cascade-hybrid model that has been [deployed](https://stephcia-ski-recommendation-system-ski-model-stephanie-zs77j6.streamlit.app/) on streamlit. The system incorporated both content-based and collaborative filtering approaches in making ski-resort recommendations. To content-based system refined recommendations based on user filters and resort similarities and was crucial in making helpful recommendations. 

### User Feedback

To gain deeper insights into the recommendations, I sought feedback from two users who completed the resort survey and provided a brief overview of their mountain preferences. These users utilized the demo stremlit model to input their filters and review the recommendations.

Based on the user feedback, the recommendation system demonstrates strong performance in suggesting ski resorts that align with user past reviews and preferences.

# Conclusions

The recommendation system demonstrates strong performance in suggesting ski resorts that align with user inputs. Though, there are times when the output does not seem entirely alligned with the user preferences or former reviews. I believe this is due to the review datatset, being that not all resorts included in the content model were reviewed by users and due to the fact that users with 3 reviews were included in the dataset.

Though the results could use some fine tuning, the recommendation system utilizes collaborative filtering and content-based approaches to provide strong recommendations based on user preferences and resort characteristics. The cascade hybrid model returns resorts that are more in line with past user ratings, compared to the collaborative model alone. This is due to individual filtering and the similarity matrix that the content based system utilizes.

It is important to acknowledge that recommendations are inherently subjective, as they rely on individual preferences and the available dataset. To further enhance the system and ensure continuous optimization, user feedback is needed. By incorporating user feedback, the recommendations can be refined and the overall user experience improved, creating a more personalized system that caters to individual preferences.


# Next Steps

Next steps involve, expanding the dataset with additional user ratings and features, collecting first party data through ski resort partnerships, and deploying a web application:

- The OnTheSnow ratings dataset did not have unique user IDs for each rating, which reduced the number of reviews used to create the collaborative model. As a result, not all ski resorts in the USA were included. By incorporating more reviews, more mountains will be included in the collaborative filtering process which could result in more accurate recommendations.

- Once additional user ratings are collected, the cascade hybrid model will be fine-tuned and the main algorithms re-run.

- Finally, additional feature characteristics related to the resort towns and mountains will be incorporated. These features could include ratings and assessments of mountain restaurants, parking information, lodging options, après-ski activities, ski rentals, and other amenities available in the resort towns. By including these metrics in the recommendation system, a more comprehensive and personalized service can be provided, catering to diverse preferences and requirements, for a focus that is greater than skiing.