# Avant Ski/ Send It

by: Stephanie Ciaccia

# Overview

# Business Problem

The project I am working on aims to address the common challenges faced by skiing and snowboarding enthusiasts when planning their ski vacations in the United States. With a multitude of ski resorts available, each differing in terms of price, size, location, cost, and mountain features, making an informed decision can be quite challenging. As an avid skier, I have personally encountered the difficulties associated with researching and organizing ski trips. 

In order to address this problem, my strategy involves creating a ski recommendation system called Avant Ski (a wordplay on "Apres-Ski"). This system aims to simplify the entire planning process for users. By using this system, users will be able to narrow down ski resorts to receive personalized recommendations and guidance, ultimately simplifying the decision-making process and enhancing their overall ski trip experience.


# Data Understanding

In [323]:
import pandas as pd
import numpy as np
import math
from datetime import datetime
import datetime
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly
import plotly.express as px
import plotly.io as pio
from matplotlib.ticker import StrMethodFormatter

from surprise.model_selection import cross_validate
from surprise import Dataset, Reader, accuracy
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline,  SVD, SVDpp, NMF, BaselineOnly, NormalPredictor
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split

from collections import Counter
from nltk.corpus import stopwords

from IPython.display import display

Function to print full rows

In [324]:
def print_full(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')

# Data Source #1

In [862]:
snow_df = pd.read_csv("data/OnTheSnow_SkiAreaReviews_clean.csv")
survey_df = pd.read_csv("data/usa_ski_resort_survey.csv")
scraped_df = pd.read_csv("data/onthesnow_scrape_170523_cleaned.csv")

In [863]:
snow_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18238 entries, 0 to 18237
Data columns (total 6 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   State                          18238 non-null  object
 1   Ski Area                       18238 non-null  object
 2   Reviewer Name                  18128 non-null  object
 3   Review Date                    18238 non-null  object
 4   Review Star Rating (out of 5)  18238 non-null  int64 
 5   Review Text                    18226 non-null  object
dtypes: int64(1), object(5)
memory usage: 855.0+ KB


In [864]:
snow_df.head()

Unnamed: 0,State,Ski Area,Reviewer Name,Review Date,Review Star Rating (out of 5),Review Text
0,colorado,copper-mountain-resort,anonymous_user,3-Mar-04,3,I have a pass the includes other mountains but...
1,utah,brighton-resort,anonymous_user,2-Dec-04,4,I've been coming to Brighton for years. Unlike...
2,north-carolina,ski-beech-mountain-resort,anonymous_user,1-Jan-05,5,"We went last Weekend, and it was the best snow..."
3,new-mexico,red-river,anonymous_user,1-Mar-05,5,Love Red River we go every year!
4,pennsylvania,sno-mountain,anonymous_user,2-Mar-05,4,"Great varied terrain, not crowded, good prices..."


In [865]:
#renaming columns
new_name = ['state', 'ski_resort', 'user_name','review_date', 'rating',
           'review'] 

snow_df.columns = new_name

In [866]:
snow_df

Unnamed: 0,state,ski_resort,user_name,review_date,rating,review
0,colorado,copper-mountain-resort,anonymous_user,3-Mar-04,3,I have a pass the includes other mountains but...
1,utah,brighton-resort,anonymous_user,2-Dec-04,4,I've been coming to Brighton for years. Unlike...
2,north-carolina,ski-beech-mountain-resort,anonymous_user,1-Jan-05,5,"We went last Weekend, and it was the best snow..."
3,new-mexico,red-river,anonymous_user,1-Mar-05,5,Love Red River we go every year!
4,pennsylvania,sno-mountain,anonymous_user,2-Mar-05,4,"Great varied terrain, not crowded, good prices..."
...,...,...,...,...,...,...
18233,minnesota,lutsen-mountains,REBECCA CARTWRIGHT,14-Dec-20,4,Many workers on the lifts did not know how to ...
18234,new-mexico,sipapu-ski-and-summer-resort,Antonio Martinez,15-Dec-20,5,"staying in the ""hotel"" (""motel"" on the sign ab..."
18235,new-mexico,sipapu-ski-and-summer-resort,Antonio Martinez,15-Dec-20,5,"staying in the ""hotel"" (""motel"" on the sign ab..."
18236,new-mexico,taos-ski-valley,David Humphrey,15-Dec-20,5,"Good skiing, have lost their way over the year..."


In [867]:
survey_df['user_name'].unique()

array(['anon_1', 'anon_2', 'anon_3', 'anon_4', 'anon_5', 'anon_6',
       'anon_7', 'anon_8', 'anon_9', 'Stephanie Ciaccia', 'Joseph Lewis',
       'Alexandria Kelly', 'Deanna Uzarski', 'Raghava Kamalesh'],
      dtype=object)

In [868]:
survey_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   review_date  62 non-null     object
 1   state        62 non-null     object
 2   ski_resort   62 non-null     object
 3   rating       62 non-null     int64 
 4   review       62 non-null     object
 5   user_name    62 non-null     object
dtypes: int64(1), object(5)
memory usage: 3.0+ KB


In [869]:
snow_df['review_date'] = pd.to_datetime(snow_df['review_date'])
survey_df['review_date'] = pd.to_datetime(survey_df['review_date'])

In [870]:
snow_df["ski_resort"].value_counts()

ski-brule                        1315
killington-resort                 204
vail                              203
winter-park-resort                191
blue-mountain-ski-area            189
                                 ... 
big-squaw-mountain-ski-resort      13
holiday-mountain                   13
powder-ridge-ski-area              13
willard-mountain                   13
whaleback-mountain                 12
Name: ski_resort, Length: 291, dtype: int64

In [871]:
snow_df['user_name'].value_counts().head(30)

anonymous_user         3026
anonymous               304
undefined undefined     130
Ben                      49
Mike                     49
Ryan                     46
Richard                  44
Dan                      42
David                    42
Chris                    39
Rob                      37
Jeff                     36
Derek                    31
Brian                    31
Matt                     31
wolfman                  31
iPhone                   28
Michael                  28
Nick                     27
gma                      27
Kevin                    27
Jim                      26
Steve                    25
J                        25
Jun                      23
Mark                     23
Joe                      23
Jason                    23
Justin                   22
Paul                     22
Name: user_name, dtype: int64

Dropping common names and usernames

In [872]:
drop_list = ["anonymous_user", "anonymous","undefined undefined","Mike", 
             "Ben", "Ryan", "Richard", "Dan", "David", "Chris", "Rob", "Jeff",
            "Derek", "Brian", "Matt", "Michael", "iPhone", "Kevin", "Nick",
            "Jim", "Steve", "Jason", "Mark", "Joe", "Paul", "Justin", "Scott",
            "Bob", "Alex", "Carter", "Dave", "Tim", "Bill", "Andrew", "John", "Sam",
            "James", "Kim", "Craig", "mike", "jason", "James", "Sam", "Kim", "mike", "peter",
            "Jack", "Adam", "Tom", "Wes", "Jun", "Steven", "Max", "Matthew", "Laura", "Felipe",
            "Greg", "Bryan", "Sarah", "Sara", "Christian", "Ray", "Connor", "Erin", "Emily",
            "Luke", "Ed", "Patrick", "kyle", "Ken", "Linda", "Eric", "Aaron", "Jake",
            "Josh", "Tony", "Abe", "Frank", "Peter", "Fred", "Arthur", "Lorraine",
            "Phil", "Sean", "Will", "Julie", "Jon", "Amy", "Becky", "Shannon", "brendan",
            "Kathy", "wayne", "Ethan", "Erika", "Jill", "Zoe", "Rick", "Wyatt",
            "Tyler", "Andrea", "mark", "john", "Donna", "Jen", "Braden", "D", "Bryce",
            "Rich", "Jared", "Jay", "Ann", "Brandon", "Nicholas","Martin",
            'Robert', 'angelino','Anonymous',
             'ty', 'jase', 'Jesse', 'Jennifer', 'Dustin', 'Natalie',
             'Pat', 'anonymous user', 'matt', 'George', 'Kate',
             'Daniel','Cindy', 'Barry', 'Todd', 'Melanie', 'Drew',
             'Andy', 'Hochard','Wayne', 'dan',
             'Charlie', 'Vanessa','Allen', 'Austin', 'Roger',
             'Jerry', 'Scotty', 'Anon', 'Lucas', 'Brian', 'Lee', 'Taylor',
            'brian', 'Lisa', 'Jade', 'Spencer', 'chris', 'Jenny', 'Amanda', 'Brett',
            'Maria', 'Holly', 'iPad', 'Sylvia', 'iPhone (2)', 'Catherine', 'Hannah', 'Wade',
            'Larry', 'Lauren']

snow_df = snow_df[snow_df['user_name'].isin(drop_list) == False]

In [873]:
snow_df

Unnamed: 0,state,ski_resort,user_name,review_date,rating,review
103,california,squaw-valley-usa,ericadyer,2006-10-23,5,This is one of my favorite resorts in Tahoe. E...
104,idaho,sun-valley,ericadyer,2006-10-23,5,I’ve skied all over the world and everything a...
107,california,donner-ski-ranch,FroDog,2006-10-24,3,"<font size=""2""><p>The mountain is less a place..."
108,california,boreal,FroDog,2006-10-24,4,"<font size=""2""><p>They&#39;ve really nailed th..."
109,nevada,diamond-peak,FroDog,2006-10-24,4,Observation Deck - Great views. They were out ...
...,...,...,...,...,...,...
18233,minnesota,lutsen-mountains,REBECCA CARTWRIGHT,2020-12-14,4,Many workers on the lifts did not know how to ...
18234,new-mexico,sipapu-ski-and-summer-resort,Antonio Martinez,2020-12-15,5,"staying in the ""hotel"" (""motel"" on the sign ab..."
18235,new-mexico,sipapu-ski-and-summer-resort,Antonio Martinez,2020-12-15,5,"staying in the ""hotel"" (""motel"" on the sign ab..."
18236,new-mexico,taos-ski-valley,David Humphrey,2020-12-15,5,"Good skiing, have lost their way over the year..."


In [874]:
snow_df['user_name'].value_counts().head(70)

wolfman            31
gma                27
J                  25
Tim Zheng          22
Dave O             22
                   ..
tc5                 7
Nick Franchino      7
fcherichel          7
James undefined     7
tjkotula            7
Name: user_name, Length: 70, dtype: int64

In [875]:
#renaming columns
new_name = ['state', 'ski_resort', 'user_name','review_date', 'rating',
           'review'] 

snow_df.columns = new_name

In [876]:
# counting the number of reviews of each person
value_counts = snow_df['user_name'].value_counts()

# selecting only reviewers with more than three reviews
selected_values = value_counts.loc[value_counts > 3]

# selecting only the reviewers with more than three reviews
cleaned_snow = snow_df.loc[snow_df['user_name'].isin(selected_values.index)]

In [877]:
cleaned_snow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2213 entries, 103 to 18199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   state        2213 non-null   object        
 1   ski_resort   2213 non-null   object        
 2   user_name    2213 non-null   object        
 3   review_date  2213 non-null   datetime64[ns]
 4   rating       2213 non-null   int64         
 5   review       2212 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 121.0+ KB


In [878]:
cleaned_snow['user_name'].unique()

array(['ericadyer', 'FroDog', 'jackson321', 'SammyG', 'Jill Adler',
       'Twenz', 'anhillx', 'emj35', 'Benji Zimmerman', 'Roger Leo',
       'filterban', 'JP', 'gkphi5', 'RippinSkiers', 'Jay C',
       'Mark Rosasco', 'Dan Gibson', 'Cherokee', 'treesker', 'gwiffie',
       'stevenam', 'Gunny J', 'jim8588', 'Bmorabito', 'Richard 1',
       'joey58242', 'Mike134', 'tom travis', 'noonito', 'tsfoust',
       'steffenwolf', 'seniordude', 'Americansonofa', 'Randy Agness',
       'sno_thing', 'sampanning', 'steep-n-deep ', 'swissnowtiger',
       'flyersboy114', 'Bob Butts', 'sharimcatee', 'bodibran',
       'tourist from Texas', 'p_nut', 'highvoltageguy', 'govey80',
       'fcherichel', 'Adye 1', 'MgoBlue', 'Randy Rogers', 'Bobby G',
       'apken', 'Resort Travel', 'kbone77', 'swetsb', 'CopPsychDoc',
       'mtbporru', 'Les', 'J Berlo', 'sewardhorner', 'thereporter',
       'nzh0tg', 'Aguilar 1', 'herbsmen', 'Philip H. Eckerberg',
       'pinkkid_2000', 'John Gonzalez', 'Ontheslopes', 'jo

In [879]:
cleaned_snow[cleaned_snow['user_name'] == 'Larry']

Unnamed: 0,state,ski_resort,user_name,review_date,rating,review


In [880]:
cleaned_snow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2213 entries, 103 to 18199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   state        2213 non-null   object        
 1   ski_resort   2213 non-null   object        
 2   user_name    2213 non-null   object        
 3   review_date  2213 non-null   datetime64[ns]
 4   rating       2213 non-null   int64         
 5   review       2212 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 121.0+ KB


In [881]:
cleaned_snow['ski_resort'].value_counts()

ski-brule                    41
killington-resort            40
breckenridge                 39
vail                         36
heavenly-mountain-resort     31
                             ..
new-hermon-mountain           1
christmas-mountain            1
pomerelle-mountain-resort     1
wolf-creek                    1
great-divide                  1
Name: ski_resort, Length: 268, dtype: int64

In [882]:
cleaned_snow['ski_resort'].value_counts()

ski-brule                    41
killington-resort            40
breckenridge                 39
vail                         36
heavenly-mountain-resort     31
                             ..
new-hermon-mountain           1
christmas-mountain            1
pomerelle-mountain-resort     1
wolf-creek                    1
great-divide                  1
Name: ski_resort, Length: 268, dtype: int64

In [883]:
cleaned_snow['ski_resort'].unique()

array(['squaw-valley-usa', 'sun-valley', 'donner-ski-ranch', 'boreal',
       'diamond-peak', 'mt-baker', 'alpental', 'stevens-pass-resort',
       'the-summit-at-snoqualmie', 'mt-rose-ski-tahoe', 'mountain-high',
       'snowshoe-mountain-resort', 'alyeska-resort', 'steamboat',
       'alta-ski-area', 'snowbird', 'snowbasin', 'brighton-resort',
       'solitude-mountain-resort', 'deer-valley-resort',
       'park-city-mountain-resort', 'jackson-hole', 'sundance',
       'brian-head-resort', 'bretton-woods', 'loon-mountain',
       'sierra-at-tahoe', 'heavenly-mountain-resort', 'gunstock',
       'sno-mountain', 'attitash', 'crystal-mountain-wa', 'vail',
       'killington-resort', 'waterville-valley', 'kirkwood',
       'copper-mountain-resort', 'breckenridge',
       'arapahoe-basin-ski-area', 'keystone', 'boyne-mountain-resort',
       'crystal-mountain', 'shanty-creek', 'cannonsburg',
       'boyne-highlands', 'aspen-snowmass', 'sunday-river',
       'mount-sunapee', 'sugar-bowl-re

In [884]:
replace_snow = ['-ski-area', '-', 'resort', 'mt']
replace_with = ['', ' ', '', 'mt.']

cleaned_snow = cleaned_snow.replace(replace_snow, replace_with, regex=True)

In [885]:
#making columns titlecase
cleaned_snow['ski_resort'] = cleaned_snow['ski_resort'].str.title()
cleaned_snow['state'] = cleaned_snow['state'].str.title()
cleaned_snow['ski_resort'] = cleaned_snow['ski_resort'].str.strip()

In [886]:
replace_snow = ['At', 'Mtn', 'Mt.N', 'Mt. Hood Ski Bowl', 'And', r'\bMount\b']
replace_with = ['at', 'Mountain', 'Mountain', 'Mt. Hood Skibowl', 'and', 'Mt.']

cleaned_snow = cleaned_snow.replace(replace_snow, replace_with, regex=True)

In [887]:
mountain_rep = ['Crystal Mountain - WA', 'Crystal Mountain Wa','Squaw Valley Usa',
                'Mccauley Mountain Ski Center', 'attitash', 'Smugglers Notch',
               'Pico Mountain at Killington', 'andes Tower Hills']
mountain_new = ["Crystal Mountain", "Crystal Mountain", 'Palisades Tahoe',
               'McCauley Mountain Ski Center', 'Attitash', "Smugglers' Notch",
               'Pico Mountain', 'Andes Tower Hills']

cleaned_snow = cleaned_snow.replace(mountain_rep, mountain_new, regex=True)

In [888]:
list(cleaned_snow['ski_resort'].sort_values().unique())

['49 Degrees North',
 'Afton Alps',
 'Alpental',
 'Alpine Valley',
 'Alta',
 'Alyeska',
 'Andes Tower Hills',
 'Angel Fire',
 'Appalachian Ski Mountain',
 'Arapahoe Basin',
 'Arizona Snowbowl',
 'Aspen Snowmass',
 'Attitash',
 'Badger Pass',
 'Bear Creek Mountain',
 'Bear Mountain',
 'Bear Valley',
 'Beaver Creek',
 'Beaver Mountain',
 'Belleayre',
 'Berkshire East',
 'Big Bear',
 'Big Boulder',
 'Big Powderhorn Mountain',
 'Big Sky',
 'Bittersweet',
 'Black Mountain',
 'Blackjack Ski',
 'Blacktail Mountain',
 'Blandford',
 'Blue Hills',
 'Blue Knob',
 'Blue Mountain',
 'Bluewood',
 'Bogus Basin',
 'Bolton Valley',
 'Boreal',
 'Boston Mills',
 'Bousquet',
 'Boyne Highlands',
 'Boyne Mountain',
 'Bradford',
 'Brandywine',
 'Breckenridge',
 'Bretton Woods',
 'Brian Head',
 'Bridger Bowl',
 'Brighton',
 'Bristol Mountain',
 'Bromley Mountain',
 'Bruce Mound',
 'Brundage Mountain',
 'Bryce',
 'Buck Hill',
 'Burke Mountain',
 'Caberfae Peaks Ski Golf',
 'Camelback Mountain',
 'Campgaw Mount

## Cleaning Survey Data

In [889]:
#making columns titlecase
survey_df['ski_resort'] =survey_df['ski_resort'].str.title()
survey_df['state'] = survey_df['state'].str.title()
survey_df['ski_resort'] = survey_df['ski_resort'].str.strip()

In [890]:
list(survey_df['ski_resort'].sort_values().unique())

['Alta',
 'Arapahoe Basin',
 'Aspen Highlands',
 'Aspen Mountain',
 'Bear Valley',
 'Beaver Creek',
 'Beaver Mountain',
 'Breckenridge',
 'Brighton',
 'Cherry Peak',
 'Copper',
 'Copper Mountain',
 'Crested Butte',
 'Crystal Mountain - Wa',
 'Deer Valley',
 'Dodge Ridge',
 'Gore Mountain',
 'Hunter Mountain',
 'Jackson Hole',
 'Killington',
 'Mammoth',
 'Mammoth Mountain',
 'Mccauley Mountain',
 'Mt Baker',
 'Mt. Rose',
 'Nordic Valley',
 'Palisades',
 'Palisades Tahoe',
 'Park City Mountain',
 'Powder Mountain',
 'Roundtop Mountain',
 'Snow Ridge',
 'Snowbird',
 'Snowmass',
 'Solitude',
 'Steamboat',
 "Steven'S Pass",
 'Stevens Pass',
 'Stratton',
 'Sugarbush',
 'Taos',
 'Telluride',
 'Vail',
 'Winter Park',
 'Woods Valley']

In [891]:
mountain_rep = ['Crystal Mountain - Wa', 'Crystal Mountain Wa', "Steven'S Pass", 'Mammoth Mountain', 'Mammoth', 'Stratton',
                'Mccauley Mountain', 'Taos', 'Snowmass', 'Palisades Tahoe', 'Palisades', 'Copper Mountain', 'Copper',
                'Crested Butte', 'Mt. Rose', 'Mt Baker', 'Nordic Valley' ,'Solitude']
mountain_rep_p = ['Crystal Mountain', 'Crystal Mountain','Stevens Pass', 'Mammoth', 'Mammoth Mountain','Stratton Mountain',
                 'McCauley Mountain Ski Center', 'Taos Ski Valley', 'Aspen Snowmass', 'Palisades', 'Palisades Tahoe', 'Copper', 'Copper Mountain',
                 'Crested Butte Mountain', 'Mt. Rose Ski Tahoe', 'Mt. Baker', 'Nordic Mountain', 'Solitude Mountain']

survey_df = survey_df.replace(mountain_rep, mountain_rep_p, regex=True)

In [892]:
# changing aspen resorts since all four mountains are part of snowmass
mountain_r = ['Aspen Mountain', 'Aspen Highlands']

survey_df = survey_df.replace(mountain_r, 'Aspen Snowmass', regex=True)

In [893]:
#checking to see which names aren't the same
survey_df.loc[~survey_df['ski_resort'].isin(cleaned_snow['ski_resort']),
                         'ski_resort'].unique()

array(['Cherry Peak'], dtype=object)

In [894]:
list(cleaned_snow['ski_resort'].sort_values().unique())

['49 Degrees North',
 'Afton Alps',
 'Alpental',
 'Alpine Valley',
 'Alta',
 'Alyeska',
 'Andes Tower Hills',
 'Angel Fire',
 'Appalachian Ski Mountain',
 'Arapahoe Basin',
 'Arizona Snowbowl',
 'Aspen Snowmass',
 'Attitash',
 'Badger Pass',
 'Bear Creek Mountain',
 'Bear Mountain',
 'Bear Valley',
 'Beaver Creek',
 'Beaver Mountain',
 'Belleayre',
 'Berkshire East',
 'Big Bear',
 'Big Boulder',
 'Big Powderhorn Mountain',
 'Big Sky',
 'Bittersweet',
 'Black Mountain',
 'Blackjack Ski',
 'Blacktail Mountain',
 'Blandford',
 'Blue Hills',
 'Blue Knob',
 'Blue Mountain',
 'Bluewood',
 'Bogus Basin',
 'Bolton Valley',
 'Boreal',
 'Boston Mills',
 'Bousquet',
 'Boyne Highlands',
 'Boyne Mountain',
 'Bradford',
 'Brandywine',
 'Breckenridge',
 'Bretton Woods',
 'Brian Head',
 'Bridger Bowl',
 'Brighton',
 'Bristol Mountain',
 'Bromley Mountain',
 'Bruce Mound',
 'Brundage Mountain',
 'Bryce',
 'Buck Hill',
 'Burke Mountain',
 'Caberfae Peaks Ski Golf',
 'Camelback Mountain',
 'Campgaw Mount

In [895]:
list(survey_df['ski_resort'].sort_values().unique())

['Alta',
 'Arapahoe Basin',
 'Aspen Snowmass',
 'Bear Valley',
 'Beaver Creek',
 'Beaver Mountain',
 'Breckenridge',
 'Brighton',
 'Cherry Peak',
 'Copper Mountain',
 'Crested Butte Mountain',
 'Crystal Mountain',
 'Deer Valley',
 'Dodge Ridge',
 'Gore Mountain',
 'Hunter Mountain',
 'Jackson Hole',
 'Killington',
 'Mammoth Mountain',
 'McCauley Mountain Ski Center',
 'Mt. Baker',
 'Mt. Rose Ski Tahoe',
 'Nordic Mountain',
 'Palisades Tahoe',
 'Park City Mountain',
 'Powder Mountain',
 'Roundtop Mountain',
 'Snow Ridge',
 'Snowbird',
 'Solitude Mountain',
 'Steamboat',
 'Stevens Pass',
 'Stratton Mountain',
 'Sugarbush',
 'Taos Ski Valley',
 'Telluride',
 'Vail',
 'Winter Park',
 'Woods Valley']

### Merging survey and OnTheSnow review dataframes

In [896]:
#merging survey review results and final onthesnow reviews
final_ski_df = pd.concat([survey_df, cleaned_snow])

In [897]:
final_ski_df = final_ski_df.dropna()

In [899]:
final_ski_df

Unnamed: 0,review_date,state,ski_resort,rating,review,user_name
0,2023-05-04,Colorado,Winter Park,4,Very family friendly,anon_1
1,2023-05-04,Colorado,Arapahoe Basin,5,"Challenging terrain, no frills",anon_1
2,2023-05-04,Colorado,Steamboat,5,Great public transport to and from lodging,anon_1
3,2023-05-04,Colorado,Copper Mountain,5,Extremely diverse terrain and fantastic terrai...,anon_1
4,2023-05-04,Utah,Solitude Mountain,5,"They had so much terrain, especially for the s...",anon_2
...,...,...,...,...,...,...
18155,2020-10-28,Montana,Discovery,5,Easy to get to nice large for family dining\nG...,Lori Young
18156,2020-10-28,Montana,Discovery,5,This little is top notch and not full of peop...,Lori Young
18157,2020-10-28,Montana,Discovery,5,Easy to get to nice large for family dining\nG...,Lori Young
18198,2020-12-03,Washington,Mt. Baker,5,First time to Baker after a snowy Nov. Parkin...,Matt H


In [900]:
final_ski_df = final_ski_df.drop_duplicates()

In [901]:
final_ski_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2079 entries, 0 to 18198
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   review_date  2079 non-null   datetime64[ns]
 1   state        2079 non-null   object        
 2   ski_resort   2079 non-null   object        
 3   rating       2079 non-null   int64         
 4   review       2079 non-null   object        
 5   user_name    2079 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 113.7+ KB


In [902]:
#Blackjack ski and Indianhead combined to form Snowriver Mountain Resort
blackjack = ['Blackjack Ski', 'Indianhead Mountain']

final_ski_df = final_ski_df.replace(blackjack, 'Snowriver Mountain Resort', regex=True)

In [903]:
old_name = ['Durango']

final_ski_df = final_ski_df.replace(old_name, 'Purgatory', regex=True)

In [904]:
old_name = ['Las Vegas Ski and Snowboard']

final_ski_df = final_ski_df.replace(old_name, 'Lee Canyon', regex=True)

In [905]:
old_name = ['Shawnee Peak']

final_ski_df = final_ski_df.replace(old_name, 'Shawnee Mountain', regex=True)

In [906]:
old_name = ['Suicide Six']

final_ski_df = final_ski_df.replace(old_name, 'Saskadena Six', regex=True)

In [907]:
old_name = ['Timberline Four Seasons']

final_ski_df = final_ski_df.replace(old_name, 'Timberline', regex=True)

In [908]:
old_name = ['Snow Summit']

final_ski_df = final_ski_df.replace(old_name, 'Big Bear', regex=True)

In [913]:
# making dictionary for replacements to avoid doubling values
replacements = {
    'Brandywine': 'Boston Mills and Brandywine',
    'Boston Mills': 'Boston Mills and Brandywine'
}

# replacing
final_ski_df['ski_resort'] = final_ski_df['ski_resort'].replace(replacements)

In [None]:
final_ski_df.loc[(final_ski_df['ski_resort'] == "Alpine Valley") & (final_ski_df['state'] == "Wisconsin"), 'ski_resort'] = "Alpine Valley Wisconsin"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Alpine Valley") & (final_ski_df['state'] == "Ohio"), 'ski_resort'] = "Alpine Valley Ohio"

In [917]:
final_ski_df = final_ski_df.loc[(final_ski_df['ski_resort'] != "Cherry Peak")]

In [918]:
final_ski_df

Unnamed: 0,review_date,state,ski_resort,rating,review,user_name
0,2023-05-04,Colorado,Winter Park,4,Very family friendly,anon_1
1,2023-05-04,Colorado,Arapahoe Basin,5,"Challenging terrain, no frills",anon_1
2,2023-05-04,Colorado,Steamboat,5,Great public transport to and from lodging,anon_1
3,2023-05-04,Colorado,Copper Mountain,5,Extremely diverse terrain and fantastic terrai...,anon_1
4,2023-05-04,Utah,Solitude Mountain,5,"They had so much terrain, especially for the s...",anon_2
...,...,...,...,...,...,...
18142,2020-10-22,Utah,Beaver Mountain,5,Beaver Mountain is an awesome family run ski ...,Payton Sharum
18144,2020-10-22,Wyoming,Snowy Range Ski Recreation Area,5,"I love how cozy this ski area is. Sure, it doe...",Payton Sharum
18154,2020-10-28,Montana,Discovery,5,This little is top notch and not full of peop...,Lori Young
18155,2020-10-28,Montana,Discovery,5,Easy to get to nice large for family dining\nG...,Lori Young


In [936]:
#exporting final cleaned dataframe
final_ski_df.to_csv("cleaned_data_exports/final_review_df.csv")

### Data Source #3 - OnTheSnow Scrape


In [939]:
scraped_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 43 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Name                      333 non-null    object 
 1   address                   333 non-null    object 
 2   city                      333 non-null    object 
 3   state                     333 non-null    object 
 4   country                   333 non-null    object 
 5   sumt                      333 non-null    int64  
 6   drop                      333 non-null    int64  
 7   base                      333 non-null    int64  
 8   gondolas_and_trams        333 non-null    int64  
 9   fastEight                 331 non-null    float64
 10  highSpeedSixes            327 non-null    float64
 11  quadChairs                329 non-null    float64
 12  tripleChairs              330 non-null    float64
 13  doubleChairs              331 non-null    float64
 14  surfeLifts

In [940]:
scraped_df.isna().sum().sort_values(ascending=False)

highSpeedSixes              6
quadChairs                  4
tripleChairs                3
fastEight                   2
doubleChairs                2
surfeLifts                  1
averageSnowfall (inches)    0
daysOpenLastYear            0
longestRun (miles)          0
totalRuns (acre)            0
ticketpriceNote             0
projectedClosing            0
gondolas_and_trams          0
base                        0
drop                        0
sumt                        0
country                     0
state                       0
city                        0
address                     0
projectedOpening            0
NovSnow                     0
terrainNote                 0
junior_weekday              0
gondolas_lifts_note         0
Url                         0
senior_weekend              0
adult_weekend               0
junior_weekend              0
child_weekend               0
senior_weekday              0
adult_weekday               0
child_weekday               0
DecSnow   

In [941]:
scraped_df = scraped_df.drop(columns=['gondolas_lifts_note', 'terrainNote', 'ticketpriceNote',
                                     'senior_weekend', 'MaySnow', 'country'])

In [942]:
scraped_df.loc[scraped_df['adult_weekend'] == 0]

Unnamed: 0,Name,address,city,state,sumt,drop,base,gondolas_and_trams,fastEight,highSpeedSixes,...,junior_season,adult_season,child_weekday,junior_weekday,adult_weekday,senior_weekday,child_weekend,junior_weekend,adult_weekend,Url
1,Afton Alps,6600 Peller Avenue South,55033 Hastings,Minnesota,1530,350,1180,0,0.0,0.0,...,859.0,50,55.0,50.0,55.0,60.0,55.0,0.0,0.0,https://www.onthesnow.com/Minnesota/afton-alps...
2,Alpental,POB 1068,98068 Snoquale Pass,Washington,5420,2280,3140,0,1.0,0.0,...,399.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,https://www.onthesnow.com/washington/alpental/...
5,Alta,P.O. Box 8007,84092 Alta,Utah,11068,2538,8530,0,3.0,0.0,...,599.0,1349,1049.0,80.0,159.0,0.0,125.0,0.0,0.0,https://www.onthesnow.com/utah/alta-ski-area/s...
13,Arizona Snowbowl,P.O. Box 40,86002 Flagstaff,Arizona,11500,2300,9200,0,0.0,1.0,...,699.0,999,749.0,0.0,0.0,0.0,0.0,0.0,0.0,https://www.onthesnow.com/arizona/arizona-snow...
15,Attitash,PO Box 308,03812 Bartlett,New Hampshire,2350,1750,600,0,2.0,0.0,...,519.0,639,59.0,79.0,59.0,67.0,89.0,67.0,0.0,https://www.onthesnow.com/new-hampshire/attita...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
320,Wisp,296 Marsh Hill Road,21541 McHenry,Maryland,3115,700,2415,0,0.0,0.0,...,699.0,529,59.0,84.0,69.0,89.0,0.0,0.0,0.0,https://www.onthesnow.com/maryland/wisp/ski-re...
323,Woodbury,785 Washington Road,06798 Woodbury,Connecticut,730,300,430,0,0.0,0.0,...,35.0,42,15.0,35.0,42.0,0.0,0.0,0.0,0.0,https://www.onthesnow.com/connecticut/woodbury...
324,Woods Valley,Box 215,13486 Westernville,New York,1400,500,900,0,0.0,0.0,...,475.0,460,368.0,28.0,31.0,33.0,39.0,0.0,0.0,https://www.onthesnow.com/new-york/woods-valle...
327,Mission Ridge,P.O. Box 1668,98807-1668 Wenatchee,Washington,6820,2250,4570,0,1.0,,...,499.0,849,429.0,0.0,0.0,0.0,0.0,0.0,0.0,https://www.onthesnow.com/washington/mission-r...


In [943]:
#filling empty lift ticket prices with mean

def mean_ticket(scraped_df, column):
    
    #changing all column types to int
    scraped_df[column] = scraped_df[column].astype(int)
    
    # finding mean values
    mean_value = scraped_df[column].mean()
    mean_value = int(mean_value)
    
    # filling 0 with mean value
    scraped_df[column] = scraped_df[column].replace(0, mean_value)
    
    return scraped_df

In [944]:
#making a loop to loop through list of ticket prices that need to be updated

ticket_prices = ["adult_weekend", "junior_weekend", "AprSnow", "child_weekend", "senior_weekday", "adult_weekday",
"junior_weekday", "child_weekday", "adult_season", "junior_season", "child_season"]

for ticket_val in ticket_prices:
    mean_ticket(scraped_df, ticket_val)

In [945]:
scraped_df[['zipcode', 'city']] = scraped_df['city'].str.split(' ', 1, expand=True)

In [946]:
column_to_move = scraped_df.pop("zipcode")

# moving zipcode after state
scraped_df.insert(4, "zipcode", column_to_move )

In [947]:
scraped_df = scraped_df.rename(columns={"Name":"ski_resort"})

In [948]:
epic_list = ["Stowe Mountain", "Okemo Mountain", "Hunter Mountain", "Mt. Snow", "Mt. Sunapee","Wildcat Monuntain","Seven Springs",
             "Attitash", "Jack Frost", "Crotched Mountain", "Laurel Mountain",
             "Roundtop Mountain", "Whitetail", "Liberty", "Big Boulder", "Heavenly Mountain", "Northstar California",
             "Kirkwood", "Stevens Pass", "Keystone", "Breckenridge","Vail", "Park City Mountain", "Beaver Creek",
             "Crested Butte Mountain", "Afton Alps", "Alpine Valley Ohio", "Boston Mills and Brandywine","Hidden Valley",
             "Mad River Mountain", "Mt. Brighton", "Paoli Peaks", "Snow Creek","Wilmot Mountain"]

mtn_col_list = ['Arapahoe Basin', 'Aspen Snowmass', 'Crystal Mountain',
                            'Jackson Hole', 'Mammoth Mountain', 'Snowbird',
                            'Palisades Tahoe', 'Sugarbush',
                            'Taos Ski Valley', 'Alta', 'Big Sky', 'Sugar Bowl',
                           'Sugarloaf', 'Sun Valley']

ikon_list = ['Winter Park', 'Copper Mountain', 'Steamboat', 'Eldora Mountain',
             'Palisades Tahoe', 'Mammoth Mountain', 'June Mountain', 'Big Bear',
             'Snow Valley', 'Stratton Mountain', 'Snowshoe Mountain', 'Sugarbush', 'Solitude Mountain', 'Alta', 'Snowbird',
             'Aspen Snowmass', 'Buttermilk', 'Arapahoe Basin', 'Big Sky',
             'Brighton', 'Deer Valley', 'Snowbasin', 'Jackson Hole', 'Mt. Bachelor',
             'Windham Mountain', 'Boyne Highlands','Crystal Mountain',
             'The Summit at Snoqualmie', 'Schweitzer', 'Sun Valley', 'Sunday River',
             'Sugarloaf', 'Loon Mountain', 'Taos Ski Valley', 'Killington', 'Pico Mountain']


In [950]:
# making new column with 0
scraped_df['epic'] = 0
scraped_df['mountain_collective'] = 0
scraped_df['ikon'] = 0

# adding 1 if the values match
scraped_df.loc[scraped_df['ski_resort'].isin(epic_list), 'epic'] = 1
scraped_df.loc[scraped_df['ski_resort'].isin(mtn_col_list), 'mountain_collective'] = 1
scraped_df.loc[scraped_df['ski_resort'].isin(ikon_list), 'ikon'] = 1

In [953]:
#checking to see if there are any missing resorts in the scraped information to ensure all resorts have feature information
final_ski_df.loc[~final_ski_df['ski_resort'].isin(scraped_df['ski_resort']),
                         'ski_resort'].sort_values().unique()

array([], dtype=object)

In [960]:
#saving final cleaned scraped df
scraped_df.to_csv("cleaned_data_exports/scraped_feature_df.csv")

In [1002]:
city_df = pd.DataFrame()
city_df['city'] = scraped_df['city']
city_df['state'] = scraped_df['state']
city_df.to_csv("cleaned_data_exports/location_for_scraping.csv")

In [990]:
#saving city names for scraping
city_df['location'] = scraped_df[['city', 'state']].agg(', '.join, axis=1)

city_df['location'] = pd.DataFrame(city_df['location'].unique())

city_df = city_df.dropna()

city_df.to_csv("cleaned_data_exports/city_names_for_scraping.csv")

## Merging dataframes on Name Columns for features

In [954]:
final_ski_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2078 entries, 0 to 18198
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   review_date  2078 non-null   datetime64[ns]
 1   state        2078 non-null   object        
 2   ski_resort   2078 non-null   object        
 3   rating       2078 non-null   int64         
 4   review       2078 non-null   object        
 5   user_name    2078 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 113.6+ KB


In [955]:
merged_df = pd.merge(final_ski_df, scraped_df, on="ski_resort", how='left')

In [956]:
merged_df = merged_df.drop(columns="state_y")
merged_df = merged_df.rename(columns={"state_x":"state"})

In [957]:
merged_df

Unnamed: 0,review_date,state,ski_resort,rating,review,user_name,address,city,zipcode,sumt,...,junior_weekday,adult_weekday,senior_weekday,child_weekend,junior_weekend,adult_weekend,Url,epic,mountain_collective,ikon
0,2023-05-04,Colorado,Winter Park,4,Very family friendly,anon_1,P.O. Box 36,Winter Park,80482,12060,...,179,157,49,39,29,55,https://www.onthesnow.com/colorado/winter-park...,0,0,1
1,2023-05-04,Colorado,Arapahoe Basin,5,"Challenging terrain, no frills",anon_1,PO Box 5808,Dillon,80435,13050,...,59,79,89,99,69,99,https://www.onthesnow.com/colorado/arapahoe-ba...,0,1,1
2,2023-05-04,Colorado,Steamboat,5,Great public transport to and from lodging,anon_1,2305 Mt. Werner Circle,Steamboat Springs,80487,10568,...,177,167,144,192,182,106,https://www.onthesnow.com/colorado/steamboat/s...,0,0,1
3,2023-05-04,Colorado,Copper Mountain,5,Extremely diverse terrain and fantastic terrai...,anon_1,P.O. Box 3001,Copper Mountain,80443,12313,...,129,124,179,39,29,55,https://www.onthesnow.com/colorado/copper-moun...,0,0,1
4,2023-05-04,Utah,Solitude Mountain,5,"They had so much terrain, especially for the s...",anon_2,12000 Big Cottonwood Canyon,Brighton,84121,10488,...,69,115,85,45,69,115,https://www.onthesnow.com/utah/solitude-mounta...,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2077,2020-10-22,Utah,Beaver Mountain,5,Beaver Mountain is an awesome family run ski ...,Payton Sharum,P.O. Box 3455,Logan,84323-3455,8860,...,50,40,40,50,40,35,https://www.onthesnow.com/utah/beaver-mountain...,0,0,0
2078,2020-10-22,Wyoming,Snowy Range Ski Recreation Area,5,"I love how cozy this ski area is. Sure, it doe...",Payton Sharum,Po Box 247,Centennial,82055,9663,...,30,42,49,39,30,42,https://www.onthesnow.com/wyong/snowy-range-sk...,0,0,0
2079,2020-10-28,Montana,Discovery,5,This little is top notch and not full of peop...,Lori Young,PO Box 221,Anonda,59711,8150,...,26,49,38,39,26,49,https://www.onthesnow.com/montana/discovery-sk...,0,0,0
2080,2020-10-28,Montana,Discovery,5,Easy to get to nice large for family dining\nG...,Lori Young,PO Box 221,Anonda,59711,8150,...,26,49,38,39,26,49,https://www.onthesnow.com/montana/discovery-sk...,0,0,0


In [959]:
#saving final merged dataframe for content based system
merged_df.to_csv("cleaned_data_exports/final_merged_df.csv")

## Review Analysis

Looking at the most reviewed mountains

In [961]:
top_5_reviewed_resorts = pd.DataFrame(merged_df['ski_resort'].value_counts().reset_index()).head(5)

In [962]:
top_5_reviewed_resorts

Unnamed: 0,index,ski_resort
0,Breckenridge,42
1,Vail,40
2,Ski Brule,39
3,Killington,37
4,Snowbird,30


In [963]:
#using plotly to plot the top reviewers
fig = px.bar(top_5_reviewed_resorts, x="index", y="ski_resort")
fig.update_layout(title_text='Most Reviewed Mountains',
                  title_x=0.5,
                  xaxis_title="Resort",
                  yaxis_title="Review Count",
                 plot_bgcolor='white')
fig.update_traces(marker_color = "#f86424")
fig.show()

Analayzing the distribution of reviews

In [964]:
#making dataframe of rating counts to compare distribution of ratings
top_ratings = pd.DataFrame(merged_df["rating"].value_counts(ascending=False).head(15))
top_ratings = top_ratings.reset_index()
top_ratings = top_ratings.rename(columns={"rating":"rating_count"})
top_ratings = top_ratings.rename(columns={"index":"rating"})

#making user_id a string for plotting
top_ratings['rating'] = top_ratings['rating'].astype(str)

# Calculate the percentage of each rating count
top_ratings['rating_percentage'] = (top_ratings['rating_count'] / top_ratings['rating_count'].sum()) * 100

There is an uneven distribtion of ratings, favoring ratings on the higher end. This user biased will impact results.

In [965]:
#using plotly to plot the top featurescolor=
fig = px.bar(top_ratings, x="rating", y="rating_percentage",
             text="rating_percentage")
fig.update_layout(title_text='Rating Distribution',
                  title_x=0.5,
                  xaxis_title="Rating",
                  yaxis_title="Rating %",
                 plot_bgcolor='white',
                 font=dict(size=14))
fig.update_traces(marker_color = "#f86424", texttemplate='%{text:.1s}%', textposition='outside')

fig.show()

In [966]:
resort_list = merged_df['ski_resort'].unique().tolist()

# Making a surprise dataset for modeling

In [967]:
surprise_df = final_ski_df.copy()

In [968]:
surprise_df = surprise_df.drop(columns=["state","review_date","review"])

In [969]:
#reordering columns
surprise_df = surprise_df[['user_name', 'ski_resort', 'rating']]

In [970]:
surprise_df

Unnamed: 0,user_name,ski_resort,rating
0,anon_1,Winter Park,4
1,anon_1,Arapahoe Basin,5
2,anon_1,Steamboat,5
3,anon_1,Copper Mountain,5
4,anon_2,Solitude Mountain,5
...,...,...,...
18142,Payton Sharum,Beaver Mountain,5
18144,Payton Sharum,Snowy Range Ski Recreation Area,5
18154,Lori Young,Discovery,5
18155,Lori Young,Discovery,5


In [602]:
from surprise import Reader, Dataset

reader = Reader(rating_scale=(1, 5))

#loading final dataset
data = Dataset.load_from_df(surprise_df[['user_name', 'ski_resort', 'rating']], reader)

#spltting into train and test
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

In [603]:
#looking at number of users
print('Number of users: ', trainset.n_users, '\n')
print('Number of items: ', trainset.n_items)

Number of users:  387 

Number of items:  256


### Normal Predictor Model

In [420]:
# Instantiate the model
baseline = NormalPredictor()

#fitting model
baseline.fit(trainset)

# making prediction on testset
predictions = baseline.test(testset)

# Save RMSE score
baseline_normal = accuracy.rmse(predictions)

RMSE: 1.4467


In [100]:
#saving normal rmse
test_baseline_normal_rmse = 1.446

In [437]:
data_rmse = [['normal predictor', 1.45]]

model_df = pd.DataFrame(data_rmse, columns=['model', 'rmse'])

In [438]:
model_df

Unnamed: 0,model,rmse
0,normal predictor,1.45


In [439]:
def model_comp(model_name, rmse):
    model_df.loc[len(model_df.index)] = [model_name, rmse] 

### Baseline Model

In [424]:
# Instantiate and fit model
baseline2 = BaselineOnly()

#fitting model
baseline2.fit(trainset)

# making prediction on testset
predictions = baseline2.test(testset)

# Save RMSE score
baseline_only = accuracy.rmse(predictions)

Estimating biases using als...
RMSE: 0.9968


In [425]:
baseline_only_rmse = 0.9968

In [440]:
model_comp('baseline', .996)

In [441]:
model_df

Unnamed: 0,model,rmse
0,normal predictor,1.45
1,baseline,0.996


### SVD Model #1

In [428]:
# Cross validate a basic SVD with no hyperparameter tuning expecting sub-par results
svd_basic = SVD(random_state=42)

results = cross_validate(svd_basic, data, measures=['RMSE'], cv=3, n_jobs = -1, verbose=True)

Evaluating RMSE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9681  0.9880  0.9676  0.9746  0.0095  
Fit time          0.07    0.07    0.07    0.07    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    


In [429]:
# Fit to trainset and predict on the testset for evaluation
svd_basic.fit(trainset)

predictions = svd_basic.test(testset)

svd_simple = accuracy.rmse(predictions)

RMSE: 0.9600


In [430]:
#saving simmple rmse for final graph
test_svd_simple_rmse = 0.96

In [442]:
model_comp('svd', .96)

## SVD Grid Search

In [433]:
#test grid search
params = {'n_factors': [10, 20, 50, 100],
          'n_epochs': [5, 10, 20],
          'init_mean': [-0.5, 0, 0.5],
         'biased': [True, False]}

g_s_svd = GridSearchCV(SVD, param_grid=params, cv=5, refit=True)

g_s_svd.fit(data)
g_s_svd.best_params['rmse']

{'n_factors': 100, 'n_epochs': 20, 'init_mean': 0, 'biased': True}

In [434]:
print(g_s_svd.best_score)
print(g_s_svd.best_params)

{'rmse': 0.9656477812960531, 'mae': 0.7454564796864988}
{'rmse': {'n_factors': 100, 'n_epochs': 20, 'init_mean': 0, 'biased': True}, 'mae': {'n_factors': 100, 'n_epochs': 20, 'init_mean': 0, 'biased': True}}


In [435]:
# instantiating SVD with best hyperparameters from gridsearch
g_s_svd = SVD(n_factors=100 ,n_epochs=20, init_mean=0, biased=True)

# fit on trainset and make predictions using testset
g_s_svd.fit(trainset)
predictions = g_s_svd.test(testset)
g_s_svd_1 = accuracy.rmse(predictions)

RMSE: 0.9572


In [443]:
#saving simmple rmse for final graph
test_svd_grid_1 = 0.957

In [444]:
model_comp('svd_grid_1', .957)

## SVD Grid Search # 2

In [445]:
#test grid search
params = {'n_factors': [200, 300, 500],
          'n_epochs': [20, 30, 40, 50],
          'init_mean': [-0.5, 0, 0.5],
         'biased': [True, False]}

g_s_svd_2 = GridSearchCV(SVD, param_grid=params, cv=5, refit=True)

g_s_svd_2.fit(data)
g_s_svd_2.best_params['rmse']

{'n_factors': 200, 'n_epochs': 50, 'init_mean': 0, 'biased': True}

In [446]:
# instantiating SVD with best hyperparameters from gridsearch
g_s_svd_2 = SVD(n_factors=200 ,n_epochs=50, init_mean=0, biased=True)

# fit on trainset and make predictions using testset
g_s_svd_2.fit(trainset)
predictions_2 = g_s_svd_2.test(testset)
g_s_svd_2 = accuracy.rmse(predictions_2)

RMSE: 0.9111


In [447]:
#saving simmple rmse for final graph
test_svd_grid_2 = 0.911

In [448]:
model_comp('svd_grid_2', .911)

## NFM Grid Search

In [449]:
# New hyperparameter dictionary for nmf model
nmf_param_grid = {'biased':[True, False],
                  'n_factors':[10, 20, 50],
                  'n_epochs': [20, 40, 50]}
nmf_gs_model = GridSearchCV(NMF, param_grid=nmf_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
nmf_gs_model.fit(data)
nmf_gs_model.best_params['rmse']

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  54 out of  54 | elapsed:    5.4s finished


{'biased': True, 'n_factors': 20, 'n_epochs': 40}

In [450]:
nmf_gs_model.best_score['rmse']

0.9962696646207182

In [451]:
# instantiating NFM
nfm_model = NMF(biased=True, n_factors=20, n_epochs=40)

# Fit on trainset and make predictions using testset to return RMSE metric
nfm_model.fit(trainset)
predictions = nfm_model.test(testset)
nfm_model_1 = accuracy.rmse(predictions)

RMSE: 0.9374


In [452]:
nmf_grid_rmse_1 = 0.9374

In [453]:
model_comp('nmf_grid_1', .937)

### NFM GRID SEARCH #2

In [454]:
# New hyperparameter dictionary for nmf model
nmf_param_grid_2 = {'biased':[True, False],
                  'n_factors':[5, 10, 15, 20, 30],
                  'n_epochs': [20, 20, 40, 50, 60]}
nmf_gs_model_2 = GridSearchCV(NMF, param_grid=nmf_param_grid_2, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
nmf_gs_model_2.fit(data)
nmf_gs_model_2.best_params['rmse']

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    0.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed:   11.3s finished


{'biased': True, 'n_factors': 15, 'n_epochs': 50}

In [455]:
nmf_gs_model_2.best_score['rmse']

0.9789281059684821

In [456]:
# instantiating NFM
nfm_model_2 = NMF(biased=True, n_factors=15, n_epochs=50)

# Fit on trainset and make predictions using testset to return RMSE metric
nfm_model_2.fit(trainset)
predictions_2 = nfm_model_2.test(testset)
nfm_model_2 = accuracy.rmse(predictions_2)

RMSE: 0.9387


In [457]:
nmf_grid_rmse_2 = 0.9387

In [458]:
model_comp('nmf_grid_2', .939)

### SVD ++

In [464]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[20, 30, 50, 100, 200],
                  'n_epochs': [20, 40, 50, 60, 70],
                    'init_mean':[0, .01, .5]}
svd_pp_model = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model.fit(data)
svd_pp_model.best_params['rmse']

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    1.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    1.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    1.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 225 out of 225 | elapsed:  3.1min finished


{'n_factors': 100, 'n_epochs': 60, 'init_mean': 0.01}

In [614]:
# instantiating NFM
svd_pp_model = SVDpp(n_factors=100, n_epochs=60, init_mean=.01)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model.fit(trainset)
predictions = svd_pp_model.test(testset)
svd_pp_model_1 = accuracy.rmse(predictions)

RMSE: 0.9021


In [466]:
svd_pp_rmse_1 = 0.90

In [467]:
model_comp('svd_pp_1', .90)

### SVD ++ Grid Search #2

In [468]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[100, 200, 300],
                  'n_epochs': [50, 60, 70, 80],
                    'init_mean':[0, .01, .5]}
svd_pp_model_2 = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model_2.fit(data)
svd_pp_model_2.best_params['rmse']

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    2.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    3.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    4.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    5.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    6.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    7.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    8.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 108 out of 108 | elapsed:  4.1min finished


{'n_factors': 200, 'n_epochs': 60, 'init_mean': 0}

In [469]:
svd_pp_model_2.best_score['rmse']

0.9600214226128932

In [613]:
# instantiating SVD
svd_pp_model_2 = SVDpp(n_factors=200, n_epochs=60, init_mean=0)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model_2.fit(trainset)
predictions = svd_pp_model_2.test(testset)
svd_pp_model_2 = accuracy.rmse(predictions)

RMSE: 0.9233


In [471]:
svd_pp_rmse_2 = 0.914

In [472]:
model_comp('svd_pp_2', .914)

## SVD++ Grid Search #3

In [473]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[200, 300, 400, 500],
                  'n_epochs': [20, 70, 80, 90],
                    'init_mean':[0, .01, .02, .03]}
svd_pp_model_3 = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model_3.fit(data)
svd_pp_model_3.best_params['rmse']

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    2.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    2.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    3.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    4.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    5.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    5.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    6.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 192 out of 192 | elapsed: 13.9min finished


{'n_factors': 200, 'n_epochs': 80, 'init_mean': 0}

In [474]:
svd_pp_model_3.best_score['rmse']

0.9652578371679076

In [612]:
# instantiating NFM
svd_pp_model_3 = SVDpp(n_factors=200, n_epochs=80)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model_3.fit(trainset)
predictions = svd_pp_model_3.test(testset)
svd_pp_model_3 = accuracy.rmse(predictions)

RMSE: 0.9092


In [476]:
svd_pp_rmse_3 = 0.929

In [477]:
model_comp('svd_pp_3', .929)

## SVD PP - Grid Search #4

In [478]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[50, 100, 150, 170],
                  'n_epochs': [60, 70, 80, 90, 100],
                    'init_mean':[0, .01, .03, .05]}
svd_pp_model_4 = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model_4.fit(data)
svd_pp_model_4.best_params['rmse']

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    2.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    3.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    3.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    4.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    5.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    6.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    6.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:  7.2min finished


{'n_factors': 150, 'n_epochs': 80, 'init_mean': 0}

In [479]:
svd_pp_model_4.best_score['rmse']

0.9553117554893132

In [491]:
# instantiating NFM
svd_pp_model_4 = SVDpp(n_factors=150, n_epochs=80)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model_4.fit(trainset)
predictions = svd_pp_model_4.test(testset)
svd_pp_model_4 = accuracy.rmse(predictions)

RMSE: 0.9264


In [611]:
# instantiating NFM
svd_pp_model_4 = SVDpp(n_factors=180, n_epochs=80)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model_4.fit(trainset)
predictions = svd_pp_model_4.test(testset)
svd_pp_model_4 = accuracy.rmse(predictions)

RMSE: 0.9184


In [494]:
model_comp('svd_pp_4', 0.926)

## SVD PP - Grid Search #5

In [495]:
# New hyperparameter dictionary for nmf model
svd_pp_param_grid = {'n_factors':[40, 50, 60, 70],
                  'n_epochs': [40, 50, 60, 70, 80],
                    'init_mean':[0, .01, .05]}
svd_pp_model_5 = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model_5.fit(data)
svd_pp_model_5.best_params['rmse']

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    1.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    2.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    2.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    3.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    3.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    4.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed:  2.4min finished


{'n_factors': 60, 'n_epochs': 40, 'init_mean': 0.01}

In [610]:
# instantiating NFM
svd_pp_model_5 = SVDpp(n_factors=60, n_epochs=40, init_mean=.01)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model_5.fit(trainset)
predictions = svd_pp_model_5.test(testset)
svd_pp_model_5_rmse = accuracy.rmse(predictions)

RMSE: 0.9210


In [498]:
svd_pp_rmse_5 = 0.915

In [499]:
model_comp('svd_pp_5', .915)

## SVD PP - Grid Search #6

In [500]:
svd_pp_param_grid = {'n_factors':[30, 40, 50, 60, 70, 80],
                  'n_epochs': [10, 20, 30, 40, 50],
                    'init_mean':[0, .01, .05]}
svd_pp_model_6 = GridSearchCV(SVDpp, param_grid=svd_pp_param_grid, cv=3, joblib_verbose=10, return_train_measures=True)

# Fit and return the best hyperparameters
svd_pp_model_6.fit(data)
svd_pp_model_6.best_params['rmse']

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed:  1.9min finished


{'n_factors': 70, 'n_epochs': 40, 'init_mean': 0.01}

In [615]:
# instantiating NFM
svd_pp_model_6 = SVDpp(n_factors=70, n_epochs=40, init_mean=.01)

# Fit on trainset and make predictions using testset to return RMSE metric
svd_pp_model_6.fit(trainset)
predictions = svd_pp_model_6.test(testset)
svd_pp_model_6_rmse = accuracy.rmse(predictions)

RMSE: 0.9263


In [616]:
svd_pp_rmse_6 = 0.917

In [503]:
model_comp('svd_pp_6', .917)

### Model Comparison

In [504]:
model_df.sort_values(by='rmse', ascending=True).head()

Unnamed: 0,model,rmse
7,svd_pp_1,0.9
4,svd_grid_2,0.911
8,svd_pp_2,0.914
11,svd_pp_5,0.915
12,svd_pp_6,0.917


In [617]:
# instantiating NFM
best_model = SVDpp(n_factors=100, n_epochs=60, init_mean=.01)

# Fit on trainset and make predictions using testset to return RMSE metric
best_model.fit(trainset)
predictions = best_model.test(testset)
best_model_rmse = accuracy.rmse(predictions)

RMSE: 0.9082


## Function

In [508]:
surprise_df.head()

Unnamed: 0,user_name,ski_resort,rating
0,anon_1,Winter Park,4
1,anon_1,Arapahoe Basin,5
2,anon_1,Steamboat,5
3,anon_1,Copper Mountain,5
4,anon_2,Solitude Mountain,5


In [509]:
#saving new dataframe with only user information
user_df = surprise_df.reset_index()
user_df.set_index('user_name', inplace = True)
user_df.drop(columns = ['rating', 'index'], inplace =True)
user_df.head()

Unnamed: 0_level_0,ski_resort
user_name,Unnamed: 1_level_1
anon_1,Winter Park
anon_1,Arapahoe Basin
anon_1,Steamboat
anon_1,Copper Mountain
anon_2,Solitude Mountain


In [652]:
def shred_recommender():
    user = str(input('Name: '))
    n_recs = int(input('How many resort recommendations do you want? '))
    
    have_rated = list(user_df.loc[user, 'ski_resort'])
    not_rated = merged_df.copy()
    not_rated = not_rated.loc[~not_rated['ski_resort'].isin(have_rated)]  # & (not_rated['state'] == state)]
    not_rated = not_rated.drop_duplicates(subset=['ski_resort'])
    not_rated.reset_index(inplace=True)
    not_rated['predicted_rating'] = not_rated['ski_resort'].apply(lambda x: best_model.predict(user, x).est)
    not_rated.sort_values(by='predicted_rating', ascending=False, inplace=True)
    not_rated = not_rated.drop(columns=['user_name', 'review_date', 'index', 'rating', 'review'])
    return not_rated.head(n_recs)

In [654]:
shred_recommender()

Name: Stephanie Ciaccia
How many resort recommendations do you want? 3


Unnamed: 0,state,ski_resort,address,city,zipcode,sumt,drop,base,gondolas_and_trams,fastEight,...,adult_weekday,senior_weekday,child_weekend,junior_weekend,adult_weekend,Url,epic,mountain_collective,ikon,predicted_rating
161,Montana,Bridger Bowl,15795 Bridger Canyon Rd.,Bozeman,59715,8700,2600,6100,0,0.0,...,85,40,60,85,40,https://www.onthesnow.com/montana/bridger-bowl...,0,0,0,4.524584
12,Wyoming,Jackson Hole,P.O. Box 290,Teton Village,83025,10450,4139,6311,3,4.0,...,194,215,172,94,140,https://www.onthesnow.com/wyong/Jackson-hole/s...,0,1,1,4.41202
3,Colorado,Copper Mountain,P.O. Box 3001,Copper Mountain,80443,12313,2738,9712,1,4.0,...,124,179,39,29,55,https://www.onthesnow.com/colorado/copper-moun...,0,0,1,4.397625


In [154]:
list(user_df.loc['Stephanie Ciaccia', 'ski_resort'])

['Hunter Mountain', 'Vail', 'Snowbird', 'Breckenridge', 'Park City Mountain']

In [655]:
shred_recommender()

Name: Deanna Uzarski
How many resort recommendations do you want? 3


Unnamed: 0,state,ski_resort,address,city,zipcode,sumt,drop,base,gondolas_and_trams,fastEight,...,adult_weekday,senior_weekday,child_weekend,junior_weekend,adult_weekend,Url,epic,mountain_collective,ikon,predicted_rating
53,California,Kirkwood,PO Box 1,Kirkwood,95646,9800,2000,7800,0,2.0,...,46,49,39,29,55,https://www.onthesnow.com/california/kirkwood/...,1,0,0,4.535256
26,Utah,Snowbird,P.O. Box 929000,Snowbird,84092-9000,11000,3240,7760,1,7.0,...,110,184,156,29,94,https://www.onthesnow.com/utah/snowbird/ski-re...,0,1,1,4.516392
11,Wyoming,Jackson Hole,P.O. Box 290,Teton Village,83025,10450,4139,6311,3,4.0,...,194,215,172,94,140,https://www.onthesnow.com/wyong/Jackson-hole/s...,0,1,1,4.502303


In [153]:
list(user_df.loc['Deanna Uzarski', 'ski_resort'])

['Telluride', 'Breckenridge', 'Crested Butte Mountain', 'Alta', 'Vail']

In [656]:
shred_recommender()

Name: Raghava Kamalesh
How many resort recommendations do you want? 3


Unnamed: 0,state,ski_resort,address,city,zipcode,sumt,drop,base,gondolas_and_trams,fastEight,...,adult_weekday,senior_weekday,child_weekend,junior_weekend,adult_weekend,Url,epic,mountain_collective,ikon,predicted_rating
27,Utah,Snowbird,P.O. Box 929000,Snowbird,84092-9000,11000,3240,7760,1,7.0,...,110,184,156,29,94,https://www.onthesnow.com/utah/snowbird/ski-re...,0,1,1,4.828966
11,Wyoming,Jackson Hole,P.O. Box 290,Teton Village,83025,10450,4139,6311,3,4.0,...,194,215,172,94,140,https://www.onthesnow.com/wyong/Jackson-hole/s...,0,1,1,4.807636
104,New York,Whiteface Mountain,Whiteface Mountain Route 86,Wilngton,12997,4650,3430,1220,1,1.0,...,90,115,90,70,90,https://www.onthesnow.com/new-york/whitefe-mou...,0,0,0,4.774799


In [155]:
list(user_df.loc['Raghava Kamalesh', 'ski_resort'])

['Breckenridge', 'Crested Butte Mountain', 'Vail', 'Beaver Creek', 'Telluride']

In [622]:
shred_recommender()

Name: Alexandria Kelly
How many resort recommendations do you want? 3


Unnamed: 0,state,ski_resort,predicted_rating
18,Wyoming,Jackson Hole,4.375903
113,Michigan,Nubs Nob,4.24732
46,California,Tahoe Donner,4.220013


In [565]:
list(user_df.loc['Raghava Kamalesh', 'ski_resort'])

['Breckenridge', 'Crested Butte', 'Vail', 'Beaver Creek', 'Telluride']

In [634]:
def shred_recommender_state():
    user = str(input('Name: '))
    n_recs = int(input('How many resort recommendations do you want? '))
    state = str(input('What state would you like to shred in? '))
    
    have_rated = list(user_df.loc[user, 'ski_resort'])
    not_rated = merged_df.copy()
    not_rated = not_rated.loc[~not_rated['ski_resort'].isin(have_rated) & (not_rated['state'] == state)]
    not_rated = not_rated.drop_duplicates(subset=['ski_resort'])
    not_rated.reset_index(inplace=True)
    not_rated['predicted_rating'] = not_rated['ski_resort'].apply(lambda x: best_model.predict(user, x).est)
    not_rated.sort_values(by='predicted_rating', ascending=False, inplace=True)
    not_rated = not_rated.drop(columns=['user_name', 'review_date', 'index', 'rating', 'review',
                                       'address', 'zipcode', 'Url'])
    return not_rated.head(n_recs)

In [635]:
shred_recommender_state()

Name: Stephanie Ciaccia
How many resort recommendations do you want? 3
What state would you like to shred in? Utah


Unnamed: 0,state,ski_resort,city,sumt,drop,base,gondolas_and_trams,fastEight,highSpeedSixes,quadChairs,...,junior_season,adult_season,child_weekday,junior_weekday,adult_weekday,senior_weekday,child_weekend,junior_weekend,adult_weekend,predicted_rating
2,Utah,Alta,Alta,11068,2538,8530,0,3.0,0.0,0.0,...,599,1349,1049,80,159,49,125,29,55,4.051192
3,Utah,Brighton,Brighton,10500,1745,8755,0,4.0,0.0,1.0,...,419,899,629,33,53,85,57,29,53,3.986185
6,Utah,Snowbasin,Huntsville,9350,2900,6450,3,2.0,1.0,0.0,...,699,1149,799,33,89,149,99,29,109,3.934579
