# Avant Ski

by: Stephanie Ciaccia

# Overview

Skiing holds a prominent place for those seeking winter recreational activities in the United States. With its stunning mountain ranges and diverse terrain, the country boasts numerous ski resorts that cater to all skill levels, from beginners to seasoned professionals. 

Skiing offers a unique blend of adventure, physical activity, and natural beauty, making it a popular choice for winter enthusiasts seeking both relaxation and excitement.

The ski market in the United States is thriving, contributing significantly to the economy. According to the [National Ski Areas Association (NSAA)](chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://nsaa.org/webdocs/Media_Public/IndustryStats/Historical_Skier_Days_1979_2022.pdf), approximately 60.7 million skiers and snowboarders visited 473 ski resorts in the 2021-2022 winter season.

# Business Problem

Skiing is an exhilarating winter activity enjoyed by many, but barriers such as high costs and limited accessibility often hinder people from fully experiencing its joys. Choosing the right ski resort can be overwhelming due to the multitude of options available, and existing websites lack dynamic filtering capabilities based on user preferences.

To address these challenges, I'm developing Avant Ski, a ski resort recommendation app. Avant Ski simplifies the ski resort selection process by leveraging data and user preferences. With dynamic filtering features, users can personalize their search based on budget, location, amenities, and skill level. By bridging the gap between ski enthusiasts and their dream destinations, Avant Ski makes skiing accessible to a wider audience, empowering them to plan unforgettable ski trips with confidence.

# Data Understanding

In [1]:
import pandas as pd
import numpy as np
import math
from datetime import datetime
import datetime
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly
import plotly.express as px
import plotly.io as pio
from matplotlib.ticker import StrMethodFormatter
import plotly.graph_objects as go

from surprise.model_selection import cross_validate
from surprise import Dataset, Reader, accuracy
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline, KNNWithZScore,  SVD, SVDpp, NMF, BaselineOnly, NormalPredictor
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split

from collections import Counter
from nltk.corpus import stopwords

from IPython.display import Image, display

import glob
import os

Function to print full rows

In [2]:
def print_full(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')

# Importing Data Files

In [3]:
snow_df = pd.read_csv("data/OnTheSnow_SkiAreaReviews_clean.csv")
survey_df = pd.read_csv("data/usa_ski_resort_survey.csv")
scraped_df = pd.read_csv("data/onthesnow_scrape_170523_cleaned.csv")
second_scrape = pd.read_csv("data/OnTheSnow_Srape_2_200523_cleaned.csv")

In [4]:
#airbnb scrape four guest listings
dec_4_airbnb_mean_final = pd.read_csv("cleaned_data_exports/scraped_cleaned/dec_4_airbnb_mean_final.csv")
jan_4_airbnb_mean_final = pd.read_csv("cleaned_data_exports/scraped_cleaned/jan_4_airbnb_mean_final.csv")
feb_4_airbnb_mean_final = pd.read_csv("cleaned_data_exports/scraped_cleaned/feb_4_airbnb_mean_final.csv")
mar_4_airbnb_mean_final = pd.read_csv("cleaned_data_exports/scraped_cleaned/mar_4_airbnb_mean_final.csv")
apr_4_airbnb_mean_final = pd.read_csv("cleaned_data_exports/scraped_cleaned/apr_4_airbnb_mean_final.csv")
may_4_airbnb_mean_final = pd.read_csv("cleaned_data_exports/scraped_cleaned/may_4_airbnb_mean_final.csv")

#airbnb scrape two guest listings
dec_2_airbnb_mean_final = pd.read_csv("cleaned_data_exports/scraped_cleaned/dec_2_airbnb_mean_final.csv")
jan_2_airbnb_mean_final = pd.read_csv("cleaned_data_exports/scraped_cleaned/jan_2_airbnb_mean_final.csv")
feb_2_airbnb_mean_final = pd.read_csv("cleaned_data_exports/scraped_cleaned/feb_2_airbnb_mean_final.csv")
mar_2_airbnb_mean_final = pd.read_csv("cleaned_data_exports/scraped_cleaned/mar_2_airbnb_mean_final.csv")
apr_2_airbnb_mean_final = pd.read_csv("cleaned_data_exports/scraped_cleaned/apr_2_airbnb_mean_final.csv")
may_2_airbnb_mean_final = pd.read_csv("cleaned_data_exports/scraped_cleaned/may_2_airbnb_mean_final.csv")

In [5]:
#google geocoding api
latitude_df = pd.read_csv("data/mountain_lat_long.csv")

### Data Source #1 - OnTheSnow (Kaggle)
### User Based Filtering Dataset

The main dataset for the user based collaborative model was pulled from [Kaggle]([https://www.kaggle.com/datasets/fredkellner/onthesnow-ski-area-reviews]). The dataset includes reviews scraped from OnTheSnow, a leading website that provides information about ski resorts and snow conditions found on Kaggle. 

There are 18,128 reviews from 291 ski resorts in the USA. The features include:

- Ski Area
- Reviewer Name 
- Review Date
- Review Star Rating (out of 5)

In [6]:
snow_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18238 entries, 0 to 18237
Data columns (total 6 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   State                          18238 non-null  object
 1   Ski Area                       18238 non-null  object
 2   Reviewer Name                  18128 non-null  object
 3   Review Date                    18238 non-null  object
 4   Review Star Rating (out of 5)  18238 non-null  int64 
 5   Review Text                    18226 non-null  object
dtypes: int64(1), object(5)
memory usage: 855.0+ KB


In [7]:
snow_df.head()

Unnamed: 0,State,Ski Area,Reviewer Name,Review Date,Review Star Rating (out of 5),Review Text
0,colorado,copper-mountain-resort,anonymous_user,3-Mar-04,3,I have a pass the includes other mountains but...
1,utah,brighton-resort,anonymous_user,2-Dec-04,4,I've been coming to Brighton for years. Unlike...
2,north-carolina,ski-beech-mountain-resort,anonymous_user,1-Jan-05,5,"We went last Weekend, and it was the best snow..."
3,new-mexico,red-river,anonymous_user,1-Mar-05,5,Love Red River we go every year!
4,pennsylvania,sno-mountain,anonymous_user,2-Mar-05,4,"Great varied terrain, not crowded, good prices..."


In [8]:
#renaming columns
new_name = ['state', 'ski_resort', 'user_name','review_date', 'rating',
           'review'] 

snow_df.columns = new_name

In [9]:
snow_df

Unnamed: 0,state,ski_resort,user_name,review_date,rating,review
0,colorado,copper-mountain-resort,anonymous_user,3-Mar-04,3,I have a pass the includes other mountains but...
1,utah,brighton-resort,anonymous_user,2-Dec-04,4,I've been coming to Brighton for years. Unlike...
2,north-carolina,ski-beech-mountain-resort,anonymous_user,1-Jan-05,5,"We went last Weekend, and it was the best snow..."
3,new-mexico,red-river,anonymous_user,1-Mar-05,5,Love Red River we go every year!
4,pennsylvania,sno-mountain,anonymous_user,2-Mar-05,4,"Great varied terrain, not crowded, good prices..."
...,...,...,...,...,...,...
18233,minnesota,lutsen-mountains,REBECCA CARTWRIGHT,14-Dec-20,4,Many workers on the lifts did not know how to ...
18234,new-mexico,sipapu-ski-and-summer-resort,Antonio Martinez,15-Dec-20,5,"staying in the ""hotel"" (""motel"" on the sign ab..."
18235,new-mexico,sipapu-ski-and-summer-resort,Antonio Martinez,15-Dec-20,5,"staying in the ""hotel"" (""motel"" on the sign ab..."
18236,new-mexico,taos-ski-valley,David Humphrey,15-Dec-20,5,"Good skiing, have lost their way over the year..."


In [10]:
survey_df['user_name'].unique()

array(['anon_1', 'anon_2', 'anon_3', 'anon_4', 'anon_5', 'anon_6',
       'anon_7', 'anon_8', 'anon_9', 'Stephanie Ciaccia', 'Joseph Lewis',
       'Alexandria Kelly', 'Deanna Uzarski', 'Raghava Kamalesh'],
      dtype=object)

In [11]:
survey_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   review_date  61 non-null     object
 1   state        61 non-null     object
 2   ski_resort   61 non-null     object
 3   rating       61 non-null     int64 
 4   review       61 non-null     object
 5   user_name    61 non-null     object
dtypes: int64(1), object(5)
memory usage: 3.0+ KB


In [12]:
snow_df['review_date'] = pd.to_datetime(snow_df['review_date'])
survey_df['review_date'] = pd.to_datetime(survey_df['review_date'])

In [13]:
snow_df["ski_resort"].value_counts()

ski-brule                        1315
killington-resort                 204
vail                              203
winter-park-resort                191
blue-mountain-ski-area            189
                                 ... 
willard-mountain                   13
otis-ridge-ski-area                13
holiday-mountain                   13
big-squaw-mountain-ski-resort      13
whaleback-mountain                 12
Name: ski_resort, Length: 291, dtype: int64

In [14]:
snow_df['user_name'].value_counts().head(30)

anonymous_user         3026
anonymous               304
undefined undefined     130
Mike                     49
Ben                      49
Ryan                     46
Richard                  44
Dan                      42
David                    42
Chris                    39
Rob                      37
Jeff                     36
wolfman                  31
Derek                    31
Brian                    31
Matt                     31
iPhone                   28
Michael                  28
Nick                     27
gma                      27
Kevin                    27
Jim                      26
J                        25
Steve                    25
Jason                    23
Jun                      23
Mark                     23
Joe                      23
Justin                   22
Paul                     22
Name: user_name, dtype: int64

To clean the review dataset, I had to drop the names of users that were not unique. I parsed through the dataset and continued to drop columns until only unique usernames or users with first and last names were left.

In [15]:
drop_list = ["anonymous_user", "anonymous","undefined undefined","Mike", 
             "Ben", "Ryan", "Richard", "Dan", "David", "Chris", "Rob", "Jeff",
            "Derek", "Brian", "Matt", "Michael", "iPhone", "Kevin", "Nick",
            "Jim", "Steve", "Jason", "Mark", "Joe", "Paul", "Justin", "Scott",
            "Bob", "Alex", "Carter", "Dave", "Tim", "Bill", "Andrew", "John", "Sam",
            "James", "Kim", "Craig", "mike", "jason", "James", "Sam", "Kim", "mike", "peter",
            "Jack", "Adam", "Tom", "Wes", "Jun", "Steven", "Max", "Matthew", "Laura", "Felipe",
            "Greg", "Bryan", "Sarah", "Sara", "Christian", "Ray", "Connor", "Erin", "Emily",
            "Luke", "Ed", "Patrick", "kyle", "Ken", "Linda", "Eric", "Aaron", "Jake",
            "Josh", "Tony", "Abe", "Frank", "Peter", "Fred", "Arthur", "Lorraine",
            "Phil", "Sean", "Will", "Julie", "Jon", "Amy", "Becky", "Shannon", "brendan",
            "Kathy", "wayne", "Ethan", "Erika", "Jill", "Zoe", "Rick", "Wyatt",
            "Tyler", "Andrea", "mark", "john", "Donna", "Jen", "Braden", "D", "Bryce",
            "Rich", "Jared", "Jay", "Ann", "Brandon", "Nicholas","Martin",
            'Robert', 'angelino','Anonymous',
             'ty', 'jase', 'Jesse', 'Jennifer', 'Dustin', 'Natalie',
             'Pat', 'anonymous user', 'matt', 'George', 'Kate',
             'Daniel','Cindy', 'Barry', 'Todd', 'Melanie', 'Drew',
             'Andy', 'Hochard','Wayne', 'dan',
             'Charlie', 'Vanessa','Allen', 'Austin', 'Roger',
             'Jerry', 'Scotty', 'Anon', 'Lucas', 'Brian', 'Lee', 'Taylor',
            'brian', 'Lisa', 'Jade', 'Spencer', 'chris', 'Jenny', 'Amanda', 'Brett',
            'Maria', 'Holly', 'iPad', 'Sylvia', 'iPhone (2)', 'Catherine', 'Hannah', 'Wade',
            'Larry', 'Lauren','Noah', 'Bobby', 'Don', 'Christine', 'Stephen', 'Howard',
             'Tanner', 'Tom', 'Casey', 'Kyle', 'Michelle', 'Shelby',
             'Benjamin', 'Erik', 'Molly', 'Johnny', 'Chuck', 'Johnny',
             'Nathan', 'Cathy', 'Shelley', 'Mary', 'Danny', 'mitch', 'Brad', 'Tammy', 'erik',
            'Tricia', 'Nate', 'Pete']

snow_df = snow_df[snow_df['user_name'].isin(drop_list) == False]

In [16]:
snow_df['user_name'].value_counts().head(70)

wolfman           31
gma               27
J                 25
Tim Zheng         22
Dave O            22
                  ..
gwiffie            7
Nick Franchino     7
MCPG               7
Bill Deaton        7
tc5                7
Name: user_name, Length: 70, dtype: int64

In [17]:
#renaming columns
new_name = ['state', 'ski_resort', 'user_name','review_date', 'rating',
           'review'] 

snow_df.columns = new_name

After cleaning the usernames, I will be further narrowing down the number of users by only including users with more than 3 reviews.

In [18]:
# counting the number of reviews for each user
value_counts = snow_df['user_name'].value_counts()

# selecting only users with more than three reviews
selected_users = value_counts[value_counts > 2].index

# selecting only the rows where the user_name is in the selected_users list
cleaned_snow = snow_df[snow_df['user_name'].isin(selected_users)]

In [19]:
cleaned_snow['user_name'].value_counts(ascending=True)

thorne36               3
Christopher Horner     3
william c              3
Dexter                 3
rlhinvail              3
                      ..
Tim Zheng             22
Dave O                22
J                     25
gma                   27
wolfman               31
Name: user_name, Length: 648, dtype: int64

Removing users with more than 3 reviews dropped the number of rows/final reviews to 2200.

In [20]:
cleaned_snow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2996 entries, 103 to 18199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   state        2996 non-null   object        
 1   ski_resort   2996 non-null   object        
 2   user_name    2996 non-null   object        
 3   review_date  2996 non-null   datetime64[ns]
 4   rating       2996 non-null   int64         
 5   review       2994 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 163.8+ KB


In [21]:
cleaned_snow['user_name'].unique()

array(['ericadyer', 'FroDog', 'jackson321', 'SammyG', 'Jill Adler',
       'Twenz', 'anhillx', 'emj35', 'Benji Zimmerman', 'Roger Leo',
       'filterban', 'JP', 'gkphi5', 'RippinSkiers', 'Jay C',
       'Mark Rosasco', 'jay', 'Dan Gibson', 'stevenstclair', 'jwtime',
       'Cherokee', 'treesker', 'hranee', 'gwiffie', 'stevenam', 'Gunny J',
       'jim8588', 'Bmorabito', 'Richard 1', 'airgarden94', 'joey58242',
       'Mike134', 'tom travis', 'bobbert', 'noonito', 'tsfoust',
       'Art Zinn', 'steffenwolf', 'seniordude', 'Shartron',
       'Mikey Likes It', 'Americansonofa', 'Randy Agness', 'sno_thing',
       'sampanning', 'steep-n-deep ', 'jestertatt', 'Dantheman',
       'swissnowtiger', 'brandon', 'flyersboy114', 'Ritt', 'Bob Butts',
       'sharimcatee', 'iLiveToRide17', 'bodibran', 'yodeledihoo',
       'tourist from Texas', 'p_nut', 'highvoltageguy', 'masterdel',
       'govey80', 'horse', 'fcherichel', 'mwolske', 'mayham2k', 'Adye 1',
       'MgoBlue', 'Randy Rogers', 'Bobby G

In [22]:
#dropping duplicate rows
cleaned_snow = cleaned_snow.drop_duplicates()

#### Ski resort name - cleaning

Since the target variable is the Ski Resort I will need to clean and update the names in all datasets to ensure they are consistent.

In [23]:
cleaned_snow['ski_resort'].unique()

array(['squaw-valley-usa', 'sun-valley', 'donner-ski-ranch', 'boreal',
       'diamond-peak', 'mt-baker', 'alpental', 'stevens-pass-resort',
       'the-summit-at-snoqualmie', 'mt-rose-ski-tahoe', 'mountain-high',
       'snowshoe-mountain-resort', 'alyeska-resort', 'steamboat',
       'alta-ski-area', 'snowbird', 'snowbasin', 'brighton-resort',
       'solitude-mountain-resort', 'deer-valley-resort',
       'park-city-mountain-resort', 'jackson-hole', 'sundance',
       'brian-head-resort', 'bretton-woods', 'loon-mountain',
       'sierra-at-tahoe', 'heavenly-mountain-resort', 'gunstock',
       'sno-mountain', 'attitash', 'crystal-mountain-wa', 'vail',
       'killington-resort', 'waterville-valley', 'kirkwood',
       'copper-mountain-resort', 'breckenridge',
       'arapahoe-basin-ski-area', 'keystone', 'boyne-mountain-resort',
       'crystal-mountain', 'shanty-creek', 'cannonsburg',
       'boyne-highlands', 'aspen-snowmass', 'sunday-river',
       'mount-sunapee', 'sugar-bowl-re

In [24]:
#removing words to clean up resort names
replace_snow = ['-ski-area', '-', 'resort', 'mt']
replace_with = ['', ' ', '', 'mt.']

cleaned_snow = cleaned_snow.replace(replace_snow, replace_with, regex=True)

In [25]:
#making columns titlecase
cleaned_snow['ski_resort'] = cleaned_snow['ski_resort'].str.title()
cleaned_snow['state'] = cleaned_snow['state'].str.title()
cleaned_snow['ski_resort'] = cleaned_snow['ski_resort'].str.strip()

In [26]:
#replacing values to standardize endings/specific resort names
replace_snow = ['At', 'Mtn', 'Mt.N', 'Mt. Hood Ski Bowl', 'And', r'\bMount\b']
replace_with = ['at', 'Mountain', 'Mountain', 'Mt. Hood Skibowl', 'and', 'Mt.']

cleaned_snow = cleaned_snow.replace(replace_snow, replace_with, regex=True)

After inspecting resort names, there were a few resorts that had the same names or very similar names. I adjusted the names, and included the state in the resort names to differentiate the names.

In [27]:
#timberline
cleaned_snow.loc[(cleaned_snow['ski_resort'] == "Timberline Four Seasons") & (cleaned_snow['state'] == "West Virginia"), 'ski_resort'] = "Timberline Mountain"

#crystal mountain
cleaned_snow.loc[(cleaned_snow['ski_resort'] == "Crystal Mountain Wa") & (cleaned_snow['state'] == "Washington"), 'ski_resort'] = "Crystal Mountain Washington"
cleaned_snow.loc[(cleaned_snow['ski_resort'] == "Crystal Mountain") & (cleaned_snow['state'] == "Michigan"), 'ski_resort'] = "Crystal Mountain Michigan"

#magic mountain
cleaned_snow.loc[(cleaned_snow['ski_resort'] == "Magic Mountain") & (cleaned_snow['state'] == "Vermont"), 'ski_resort'] = "Magic Mountain Vermont"
cleaned_snow.loc[(cleaned_snow['ski_resort'] == "Magic Mountain") & (cleaned_snow['state'] == "Idaho"), 'ski_resort'] = "Magic Mountain Idaho"

In [28]:
mountain_rep = ['Squaw Valley Usa',
                'Mccauley Mountain Ski Center', 'attitash', 'Smugglers Notch',
               'Pico Mountain at Killington', 'andes Tower Hills']

mountain_new = ['Palisades Tahoe',
               'McCauley Mountain Ski Center', 'Attitash', "Smugglers' Notch",
               'Pico Mountain', 'Andes Tower Hills']

cleaned_snow = cleaned_snow.replace(mountain_rep, mountain_new, regex=True)

### Data Source #1 - Survey Data

A third small dataset was collected through a [google survey]([https://docs.google.com/forms/d/1ROrGEkCh40RjbHidNCqg4SCCbY3_6DFNw0VWIhTEIGs/edit#responses]) I distributed to individuals who ski, including myself.

I downloaded the sheets file from google and saved it as a .csv. A few individuals did not include their name, so I gave them unique "anon" names.

I plan to use the names of three users that I know, to analyze the results from the model to see if they align with the users preferences. For those three users, I also asked that they send me a brief summary of the key characteristics they look for when choosing ski resorts to visit.

In [29]:
#making columns titlecase
survey_df['ski_resort'] = survey_df['ski_resort'].str.title()
survey_df['state'] = survey_df['state'].str.title()
survey_df['ski_resort'] = survey_df['ski_resort'].str.strip()

In [30]:
list(survey_df['ski_resort'].sort_values().unique())

['Alta',
 'Arapahoe Basin',
 'Aspen Highlands',
 'Aspen Mountain',
 'Bear Valley',
 'Beaver Creek',
 'Beaver Mountain',
 'Breckenridge',
 'Brighton',
 'Cherry Peak',
 'Copper',
 'Copper Mountain',
 'Crested Butte',
 'Crystal Mountain - Wa',
 'Deer Valley',
 'Dodge Ridge',
 'Gore Mountain',
 'Hunter Mountain',
 'Jackson Hole',
 'Killington',
 'Mammoth',
 'Mammoth Mountain',
 'Mccauley Mountain',
 'Mt Baker',
 'Mt. Rose',
 'Nordic Valley',
 'Palisades',
 'Palisades Tahoe',
 'Park City Mountain',
 'Powder Mountain',
 'Roundtop Mountain',
 'Snow Ridge',
 'Snowbird',
 'Snowmass',
 'Solitude',
 'Steamboat',
 "Steven'S Pass",
 'Stevens Pass',
 'Stratton',
 'Sugarbush',
 'Taos',
 'Telluride',
 'Vail',
 'Winter Park',
 'Woods Valley']

Below, I manually parsed through the resort names and changed the names to match the names in the main dataframe.

In [31]:
mountain_rep = ['Crystal Mountain - Wa',"Steven'S Pass", 'Mammoth Mountain', 'Mammoth', 'Stratton',
                'Mccauley Mountain', 'Taos', 'Snowmass', 'Palisades Tahoe', 'Palisades', 'Copper Mountain', 'Copper',
                'Crested Butte', 'Mt. Rose', 'Mt Baker', 'Nordic Valley' ,'Solitude']
mountain_rep_p = ['Crystal Mountain Washington', 'Stevens Pass', 'Mammoth', 'Mammoth Mountain','Stratton Mountain',
                 'McCauley Mountain Ski Center', 'Taos Ski Valley', 'Aspen Snowmass', 'Palisades', 'Palisades Tahoe', 'Copper', 'Copper Mountain',
                 'Crested Butte Mountain', 'Mt. Rose Ski Tahoe', 'Mt. Baker', 'Nordic Mountain', 'Solitude Mountain']

survey_df = survey_df.replace(mountain_rep, mountain_rep_p, regex=True)

In [32]:
# changing aspen resorts since all four mountains are part of snowmass
mountain_r = ['Aspen Mountain', 'Aspen Highlands']

survey_df = survey_df.replace(mountain_r, 'Aspen Snowmass', regex=True)

In [33]:
#checking to see which names are different
survey_df.loc[~survey_df['ski_resort'].isin(cleaned_snow['ski_resort']),
                         'ski_resort'].unique()

array(['Cherry Peak'], dtype=object)

### Merging survey and OnTheSnow review data

In [34]:
#merging survey review results and final onthesnow reviews
final_ski_df = pd.concat([survey_df, cleaned_snow])

In [35]:
#dropping null values
final_ski_df = final_ski_df.dropna()

In [36]:
#dropping duplicates
final_ski_df = final_ski_df.drop_duplicates()

In [37]:
final_ski_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2796 entries, 0 to 18198
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   review_date  2796 non-null   datetime64[ns]
 1   state        2796 non-null   object        
 2   ski_resort   2796 non-null   object        
 3   rating       2796 non-null   int64         
 4   review       2796 non-null   object        
 5   user_name    2796 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 152.9+ KB


In [38]:
#Blackjack ski and Indianhead combined to form Snowriver Mountain Resort
blackjack = ['Blackjack Ski', 'Indianhead Mountain']

final_ski_df = final_ski_df.replace(blackjack, 'Snowriver Mountain Resort', regex=True)

In [39]:
#updating names of resort names that have changed
old_name = ['Durango Mountain','Las Vegas Ski and Snowboard','Shawnee Peak', 'Suicide Six', 'Snow Summit']
new_name = ['Purgatory Mountain','Lee Canyon','Shawnee Mountain', 'Saskadena Six', 'Big Bear']

final_ski_df = final_ski_df.replace(old_name, new_name, regex=True)

In [40]:
# making dictionary for replacements to avoid doubling the names
replacements = {
    'Brandywine': 'Boston Mills and Brandywine',
    'Boston Mills': 'Boston Mills and Brandywine'
}

# replacing
final_ski_df['ski_resort'] = final_ski_df['ski_resort'].replace(replacements)

In [41]:
#updating duplicate ski resort names and saving as new resort names that include the state names
final_ski_df.loc[(final_ski_df['ski_resort'] == "Crystal Mountain") & (final_ski_df['state'] == "Washington"), 'ski_resort'] = "Crystal Mountain Washington"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Crystal Mountain") & (final_ski_df['state'] == "Michigan"), 'ski_resort'] = "Crystal Mountain Michigan"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Crystal Mountain ") & (final_ski_df['state'] == "Michigan"), 'ski_resort'] = "Crystal Mountain Michigan"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Powder Ridge") & (final_ski_df['state'] == "Minnesota"), 'ski_resort'] = "Powder Ridge Minnesota"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Powder Ridge") & (final_ski_df['state'] == "Connecticut"), 'ski_resort'] = "Powder Ridge Connecticut"

#alpine valley
final_ski_df.loc[(final_ski_df['ski_resort'] == "Alpine Valley") & (final_ski_df['state'] == "Wisconsin"), 'ski_resort'] = "Alpine Valley Wisconsin"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Alpine Valley") & (final_ski_df['state'] == "Ohio"), 'ski_resort'] = "Alpine Valley Ohio"

In [42]:
#dropping Cherry Peak
final_ski_df = final_ski_df.loc[(final_ski_df['ski_resort'] != "Cherry Peak")]

In [43]:
#exporting final cleaned dataframe for OnTheSnow scrape
#final_ski_df.to_csv("cleaned_data_exports/final_review_df_final.csv")

### Data Source #3 - OnTheSnow Scrape


I scraped OnTheShow to pull current ski resort features for the resorts in the final merged datset. The code for this scrape was adapted from a [user on github] [(https://github.com/SijiaLai/OnTheSnow/tree/master)] and updated based on html changes and the features I wanted to pull.

The code for the scraper can be found in the data folder.


The main features I scraped:

- mountain elevation
- ticket price
- mountain location
- ski terrain
- snowfall averages

In [44]:
scraped_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 43 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Name                      333 non-null    object 
 1   address                   333 non-null    object 
 2   city                      333 non-null    object 
 3   state                     333 non-null    object 
 4   country                   333 non-null    object 
 5   sumt                      333 non-null    int64  
 6   drop                      333 non-null    int64  
 7   base                      333 non-null    int64  
 8   gondolas_and_trams        333 non-null    int64  
 9   fastEight                 331 non-null    float64
 10  highSpeedSixes            327 non-null    float64
 11  quadChairs                329 non-null    float64
 12  tripleChairs              330 non-null    float64
 13  doubleChairs              331 non-null    float64
 14  surfeLifts

In [45]:
scraped_df['state'].unique()

array(['Washington', 'Minnesota', 'Ohio', 'Wisconsin', 'Utah', 'Alaska',
       'New Mexico', 'Oregon', 'North Carolina', 'Michigan', 'Colorado',
       'Arizona', 'New Hampshire', 'Pennsylvania', 'California',
       'New York', 'Masshusetts', 'Montana', 'Maine', 'Idaho', 'Vermont',
       'CA 96160', 'Virginia', 'New Jersey', 'West Virginia', 'Illinois',
       'WI  54819', 'South Dakota', 'Nevada', 'Wyoming', 'Missouri',
       'NV 89131', 'ID 83873', 'Connecticut', 'Pa 16440', 'Iowa', 'NY',
       'Tennessee', 'Indiana', 'CO', 'Maryland', 'Rhode Island'],
      dtype=object)

In [46]:
state_rep = ['WI  54819', 'NV 89131', 'ID 83873', 'Pa 16440', 'NY', 'CO', 'CA 96160']
state_with = ["Wisconsin", "Nevada", "Idaho", "Pennsylvania", "New York", "Colorado", "California"]

scraped_df.replace(state_rep, state_with, regex=True, inplace=True)

In [47]:
scraped_df.isna().sum().sort_values(ascending=False)

highSpeedSixes              6
quadChairs                  4
tripleChairs                3
fastEight                   2
doubleChairs                2
surfeLifts                  1
averageSnowfall (inches)    0
daysOpenLastYear            0
longestRun (miles)          0
totalRuns (acre)            0
ticketpriceNote             0
projectedClosing            0
gondolas_and_trams          0
base                        0
drop                        0
sumt                        0
country                     0
state                       0
city                        0
address                     0
projectedOpening            0
NovSnow                     0
terrainNote                 0
junior_weekday              0
gondolas_lifts_note         0
Url                         0
senior_weekend              0
adult_weekend               0
junior_weekend              0
child_weekend               0
senior_weekday              0
adult_weekday               0
child_weekday               0
DecSnow   

In [48]:
#dropping null values
scraped_df = scraped_df.drop(columns=['gondolas_lifts_note', 'terrainNote', 'ticketpriceNote',
                                     'senior_weekend', 'MaySnow', 'country'])

The below is code to replace empty ticket prices with mean values for the column. If time allowed, I would have preferred to manually parse through the null values and to replace them with information on individual ski resort's websites.

In [49]:
#filling empty lift ticket prices with mean

def mean_ticket(column):
    
    #changing all column types to int
    scraped_df[column] = scraped_df[column].astype(int)
    
    # finding mean values
    mean_value = scraped_df[column].mean()
    mean_value = int(mean_value)
    
    # filling 0 with mean value
    scraped_df[column] = scraped_df[column].replace(0, mean_value)
    
    return scraped_df

In [50]:
#making a loop to loop through list of ticket prices that need to be updated

ticket_prices = ["adult_weekend", "junior_weekend", "child_weekend", "senior_weekday", "adult_weekday",
"junior_weekday", "child_weekday", "adult_season", "junior_season", "child_season"]

for ticket_val in ticket_prices:
    mean_ticket(ticket_val)

In [51]:
#splitting the city into zipcode and city column
scraped_df[['zipcode', 'city']] = scraped_df['city'].str.split(' ', 1, expand=True)

In [52]:
#changing location of zipcode so it is placed next to city
column_to_move = scraped_df.pop("zipcode")

# moving zipcode after state
scraped_df.insert(4, "zipcode", column_to_move )

In [53]:
#renaming columns
scraped_df = scraped_df.rename(columns={"Name":"ski_resort"})

### OnTheSnow - Scrape #2

My initial scrape did not include ski run information, so I updated my initial scraping code and pulled ski run difficulty information.

In [54]:
second_scrape.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330 entries, 0 to 329
Data columns (total 49 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ski_resort           330 non-null    object 
 1   address              330 non-null    object 
 2   city                 330 non-null    object 
 3   state                330 non-null    object 
 4   country              330 non-null    object 
 5   summit               330 non-null    int64  
 6   drop                 330 non-null    int64  
 7   base                 330 non-null    int64  
 8   gondolas_and_trams   40 non-null     float64
 9   fast_eight           110 non-null    float64
 10  high_speed_sixes     41 non-null     float64
 11  quad_chairs          163 non-null    float64
 12  triple_chairs        223 non-null    float64
 13  double_chairs        238 non-null    float64
 14  surface_lifts        308 non-null    float64
 15  total_runs           330 non-null    int

In [55]:
second_scrape = second_scrape[['ski_resort','beginner_runs', 'intermediate_runs','advanced_runs', 'expert_runs']].copy()

In [56]:
#merging
scraped_df = pd.merge(scraped_df, second_scrape, on="ski_resort", how="left")

#replacing characters
scraped_df = scraped_df.replace("%", "", regex=True)

#replacing more characters
replace_vals = ['null', '-', " "]

#replacing strings 
scraped_df['expert_runs'] = scraped_df['expert_runs'].replace(replace_vals, "", regex=True)

#replacing empty cells with 0
scraped_df['expert_runs'] = scraped_df['expert_runs'].replace(r'^\s*$', 0, regex=True)

In [57]:
#filling null values with 0 as a null value indicates 0 runs

int_list = ["beginner_runs", "intermediate_runs", "advanced_runs", "expert_runs"]

for x in int_list:
    scraped_df[x] = scraped_df[x].fillna(0)

#converting values to floats
for x in int_list:
    scraped_df[x].astype(int)

In [58]:
#checking to see if there are any missing resorts in the scraped information to ensure all resorts have feature information
final_ski_df.loc[~final_ski_df['ski_resort'].isin(scraped_df['ski_resort']),
                         'ski_resort'].sort_values().unique()

array(['Anthony Lakes Mountain', 'Big Squaw Mountain Ski',
       'Brantling Ski Slopes', 'Coffee Mill Ski Snowboard', 'Mt. Holly',
       'Mt. Shasta Board Ski Park', 'Timberline Mountain'], dtype=object)

In [59]:
replace_from = ["Anthony Lakes", "Big Squaw","Brantling Ski", "Coffee Mill", "Mt. Shasta", "Mount Holly"]
replace_to = ["Anthony Lakes Mountain", 'Big Squaw Mountain Ski', "Brantling Ski Slopes", "Coffee Mill Ski Snowboard",
               'Mt. Shasta Board Ski Park', "Mt. Holly"]

scraped_df.replace(replace_from, replace_to, regex=True, inplace=True)

In [60]:
scraped_df.loc[(scraped_df['ski_resort'] == "Timberline") & (scraped_df['state'] == "West Virginia"), 'ski_resort'] = "Timberline Mountain"

In [61]:
#updating scraped df names to match the other dataframes
scraped_df.loc[(scraped_df['ski_resort'] == "Crystal Mountain") & (scraped_df['state'] == "Washington"), 'ski_resort'] = "Crystal Mountain Washington"
scraped_df.loc[(scraped_df['ski_resort'] == "Crystal Mountain ") & (scraped_df['state'] == "Michigan"), 'ski_resort'] = "Crystal Mountain Michigan"

#updating
scraped_df.loc[(scraped_df['ski_resort'] == "Timberline") & (scraped_df['state'] == "West Virginia"), 'ski_resort'] = "Timberline Mountain"

#updating scraped df names to match the other dataframes
scraped_df.loc[(scraped_df['ski_resort'] == "Powder Ridge") & (scraped_df['state'] == "Minnesota"), 'ski_resort'] = "Powder Ridge Minnesota"
scraped_df.loc[(scraped_df['ski_resort'] == "Powder Ridge") & (scraped_df['state'] == "Connecticut"), 'ski_resort'] = "Powder Ridge Connecticut"

In [62]:
#dropping duplicates
scraped_df = scraped_df.drop_duplicates()

In [63]:
#dropping duplicate names in ski_resort
scraped_df = scraped_df.drop_duplicates(subset=['ski_resort'])

In [64]:
#checking value counts of resorts to make sure there aren't duplicates or duplicate names
scraped_df.ski_resort.value_counts()

Pats Peak                    1
Brundage Mountain            1
Donner Ski Ranch             1
King Pine                    1
Mt. Bohemia                  1
                            ..
Powder Ridge Minnesota       1
Whitecap Mountain            1
Seven Oaks                   1
Mountain Creek               1
Snowriver Mountain Resort    1
Name: ski_resort, Length: 329, dtype: int64

In [65]:
#checking to see if there are any missing resorts in the scraped information to ensure all resorts have feature information
final_ski_df.loc[~final_ski_df['ski_resort'].isin(scraped_df['ski_resort']),
                         'ski_resort'].sort_values().unique()

array([], dtype=object)

### Additional feature engineering
It is common for avid skiiers to purchase ski passes through companies that own a collective of mountains around the United States. I researched a current list of mountains of four of the most common ski passes, manually parsed through the names to update them to match the main dataframe, and then one hot encoded the values for each resort.

I did attempt to use the fuzz to update the names, however many resorts have very similar names so it was not an effective way to update the resort names.

- Epic Pass
- Ikon Pass
- Mountain Collective
- Indy Pass

In [66]:
epic_list = ["Stowe Mountain", "Okemo Mountain", "Hunter Mountain", "Mt. Snow", "Mt. Sunapee","Wildcat Mountain","Seven Springs",
             "Attitash", "Jack Frost", "Crotched Mountain", "Laurel Mountain",
             "Roundtop Mountain", "Whitetail", "Liberty", "Big Boulder", "Heavenly Mountain", "Northstar California",
             "Kirkwood", "Stevens Pass", "Keystone", "Breckenridge","Vail", "Park City Mountain", "Beaver Creek",
             "Crested Butte Mountain", "Afton Alps", "Alpine Valley Ohio", "Boston Mills and Brandywine","Hidden Valley",
             "Mad River Mountain", "Mt. Brighton", "Paoli Peaks", "Snow Creek","Wilmot Mountain",
            'Mt. Sunapee','Wildcat Mountain', 'Whitetail', 'Mt. Brighton', 'Wilmot Mountain']

mtn_col_list = ['Arapahoe Basin', 'Aspen Snowmass', 'Jackson Hole', 'Mammoth Mountain', 'Snowbird',
                            'Palisades Tahoe', 'Sugarbush','Taos Ski Valley', 'Alta', 'Big Sky', 'Sugar Bowl',
                           'Sugarloaf', 'Sun Valley', 'Grand Targhee', 'Snowbasin']

ikon_list = ['Palisades Tahoe', 'Mammoth Mountain', 'June Mountain', 'Bear Mountain', 'Snow Summit','Snow Valley',
    'Sun Valley', 'Dollar Mountain', 'Crystal Mountain Washington', 'Alpental', 'The Summit at Snoqualmie',
    'Mt. Bachelor', 'Schweitzer', 'Alyeska', 'Aspen Snowmass','Buttermilk', 'Steamboat', 'Winter Park',
    'Copper Mountain', 'Arapahoe Basin', 'Eldora Mountain', 'Jackson Hole', 'Big Sky',
    'Taos Ski Valley','Deer Valley', 'Solitude Mountain','Brighton','Alta', 'Snowbird',
    'Snowbasin','Boyne Highlands', 'Boyne Mountain', 'Stratton Mountain', 'Sugarbush', 'Killington', 'Pico Mountain',
    'Windham Mountain', 'Snowshoe Mountain', 'Sunday River','Sugarloaf','Loon Mountain']


indy_list = ['Eaglecrest', 'Ski China Peak', 'Mt. Shasta Board Ski Park', 'Mountain High', 'Dodge Ridge',
    'Hoodoo Ski Area', 'Mt. Ashland', 'Mt. Hood Meadows', '49 Degrees North',
    'Hurricane Ridge Ski & Snowboard Area', 'Mission Ridge', 'Bluewood', 'White Pass',
    'Castle Mountain Resort', 'Sunrise Park', 'Echo Mountain', 'Granby Ranch',
    'Sunlight', 'Brundage Mountain', 'Kelly Canyon', 'Pomerelle Mountain', 'Silver Mountain', 'Soldier Mountain',
    'Tamarack', 'Blacktail Mountain', 'Mountain', 'Red Lodge Mountain', 'Beaver Mountain',
    'Powder Mountain', 'Antelope Butte', 'Snow King', 'White Pine Ski Resort',
    'Seven Oaks', 'Sundown Mountain', 'Big Powderhorn Mountain', 'Caberfae Peaks Ski Golf',
    'Crystal Mountain Michigan', 'Marquette Mountain', 'Nubs Nob', 'Pine Mountain',
    'Schuss Mountain at Shanty Creek', 'Swiss Valley', 'Treetops Ski Resort',
    'Buck Hill', 'Detroit Mountain Recreation Area', 'Lutsen Mountains', 'Mount Mankato',
    'Powder Ridge Minnesota', 'Spirit Mountain', 'Terry Peak', 'Granite Peak',
    'Little Switzerland', 'Nordic Mountain', 'The Rock Snowpark', 'Trollhaugen',
    'Tyrol Basin', 'Mohawk Mountain', 'BigRockMountain',
    'Rangeley Lakes Trail Center', 'Saddleback Mountain', 'Berkshire East',
    'Black Mountain', 'Cannon Mountain', 'Pats Peak', 'Waterville Valley',
    'Catamount Ski Ride Area', 'Greek Peak', 'Peekn Peak', 'Snow Ridge', 'Swain Resort',
    'Titus Mountain', 'West Mountain', 'Catamount Outdoor Family Center', 'Bolton Valley',
    'Jay Peak', 'Magic Mountain Vermont', 'Saskadena Six', 'Cataloochee',
    'Blue Knob', 'Montage Mountain', 'Shawnee Mountain', 'Ski Sawmill',
    'Tussey Mountain', 'Ober Gatlinburg Ski', 'Bryce', 'Massanutten',
    'Canaan Valley', 'Winterplace Ski']

In [67]:
#sanity check to see if any of the names are wrong

test_list = ikon_list + epic_list + mtn_col_list

missing_values = [value for value in test_list if value not in scraped_df['ski_resort'].values]

# Print the missing values
print(missing_values)

['Dollar Mountain', 'Buttermilk', 'Laurel Mountain']


In [68]:
#sanity check to see if any of the names are wrong

missing_values = [value for value in indy_list if value not in scraped_df['ski_resort'].values]

# Print the missing values
print(missing_values)

['Hoodoo Ski Area', 'Hurricane Ridge Ski & Snowboard Area', 'Castle Mountain Resort', 'Echo Mountain', 'Sunlight', 'Mountain', 'Red Lodge Mountain', 'Antelope Butte', 'White Pine Ski Resort', 'Schuss Mountain at Shanty Creek', 'Treetops Ski Resort', 'Detroit Mountain Recreation Area', 'Mount Mankato', 'The Rock Snowpark', 'BigRockMountain', 'Rangeley Lakes Trail Center', 'Saddleback Mountain', 'Swain Resort', 'Catamount Outdoor Family Center', 'Montage Mountain']


In [69]:
# making new column with 0
scraped_df['epic'] = 0
scraped_df['mountain_collective'] = 0
scraped_df['ikon'] = 0
scraped_df['indy'] = 0

# adding 1 for each row value based on the ski resort pass lists
scraped_df.loc[scraped_df['ski_resort'].isin(epic_list), 'epic'] = 1
scraped_df.loc[scraped_df['ski_resort'].isin(mtn_col_list), 'mountain_collective'] = 1
scraped_df.loc[scraped_df['ski_resort'].isin(ikon_list), 'ikon'] = 1
scraped_df.loc[scraped_df['ski_resort'].isin(indy_list), 'indy'] = 1

### Saving Airbnb scraping file
For the content based system, I will be scraping airbnb costs from each resort city. In the code below, I am making a new dataframe that I wll use that includes city, state, and ski resort information to pull information from airbnb.

The scraping notebook can be found in the data folder.

In [70]:
#making new dataframe
city_df = pd.DataFrame()

#saving city
city_df['city'] = scraped_df['city']

#saving state
city_df['state'] = scraped_df['state']

#saving ski resort name
city_df['ski_resort'] = scraped_df['ski_resort']

#saving 
#city_df.to_csv("cleaned_data_exports/location_for_scraping_v2.csv")

In [71]:
#saving city names for scraping
city_df['location'] = scraped_df[['city', 'state']].agg(', '.join, axis=1)

#saving unique cities
city_df['location'] = pd.DataFrame(city_df['location'].unique())

#dropping nulls
city_df = city_df.dropna()

#saving
#city_df.to_csv("cleaned_data_exports/city_names_for_scraping_v2.csv")

In [78]:
#saving final merged dataframe for content based system
#merged_df.to_csv("cleaned_data_exports/user_df_model.csv")

## Airbnb Scrape Cleaning

Importing .csv that I scraped from Airbnb. This includes prices of airbnb's pulled from the first two pages (32 results) for airbnbs with a max of **2 guests** and max of **4 guests**. I pulled this information to provide cost information to those planning ski trips, as high lodging and ticket costs are a barrier to entry while planning ski trips.

In [97]:
jan_2_airbnb_mean_final

Unnamed: 0,jan_mean_2_guests,jan_min_2_guests,jan_max_2_guests,ski_resort
0,240.742857,100.0,585.0,Magic Mountain
1,140.800000,73.0,409.0,49 Degrees North
2,187.156250,61.0,842.0,Afton Alps
3,264.857143,99.0,833.0,Alpental
4,158.676471,64.0,296.0,Alpine Valley Ohio
...,...,...,...,...
326,189.085714,71.0,510.0,Badger Pass
327,157.171429,84.0,324.0,Mt. Bachelor
328,221.285714,80.0,659.0,Mt. Bohemia
329,165.264706,67.0,357.0,Shawnee Mountain


In [98]:
jan_2_airbnb_mean_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331 entries, 0 to 330
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   jan_mean_2_guests  331 non-null    float64
 1   jan_min_2_guests   331 non-null    float64
 2   jan_max_2_guests   331 non-null    float64
 3   ski_resort         331 non-null    object 
dtypes: float64(3), object(1)
memory usage: 10.5+ KB


In [72]:
#making lists of all dataframes from airbnb scrape
two_guest_list = [dec_2_airbnb_mean_final,jan_2_airbnb_mean_final,feb_2_airbnb_mean_final,
                  mar_2_airbnb_mean_final, apr_2_airbnb_mean_final,
                  may_2_airbnb_mean_final]

four_guest_list = [dec_4_airbnb_mean_final,jan_4_airbnb_mean_final,feb_4_airbnb_mean_final,
                  mar_4_airbnb_mean_final, apr_4_airbnb_mean_final,
                  may_4_airbnb_mean_final]

month_list = ['dec', 'jan', 'feb', 'mar', 'apr', 'may']

for x, y, z in zip(two_guest_list, four_guest_list, month_list):
    
    dict_2 = {'mean': z + '_mean_2_guests',
              'min': z + '_min_2_guests','max': z + '_max_2_guests'}

    dict_4 = {'mean': z +'_mean_4_guests',
              'min': z +'_min_4_guests','max': z +'_max_4_guests'}
    
    #renaming column names based on scrape
    x.rename(columns=dict_2, inplace=True)
    y.rename(columns=dict_4, inplace=True)
    
    #saving as new dataframe
    x.drop(columns=['count', 'std', '25%', '50%', '75%', 'Unnamed: 0'], inplace=True)
    y.drop(columns=['count', 'std', '25%', '50%', '75%', 'Unnamed: 0'], inplace=True)

In [73]:
#merging all months from the list of 2 guest airbnbs 
for df in two_guest_list[1:]:
    dec_2_airbnb_mean_final = pd.merge(dec_2_airbnb_mean_final, df, on='ski_resort')

#mergins all months from the list of 4 guest airbnbs
for df in four_guest_list:
    dec_2_airbnb_mean_final = pd.merge(dec_2_airbnb_mean_final, df, on='ski_resort')
    
#saving off the list as a new dataframe
airbnb_df = dec_2_airbnb_mean_final.copy()

#dropping duplicates
airbnb_df = airbnb_df.drop_duplicates()

In [76]:
#inspecting value counts
airbnb_df['ski_resort'].value_counts()

Soldier Mountain             1
Howelsen Hill                1
Peekn Peak                   1
Ober Gatlinburg Ski          1
Pajarito Mountain            1
                            ..
Snowbasin                    1
Nashoba Valley               1
Roundtop Mountain            1
Thunder Ridge                1
Snowriver Mountain Resort    1
Name: ski_resort, Length: 331, dtype: int64

In [None]:
#checking to see which ski resort names different between the main feature df and the airbnb df
scraped_df.loc[~scraped_df['ski_resort'].isin(airbnb_df['ski_resort']),
                         'ski_resort'].unique()

In [77]:
#renaming resorts that have the same name
airbnb_df.replace("Crystal Mountain", "Crystal Mountain Washington", regex=True, inplace=True)

#renaming resorts that have the same name
airbnb_df.replace("Powder Ridge", "Powder Ridge Minnesota", regex=True, inplace=True)

airbnb_df.loc[(airbnb_df['ski_resort'] == "Crystal Mountain Washington ") & (airbnb_df['dec_min_2_guests'] == 40), 'ski_resort'] = "Crystal Mountain Michigan"

airbnb_df.replace("Powder Ridge", "Powder Ridge Connecticut", inplace=True)

airbnb_df.replace('Mt. Bhelor', 'Mt. Bachelor', inplace=True)

airbnb_df.loc[(airbnb_df['ski_resort'] == 'Timberline') & (airbnb_df['dec_min_2_guests'] == 75.0), 'ski_resort'] = "Timberline Mountain"

#batch renaming mountains to match the user df
rename_list = ["Anthony Lakes", "Mount Holly", "Mt. Shasta", "Coffee ll", "Brantling Ski", 'Big Squaw']
rename_to = ['Anthony Lakes Mountain', 'Mt. Holly','Mt. Shasta Board Ski Park', 'Coffee Mill Ski Snowboard',
       'Brantling Ski Slopes', 'Big Squaw Mountain Ski']

airbnb_df.replace(rename_list, rename_to, regex=True, inplace=True)

In [89]:
#checking to see which ski resort names different between the main feature df and the airbnb df
scraped_df.loc[~scraped_df['ski_resort'].isin(airbnb_df['ski_resort']),
                         'ski_resort'].unique()

array(['Powder Ridge Connecticut', 'Snow Summit', 'Whaleback'],
      dtype=object)

# Merging reviews with features
I will be combining the cleaned dataframes from below that will be used for the collaborative model output.

In [None]:
# #merging final review dataframe and scraped data
# merged_df = pd.merge(final_ski_df, scraped_df, on="ski_resort", how='left')

# #dropping columns
# merged_df = merged_df.drop(columns="state_y")

# #renaming columns
# merged_df = merged_df.rename(columns={"state_x":"state"})

## Merging airbnb with feature dataframe

In [91]:
#merging airbnb and feature dataframes
content_df = pd.merge(scraped_df, airbnb_df, on="ski_resort", how="left")

In [92]:
content_df

Unnamed: 0,ski_resort,address,city,state,zipcode,sumt,drop,base,gondolas_and_trams,fastEight,...,feb_max_4_guests,mar_mean_4_guests,mar_min_4_guests,mar_max_4_guests,apr_mean_4_guests,apr_min_4_guests,apr_max_4_guests,may_mean_4_guests,may_min_4_guests,may_max_4_guests
0,49 Degrees North,P.O. Box 166,Chewelah,Washington,99109,5774,1851,3932,0,1.0,...,389.0,180.000000,75.0,394.0,216.142857,79.0,435.0,165.625000,79.0,375.0
1,Afton Alps,6600 Peller Avenue South,Hastings,Minnesota,55033,1530,350,1180,0,0.0,...,700.0,262.968750,107.0,725.0,262.093750,103.0,749.0,194.750000,80.0,388.0
2,Alpental,POB 1068,Snoquale Pass,Washington,98068,5420,2280,3140,0,1.0,...,993.0,376.090909,132.0,1200.0,347.687500,90.0,1200.0,245.875000,78.0,580.0
3,Alpine Valley Ohio,10620 Mayfield,Chesterland,Ohio,44026,1500,230,1260,0,0.0,...,519.0,205.914286,98.0,519.0,220.742857,94.0,719.0,167.125000,70.0,339.0
4,Alpine Valley Wisconsin,P.O. Box 615,East Troy,Wisconsin,53120,1040,388,820,0,3.0,...,750.0,235.147059,65.0,502.0,239.406250,65.0,750.0,227.947368,90.0,406.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
325,Woods Valley,Box 215,Westernville,New York,13486,1400,500,900,0,0.0,...,666.0,225.171429,85.0,666.0,245.428571,129.0,677.0,175.458333,85.0,350.0
326,Yawgoo Valley,Box 41,Slocum,Rhode Island,2877,315,245,70,0,0.0,...,650.0,297.057143,132.0,658.0,324.771429,150.0,699.0,266.000000,125.0,950.0
327,Badger Pass,9001 Village Dr.,Yosete,California,95389,7800,600,7200,0,0.0,...,559.0,249.029412,131.0,559.0,283.685714,130.0,950.0,179.785714,67.0,259.0
328,Shawnee Mountain,P.O. Box 339,Shawnee on Delaware,Pennsylvania,18356-0339,1350,700,650,0,1.0,...,400.0,197.281250,120.0,411.0,207.029412,71.0,473.0,141.695652,68.0,310.0


### Google Geocoding API

Importing the final .csv from the Google Geocoding API pull. The code for this can be found in the **scraping_ipynb** file.

Google's Geocoding API is a service that accepts a place as an address, latitude and longitude coordinates, or Place ID. It converts the address into latitude and longitude coordinates and a Place ID, or converts latitude and longitude coordinates or a Place ID into an address.

In [99]:
latitude_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 332 entries, 0 to 331
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    332 non-null    int64  
 1   full_address  332 non-null    object 
 2   ski_resort    332 non-null    object 
 3   latitude      332 non-null    float64
 4   longitude     332 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 13.1+ KB


In [101]:
latitude_df.head()

Unnamed: 0,full_address,ski_resort,latitude,longitude
0,"P.O. Box 166, Chewelah, Washington",49 Degrees North,48.276287,-117.715521
1,"6600 Peller Avenue South, Hastings, Minnesota",Afton Alps,44.854416,-92.790839
2,"POB 1068, Snoquale Pass, Washington",Alpental,47.392335,-121.400094
3,"10620 Mayfield, Chesterland, Ohio",Alpine Valley Ohio,41.526814,-81.25982
4,"P.O. Box 615, East Troy, Wisconsin",Alpine Valley Wisconsin,42.785292,-88.405096


In [100]:
latitude_df.drop(columns="Unnamed: 0", inplace=True)

In [102]:
content_df.loc[~content_df['ski_resort'].isin(latitude_df['ski_resort']),
                         'ski_resort'].unique()

array(['Anthony Lakes Mountain', 'Big Squaw Mountain Ski',
       'Brantling Ski Slopes', 'Coffee Mill Ski Snowboard', 'Mt. Holly',
       'Mt. Shasta Board Ski Park', 'Snow Summit', 'Timberline Mountain',
       'Whaleback'], dtype=object)

In [103]:
#updating Timberline because there are multiple similar resorts
latitude_df.loc[(latitude_df['ski_resort'] == 'Timberline') & (latitude_df['full_address'] == "HC 70 Box 488, Davis, West Virginia"), 'ski_resort'] = "Timberline Mountain"

#batch renaming
lat_list = ["Anthony Lakes", "Mount Holly", "Mt. Shasta", "Coffee ll", "Brantling Ski", "Big Squaw"]
lat_replace = ['Anthony Lakes Mountain', 'Mt. Holly','Mt. Shasta Board Ski Park',
               'Coffee Mill Ski Snowboard','Brantling Ski Slopes', 'Big Squaw Mountain Ski']

latitude_df.replace(lat_list, lat_replace, regex=True, inplace=True)

In [104]:
#confiring all names are consistent besides two remaining values that don't exist in the other dataframe
content_df.loc[~content_df['ski_resort'].isin(latitude_df['ski_resort']),
                         'ski_resort'].unique()

array(['Snow Summit', 'Whaleback'], dtype=object)

### Merging with feature dataframe

In [105]:
#merging latitude df with final content_df
content_df = pd.merge(content_df, latitude_df, on="ski_resort", how="left")

In [107]:
content_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 330 entries, 0 to 329
Data columns (total 85 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ski_resort                330 non-null    object 
 1   address                   330 non-null    object 
 2   city                      330 non-null    object 
 3   state                     330 non-null    object 
 4   zipcode                   330 non-null    object 
 5   sumt                      330 non-null    int64  
 6   drop                      330 non-null    int64  
 7   base                      330 non-null    int64  
 8   gondolas_and_trams        330 non-null    int64  
 9   fastEight                 330 non-null    float64
 10  highSpeedSixes            328 non-null    float64
 11  quadChairs                329 non-null    float64
 12  tripleChairs              329 non-null    float64
 13  doubleChairs              329 non-null    float64
 14  surfeLifts

In [106]:
content_df.head()

Unnamed: 0,ski_resort,address,city,state,zipcode,sumt,drop,base,gondolas_and_trams,fastEight,...,mar_max_4_guests,apr_mean_4_guests,apr_min_4_guests,apr_max_4_guests,may_mean_4_guests,may_min_4_guests,may_max_4_guests,full_address,latitude,longitude
0,49 Degrees North,P.O. Box 166,Chewelah,Washington,99109,5774,1851,3932,0,1.0,...,394.0,216.142857,79.0,435.0,165.625,79.0,375.0,"P.O. Box 166, Chewelah, Washington",48.276287,-117.715521
1,Afton Alps,6600 Peller Avenue South,Hastings,Minnesota,55033,1530,350,1180,0,0.0,...,725.0,262.09375,103.0,749.0,194.75,80.0,388.0,"6600 Peller Avenue South, Hastings, Minnesota",44.854416,-92.790839
2,Alpental,POB 1068,Snoquale Pass,Washington,98068,5420,2280,3140,0,1.0,...,1200.0,347.6875,90.0,1200.0,245.875,78.0,580.0,"POB 1068, Snoquale Pass, Washington",47.392335,-121.400094
3,Alpine Valley Ohio,10620 Mayfield,Chesterland,Ohio,44026,1500,230,1260,0,0.0,...,519.0,220.742857,94.0,719.0,167.125,70.0,339.0,"10620 Mayfield, Chesterland, Ohio",41.526814,-81.25982
4,Alpine Valley Wisconsin,P.O. Box 615,East Troy,Wisconsin,53120,1040,388,820,0,3.0,...,502.0,239.40625,65.0,750.0,227.947368,90.0,406.0,"P.O. Box 615, East Troy, Wisconsin",42.785292,-88.405096


### Adding Additional Column

In [108]:
lift_list = ['gondolas_and_trams','fastEight','highSpeedSixes','quadChairs','tripleChairs','doubleChairs',
             'surfeLifts']

for x in lift_list:
    content_df[x] = content_df[x].fillna(0)

In [109]:
content_df['gondolas_and_trams'] = content_df['gondolas_and_trams'].astype(float)

In [110]:
#making new column that totals the lift sum
content_df['total_lifts'] = 0 

#adding columns
content_df['total_lifts'] = content_df['gondolas_and_trams'] + content_df['fastEight'] + content_df['highSpeedSixes'] + content_df['quadChairs'] + content_df['tripleChairs'] + content_df['doubleChairs'] + content_df['surfeLifts']

In [113]:
content_df

Unnamed: 0,ski_resort,address,city,state,zipcode,sumt,drop,base,gondolas_and_trams,fastEight,...,apr_mean_4_guests,apr_min_4_guests,apr_max_4_guests,may_mean_4_guests,may_min_4_guests,may_max_4_guests,full_address,latitude,longitude,total_lifts
0,49 Degrees North,P.O. Box 166,Chewelah,Washington,99109,5774,1851,3932,0.0,1.0,...,216.142857,79.0,435.0,165.625000,79.0,375.0,"P.O. Box 166, Chewelah, Washington",48.276287,-117.715521,7.0
1,Afton Alps,6600 Peller Avenue South,Hastings,Minnesota,55033,1530,350,1180,0.0,0.0,...,262.093750,103.0,749.0,194.750000,80.0,388.0,"6600 Peller Avenue South, Hastings, Minnesota",44.854416,-92.790839,21.0
2,Alpental,POB 1068,Snoquale Pass,Washington,98068,5420,2280,3140,0.0,1.0,...,347.687500,90.0,1200.0,245.875000,78.0,580.0,"POB 1068, Snoquale Pass, Washington",47.392335,-121.400094,5.0
3,Alpine Valley Ohio,10620 Mayfield,Chesterland,Ohio,44026,1500,230,1260,0.0,0.0,...,220.742857,94.0,719.0,167.125000,70.0,339.0,"10620 Mayfield, Chesterland, Ohio",41.526814,-81.259820,5.0
4,Alpine Valley Wisconsin,P.O. Box 615,East Troy,Wisconsin,53120,1040,388,820,0.0,3.0,...,239.406250,65.0,750.0,227.947368,90.0,406.0,"P.O. Box 615, East Troy, Wisconsin",42.785292,-88.405096,13.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
325,Woods Valley,Box 215,Westernville,New York,13486,1400,500,900,0.0,0.0,...,245.428571,129.0,677.0,175.458333,85.0,350.0,"Box 215, Westernville, New York",43.305625,-75.382950,6.0
326,Yawgoo Valley,Box 41,Slocum,Rhode Island,2877,315,245,70,0.0,0.0,...,324.771429,150.0,699.0,266.000000,125.0,950.0,"Box 41, Slocum, Rhode Island",41.532874,-71.514729,4.0
327,Badger Pass,9001 Village Dr.,Yosete,California,95389,7800,600,7200,0.0,0.0,...,283.685714,130.0,950.0,179.785714,67.0,259.0,"9001 Village Dr., Yosete, California",37.747595,-119.584136,5.0
328,Shawnee Mountain,P.O. Box 339,Shawnee on Delaware,Pennsylvania,18356-0339,1350,700,650,0.0,1.0,...,207.029412,71.0,473.0,141.695652,68.0,310.0,"P.O. Box 339, Shawnee on Delaware, Pennsylvania",41.012317,-75.110733,9.0


In [118]:
content_df.drop_duplicates(subset="ski_resort", inplace=True)

In [122]:
#saving final cleaned scraped df
content_df.to_csv("cleaned_data_exports/scraped_feature_df.csv")

## Feature Analysis

### Ratings by resort distribution

In [120]:
#looking at the most reviewed mountains
top_5_reviewed_resorts = pd.DataFrame(final_ski_df['ski_resort'].value_counts().reset_index()).head(5)
top_5_reviewed_resorts

Unnamed: 0,index,ski_resort
0,Ski Brule,74
1,Killington,53
2,Vail,51
3,Breckenridge,49
4,Snowbird,36


In [121]:
#using plotly to plot the top reviewers
fig = px.bar(top_5_reviewed_resorts, x="index", y="ski_resort")
fig.update_layout(title_text='Most Reviewed Mountains',
                  title_x=0.5,
                  xaxis_title="Resort",
                  yaxis_title="Review Count",
                 plot_bgcolor='white')
fig.update_traces(marker_color = "#00b5ff")
fig.show()

### Rating distribution

There is an imbalance in rating distrubutions, however the breakdown of ratings is not in line with typical user bias where ratings are either on the high or low scale. This imbalance will most likely end up affecting the performance of our modeling, but we can choose an algorithm that works best for the type of data we have.

In [123]:
#making dataframe of rating counts to compare distribution of ratings
top_ratings = pd.DataFrame(final_ski_df["rating"].value_counts(ascending=False).head(15))
top_ratings = top_ratings.reset_index()
top_ratings = top_ratings.rename(columns={"rating":"rating_count"})
top_ratings = top_ratings.rename(columns={"index":"rating"})

#making user_id a string for plotting
top_ratings['rating'] = top_ratings['rating'].astype(str)

# Calculate the percentage of each rating count
top_ratings['rating_percentage'] = (top_ratings['rating_count'] / top_ratings['rating_count'].sum()) * 100

In [124]:
#using plotly to plot the top featurescolor=
fig = px.bar(top_ratings, x="rating", y="rating_percentage",
             text="rating_percentage")
fig.update_layout(title_text='Rating Distribution',
                  title_x=0.5,
                  xaxis_title="Rating",
                  yaxis_title="Rating %",
                 plot_bgcolor='white',
                 font=dict(size=14))
fig.update_traces(marker_color = "#00b5ff", texttemplate='%{text:.1s}%', textposition='outside')

fig.show()

### Monthly Snowfall

In [152]:
snow_list = ["NovSnow", "DecSnow", "JanSnow", "FebSnow", "MarSnow", "AprSnow"]
snow_names = ['November', 'December', 'January', 'February', 'March', 'April']

monthly_mean = content_df[snow_list].mean(skipna=True)

monthly_snowfall = pd.DataFrame({'month': snow_names, 'mean_snowfall': monthly_mean.values})

In [153]:
monthly_snowfall

Unnamed: 0,month,mean_snowfall
0,November,5.118541
1,December,25.088754
2,January,28.780243
3,February,29.501216
4,March,21.741337
5,April,5.274164


In [154]:
#using plotly to plot the top featurescolor=
fig = px.bar(monthly_snowfall, x="month", y="mean_snowfall")
fig.update_layout(title_text='2022 US Average Snowfall',
                  title_x=0.5,
                  xaxis_title="Month",
                  yaxis_title="Snow (in)",
                 plot_bgcolor='white',
                 font=dict(size=14))
fig.update_traces(marker_color = "#00b5ff",textposition='outside')

fig.show()

### Airbnb Prices

In [183]:
#using plotly to plot the top featurescolor=
fig = px.bar(content_df.head(), x="ski_resort", y=["dec_min_2_guests", "dec_min_4_guests"],
            width=1000, height=500)
fig.update_layout(title_text='December Airbnb Costs',
                  title_x=0.5,
                  xaxis_title="Ski Resort",
                  yaxis_title="Nightly Price ($)",
                 plot_bgcolor='white',
                 font=dict(size=14),
                 barmode='group')

newnames = {'dec_min_2_guests':'2 Guest Max', 'dec_min_4_guests': '4 Guest Max'}
fig.for_each_trace(lambda t: t.update(name = newnames[t.name],
                                      legendgroup = newnames[t.name],
                                      hovertemplate = t.hovertemplate.replace(t.name, newnames[t.name])
                                     )
                  )

fig.update_traces(textposition='outside')               
              
fig.show()

## Elevation graph

In [184]:
def mountain_elevation(resort_name):

    resort_df = content_df.loc[content_df['ski_resort'] == resort_name]
    
    # elevation
    base_elevation = resort_df['base'].values[0]
    summit_elevation = resort_df['sumt'].values[0]

    # traving elevation
    elev_trace = go.Scatter(x=["base", "summit", "drop"], y=[base_elevation, summit_elevation, base_elevation], mode='lines', line=dict(color='blue'))

    # displaying plot
    layout = go.Layout(
        title='Elevation Change',
        yaxis=dict(title='Elevation'),
        plot_bgcolor='white',
        showlegend=False
    )
    
    # making figure
    fig = go.Figure(data=[elev_trace], layout=layout)

    # Showing the line plot
    fig.show()

In [185]:
mountain_elevation("Arapahoe Basin")

# Conclusion

#### Review Dataset
After cleaning and analyzing the data, there are **662 users, 275 resorts, and 2795 total reviews**. There is an imbalance in the reviews, however our final recommendation system will be a hybrid-cascade model, so this will help balance out the results.

In [187]:
unique_users = len(final_ski_df['user_name'].unique())
unique_resorts = len(final_ski_df['ski_resort'].unique())
total_reviews = len(final_ski_df)

print("Number of unique users:", unique_users)
print("Number of unique resorts:", unique_resorts)
print("Number of reviews:", total_reviews)

Number of unique users: 662
Number of unique resorts: 275
Number of reviews: 2795


#### Review Dataset
After cleaning the scraped data from OnTheSnow, Google Geocoding API, and Airbnb, there are **329 resorts** and **86 columns** in the final dataframe.

In [190]:
unique_resorts = len(content_df['ski_resort'].unique())
unique_features = len(content_df.columns)

print("Number of resorts:", unique_resorts)
print("Number of features:", unique_features)

Number of resorts: 329
Number of features: 86


# Next Steps

The next step will be to begin modeling to create the recommendation system. The two main dataframes from this notebook will be used are listed below:

- **Collaborative Modeling** - cleaned_data_exports/user_df_model.csv
- **Content/Cascade Hybrid Modeling** - cleaned_data_exports/scraped_feature_df.csv

The collaborative model will be saved in a separate notebook than the final content and hybrid based models.