# Avant Ski

by: Stephanie Ciaccia

# Overview

Skiing holds a prominent place for those seeking winter recreational activities in the United States. With its stunning mountain ranges and diverse terrain, the country boasts numerous ski resorts that cater to all skill levels, from beginners to seasoned professionals. 

Skiing offers a unique blend of adventure, physical activity, and natural beauty, making it a popular choice for winter enthusiasts seeking both relaxation and excitement.

The ski market in the United States is thriving, contributing significantly to the economy. According to the [National Ski Areas Association (NSAA)](chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://nsaa.org/webdocs/Media_Public/IndustryStats/Historical_Skier_Days_1979_2022.pdf), approximately 60.7 million skiers and snowboarders visited 473 ski resorts in the 2021-2022 winter season.

# Business Problem

Skiing is an exhilarating winter activity enjoyed by many, but barriers such as high costs and limited accessibility often hinder people from fully experiencing its joys. Choosing the right ski resort can be overwhelming due to the multitude of options available, and existing websites lack dynamic filtering capabilities based on user preferences.

To address these challenges, I'm developing Avant Ski, a ski resort recommendation app. Avant Ski simplifies the ski resort selection process by leveraging data and user preferences. With dynamic filtering features, users can personalize their search based on budget, location, amenities, and skill level. By bridging the gap between ski enthusiasts and their dream destinations, Avant Ski makes skiing accessible to a wider audience, empowering them to plan unforgettable ski trips with confidence.

# Data Understanding

In [1]:
import pandas as pd
import numpy as np
import math
from datetime import datetime
import datetime
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly
import plotly.express as px
import plotly.io as pio
from matplotlib.ticker import StrMethodFormatter
import plotly.graph_objects as go

from surprise.model_selection import cross_validate
from surprise import Dataset, Reader, accuracy
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline, KNNWithZScore,  SVD, SVDpp, NMF, BaselineOnly, NormalPredictor
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split

from collections import Counter
from nltk.corpus import stopwords

from IPython.display import Image, display

import glob
import os

Function to print full rows

In [2]:
def print_full(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')

# Importing Data Files

In [3]:
snow_df = pd.read_csv("data/OnTheSnow_SkiAreaReviews_clean.csv")
survey_df = pd.read_csv("data/usa_ski_resort_survey.csv")
scraped_df = pd.read_csv("data/OnTheSnow_Scrape_2_820523.csv")

In [4]:
#airbnb scrape four guest listings
dec_4_airbnb_mean_final = pd.read_csv("data/airbnb_scraped_cleaned/dec_4_airbnb_mean_final.csv")
jan_4_airbnb_mean_final = pd.read_csv("data/airbnb_scraped_cleaned/jan_4_airbnb_mean_final.csv")
feb_4_airbnb_mean_final = pd.read_csv("data/airbnb_scraped_cleaned/feb_4_airbnb_mean_final.csv")
mar_4_airbnb_mean_final = pd.read_csv("data/airbnb_scraped_cleaned/mar_4_airbnb_mean_final.csv")
apr_4_airbnb_mean_final = pd.read_csv("data/airbnb_scraped_cleaned/apr_4_airbnb_mean_final.csv")
may_4_airbnb_mean_final = pd.read_csv("data/airbnb_scraped_cleaned/may_4_airbnb_mean_final.csv")

#airbnb scrape two guest listings
dec_2_airbnb_mean_final = pd.read_csv("data/airbnb_scraped_cleaned/dec_2_airbnb_mean_final.csv")
jan_2_airbnb_mean_final = pd.read_csv("data/airbnb_scraped_cleaned/jan_2_airbnb_mean_final.csv")
feb_2_airbnb_mean_final = pd.read_csv("data/airbnb_scraped_cleaned/feb_2_airbnb_mean_final.csv")
mar_2_airbnb_mean_final = pd.read_csv("data/airbnb_scraped_cleaned/mar_2_airbnb_mean_final.csv")
apr_2_airbnb_mean_final = pd.read_csv("data/airbnb_scraped_cleaned/apr_2_airbnb_mean_final.csv")
may_2_airbnb_mean_final = pd.read_csv("data/airbnb_scraped_cleaned/may_2_airbnb_mean_final.csv")

In [5]:
#google geocoding api
latitude_df = pd.read_csv("data/cleaned_data_exports/mountain_lat_long.csv")

In [6]:
#closest airport information
airport_df = pd.read_csv("data/cleaned_data_exports/closest_airports.csv")

### Data Source #1 - OnTheSnow (Kaggle)
### User Based Filtering Dataset

The main dataset for the user based collaborative model was pulled from [Kaggle]([https://www.kaggle.com/datasets/fredkellner/onthesnow-ski-area-reviews]). The dataset includes reviews scraped from OnTheSnow, a leading website that provides information about ski resorts and snow conditions found on Kaggle. 

There are 18,128 reviews from 291 ski resorts in the USA. The features include:

- Ski Area
- Reviewer Name 
- Review Date
- Review Star Rating (out of 5)

In [7]:
snow_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18238 entries, 0 to 18237
Data columns (total 6 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   State                          18238 non-null  object
 1   Ski Area                       18238 non-null  object
 2   Reviewer Name                  18128 non-null  object
 3   Review Date                    18238 non-null  object
 4   Review Star Rating (out of 5)  18238 non-null  int64 
 5   Review Text                    18226 non-null  object
dtypes: int64(1), object(5)
memory usage: 855.0+ KB


In [8]:
snow_df.head()

Unnamed: 0,State,Ski Area,Reviewer Name,Review Date,Review Star Rating (out of 5),Review Text
0,colorado,copper-mountain-resort,anonymous_user,3-Mar-04,3,I have a pass the includes other mountains but...
1,utah,brighton-resort,anonymous_user,2-Dec-04,4,I've been coming to Brighton for years. Unlike...
2,north-carolina,ski-beech-mountain-resort,anonymous_user,1-Jan-05,5,"We went last Weekend, and it was the best snow..."
3,new-mexico,red-river,anonymous_user,1-Mar-05,5,Love Red River we go every year!
4,pennsylvania,sno-mountain,anonymous_user,2-Mar-05,4,"Great varied terrain, not crowded, good prices..."


In [9]:
#renaming columns
new_name = ['state', 'ski_resort', 'user_name','review_date', 'rating',
           'review'] 

snow_df.columns = new_name

In [10]:
snow_df

Unnamed: 0,state,ski_resort,user_name,review_date,rating,review
0,colorado,copper-mountain-resort,anonymous_user,3-Mar-04,3,I have a pass the includes other mountains but...
1,utah,brighton-resort,anonymous_user,2-Dec-04,4,I've been coming to Brighton for years. Unlike...
2,north-carolina,ski-beech-mountain-resort,anonymous_user,1-Jan-05,5,"We went last Weekend, and it was the best snow..."
3,new-mexico,red-river,anonymous_user,1-Mar-05,5,Love Red River we go every year!
4,pennsylvania,sno-mountain,anonymous_user,2-Mar-05,4,"Great varied terrain, not crowded, good prices..."
...,...,...,...,...,...,...
18233,minnesota,lutsen-mountains,REBECCA CARTWRIGHT,14-Dec-20,4,Many workers on the lifts did not know how to ...
18234,new-mexico,sipapu-ski-and-summer-resort,Antonio Martinez,15-Dec-20,5,"staying in the ""hotel"" (""motel"" on the sign ab..."
18235,new-mexico,sipapu-ski-and-summer-resort,Antonio Martinez,15-Dec-20,5,"staying in the ""hotel"" (""motel"" on the sign ab..."
18236,new-mexico,taos-ski-valley,David Humphrey,15-Dec-20,5,"Good skiing, have lost their way over the year..."


In [11]:
survey_df['user_name'].unique()

array(['anon_1', 'anon_2', 'anon_3', 'anon_4', 'anon_5', 'anon_6',
       'anon_7', 'anon_8', 'anon_9', 'Stephanie Ciaccia', 'Joseph Lewis',
       'Alexandria Kelly', 'Deanna Uzarski', 'Raghava Kamalesh'],
      dtype=object)

In [12]:
survey_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   review_date  61 non-null     object
 1   state        61 non-null     object
 2   ski_resort   61 non-null     object
 3   rating       61 non-null     int64 
 4   review       61 non-null     object
 5   user_name    61 non-null     object
dtypes: int64(1), object(5)
memory usage: 3.0+ KB


In [13]:
snow_df['review_date'] = pd.to_datetime(snow_df['review_date'])
survey_df['review_date'] = pd.to_datetime(survey_df['review_date'])

In [14]:
snow_df["ski_resort"].value_counts()

ski-brule                           1315
killington-resort                    204
vail                                 203
winter-park-resort                   191
blue-mountain-ski-area               189
                                    ... 
holiday-mountain                      13
new-hermon-mountain                   13
coffee-mill-ski-snowboard-resort      13
otis-ridge-ski-area                   13
whaleback-mountain                    12
Name: ski_resort, Length: 291, dtype: int64

In [15]:
snow_df['user_name'].value_counts().head(30)

anonymous_user         3026
anonymous               304
undefined undefined     130
Ben                      49
Mike                     49
Ryan                     46
Richard                  44
Dan                      42
David                    42
Chris                    39
Rob                      37
Jeff                     36
Matt                     31
wolfman                  31
Brian                    31
Derek                    31
Michael                  28
iPhone                   28
Nick                     27
gma                      27
Kevin                    27
Jim                      26
J                        25
Steve                    25
Mark                     23
Jun                      23
Jason                    23
Joe                      23
Paul                     22
Dave O                   22
Name: user_name, dtype: int64

To clean the review dataset, I had to drop the names of users that were not unique. I parsed through the dataset and continued to drop columns until only unique usernames or users with first and last names were left.

In [16]:
drop_list = ["anonymous_user", "anonymous","undefined undefined","Mike", 
             "Ben", "Ryan", "Richard", "Dan", "David", "Chris", "Rob", "Jeff",
            "Derek", "Brian", "Matt", "Michael", "iPhone", "Kevin", "Nick",
            "Jim", "Steve", "Jason", "Mark", "Joe", "Paul", "Justin", "Scott",
            "Bob", "Alex", "Carter", "Dave", "Tim", "Bill", "Andrew", "John", "Sam",
            "James", "Kim", "Craig", "mike", "jason", "James", "Sam", "Kim", "mike", "peter",
            "Jack", "Adam", "Tom", "Wes", "Jun", "Steven", "Max", "Matthew", "Laura", "Felipe",
            "Greg", "Bryan", "Sarah", "Sara", "Christian", "Ray", "Connor", "Erin", "Emily",
            "Luke", "Ed", "Patrick", "kyle", "Ken", "Linda", "Eric", "Aaron", "Jake",
            "Josh", "Tony", "Abe", "Frank", "Peter", "Fred", "Arthur", "Lorraine",
            "Phil", "Sean", "Will", "Julie", "Jon", "Amy", "Becky", "Shannon", "brendan",
            "Kathy", "wayne", "Ethan", "Erika", "Jill", "Zoe", "Rick", "Wyatt",
            "Tyler", "Andrea", "mark", "john", "Donna", "Jen", "Braden", "D", "Bryce",
            "Rich", "Jared", "Jay", "Ann", "Brandon", "Nicholas","Martin",
            'Robert', 'angelino','Anonymous',
             'ty', 'jase', 'Jesse', 'Jennifer', 'Dustin', 'Natalie',
             'Pat', 'anonymous user', 'matt', 'George', 'Kate',
             'Daniel','Cindy', 'Barry', 'Todd', 'Melanie', 'Drew',
             'Andy', 'Hochard','Wayne', 'dan',
             'Charlie', 'Vanessa','Allen', 'Austin', 'Roger',
             'Jerry', 'Scotty', 'Anon', 'Lucas', 'Brian', 'Lee', 'Taylor',
            'brian', 'Lisa', 'Jade', 'Spencer', 'chris', 'Jenny', 'Amanda', 'Brett',
            'Maria', 'Holly', 'iPad', 'Sylvia', 'iPhone (2)', 'Catherine', 'Hannah', 'Wade',
            'Larry', 'Lauren','Noah', 'Bobby', 'Don', 'Christine', 'Stephen', 'Howard',
             'Tanner', 'Tom', 'Casey', 'Kyle', 'Michelle', 'Shelby',
             'Benjamin', 'Erik', 'Molly', 'Johnny', 'Chuck', 'Johnny',
             'Nathan', 'Cathy', 'Shelley', 'Mary', 'Danny', 'mitch', 'Brad', 'Tammy', 'erik',
            'Tricia', 'Nate', 'Pete']

snow_df = snow_df[snow_df['user_name'].isin(drop_list) == False]

In [17]:
snow_df['user_name'].value_counts().head(70)

wolfman            31
gma                27
J                  25
Dave O             22
Tim Zheng          22
                   ..
Steve undefined     7
Jay C               7
tjkotula            7
nanaandpapa         7
smk1945             7
Name: user_name, Length: 70, dtype: int64

In [18]:
#renaming columns
new_name = ['state', 'ski_resort', 'user_name','review_date', 'rating',
           'review'] 

snow_df.columns = new_name

After cleaning the usernames, I will be further narrowing down the number of users by only including users with more than 3 reviews.

In [19]:
# counting the number of reviews for each user
value_counts = snow_df['user_name'].value_counts()

# selecting only users with more than three reviews
selected_users = value_counts[value_counts > 2].index

# selecting only the rows where the user_name is in the selected_users list
cleaned_snow = snow_df[snow_df['user_name'].isin(selected_users)]

In [20]:
cleaned_snow['user_name'].value_counts(ascending=True)

brandon           3
Eric's iPhone     3
Kase1             3
bwm30             3
echi              3
                 ..
Dave O           22
Tim Zheng        22
J                25
gma              27
wolfman          31
Name: user_name, Length: 648, dtype: int64

Removing users with more than 3 reviews dropped the number of rows/final reviews to 2200.

In [21]:
cleaned_snow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2996 entries, 103 to 18199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   state        2996 non-null   object        
 1   ski_resort   2996 non-null   object        
 2   user_name    2996 non-null   object        
 3   review_date  2996 non-null   datetime64[ns]
 4   rating       2996 non-null   int64         
 5   review       2994 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 163.8+ KB


In [22]:
cleaned_snow['user_name'].unique()

array(['ericadyer', 'FroDog', 'jackson321', 'SammyG', 'Jill Adler',
       'Twenz', 'anhillx', 'emj35', 'Benji Zimmerman', 'Roger Leo',
       'filterban', 'JP', 'gkphi5', 'RippinSkiers', 'Jay C',
       'Mark Rosasco', 'jay', 'Dan Gibson', 'stevenstclair', 'jwtime',
       'Cherokee', 'treesker', 'hranee', 'gwiffie', 'stevenam', 'Gunny J',
       'jim8588', 'Bmorabito', 'Richard 1', 'airgarden94', 'joey58242',
       'Mike134', 'tom travis', 'bobbert', 'noonito', 'tsfoust',
       'Art Zinn', 'steffenwolf', 'seniordude', 'Shartron',
       'Mikey Likes It', 'Americansonofa', 'Randy Agness', 'sno_thing',
       'sampanning', 'steep-n-deep ', 'jestertatt', 'Dantheman',
       'swissnowtiger', 'brandon', 'flyersboy114', 'Ritt', 'Bob Butts',
       'sharimcatee', 'iLiveToRide17', 'bodibran', 'yodeledihoo',
       'tourist from Texas', 'p_nut', 'highvoltageguy', 'masterdel',
       'govey80', 'horse', 'fcherichel', 'mwolske', 'mayham2k', 'Adye 1',
       'MgoBlue', 'Randy Rogers', 'Bobby G

In [23]:
#dropping duplicate rows
cleaned_snow = cleaned_snow.drop_duplicates()

#### Ski resort name - cleaning

Since the target variable is the Ski Resort I will need to clean and update the names in all datasets to ensure they are consistent.

In [24]:
cleaned_snow['ski_resort'].unique()

array(['squaw-valley-usa', 'sun-valley', 'donner-ski-ranch', 'boreal',
       'diamond-peak', 'mt-baker', 'alpental', 'stevens-pass-resort',
       'the-summit-at-snoqualmie', 'mt-rose-ski-tahoe', 'mountain-high',
       'snowshoe-mountain-resort', 'alyeska-resort', 'steamboat',
       'alta-ski-area', 'snowbird', 'snowbasin', 'brighton-resort',
       'solitude-mountain-resort', 'deer-valley-resort',
       'park-city-mountain-resort', 'jackson-hole', 'sundance',
       'brian-head-resort', 'bretton-woods', 'loon-mountain',
       'sierra-at-tahoe', 'heavenly-mountain-resort', 'gunstock',
       'sno-mountain', 'attitash', 'crystal-mountain-wa', 'vail',
       'killington-resort', 'waterville-valley', 'kirkwood',
       'copper-mountain-resort', 'breckenridge',
       'arapahoe-basin-ski-area', 'keystone', 'boyne-mountain-resort',
       'crystal-mountain', 'shanty-creek', 'cannonsburg',
       'boyne-highlands', 'aspen-snowmass', 'sunday-river',
       'mount-sunapee', 'sugar-bowl-re

In [25]:
#removing words to clean up resort names
replace_snow = ['-ski-area', '-', 'resort', 'mt']
replace_with = ['', ' ', '', 'mt.']

cleaned_snow = cleaned_snow.replace(replace_snow, replace_with, regex=True)

In [26]:
#making columns titlecase
cleaned_snow['ski_resort'] = cleaned_snow['ski_resort'].str.title()
cleaned_snow['state'] = cleaned_snow['state'].str.title()
cleaned_snow['ski_resort'] = cleaned_snow['ski_resort'].str.strip()

In [27]:
#replacing values to standardize endings/specific resort names
replace_snow = ['At', 'Mtn', 'Mt.N', 'Mt. Hood Ski Bowl', 'And', r'\bMount\b', 'Mtn.']
replace_with = ['at', 'Mountain', 'Mountain', 'Mt. Hood Skibowl', 'and', 'Mt.', 'Mountain']

cleaned_snow = cleaned_snow.replace(replace_snow, replace_with, regex=True)

In [28]:
cleaned_snow = cleaned_snow.replace("Shanty Creek", "Schuss Mountain", regex=True)

After inspecting resort names, there were a few resorts that had the same names or very similar names. I adjusted the names, and included the state in the resort names to differentiate the names.

In [29]:
#timberline
cleaned_snow.loc[(cleaned_snow['ski_resort'] == "Timberline Four Seasons") & (cleaned_snow['state'] == "West Virginia"), 'ski_resort'] = "Timberline Mountain"

#crystal mountain
cleaned_snow.loc[(cleaned_snow['ski_resort'] == "Crystal Mountain Wa") & (cleaned_snow['state'] == "Washington"), 'ski_resort'] = "Crystal Mountain Washington"
cleaned_snow.loc[(cleaned_snow['ski_resort'] == "Crystal Mountain") & (cleaned_snow['state'] == "Michigan"), 'ski_resort'] = "Crystal Mountain Michigan"

#magic mountain
cleaned_snow.loc[(cleaned_snow['ski_resort'] == "Magic Mountain") & (cleaned_snow['state'] == "Vermont"), 'ski_resort'] = "Magic Mountain Vermont"
cleaned_snow.loc[(cleaned_snow['ski_resort'] == "Magic Mountain") & (cleaned_snow['state'] == "Idaho"), 'ski_resort'] = "Magic Mountain Idaho"

In [30]:
mountain_rep = ['Squaw Valley Usa',
                'Mccauley Mountain Ski Center', 'attitash', 'Smugglers Notch',
               'Pico Mountain at Killington', 'andes Tower Hills']

mountain_new = ['Palisades Tahoe',
               'McCauley Mountain', 'Attitash', "Smugglers' Notch",
               'Pico Mountain', 'Andes Tower Hills']

cleaned_snow = cleaned_snow.replace(mountain_rep, mountain_new, regex=True)

### Data Source #1 - Survey Data

A third small dataset was collected through a [google survey]([https://docs.google.com/forms/d/1ROrGEkCh40RjbHidNCqg4SCCbY3_6DFNw0VWIhTEIGs/edit#responses]) I distributed to individuals who ski, including myself.

I downloaded the sheets file from google and saved it as a .csv. A few individuals did not include their name, so I gave them unique "anon" names.

I plan to use the names of three users that I know, to analyze the results from the model to see if they align with the users preferences. For those three users, I also asked that they send me a brief summary of the key characteristics they look for when choosing ski resorts to visit.

In [31]:
#making columns titlecase
survey_df['ski_resort'] = survey_df['ski_resort'].str.title()
survey_df['state'] = survey_df['state'].str.title()
survey_df['ski_resort'] = survey_df['ski_resort'].str.strip()

In [32]:
list(survey_df['ski_resort'].sort_values().unique())

['Alta',
 'Arapahoe Basin',
 'Aspen Highlands',
 'Aspen Snowmass',
 'Bear Valley',
 'Beaver Creek',
 'Beaver Mountain',
 'Breckenridge',
 'Brighton',
 'Cherry Peak',
 'Copper Mountain',
 'Crested Butte',
 'Crystal Mountain - Wa',
 'Deer Valley',
 'Dodge Ridge',
 'Gore Mountain',
 'Hunter Mountain',
 'Jackson Hole',
 'Killington',
 'Mammoth Mountain',
 'Mccauley Mountain',
 'Mt Baker',
 'Mt. Rose',
 'Nordic Valley',
 'Palisades Tahoe',
 'Park City Mountain',
 'Powder Mountain',
 'Roundtop Mountain',
 'Snow Ridge',
 'Snowbird',
 'Solitude',
 'Steamboat',
 "Steven'S Pass",
 'Stevens Pass',
 'Stratton',
 'Sugarbush',
 'Taos',
 'Telluride',
 'Vail',
 'Winter Park',
 'Woods Valley']

Below, I manually parsed through the resort names and changed the names to match the names in the main dataframe.

In [33]:
mountain_rep = ['Crystal Mountain - Wa',"Steven'S Pass",'Stratton',
                'Mccauley Mountain', 'Taos',
                'Crested Butte', 'Mt. Rose', 'Mt Baker', 'Nordic Valley' ,'Solitude']

mountain_rep_p = ['Crystal Mountain Washington', 'Stevens Pass','Stratton Mountain',
                 'McCauley Mountain', 'Taos Ski Valley',
                 'Crested Butte Mountain', 'Mt. Rose Ski Tahoe', 'Mt. Baker', 'Nordic Mountain', 'Solitude Mountain']

survey_df = survey_df.replace(mountain_rep, mountain_rep_p, regex=True)

In [34]:
# changing aspen resorts since all four mountains are part of snowmass
mountain_r = ['Aspen Mountain', 'Aspen Highlands']

survey_df = survey_df.replace(mountain_r, 'Aspen Snowmass', regex=True)

In [35]:
#checking to see which names are different
survey_df.loc[~survey_df['ski_resort'].isin(cleaned_snow['ski_resort']),
                         'ski_resort'].unique()

array(['Cherry Peak'], dtype=object)

### Merging survey and OnTheSnow review data

In [36]:
#merging survey review results and final onthesnow reviews
final_ski_df = pd.concat([survey_df, cleaned_snow])

In [37]:
#dropping null values
final_ski_df = final_ski_df.dropna()

In [38]:
#dropping duplicates
final_ski_df = final_ski_df.drop_duplicates()

In [39]:
final_ski_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2796 entries, 0 to 18198
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   review_date  2796 non-null   datetime64[ns]
 1   state        2796 non-null   object        
 2   ski_resort   2796 non-null   object        
 3   rating       2796 non-null   int64         
 4   review       2796 non-null   object        
 5   user_name    2796 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 152.9+ KB


In [40]:
#Blackjack ski and Indianhead combined to form Snowriver Mountain Resort
blackjack = ['Blackjack Ski', 'Indianhead Mountain']

final_ski_df = final_ski_df.replace(blackjack, 'Snowriver Mountain Resort', regex=True)

In [41]:
#updating names of resort names that have changed
old_name = ['Durango Mountain','Las Vegas Ski and Snowboard','Shawnee Peak', 'Suicide Six', 'Snow Summit']
new_name = ['Purgatory Mountain','Lee Canyon','Shawnee Mountain', 'Saskadena Six', 'Big Bear']

final_ski_df = final_ski_df.replace(old_name, new_name, regex=True)

In [42]:
# making dictionary for replacements to avoid doubling the names
replacements = {
    'Brandywine': 'Boston Mills and Brandywine',
    'Boston Mills': 'Boston Mills and Brandywine'
}

# replacing
final_ski_df['ski_resort'] = final_ski_df['ski_resort'].replace(replacements)

In [43]:
#updating duplicate ski resort names and saving as new resort names that include the state names
final_ski_df.loc[(final_ski_df['ski_resort'] == "Crystal Mountain") & (final_ski_df['state'] == "Washington"), 'ski_resort'] = "Crystal Mountain Washington"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Crystal Mountain") & (final_ski_df['state'] == "Michigan"), 'ski_resort'] = "Crystal Mountain Michigan"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Crystal Mountain ") & (final_ski_df['state'] == "Michigan"), 'ski_resort'] = "Crystal Mountain Michigan"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Powder Ridge") & (final_ski_df['state'] == "Minnesota"), 'ski_resort'] = "Powder Ridge Minnesota"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Powder Ridge") & (final_ski_df['state'] == "Connecticut"), 'ski_resort'] = "Powder Ridge Connecticut"

#alpine valley
final_ski_df.loc[(final_ski_df['ski_resort'] == "Alpine Valley") & (final_ski_df['state'] == "Wisconsin"), 'ski_resort'] = "Alpine Valley Wisconsin"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Alpine Valley") & (final_ski_df['state'] == "Ohio"), 'ski_resort'] = "Alpine Valley Ohio"

In [44]:
final_ski_df.replace("Ski Mystic at Deer Mountain", "Deer Mountain", regex=True, inplace=True)

In [45]:
final_ski_df.replace("Sno Mountain", "Montage Mountain", regex=True, inplace=True)

In [46]:
final_ski_df.replace(" Ski Recreation Area", "", regex=True, inplace=True)

In [47]:
final_ski_df.replace("Winter Sports Park", "", regex=True, inplace=True)

In [48]:
final_ski_df.replace("The", "", regex=True, inplace=True)

In [49]:
final_ski_df.replace("Snowboard Area", "", regex=True, inplace=True)

In [50]:
final_ski_df['ski_resort'] = final_ski_df['ski_resort'].str.strip()

In [51]:
#dropping Cherry Peak
final_ski_df = final_ski_df.loc[(final_ski_df['ski_resort'] != "Cherry Peak")]

In [52]:
#exporting final cleaned dataframe for OnTheSnow scrape
#final_ski_df.to_csv("cleaned_data_exports/final_review_df_final.csv")

### Data Source #3 - OnTheSnow Scrape


I scraped OnTheShow to pull current ski resort features for the resorts in the final merged datset. The code for this scrape was adapted from a [user on github] [(https://github.com/SijiaLai/OnTheSnow/tree/master)] and updated based on html changes and the features I wanted to pull.

The code for the scraper can be found in the data folder.


The main features I scraped:

- mountain elevation
- ticket price
- mountain location
- ski terrain
- snowfall averages

In [124]:
scraped_df_2_2 = pd.read_csv("data/OnTheSnow_Scrape_2_820523.csv")


In [125]:
scraped_df_2_2

Unnamed: 0,ski_resort,address,city,state,country,summit,drop,base,gondolas_and_trams,fast_eight,...,teenagerPrice_season,adultPrice_season,seniorPrice_season,Url,gondolas_lifts_note,beginner_runs,intermediate_runs,advanced_runs,expert_runs,night_skiing
0,Palisades Tahoe,PO Box 2007,96146 Olympic Valley,California,United States,9050,2850,6200,3.0,6.0,...,879.00,1179.00,,https://www.onthesnow.com/california/squaw-val...,,,,,,
1,Mammoth,P.O. Box 24,93546 Mammoth Lakes,California,United States,11053,3100,7953,3.0,9.0,...,879.00,1179.00,,https://www.onthesnow.com/california/mammoth-m...,Check gondola information.,15%,48%,24%,13%,
2,Donner Ski Ranch,P.O. Box 66,95724 Norden,California,United States,8012,750,7031,,,...,449.00,499.00,449.00,https://www.onthesnow.com/california/donner-sk...,,31%,38%,21%,10%,
3,Sugar Bowl,P.O. Box 5,95724 Norden,California,United States,8383,1500,6883,1.0,5.0,...,889.00,1119.00,889.00,https://www.onthesnow.com/california/sugar-bow...,,15%,45%,28%,12%,
4,Kirkwood,PO Box 1,95646 Kirkwood,California,United States,9800,2000,7800,,2.0,...,429.00,545.00,437.00,https://www.onthesnow.com/california/kirkwood/...,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
326,Oak Mountain,141 Novosel Way,12164 Speculator,NY,United States,2400,650,1750,,,...,315.00,415.00,315.00,https://www.onthesnow.com/new-york/oak-mountai...,,45%,27%,18%,9%,12 ac
327,Mount Pleasant,23301 Plank Rd,16403 Venango,Pa 16440,United States,1540,340,1200,,,...,,350.00,,https://www.onthesnow.com/pennsylvania/mount-p...,,22%,56%,22%,,35 ac
328,Hunt Hollow,7532 County Road 36,14512 Naples,New York,United States,2030,825,1000,,,...,,,,https://www.onthesnow.com/new-york/hunt-hollow...,,32%,21%,37%,11%,400 ac
329,Powder Ridge,99 Powder Hill Road,06455 Middlefield,Connecticut,United States,720,550,170,,,...,520.00,570.00,520.00,https://www.onthesnow.com/connecticut/powder-r...,,45%,40%,15%,,40 ac


In [53]:
scraped_df = pd.read_csv("data/OnTheSnow_Scrape_2_820523.csv")
scraped_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331 entries, 0 to 330
Data columns (total 61 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ski_resort            331 non-null    object 
 1   address               331 non-null    object 
 2   city                  331 non-null    object 
 3   state                 331 non-null    object 
 4   country               331 non-null    object 
 5   summit                331 non-null    int64  
 6   drop                  331 non-null    int64  
 7   base                  331 non-null    int64  
 8   gondolas_and_trams    40 non-null     float64
 9   fast_eight            110 non-null    float64
 10  high_speed_sixes      41 non-null     float64
 11  quad_chairs           163 non-null    float64
 12  triple_chairs         224 non-null    float64
 13  double_chairs         239 non-null    float64
 14  surface_lifts         309 non-null    float64
 15  total_runs            3

In [57]:
scraped_df['state'].unique()

array(['California', 'Nevada', 'Colorado', 'Washington', 'Utah', 'Oregon',
       'Alaska', 'Idaho', 'North Carolina', 'Wyoming', 'Arizona',
       'Minnesota', 'New Mexico', 'Montana', 'Michigan', 'Vermont',
       'Wisconsin', 'New Hampshire', 'New Jersey', 'Massachusetts',
       'Pennsylvania', 'New York', 'Iowa', 'Maine', 'West Virginia',
       'Illinois', 'Ohio', 'Virginia', 'Missouri', 'Connecticut',
       'Tennessee', 'Indiana', 'South Dakota', 'Maryland', 'Rhode Island'],
      dtype=object)

In [55]:
state_rep = ['WI  54819', 'NV 89131', 'ID 83873', 'Pa 16440', 'NY', 'CO', 'CA 96160', 'PA 16625', 'Az']
state_with = ["Wisconsin", "Nevada", "Idaho", "Pennsylvania", "New York", "Colorado", "California", "Pennsylvania",
             "Arizona"]

scraped_df.replace(state_rep, state_with, regex=True, inplace=True)

In [56]:
scraped_df.isna().sum().sort_values(ascending=False).head(35)

teenager6DayPrice       319
children6DayPrice       313
senior6DayPrice         313
adult6DayPrice          312
gondolas_lifts_note     309
gondolas_and_trams      291
high_speed_sixes        290
teenager2DayPrice       287
children2DayPrice       287
senior2DayPrice         286
adult2DayPrice          280
seniorHalfDayPrice      235
teenagerHalfDayPrice    233
fast_eight              221
adultHalfDayPrice       217
expert_runs             192
quad_chairs             168
may_snow                167
night_skiing            149
terrain_note            139
advanced_runs           123
seniorPrice_season      120
beginner_runs           115
intermediate_runs       114
childrenWeekdayPrice    112
childrenWeekendPrice    111
triple_chairs           107
childrenPrice_season     98
teenagerPrice_season     98
seniorWeekdayPrice       95
seniorWeekendPrice       94
double_chairs            92
teenagerWeekdayPrice     88
teenagerWeekendPrice     85
apr_snow                 77
dtype: int64

In [165]:
scraped_df.isna().sum().sort_values(ascending=False).head(35)

teenager6DayPrice       319
children6DayPrice       313
senior6DayPrice         313
adult6DayPrice          312
gondolas_lifts_note     309
gondolas_and_trams      291
high_speed_sixes        290
teenager2DayPrice       287
children2DayPrice       287
senior2DayPrice         286
adult2DayPrice          280
seniorHalfDayPrice      235
teenagerHalfDayPrice    233
fast_eight              221
adultHalfDayPrice       217
expert_runs             192
quad_chairs             168
may_snow                167
night_skiing            149
terrain_note            139
advanced_runs           123
seniorPrice_season      120
beginner_runs           115
intermediate_runs       114
childrenWeekdayPrice    112
childrenWeekendPrice    111
triple_chairs           107
childrenPrice_season     98
teenagerPrice_season     98
seniorWeekdayPrice       95
seniorWeekendPrice       94
double_chairs            92
teenagerWeekdayPrice     88
teenagerWeekendPrice     85
apr_snow                 77
dtype: int64

In [58]:
#dropping null values
scraped_df.drop(columns=['gondolas_lifts_note', 'terrain_note','country','teenager6DayPrice',
                                     'children6DayPrice', 'senior6DayPrice', 'adult6DayPrice',
                                     'gondolas_lifts_note', 'teenager2DayPrice', 'children2DayPrice',
                                     'senior2DayPrice', 'adult2DayPrice', 'seniorHalfDayPrice',
                                     'teenagerHalfDayPrice', 'adultHalfDayPrice',
                                     'adultHalfDayPrice', 'terrain_note', 'may_snow',
                         'seniorPrice_season'], inplace=True)

In [59]:
#filling null with 0
fill_list = ["gondolas_and_trams", "high_speed_sixes", "expert_runs", "quad_chairs", "night_skiing",
            "beginner_runs", "advanced_runs", "intermediate_runs", "triple_chairs", "double_chairs", "apr_snow",
            "snow_making", "surface_lifts", "dec_snow", "longest_run", "skiable_terrain", "mar_snow", "apr_snow", "jan_snow",
            "fast_eight"]

for x in fill_list:

    scraped_df[x] = scraped_df[x].fillna(0)

In [60]:
fill_list = ["childrenWeekdayPrice", "childrenWeekendPrice", "teenagerPrice_season", "childrenPrice_season",
            "seniorWeekdayPrice", "seniorWeekendPrice", "teenagerWeekdayPrice", "teenagerWeekendPrice",
            "adultWeekdayPrice", "adultWeekendPrice", "adultPrice_season"]

for x in fill_list:
    scraped_df[x] = scraped_df[x].fillna("See ski resort website")

In [61]:
#splitting the city into zipcode and city column
scraped_df[['zipcode', 'city']] = scraped_df['city'].str.split(' ', 1, expand=True)

In [62]:
#changing location of zipcode so it is placed next to city
column_to_move = scraped_df.pop("zipcode")

# moving zipcode after state
scraped_df.insert(4, "zipcode", column_to_move )

In [63]:
replace_list = ["beginner_runs", "intermediate_runs", "advanced_runs", "expert_runs", "night_skiing",
               "seniorWeekendPrice", "childrenPrice_season", "teenagerPrice_season", "adultPrice_season",
               "longest_run", "skiable_terrain", "snow_making", "averageSnowfall", "nov_snow", "dec_snow",
               "jan_snow", "feb_snow", "mar_snow", "apr_snow","adultPrice_season",
               "childrenWeekdayPrice", "childrenWeekendPrice", "teenagerWeekdayPrice","teenagerWeekendPrice",
                "adultWeekdayPrice","adultWeekendPrice","seniorWeekdayPrice", "seniorWeekendPrice",
                "childrenPrice_season", "teenagerPrice_season","adultPrice_season"]
                
for x in replace_list:

    scraped_df[x] = scraped_df[x].replace(r'[^0-9]', '', regex=True)

In [64]:
#checking to see if there are any missing resorts in the scraped information to ensure all resorts have feature information
final_ski_df.loc[~final_ski_df['ski_resort'].isin(scraped_df['ski_resort']),
                         'ski_resort'].sort_values().unique()

array(['49 Degrees North', 'Alpine Valley Ohio',
       'Alpine Valley Wisconsin', 'Anthony Lakes Mountain',
       'Appalachian Ski Mountain', 'Badger Pass', 'Bear Creek Mountain',
       'Big Powderhorn Mountain', 'Big Squaw Mountain Ski', 'Boreal',
       'Boston Mills and Brandywine', 'Boyne Highlands',
       'Brantling Ski Slopes', 'Bromley Mountain', 'Bryce',
       'Caberfae Peaks Ski Golf', 'Camelback Mountain',
       'Catamount Ski Ride Area', 'Coffee Mill Ski Snowboard',
       'Crested Butte Mountain', 'Crystal Mountain Michigan',
       'Crystal Mountain Washington', 'Discovery', 'Eldora Mountain',
       'Elk Mountain Ski', 'Heavenly Mountain', 'Hogadon', 'Holimont',
       'Lost Trail Powder Mountain', 'Lutsen Mountains',
       'Mad River Mountain', 'Magic Mountain Idaho',
       'Magic Mountain Vermont', 'Mammoth Mountain', 'Marquette Mountain',
       'Monarch Mountain', 'Mt. Abram Ski', 'Mt. Holly', 'Mt. Peter',
       'Mt. Rose Ski Tahoe', 'Mt. Shasta Board Ski Par

In [65]:
rep_list = ["49° North", "Mammoth","Bear Creek", "Boreal Mountain", "Boston Mills",
           r'\bBrandywine\b',"Bromley", "Bryce Resort", "Caberfae Peaks","Camelback",
            "Catamount", "Crested Butte", "Discovery Ski", "Eldora", "Elk Mountain", "Heavenly", "Hogadon Basin",
           "HoliMont", "Lost Trail", "Crystal Mountain, MI", "Lutsen", "Marquette",
            "Monarch","Mt. Abram", "Mt. Rose", "Ober Mountain", "Okemo", "Pajarito",
           "Park City", "Peek'n Peak",  "Plattekill", "Pomerelle", "Purgatory", "Roundtop", "Sierra",
            "Silverton", 'Sipapu Ski', "Beech Mountain", 'China Peak', "Snowshoe", "Solitude", "Stowe", "Stratton",
           "Tamarack Resort", "Taos", "Toggenburg", "Wachusett", "Whitecap", "Whiteface", "Whitefish",
           "Wild Mountain", "Wildcat", "Winterplace", "Wolf Ridge","Yosemite Badger Pass", "The Highlands",
           r'\bNew Hermon Mtn.\b']

rep_with = ["49 Degrees North", "Mammoth Mountain", "Bear Creek Mountain", "Boreal",
            "Boston Mills and Brandywine", "Boston Mills and Brandywine", "Bromley Mountain",
           "Bryce", "Caberfae Peaks Ski Golf", "Camelback Mountain", "Catamount Ski Ride Area", "Crested Butte Mountain",
           "Discovery", "Eldora Mountain", "Elk Mountain Ski", "Heavenly Mountain", "Hogadon", "Holimont",
            "Lost Trail Powder Mountain","Crystal Mountain Michigan", "Lutsen Mountains", "Marquette Mountain",
           "Monarch Mountain", "Mt. Abram Ski", "Mt. Rose Ski Tahoe",
           "Ober Gatlinburg Ski", "Okemo Mountain", "Pajarito Mountain", "Park City Mountain", "Peekn Peak", "Plattekill Mountain", "Pomerelle Mountain", "Purgatory Mountain",
           "Roundtop Mountain", "Sierra at Tahoe", "Silverton Mountain", "Sipapu Ski and Summer", "Ski Beech Mountain", 'Ski China Peak',
            "Snowshoe Mountain", "Solitude Mountain", "Stowe Mountain", "Stratton Mountain", "Tamarack", "Taos Ski Valley",
           "Toggenburg Mountain", "Wachusett Mountain", "Whitecap Mountain", "Whiteface Mountain", "Whitefish Mountain",
           "Wild Mountain Ski", "Wildcat Mountain", "Winterplace Ski", "Wolf Ridge Ski","Badger Pass",
            "Boyne Highlands","New Hermon Mountain"]

scraped_df.replace(rep_list, rep_with, regex=True, inplace=True)

In [66]:
# Define the replacements
replace_from = [r'\bMount\b', r'\bMtn\b']
replace_to = ['Mt.', 'Mountain']

# Replace the exact strin"g
scraped_df = scraped_df.replace(replace_from, replace_to, regex=True)

In [67]:
scraped_df = scraped_df.replace("New Hermon Mountain.", "New Hermon Mountain", regex=True)

In [68]:
replace_from = ["Anthony Lakes", "Big Squaw","Brantling Ski", "Coffee Mill", "Mt. Shasta", "Mount Holly"]
replace_to = ["Anthony Lakes Mountain", 'Big Squaw Mountain Ski', "Brantling Ski Slopes", "Coffee Mill Ski Snowboard",
               'Mt. Shasta Board Ski Park', "Mt. Holly"]

scraped_df.replace(replace_from, replace_to, regex=True, inplace=True)

In [69]:
scraped_df.loc[(scraped_df['ski_resort'] == "Cooper") & (scraped_df['state'] == "Colorado"), 'ski_resort'] = "Ski Cooper"

In [70]:
scraped_df.loc[(scraped_df['ski_resort'] == "Timberline") & (scraped_df['state'] == "West Virginia"), 'ski_resort'] = "Timberline Mountain"

In [71]:
#updating scraped df names to match the other dataframes
scraped_df.loc[(scraped_df['ski_resort'] == "Alpine Valley") & (scraped_df['state'] == "Ohio"), 'ski_resort'] = "Alpine Valley Ohio"
scraped_df.loc[(scraped_df['ski_resort'] == "Alpine Valley") & (scraped_df['state'] == "Wisconsin"), 'ski_resort'] = "Alpine Valley Wisconsin"
scraped_df.loc[(scraped_df['ski_resort'] == "Alpine Valley") & (scraped_df['state'] == "Michigan"), 'ski_resort'] = "Alpine Valley Michigan"

#updating scraped df names to match the other dataframes
scraped_df.loc[(scraped_df['ski_resort'] == "Crystal Mountain") & (scraped_df['state'] == "Washington"), 'ski_resort'] = "Crystal Mountain Washington"
scraped_df.loc[(scraped_df['ski_resort'] == "Crystal Mountain ") & (scraped_df['state'] == "Michigan"), 'ski_resort'] = "Crystal Mountain Michigan"

#updating
scraped_df.loc[(scraped_df['ski_resort'] == "Timberline") & (scraped_df['state'] == "West Virginia"), 'ski_resort'] = "Timberline Mountain"

#updating scraped df names to match the other dataframes
scraped_df.loc[(scraped_df['ski_resort'] == "Powder Ridge") & (scraped_df['state'] == "Minnesota"), 'ski_resort'] = "Powder Ridge Minnesota"
scraped_df.loc[(scraped_df['ski_resort'] == "Powder Ridge") & (scraped_df['state'] == "Connecticut"), 'ski_resort'] = "Powder Ridge Connecticut"

#updating scraped df names to match the other dataframes
scraped_df.loc[(scraped_df['ski_resort'] == "Magic Mountain") & (scraped_df['state'] == "Idaho"), 'ski_resort'] = "Magic Mountain Idaho"
scraped_df.loc[(scraped_df['ski_resort'] == "Magic Mountain") & (scraped_df['state'] == "Vermont"), 'ski_resort'] = "Magic Mountain Vermont"

#updating scraped df names to match the other dataframes
scraped_df.loc[(scraped_df['ski_resort'] == "Mad River") & (scraped_df['state'] == "Ohio"), 'ski_resort'] = "Mad River Mountain"

In [72]:
#dropping duplicates
scraped_df = scraped_df.drop_duplicates()

In [73]:
#dropping duplicate names in ski_resort
scraped_df = scraped_df.drop_duplicates(subset=['ski_resort'])

In [74]:
#checking value counts of resorts to make sure there aren't duplicates or duplicate names
scraped_df.ski_resort.value_counts()

Powder Ridge Minnesota    1
Bristol Mountain          1
Sugarbush                 1
Grand Targhee             1
Mont Ripley               1
                         ..
Pomerelle Mountain        1
Omni Homestead            1
Buena Vista               1
Alpine Valley Ohio        1
Powder Mountain           1
Name: ski_resort, Length: 330, dtype: int64

In [75]:
#checking to see if there are any missing resorts in the scraped information to ensure all resorts have feature information
final_ski_df.loc[~final_ski_df['ski_resort'].isin(scraped_df['ski_resort']),
                         'ski_resort'].sort_values().unique()

array(['Snowriver Mountain Resort'], dtype=object)

### Additional feature engineering
It is common for avid skiiers to purchase ski passes through companies that own a collective of mountains around the United States. I researched a current list of mountains of four of the most common ski passes, manually parsed through the names to update them to match the main dataframe, and then one hot encoded the values for each resort.

I did attempt to use the fuzz to update the names, however many resorts have very similar names so it was not an effective way to update the resort names.

- Epic Pass
- Ikon Pass
- Mountain Collective
- Indy Pass

In [76]:
epic_list = ["Stowe Mountain", "Okemo Mountain", "Hunter Mountain", "Mt. Snow", "Mt. Sunapee","Wildcat Mountain","Seven Springs",
             "Attitash", "Jack Frost", "Crotched Mountain", "Laurel Mountain",
             "Roundtop Mountain", "Whitetail", "Liberty", "Big Boulder", "Heavenly Mountain", "Northstar California",
             "Kirkwood", "Stevens Pass", "Keystone", "Breckenridge","Vail", "Park City Mountain", "Beaver Creek",
             "Crested Butte Mountain", "Afton Alps", "Alpine Valley Ohio", "Boston Mills and Brandywine","Hidden Valley",
             "Mad River Mountain", "Mt. Brighton", "Paoli Peaks", "Snow Creek","Wilmot Mountain",
            'Mt. Sunapee','Wildcat Mountain', 'Whitetail', 'Mt. Brighton', 'Wilmot Mountain']

mtn_col_list = ['Arapahoe Basin', 'Aspen Snowmass', 'Jackson Hole', 'Mammoth Mountain', 'Snowbird',
                            'Palisades Tahoe', 'Sugarbush','Taos Ski Valley', 'Alta', 'Big Sky', 'Sugar Bowl',
                           'Sugarloaf', 'Sun Valley', 'Grand Targhee', 'Snowbasin']

ikon_list = ['Palisades Tahoe', 'Mammoth Mountain', 'June Mountain', 'Bear Mountain', 'Snow Summit','Snow Valley',
    'Sun Valley', 'Dollar Mountain', 'Crystal Mountain Washington', 'Alpental', 'Summit at Snoqualmie',
    'Mt. Bachelor', 'Schweitzer', 'Alyeska', 'Aspen Snowmass','Buttermilk', 'Steamboat', 'Winter Park',
    'Copper Mountain', 'Arapahoe Basin', 'Eldora Mountain', 'Jackson Hole', 'Big Sky',
    'Taos Ski Valley','Deer Valley', 'Solitude Mountain','Brighton','Alta', 'Snowbird',
    'Snowbasin','Boyne Highlands', 'Boyne Mountain', 'Stratton Mountain', 'Sugarbush', 'Killington', 'Pico Mountain',
    'Windham Mountain', 'Snowshoe Mountain', 'Sunday River','Sugarloaf','Loon Mountain']


indy_list = ['Eaglecrest', 'Ski China Peak', 'Mt. Shasta Board Ski Park', 'Mountain High', 'Dodge Ridge',
    'Hoodoo Ski Area', 'Mt. Ashland', 'Mt. Hood Meadows', '49 Degrees North',
    'Hurricane Ridge Ski & Snowboard Area', 'Mission Ridge', 'Bluewood', 'White Pass',
    'Castle Mountain Resort', 'Sunrise Park', 'Echo Mountain', 'Granby Ranch',
    'Sunlight', 'Brundage Mountain', 'Kelly Canyon', 'Pomerelle Mountain', 'Silver Mountain', 'Soldier Mountain',
    'Tamarack', 'Blacktail Mountain', 'Mountain', 'Red Lodge Mountain', 'Beaver Mountain',
    'Powder Mountain', 'Antelope Butte', 'Snow King', 'White Pine Ski Resort',
    'Seven Oaks', 'Sundown Mountain', 'Big Powderhorn Mountain', 'Caberfae Peaks Ski Golf',
    'Crystal Mountain Michigan', 'Marquette Mountain', 'Nubs Nob', 'Pine Mountain',
    'Schuss Mountain', 'Swiss Valley', 'Treetops Ski Resort',
    'Buck Hill', 'Detroit Mountain Recreation Area', 'Lutsen Mountains', 'Mount Mankato',
    'Powder Ridge Minnesota', 'Spirit Mountain', 'Terry Peak', 'Granite Peak',
    'Little Switzerland', 'Nordic Mountain', 'The Rock Snowpark', 'Trollhaugen',
    'Tyrol Basin', 'Mohawk Mountain', 'BigRockMountain',
    'Rangeley Lakes Trail Center', 'Saddleback Mountain', 'Berkshire East',
    'Black Mountain', 'Cannon Mountain', 'Pats Peak', 'Waterville Valley',
    'Catamount Ski Ride Area', 'Greek Peak', 'Peekn Peak', 'Snow Ridge', 'Swain Resort',
    'Titus Mountain', 'West Mountain', 'Catamount Outdoor Family Center', 'Bolton Valley',
    'Jay Peak', 'Magic Mountain Vermont', 'Saskadena Six', 'Cataloochee',
    'Blue Knob', 'Montage Mountain', 'Shawnee Mountain', 'Ski Sawmill',
    'Tussey Mountain', 'Ober Gatlinburg Ski', 'Bryce', 'Massanutten',
    'Canaan Valley', 'Winterplace Ski']

In [77]:
#sanity check to see if any of the names are wrong

test_list = ikon_list + epic_list + mtn_col_list

missing_values = [value for value in test_list if value not in scraped_df['ski_resort'].values]

# Print the missing values
print(missing_values)

['Dollar Mountain', 'Buttermilk', 'Laurel Mountain']


In [78]:
#sanity check to see if any of the names are wrong

missing_values = [value for value in indy_list if value not in scraped_df['ski_resort'].values]

# Print the missing values
print(missing_values)

['Hoodoo Ski Area', 'Hurricane Ridge Ski & Snowboard Area', 'Castle Mountain Resort', 'Echo Mountain', 'Sunlight', 'Mountain', 'Red Lodge Mountain', 'Antelope Butte', 'White Pine Ski Resort', 'Treetops Ski Resort', 'Detroit Mountain Recreation Area', 'Mount Mankato', 'The Rock Snowpark', 'BigRockMountain', 'Rangeley Lakes Trail Center', 'Saddleback Mountain', 'Swain Resort', 'Catamount Outdoor Family Center']


In [79]:
# making new column with 0
scraped_df['epic'] = 0
scraped_df['mountain_collective'] = 0
scraped_df['ikon'] = 0
scraped_df['indy'] = 0

# adding 1 for each row value based on the ski resort pass lists
scraped_df.loc[scraped_df['ski_resort'].isin(epic_list), 'epic'] = 1
scraped_df.loc[scraped_df['ski_resort'].isin(mtn_col_list), 'mountain_collective'] = 1
scraped_df.loc[scraped_df['ski_resort'].isin(ikon_list), 'ikon'] = 1
scraped_df.loc[scraped_df['ski_resort'].isin(indy_list), 'indy'] = 1

### Saving Airbnb scraping file
For the content based system, I will be scraping airbnb costs from each resort city. In the code below, I am making a new dataframe that I wll use that includes city, state, and ski resort information to pull information from airbnb.

The scraping notebook can be found in the data folder.

In [80]:
#making new dataframe
city_df = pd.DataFrame()

#saving city
city_df['city'] = scraped_df['city']

#saving state
city_df['state'] = scraped_df['state']

#saving ski resort name
city_df['ski_resort'] = scraped_df['ski_resort']

#saving 
#city_df.to_csv("data/cleaned_data_exports/location_for_scraping_v3.csv")

In [81]:
#saving city names for scraping
city_df['location'] = scraped_df[['city', 'state']].agg(', '.join, axis=1)

#saving unique cities
city_df['location'] = pd.DataFrame(city_df['location'].unique())

#dropping nulls
city_df = city_df.dropna()

#saving
#city_df.to_csv("data/cleaned_data_exports/city_names_for_scraping_v3.csv")

In [82]:
#saving final merged dataframe for content based system
final_ski_df.to_csv("data/cleaned_data_exports/user_df_model.csv")

## Airbnb Scrape Cleaning

Importing .csv that I scraped from Airbnb. This includes prices of airbnb's pulled from the first two pages (32 results) for airbnbs with a max of **2 guests** and max of **4 guests**. I pulled this information to provide cost information to those planning ski trips, as high lodging and ticket costs are a barrier to entry while planning ski trips.

In [83]:
jan_2_airbnb_mean_final

Unnamed: 0.1,Unnamed: 0,count,mean,std,min,25%,50%,75%,max,ski_resort
0,0,35.0,240.742857,108.573191,100.0,168.50,215.0,297.00,585.0,Magic Mountain
1,1,35.0,140.800000,66.825409,73.0,100.00,120.0,168.00,409.0,49 Degrees North
2,2,32.0,187.156250,138.082049,61.0,121.00,157.5,205.50,842.0,Afton Alps
3,3,35.0,264.857143,143.587879,99.0,173.50,228.0,338.00,833.0,Alpental
4,4,34.0,158.676471,61.121201,64.0,108.75,144.0,193.75,296.0,Alpine Valley Ohio
...,...,...,...,...,...,...,...,...,...,...
326,326,35.0,189.085714,90.782342,71.0,140.00,177.0,222.50,510.0,Badger Pass
327,327,35.0,157.171429,49.416290,84.0,123.50,151.0,183.00,324.0,Mt. Bachelor
328,328,35.0,221.285714,136.107996,80.0,142.00,190.0,250.00,659.0,Mt. Bohemia
329,329,34.0,165.264706,60.058209,67.0,132.00,147.5,188.25,357.0,Shawnee Mountain


In [84]:
jan_2_airbnb_mean_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331 entries, 0 to 330
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  331 non-null    int64  
 1   count       331 non-null    float64
 2   mean        331 non-null    float64
 3   std         331 non-null    float64
 4   min         331 non-null    float64
 5   25%         331 non-null    float64
 6   50%         331 non-null    float64
 7   75%         331 non-null    float64
 8   max         331 non-null    float64
 9   ski_resort  331 non-null    object 
dtypes: float64(8), int64(1), object(1)
memory usage: 26.0+ KB


In [85]:
#making lists of all dataframes from airbnb scrape
two_guest_list = [dec_2_airbnb_mean_final,jan_2_airbnb_mean_final,feb_2_airbnb_mean_final,
                  mar_2_airbnb_mean_final, apr_2_airbnb_mean_final,
                  may_2_airbnb_mean_final]

four_guest_list = [dec_4_airbnb_mean_final,jan_4_airbnb_mean_final,feb_4_airbnb_mean_final,
                  mar_4_airbnb_mean_final, apr_4_airbnb_mean_final,
                  may_4_airbnb_mean_final]

month_list = ['dec', 'jan', 'feb', 'mar', 'apr', 'may']

for x, y, z in zip(two_guest_list, four_guest_list, month_list):
    
    dict_2 = {'mean': z + '_mean_2_guests',
              'min': z + '_min_2_guests','max': z + '_max_2_guests'}

    dict_4 = {'mean': z +'_mean_4_guests',
              'min': z +'_min_4_guests','max': z +'_max_4_guests'}
    
    #renaming column names based on scrape
    x.rename(columns=dict_2, inplace=True)
    y.rename(columns=dict_4, inplace=True)
    
    #saving as new dataframe
    x.drop(columns=['count', 'std', '25%', '50%', '75%', 'Unnamed: 0'], inplace=True)
    y.drop(columns=['count', 'std', '25%', '50%', '75%', 'Unnamed: 0'], inplace=True)

In [86]:
#merging all months from the list of 2 guest airbnbs 
for df in two_guest_list[1:]:
    dec_2_airbnb_mean_final = pd.merge(dec_2_airbnb_mean_final, df, on='ski_resort')

#mergins all months from the list of 4 guest airbnbs
for df in four_guest_list:
    dec_2_airbnb_mean_final = pd.merge(dec_2_airbnb_mean_final, df, on='ski_resort')
    
#saving off the list as a new dataframe
airbnb_df = dec_2_airbnb_mean_final.copy()

#dropping duplicates
airbnb_df = airbnb_df.drop_duplicates()

In [87]:
#inspecting value counts
airbnb_df['ski_resort'].value_counts()

Kelly Canyon       1
Blue Hills         1
Mt. Abram Ski      1
Sugar Bowl         1
Palisades Tahoe    1
                  ..
Treetops           1
Swiss Valley       1
Giants Ridge       1
Discovery          1
Powder Mountain    1
Name: ski_resort, Length: 331, dtype: int64

In [88]:
#checking to see which ski resort names different between the main feature df and the airbnb df
scraped_df.loc[~scraped_df['ski_resort'].isin(airbnb_df['ski_resort']),
                         'ski_resort'].unique()

array(['Mt. Shasta Board Ski Park', 'Crystal Mountain Washington',
       'Summit at Snoqualmie', 'Black River', 'Jackson Creek',
       'Wild Mountain Ski', 'Anthony Lakes Mountain',
       'Red Lodge Mountain.', 'Snowy Range', 'Crystal Mountain Michigan',
       'Schuss Mountain', 'Snow Summit', 'Timberline Mountain',
       'Montage Mountain', 'Big Squaw Mountain Ski', 'Hyland Ski',
       'Coffee Mill Ski Snowboard', 'Alpine Valley Michigan',
       'Brantling Ski Slopes', 'McCauley Mountain', 'Mt. Kato',
       'Mt. Holly', 'Powder Ridge Minnesota', 'Ski Snowstar',
       'Deer Mountain', 'Whaleback', 'Mt. Pleasant',
       'Powder Ridge Connecticut'], dtype=object)

In [89]:
airbnb_df.loc[airbnb_df["ski_resort"] == "Powder Mountain"]

Unnamed: 0,dec_mean_2_guests,dec_min_2_guests,dec_max_2_guests,ski_resort,jan_mean_2_guests,jan_min_2_guests,jan_max_2_guests,feb_mean_2_guests,feb_min_2_guests,feb_max_2_guests,...,feb_max_4_guests,mar_mean_4_guests,mar_min_4_guests,mar_max_4_guests,apr_mean_4_guests,apr_min_4_guests,apr_max_4_guests,may_mean_4_guests,may_min_4_guests,may_max_4_guests
213,186.314286,60.0,630.0,Powder Mountain,203.764706,65.0,750.0,224.117647,66.0,767.0,...,699.0,249.647059,66.0,699.0,264.764706,109.0,540.0,137.458333,75,325


In [90]:
#renaming resorts that have the same name
airbnb_df.replace("Crystal Mountain", "Crystal Mountain Washington", regex=True, inplace=True)

airbnb_df.loc[(airbnb_df['ski_resort'] == "Crystal Mountain Washington ") & (airbnb_df['dec_min_2_guests'] == 40), 'ski_resort'] = "Crystal Mountain Michigan"

In [91]:
airbnb_df.loc[(airbnb_df['ski_resort'] == 'Timberline') & (airbnb_df['dec_min_2_guests'] == 75.0), 'ski_resort'] = "Timberline Mountain"

#batch renaming mountains to match the user df
rename_list = ["Anthony Lakes", "Mount Holly", "Coffee ll", "Brantling Ski", 'Big Squaw',
               "McCauley Mountain Ski Center",'Hyland Ski Snowboard Area', "Mount Kato"]

rename_to = ['Anthony Lakes Mountain', 'Mt. Holly', 'Coffee Mill Ski Snowboard',
       'Brantling Ski Slopes', 'Big Squaw Mountain Ski', "McCauley Mountain", "Hyland Ski", "Mt. Kato"]

airbnb_df.replace(rename_list, rename_to, regex=True, inplace=True)

In [92]:
#renaming resorts that have the same name
airbnb_df.replace("Powder Ridge", "Powder Ridge Connecticut", regex=True, inplace=True)

In [93]:
rename_list = ["Mt. Shasta", "The Summit at Snoqualmie", 'Wild Mountain Ski Snowboard Area', 'Red Lodge Mtn.',
              'Sno Mountain', 'Ski Snowstar Winter Sports Park','Ski Mystic at Deer Mountain', "Mount Pleasant",
              'Shanty Creek', 'Red Lodge Mountain', 'Snowy Range Ski Recreation Area', "Whalebk"]

rename_to = ["Mt. Shasta Board Ski Park", "Summit at Snoqualmie", "Wild Mountain Ski", 'Red Lodge Mountain.',
            'Montage Mountain', 'Ski Snowstar', 'Deer Mountain', "Mt. Pleasant", "Schuss Mountain",
             "Red Lodge Mountain.", "Snowy Range", "Whaleback"]

airbnb_df.replace(rename_list, rename_to, regex=True, inplace=True)

In [94]:
#checking to see which ski resort names different between the main feature df and the airbnb df
scraped_df.loc[~scraped_df['ski_resort'].isin(airbnb_df['ski_resort']),
                         'ski_resort'].unique()

array(['Black River', 'Jackson Creek', 'Snow Summit',
       'Alpine Valley Michigan', 'Powder Ridge Minnesota'], dtype=object)

# Merging reviews with features
I will be combining the cleaned dataframes from below that will be used for the collaborative model output.

In [203]:
# #merging final review dataframe and scraped data
# merged_df = pd.merge(final_ski_df, scraped_df, on="ski_resort", how='left')

# #dropping columns
# merged_df = merged_df.drop(columns="state_y")

# #renaming columns
# merged_df = merged_df.rename(columns={"state_x":"state"})

## Merging airbnb with feature dataframe

In [95]:
#merging airbnb and feature dataframes
content_df = pd.merge(scraped_df, airbnb_df, on="ski_resort", how="left")

In [96]:
content_df

Unnamed: 0,ski_resort,address,city,state,zipcode,summit,drop,base,gondolas_and_trams,fast_eight,...,feb_max_4_guests,mar_mean_4_guests,mar_min_4_guests,mar_max_4_guests,apr_mean_4_guests,apr_min_4_guests,apr_max_4_guests,may_mean_4_guests,may_min_4_guests,may_max_4_guests
0,Palisades Tahoe,PO Box 2007,Olympic Valley,California,96146,9050,2850,6200,3.0,6.0,...,1128.0,497.857143,210.0,1320.0,463.542857,178.0,1260.0,331.956522,122.0,825.0
1,Mammoth Mountain,P.O. Box 24,Mammoth Mountain Lakes,California,93546,11053,3100,7953,3.0,9.0,...,589.0,339.200000,156.0,589.0,325.411765,126.0,699.0,144.571429,64.0,246.0
2,Donner Ski Ranch,P.O. Box 66,Norden,California,95724,8012,750,7031,0.0,0.0,...,650.0,326.657143,190.0,739.0,349.257143,165.0,996.0,231.272727,83.0,643.0
3,Sugar Bowl,P.O. Box 5,Norden,California,95724,8383,1500,6883,1.0,5.0,...,590.0,349.457143,190.0,739.0,349.257143,165.0,996.0,245.590909,120.0,643.0
4,Kirkwood,PO Box 1,Kirkwood,California,95646,9800,2000,7800,0.0,2.0,...,1179.0,424.250000,146.0,1179.0,420.290323,150.0,950.0,309.200000,114.0,590.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
325,Oak Mountain,141 Novosel Way,Speculator,New York,12164,2400,650,1750,0.0,0.0,...,499.0,270.914286,100.0,769.0,285.028571,125.0,769.0,219.541667,95.0,460.0
326,Mt. Pleasant,23301 Plank Rd,Venango,Pennsylvania,16403,1540,340,1200,0.0,0.0,...,350.0,176.885714,55.0,350.0,184.942857,55.0,350.0,189.666667,76.0,400.0
327,Hunt Hollow,7532 County Road 36,Naples,New York,14512,2030,825,1000,0.0,0.0,...,800.0,314.742857,125.0,800.0,307.342857,124.0,800.0,242.541667,99.0,444.0
328,Powder Ridge Connecticut,99 Powder Hill Road,Middlefield,Connecticut,06455,720,550,170,0.0,0.0,...,898.0,225.457143,67.0,977.0,237.628571,68.0,1199.0,148.208333,73.0,250.0


### Google Geocoding API + Closest Airports

Importing the final .csv from the Google Geocoding API pull. The code for this can be found in the **scraping_ipynb** file.

Google's Geocoding API is a service that accepts a place as an address, latitude and longitude coordinates, or Place ID. It converts the address into latitude and longitude coordinates and a Place ID, or converts latitude and longitude coordinates or a Place ID into an address.

In [97]:
latitude_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 332 entries, 0 to 331
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    332 non-null    int64  
 1   full_address  332 non-null    object 
 2   ski_resort    332 non-null    object 
 3   latitude      332 non-null    float64
 4   longitude     332 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 13.1+ KB


In [98]:
latitude_df.head()

Unnamed: 0.1,Unnamed: 0,full_address,ski_resort,latitude,longitude
0,0,"P.O. Box 166, Chewelah, Washington",49 Degrees North,48.276287,-117.715521
1,1,"6600 Peller Avenue South, Hastings, Minnesota",Afton Alps,44.854416,-92.790839
2,2,"POB 1068, Snoquale Pass, Washington",Alpental,47.392335,-121.400094
3,3,"10620 Mayfield, Chesterland, Ohio",Alpine Valley Ohio,41.526814,-81.25982
4,4,"P.O. Box 615, East Troy, Wisconsin",Alpine Valley Wisconsin,42.785292,-88.405096


In [99]:
airport_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 332 entries, 0 to 331
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  332 non-null    int64  
 1   ski_resort  332 non-null    object 
 2   airport_1   332 non-null    object 
 3   distance_1  332 non-null    float64
 4   airport_2   332 non-null    object 
 5   distance_2  332 non-null    float64
 6   airport_3   332 non-null    object 
 7   distance_3  332 non-null    float64
dtypes: float64(3), int64(1), object(4)
memory usage: 20.9+ KB


In [100]:
airport_df.head()

Unnamed: 0.1,Unnamed: 0,ski_resort,airport_1,distance_1,airport_2,distance_2,airport_3,distance_3
0,0,49 Degrees North,Colville Municipal,32.024441,Ione Municipal,52.867442,Wiley Post,2171.173215
1,1,Afton Alps,Lake Elmo,16.707929,South St.Paul Municipal,19.079077,Babelthoup/Koror,12532.478178
2,2,Alpental,Cle Elum Municipal,45.31106,Renton Municipal,62.356057,Palmdale Production Flight,1445.818249
3,3,Alpine Valley Ohio,Geauga County,18.511068,Cuyahoga County,19.327603,Rock County,653.338468
4,4,Alpine Valley Wisconsin,East Troy Municipal,2.96675,Palmyra Municipal,19.114149,Sac City Municipal,540.030304


In [101]:
#dropping unnamed
latitude_df.drop(columns="Unnamed: 0", inplace=True)
airport_df.drop(columns="Unnamed: 0", inplace=True)

In [102]:
#merging latitude df and airport df to combine with final dataframe
latitude_df = pd.merge(latitude_df, airport_df, on="ski_resort")

In [103]:
content_df.loc[~content_df['ski_resort'].isin(latitude_df['ski_resort']),
                         'ski_resort'].unique()

array(['Mt. Shasta Board Ski Park', 'Summit at Snoqualmie', 'Black River',
       'Jackson Creek', 'Wild Mountain Ski', 'Anthony Lakes Mountain',
       'Red Lodge Mountain.', 'Snowy Range', 'Schuss Mountain',
       'Snow Summit', 'Timberline Mountain', 'Montage Mountain',
       'Big Squaw Mountain Ski', 'Hyland Ski',
       'Coffee Mill Ski Snowboard', 'Alpine Valley Michigan',
       'Brantling Ski Slopes', 'McCauley Mountain', 'Mt. Kato',
       'Mt. Holly', 'Ski Snowstar', 'Deer Mountain', 'Whaleback',
       'Mt. Pleasant'], dtype=object)

In [104]:
#updating Timberline because there are multiple similar resorts
latitude_df.loc[(latitude_df['ski_resort'] == 'Timberline') & (latitude_df['full_address'] == "HC 70 Box 488, Davis, West Virginia"), 'ski_resort'] = "Timberline Mountain"

#batch renaming
lat_list = ["Anthony Lakes", "Mount Holly", "Mt. Shasta", "Coffee ll", "Brantling Ski", "Big Squaw",'McCauley Mountain Ski Center',
           'Mount Kato', 'Hyland Ski Snowboard Area']

lat_replace = ['Anthony Lakes Mountain', 'Mt. Holly','Mt. Shasta Board Ski Park',
               'Coffee Mill Ski Snowboard','Brantling Ski Slopes', 'Big Squaw Mountain Ski',
              'McCauley Mountain', 'Mt. Kato', 'Hyland Ski']

latitude_df.replace(lat_list, lat_replace, regex=True, inplace=True)

In [105]:
rename_list = ["The Summit at Snoqualmie", 'Wild Mountain Ski Snowboard Area', 'Red Lodge Mtn.',
              'Sno Mountain', 'Ski Snowstar Winter Sports Park','Ski Mystic at Deer Mountain', "Mount Pleasant",
              'Shanty Creek', 'Red Lodge Mountain', 'Snowy Range Ski Recreation Area', "Whalebk", "New Hermon Mountain"]

rename_to = ["Summit at Snoqualmie", "Wild Mountain Ski", 'Red Lodge Mountain',
            'Montage Mountain', 'Ski Snowstar', 'Deer Mountain', "Mt. Pleasant", "Schuss Mountain",
             "Red Lodge Mountain.", "Snowy Range", "Whaleback", "New Hermon Mountain"]

latitude_df.replace(rename_list, rename_to, regex=True, inplace=True)

In [106]:
#confiring all names are consistent besides two remaining values that don't exist in the other dataframe
content_df.loc[~content_df['ski_resort'].isin(latitude_df['ski_resort']),
                         'ski_resort'].unique()

array(['Black River', 'Jackson Creek', 'Red Lodge Mountain.',
       'Snow Summit', 'Alpine Valley Michigan'], dtype=object)

### Merging with feature dataframe

In [107]:
#merging latitude df with final content_df
content_df = pd.merge(content_df, latitude_df, on="ski_resort", how="left")

In [112]:
content_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 329 entries, 0 to 329
Data columns (total 86 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ski_resort                329 non-null    object 
 1   address                   329 non-null    object 
 2   city                      329 non-null    object 
 3   state                     329 non-null    object 
 4   zipcode                   329 non-null    object 
 5   sumt                      329 non-null    int64  
 6   drop                      329 non-null    int64  
 7   base                      329 non-null    int64  
 8   gondolas_and_trams        329 non-null    float64
 9   fastEight                 329 non-null    float64
 10  highSpeedSixes            329 non-null    float64
 11  quadChairs                329 non-null    float64
 12  tripleChairs              329 non-null    float64
 13  doubleChairs              329 non-null    float64
 14  surfeLifts

In [109]:
content_df.head()

Unnamed: 0,ski_resort,address,city,state,zipcode,summit,drop,base,gondolas_and_trams,fast_eight,...,may_max_4_guests,full_address,latitude,longitude,airport_1,distance_1,airport_2,distance_2,airport_3,distance_3
0,Palisades Tahoe,PO Box 2007,Olympic Valley,California,96146,9050,2850,6200,3.0,6.0,...,825.0,"PO Box 2007, Olympic Valley, California",39.19698,-120.235705,Truckee-Tahoe,15.992749,Minden-Tahoe,47.213899,Greenville Muni,3357.30661
1,Mammoth Mountain,P.O. Box 24,Mammoth Mountain Lakes,California,93546,11053,3100,7953,3.0,9.0,...,246.0,"P.O. Box 24, Mammoth Lakes, California",37.648546,-118.972079,Mammoth Yosemite,12.136117,Bryant,71.790347,Grand Rapids-Itasca County,2330.068385
2,Donner Ski Ranch,P.O. Box 66,Norden,California,95724,8012,750,7031,0.0,0.0,...,643.0,"P.O. Box 66, Norden, California",39.317356,-120.354182,Truckee-Tahoe,18.464839,Reno/Tahoe International,54.237886,Mapleton Municipal,2085.874328
3,Sugar Bowl,P.O. Box 5,Norden,California,95724,8383,1500,6883,1.0,5.0,...,643.0,"P.O. Box 5, Norden, California",39.317356,-120.354182,Truckee-Tahoe,18.464839,Reno/Tahoe International,54.237886,Mapleton Municipal,2085.874328
4,Kirkwood,PO Box 1,Kirkwood,California,95646,9800,2000,7800,0.0,2.0,...,590.0,"PO Box 1, Kirkwood, California",38.702308,-120.072244,Lake Tahoe,22.320342,Minden-Tahoe,43.275852,Nogales International,1165.403603


In [110]:
#changing
col_list = ["nov_snow", "dec_snow", "jan_snow", "feb_snow", "mar_snow", "apr_snow"]

for x in col_list:
    content_df[x] = content_df[x].astype(int)

### Adding Additional Column

In [111]:
lift_list = ['gondolas_and_trams','fast_eight','high_speed_sixes','quad_chairs','triple_chairs','double_chairs',
             'surface_lifts']

for x in lift_list:
    content_df[x] = content_df[x].fillna(0)

In [112]:
content_df['gondolas_and_trams'] = content_df['gondolas_and_trams'].astype(float)

In [113]:
#making new column that totals the lift sum
content_df['total_lifts'] = 0 

#adding columns
content_df['total_lifts'] = content_df['gondolas_and_trams'] + content_df['fast_eight'] + content_df['high_speed_sixes'] + content_df['quad_chairs'] + content_df['triple_chairs'] + content_df['double_chairs'] + content_df['surface_lifts']

In [114]:
content_df

Unnamed: 0,ski_resort,address,city,state,zipcode,summit,drop,base,gondolas_and_trams,fast_eight,...,full_address,latitude,longitude,airport_1,distance_1,airport_2,distance_2,airport_3,distance_3,total_lifts
0,Palisades Tahoe,PO Box 2007,Olympic Valley,California,96146,9050,2850,6200,3.0,6.0,...,"PO Box 2007, Olympic Valley, California",39.196980,-120.235705,Truckee-Tahoe,15.992749,Minden-Tahoe,47.213899,Greenville Muni,3357.306610,36.0
1,Mammoth Mountain,P.O. Box 24,Mammoth Mountain Lakes,California,93546,11053,3100,7953,3.0,9.0,...,"P.O. Box 24, Mammoth Lakes, California",37.648546,-118.972079,Mammoth Yosemite,12.136117,Bryant,71.790347,Grand Rapids-Itasca County,2330.068385,25.0
2,Donner Ski Ranch,P.O. Box 66,Norden,California,95724,8012,750,7031,0.0,0.0,...,"P.O. Box 66, Norden, California",39.317356,-120.354182,Truckee-Tahoe,18.464839,Reno/Tahoe International,54.237886,Mapleton Municipal,2085.874328,8.0
3,Sugar Bowl,P.O. Box 5,Norden,California,95724,8383,1500,6883,1.0,5.0,...,"P.O. Box 5, Norden, California",39.317356,-120.354182,Truckee-Tahoe,18.464839,Reno/Tahoe International,54.237886,Mapleton Municipal,2085.874328,12.0
4,Kirkwood,PO Box 1,Kirkwood,California,95646,9800,2000,7800,0.0,2.0,...,"PO Box 1, Kirkwood, California",38.702308,-120.072244,Lake Tahoe,22.320342,Minden-Tahoe,43.275852,Nogales International,1165.403603,13.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
325,Oak Mountain,141 Novosel Way,Speculator,New York,12164,2400,650,1750,0.0,0.0,...,"141 Novosel Way, Speculator, NY",43.517884,-74.361618,Piseco Muni,14.487038,Saratoga Cty,65.825037,Hillsboro Municipal,1814.724642,4.0
326,Mt. Pleasant,23301 Plank Rd,Venango,Pennsylvania,16403,1540,340,1200,0.0,0.0,...,"23301 Plank Rd , Venango, Pa 16440",41.795526,-80.097363,Port Meadville,21.167046,Erie Intl,32.517680,Nondalton,5284.952567,2.0
327,Hunt Hollow,7532 County Road 36,Naples,New York,14512,2030,825,1000,0.0,0.0,...,"7532 County Road 36, Naples, New York",42.643014,-77.469117,Dansville Muni,21.514044,Hornell Muni,33.855765,Perry-Warsaw,48.883513,3.0
328,Powder Ridge Connecticut,99 Powder Hill Road,Middlefield,Connecticut,06455,720,550,170,0.0,0.0,...,"99 Powder Hill Road, ddlefield, Connecticut",41.501600,-72.736408,Meriden-Markham Municipal,7.790552,Chester,23.247699,Piedmont Triad International,865.684349,6.0


In [115]:
content_df.drop_duplicates(subset="ski_resort", inplace=True)

In [116]:
content_df['state'].unique()

array(['California', 'Nevada', 'Colorado', 'Washington', 'Utah', 'Oregon',
       'Alaska', 'Idaho', 'North Carolina', 'Wyoming', 'Arizona',
       'Minnesota', 'New Mexico', 'Montana', 'Michigan', 'Vermont',
       'Wisconsin', 'New Hampshire', 'New Jersey', 'Massachusetts',
       'Pennsylvania', 'New York', 'Iowa', 'Maine', 'West Virginia',
       'Illinois', 'Ohio', 'Virginia', 'Missouri', 'Connecticut',
       'Tennessee', 'Indiana', 'South Dakota', 'Maryland', 'Rhode Island'],
      dtype=object)

In [119]:
float_list = ['gondolas_and_trams',
 'fastEight',
 'highSpeedSixes',
 'quadChairs',
 'tripleChairs',
 'doubleChairs',
 'surfeLifts',
 'longestRun (miles)',
 'NovSnow',
 'DecSnow',
 'JanSnow',
 'FebSnow',
 'MarSnow',
 'AprSnow',
 'dec_mean_2_guests',
 'dec_min_2_guests',
 'dec_max_2_guests',
 'jan_mean_2_guests',
 'jan_min_2_guests',
 'jan_max_2_guests',
 'feb_mean_2_guests',
 'feb_min_2_guests',
 'feb_max_2_guests',
 'mar_mean_2_guests',
 'mar_min_2_guests',
 'mar_max_2_guests',
 'apr_mean_2_guests',
 'apr_min_2_guests',
 'apr_max_2_guests',
 'may_mean_2_guests',
 'may_min_2_guests',
 'may_max_2_guests',
 'dec_mean_4_guests',
 'dec_min_4_guests',
 'dec_max_4_guests',
 'jan_mean_4_guests',
 'jan_min_4_guests',
 'jan_max_4_guests',
 'feb_mean_4_guests',
 'feb_min_4_guests',
 'feb_max_4_guests',
 'mar_mean_4_guests',
 'mar_min_4_guests',
 'mar_max_4_guests',
 'apr_mean_4_guests',
 'apr_min_4_guests',
 'apr_max_4_guests',
 'may_mean_4_guests',
 'may_min_4_guests',
 'may_max_4_guests',
 'total_lifts']

In [120]:
for x in float_list:
    content_df[x] = content_df[x].fillna(0).astype(int)

In [122]:
content_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 329 entries, 0 to 329
Data columns (total 86 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ski_resort                329 non-null    object 
 1   address                   329 non-null    object 
 2   city                      329 non-null    object 
 3   state                     329 non-null    object 
 4   zipcode                   329 non-null    object 
 5   sumt                      329 non-null    int64  
 6   drop                      329 non-null    int64  
 7   base                      329 non-null    int64  
 8   gondolas_and_trams        329 non-null    int64  
 9   fastEight                 329 non-null    int64  
 10  highSpeedSixes            329 non-null    int64  
 11  quadChairs                329 non-null    int64  
 12  tripleChairs              329 non-null    int64  
 13  doubleChairs              329 non-null    int64  
 14  surfeLifts

In [123]:
#saving final cleaned scraped df
content_df.to_csv("data/cleaned_data_exports/scraped_feature_df_3.csv")

## Feature Analysis

### Ratings by resort distribution

In [281]:
#looking at the most reviewed mountains
top_5_reviewed_resorts = pd.DataFrame(final_ski_df['ski_resort'].value_counts().reset_index()).head(5)
top_5_reviewed_resorts

Unnamed: 0,index,ski_resort
0,Ski Brule,74
1,Killington,53
2,Vail,51
3,Breckenridge,49
4,Snowbird,36


In [282]:
#using plotly to plot the top reviewers
fig = px.bar(top_5_reviewed_resorts, x="index", y="ski_resort")
fig.update_layout(title_text='Most Reviewed Mountains',
                  title_x=0.5,
                  xaxis_title="Resort",
                  yaxis_title="Review Count",
                 plot_bgcolor='white')
fig.update_traces(marker_color = "#00b5ff")
fig.show()

### Rating distribution

There is an imbalance in rating distrubutions, however the breakdown of ratings is not in line with typical user bias where ratings are either on the high or low scale. This imbalance will most likely end up affecting the performance of our modeling, but we can choose an algorithm that works best for the type of data we have.

In [283]:
#making dataframe of rating counts to compare distribution of ratings
top_ratings = pd.DataFrame(final_ski_df["rating"].value_counts(ascending=False).head(15))
top_ratings = top_ratings.reset_index()
top_ratings = top_ratings.rename(columns={"rating":"rating_count"})
top_ratings = top_ratings.rename(columns={"index":"rating"})

#making user_id a string for plotting
top_ratings['rating'] = top_ratings['rating'].astype(str)

# Calculate the percentage of each rating count
top_ratings['rating_percentage'] = (top_ratings['rating_count'] / top_ratings['rating_count'].sum()) * 100

In [284]:
#using plotly to plot the top featurescolor=
fig = px.bar(top_ratings, x="rating", y="rating_percentage",
             text="rating_percentage")
fig.update_layout(title_text='Rating Distribution',
                  title_x=0.5,
                  xaxis_title="Rating",
                  yaxis_title="Rating %",
                 plot_bgcolor='white',
                 font=dict(size=14))
fig.update_traces(marker_color = "#00b5ff", texttemplate='%{text:.1s}%', textposition='outside')

fig.show()

### Monthly Snowfall

In [285]:
snow_list = ["nov_snow", "dec_snow", "jan_snow", "feb_snow", "mar_snow", "apr_snow"]
snow_names = ['November', 'December', 'January', 'February', 'March', 'April']

monthly_mean = content_df[snow_list].mean(skipna=True)

monthly_snowfall = pd.DataFrame({'month': snow_names, 'mean_snowfall': monthly_mean.values})

In [286]:
monthly_snowfall

Unnamed: 0,month,mean_snowfall
0,November,5.918182
1,December,25.290909
2,January,28.787879
3,February,29.463636
4,March,21.948485
5,April,5.775758


In [287]:
#using plotly to plot the top featurescolor=
fig = px.bar(monthly_snowfall, x="month", y="mean_snowfall")
fig.update_layout(title_text='2022 US Average Snowfall',
                  title_x=0.5,
                  xaxis_title="Month",
                  yaxis_title="Snow (in)",
                 plot_bgcolor='white',
                 font=dict(size=14))
fig.update_traces(marker_color = "#00b5ff",textposition='outside')

fig.show()

### Airbnb Prices

In [288]:
#using plotly to plot the top featurescolor=
fig = px.bar(content_df.head(), x="ski_resort", y=["dec_min_2_guests", "dec_min_4_guests"],
            width=1000, height=500)
fig.update_layout(title_text='December Airbnb Costs',
                  title_x=0.5,
                  xaxis_title="Ski Resort",
                  yaxis_title="Nightly Price ($)",
                 plot_bgcolor='white',
                 font=dict(size=14),
                 barmode='group')

newnames = {'dec_min_2_guests':'2 Guest Max', 'dec_min_4_guests': '4 Guest Max'}
fig.for_each_trace(lambda t: t.update(name = newnames[t.name],
                                      legendgroup = newnames[t.name],
                                      hovertemplate = t.hovertemplate.replace(t.name, newnames[t.name])
                                     )
                  )

fig.update_traces(textposition='outside')               
              
fig.show()

## Elevation graph

In [289]:
def mountain_elevation(resort_name):

    resort_df = content_df.loc[content_df['ski_resort'] == resort_name]
    
    # elevation
    base_elevation = resort_df['base'].values[0]
    summit_elevation = resort_df['summit'].values[0]

    # traving elevation
    elev_trace = go.Scatter(x=["base", "summit", "drop"], y=[base_elevation, summit_elevation, base_elevation], mode='lines', line=dict(color='blue'))

    # displaying plot
    layout = go.Layout(
        title='Elevation Change',
        yaxis=dict(title='Elevation'),
        plot_bgcolor='white',
        showlegend=False
    )
    
    # making figure
    fig = go.Figure(data=[elev_trace], layout=layout)

    # Showing the line plot
    fig.show()

In [290]:
mountain_elevation("Arapahoe Basin")

# Conclusion

#### Review Dataset
After cleaning and analyzing the data, there are **662 users, 275 resorts, and 2795 total reviews**. There is an imbalance in the reviews, however our final recommendation system will be a hybrid-cascade model, so this will help balance out the results.

In [291]:
unique_users = len(final_ski_df['user_name'].unique())
unique_resorts = len(final_ski_df['ski_resort'].unique())
total_reviews = len(final_ski_df)

print("Number of unique users:", unique_users)
print("Number of unique resorts:", unique_resorts)
print("Number of reviews:", total_reviews)

Number of unique users: 662
Number of unique resorts: 275
Number of reviews: 2795


#### Review Dataset
After cleaning the scraped data from OnTheSnow, Google Geocoding API, and Airbnb, there are **329 resorts** and **86 columns** in the final dataframe.

In [292]:
unique_resorts = len(content_df['ski_resort'].unique())
unique_features = len(content_df.columns)

print("Number of resorts:", unique_resorts)
print("Number of features:", unique_features)

Number of resorts: 330
Number of features: 99


# Next Steps

The next step will be to begin modeling to create the recommendation system. The two main dataframes from this notebook will be used are listed below:

- **Collaborative Modeling** - cleaned_data_exports/user_df_model.csv
- **Content/Cascade Hybrid Modeling** - cleaned_data_exports/scraped_feature_df.csv

The collaborative model will be saved in a separate notebook than the final content and hybrid based models.