# Avant Ski

by: Stephanie Ciaccia

# Overview

Skiing holds a prominent place for those seeking winter recreational activities in the United States. With its stunning mountain ranges and diverse terrain, the country boasts numerous ski resorts that cater to all skill levels, from beginners to seasoned professionals. 

Skiing offers a unique blend of adventure, physical activity, and natural beauty, making it a popular choice for winter enthusiasts seeking both relaxation and excitement.

The ski market in the United States is thriving, contributing significantly to the economy. According to the [National Ski Areas Association (NSAA)](chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://nsaa.org/webdocs/Media_Public/IndustryStats/Historical_Skier_Days_1979_2022.pdf), approximately 60.7 million skiers and snowboarders visited 473 ski resorts in the 2021-2022 winter season.

# Business Problem

Skiing is an exhilarating winter activity enjoyed by many, but barriers such as high costs and limited accessibility often hinder people from fully experiencing its joys. Choosing the right ski resort can be overwhelming due to the multitude of options available, and existing websites lack dynamic filtering capabilities based on user preferences.

To address these challenges, I'm developing Avant Ski, a ski resort recommendation app. Avant Ski simplifies the ski resort selection process by leveraging data and user preferences. With dynamic filtering features, users can personalize their search based on budget, location, amenities, and skill level. By bridging the gap between ski enthusiasts and their dream destinations, Avant Ski makes skiing accessible to a wider audience, empowering them to plan unforgettable ski trips with confidence.

# Data Understanding

In [1]:
import pandas as pd
import numpy as np
import math
from datetime import datetime
import datetime
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly
import plotly.express as px
import plotly.io as pio
from matplotlib.ticker import StrMethodFormatter
import plotly.graph_objects as go

from surprise.model_selection import cross_validate
from surprise import Dataset, Reader, accuracy
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline, KNNWithZScore,  SVD, SVDpp, NMF, BaselineOnly, NormalPredictor
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split

from collections import Counter
from nltk.corpus import stopwords

from IPython.display import Image, display

import glob
import os

Function to print full rows

In [2]:
def print_full(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')

# Importing Data Files

In [3]:
snow_df = pd.read_csv("data/OnTheSnow_SkiAreaReviews_clean.csv")
survey_df = pd.read_csv("data/usa_ski_resort_survey.csv")
scraped_df = pd.read_csv("data/onthesnow_scrape_170523_cleaned.csv")
second_scrape = pd.read_csv("data/OnTheSnow_Srape_2_200523_cleaned.csv")

In [5]:
#airbnb scrape four guest listings
dec_4_airbnb_mean_final = pd.read_csv("data/scraped_cleaned/dec_4_airbnb_mean_final.csv")
jan_4_airbnb_mean_final = pd.read_csv("scraped_cleaned/jan_4_airbnb_mean_final.csv")
feb_4_airbnb_mean_final = pd.read_csv("scraped_cleaned/feb_4_airbnb_mean_final.csv")
mar_4_airbnb_mean_final = pd.read_csv("scraped_cleaned/mar_4_airbnb_mean_final.csv")
apr_4_airbnb_mean_final = pd.read_csv("scraped_cleaned/apr_4_airbnb_mean_final.csv")
may_4_airbnb_mean_final = pd.read_csv("scraped_cleaned/may_4_airbnb_mean_final.csv")

#airbnb scrape two guest listings
dec_2_airbnb_mean_final = pd.read_csv("scraped_cleaned/dec_2_airbnb_mean_final.csv")
jan_2_airbnb_mean_final = pd.read_csv("scraped_cleaned/jan_2_airbnb_mean_final.csv")
feb_2_airbnb_mean_final = pd.read_csv("scraped_cleaned/feb_2_airbnb_mean_final.csv")
mar_2_airbnb_mean_final = pd.read_csv("scraped_cleaned/mar_2_airbnb_mean_final.csv")
apr_2_airbnb_mean_final = pd.read_csv("scraped_cleaned/apr_2_airbnb_mean_final.csv")
may_2_airbnb_mean_final = pd.read_csv("scraped_cleaned/may_2_airbnb_mean_final.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'scraped_cleaned/dec_4_airbnb_mean_final.csv'

In [None]:
#google geocoding api
latitude_df = pd.read_csv("data/mountain_lat_long.csv")

### Data Source #1 - OnTheSnow (Kaggle)
### User Based Filtering Dataset

The main dataset for the user based collaborative model was pulled from [Kaggle]([https://www.kaggle.com/datasets/fredkellner/onthesnow-ski-area-reviews]). The dataset includes reviews scraped from OnTheSnow, a leading website that provides information about ski resorts and snow conditions found on Kaggle. 

There are 18,128 reviews from 291 ski resorts in the USA. The features include:

- Ski Area
- Reviewer Name 
- Review Date
- Review Star Rating (out of 5)

In [None]:
snow_df.info()

In [None]:
snow_df.head()

In [None]:
#renaming columns
new_name = ['state', 'ski_resort', 'user_name','review_date', 'rating',
           'review'] 

snow_df.columns = new_name

In [None]:
snow_df

In [None]:
survey_df['user_name'].unique()

In [None]:
survey_df.info()

In [None]:
snow_df['review_date'] = pd.to_datetime(snow_df['review_date'])
survey_df['review_date'] = pd.to_datetime(survey_df['review_date'])

In [None]:
snow_df["ski_resort"].value_counts()

In [None]:
snow_df['user_name'].value_counts().head(30)

To clean the review dataset, I had to drop the names of users that were not unique. I parsed through the dataset and continued to drop columns until only unique usernames or users with first and last names were left.

In [None]:
drop_list = ["anonymous_user", "anonymous","undefined undefined","Mike", 
             "Ben", "Ryan", "Richard", "Dan", "David", "Chris", "Rob", "Jeff",
            "Derek", "Brian", "Matt", "Michael", "iPhone", "Kevin", "Nick",
            "Jim", "Steve", "Jason", "Mark", "Joe", "Paul", "Justin", "Scott",
            "Bob", "Alex", "Carter", "Dave", "Tim", "Bill", "Andrew", "John", "Sam",
            "James", "Kim", "Craig", "mike", "jason", "James", "Sam", "Kim", "mike", "peter",
            "Jack", "Adam", "Tom", "Wes", "Jun", "Steven", "Max", "Matthew", "Laura", "Felipe",
            "Greg", "Bryan", "Sarah", "Sara", "Christian", "Ray", "Connor", "Erin", "Emily",
            "Luke", "Ed", "Patrick", "kyle", "Ken", "Linda", "Eric", "Aaron", "Jake",
            "Josh", "Tony", "Abe", "Frank", "Peter", "Fred", "Arthur", "Lorraine",
            "Phil", "Sean", "Will", "Julie", "Jon", "Amy", "Becky", "Shannon", "brendan",
            "Kathy", "wayne", "Ethan", "Erika", "Jill", "Zoe", "Rick", "Wyatt",
            "Tyler", "Andrea", "mark", "john", "Donna", "Jen", "Braden", "D", "Bryce",
            "Rich", "Jared", "Jay", "Ann", "Brandon", "Nicholas","Martin",
            'Robert', 'angelino','Anonymous',
             'ty', 'jase', 'Jesse', 'Jennifer', 'Dustin', 'Natalie',
             'Pat', 'anonymous user', 'matt', 'George', 'Kate',
             'Daniel','Cindy', 'Barry', 'Todd', 'Melanie', 'Drew',
             'Andy', 'Hochard','Wayne', 'dan',
             'Charlie', 'Vanessa','Allen', 'Austin', 'Roger',
             'Jerry', 'Scotty', 'Anon', 'Lucas', 'Brian', 'Lee', 'Taylor',
            'brian', 'Lisa', 'Jade', 'Spencer', 'chris', 'Jenny', 'Amanda', 'Brett',
            'Maria', 'Holly', 'iPad', 'Sylvia', 'iPhone (2)', 'Catherine', 'Hannah', 'Wade',
            'Larry', 'Lauren','Noah', 'Bobby', 'Don', 'Christine', 'Stephen', 'Howard',
             'Tanner', 'Tom', 'Casey', 'Kyle', 'Michelle', 'Shelby',
             'Benjamin', 'Erik', 'Molly', 'Johnny', 'Chuck', 'Johnny',
             'Nathan', 'Cathy', 'Shelley', 'Mary', 'Danny', 'mitch', 'Brad', 'Tammy', 'erik',
            'Tricia', 'Nate', 'Pete']

snow_df = snow_df[snow_df['user_name'].isin(drop_list) == False]

In [None]:
snow_df['user_name'].value_counts().head(70)

In [None]:
#renaming columns
new_name = ['state', 'ski_resort', 'user_name','review_date', 'rating',
           'review'] 

snow_df.columns = new_name

After cleaning the usernames, I will be further narrowing down the number of users by only including users with more than 3 reviews.

In [None]:
# counting the number of reviews for each user
value_counts = snow_df['user_name'].value_counts()

# selecting only users with more than three reviews
selected_users = value_counts[value_counts > 2].index

# selecting only the rows where the user_name is in the selected_users list
cleaned_snow = snow_df[snow_df['user_name'].isin(selected_users)]

In [None]:
cleaned_snow['user_name'].value_counts(ascending=True)

Removing users with more than 3 reviews dropped the number of rows/final reviews to 2200.

In [None]:
cleaned_snow.info()

In [None]:
cleaned_snow['user_name'].unique()

In [None]:
#dropping duplicate rows
cleaned_snow = cleaned_snow.drop_duplicates()

#### Ski resort name - cleaning

Since the target variable is the Ski Resort I will need to clean and update the names in all datasets to ensure they are consistent.

In [None]:
cleaned_snow['ski_resort'].unique()

In [None]:
#removing words to clean up resort names
replace_snow = ['-ski-area', '-', 'resort', 'mt']
replace_with = ['', ' ', '', 'mt.']

cleaned_snow = cleaned_snow.replace(replace_snow, replace_with, regex=True)

In [None]:
#making columns titlecase
cleaned_snow['ski_resort'] = cleaned_snow['ski_resort'].str.title()
cleaned_snow['state'] = cleaned_snow['state'].str.title()
cleaned_snow['ski_resort'] = cleaned_snow['ski_resort'].str.strip()

In [None]:
#replacing values to standardize endings/specific resort names
replace_snow = ['At', 'Mtn', 'Mt.N', 'Mt. Hood Ski Bowl', 'And', r'\bMount\b']
replace_with = ['at', 'Mountain', 'Mountain', 'Mt. Hood Skibowl', 'and', 'Mt.']

cleaned_snow = cleaned_snow.replace(replace_snow, replace_with, regex=True)

After inspecting resort names, there were a few resorts that had the same names or very similar names. I adjusted the names, and included the state in the resort names to differentiate the names.

In [None]:
#timberline
cleaned_snow.loc[(cleaned_snow['ski_resort'] == "Timberline Four Seasons") & (cleaned_snow['state'] == "West Virginia"), 'ski_resort'] = "Timberline Mountain"

#crystal mountain
cleaned_snow.loc[(cleaned_snow['ski_resort'] == "Crystal Mountain Wa") & (cleaned_snow['state'] == "Washington"), 'ski_resort'] = "Crystal Mountain Washington"
cleaned_snow.loc[(cleaned_snow['ski_resort'] == "Crystal Mountain") & (cleaned_snow['state'] == "Michigan"), 'ski_resort'] = "Crystal Mountain Michigan"

#magic mountain
cleaned_snow.loc[(cleaned_snow['ski_resort'] == "Magic Mountain") & (cleaned_snow['state'] == "Vermont"), 'ski_resort'] = "Magic Mountain Vermont"
cleaned_snow.loc[(cleaned_snow['ski_resort'] == "Magic Mountain") & (cleaned_snow['state'] == "Idaho"), 'ski_resort'] = "Magic Mountain Idaho"

In [None]:
mountain_rep = ['Squaw Valley Usa',
                'Mccauley Mountain Ski Center', 'attitash', 'Smugglers Notch',
               'Pico Mountain at Killington', 'andes Tower Hills']

mountain_new = ['Palisades Tahoe',
               'McCauley Mountain Ski Center', 'Attitash', "Smugglers' Notch",
               'Pico Mountain', 'Andes Tower Hills']

cleaned_snow = cleaned_snow.replace(mountain_rep, mountain_new, regex=True)

### Data Source #1 - Survey Data

A third small dataset was collected through a [google survey]([https://docs.google.com/forms/d/1ROrGEkCh40RjbHidNCqg4SCCbY3_6DFNw0VWIhTEIGs/edit#responses]) I distributed to individuals who ski, including myself.

I downloaded the sheets file from google and saved it as a .csv. A few individuals did not include their name, so I gave them unique "anon" names.

I plan to use the names of three users that I know, to analyze the results from the model to see if they align with the users preferences. For those three users, I also asked that they send me a brief summary of the key characteristics they look for when choosing ski resorts to visit.

In [None]:
#making columns titlecase
survey_df['ski_resort'] = survey_df['ski_resort'].str.title()
survey_df['state'] = survey_df['state'].str.title()
survey_df['ski_resort'] = survey_df['ski_resort'].str.strip()

In [None]:
list(survey_df['ski_resort'].sort_values().unique())

Below, I manually parsed through the resort names and changed the names to match the names in the main dataframe.

In [None]:
mountain_rep = ['Crystal Mountain - Wa',"Steven'S Pass", 'Mammoth Mountain', 'Mammoth', 'Stratton',
                'Mccauley Mountain', 'Taos', 'Snowmass', 'Palisades Tahoe', 'Palisades', 'Copper Mountain', 'Copper',
                'Crested Butte', 'Mt. Rose', 'Mt Baker', 'Nordic Valley' ,'Solitude']
mountain_rep_p = ['Crystal Mountain Washington', 'Stevens Pass', 'Mammoth', 'Mammoth Mountain','Stratton Mountain',
                 'McCauley Mountain Ski Center', 'Taos Ski Valley', 'Aspen Snowmass', 'Palisades', 'Palisades Tahoe', 'Copper', 'Copper Mountain',
                 'Crested Butte Mountain', 'Mt. Rose Ski Tahoe', 'Mt. Baker', 'Nordic Mountain', 'Solitude Mountain']

survey_df = survey_df.replace(mountain_rep, mountain_rep_p, regex=True)

In [None]:
# changing aspen resorts since all four mountains are part of snowmass
mountain_r = ['Aspen Mountain', 'Aspen Highlands']

survey_df = survey_df.replace(mountain_r, 'Aspen Snowmass', regex=True)

In [None]:
#checking to see which names are different
survey_df.loc[~survey_df['ski_resort'].isin(cleaned_snow['ski_resort']),
                         'ski_resort'].unique()

### Merging survey and OnTheSnow review data

In [None]:
#merging survey review results and final onthesnow reviews
final_ski_df = pd.concat([survey_df, cleaned_snow])

In [None]:
#dropping null values
final_ski_df = final_ski_df.dropna()

In [None]:
#dropping duplicates
final_ski_df = final_ski_df.drop_duplicates()

In [None]:
final_ski_df.info()

In [None]:
#Blackjack ski and Indianhead combined to form Snowriver Mountain Resort
blackjack = ['Blackjack Ski', 'Indianhead Mountain']

final_ski_df = final_ski_df.replace(blackjack, 'Snowriver Mountain Resort', regex=True)

In [None]:
#updating names of resort names that have changed
old_name = ['Durango Mountain','Las Vegas Ski and Snowboard','Shawnee Peak', 'Suicide Six', 'Snow Summit']
new_name = ['Purgatory Mountain','Lee Canyon','Shawnee Mountain', 'Saskadena Six', 'Big Bear']

final_ski_df = final_ski_df.replace(old_name, new_name, regex=True)

In [None]:
# making dictionary for replacements to avoid doubling the names
replacements = {
    'Brandywine': 'Boston Mills and Brandywine',
    'Boston Mills': 'Boston Mills and Brandywine'
}

# replacing
final_ski_df['ski_resort'] = final_ski_df['ski_resort'].replace(replacements)

In [None]:
#updating duplicate ski resort names and saving as new resort names that include the state names
final_ski_df.loc[(final_ski_df['ski_resort'] == "Crystal Mountain") & (final_ski_df['state'] == "Washington"), 'ski_resort'] = "Crystal Mountain Washington"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Crystal Mountain") & (final_ski_df['state'] == "Michigan"), 'ski_resort'] = "Crystal Mountain Michigan"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Crystal Mountain ") & (final_ski_df['state'] == "Michigan"), 'ski_resort'] = "Crystal Mountain Michigan"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Powder Ridge") & (final_ski_df['state'] == "Minnesota"), 'ski_resort'] = "Powder Ridge Minnesota"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Powder Ridge") & (final_ski_df['state'] == "Connecticut"), 'ski_resort'] = "Powder Ridge Connecticut"

#alpine valley
final_ski_df.loc[(final_ski_df['ski_resort'] == "Alpine Valley") & (final_ski_df['state'] == "Wisconsin"), 'ski_resort'] = "Alpine Valley Wisconsin"
final_ski_df.loc[(final_ski_df['ski_resort'] == "Alpine Valley") & (final_ski_df['state'] == "Ohio"), 'ski_resort'] = "Alpine Valley Ohio"

In [None]:
#dropping Cherry Peak
final_ski_df = final_ski_df.loc[(final_ski_df['ski_resort'] != "Cherry Peak")]

In [None]:
#exporting final cleaned dataframe for OnTheSnow scrape
#final_ski_df.to_csv("cleaned_data_exports/final_review_df_final.csv")

### Data Source #3 - OnTheSnow Scrape


I scraped OnTheShow to pull current ski resort features for the resorts in the final merged datset. The code for this scrape was adapted from a [user on github] [(https://github.com/SijiaLai/OnTheSnow/tree/master)] and updated based on html changes and the features I wanted to pull.

The code for the scraper can be found in the data folder.


The main features I scraped:

- mountain elevation
- ticket price
- mountain location
- ski terrain
- snowfall averages

In [None]:
scraped_df.info()

In [None]:
scraped_df['state'].unique()

In [None]:
state_rep = ['WI  54819', 'NV 89131', 'ID 83873', 'Pa 16440', 'NY', 'CO', 'CA 96160']
state_with = ["Wisconsin", "Nevada", "Idaho", "Pennsylvania", "New York", "Colorado", "California"]

scraped_df.replace(state_rep, state_with, regex=True, inplace=True)

In [None]:
scraped_df.isna().sum().sort_values(ascending=False)

In [None]:
#dropping null values
scraped_df = scraped_df.drop(columns=['gondolas_lifts_note', 'terrainNote', 'ticketpriceNote',
                                     'senior_weekend', 'MaySnow', 'country'])

The below is code to replace empty ticket prices with mean values for the column. If time allowed, I would have preferred to manually parse through the null values and to replace them with information on individual ski resort's websites.

In [None]:
#filling empty lift ticket prices with mean

def mean_ticket(column):
    
    #changing all column types to int
    scraped_df[column] = scraped_df[column].astype(int)
    
    # finding mean values
    mean_value = scraped_df[column].mean()
    mean_value = int(mean_value)
    
    # filling 0 with mean value
    scraped_df[column] = scraped_df[column].replace(0, mean_value)
    
    return scraped_df

In [None]:
#making a loop to loop through list of ticket prices that need to be updated

ticket_prices = ["adult_weekend", "junior_weekend", "child_weekend", "senior_weekday", "adult_weekday",
"junior_weekday", "child_weekday", "adult_season", "junior_season", "child_season"]

for ticket_val in ticket_prices:
    mean_ticket(ticket_val)

In [None]:
#splitting the city into zipcode and city column
scraped_df[['zipcode', 'city']] = scraped_df['city'].str.split(' ', 1, expand=True)

In [None]:
#changing location of zipcode so it is placed next to city
column_to_move = scraped_df.pop("zipcode")

# moving zipcode after state
scraped_df.insert(4, "zipcode", column_to_move )

In [None]:
#renaming columns
scraped_df = scraped_df.rename(columns={"Name":"ski_resort"})

### OnTheSnow - Scrape #2

My initial scrape did not include ski run information, so I updated my initial scraping code and pulled ski run difficulty information.

In [None]:
second_scrape.info()

In [None]:
second_scrape = second_scrape[['ski_resort','beginner_runs', 'intermediate_runs','advanced_runs', 'expert_runs']].copy()

In [None]:
#merging
scraped_df = pd.merge(scraped_df, second_scrape, on="ski_resort", how="left")

#replacing characters
scraped_df = scraped_df.replace("%", "", regex=True)

#replacing more characters
replace_vals = ['null', '-', " "]

#replacing strings 
scraped_df['expert_runs'] = scraped_df['expert_runs'].replace(replace_vals, "", regex=True)

#replacing empty cells with 0
scraped_df['expert_runs'] = scraped_df['expert_runs'].replace(r'^\s*$', 0, regex=True)

In [None]:
#filling null values with 0 as a null value indicates 0 runs

int_list = ["beginner_runs", "intermediate_runs", "advanced_runs", "expert_runs"]

for x in int_list:
    scraped_df[x] = scraped_df[x].fillna(0)

#converting values to floats
for x in int_list:
    scraped_df[x].astype(int)

In [None]:
#checking to see if there are any missing resorts in the scraped information to ensure all resorts have feature information
final_ski_df.loc[~final_ski_df['ski_resort'].isin(scraped_df['ski_resort']),
                         'ski_resort'].sort_values().unique()

In [None]:
replace_from = ["Anthony Lakes", "Big Squaw","Brantling Ski", "Coffee Mill", "Mt. Shasta", "Mount Holly"]
replace_to = ["Anthony Lakes Mountain", 'Big Squaw Mountain Ski', "Brantling Ski Slopes", "Coffee Mill Ski Snowboard",
               'Mt. Shasta Board Ski Park', "Mt. Holly"]

scraped_df.replace(replace_from, replace_to, regex=True, inplace=True)

In [None]:
scraped_df.loc[(scraped_df['ski_resort'] == "Timberline") & (scraped_df['state'] == "West Virginia"), 'ski_resort'] = "Timberline Mountain"

In [None]:
#updating scraped df names to match the other dataframes
scraped_df.loc[(scraped_df['ski_resort'] == "Crystal Mountain") & (scraped_df['state'] == "Washington"), 'ski_resort'] = "Crystal Mountain Washington"
scraped_df.loc[(scraped_df['ski_resort'] == "Crystal Mountain ") & (scraped_df['state'] == "Michigan"), 'ski_resort'] = "Crystal Mountain Michigan"

#updating
scraped_df.loc[(scraped_df['ski_resort'] == "Timberline") & (scraped_df['state'] == "West Virginia"), 'ski_resort'] = "Timberline Mountain"

#updating scraped df names to match the other dataframes
scraped_df.loc[(scraped_df['ski_resort'] == "Powder Ridge") & (scraped_df['state'] == "Minnesota"), 'ski_resort'] = "Powder Ridge Minnesota"
scraped_df.loc[(scraped_df['ski_resort'] == "Powder Ridge") & (scraped_df['state'] == "Connecticut"), 'ski_resort'] = "Powder Ridge Connecticut"

In [None]:
#dropping duplicates
scraped_df = scraped_df.drop_duplicates()

In [None]:
#dropping duplicate names in ski_resort
scraped_df = scraped_df.drop_duplicates(subset=['ski_resort'])

In [None]:
#checking value counts of resorts to make sure there aren't duplicates or duplicate names
scraped_df.ski_resort.value_counts()

In [None]:
#checking to see if there are any missing resorts in the scraped information to ensure all resorts have feature information
final_ski_df.loc[~final_ski_df['ski_resort'].isin(scraped_df['ski_resort']),
                         'ski_resort'].sort_values().unique()

### Additional feature engineering
It is common for avid skiiers to purchase ski passes through companies that own a collective of mountains around the United States. I researched a current list of mountains of four of the most common ski passes, manually parsed through the names to update them to match the main dataframe, and then one hot encoded the values for each resort.

I did attempt to use the fuzz to update the names, however many resorts have very similar names so it was not an effective way to update the resort names.

- Epic Pass
- Ikon Pass
- Mountain Collective
- Indy Pass

In [None]:
epic_list = ["Stowe Mountain", "Okemo Mountain", "Hunter Mountain", "Mt. Snow", "Mt. Sunapee","Wildcat Mountain","Seven Springs",
             "Attitash", "Jack Frost", "Crotched Mountain", "Laurel Mountain",
             "Roundtop Mountain", "Whitetail", "Liberty", "Big Boulder", "Heavenly Mountain", "Northstar California",
             "Kirkwood", "Stevens Pass", "Keystone", "Breckenridge","Vail", "Park City Mountain", "Beaver Creek",
             "Crested Butte Mountain", "Afton Alps", "Alpine Valley Ohio", "Boston Mills and Brandywine","Hidden Valley",
             "Mad River Mountain", "Mt. Brighton", "Paoli Peaks", "Snow Creek","Wilmot Mountain",
            'Mt. Sunapee','Wildcat Mountain', 'Whitetail', 'Mt. Brighton', 'Wilmot Mountain']

mtn_col_list = ['Arapahoe Basin', 'Aspen Snowmass', 'Jackson Hole', 'Mammoth Mountain', 'Snowbird',
                            'Palisades Tahoe', 'Sugarbush','Taos Ski Valley', 'Alta', 'Big Sky', 'Sugar Bowl',
                           'Sugarloaf', 'Sun Valley', 'Grand Targhee', 'Snowbasin']

ikon_list = ['Palisades Tahoe', 'Mammoth Mountain', 'June Mountain', 'Bear Mountain', 'Snow Summit','Snow Valley',
    'Sun Valley', 'Dollar Mountain', 'Crystal Mountain Washington', 'Alpental', 'The Summit at Snoqualmie',
    'Mt. Bachelor', 'Schweitzer', 'Alyeska', 'Aspen Snowmass','Buttermilk', 'Steamboat', 'Winter Park',
    'Copper Mountain', 'Arapahoe Basin', 'Eldora Mountain', 'Jackson Hole', 'Big Sky',
    'Taos Ski Valley','Deer Valley', 'Solitude Mountain','Brighton','Alta', 'Snowbird',
    'Snowbasin','Boyne Highlands', 'Boyne Mountain', 'Stratton Mountain', 'Sugarbush', 'Killington', 'Pico Mountain',
    'Windham Mountain', 'Snowshoe Mountain', 'Sunday River','Sugarloaf','Loon Mountain']


indy_list = ['Eaglecrest', 'Ski China Peak', 'Mt. Shasta Board Ski Park', 'Mountain High', 'Dodge Ridge',
    'Hoodoo Ski Area', 'Mt. Ashland', 'Mt. Hood Meadows', '49 Degrees North',
    'Hurricane Ridge Ski & Snowboard Area', 'Mission Ridge', 'Bluewood', 'White Pass',
    'Castle Mountain Resort', 'Sunrise Park', 'Echo Mountain', 'Granby Ranch',
    'Sunlight', 'Brundage Mountain', 'Kelly Canyon', 'Pomerelle Mountain', 'Silver Mountain', 'Soldier Mountain',
    'Tamarack', 'Blacktail Mountain', 'Mountain', 'Red Lodge Mountain', 'Beaver Mountain',
    'Powder Mountain', 'Antelope Butte', 'Snow King', 'White Pine Ski Resort',
    'Seven Oaks', 'Sundown Mountain', 'Big Powderhorn Mountain', 'Caberfae Peaks Ski Golf',
    'Crystal Mountain Michigan', 'Marquette Mountain', 'Nubs Nob', 'Pine Mountain',
    'Schuss Mountain at Shanty Creek', 'Swiss Valley', 'Treetops Ski Resort',
    'Buck Hill', 'Detroit Mountain Recreation Area', 'Lutsen Mountains', 'Mount Mankato',
    'Powder Ridge Minnesota', 'Spirit Mountain', 'Terry Peak', 'Granite Peak',
    'Little Switzerland', 'Nordic Mountain', 'The Rock Snowpark', 'Trollhaugen',
    'Tyrol Basin', 'Mohawk Mountain', 'BigRockMountain',
    'Rangeley Lakes Trail Center', 'Saddleback Mountain', 'Berkshire East',
    'Black Mountain', 'Cannon Mountain', 'Pats Peak', 'Waterville Valley',
    'Catamount Ski Ride Area', 'Greek Peak', 'Peekn Peak', 'Snow Ridge', 'Swain Resort',
    'Titus Mountain', 'West Mountain', 'Catamount Outdoor Family Center', 'Bolton Valley',
    'Jay Peak', 'Magic Mountain Vermont', 'Saskadena Six', 'Cataloochee',
    'Blue Knob', 'Montage Mountain', 'Shawnee Mountain', 'Ski Sawmill',
    'Tussey Mountain', 'Ober Gatlinburg Ski', 'Bryce', 'Massanutten',
    'Canaan Valley', 'Winterplace Ski']

In [None]:
#sanity check to see if any of the names are wrong

test_list = ikon_list + epic_list + mtn_col_list

missing_values = [value for value in test_list if value not in scraped_df['ski_resort'].values]

# Print the missing values
print(missing_values)

In [None]:
#sanity check to see if any of the names are wrong

missing_values = [value for value in indy_list if value not in scraped_df['ski_resort'].values]

# Print the missing values
print(missing_values)

In [None]:
# making new column with 0
scraped_df['epic'] = 0
scraped_df['mountain_collective'] = 0
scraped_df['ikon'] = 0
scraped_df['indy'] = 0

# adding 1 for each row value based on the ski resort pass lists
scraped_df.loc[scraped_df['ski_resort'].isin(epic_list), 'epic'] = 1
scraped_df.loc[scraped_df['ski_resort'].isin(mtn_col_list), 'mountain_collective'] = 1
scraped_df.loc[scraped_df['ski_resort'].isin(ikon_list), 'ikon'] = 1
scraped_df.loc[scraped_df['ski_resort'].isin(indy_list), 'indy'] = 1

### Saving Airbnb scraping file
For the content based system, I will be scraping airbnb costs from each resort city. In the code below, I am making a new dataframe that I wll use that includes city, state, and ski resort information to pull information from airbnb.

The scraping notebook can be found in the data folder.

In [None]:
#making new dataframe
city_df = pd.DataFrame()

#saving city
city_df['city'] = scraped_df['city']

#saving state
city_df['state'] = scraped_df['state']

#saving ski resort name
city_df['ski_resort'] = scraped_df['ski_resort']

#saving 
#city_df.to_csv("cleaned_data_exports/location_for_scraping_v2.csv")

In [None]:
#saving city names for scraping
city_df['location'] = scraped_df[['city', 'state']].agg(', '.join, axis=1)

#saving unique cities
city_df['location'] = pd.DataFrame(city_df['location'].unique())

#dropping nulls
city_df = city_df.dropna()

#saving
#city_df.to_csv("cleaned_data_exports/city_names_for_scraping_v2.csv")

In [None]:
#saving final merged dataframe for content based system
#merged_df.to_csv("cleaned_data_exports/user_df_model.csv")

## Airbnb Scrape Cleaning

Importing .csv that I scraped from Airbnb. This includes prices of airbnb's pulled from the first two pages (32 results) for airbnbs with a max of **2 guests** and max of **4 guests**. I pulled this information to provide cost information to those planning ski trips, as high lodging and ticket costs are a barrier to entry while planning ski trips.

In [None]:
jan_2_airbnb_mean_final

In [None]:
jan_2_airbnb_mean_final.info()

In [None]:
#making lists of all dataframes from airbnb scrape
two_guest_list = [dec_2_airbnb_mean_final,jan_2_airbnb_mean_final,feb_2_airbnb_mean_final,
                  mar_2_airbnb_mean_final, apr_2_airbnb_mean_final,
                  may_2_airbnb_mean_final]

four_guest_list = [dec_4_airbnb_mean_final,jan_4_airbnb_mean_final,feb_4_airbnb_mean_final,
                  mar_4_airbnb_mean_final, apr_4_airbnb_mean_final,
                  may_4_airbnb_mean_final]

month_list = ['dec', 'jan', 'feb', 'mar', 'apr', 'may']

for x, y, z in zip(two_guest_list, four_guest_list, month_list):
    
    dict_2 = {'mean': z + '_mean_2_guests',
              'min': z + '_min_2_guests','max': z + '_max_2_guests'}

    dict_4 = {'mean': z +'_mean_4_guests',
              'min': z +'_min_4_guests','max': z +'_max_4_guests'}
    
    #renaming column names based on scrape
    x.rename(columns=dict_2, inplace=True)
    y.rename(columns=dict_4, inplace=True)
    
    #saving as new dataframe
    x.drop(columns=['count', 'std', '25%', '50%', '75%', 'Unnamed: 0'], inplace=True)
    y.drop(columns=['count', 'std', '25%', '50%', '75%', 'Unnamed: 0'], inplace=True)

In [None]:
#merging all months from the list of 2 guest airbnbs 
for df in two_guest_list[1:]:
    dec_2_airbnb_mean_final = pd.merge(dec_2_airbnb_mean_final, df, on='ski_resort')

#mergins all months from the list of 4 guest airbnbs
for df in four_guest_list:
    dec_2_airbnb_mean_final = pd.merge(dec_2_airbnb_mean_final, df, on='ski_resort')
    
#saving off the list as a new dataframe
airbnb_df = dec_2_airbnb_mean_final.copy()

#dropping duplicates
airbnb_df = airbnb_df.drop_duplicates()

In [None]:
#inspecting value counts
airbnb_df['ski_resort'].value_counts()

In [None]:
#checking to see which ski resort names different between the main feature df and the airbnb df
scraped_df.loc[~scraped_df['ski_resort'].isin(airbnb_df['ski_resort']),
                         'ski_resort'].unique()

In [None]:
#renaming resorts that have the same name
airbnb_df.replace("Crystal Mountain", "Crystal Mountain Washington", regex=True, inplace=True)

#renaming resorts that have the same name
airbnb_df.replace("Powder Ridge", "Powder Ridge Minnesota", regex=True, inplace=True)

airbnb_df.loc[(airbnb_df['ski_resort'] == "Crystal Mountain Washington ") & (airbnb_df['dec_min_2_guests'] == 40), 'ski_resort'] = "Crystal Mountain Michigan"

airbnb_df.replace("Powder Ridge", "Powder Ridge Connecticut", inplace=True)

airbnb_df.replace('Mt. Bhelor', 'Mt. Bachelor', inplace=True)

airbnb_df.loc[(airbnb_df['ski_resort'] == 'Timberline') & (airbnb_df['dec_min_2_guests'] == 75.0), 'ski_resort'] = "Timberline Mountain"

#batch renaming mountains to match the user df
rename_list = ["Anthony Lakes", "Mount Holly", "Mt. Shasta", "Coffee ll", "Brantling Ski", 'Big Squaw']
rename_to = ['Anthony Lakes Mountain', 'Mt. Holly','Mt. Shasta Board Ski Park', 'Coffee Mill Ski Snowboard',
       'Brantling Ski Slopes', 'Big Squaw Mountain Ski']

airbnb_df.replace(rename_list, rename_to, regex=True, inplace=True)

In [None]:
#checking to see which ski resort names different between the main feature df and the airbnb df
scraped_df.loc[~scraped_df['ski_resort'].isin(airbnb_df['ski_resort']),
                         'ski_resort'].unique()

# Merging reviews with features
I will be combining the cleaned dataframes from below that will be used for the collaborative model output.

In [None]:
# #merging final review dataframe and scraped data
# merged_df = pd.merge(final_ski_df, scraped_df, on="ski_resort", how='left')

# #dropping columns
# merged_df = merged_df.drop(columns="state_y")

# #renaming columns
# merged_df = merged_df.rename(columns={"state_x":"state"})

## Merging airbnb with feature dataframe

In [None]:
#merging airbnb and feature dataframes
content_df = pd.merge(scraped_df, airbnb_df, on="ski_resort", how="left")

In [None]:
content_df

### Google Geocoding API

Importing the final .csv from the Google Geocoding API pull. The code for this can be found in the **scraping_ipynb** file.

Google's Geocoding API is a service that accepts a place as an address, latitude and longitude coordinates, or Place ID. It converts the address into latitude and longitude coordinates and a Place ID, or converts latitude and longitude coordinates or a Place ID into an address.

In [None]:
latitude_df.info()

In [None]:
latitude_df.head()

In [None]:
latitude_df.drop(columns="Unnamed: 0", inplace=True)

In [None]:
content_df.loc[~content_df['ski_resort'].isin(latitude_df['ski_resort']),
                         'ski_resort'].unique()

In [None]:
#updating Timberline because there are multiple similar resorts
latitude_df.loc[(latitude_df['ski_resort'] == 'Timberline') & (latitude_df['full_address'] == "HC 70 Box 488, Davis, West Virginia"), 'ski_resort'] = "Timberline Mountain"

#batch renaming
lat_list = ["Anthony Lakes", "Mount Holly", "Mt. Shasta", "Coffee ll", "Brantling Ski", "Big Squaw"]
lat_replace = ['Anthony Lakes Mountain', 'Mt. Holly','Mt. Shasta Board Ski Park',
               'Coffee Mill Ski Snowboard','Brantling Ski Slopes', 'Big Squaw Mountain Ski']

latitude_df.replace(lat_list, lat_replace, regex=True, inplace=True)

In [None]:
#confiring all names are consistent besides two remaining values that don't exist in the other dataframe
content_df.loc[~content_df['ski_resort'].isin(latitude_df['ski_resort']),
                         'ski_resort'].unique()

### Merging with feature dataframe

In [None]:
#merging latitude df with final content_df
content_df = pd.merge(content_df, latitude_df, on="ski_resort", how="left")

In [None]:
content_df.info()

In [None]:
content_df.head()

### Adding Additional Column

In [None]:
lift_list = ['gondolas_and_trams','fastEight','highSpeedSixes','quadChairs','tripleChairs','doubleChairs',
             'surfeLifts']

for x in lift_list:
    content_df[x] = content_df[x].fillna(0)

In [None]:
content_df['gondolas_and_trams'] = content_df['gondolas_and_trams'].astype(float)

In [None]:
#making new column that totals the lift sum
content_df['total_lifts'] = 0 

#adding columns
content_df['total_lifts'] = content_df['gondolas_and_trams'] + content_df['fastEight'] + content_df['highSpeedSixes'] + content_df['quadChairs'] + content_df['tripleChairs'] + content_df['doubleChairs'] + content_df['surfeLifts']

In [None]:
content_df

In [None]:
content_df.drop_duplicates(subset="ski_resort", inplace=True)

In [None]:
#saving final cleaned scraped df
#content_df.to_csv("cleaned_data_exports/scraped_feature_df.csv")

## Feature Analysis

### Ratings by resort distribution

In [None]:
#looking at the most reviewed mountains
top_5_reviewed_resorts = pd.DataFrame(final_ski_df['ski_resort'].value_counts().reset_index()).head(5)
top_5_reviewed_resorts

In [None]:
#using plotly to plot the top reviewers
fig = px.bar(top_5_reviewed_resorts, x="index", y="ski_resort")
fig.update_layout(title_text='Most Reviewed Mountains',
                  title_x=0.5,
                  xaxis_title="Resort",
                  yaxis_title="Review Count",
                 plot_bgcolor='white')
fig.update_traces(marker_color = "#00b5ff")
fig.show()

### Rating distribution

There is an imbalance in rating distrubutions, however the breakdown of ratings is not in line with typical user bias where ratings are either on the high or low scale. This imbalance will most likely end up affecting the performance of our modeling, but we can choose an algorithm that works best for the type of data we have.

In [None]:
#making dataframe of rating counts to compare distribution of ratings
top_ratings = pd.DataFrame(final_ski_df["rating"].value_counts(ascending=False).head(15))
top_ratings = top_ratings.reset_index()
top_ratings = top_ratings.rename(columns={"rating":"rating_count"})
top_ratings = top_ratings.rename(columns={"index":"rating"})

#making user_id a string for plotting
top_ratings['rating'] = top_ratings['rating'].astype(str)

# Calculate the percentage of each rating count
top_ratings['rating_percentage'] = (top_ratings['rating_count'] / top_ratings['rating_count'].sum()) * 100

In [None]:
#using plotly to plot the top featurescolor=
fig = px.bar(top_ratings, x="rating", y="rating_percentage",
             text="rating_percentage")
fig.update_layout(title_text='Rating Distribution',
                  title_x=0.5,
                  xaxis_title="Rating",
                  yaxis_title="Rating %",
                 plot_bgcolor='white',
                 font=dict(size=14))
fig.update_traces(marker_color = "#00b5ff", texttemplate='%{text:.1s}%', textposition='outside')

fig.show()

### Monthly Snowfall

In [None]:
snow_list = ["NovSnow", "DecSnow", "JanSnow", "FebSnow", "MarSnow", "AprSnow"]
snow_names = ['November', 'December', 'January', 'February', 'March', 'April']

monthly_mean = content_df[snow_list].mean(skipna=True)

monthly_snowfall = pd.DataFrame({'month': snow_names, 'mean_snowfall': monthly_mean.values})

In [None]:
monthly_snowfall

In [None]:
#using plotly to plot the top featurescolor=
fig = px.bar(monthly_snowfall, x="month", y="mean_snowfall")
fig.update_layout(title_text='2022 US Average Snowfall',
                  title_x=0.5,
                  xaxis_title="Month",
                  yaxis_title="Snow (in)",
                 plot_bgcolor='white',
                 font=dict(size=14))
fig.update_traces(marker_color = "#00b5ff",textposition='outside')

fig.show()

### Airbnb Prices

In [None]:
#using plotly to plot the top featurescolor=
fig = px.bar(content_df.head(), x="ski_resort", y=["dec_min_2_guests", "dec_min_4_guests"],
            width=1000, height=500)
fig.update_layout(title_text='December Airbnb Costs',
                  title_x=0.5,
                  xaxis_title="Ski Resort",
                  yaxis_title="Nightly Price ($)",
                 plot_bgcolor='white',
                 font=dict(size=14),
                 barmode='group')

newnames = {'dec_min_2_guests':'2 Guest Max', 'dec_min_4_guests': '4 Guest Max'}
fig.for_each_trace(lambda t: t.update(name = newnames[t.name],
                                      legendgroup = newnames[t.name],
                                      hovertemplate = t.hovertemplate.replace(t.name, newnames[t.name])
                                     )
                  )

fig.update_traces(textposition='outside')               
              
fig.show()

## Elevation graph

In [None]:
def mountain_elevation(resort_name):

    resort_df = content_df.loc[content_df['ski_resort'] == resort_name]
    
    # elevation
    base_elevation = resort_df['base'].values[0]
    summit_elevation = resort_df['sumt'].values[0]

    # traving elevation
    elev_trace = go.Scatter(x=["base", "summit", "drop"], y=[base_elevation, summit_elevation, base_elevation], mode='lines', line=dict(color='blue'))

    # displaying plot
    layout = go.Layout(
        title='Elevation Change',
        yaxis=dict(title='Elevation'),
        plot_bgcolor='white',
        showlegend=False
    )
    
    # making figure
    fig = go.Figure(data=[elev_trace], layout=layout)

    # Showing the line plot
    fig.show()

In [None]:
mountain_elevation("Arapahoe Basin")

# Conclusion

#### Review Dataset
After cleaning and analyzing the data, there are **662 users, 275 resorts, and 2795 total reviews**. There is an imbalance in the reviews, however our final recommendation system will be a hybrid-cascade model, so this will help balance out the results.

In [None]:
unique_users = len(final_ski_df['user_name'].unique())
unique_resorts = len(final_ski_df['ski_resort'].unique())
total_reviews = len(final_ski_df)

print("Number of unique users:", unique_users)
print("Number of unique resorts:", unique_resorts)
print("Number of reviews:", total_reviews)

#### Review Dataset
After cleaning the scraped data from OnTheSnow, Google Geocoding API, and Airbnb, there are **329 resorts** and **86 columns** in the final dataframe.

In [None]:
unique_resorts = len(content_df['ski_resort'].unique())
unique_features = len(content_df.columns)

print("Number of resorts:", unique_resorts)
print("Number of features:", unique_features)

# Next Steps

The next step will be to begin modeling to create the recommendation system. The two main dataframes from this notebook will be used are listed below:

- **Collaborative Modeling** - cleaned_data_exports/user_df_model.csv
- **Content/Cascade Hybrid Modeling** - cleaned_data_exports/scraped_feature_df.csv

The collaborative model will be saved in a separate notebook than the final content and hybrid based models.