#### Packages to Install
Please install the following additional packages, prior to running the demo:
* sklearn:
```
        pip install -U scikit-learn
```

* category_encoders:
```
        pip install category_encoders
```

In [71]:
import pyclimb as pc
from pyclimb.vis_func import clus_map, heat_map
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from category_encoders import TargetEncoder
from sklearn.ensemble import RandomForestRegressor

### Loading and Cleaning
First, use the load_data function to load in the preloaded datasets for the demo. The preloaded datasets include the climbing dataset both in clean, and in raw form, a weather dataset from Utah weather stations, and a Utah cities dataset scraped from the web. For this demo, we are using the raw form of the data set, to demonstrate how to use the built in cleaning function.

If reading in your own data from mountainproject.com. Put all of your files in the same working directory and use the pc.concat() function.

In [72]:
# if using your own files dowloaded from mountain project,then uncomment this code
# climbs = pc.concat(['route-finder(1).csv', 'route-finder(2).csv', 'route-finder(3).csv'])
climbs = pc.load_data('raw')
weather = pc.load_data('weather')
cities = pc.load_data('cities')

You can then clean the data using the pc.clean() function.

In [73]:
pc.clean(climbs, inplace = True)

### Scraping
Additionally, there is a scraper function that will collect additional data from each climb, the crawl-delay requested by mountain project.com, is 60 seconds, so for the purposes of this demo, I will leave it commented out, but if you are using your own data feel free to uncomment it.

In [74]:
# pc.scrape_mp(climbs, inplace = True) # uncomment this section to scrape from MP
# I have already scraped the data from MP and it is in the clean dataset below
climbs = pc.load_data('clean')

### Merging the data
Once you have your cleaned data, use merge_data_dist to merge the desired dataframes based on the closest latitude and longitude. This example merges the climbs dataset with two others, weather stations and cities

In [75]:
climbing = pc.merge_data_dist(climbs, weather, 'ELEVATION', 'Area Latitude', 'Area Longitude', 'LATITUDE', 'LONGITUDE')
# function adds distance variable by default, so we rename it to be more specific
climbing.rename({'Distance' : 'station_dist', 'Location' : 'climb_location'}, inplace = True, axis = 1) 
climbing = pc.merge_data_dist(climbing, cities, 'Location')
climbing.rename({'Distance' : 'city_dist', 'Location' : 'city'}, inplace = True, axis = 1)

climbing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6107 entries, 0 to 6106
Data columns (total 34 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Route           6107 non-null   object 
 1   URL             6107 non-null   object 
 2   Avg Stars       6062 non-null   float64
 3   Rating          6107 non-null   object 
 4   Pitches         6107 non-null   int64  
 5   Length          5474 non-null   float64
 6   Latitude        6107 non-null   float64
 7   Longitude       6107 non-null   float64
 8   PG13            6107 non-null   bool   
 9   R               6107 non-null   bool   
 10  State           6107 non-null   object 
 11  Region          6107 non-null   object 
 12  climb_location  6107 non-null   object 
 13  Crag            5874 non-null   object 
 14  Wall            4702 non-null   object 
 15  Trad            6107 non-null   bool   
 16  Alpine          6107 non-null   bool   
 17  TR              6107 non-null   b

### Maps
Once the datasets are merged, you can make an interactive cluster or heat map. This uses the latitude, longitude, and a description for each point. These maps save by default but can be printed to screen instead by using save = False.

In [76]:
clus_map(climbing, desc = 'Route', save = False)

In [77]:
heat_map(climbing, desc = 'Avg Stars', save = False)

### Analysis
Here we use scikit-learn to create a random forest in order to predict the Rating_num.

In [78]:
# Change Elevation to a float
climbing['ELEVATION'] = climbing.ELEVATION.mask(climbing['ELEVATION'] == " ", np.nan).astype(float)

# Drop na values
climbing.dropna(inplace = True)

# Change Booleans to 1 and 0
for col in ['R', 'PG13', 'Trad', 'Alpine', 'TR', 'Aid', 'Boulder', 'Mixed']:
    climbing[col] = climbing[col].map({True : 1, False : 0})

# Target encode categorical features
label_encoder = LabelEncoder()
categorical_variables = ['Region', 'climb_location', 'Crag', 'Wall', 'city']
for var in categorical_variables:
    climbing[var] = label_encoder.fit_transform(climbing[var])

# fit the model
X = climbing.drop(['Rating_num', 'URL',             # Remove uninformative features
                   'Route', 'Rating', 'Shared_by', 
                   'State', 'Date'], axis = 1)
y = climbing['Rating_num']
weights = climbing['numVotes']

# Finish encoding the variables
target_encoder = TargetEncoder(cols = categorical_variables)
X = target_encoder.fit_transform(X, y)

# Convert to numpy array
X = np.array(X)
y = np.array(y)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Create the model
rf_mod = RandomForestRegressor()
rf_mod.fit(X_train, y_train)

In [79]:
# Make predictions and evalutate model performance
preds = rf_mod.predict(X_test)
mean_squared_error(y_test, preds)

1.0439620756572663