**Outdoor Adventure Recommendations (OARs)**

<U>Problem Statement</U><br>
Finding outdoor activities that match a user's fitness level, preferences, and location can be overwhelming due to the abundance of options and lack of personalized recommendations. Users often spend hours researching trails or activities, only to find that their chosen options don't align with their expectations or capabilities.<br>
<U>Solution</U><br>
This project aims to develop a recommendation system that provides personalized outdoor adventure suggestions. By analyzing trail features (e.g., difficulty, length, terrain) and user attributes (e.g., fitness level, activity preferences, equipment), the system will deliver tailored recommendations. This solution is valuable for outdoor enthusiasts seeking efficient, reliable, and personalized suggestions, enhancing their experiences and saving them time.<br>
<U>Approach</U><br>


1.   We will pull hiking trail data from the Overpass API. From this we will get features like length, elevation, whether dogs are allowed, etc.
2.   Using and unsupervised algorithm, the trails will be grouped into clusters.
3.   We will bootstrap our program by simulating user data and, using business knowledge, we will assign our simulated users to the trail clusters.
4.   A supervised model will be trained on the user data.
5.   The trained model will be applied to new users so that the system can provide them their recommended trails.<br>
At the end, we will explore future enhancements.






In [484]:
#installing all required packages
import umap
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score, silhouette_score, accuracy_score, f1_score
import matplotlib.pyplot as plt
from matplotlib import colormaps
from matplotlib.colors import ListedColormap
from scipy.cluster.hierarchy import linkage, dendrogram
import requests
import time
from geopy.distance import geodesic
import random
from faker import Faker
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import warnings

warnings.filterwarnings('ignore') #turning off warnings to mak the final output more clear

First we will define some functions for processing the data. This makes things a bit cleaner and will also help with reuseability in the future as we expand our program into other outdoor activities.

In [485]:
#this function is used to data for specific geographic locations from the Overpass API, including latitude, longitude, and elevation
def fetch_node_coordinates(node_ids):
    if not node_ids:
        return []

    node_ids_str = ",".join(map(str, node_ids))  # Create a comma-separated list of node IDs
    query = f"""
    [out:json];
    node(id:{node_ids_str});
    out body;
    """
    response = requests.get(overpass_url, params={'data': query})
    if response.status_code == 200:
        return response.json().get('elements', [])
    else:
        print(f"Error fetching data for nodes {node_ids}: Status code {response.status_code}")
        return []

In [486]:
#not all Overpass nodes contain elevation data, so this function serves as a backup to retrieve the elevaton from the Open-Elevation API
def fetch_elevation(lat, lon):
    # OpenElevation API endpoint
    url = "https://api.open-elevation.com/api/v1/lookup"

    params = {
        'locations': f'{lat},{lon}'
    }

    try:
        response = requests.get(url, params=params)
        if response.status_code == 200:
            data = response.json()
            elevation = data['results'][0]['elevation']
            return elevation
        else:
            print(f"Error: Unable to fetch data. Status code {response.status_code}")
            return 1
    except Exception as e:
        print(f"Error fetching elevation: {e}")
        return 1

In [487]:
#Based on longitude and latitude data from the Overpass API, this function calcualtes a trail's length
def calculate_length(node_ids, node_coords):
    length = 0
    for i in range(len(node_ids) - 1):
        start = node_coords.get(node_ids[i])
        end = node_coords.get(node_ids[i + 1])
        if start and end:
            length += geodesic((start['lat'], start['lon']), (end['lat'], end['lon'])).meters
        else:
            print(f"Missing coordinates for nodes: {node_ids[i]} or {node_ids[i + 1]}")
    return length

In [488]:
#Based on elevation from the Overpass API, this function calcualtes a trail's elevation gain
def calculate_elevation_gain(node_ids, node_coords):
    elevation_gain = 0
    prev_elevation = None
    for node_id in node_ids:
        node = node_coords.get(node_id)
        if node and 'ele' in node and node['ele'] is not None:
            try:
                current_elevation = float(node['ele'])
                if prev_elevation is not None and current_elevation > prev_elevation:
                    elevation_gain += current_elevation - prev_elevation
                prev_elevation = current_elevation
            except (ValueError, TypeError):
                elevation_gain=100
        else:
            elevation_gain=100
    return elevation_gain

As our user base grows, we will continually retrain our models. For now, we will necessarily simulate some users so as to bootstrap our system.<br>

The function below contains the feaures which will be created for each user. Some are categorical, some are binary, and some are in the range 0-10. All new users in our system wil be asked questions at sign-up that will provide this data.<br>

This will allow new users to receive their recommendations and it will also provide new user data with which to update our models.

In [489]:
#This is our function to generate simulated users
def generate_user_data(num_users):
    fake = Faker()
    data = []
    for _ in range(num_users):
        name = fake.name()
        gender = random.choice(['Male', 'Female'])
        activity_level = random.choice(['Low', 'Moderate', 'High', 'Extreme'])
        owns_dog = random.choice([0, 1])
        owns_bicycle = random.choice([0, 1])
        enjoys_leisure = random.randint(1,10)
        enjoys_physical_challenges = random.randint(1,10)
        mobility_issues = random.randint(1,10)
        safety_concerns = random.randint(1,10)

        data.append({
            'Name': name,
            'Age': random.randint(18, 65),
            'Sex': gender,
            'Activity Level': activity_level,
            'Owns a Dog': owns_dog,
            'Enjoys Leisure Activities': enjoys_leisure,
            'Owns a Bicycle': owns_bicycle,
            'Enjoys Physical Challenges': enjoys_physical_challenges,
            'Mobility Issues': mobility_issues,
            'Safety Concerns': safety_concerns,
        })

    return pd.DataFrame(data)

Now that we have defined our functions, we begin with the main program. The first section deals with getting the hiking trails data and creating the clusters.<br>
In our productionalized version of the program, this will be a separate process which runs monthly to create and store the clusters offline for reference by other parts of the program. Here for our POC we will be doing all of the work inline as a single process.

In [490]:
#Define the Overpass API query for retrieving hiking trails
#For this POC, we are bounding the data to West Virginia. In a productionalized version, this range would be expanded as user location
#would be incorporate into the model
overpass_url = "http://overpass-api.de/api/interpreter"
overpass_query = """
[out:json][timeout:25];
(
  way["highway"="path"]["foot"="yes"](38.0,-82.0,39.0,-81.0);
);
out body;
>;
out skel qt;
"""

#Call the Overpass API
response = requests.get(overpass_url, params={'data': overpass_query})
data = response.json()

In [491]:
#parse the Overpass data into a list of dictionaries
elements = data.get("elements", [])
hiking_trails = []

#parse way (hiking trail) elements
for element in elements:
    if element.get("type") == "way":  # Focus on 'way' elements
        trail = {
            "id": element.get("id"),
            "tags": element.get("tags", {}),
            "nodes": element.get("nodes", []),
            "name": element.get("tags", {}).get("name", "Unknown"),
            "type": element.get("tags", {}).get("highway", "Unknown"),
        }
        hiking_trails.append(trail)

#create a dataframe of the hiking trail data
df_hiking = pd.DataFrame(hiking_trails)
print(df_hiking.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116 entries, 0 to 115
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      116 non-null    int64 
 1   tags    116 non-null    object
 2   nodes   116 non-null    object
 3   name    116 non-null    object
 4   type    116 non-null    object
dtypes: int64(1), object(4)
memory usage: 4.7+ KB
None


By calling the Overpass API, we have retrieved data on 116 trails within West Virigina.<br>
The nodes field contains geographic points aling the trail. We will now use our earlier-defined functions to extract latitude, longitue, and elevation data from these nodes.

In [None]:
#for each trail node contained within each trail, extract the latitude, longitue, and elevation
node_coords = {}
for trail_nodes in df_hiking['nodes']:
    if trail_nodes:
        elements = fetch_node_coordinates(trail_nodes)
        for element in elements:
            if 'lat' in element and 'lon' in element:
                elevation = element.get('tags', {}).get('ele')
                if elevation is None:
                    elevation = pow(element['lat'],2)*element['lon']
                node_coords[element['id']] = {
                    'lat': element['lat'],
                    'lon': element['lon'],
                    'ele': elevation
                }
    time.sleep(1)

In [None]:
#sometimes when hitting the Overpass API, a node will not return data
all_node_ids = set(node_id for nodes in df_hiking['nodes'] for node_id in nodes)
missing_node_ids = all_node_ids - set(node_coords.keys())

#for any nodes which failed to return data earlier, we will try them again
if missing_node_ids:
    print(f"Missing nodes: {missing_node_ids}")
    for batch in [list(missing_node_ids)[i:i + 20] for i in range(0, len(missing_node_ids), 20)]:
        elements = fetch_node_coordinates(batch)
        for element in elements:
            if 'lat' in element and 'lon' in element:
                # Check for elevation data
                elevation = element.get('tags', {}).get('ele')
                # If elevation data is missing
                if elevation is None:
                    elevation = pow(element['lat'],2)*element['lon']
                node_coords[element['id']] = {
                    'lat': element['lat'],
                    'lon': element['lon'],
                    'ele': element.get('tags', {}).get('ele')
                }
        time.sleep(1)

In [None]:
#here we use the retrieved latitude and longitude data to assign a trail length using an earlier-defined function
df_hiking['length_meters'] = df_hiking['nodes'].apply(lambda nodes: calculate_length(nodes, node_coords))

In [None]:
#here we use the retrieved elevation data to assign a trail elevation gain value using an earlier-defined function
df_hiking['elevation_gain_meters'] = df_hiking['nodes'].apply(lambda nodes: calculate_elevation_gain(nodes, node_coords))

In [None]:
df_hiking.rename(columns={'name':'trail name'},inplace=True) #renaming since the tags field also contains a field called 'name'

#expand the tags field from the Overpass API into new columns which will be used as features
tags_df = pd.json_normalize(df_hiking['tags'])

#merge the new columns with the original DataFrame
df_hiking = pd.concat([df_hiking.drop(columns=['tags']), tags_df], axis=1)

print(df_hiking.info())

After processing the data receved from Overpass, we now have 23 fields in our dataframe. At this time, we will only make use of a subset of these fields. As our application grows, we may look to expand our functionality and feature set.<br>
In this next section we do some processing and visualization of the dataset.

In [None]:
#removing features which will not apply to our application at this time
df_hiking.drop(columns=["highway","name","fid","foot","type","tiger:cfcc","tiger:county","tiger:reviewed","horse","source:geometry","motor_vehicle","bridge","layer","snowmobile","nodes"],inplace=True)

#replacing missing values with 0
df_hiking=df_hiking.fillna(0)

#converting categorical variables to binary
df_hiking['surface'] = df_hiking['surface'].apply(lambda x: 1 if x == 'ground' else 0)
df_hiking['dog'] = df_hiking['dog'].apply(lambda x: 1 if x == 'leashed' else 0)
df_hiking['lit'] = df_hiking['lit'].apply(lambda x: 0 if x == 'no' else 1)
df_hiking['fitness'] = df_hiking['fitness'].apply(lambda x: 1 if x == 'yes' else 0)
df_hiking['bicycle'] = df_hiking['bicycle'].apply(lambda x: 1 if x == 'yes' else 0)

#dropping any duplicates from our data
df_hiking.drop_duplicates(inplace=True)

print(df_hiking.info())

Here we can see our final hiking trails dataset.

In [None]:
display(df_hiking)

In [None]:
#we want to see the distribution of out two continuos variables
plt.figure(figsize=(8, 4))

plt.subplot(1, 2, 1)
plt.hist(df_hiking["length_meters"], bins=20, color='red', edgecolor='black', alpha=0.7)
plt.title('Trail Length')
plt.xlabel('Variable 1')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(df_hiking["elevation_gain_meters"], bins=20, color='blue', edgecolor='black', alpha=0.7)
plt.title('Elevation Gain')
plt.xlabel('Variable 2')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
#we want to see how our binary variables are broken out
binary_vars = ['bicycle','surface','dog','lit','fitness']

proportions = df_hiking[binary_vars].apply(lambda x: x.value_counts(normalize=True).reindex([0, 1]), axis=0)

proportions.T.plot(kind='barh', stacked=True, color=['green', 'orange'], figsize=(8, 4))
plt.title('Proportions of Binary Variables')
plt.xlabel('Proportion')
plt.legend(['0', '1'], title='Value')
plt.show()

Now that we have completed the process of obtaining and processing the hiking trails data, it is time to create our clusters.<br>
We will use an Agglomerative clustering algorigthms here as we expect the clusters to not be spheircal. We will use multiple values for cluster size and choose one based on the best Silhouette score. Silhouette is the appropriate measure here since we do not have any ground truth number of clusters to look at.<br>
As this will be a monthly process in the production version of this application, we want to minimize variance in the clusters. Therefore, is is approriate to restrict ou appoach to one clustering algorithm while also keeping the potential cluster sizes relatively bounded.

In [None]:
df=df_hiking.drop(columns=["id","trail name"])

#scale the variables
scaler = StandardScaler()
X=scaler.fit_transform(df)

In [None]:
silhouette_scores = []
cluster_range = range(2, 11)
best_n_clusters = 0
best_score = -1
best_model = None

for n_clusters in cluster_range:
    model = AgglomerativeClustering(n_clusters=n_clusters)
    labels = model.fit_predict(X)
    score = silhouette_score(X, labels)
    silhouette_scores.append(score)

    if score > best_score:
        best_score = score
        best_n_clusters = n_clusters
        best_model = model

print(f"Best number of clusters: {best_n_clusters} with silhouette score: {best_score:.3f}")


Our optimal unsupervised model is built using Agglomerative Clustering with a cluster size of 9. We have acheived a score >.6 which indicates a reasonable structure to the clusters given the score rang of -1 to 1.<br>
Now we will train this optimal model on our data and transform it to 2-dimensions using the UMAP algorithm so that we can get an understanding of our clusters visually.

In [None]:
#train the final AgglomerativeClustering model
final_model = AgglomerativeClustering(n_clusters=best_n_clusters)
final_labels = final_model.fit_predict(X)

#perform UMAP dimensionality reduction for visualization
umap_results = umap.UMAP(n_neighbors=5, min_dist=0.1, metric='euclidean').fit_transform(X)

#plot the UMAP visualization
color_map = ListedColormap(plt.colormaps['viridis'](np.linspace(0, 1, best_n_clusters)))
plt.figure(figsize=(10, 6))

plt.scatter(
    umap_results[:, 0],
    umap_results[:, 1],
    c=final_labels,
    cmap=color_map,
    s=100,
    alpha=0.7,
    edgecolor='k'
)
plt.title(f"UMAP Visualization of Clustering (n_clusters={best_n_clusters})")
plt.xticks([])
plt.yticks([])
plt.axis('off')

plt.show()


Based on the visualization, the clusters created look reasonable.<br>
We will now assign the model-predicted clusters to our hiking trails dataset. The result of this will be data which tells us which hiking trails below to which clusters which will in turn relate trail features to clusters.

In [None]:
df_hiking['Cluster'] = final_labels

df_hiking.head()

Below we can see the clusters created and the summarized associated features.

In [None]:
grouped_df = df_hiking.groupby('Cluster').agg({'length_meters':'mean','elevation_gain_meters':'mean','bicycle':'sum','surface':'sum','dog':'sum','lit':'sum','fitness':'sum'})

display(grouped_df)

Now we will generate our simulated users using our earlier-defined function.<br>
Again, this is as a bootstrap to the system and will eventually be augmented, and then replaced, with actual users.

In [None]:
df_users=generate_user_data(2000)
display(df_users.head())

In order to support supervised learining modeling, we require labeled data. Using domain knowledge, we will again bootstrap our system by programtically applying labels - in this case preferred trail clusters. In the future, this will also be augmented/replaced with user's trail ratings data.<br>
Some randonmess is also applied to our assignments below.

In [None]:
df_users["Trail"]=0

df_users['Trail'] = np.where(
    (df_users['Owns a Bicycle']==1) & (df_users["Mobility Issues"]<3) & ((df_users["Activity Level"]=='High') | (df_users["Activity Level"]=='Extreme') | (df_users["Activity Level"]=='Moderate'))& (df_users["Safety Concerns"]<4) & (df_users['Owns a Dog']==1),
    0,
    df_users['Trail']
)

df_users['Trail'] = np.where(
    ((df_users['Owns a Bicycle']==1) | (df_users['Owns a Dog']==1)) & (df_users["Mobility Issues"]<4) & ((df_users["Activity Level"]=='Low') | (df_users["Activity Level"]=='Moderate')) & (df_users["Safety Concerns"]<6) & (df_users['Trail']==0),
    1,
    df_users['Trail']
)

df_users['Trail'] = np.where(
    ((df_users['Owns a Bicycle']==0) | (df_users['Owns a Dog']==0)) & (df_users["Mobility Issues"]>2) & (df_users["Activity Level"]=='Low') & (df_users["Safety Concerns"]>2) & (df_users['Trail']==0),
    2,
    df_users['Trail']
)

df_users['Trail'] = np.where(
    (df_users['Owns a Bicycle']==1) & ((df_users["Activity Level"]=="Moderate") | (df_users["Activity Level"]=="Low")) & (df_users["Safety Concerns"]>3) & (df_users['Trail']==0),
    3,
    df_users['Trail']
)

df_users['Trail'] = np.where(
    (df_users['Owns a Bicycle']==0) & (df_users["Safety Concerns"]>2) & (df_users['Owns a Dog']==0) & (df_users['Trail']==0),
    4,
    df_users['Trail']
)

df_users['Trail'] = np.where(
    ((df_users['Owns a Bicycle']==1) | (df_users['Owns a Dog']==1)) & (df_users["Mobility Issues"]>4) & ((df_users["Activity Level"]=='Low') | (df_users["Activity Level"]=='Moderate')) & (df_users['Trail']==0),
    5,
    df_users['Trail']
)

df_users['Trail'] = np.where(
    (df_users["Mobility Issues"]>2) & (df_users["Activity Level"]=='Extreme') & (df_users['Owns a Dog']==1) & (df_users['Trail']==0),
    6,
    df_users['Trail']
)

df_users['Trail'] = np.where(
    (df_users['Owns a Bicycle']==1) & (df_users["Safety Concerns"]>2) & ((df_users["Activity Level"]=='Extreme') | (df_users["Activity Level"]=='Moderate')) & (df_users['Trail']==0),
    7,
    df_users['Trail']
)

df_users['Trail'] = np.where(
    (df_users['Owns a Bicycle']==1) & ((df_users["Mobility Issues"]>3) | (df_users['Safety Concerns']>4)) & (df_users['Trail']==0),
    8,
    df_users['Trail']
)

#introduce 15% randomness to the cluster assignments
np.random.seed(11)
sampled_indices = df_users.sample(frac=0.15, random_state=11).index
df_users.loc[sampled_indices, 'Trail'] = np.random.randint(0, best_n_clusters-1, size=len(sampled_indices))


Below we can see how many users are assigned to each trail cluster. The users are well-distributed.

In [None]:
for column in df_users.columns:
  if column=="Trail":
    print(df_users[column].value_counts())

In this next section, we will try a couple of supervised learing models while also testing various hyperparameters within each.<br>
We will use the best f1 score (which is a good balance or precision and recall) to select our best parameters and then accuracy to select our best model.<br>
This process would likely be repeated periodically - maybe quarterly or even yearly so as to reduce variance in the user experience and to prevent any sensitivity to short-lived trends.

In [None]:
#using get_dummies to create binary feature for these categorial variables
df_users=pd.get_dummies(df_users,columns=["Sex","Activity Level"],drop_first=True,dtype=int)

In [None]:
df_users.info()

Above is our final dataset to be used for the supervised learning models. Let's now take a look at a correlation matrix/heatmap of all the features to the target.

In [None]:
#create a correlation matrix/heatmap of all the features to the target
df=df_users.drop(columns=["Name"])
correlation_matrix = df.corr()
correlations_with_target = correlation_matrix[['Trail']].drop(index='Trail')

plt.figure(figsize=(6, 6))
sns.heatmap(correlations_with_target, annot=True, cmap='coolwarm', cbar=True, linewidths=0.5)
plt.title("Correlation with Target: Trails")
plt.show()

The first supervised model we will try in a RandomForest.<br>
We start by doing a 5-fold grid search on various hyperparameters.

In [None]:
X = df.drop(columns='Trail')
y = df['Trail']

rf=RandomForestClassifier(random_state=3)
param_grid={
  'max_features': ['auto','sqrt','log2'],
  'criterion': ["gini","entropy"],
  'max_depth': range(4, 9)
}
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='f1_weighted', error_score=0)
grid_search.fit(X,y)

print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best Cross Validation F1: {grid_search.best_score_}')

And now using these optimal hyperparameters, we will train a model using 80% of our data for training and 20% for testing.

In [None]:
X = df.drop(columns='Trail')
y = df['Trail']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf_model = RandomForestClassifier(random_state=11,max_depth=grid_search.best_params_.get("max_depth"),criterion=grid_search.best_params_.get("criterion"),max_features=grid_search.best_params_.get("max_features"))
rf_model.fit(X_train, y_train)

y_pred = rf_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.3f}")
print(f"F1 Score (Weighted): {f1:.3f}")

We have acheived an accuracy of .85 with the RandomForest model. This is quite solid.<br>
Before moving on the the next model, first a look at the feature importances in the RandomForest model.

In [None]:
importances = rf_model.feature_importances_

feature_importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
})

feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(8, 4))
plt.barh(feature_importances['Feature'], feature_importances['Importance'], color='skyblue')
plt.xlabel('Feature Importance')
plt.title('Feature Importances in Random Forest')
plt.gca().invert_yaxis()
plt.show()

Now for the same process with a Logistic Regression model...

In [None]:
X = df.drop(columns='Trail')
y = df['Trail']

log_reg = LogisticRegression(random_state=11,max_iter=1500)
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'solver': ['lbfgs', 'newton-cg', 'saga'],
}
grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='f1_weighted', error_score=0)
grid_search.fit(X, y)

print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best Cross Validation F1: {grid_search.best_score_}')

In [None]:
log_reg_model = LogisticRegression(multi_class='multinomial', solver=grid_search.best_params_.get("solver"), random_state=11, max_iter=1500, C=grid_search.best_params_.get("C"))
log_reg_model.fit(X_train, y_train)

y_pred = log_reg_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.3f}")
print(f"F1 Score (Weighted): {f1:.3f}")

In [None]:
feature_names = X.columns
coefficients = log_reg_model.coef_

mean_abs_coefficients = np.mean(np.abs(coefficients), axis=0)

feature_importance = pd.DataFrame({
    'Feature': feature_names,
    'Importance': mean_abs_coefficients
}).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(8, 4))

plt.barh(feature_importance['Feature'], feature_importance['Importance'], color='skyblue')
plt.xlabel('Average Coefficient Magnitude')
plt.ylabel('Feature')
plt.title('Feature Importance in Logistic Regression')
plt.gca().invert_yaxis()  # To display the highest importance at the top

plt.show()

Our accuracy for Logistic Regression was .73, falling below the mark of .85 set by RandomForest. Therefore we will use the RandomForect model to make our final predictions.<br>
In the production version of this application, when a new user signs-up, they will fill out a quesitonaire which will supply the required feature values. The model will then be used to predict their trail recommendation using that data. For the purpose of this demonstration, we will simulate a new set of users, use our model to generate predictions, and provide the output for a selected user.

In [None]:
#generate a set of simulated users and apply the require feature engineering
df_new_users=generate_user_data(100)
df_new_users=pd.get_dummies(df_new_users,columns=["Sex","Activity Level"],drop_first=True,dtype=int)

In [None]:
X=df_new_users[rf_model.feature_names_in_]
y_pred=rf_model.predict(X)
df_new_users["Trail"]=y_pred
name=df_new_users.iloc[0]['Name']
cluster=df_new_users.iloc[random.randint(0,100)]['Trail']

In [None]:
print(f"When user {name} logs in, they will be recommended trail cluster {cluster}.")

And it works! Our new user has a set of trails recommened to them thanks to our unsupervised model which clusters trails together and our supervised model which predicts the right trail cluster for a user based on their features.<br>
Near-term enhancements could be:
*   Augment our simulated user data from some service like Strava (requires a paid developer account)
*   Expand our application beyond hiking trails to also include other outdoor activies such as kayaking
*   Enhance the product to contain data and models outside of West Viriginia and to also then incorporate user location into the recommendation engine
*   Execute the enhancement of user-based modeling with actual user data as our population increases as mentioned earlier
*   Execute the aforementioned re-examination, evaluation, and training of models on the planned schedules



