# **Executive Summary**

The objective of this project is to build a recommendation engine for hiking trails to a user based on their location, trail type, difficulty level, distance, etc., along with hikes they have taken in the past (to draw out similarities to).

www.alltrails.com is a detailed source to look up trail information for any trail under the sun. It's also a great place to curate trails and track trails one takes. Personalization based on user's choices is not present in the free version.

We wish to provide through our recommendation engine a finer level of granularity in the suggested trail, taking into account user's history, as well as other filters. We are also on the lookout for interesting trends around a place or trail.

# **Data sources**

We wanted to obtain trail data for hiking trails throughout the United States. For this, we began with scraping data from www.alltrails.com.
However, the website doesn't allow API requests/ Selenium-based scraping at present. Due to this issue, we used a dataset for trail data from www.kaggle.com/planejane/national-park-trails.

We have enhanced the data to include campground and reviews based on geo-location, using Google API's. We used the "Place Search" API, specifically the "Nearby Search" endpoint, to find campgrounds in 3 miles within the geo-location of the trails in our data set. Then we used the "Place Details" API to obtain relevant information about the campgrounds found in the previous search. We repeated this process multiple times to obtain the dataset used in this project. 

Additionally, we integrated weather data into the dataset using Weather API.





# **Data Limitations**

Since the primary dataset is from Kaggle, trail data will not be updated real-time, although reviews and weather are real-time data points. As a future enhancement, we have to find a way to collate alltrails data to be updated with current trail information real-time before making recommendations to a user.

# **Data Components Description**



We will be reviewing the data we have gotten from www.alltrails.com and enhancing the dataset by adding features to it.

In [None]:
!pip install gmaps
!pip install ipywidgets
!pip install widgetsnbextension
!pip install plotly==4.6
!pip install geopandas
!pip install geoplot

Collecting gmaps
  Downloading gmaps-0.9.0.tar.gz (1.1 MB)
[?25l[K     |▎                               | 10 kB 23.7 MB/s eta 0:00:01[K     |▋                               | 20 kB 29.3 MB/s eta 0:00:01[K     |█                               | 30 kB 14.0 MB/s eta 0:00:01[K     |█▎                              | 40 kB 10.4 MB/s eta 0:00:01[K     |█▌                              | 51 kB 6.1 MB/s eta 0:00:01[K     |█▉                              | 61 kB 7.2 MB/s eta 0:00:01[K     |██▏                             | 71 kB 7.5 MB/s eta 0:00:01[K     |██▌                             | 81 kB 6.8 MB/s eta 0:00:01[K     |██▉                             | 92 kB 7.5 MB/s eta 0:00:01[K     |███                             | 102 kB 6.7 MB/s eta 0:00:01[K     |███▍                            | 112 kB 6.7 MB/s eta 0:00:01[K     |███▊                            | 122 kB 6.7 MB/s eta 0:00:01[K     |████                            | 133 kB 6.7 MB/s eta 0:00:01[K     |████▍  

In [None]:
from google.colab import output
output.enable_custom_widget_manager()

We import all the libraries required to analyse the data and draw expploratory graphs

In [None]:
## Import library
import numpy as np
from numpy import mean
import pandas as pd
from pandas_profiling import ProfileReport
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import matplotlib.colors as colors
%matplotlib inline
from operator import itemgetter
import ast
from ast import literal_eval

## Plotly Packages
from plotly import tools
import plotly.graph_objs as go
import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import ipywidgets
from IPython.display import HTML
import gmaps 
import gmaps.datasets 
import geopandas as gpd
import geoplot as gplt
import scipy.stats as stats

**DATA PROFILING**

To get the data, stored as a CSV file on the drive, we have connected to the Google drive and we get the data through the file id

In [None]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
link="https://drive.google.com/file/d/1dhWpdeVgKuWACcBtGY8mk_Pq3_4bOyJb/view?usp=sharing"
id="1dhWpdeVgKuWACcBtGY8mk_Pq3_4bOyJb"
link_places="https://drive.google.com/file/d/1Jpsa7goEcEbw6ihIrnXLPUSj7Mp9fLP0/view?usp=sharing"
id_places="1Jpsa7goEcEbw6ihIrnXLPUSj7Mp9fLP0"
link_weather="https://drive.google.com/file/d/15LTgS1LC45CkAf_vfHBhqnePUJTpO7-0/view?usp=sharing"
id_weather="15LTgS1LC45CkAf_vfHBhqnePUJTpO7-0"

In [None]:
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('AllTrails.csv')  

In [None]:
download_places = drive.CreateFile({'id':id_places}) 
download_places.GetContentFile('places_data.csv') 

In [None]:
download_weather = drive.CreateFile({'id':id_weather}) 
download_weather.GetContentFile('weather.csv') 

In [None]:
all_trails= pd.read_csv('AllTrails.csv')
all_trails.head()

In [None]:
places= pd.read_csv('places_data.csv')
places.head()

In [None]:
all_trails.info()

## ADDING RELEVANT FEATURES IN THE DATA

**BREAKING THE GEOLOCATION INTO LATITUDE AND LONGITUTE THAT IS USEFUL FOR ANALYSIS**

In [None]:
#We need latitude and londitude columns for geolocation analysis
all_trails[['lat','lng']] = all_trails['_geoloc'].apply(lambda x: pd.Series(str(x).split(",")))
all_trails['lat'] = all_trails['lat'].apply(lambda x: (x.split(':')[1].split()[-1])).astype(float)
all_trails['lng'] = all_trails['lng'].apply(lambda x: (x.split(':')[1].split()[-1][:-1])).astype(float)


**ADDING CAMPGROUND DATA & REVIEWS**







To add the campground & reviews data, we need to access to the geolocation of all trails in the data set 'all_trails'. Please refer to the notebook https://colab.research.google.com/drive/1TUDmbXETWNweGJGhRlliIAe1X9g4Hfmh?usp=sharing which covers the code to extract the information of each geolocation.

In [None]:
#Creating temporary column 'Id_loc' which is useful in joining the campground data and reviews with the original dataset
all_trails['Id_loc'] = all_trails.lat.astype(str).str.cat(all_trails.lng.astype(str), sep=',')

We have saved the information of campgrounds and reviews in a CSV which we imported in the commands above. Now we will be using that CSV to merge the data with all_trails DF. 

Note: We can't run the google places API in realtime, as it is costly. We are storing the data, after running it for the first time.

In [None]:
# Profiling the obtained data
df_orig=places
print(df_orig.info())
print(len(df_orig))
df_orig.head()

To merge the obtained data set with the 'all_trails' data set we need to do some data cleaning first.

In [None]:
print(df_orig.Time.drop_duplicates())

In [None]:
# Convert column Time from text to number of months
replace_values = {'a month ago': 1, '6 months ago': 6, '7 months ago':7, '8 months ago':8, 'a year ago':12, '10 months ago':10, '5 months ago':5, '4 months ago':4, '2 years ago':24, '4 years ago':48, '11 months ago':11, '2 months ago':2, '3 months ago':3, 'in the last week':0, '3 years ago':36, '2 weeks ago':0.5, 'a week ago':0.25, '9 months ago':9, '3 weeks ago':0.75, '4 weeks ago':4, '6 years ago':72, '5 years ago':60, '8 years ago':96, '7 years ago':84, '9 years ago':108, '12 months ago':12, '10 years ago':120}
df = df_orig.replace({'Time': replace_values})

In [None]:
#Group rows by Campground to get the average rating and the average number of months since the last review.
df_1 = df.groupby(['Id_loc','Name','Website', 'Address', 'Phone_number']).Rating.mean().reset_index()
df_2 = df.groupby(['Id_loc','Name','Website', 'Address', 'Phone_number']).Time.mean().reset_index()
df_1["Avg_Time_months"]= df_2.Time
df_campground = df_1
df_campground

In [None]:
#Group rows by location ('Id_loc') to get number of campgrounds around, the average rating of them and the average number of months since the last review.
df_3 = df_campground.groupby('Id_loc').Name.count().reset_index()
df_4 = df_campground.groupby('Id_loc').Rating.mean().reset_index()
df_5 = df_campground.groupby('Id_loc').Avg_Time_months.mean().reset_index()
df_3['Rating'] = df_4.Rating
df_3=df_3.rename(columns={"Name":"Number_of_Campgrounds"})
df_reviews = df_3
df_reviews

To merge the obtained data set 'df_reviews' with our initial 'all_trails' data set we use 'Id_loc' as key. We have ensured to do a left join as there might be a few trails for which no campground information or reviews might be present.

In [None]:
all_trails = pd.merge(all_trails, df_reviews, on ='Id_loc', how ="left")

As the merge operation gets completed, we drop the column Id_loc as it is no more needed.

In [None]:
all_trails=all_trails.drop('Id_loc',axis=1)
all_trails

**ADDING WEATHER DATA**
We have extracted the data using the weather API to add the weather information. This information was saved to a CSV for exploratory purposes

In [None]:
#Obtain the current weather at set latitude and longitude locations
import requests
import ast
from tkinter import *
import math
from datetime import date
w = pd.read_csv('weather.csv', index_col=0)


weather_features = ['temp_c', 'condition', 'wind_kph', 'pressure_mb', 'precip_mm', 'humidity', 'cloud', 'feelslike_c', 'vis_km', 'uv', 'gust_kph']
all_trails[weather_features] = ''

for j in weather_features:
    for i in range(len(all_trails)):
        if j == 'condition':
            all_trails[j][i] = ast.literal_eval(w['current'][i])[j]['text']
        else:
            all_trails[j][i]  = ast.literal_eval(w['current'][i])[j]

In [None]:
all_trails

**CONVERTING DIFFICULTY INTO READABLE FORMAT**

In [None]:
#Create difficulty_rating definition
def definition(difficulty_rate):
    '''To clarify trail's difficulty rating 
    >>>Input: -> Output
    1 -> easy
    3-> moderate
    5 -> hard
    7 -> strenous
    '''
    if difficulty_rate == 1:
        return 'easy'
    elif difficulty_rate == 3:
        return 'moderate'
    elif difficulty_rate == 5:
        return 'hard'
    else:
        return 'strenous'
all_trails['difficulty'] = all_trails.apply(lambda x: definition(difficulty_rate = x['difficulty_rating']), axis = 1)

**CORRECTING THE COUNTRY NAMES OF TRAILS IN HAWAII**
The data has some issues regarding the country name with reference to Hawaii. We have correcetd that manually.

In [None]:
all_trails['country_name'] = all_trails['country_name'].apply(lambda x: x.replace("Hawaii", "United States"))
all_trails['state_name'] = all_trails['state_name'].apply(lambda x: x.replace("Maui", "Hawaii"))

**DISTANCE IN MULTIPLE UNITS**

As we might need the distance of each trail in multiple units, while building business insights, we convert it to inches and miles.

In [None]:
# A function to convert the distance to inches
def convert (units, length):
    """ Convert meters to inches
        Input: 1 meter
        Output: 39.3701 inches
    """
    if units == 'm':
        return length * 39.3701
    else:
        return length
# A function to convert the distance to miles
def convert_to_miles (units, length):
    """ Convert meters to inches
        Input: 1 meter
        Output: 0.000621371 miles
    """
    if units == 'm':
        return length * 0.000621371
    elif units == 'i':
        return length * 1.57828e-5
    else:
        return length
#Create new length column
all_trails['length_inches'] = all_trails.apply(lambda x: convert(units = x['units'], length = x['length']), axis = 1)
all_trails['length_miles'] = all_trails.apply(lambda x: convert_to_miles(units = x['units'], length = x['length']), axis = 1)

**ADDING COLUMNS CORRESPONDING TO THE FEATURES AND ACTIVITIES OF EACH TRAIL**

1. Adding total counts of features and activities present in each trail
2. Converting all the activities and features into categorical features. 

In [None]:
all_trails['features_count'] = all_trails['features'].apply(lambda x: len(x.split(',')))
all_trails['activities_count'] = all_trails['activities'].apply(lambda x: len(x.split(',')))

In [None]:
#generate list of unique features
list_of_features=[]
for feat in all_trails['features']:
  mp=str(feat)
  mp=mp.strip()
  mp=mp.replace('[','')
  mp=mp.replace(']','')
  mp=mp.replace("'",'')
  mp=mp.replace(' ','')
  li=list(mp.split(','))
  for k in li:
    if k not in list_of_features:
      list_of_features.append(k)
print(list_of_features)

In [None]:
#generate list of unique activities
list_of_activities=[]
for act in all_trails['activities']:
  mp=str(act)
  mp=mp.replace('[','')
  mp=mp.replace(']','')
  mp=mp.replace("'",'')
  mp=mp.replace(' ','')
  li=list(mp.split(','))
  for k in li:
    if k not in list_of_activities:
      list_of_activities.append(k)
print(list_of_activities)

In [None]:
#Based on the features and activities present for each trail, we update the categorical variable
for ind,row in all_trails.iterrows():
  mp=str(row['features'])
  mp=mp.replace('[','')
  mp=mp.replace(']','')
  mp=mp.replace("'",'')
  mp=mp.replace(' ','')
  li=list(mp.split(','))
  #print(li)
  for feat in list_of_features:
    if feat in li:
      all_trails.at[ind,'feature_'+feat]=1
    else:
      all_trails.at[ind,'feature_'+feat]=0
    #print(row['feature_'+feat])
  mp=str(row['activities'])
  mp=mp.replace('[','')
  mp=mp.replace(']','')
  mp=mp.replace("'",'')
  mp=mp.replace(' ','')
  li=list(mp.split(','))
  #print(li)
  for feat in list_of_activities:
    if feat in li:
      all_trails.at[ind,'activity_'+feat]=1
    else:
      all_trails.at[ind,'activity_'+feat]=0


In [None]:
all_trails

We are dropping geolocation because it is already divided into latitude and longitude. We are dropping visitor usage as we are not sure what it is used for.

In [None]:
#remove geoloc variables
all_trails.drop(['_geoloc'],axis = 1, inplace = True)
all_trails.drop(['visitor_usage'],axis = 1, inplace = True)

#Review and check the dataset
all_trails.isnull().sum()

In [None]:
#from google.colab import drive
#drive.mount('drive')

In [None]:
#all_trails.to_csv('feature_data_allTrails.csv')
#!cp feature_data_allTrails.csv "drive/MyDrive/243/Data/"

**EXPLORATORY DATA ANALYSIS (EDA)**

Building some general grapghs to have a look and feel of the data in general

In [None]:
all_trails.describe().T

**EXPLAINATION OF VARIABLES IN THE DATA**
The information of the features is maintained at https://docs.google.com/spreadsheets/d/1NAlbgAjMXVL7KwVAERaCoqXXx8pq3Ihs9hH5pbWq0OE/edit?usp=sharing

In [None]:
all_trails.columns

In [None]:
#For the heatmap, we are reducing the number of features to find the correlation between relevant features
analysis_df=all_trails[['trail_id','name','area_name','city_name','state_name','country_name','popularity',
                        'length','elevation_gain','difficulty_rating','route_type','avg_rating','num_reviews','Rating','Number_of_Campgrounds',
                        'lat','lng','temp_c','feelslike_c','humidity']]
analysis_df

In [None]:
#creating plots on dataset
# Each attribution Histagram 
analysis_df.hist(bins=30,figsize=(20,15))
plt.show()

In [None]:
all_trails

In [None]:
#Heamp map on correlation - check all the variables linearly related
corr = analysis_df.corr().round(2)
sns.set(rc={"figure.figsize":(12,8)})
heatmap = sns.heatmap(corr, annot = True, vmin=-1, vmax=1, center= 0, cmap=sns.diverging_palette(20, 220, n=200),linewidths=.5)
plt.show()

**We find that the number of reviews, average rating, and features count is positively correlated to popularity from the correlation heat map.**|



1) Number of Trails in the National Parks by States

In [None]:
# States count
all_trails['state_name'].unique()
#Total summary of trails by state
Total_count = all_trails[['trail_id','state_name','route_type']].groupby(['state_name', 'route_type']).count().reset_index()
Total_count

In [None]:

#Bar Chart of total summary of trails by state
fig_sumtrails_group= px.bar(Total_count, x= 'state_name', y='trail_id', text= 'trail_id', color = 'route_type', title = 'Total Trails count by States')
HTML(fig_sumtrails_group.to_html())

Top 10 trails in National Parks by number of reviews 

In [None]:
#Top N function
def top(df, n, column):
    """ return top N  records"""
    return df.sort_values(by=column)[-n:]

In [None]:
#New data frame
NO_reviews = all_trails[['area_name','name','route_type','num_reviews','popularity']]
# Top 10 Number of Reviews of National Parks and trails
Top_10_NP = top(NO_reviews, n=10, column = 'num_reviews').sort_values(by = 'num_reviews', ascending=True)
Top_10_NP

Top 10 populiar trails in National Parks by number of trails

In [None]:
#Top Trials by number of reviews
fig = px.bar(Top_10_NP, x="name", y="num_reviews", text = 'num_reviews',color="route_type",title="Top 10 Trails in National Parks by Number of Reviews",
            labels={
                     "name": "Trail Name",
                     "num_reviews": "Number of Reviews",
                     "route_type": "Route Type"
                 },)
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
#fig.show()
HTML(fig.to_html())

In [None]:
#Top 10 Activities for all the trails
mega_act_list = []

for index, row in all_trails.iterrows():
    for item in ast.literal_eval(row['activities']):
        mega_act_list.append(item)
        
#print(mega_act_list)
mega_act_df = pd.DataFrame(mega_act_list, columns = ["activity"])
#print(mega_act_df)

act_count = mega_act_df.groupby("activity")["activity"].agg(Total = 'count')

In [None]:
#Top 10 Activities of trails
Top_10_activities = top(act_count, n=10, column = 'Total').sort_values(by = 'Total', ascending=False)
Top_10_activities

**Hiking, Nature-trips, and birding are frequently show on the trails' activity**

In [None]:
#Top 10 feeatures for all the trails
#Features of trails Calulation 
mega_feat_list = []

for index, row in all_trails.iterrows():
    for item in ast.literal_eval(row['features']):
        mega_feat_list.append(item)

mega_feat_df = pd.DataFrame(mega_feat_list, columns = ["features"])
#print(mega_act_df)

feat_count = mega_feat_df.groupby("features")["features"].agg(Total = 'count')

In [None]:
#Top 10 Features for trails
Top_10_features = top(feat_count, n=10, column = 'Total').sort_values(by = 'Total', ascending=False)
Top_10_features

In [None]:
fig1 = px.scatter_mapbox(all_trails,
                        lat="lat",
                        lon="lng",
                        hover_name="name",
                        hover_data=["popularity"],
                        color="difficulty_rating",
                        zoom=3,
                        height=600,
                        size="popularity",
                        size_max=30,
                        opacity=0.4,
                        width=1300,
                        color_continuous_scale='Geyser',
                        title="Trails in the US by popularity and difficulty")
fig1.update_layout(mapbox_style='carto-positron')
fig1.update_layout(geo_scope='usa')
fig1.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0})
fig1.update_layout(title_text="Plots of trails across US by popularity and difficulty")
HTML(fig1.to_html())
#fig1.show()

The visual above shows the trails located across the US with the size of the visual representing the popularity of the trail and the color of the point displaying the difficulty rating of that trail based on the scale shown to the right. 

In [None]:
fig2 = px.scatter_mapbox(all_trails,
                        lat="lat",
                        lon="lng",
                        hover_name="name",
                        hover_data=["elevation_gain"],
                        color="elevation_gain",
                        zoom=3,
                        height=600,
                        size="length",
                        size_max=30,
                        opacity=0.4,
                        width=1300,
                        color_continuous_scale='Plasma',
                        title="Trails in the US by popularity and difficulty")
fig2.update_layout(mapbox_style='stamen-terrain')
fig2.update_layout(geo_scope='usa')
fig2.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0})
fig2.update_layout(title_text="Plots of trails across US by popularity and difficulty")
HTML(fig2.to_html())
#fig2.show()

The visual above shows the trails located across the US with the size of the visual representing the lenght of the trail and the color of the point displaying the elevation of that trail based on the scale shown to the right. 

In [None]:
all_trails_score_weather = pd.DataFrame(all_trails[['trail_id', 'name', 'temp_c', 'wind_kph', 'humidity', 'gust_kph']].copy())
all_trails_score_weather['score'] = ''

for i in range(len(all_trails)):
  x=0
  if all_trails_score_weather['temp_c'][i] > 15:
    if all_trails_score_weather['temp_c'][i] < 30:
      x= x+1
  if all_trails_score_weather['wind_kph'][i] < 10:
    x= x+1
  if all_trails_score_weather['humidity'][i] > 30:
    if all_trails_score_weather['temp_c'][i] < 50:
      x= x+1
  if all_trails_score_weather['gust_kph'][i] < 10:
    x= x+1
  all_trails_score_weather['score'][i] = x

sns.countplot(x = all_trails_score_weather['score']);
display(all_trails_score_weather[all_trails_score_weather['score']==4]['name'])

**Data Validation**

In [None]:
#Number of missing values
NA = all_trails.isnull().sum()
display(NA[NA!=0])
display(NA.sum())

In [None]:
Validation_df = all_trails[['popularity', 'length', 'elevation_gain', 'difficulty_rating', 'avg_rating', 'num_reviews']]
fig = plt.figure()
i=1

for factor in Validation_df:
  ax = fig.add_subplot(2, 3, i)
  fig.set_size_inches(18.5, 10.5)
  sns.distplot(Validation_df[factor], ax=ax)
  plt.title(f'{factor} distribution')
  plt.ylabel('pdf')
  i+=1

In [None]:
all_trails.to_excel('all_trails_data')