# Combine Routes

Merges the CSV route data with the JSON route data.

In [72]:
import pandas as pd
import re
import os
import json

### Read in the CSV data

In [4]:
scraped = pd.read_csv("scraped.csv")

### Read in the JSON data

In [6]:
directory = "json-routes"

Define a function to create a pandas dataframe from the `"routes"` entry in the JSON file.

In [39]:
def json_to_df(json_file):
    json_dict = json.load(json_file)
    just_routes = json_dict["routes"]
    return pd.DataFrame(just_routes)

Go through every JSON file in the folder, unioning their route data into one dataframe.

In [47]:
directory = "json-routes"
requested = []
for file in os.scandir(directory):
    with open(file.path, "r") as json_file:
        if len(requested):
            requested = pd.concat([requested, json_to_df(json_file)])
        else:
            requested = json_to_df(json_file)

### Merge the two datasets

A lot of the data is overlapping, but there are a few features the API-pulled data has that the scraped data doesn't and vice-versa.

In [69]:
print(" scraped:\n\n", scraped.columns, "\n\n", "requested:\n\n", requested.columns)

 scraped:

 Index(['Unnamed: 0', 'Route', 'Location', 'URL', 'Avg Stars', 'Your Stars',
       'Route Type', 'Rating', 'Pitches', 'Length', 'Area Latitude',
       'Area Longitude'],
      dtype='object') 

 requested:

 Index(['id', 'name', 'type', 'rating', 'stars', 'starVotes', 'pitches',
       'location', 'url', 'imgSqSmall', 'imgSmall', 'imgSmallMed', 'imgMedium',
       'longitude', 'latitude'],
      dtype='object')


Specifically, the **scraped** data has the vertical *length* of the route (in feet), the *rating* I've personally given routes I've climbed, and the *"Area" lat/lon* instead of the *route lat/lon*. 

From a quick inspection, the lat/lon values are the same, if truncated to fewer decimal places in the scraped data.

The **requested** data, on the other hand, has a *starVotes* column (a nice proxy for traffic) and the *image* links.

Before merging, I want to check for discrepancies between the two datasets. I'll start by adding a *route_id* column to the scraped dataframe.

In [95]:
get_id = lambda entry: int(re.search("\d{9}", entry).group())
scraped["id"] = scraped["URL"].apply(get_id)

The "indices" for both dataframes are not unique, so I'll use the route ID as the index instead.

In [96]:
scraped.index = scraped["id"]
requested.index = requested["id"]

Now we can order them by this new index and start comparing various columns' values.

In [98]:
scraped.head()

Unnamed: 0_level_0,Unnamed: 0,Route,Location,URL,Avg Stars,Your Stars,Route Type,Rating,Pitches,Length,Area Latitude,Area Longitude,route_id,id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
105979968,0,Dreamscape,Sun Wall > Sand Rock > Alabama,https://www.mountainproject.com/route/10597996...,3.8,-1,Sport,5.11c,1,75.0,34.18041,-85.81555,105979968,105979968
105905196,1,Comfortably Numb,The Pinnacle > Sand Rock > Alabama,https://www.mountainproject.com/route/10590519...,3.6,-1,"Trad, TR",5.9,1,120.0,34.17948,-85.81775,105905196,105905196
105905421,2,Misty,Sun Wall > Sand Rock > Alabama,https://www.mountainproject.com/route/10590542...,3.6,-1,Sport,5.10b/c,1,90.0,34.18041,-85.81555,105905421,105905421
105926850,3,Oyster,Holiday Block > Sand Rock > Alabama,https://www.mountainproject.com/route/10592685...,3.1,-1,Sport,5.10a,1,80.0,34.17961,-85.8179,105926850,105926850
105930746,4,Pigs in Zen (Tuesday's Gone),Holiday Block > Sand Rock > Alabama,https://www.mountainproject.com/route/10593074...,3.0,-1,Sport,5.10d,1,70.0,34.17961,-85.8179,105930746,105930746


Unnamed: 0_level_0,id,name,type,rating,stars,starVotes,pitches,location,url,imgSqSmall,imgSmall,imgSmallMed,imgMedium,longitude,latitude
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
105888407,105888407,It's All Good,Sport,5.9+,3.6,38,1,"[Tennessee, Foster Falls, Gutbuster Area]",https://www.mountainproject.com/route/10588840...,https://cdn2.apstatic.com/photos/climb/1059769...,https://cdn2.apstatic.com/photos/climb/1059769...,https://cdn2.apstatic.com/photos/climb/1059769...,https://cdn2.apstatic.com/photos/climb/1059769...,-85.6837,35.1769
105892528,105892528,Mammy,Sport,5.9,3.2,146,1,"[Tennessee, Foster Falls, Jimmywood]",https://www.mountainproject.com/route/10589252...,https://cdn2.apstatic.com/photos/climb/1060315...,https://cdn2.apstatic.com/photos/climb/1060315...,https://cdn2.apstatic.com/photos/climb/1060315...,https://cdn2.apstatic.com/photos/climb/1060315...,-85.6825,35.1771
105892538,105892538,Afterburner,Sport,5.5,3.1,137,1,"[Tennessee, Foster Falls, Rocket Slab]",https://www.mountainproject.com/route/10589253...,https://cdn2.apstatic.com/photos/climb/1061404...,https://cdn2.apstatic.com/photos/climb/1061404...,https://cdn2.apstatic.com/photos/climb/1061404...,https://cdn2.apstatic.com/photos/climb/1061404...,-85.6842,35.1767
105892543,105892543,Gravity Boots,Sport,5.7,3.1,158,1,"[Tennessee, Foster Falls, Rocket Slab]",https://www.mountainproject.com/route/10589254...,https://cdn2.apstatic.com/photos/climb/1062510...,https://cdn2.apstatic.com/photos/climb/1062510...,https://cdn2.apstatic.com/photos/climb/1062510...,https://cdn2.apstatic.com/photos/climb/1062510...,-85.6842,35.1767
105893863,105893863,Bottom Feeder,Sport,5.12a,2.6,14,1,"[Tennessee, Foster Falls, White Wall]",https://www.mountainproject.com/route/10589386...,https://cdn2.apstatic.com/photos/climb/1182717...,https://cdn2.apstatic.com/photos/climb/1182717...,https://cdn2.apstatic.com/photos/climb/1182717...,https://cdn2.apstatic.com/photos/climb/1182717...,-85.6778,35.1795


In [2]:
#compare lat/lon data

In [None]:
#for each file in directory: 
#pd.read_json() 
#pd.read_csv(scraped_csv_with_routid)