# Data extraction

**Preface:**

We want to analyze the most popular cycling routes for every mountain pass and port in continental Spain. To achieve this we have scrapped the website *Wikiloc* for the best routes as well as downloaded all the corresponding gpx files.

This lengthy process is documented in this repo:

https://github.com/x-esteban/PR_Final

To obtain the necessary data from our gpx files we'll need to parse them. For this purpose we'll be using a great library called *Gpxpy*.

**What are our initial questions?**

Our goal is to analyze the most popular routes for every province and answer two questions:

1. Do cyclists from different areas prefer different kinds of routes?
2. Are the routes in northern Spain more hilly than the ones in the southern regions? 

## 1. Importing our dataframe of all routes

In [5]:
import pandas as pd

In [3]:
#We begin by importing our dataframe containing all routes.

df = pd.read_csv('df.csv')

In [4]:
#The column 'alpha_name' contains the name of the corresponding gpx file. 

df.head()

Unnamed: 0,ubicacion,nombre,trailrank,distancia,desnivel,dificultad,url,photo1,photo2,photo3,alpha_name
0,Pico Veleta,Día 1/2 - Sierra Nevada - Granada - Pico Veleta,64,"108,78 km",2.914 m,Moderado,https://es.wikiloc.com/rutas-ciclismo/dia-1-2-...,https://es.wikiloc.com/rutas-ciclismo/dia-1-2-...,https://es.wikiloc.com/rutas-ciclismo/dia-1-2-...,https://es.wikiloc.com/rutas-ciclismo/dia-1-2-...,Día12SierraNevadaGra
1,Pico Veleta,Pinos Genil. Güejar Sierra. Hazas Llanas. Prad...,54,"83,61 km",2.563 m,Muy difícil,https://es.wikiloc.com/rutas-ciclismo/pinos-ge...,https://es.wikiloc.com/rutas-ciclismo/pinos-ge...,https://es.wikiloc.com/rutas-ciclismo/pinos-ge...,https://es.wikiloc.com/rutas-ciclismo/pinos-ge...,PinosGenilGüejarSier
2,Pico Veleta,granada - pico veleta,41,"85,89 km",2.785 m,Moderado,https://es.wikiloc.com/rutas-ciclismo/granada-...,https://es.wikiloc.com/rutas-ciclismo/granada-...,https://es.wikiloc.com/rutas-ciclismo/granada-...,https://es.wikiloc.com/rutas-ciclismo/granada-...,granadapicoveleta
3,Pico Veleta,Subida al pico veleta y al radiotelescospio de...,34,"32,42 km",1.199 m,Muy difícil,https://es.wikiloc.com/rutas-ciclismo/subida-a...,https://es.wikiloc.com/rutas-ciclismo/subida-a...,https://es.wikiloc.com/rutas-ciclismo/subida-a...,https://es.wikiloc.com/rutas-ciclismo/subida-a...,Subidaalpicoveletaya
4,Pico Veleta,Pico Veleta por el Monachil-el Purche-Sierra N...,32,"101,77 km",3.234 m,Difícil,https://es.wikiloc.com/rutas-ciclismo/pico-vel...,https://es.wikiloc.com/rutas-ciclismo/pico-vel...,https://es.wikiloc.com/rutas-ciclismo/pico-vel...,https://es.wikiloc.com/rutas-ciclismo/pico-vel...,PicoVeletaporelMonac


## 2. Parsing our gpx files

To obtain meaningful conclusions we must parse the gpx files so that we will be able to add their contents to our dataframe. 

We also suspect that some routes aren't even in Spanish soil, so we'll use the parsed coordinates to obtain their exact location.

Let's start by defining a function that will parse a given gpx file and return a list with all necessary data. We're opting for lists instead of dataframes or dictionaries because they're way less memory intensive and require less resources.

In [6]:
#Importing necessary libraries.

import gpxpy
import gpxpy.gpx
import time

In [18]:
#Parsing our gpx file, for this example we're using 'Madrona.gpx'.

gpx = gpxpy.parse(open('Madrona.gpx'))

In [22]:
#Appending each point with their parameters to a list.

data = []

track = gpx.tracks[0]
segment = track.segments[0]
segment_length = segment.length_3d()
for point_idx, point in enumerate(segment.points): #Iterating through all the points in our gpx file.
    data.append([point.longitude, #Getting all parameters from each point.
                 point.latitude,
                 point.elevation, 
                 point.time, 
                 (segment.get_speed(point_idx))*3.6]) #Speed converted from m/s to km/h.

In [23]:
data

[[1.558717,
  41.973759,
  735.147,
  datetime.datetime(2019, 6, 2, 8, 54, 33, tzinfo=SimpleTZ("Z")),
  0.4286373026700454],
 [1.558657,
  41.973774,
  735.129,
  datetime.datetime(2019, 6, 2, 8, 55, 17, tzinfo=SimpleTZ("Z")),
  8.420524427220105],
 [1.558573,
  41.973721,
  735.121,
  datetime.datetime(2019, 6, 2, 8, 55, 19, tzinfo=SimpleTZ("Z")),
  16.584041543356516],
 [1.558486,
  41.973668,
  735.108,
  datetime.datetime(2019, 6, 2, 8, 55, 21, tzinfo=SimpleTZ("Z")),
  22.546021272466607],
 [1.558335,
  41.973582,
  735.137,
  datetime.datetime(2019, 6, 2, 8, 55, 23, tzinfo=SimpleTZ("Z")),
  24.71373683821654],
 [1.558272,
  41.973558,
  735.111,
  datetime.datetime(2019, 6, 2, 8, 55, 24, tzinfo=SimpleTZ("Z")),
  21.29445493993982],
 [1.558202,
  41.973545,
  735.175,
  datetime.datetime(2019, 6, 2, 8, 55, 25, tzinfo=SimpleTZ("Z")),
  21.640317990236746],
 [1.558129,
  41.973548,
  735.187,
  datetime.datetime(2019, 6, 2, 8, 55, 26, tzinfo=SimpleTZ("Z")),
  22.471268817743884],
 [1

In [21]:
#Visualizing a single point (one entry in our list).

data[0]

[1.558717,
 41.973759,
 735.147,
 datetime.datetime(2019, 6, 2, 8, 54, 33, tzinfo=SimpleTZ("Z")),
 0.11906591740834595]

In [35]:
help(gpx)

Help on GPX in module gpxpy.gpx object:

class GPX(builtins.object)
 |  GPX() -> None
 |  
 |  Methods defined here:
 |  
 |  __init__(self) -> None
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __repr__(self) -> str
 |      Return repr(self).
 |  
 |  add_elevation(self, delta: float) -> None
 |      Adjusts elevation data of GPX data.
 |      
 |      Parameters
 |      ----------
 |      delta : float
 |          Elevation delta in meters to apply to GPX data
 |  
 |  add_missing_data(self, get_data_function: Callable[[gpxpy.gpx.GPXTrackPoint], Any], add_missing_function: Callable[[List[gpxpy.gpx.GPXTrackPoint], gpxpy.gpx.GPXTrackPoint, gpxpy.gpx.GPXTrackPoint, List[float]], NoneType]) -> None
 |  
 |  add_missing_elevations(self) -> None
 |  
 |  add_missing_speeds(self) -> None
 |      The missing speeds are added to a segment.
 |      
 |      The weighted harmonic mean is used to approximate the speed at
 |      a :obj:'~.GPXTrackPoint'.
 |     

In [59]:
#But there's plenty of data we can extract besides each individual point. Start and end times:

gpx_start = track.get_time_bounds().start_time
gpx_start

datetime.datetime(2019, 6, 2, 8, 54, 33, tzinfo=SimpleTZ("Z"))

In [60]:
#Activity name:

gpx_name = track.name
gpx_name

'Madrona'

In [57]:
#Distance:

gpx_distance = track.length_3d()/1000 #Converting to km.
gpx_distance

51.33266117084173

In [56]:
#Ride duration:

gpx_duration = track.get_duration()/60 #Converting to minutes.
gpx_duration

195.88333333333333

In [55]:
#Maximum height:

gpx_height = gpx.get_elevation_extremes().maximum
gpx_height

921.008

In [54]:
#Total climb:

gpx_climb = track.get_uphill_downhill().uphill
gpx_climb

855.4839999999991

In [66]:
#Average latitude and longitude:

lat = []
long = []

for i in range(len(data)):
    lat.append(data[i][0])
    long.append(data[i][1])

avg_coords = [sum(lat)/len(lat), sum(long)/len(long)]
avg_coords

[1.4475997135893668, 41.98444892112242]

## 2.1 Creating the parser function

Now that we know how to access every piece of data in our parsed gpx file, let's define a function that returns a dictionary with all valuable data. It will be easy to convert the generated dictionaries into a dataframe later.

In [8]:
#Defining the parser function.

def parser(filename):
    file = open(filename + '.gpx', encoding='utf-8')
    gpx = gpxpy.parse(file)
    track = gpx.tracks[0]
    segment = track.segments[0]
    gpx_start = track.get_time_bounds().start_time
    return gpx_start

In [9]:
parser('Madrona')

datetime.datetime(2019, 6, 2, 8, 54, 33, tzinfo=SimpleTZ("Z"))

In [None]:
'''
Tasklist for tomorrow:
                        - Finish the function with try/except blocks and encoding options.
                        - Create a df with the parsed files.
                        - Make sure that missing files are kept to a minimum.
                        - COMMIT MORE OFTEN

'''