# Calorie expenditure during road cycling

In this project I'll try to answer two questions:

    - What variables affect calorie expenditure the most when we're cycling?
    - Can this caloric expenditure be accurately predicted for any given route?
    
To answer them I'll be using a dataset of my cycling rides from 2016 up to this day, courtesy of **SportTracks**'s elegantly simple data export options.

## 1 Data wrangling

Since our data is in csv format, we'll begin our data exploration by loading it into a dataframe object.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('workouts.csv', encoding='utf-8')

In [4]:
df.head()

Unnamed: 0,Nombre,Inicio,Deporte,Distancia,Duración,Velocidad media,Calorías,Calorías (/hora),Pulso promedio,Aumento del Desnivel,Temperatura
0,Cycling: Road,2021-05-30 11:09:26,Carretera,"20,04 km",42:26,"28,3 km/h",506.0,715.0,144.0,152.0,223
1,Cycling: Road,2021-05-28 08:51:04,Carretera,"100,29 km",3:21:59,"29,8 km/h",1.799,534.0,131.0,610.0,166
2,Cycling: Road,2021-05-26 08:51:07,Carretera,"124,78 km",4:44:16,"26,3 km/h",2.741,579.0,140.0,1.751,159
3,Cycling: Road,2021-05-24 10:46:51,Carretera,"36,25 km",1:18:23,"27,7 km/h",643.0,492.0,124.0,186.0,162
4,Cycling: Mountain,2021-05-23 09:20:16,Montaña,"78,61 km",3:10:01,"24,8 km/h",2.031,641.0,143.0,381.0,139


In [6]:
#Let's check for missing values. Since very few activities have missing data we can safely drop them.

df.isnull().sum()

Nombre                  0
Inicio                  0
Deporte                 0
Distancia               0
Duración                0
Velocidad media         4
Calorías                4
Calorías (/hora)        4
Pulso promedio          8
Aumento del Desnivel    0
Temperatura             3
dtype: int64

In [7]:
#Dropping the rows with missing data.

df.dropna(axis=0, how='any', inplace=True)

In [12]:
#To clean up our dataframe and prevent any future naming issues, let's rename the columns.

df.rename(columns={'Nombre':'name',
                      'Inicio':'start',
                      'Deporte': 'sport',
                      'Distancia':'distance',
                      'Duración':'duration',
                      'Velocidad media':'avg_speed',
                      'Calorías':'calories',
                      'Calorías (/hora)':'cals_per_hour',
                      'Pulso promedio':'heartrate',
                      'Aumento del Desnivel':'climb',
                      'Temperatura':'temp'}, inplace=True)

In [13]:
df.head()

Unnamed: 0,name,start,sport,distance,duration,avg_speed,calories,cals_per_hour,heartrate,climb,temp
0,Cycling: Road,2021-05-30 11:09:26,Carretera,"20,04 km",42:26,"28,3 km/h",506.0,715.0,144.0,152.0,223
1,Cycling: Road,2021-05-28 08:51:04,Carretera,"100,29 km",3:21:59,"29,8 km/h",1.799,534.0,131.0,610.0,166
2,Cycling: Road,2021-05-26 08:51:07,Carretera,"124,78 km",4:44:16,"26,3 km/h",2.741,579.0,140.0,1.751,159
3,Cycling: Road,2021-05-24 10:46:51,Carretera,"36,25 km",1:18:23,"27,7 km/h",643.0,492.0,124.0,186.0,162
4,Cycling: Mountain,2021-05-23 09:20:16,Montaña,"78,61 km",3:10:01,"24,8 km/h",2.031,641.0,143.0,381.0,139


In [16]:
#The 'name column' doesn't give us any meaningful information so we'll drop it.

df.drop('name',axis=1, inplace=True)
df.head()

Unnamed: 0,start,sport,distance,duration,avg_speed,calories,cals_per_hour,heartrate,climb,temp
0,2021-05-30 11:09:26,Carretera,"20,04 km",42:26,"28,3 km/h",506.0,715.0,144.0,152.0,223
1,2021-05-28 08:51:04,Carretera,"100,29 km",3:21:59,"29,8 km/h",1.799,534.0,131.0,610.0,166
2,2021-05-26 08:51:07,Carretera,"124,78 km",4:44:16,"26,3 km/h",2.741,579.0,140.0,1.751,159
3,2021-05-24 10:46:51,Carretera,"36,25 km",1:18:23,"27,7 km/h",643.0,492.0,124.0,186.0,162
4,2021-05-23 09:20:16,Montaña,"78,61 km",3:10:01,"24,8 km/h",2.031,641.0,143.0,381.0,139


In [18]:
#Some of our columns need to be typecasted into more useful formats. Let's get to it.

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 800 entries, 0 to 813
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   start          800 non-null    object 
 1   sport          800 non-null    object 
 2   distance       800 non-null    object 
 3   duration       800 non-null    object 
 4   avg_speed      800 non-null    object 
 5   calories       800 non-null    float64
 6   cals_per_hour  800 non-null    float64
 7   heartrate      800 non-null    float64
 8   climb          800 non-null    float64
 9   temp           800 non-null    object 
dtypes: float64(4), object(6)
memory usage: 101.0+ KB


In [19]:
#Converting 'start' to datetime format. Using the correct formal is crucial at this point.

df['start'] = pd.to_datetime(df['start'], format='%Y-%m-%d %H:%M:%S')

In [21]:
#The column Dtype has been changed successfully.

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 800 entries, 0 to 813
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   start          800 non-null    datetime64[ns]
 1   sport          800 non-null    object        
 2   distance       800 non-null    object        
 3   duration       800 non-null    object        
 4   avg_speed      800 non-null    object        
 5   calories       800 non-null    float64       
 6   cals_per_hour  800 non-null    float64       
 7   heartrate      800 non-null    float64       
 8   climb          800 non-null    float64       
 9   temp           800 non-null    object        
dtypes: datetime64[ns](1), float64(4), object(5)
memory usage: 101.0+ KB


In [22]:
#'Sport' is a categorical variable. Let's see the values it can have.

df['sport'].value_counts()

Carretera     538
Montaña       129
Ciclismo       69
Virtual        59
Interiores      4
Entrenador      1
Name: sport, dtype: int64

In [23]:
#Those 6 categories can be summed up into just 3 (road, mountain, indoor).

df['sport'].replace('Carretera','road',inplace=True)
df['sport'].replace('Montaña','mountain',inplace=True)
df['sport'].replace('Ciclismo','road',inplace=True)
df['sport'].replace('Virtual','indoor',inplace=True)
df['sport'].replace('Interiores','indoor',inplace=True)
df['sport'].replace('Entrenador','indoor',inplace=True)

In [24]:
#Let's check if the string replace has worked successfully.

df['sport'].value_counts()

road        607
mountain    129
indoor       64
Name: sport, dtype: int64

In [33]:
#To convert 'distance' into a float we'll simply use string replace.

df['distance'] = df['distance'].str.replace(' km','')

In [34]:
df.head()

Unnamed: 0,start,sport,distance,duration,avg_speed,calories,cals_per_hour,heartrate,climb,temp
0,2021-05-30 11:09:26,road,2004,42:26,"28,3 km/h",506.0,715.0,144.0,152.0,223
1,2021-05-28 08:51:04,road,10029,3:21:59,"29,8 km/h",1.799,534.0,131.0,610.0,166
2,2021-05-26 08:51:07,road,12478,4:44:16,"26,3 km/h",2.741,579.0,140.0,1.751,159
3,2021-05-24 10:46:51,road,3625,1:18:23,"27,7 km/h",643.0,492.0,124.0,186.0,162
4,2021-05-23 09:20:16,mountain,7861,3:10:01,"24,8 km/h",2.031,641.0,143.0,381.0,139


Tasklist:

            - Data exploration.
            
            - Rename columns. DONE
            
            - Look for missing values. DONE
            
            - Typecasting columns if necessary.
            
            - Correlation matrix.
            
            - Data viz when it's done.