# Strava Data Project
----

**Goals:**
- Develop a script to clean up the data and surface different charts and statistics
    - Totals, Avgs, Time series, Locations, Fastest segments, Longest Runs, All over
        - Speed, Distance, Time, Altitude, Calories, Activity Type
- Create dashboard using d3.js and other front end tools (html, css)
    - Focus on design and dynamics
- Create scrollable/slideshow on webpage to increase interactivity
    - Surface all the major statistics in a nice fashion
    - Dashboard at the end
- Reach Task: Create a live webpage where people can upload their 'activities.csv' data from strava and see their information afterwards
    - Questions
        - Data collections from stangers?
        - Making webpage constantly active?

**Results:**
- Python script to ingest and process data files
- html webpage to show all the information
- Website url for public access

**Future:**
- Weave in whoop data?

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('data/activities.csv')

In [3]:
df = data[['Activity ID', 'Activity Date', 'Activity Name', 'Activity Type', 'Elapsed Time',
           'Elapsed Time.1', 'Distance', 'Distance.1', 'Max Heart Rate', 'Relative Effort',
           'Moving Time', 'Max Speed', 'Average Speed', 'Elevation Gain',
           'Elevation Loss', 'Elevation Low', 'Elevation High', 'Max Grade', 'Average Grade',
           'Average Heart Rate', 'Calories']]
df.head()

Unnamed: 0,Activity ID,Activity Date,Activity Name,Activity Type,Elapsed Time,Elapsed Time.1,Distance,Distance.1,Max Heart Rate,Relative Effort,...,Max Speed,Average Speed,Elevation Gain,Elevation Loss,Elevation Low,Elevation High,Max Grade,Average Grade,Average Heart Rate,Calories
0,4551300481,"Jan 1, 2021, 10:56:31 PM",Afternoon Run,Run,612,612.0,1.61,1609.400024,,,...,,2.629739,,,,,,0.0,,
1,4555170880,"Jan 2, 2021, 3:52:42 PM",Morning Run,Run,647,647.0,1.69,1698.300049,,,...,7.0,2.844724,28.823198,26.423201,58.400002,77.400002,8.7,0.141318,,
2,4560941785,"Jan 3, 2021, 2:52:02 PM",Morning Run,Run,633,633.0,1.64,1646.58728,168.0,28.0,...,5.1,2.673031,28.126984,23.827,57.200001,77.400002,10.4,0.261144,168.0,
3,4566782439,"Jan 4, 2021, 5:24:38 PM",Lunch run,Run,1157,1157.0,3.23,3231.699951,,,...,4.7,2.847313,9.891797,10.1918,1.3,8.1,7.8,-0.009283,,
4,4572079956,"Jan 5, 2021, 4:57:41 PM",Lunch run,Run,1759,1759.0,5.26,5265.600098,,,...,4.9,3.04546,13.625696,14.2257,1.3,8.0,5.2,-0.011395,,


In [4]:
df.columns = [i.lower().replace(' ', '_') for i in df.columns]

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   activity_id         660 non-null    int64  
 1   activity_date       660 non-null    object 
 2   activity_name       660 non-null    object 
 3   activity_type       660 non-null    object 
 4   elapsed_time        660 non-null    int64  
 5   elapsed_time.1      660 non-null    float64
 6   distance            660 non-null    float64
 7   distance.1          660 non-null    float64
 8   max_heart_rate      344 non-null    float64
 9   relative_effort     344 non-null    float64
 10  moving_time         660 non-null    float64
 11  max_speed           656 non-null    float64
 12  average_speed       660 non-null    float64
 13  elevation_gain      656 non-null    float64
 14  elevation_loss      311 non-null    float64
 15  elevation_low       311 non-null    float64
 16  elevatio

In [6]:
df['time_of_day'] = [i.split(',')[2].strip() for i in df['activity_date']]
df['distance_m'] = df['distance.1'] / 1.609
df['time_minutes'] = df['elapsed_time'] / 60
df['max_speed_mph'] = df['max_speed'] * 2.237
df['average_speed_mph'] = df['average_speed'] * 2.237


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using

In [7]:
df = df.rename(columns={'distance.1':'distance_km', 'elapsed_time.1':'time_sec'})

In [8]:
df.tail(20)

Unnamed: 0,activity_id,activity_date,activity_name,activity_type,elapsed_time,time_sec,distance,distance_km,max_heart_rate,relative_effort,...,elevation_high,max_grade,average_grade,average_heart_rate,calories,time_of_day,distance_m,time_minutes,max_speed_mph,average_speed_mph
640,6188034280,"Oct 30, 2021, 1:54:30 PM",Morning Activity,Weight Training,1957,1957.0,0.0,0.0,129.0,4.0,...,,0.0,0.0,97.876343,102.0,1:54:30 PM,0.0,32.616667,0.0,0.0
641,6192365918,"Oct 31, 2021, 12:57:32 PM",Morning Run,Run,2664,2664.0,8.49,8490.459961,,,...,7.8,9.704091,0.0106,,,12:57:32 PM,5276.855165,44.4,12.41972,7.285447
642,6196509733,"Nov 1, 2021, 11:50:35 AM",Morning Run,Run,1699,1699.0,5.2,5208.080078,,,...,8.2,7.809011,0.032642,,,11:50:35 AM,3236.842808,28.316667,16.461337,7.182783
643,6202316386,"Nov 2, 2021, 4:19:55 PM",Lunch Run,Run,3576,3576.0,11.5,11502.129883,,,...,8.2,7.272727,0.013041,,,4:19:55 PM,7148.620188,59.6,11.6503,7.464539
644,6203535513,"Nov 1, 2021, 9:32:35 PM",Afternoon Activity,Workout,2724,2724.0,0.0,0.0,154.0,19.0,...,,0.0,0.0,124.821953,251.0,9:32:35 PM,0.0,45.4,0.0,0.0
645,6203535593,"Nov 2, 2021, 12:01:19 PM",Morning Activity,Workout,2848,2848.0,0.0,0.0,155.0,10.0,...,,0.0,0.0,110.27388,220.0,12:01:19 PM,0.0,47.466667,0.0,0.0
646,6205893312,"Nov 3, 2021, 12:25:59 PM",Morning Run,Run,1220,1220.0,3.7,3700.830078,,,...,7.7,5.029341,0.027021,,,12:25:59 PM,2300.080844,20.333333,20.074838,7.211461
647,6208812320,"Nov 3, 2021, 9:31:57 PM",Afternoon Activity,Workout,2843,2843.0,0.0,0.0,175.0,15.0,...,,0.0,0.0,115.698906,246.0,9:31:57 PM,0.0,47.383333,0.0,0.0
648,6210322335,"Nov 4, 2021, 11:48:55 AM",Morning Run,Run,2037,2037.0,6.44,6448.410156,,,...,8.2,5.709626,0.013957,,,11:48:55 AM,4007.712962,33.95,16.86698,7.259735
649,6210724268,"Nov 4, 2021, 11:01:56 AM",Morning Activity,Weight Training,2729,2729.0,0.0,0.0,137.0,7.0,...,,0.0,0.0,104.828873,149.0,11:01:56 AM,0.0,45.483333,0.0,0.0


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 26 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   activity_id         660 non-null    int64  
 1   activity_date       660 non-null    object 
 2   activity_name       660 non-null    object 
 3   activity_type       660 non-null    object 
 4   elapsed_time        660 non-null    int64  
 5   time_sec            660 non-null    float64
 6   distance            660 non-null    float64
 7   distance_km         660 non-null    float64
 8   max_heart_rate      344 non-null    float64
 9   relative_effort     344 non-null    float64
 10  moving_time         660 non-null    float64
 11  max_speed           656 non-null    float64
 12  average_speed       660 non-null    float64
 13  elevation_gain      656 non-null    float64
 14  elevation_loss      311 non-null    float64
 15  elevation_low       311 non-null    float64
 16  elevatio

In [22]:
df['activity_type'].value_counts().reset_index()

Unnamed: 0,index,activity_type
0,Run,322
1,Workout,275
2,Weight Training,42
3,Ride,16
4,Rowing,3
5,Yoga,2


In [13]:
# Split off activity dfs
activity_dfs = {}

for a in df['activity_type'].unique():
    filt = df[df['activity_type'] == a]
    activity_dfs[a] = filt

----
### Top Statistics:

- Num Activities
- Top activity
- Activity counts (Pie?)
- Total Activity Time
- Total Activity Distance

In [28]:
num_activities = len(df)
top_activity = df['activity_type'].value_counts().reset_index()['index'].iloc[0]

active_counts_table = df['activity_type'].value_counts().reset_index()
active_counts_table.columns = ['Activity', 'Count']
active_counts_table.to_csv('value_cnts.csv')

total_time_seconds = sum(df['time_sec'])
total_time_minutes = sum(df['time_minutes'])

total_miles = sum(df['distance_m'])

### Speed and distance activities

In [None]:
# Group activity types


In [24]:
if 'Run' in df['activity_type'].unique():
    runs = activity_dfs['Run']
    run_dist_total_miles = sum(runs['distance_m'])
    run_speed_avg_mph = runs['average_speed_mph'].mean()
    run_count = len(runs)
    run_total_elevation = sum(runs['elevation_gain'])
    run_hr_avg = runs['average_heart_rate'].mean()
    run_cals_total = sum(runs['calories'])
    run_time_total = sum(runs['time_minutes'])
    
if 'Ride' in df['activity_type'].unique():
    pass

In [26]:
run_dist_total

1131341.2862887958

### Workout activities