# ITNPBD2 Representing and Manipulating Data
## Pandas Lab - Fitness Tracker Data

In this practical, we will analyse data from a fitness tracking app. The data have been collected from various different users and stored in a single file. The data are a mixture of user profile data (username, date of birth, etc.) and activity records (swim distance, etc.). This means the profile data are repeated for every user for every activity they recorded. That is not very useful!

Worse, the columns are not appropriate for all activities. For example, the column `stroke` only has data when the user was swimming, otherwise it is empty.

You need to separate the data into different tables and then analyse it. You will be doing this with Pandas.

## Load the data into a Pandas data frame

After importing pandas, load the data file `fitness.csv`. Either copy it to a local folder from the course Canvas page or read it directly from the URL. Don't set an index column or data types yet - just load it and look at the first 10 rows. Print the length of the table (number of rows).

In [1]:
import pandas as pd

df = pd.read_csv('fitness.csv')

result = df.head(10)
print(result)

count = len(df.index)
display(count)

         Date Activity  Username         DOB  UserWeight  UserHeight  LocName  \
0  01.01.2014      Gym  ddbq2413  09.09.1979          76         157  PureGym   
1  01.01.2014      Gym  ddbq2413  09.09.1979          76         157  PureGym   
2  01.01.2014      Gym  ddbq2413  09.09.1979          76         157  PureGym   
3  01.01.2014      Gym  ddbq2413  09.09.1979          76         157  PureGym   
4  01.01.2014      Gym  ddbq2413  09.09.1979          76         157  PureGym   
5  01.01.2014      Gym  ddbq2413  09.09.1979          76         157  PureGym   
6  01.01.2014      Gym  ddbq2413  09.09.1979          76         157  PureGym   
7  01.01.2014      Gym  ddbq2413  09.09.1979          76         157  PureGym   
8  01.01.2014      Gym  ddbq2413  09.09.1979          76         157  PureGym   
9  01.01.2014      Gym  ddbq2413  09.09.1979          76         157  PureGym   

       Address       Hours  Improves  ... Distance  Set      Exercise  Reps  \
0  High Street  7am - 10pm  S

6088

# Fix the date columns.
## Two columns: DOB and Date contain dates. Convert them to the correct format

In [2]:
df['Date'] =df['Date'].astype('datetime64[ns]')
df['DOB'] =df['DOB'].astype('datetime64[ns]')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6088 entries, 0 to 6087
Data columns (total 23 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Date         6088 non-null   datetime64[ns]
 1   Activity     6088 non-null   object        
 2   Username     6088 non-null   object        
 3   DOB          6088 non-null   datetime64[ns]
 4   UserWeight   6088 non-null   int64         
 5   UserHeight   6088 non-null   int64         
 6   LocName      5432 non-null   object        
 7   Address      5432 non-null   object        
 8   Hours        5432 non-null   object        
 9   Improves     6088 non-null   object        
 10  BodyPart     6088 non-null   object        
 11  Muscle       6088 non-null   object        
 12  Minutes      1384 non-null   float64       
 13  Distance     1384 non-null   float64       
 14  Set          5432 non-null   float64       
 15  Exercise     4704 non-null   object        
 16  Reps  

# Find out what the different activities in the data are, and how often each one occurs

In [3]:
df.Exercise.unique()

df.Exercise.value_counts()

Pullups         1296
Squat           1281
Bench press      894
Situps           443
Plank            418
Barbell Curl     372
Name: Exercise, dtype: int64

# Let's start with some simple date extraction. Print each of these parts of the data:
## The column with usernames in it

In [4]:
print(df[['Date', 'Username']])

           Date  Username
0    2014-01-01  ddbq2413
1    2014-01-01  ddbq2413
2    2014-01-01  ddbq2413
3    2014-01-01  ddbq2413
4    2014-01-01  ddbq2413
...         ...       ...
6083 2014-04-25  nzvs3223
6084 2014-04-27  nzvs3223
6085 2014-04-27  nzvs3223
6086 2014-04-27  nzvs3223
6087 2014-04-27  nzvs3223

[6088 rows x 2 columns]


## The row at index location 2

In [5]:
df.loc[2]

Date           2014-01-01 00:00:00
Activity                       Gym
Username                  ddbq2413
DOB            1979-09-09 00:00:00
UserWeight                      76
UserHeight                     157
LocName                    PureGym
Address                High Street
Hours                   7am - 10pm
Improves                  Strength
BodyPart                      Arms
Muscle                     Triceps
Minutes                        NaN
Distance                       NaN
Set                            2.0
Exercise               Bench press
Reps                           7.0
Weight                        75.0
Stroke                         NaN
HandsFacing                    NaN
HandWidth                      NaN
Seconds                        NaN
HandsWidth                     NaN
Name: 2, dtype: object

## Print all the data for user ddbq2413

In [6]:
print(df.loc[df['Username'] == 'ddbq2413'])

          Date Activity  Username        DOB  UserWeight  UserHeight  LocName  \
0   2014-01-01      Gym  ddbq2413 1979-09-09          76         157  PureGym   
1   2014-01-01      Gym  ddbq2413 1979-09-09          76         157  PureGym   
2   2014-01-01      Gym  ddbq2413 1979-09-09          76         157  PureGym   
3   2014-01-01      Gym  ddbq2413 1979-09-09          76         157  PureGym   
4   2014-01-01      Gym  ddbq2413 1979-09-09          76         157  PureGym   
..         ...      ...       ...        ...         ...         ...      ...   
620 2014-04-27      Gym  ddbq2413 1979-09-09          76         157  Uni Gym   
621 2014-04-27      Gym  ddbq2413 1979-09-09          76         157  Uni Gym   
622 2014-04-27      Gym  ddbq2413 1979-09-09          76         157  Uni Gym   
623 2014-04-27      Gym  ddbq2413 1979-09-09          76         157  Uni Gym   
624 2014-04-27      Gym  ddbq2413 1979-09-09          76         157  Uni Gym   

                    Address

## Calculate the average (mean) distance travelled for Swim, Bike and Run activities respectively. Use `groupby` to do this

In [7]:
df_x_gym = df[df['Activity'].str.contains('Gym')==False]
df_x_gym.groupby('Activity')['Distance'].mean()

Activity
Bike    22500.000000
Run      5236.842105
Swim      426.785714
Name: Distance, dtype: float64

## Extract user Data
You should see that columns 2,3,4 and 5 (from Username to UserHeight) are about the user and not the exercise.
- Find out how many different users there are

In [8]:
df['Username'].nunique()

10

## Then create a new data frame with only the user data columns: username, Date of Birth (DOB), weight and height.
- Then drop all the duplicates to create a table with a single entry for each user
- Then set the index to be `Username`
- Display the whole list of users (it is short)

In [9]:
df_users = df.drop_duplicates(subset=['Username'])
df_users.set_index('Username')
df_info = df_users[['Username', 'DOB', 'UserWeight', 'UserHeight']]
print(df_info)

      Username        DOB  UserWeight  UserHeight
0     ddbq2413 1979-09-09          76         157
625   hrfu1432 1962-09-13          88         173
1167  revz4142 1979-09-09          79         147
1801  fuqm1243 1974-10-09          75         172
2389  emvk3411 1995-05-09          79         167
3007  cswz4434 1995-05-09          79         158
3658  nlgm4332 1952-09-15          71         127
4177  viqi3431 1977-09-09          84         161
4802  wzro1422 1976-09-09          83         165
5440  nzvs3223 1954-09-15          81         136


# Now we need to extract the Swimming Data
- The columns you need are Date, Username, Distance, Set and Stroke
- You should only extract the rows where the Activity column equals Swim
- Extract them into a dataframe called `swims`
- Note that the `set` column allows the user to break one swim session into shorter sets, each of a different stroke. You should see each date has 1 or more sets.
- There will be duplicates in this set, so you must drop them too

In [10]:
df_swims = df.loc[df['Activity'] == 'Swim']
swims = df_swims[['Date', 'Username', 'Distance', 'Set', 'Stroke']]
swims.drop_duplicates(subset=['Set'])
print(swims)

           Date  Username  Distance  Set         Stroke
146  2014-01-23  ddbq2413     375.0  1.0    Front Crawl
147  2014-01-23  ddbq2413     375.0  1.0    Front Crawl
148  2014-01-23  ddbq2413     375.0  1.0    Front Crawl
149  2014-01-23  ddbq2413     375.0  1.0    Front Crawl
150  2014-01-23  ddbq2413     125.0  2.0  Breast stroke
...         ...       ...       ...  ...            ...
6010 2014-04-16  nzvs3223     750.0  1.0    Front Crawl
6011 2014-04-16  nzvs3223     250.0  2.0    Front Crawl
6012 2014-04-16  nzvs3223     250.0  2.0    Front Crawl
6013 2014-04-16  nzvs3223     250.0  2.0    Front Crawl
6014 2014-04-16  nzvs3223     250.0  2.0    Front Crawl

[728 rows x 5 columns]


# One more extraction
## We won't bother separating out every activity, but we will do one more
- Make another frame called `bike` with Date, Username, Distance and Minutes by selecting rows where activity is Bike
- Remember to remove duplicates

In [11]:
df_bike = df.loc[df['Activity'] == 'Bike']
bike = df_bike[['Date', 'Username', 'Distance', 'Minutes']].drop_duplicates()
print(bike)

           Date  Username  Distance  Minutes
56   2014-10-01  ddbq2413   20000.0     63.0
60   2014-11-01  ddbq2413   10000.0     24.0
259  2014-01-02  ddbq2413   40000.0    102.0
455  2014-03-03  ddbq2413   40000.0    109.0
463  2014-10-03  ddbq2413   20000.0     51.0
...         ...       ...       ...      ...
5691 2014-12-02  nzvs3223   40000.0    112.0
5762 2014-02-26  nzvs3223   20000.0     54.0
5851 2014-03-17  nzvs3223   40000.0    110.0
5913 2014-03-28  nzvs3223   10000.0     24.0
6015 2014-04-18  nzvs3223   20000.0     50.0

[88 rows x 4 columns]


- Extract all the bike data for user `ddbq2413`
- Extract all the bike data for user `ddbq2413` on date `2014-10-01`

In [13]:
print(bike.loc[(df['Username'] == 'ddbq2413') & (df['Date'] == '2014-10-01')])

         Date  Username  Distance  Minutes
56 2014-10-01  ddbq2413   20000.0     63.0


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=41af8bd7-a5ed-4334-a2fe-992dcc7ea742' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>