# INFO 370 2020 Wi - Python Pandas Basics Exercise

*Name:* Simran

*Collaborators:*

### Instructions

1. Please complete the lab tutorial during your respective lab session.

2. Please write down your name and your collaborators (if any).

3. In this exercise you will get hands-on experiences in importing data into structured format, summarizing data using descriptive statistics (e,g, sum, average, etc.) and manipulating data including indexing, slicing and grouping.

4. The data you will be working with was collected from an activity-tracking platform:    Strava. It is a collection of physical activities that Strava users recorded and posted on this platform. For simplicity, this exercise focuses on several selected fields. You can find the data description by reading the Strava API doc, following this link: https://developers.strava.com/docs/reference/#api-models-DetailedActivity

Some of the more unclear variables:
    * distance: distance covered in meters
    * achievement_count: The total number of achievements of the considered activities
    * kudos_count: The number of kudos given for this activity 
    * type: type of the activity, like running/biking...

5. For each question, please type your codes in a *Code* cell and your written response which summarizes your results in a *Markdown* cell.

6. Don't be scared. We are here to help you learn. :)

In [2]:
# import packages
import numpy as np
import pandas as pd

# you may import other packages if necessary

### 1. Data Import and Data Summary

**(a) Load the data strava-activity.csv into a pandas dataframe.**

Print the first few lines of it as a sanity check.


In [42]:
data = pd.read_csv("./strava-activity.csv.bz2")
data.head(5)

Unnamed: 0,name,athlete.sex,athlete.country,start_date_local,distance,achievement_count,type,kudos_count
0,"Mitro"""" :)",F,Ecuador,2014-08-27T15:55:26Z,21580.0,6,Ride,0
1,Ochtendrit,M,The Netherlands,2015-08-04T09:25:02Z,19092.8,0,Ride,0
2,Enough time for a quick and hot 14,M,United States,2015-05-17T14:38:04Z,23023.4,8,Ride,4
3,Ilkley Grassington Buckden and back,F,United Kingdom,2014-05-26T09:20:53Z,101702.0,11,Ride,6
4,Morning Short hike,F,United States,2013-12-19T09:05:16Z,2739.8,0,Hike,0


**(b) It's time to get to know your data! Report the number of rows and columns in the dataset.**

In [43]:
data.shape

(8093, 8)

Your written response here：

_8093 rows, 8 columns_


**(c) What variables does this dataset have? Report the variable names along with the data type of each variable.**

In [44]:
data.dtypes

name                  object
athlete.sex           object
athlete.country       object
start_date_local      object
distance             float64
achievement_count      int64
type                  object
kudos_count            int64
dtype: object

Your written response here：

_name_: object
_athelete.sex_: object
_athelete.country_: object
_start_date_local_: object
_distance_: float64
_achievement_count_: int64
_type_: object
_kudos_count_: int64

**(d) What is the number of NULL/NA values in each column of the dataframe?**



In [45]:
columns = data.columns
for i in columns:
    print(i, len(data) - data[i].count())

name 0
athlete.sex 185
athlete.country 200
start_date_local 0
distance 0
achievement_count 0
type 0
kudos_count 0


_name_: 0
_athlete.sex_: 185
_athlete.country_: 200
_start_date_local_: 0
_distance_: 0
_achievement_count_: 0
_type_: 0
_kudos_count_: 0

### 2. Data Manipulation - Explore Achievement Count by Country

**(a) What are the top 5 countries most Strava users are from in this dataset?**

Hint: check out _DataFrame.groupby_ method, _Series.count()_ method, and _Series.nlargest()_ method.

In [46]:
group_countries = data.copy().groupby('athlete.country')
group_countries = group_countries.count()
users = group_countries['name']
users.nlargest(5)

athlete.country
United States     2424
United Kingdom    1770
Australia          632
Canada             249
France             245
Name: name, dtype: int64

Your written response here

_United States_:     2424
_United Kingdom_:    1770
_Australia_:          632
_Canada_:             249
_France_:             245

**(b) What is the total achievement count of athletes from Canada?**

In [57]:
canada = data[data['athlete.country'] == 'Canada']
# print(canada)
canada.sum(axis=0, skipna=True).achievement_count

635

Your written response here

_635_


**(c) What is the average achievement count of male athletes from Australia? How is it compared to the counterpart females?**

In [62]:
australia = data[data['athlete.country'] == 'Australia']
au_m = australia[australia['athlete.sex'] == 'M']
au_f = australia[australia['athlete.sex'] == 'F']
male_avg = au_m.mean(axis = 0, skipna=True).achievement_count
female_avg = au_f.mean(axis = 0, skipna=True).achievement_count
print(male_avg)
print(female_avg)


6.221476510067114
4.6923076923076925


Your written response here

_The Australian male average achievement count is 6.22 which is greater than the Australian female average achievement count at 4.69_

### 3. Data Manipulation - Explore Activity Type (Extra question, not graded)

**(a) How many different kinds of activities are there in this data set? Report the name and the total number of completed activities for each type.**

In [68]:
data['type'].value_counts()

Ride               4512
Run                3001
Walk                194
Swim                178
Workout              66
Hike                 37
VirtualRide          26
NordicSki            20
WeightTraining       14
Yoga                  9
AlpineSki             7
Crossfit              5
BackcountrySki        4
IceSkate              4
Kayaking              3
Snowboard             3
Elliptical            3
Rowing                3
EBikeRide             1
RockClimbing          1
Snowshoe              1
StandUpPaddling       1
Name: type, dtype: int64

**(b) Which type of activity do most male and female users participate in?**

Hint: check out _DataFrame.groupby()_ and _size()_ methods

In [None]:
# Your codes here

Your written response here


**(c) What is the average distance covered by users for each type of activity?**

Hint: use _groupby()_ and _mean()_

In [None]:
# Your codes here