## Lesson 8 Overview

## Let's load today's lesson!

### Open Azure Notebooks library 

Go to https://notebooks.azure.com -> Sign in if needed -> Select **python-codeacademy-sg**

### Update lesson file to latest version

Select **New** -> **From URL** -> input https://raw.githubusercontent.com/viettrung9012/python-codeacademy-sg/master/Lesson8.ipynb (URL is available in **Lesson8.ipynb**) -> Click outside input then select **Upload** (overwrite if needed)

### Open JupyterLab

From your browser's bookmark or **Run** -> Change browser URL path from **/nb/tree** to **/nb/lab**

Select **Lesson8.ipynb**

## Let's talk about Pandas

What *is* Pandas? [Pandas](https://pandas.pydata.org/) is an open source library providing high-performance, easy-to-use data structures and data analysis tools for Python. Though you might have been thinking about adorable black and white pandas, this name was actually derived from the term *"panel data"*, an econometrics term for data sets that include observations over multiple time periods for the same individuals.

In [None]:
# import pandas library and read into dataframe

import pandas as pd
df = pd.read_csv('Biggest Loser 2018.csv')

In [None]:
# list dataframe in tabular format

df

In [None]:
# identify column names

df.columns

In [None]:
# retrieve data from a specific column, e.g. team name

df['team_name']

In [None]:
# retrieve unique team names

df['team_name'].unique()

In [None]:
# retrieve data for a specific team number, e.g. team 1

df.loc[df['team_no'] == 1]

In [None]:
# retrieve data for a specific team member, e.g. 1-5

df.loc[df['team_member'] == "1-5"]

In [None]:
# add a column called member_tot_steps to store total steps for each member for entire challenge
# sort individuals by total steps in descending order; display top 3
# axis=1 means rowwise, while axis=0 means column-wise

df['member_tot_steps']= df.loc[:, '2018-04-02':'2018-04-29'].sum(axis=1)
df.sort_values('member_tot_steps', ascending=False).head(3)

In [None]:
# sort and display in descending order, individuals who have exceeded 350K steps

df[df['member_tot_steps']>350000].sort_values('member_tot_steps', ascending=False)

In [35]:
# add a column called member_avg_steps to store average steps for each member for entire challenge
# sort individuals by average steps in descending order; display top 3
# axis=1 means rowwise, while axis=0 means column-wise

df['member_avg_steps']= df.loc[:, '2018-04-02':'2018-04-29'].mean(axis=1).round(0)
df.sort_values('member_avg_steps', ascending=False).head(3)

Unnamed: 0_level_0,team_no,team_name,team_captain,team_member,2018-04-02,2018-04-03,2018-04-04,2018-04-05,2018-04-06,2018-04-07,...,2018-04-21,2018-04-22,2018-04-23,2018-04-24,2018-04-25,2018-04-26,2018-04-27,2018-04-28,2018-04-29,member_avg_steps
id2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
42,9,Just Step It,False,9-3,3000.0,3000,22455,10025,25172.0,31083,...,10621.0,16228.0,10621.0,,,,,,,15258.0
37,8,The Slimsons,True,8-3,8236.0,17181,17212,11077,20700.0,12917,...,22308.0,7878.0,8558.0,7966.0,7680.0,19611.0,9970.0,4038.0,3855.0,15146.0
59,12,Scrambled Legs,True,12-5,20592.0,19092,20154,17551,10438.0,19226,...,3435.0,2078.0,5960.0,13194.0,,,,,,15009.0


In [27]:
# dislay highest step count for each individual

df.loc[:, '2018-04-02':'2018-04-29'].max(axis=1)

id2
0     39812.0
1     29581.0
2     42590.0
3     25565.0
4     25371.0
5     16870.0
6      8031.0
7     30719.0
8     27244.0
9     12448.0
10    19242.0
11    29268.0
12    13346.0
13    18540.0
14    18027.0
15    17194.0
16    21664.0
17    17292.0
18    14123.0
19    17994.0
20    28652.0
21    22412.0
22    22600.0
23    20402.0
24    31193.0
25    13256.0
26    32035.0
27    17054.0
28    16232.0
29    17249.0
       ...   
40    16263.0
41    10517.0
42    41022.0
43    18707.0
44    14944.0
45    27636.0
46    20055.0
47     8805.0
48    11921.0
49    21638.0
50    14046.0
51    20150.0
52    16480.0
53    24020.0
54    17796.0
55    21197.0
56    16540.0
57    18438.0
58    15907.0
59    24577.0
60    24000.0
61    24754.0
62    18993.0
63    22371.0
64    16140.0
65    23290.0
66    18566.0
67    14736.0
68    15585.0
69     6633.0
dtype: float64

In [None]:
# sum total daily steps for each team into a new data frame called team_df

team_df = df.groupby('team_no').sum()
team_df

In [None]:
# remove a column e.g. team captain

del team_df['team_captain']
team_df

In [None]:
# add a column called team_tot_steps to store total steps for each team for entire challenge
# sort teams by total steps in descending order; display top 3

team_df['team_tot_steps']= team_df.loc[:, '2018-04-02':'2018-04-29'].sum(axis=1)
team_df.sort_values('team_tot_steps', ascending=False).head(3)

In [None]:
# sort and display in descending order, teams who have exceeded 1 million steps

team_df[team_df['team_tot_steps']>1000000].sort_values('team_tot_steps', ascending=False)