## Lesson 8 Overview

## Let's load today's lesson!

### Open Azure Notebooks library 

Go to https://notebooks.azure.com -> Sign in if needed -> Select **python-codeacademy-sg**

### Update lesson file to latest version

Select **New** -> **From URL** -> input https://raw.githubusercontent.com/viettrung9012/python-codeacademy-sg/master/Lesson8.ipynb (URL is available in **Lesson8.ipynb**) -> Click outside input then select **Upload** (overwrite if needed)

### Open JupyterLab

From your browser's bookmark or **Run** -> Change browser URL path from **/nb/tree** to **/nb/lab**

Select **Lesson8.ipynb**

## Let's talk about Pandas

What *is* Pandas? [Pandas](https://pandas.pydata.org/) is an open source library providing high-performance, easy-to-use data structures and data analysis tools for Python. Though you might have been thinking about adorable black and white pandas, this name was actually derived from the term *"panel data"*, an econometrics term for data sets that include observations over multiple time periods for the same individuals.

Pandas is often used together with [Numpy](http://www.numpy.org/) and [scikit-learn](http://scikit-learn.org/stable/index.html).

In [None]:
# let's import pandas library
import pandas as pd

### Pandas Data structure - Series

A Series is a **one-dimensional** array with labeled axes.

In [None]:
# read data into series where index starts from 0
fruits = pd.Series(["Apple", "Banana", "Mango"])
fruits[2]

In [None]:
# You can pass in index to create your own index
fruits = pd.Series(["Apple", "Banana", "Mango"], index=['a', 'b', 'm'])
fruits['m']  # returns 'Mango'

In [None]:
# Do this again, this time as a dictionary
fruits = pd.Series({'a': "Apple", 'b': "Banana", 'm': "Mango"})
fruits['m']  # returns 'Mango'

### Pandas Data structure - DataFrame

A DataFrame is a **2-dimensional** table with labeled axes. It acts like a dict-like container for Series objects.

In [None]:
# Let's build a DataFrame of fruit inventory across all floors
fruit_inventory = {
    "fruit": ["Apple", "Banana", "Mango"],
    "fruit_count_2F": [75, 150, 250],
    "fruit_count_6F": [80, 120, 150],
    "fruit_count_8F": [50, 100, 200],
    "fruit_count_9F": [100, 200, 350],    
  }
df1 = pd.DataFrame(fruit_inventory)
df1

In [None]:
# You get a Series if you access a DataFrame's index
df1.fruits  # you can also use df1["fruits"]

In [None]:
# let's create a second dataframe
fruit_property = {
    "fruit": ["Apple", "Banana", "Mango", "Papaya"],
    "fruit_color": ["red", "yellow", "yellow", "orange"],
  }
df2 = pd.DataFrame(fruit_property)
df2

#### Merging DataFrames

In [None]:
# Left merge DataFrames df1 and df2 using the 'fruit' column as key 
# This only retains rows which exist in df1

fruit_list1 = df1.merge(df2, on='fruit', how='left')
fruit_list1

In [None]:
# Outer merge DataFrames df1 and df2 using the 'fruit' column as key 
# All rows are kept, and empty values are filled with NaN (Not a Number)

fruit_list2 = df1.merge(df2, on="fruit", how="outer")
fruit_list2

#### Removing NaN and casting numbers to integer

In [None]:
# Let's replace NaN value in fruit count columns with 0
fruit_list2 = fruit_list2.fillna(0)

# Let's remove the decimal places that appeared in fruit count after merging DataFrames
fruit_list2 = fruit_list2.apply(pd.to_numeric, downcast='integer', errors='ignore')

fruit_list2

#### Summing by row in DataFrame

In [None]:
# Now let's create a column to sum the total fruit count across all floors

# data.loc[<row selection>, <column selection>]
# In this case, we're applying across all rows for a selected column range
# axis=1 means rowwise, while axis=0 means column-wise

fruit_list2['fruit_count_total']= fruit_list2.loc[:, 'fruit_count_2F':'fruit_count_9F'].sum(axis=1)
fruit_list2

#### Summing by column in DataFrame

In [None]:
# Let's count all the fruits on each floor

# First, create a list of column names which include 'F' 
columns = [column for column in list(fruit_list2) if 'F' in column]

# Next, sum within each column for the column names which include 'F'
fruit_count_floor_total = fruit_list2[columns].sum()
fruit_count_floor_total

#### Average 

In [None]:
# Let's find out the average fruit count for each floor

fruit_list2['fruit_count_avg']= fruit_list2.loc[:, 'fruit_count_2F':'fruit_count_9F'].mean(axis=1).round(0)
fruit_list2.sort_values('fruit_count_avg', ascending=False)

#### Sorting rows in DataFrame

In [None]:
# Great! Now let's sort the fruit count total in descending order
fruit_list2.sort_values('fruit_count_total', ascending=False)

In [None]:
# sort and display in descending order, only fruits with count exceeding 500 

fruit_list2[fruit_list2['fruit_count_total']>500].sort_values('fruit_count_total', ascending=False)

#### Retrieving certain records in DataFrame

In [None]:
# Let's retrieve rows containing only yellow fruits
fruit_list2.loc[fruit_list2['fruit_color'] == "yellow"]

In [None]:
# Get unique fruit colors
print(fruit_list2['fruit_color'].unique())

#### Grouping and summing in DataFrames

In [None]:
# Group fruit count by fruit color
fruit_list2.groupby('fruit_color').sum()

#### Adding and deleting columns in DataFrames

In [None]:
# Add a new column 'in_stock' based on function applied to column 'fruit_count_total'
# 'in_stock' = True only if fruit_count_total > 0

fruit_list2['in_stock'] = fruit_list2['fruit_count_total'].apply(lambda x: True if x > 0 else False)
fruit_list2

In [None]:
# Let's delete 'in_stock' column
del fruit_list2['in_stock']
fruit_list2

### Now it's your turn to use Pandas to explore Biggest Loser data 

In [None]:
# Ready? Let's read the Biggest Loser csv data into a dataframe 

df = pd.read_csv('Biggest Loser 2018.csv')
df

#### Retrieve top records in DataFrame

In [None]:
# Too many rows! How can we read the top 5?
df.head(5)

#### Retrieve bottom records in DataFrame

In [None]:
# That's better. How about the bottom 3 rows?
df.tail(3)

### Now you have everything you need to code the following by yourselves!

In [None]:
# Your turn now! Let's print unique team names



In [None]:
# Great! Now retrieve data for only team_no 1



In [None]:
# Now add a column called member_tot_steps to store each person's total count across the challenge
# sort rows by total steps in descending order; display top 3




In [None]:
# sort and display in descending order, individuals who have exceeded 350K steps




In [None]:
# add a column called member_avg_steps to store average steps for each member across the challenge
# sort individuals by average steps in descending order; display top 3



In [None]:
# sum total daily steps for each team into a new DataFrame called team_df




In [None]:
# In DataFrame team_df, remove a column e.g. team captain




In [None]:
# In DataFrame team_df, add a column called team_tot_steps 
# This will store total steps for each team for entire challenge
# Sort teams by total steps in descending order; display top 3



In [None]:
# In DataFrame team_df, sort & display in descending order, teams who have exceeded 1 million steps




### Recap

Today you have learnt the basics of Pandas library:
* Pandas data structures - Series and DataFrame
* DataFrame merging, grouping, manipulation, retrieving

### Congratulations! 

You have reached the end of our Python for Beginners course. We hope you have enjoyed learning Python and will continue to use it in future!