In [1]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

## Tutorial 3: Better understanding of `for loops` and `defining functions`

In this tutorial we will learn how to use `for loops` and `defining functions` to make our code more efficient and easier to read. To increase your understanding of `for loops` and `defining functions`, we are going to write some algorithms to do some data manipulation. **Please remember most of these algorithms have already been implemented in the `pandas` library, which can be directly used in your code.** Today we are going to write these algorithms just to help you to understand `for loops in dataframes` better. 

### For loops

In this tutorial, we are going to use `for loops` to iterate through a dataframe. For example, we have a dataframe as follows:

| Name      | Gender | Age | Job        |
| --------- | ------ | --- | ---------- |
| Alice     | F      | 20  | Data Analyst |
| Bob       | M      | 30  | Software Engineer |
| Charlie   | M      | 40  | Data Scientist |
| David     | M      | 50  | Consultant |
| Eve       | F      | 60  | Business Analyst |



If we want to get the information of each cell in the dataframe, we can use `for loops` to iterate through the dataframe. The code is as follows:

In [2]:
df = pd.DataFrame(
    {
        "Name": ["Alice", "Bob", "Charlie", "David", "Eve"],
        "Gender": ["F", "M", "M", "M", "F"],
        "Age": [20, 30, 40, 50, 60],
        "Job": ["Data Analyst", "Software Engineer", "Data Scientist", "Consultant", "Business Analyst"],
    }
)
for i in range(len(df)):
    for j in range(len(df.columns)):
        print(df.iloc[i, j])

Alice
F
20
Data Analyst
Bob
M
30
Software Engineer
Charlie
M
40
Data Scientist
David
M
50
Consultant
Eve
F
60
Business Analyst


Alternatively, we can use `iterrows()` to iterate through the dataframe. The code is as follows:

In [3]:
for ind, row in df.iterrows():
    for col in df.columns:
        print(row[col])
    

Alice
F
20
Data Analyst
Bob
M
30
Software Engineer
Charlie
M
40
Data Scientist
David
M
50
Consultant
Eve
F
60
Business Analyst


### Defining functions

In real production work, we often need to write some functions to make our code more efficient and easier to read. If we can encapsulate some code into a function, we can use this function in other places. 

The logic of the function is like running a sub-program. The function can take some inputs and return some outputs. The inputs and outputs are called `parameters` and `return values`. It is very similar to the logic of functions in mathematics.

For example, we want to calculate the area and perimeter of a rectangle. We can write a function to do this. The code is as follows:

In [4]:
def cal_area_perimeter(length, width):
    area = length * width
    perimeter = 2 * (length + width)
    return area, perimeter

area, perimeter = cal_area_perimeter(10, 20)
print(area, perimeter)

200 60


With the help of function, we can make some data manipulation process more easy. For example, I have a dataframe of date:
| Date      |
| --------- |
| 2022-11-01 |
| 2022-11-02 |
| 2022-11-03 |
| 2022-11-04 |
| 2022-11-05 |
| 2022-11-06 |
| 2022-11-07 |


If we want to get the next weekend date, we can write a function to do this. The code is as follows:


In [5]:
df = pd.DataFrame(
    {
        'Date': ['2022-11-01', '2022-11-02', '2022-11-03', '2022-11-04', '2022-11-05', '2022-11-06', '2022-11-07']
    }
)
def get_next_sunday(date):
    date = pd.to_datetime(date)
    # if the date is Sunday, return the same date
    if date.day_name() == 'Sunday':
        return date
    else:
        # get the next Sunday
        next_sunday = date + pd.DateOffset(6 - date.dayofweek)
        return next_sunday

df['Next Sunday'] = df['Date'].apply(get_next_sunday)
df


Unnamed: 0,Date,Next Sunday
0,2022-11-01,2022-11-06
1,2022-11-02,2022-11-06
2,2022-11-03,2022-11-06
3,2022-11-04,2022-11-06
4,2022-11-05,2022-11-06
5,2022-11-06,2022-11-06
6,2022-11-07,2022-11-13


### Exercise

#### Situation 1

Raw table:
| products | sales_in_2019 | sales_in_2020 | sales_in_2021 |
|----------|---------------|---------------|---------------|
| A       | 10            | 20            | 30            |
| B       | 20            | 30            | 40            |
| C       | 30            | 40            | 50            |
| D       | 40            | 50            | 60            |
| E       | 50            | 60            | 70            |
| F       | 60            | 70            | 80            |
| G       | 70            | 80            | 90            |
| H       | 80            | 90            | 100           |
| I       | 90            | 100           | 110           |
| J       | 100           | 110           | 120           |


The table we want:
| products | year | sales |
|----------|------|-------|
| A       | 2019 | 10    |
| A       | 2020 | 20    |
| A       | 2021 | 30    |
| B       | 2019 | 20    |
| B       | 2020 | 30    |
| B       | 2021 | 40    |
| C       | 2019 | 30    |
| C       | 2020 | 40    |
| C       | 2021 | 50    |
| D       | 2019 | 40    |
| D       | 2020 | 50    |
| D       | 2021 | 60    |
| E       | 2019 | 50    |
| E       | 2020 | 60    |
| E       | 2021 | 70    |
| F       | 2019 | 60    |
| F       | 2020 | 70    |
| F       | 2021 | 80    |
| G       | 2019 | 70    |
| G       | 2020 | 80    |
| G       | 2021 | 90    |
| H       | 2019 | 80    |
| H       | 2020 | 90    |
| H       | 2021 | 100   |
| I       | 2019 | 90    |
| I       | 2020 | 100   |
| I       | 2021 | 110   |
| J       | 2019 | 100   |
| J       | 2020 | 110   |
| J       | 2021 | 120   |



In [6]:
res = {
    'products': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'sales_in_2019': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
    'sales_in_2020': [20, 30, 40, 50, 60, 70, 80, 90, 100, 110],
    'sales_in_2021': [30, 40, 50, 60, 70, 80, 90, 100, 110, 120]
}
df = pd.DataFrame(res)
df

Unnamed: 0,products,sales_in_2019,sales_in_2020,sales_in_2021
0,A,10,20,30
1,B,20,30,40
2,C,30,40,50
3,D,40,50,60
4,E,50,60,70
5,F,60,70,80
6,G,70,80,90
7,H,80,90,100
8,I,90,100,110
9,J,100,110,120


##### Solution 1
We can use `for loops` to solve this problem. We can use `for loops` to iterate through the columns of the dataframe and create a new dataframe with the new format. 


In [7]:
# rame the columns
# df.columns = ['products', '2019', '2020', '2021']

# create a new dataframe
df_new = pd.DataFrame(columns=['products', 'year', 'sales'])

# loop through rows
for ind,row in df.iterrows():
    # loop through columns
    for col in df.columns:
        if col != 'products':
            # append new row to new dataframe
            df_new = df_new.append({'products': row['products'], 'year': col, 'sales': row[col]}, ignore_index=True)
df_new

  df_new = df_new.append({'products': row['products'], 'year': col, 'sales': row[col]}, ignore_index=True)
  df_new = df_new.append({'products': row['products'], 'year': col, 'sales': row[col]}, ignore_index=True)
  df_new = df_new.append({'products': row['products'], 'year': col, 'sales': row[col]}, ignore_index=True)
  df_new = df_new.append({'products': row['products'], 'year': col, 'sales': row[col]}, ignore_index=True)
  df_new = df_new.append({'products': row['products'], 'year': col, 'sales': row[col]}, ignore_index=True)
  df_new = df_new.append({'products': row['products'], 'year': col, 'sales': row[col]}, ignore_index=True)
  df_new = df_new.append({'products': row['products'], 'year': col, 'sales': row[col]}, ignore_index=True)
  df_new = df_new.append({'products': row['products'], 'year': col, 'sales': row[col]}, ignore_index=True)
  df_new = df_new.append({'products': row['products'], 'year': col, 'sales': row[col]}, ignore_index=True)
  df_new = df_new.append({'products':

Unnamed: 0,products,year,sales
0,A,sales_in_2019,10
1,A,sales_in_2020,20
2,A,sales_in_2021,30
3,B,sales_in_2019,20
4,B,sales_in_2020,30
5,B,sales_in_2021,40
6,C,sales_in_2019,30
7,C,sales_in_2020,40
8,C,sales_in_2021,50
9,D,sales_in_2019,40


##### Solution 2
We can just use the `melt` function in `pandas` to solve this problem.

In [8]:
df_new = pd.melt(df, id_vars=['products'], var_name='year', value_name='sales')
df_new

Unnamed: 0,products,year,sales
0,A,sales_in_2019,10
1,B,sales_in_2019,20
2,C,sales_in_2019,30
3,D,sales_in_2019,40
4,E,sales_in_2019,50
5,F,sales_in_2019,60
6,G,sales_in_2019,70
7,H,sales_in_2019,80
8,I,sales_in_2019,90
9,J,sales_in_2019,100


#### Situation 2

Raw table:
| products | month | sales |
|----------|-------|-------|
| A       | 1     | 10    |
| B       | 1     | 20    |
| C       | 1     | 30    |
| A       | 2     | 40    |
| B       | 2     | 50    |
| C       | 2     | 60    |
| A       | 3     | 70    |
| B       | 3     | 80    |
| C       | 3     | 90    |
| A       | 4     | 100   |
| B       | 4     | 110   |
| C       | 4     | 120   |

The format we want:
| products | month | sales | cumulative_sales |
|----------|-------|-------|------------------|
| A       | 1     | 10    | 10               |
| A       | 2     | 40    | 50               |
| A       | 3     | 70    | 120              |
| A       | 4     | 100   | 220              |
| B       | 1     | 20    | 20               |
| B       | 2     | 50    | 70               |
| B       | 3     | 80    | 150              |
| B       | 4     | 110   | 260              |
| C       | 1     | 30    | 30               |
| C       | 2     | 60    | 90               |
| C       | 3     | 90    | 180              |
| C       | 4     | 120   | 300              |


In [9]:
df  = pd.DataFrame({
    'products' : ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'month' : [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
    'sales' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120]}
)

df

Unnamed: 0,products,month,sales
0,A,1,10
1,B,1,20
2,C,1,30
3,A,2,40
4,B,2,50
5,C,2,60
6,A,3,70
7,B,3,80
8,C,3,90
9,A,4,100


##### Solution 1
We can solve the problem by sorting the dataframe and calculate the cumulative sum of the sales column by products

In [10]:
df.sort_values(by=['products', 'month'], inplace=True)

# for loop calculating the cumulative sum for each product
for product in df['products'].unique():
    df_temp = df[df['products'] == product]
    df_temp['cumulative_sales'] = df_temp['sales'].cumsum()
    df.loc[df['products'] == product, 'cumulative_sales'] = df_temp['cumulative_sales'].values

df
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp['cumulative_sales'] = df_temp['sales'].cumsum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp['cumulative_sales'] = df_temp['sales'].cumsum()


Unnamed: 0,products,month,sales,cumulative_sales
0,A,1,10,10.0
3,A,2,40,50.0
6,A,3,70,120.0
9,A,4,100,220.0
1,B,1,20,20.0
4,B,2,50,70.0
7,B,3,80,150.0
10,B,4,110,260.0
2,C,1,30,30.0
5,C,2,60,90.0


##### Solution 2
We can also use `groupby` to solve the problem. We can use `groupby` to group the dataframe by products and then use `cumsum` to calculate the cumulative sum of the sales column.

In [11]:
df.sort_values(by=['products', 'month'], inplace=True)
df['cumulative_sales'] = df.groupby('products')['sales'].cumsum()

##### Data Visualization

In [12]:
# plot the cumulative sales
fig = px.line(df, x='month', y='cumulative_sales', color='products')

# set the x label as 1, 2, 3, 4
fig.update_xaxes(tickvals=[1, 2, 3, 4])

# set the title
fig.update_layout(title='Cumulative Sales of Products', title_x=0.5)

# show the plot
fig.show()

#### Situation 3:
Raw table:
| Stock | Date       | Price | Volume |
|-------|------------|-------|--------|
| A     | 2020-01-01 | 10    | 1000   |
| A     | 2020-01-02 | 20    | 2000   |
| A     | 2020-01-03 | 14    | 2000   |
| A     | 2020-01-04 | 15    | 2000   |
| A     | 2020-01-05 | 16    | 2300   |
| A     | 2020-01-06 | 17    | 2400   |
| A     | 2020-01-07 | 18    | 2500   |
| A     | 2020-01-08 | 13    | 2600   |

The table we want:
| Stock | Date       | Price | Volume | Price Change (%) |
|-------|------------|-------|--------|------------------|
| A     | 2020-01-01 | 10    | 1000   | 0                |
| A     | 2020-01-02 | 20    | 2000   | 100              |
| A     | 2020-01-03 | 14    | 2000   | -30              |
| A     | 2020-01-04 | 15    | 2000   | 7                |
| A     | 2020-01-05 | 16    | 2300   | 7                |
| A     | 2020-01-06 | 17    | 2400   | 6                |
| A     | 2020-01-07 | 18    | 2500   | 6                |
| A     | 2020-01-08 | 13    | 2600   | -28              |



In [13]:
df = pd.DataFrame({
    'Stock' : ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'],
    'Date' : ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08'],
    'Price' : [10, 20, 14, 15, 16, 17, 18, 13],
    'Volume' : [1000, 2000, 2000, 2000, 2300, 2400, 2500, 2600]
})    
df

Unnamed: 0,Stock,Date,Price,Volume
0,A,2020-01-01,10,1000
1,A,2020-01-02,20,2000
2,A,2020-01-03,14,2000
3,A,2020-01-04,15,2000
4,A,2020-01-05,16,2300
5,A,2020-01-06,17,2400
6,A,2020-01-07,18,2500
7,A,2020-01-08,13,2600


##### Solution 1
We can use `for loops` to solve this problem. We can use `for loops` to iterate through the rows of the dataframe and calculate the price change for each row.


In [14]:
df['Price Change (%)'] = 0
for i in range(1, len(df)):
    df.loc[i, 'Price Change (%)'] = (df.loc[i, 'Price'] - df.loc[i-1, 'Price']) / df.loc[i-1, 'Price'] * 100

df['Price Change (%)'] = df['Price Change (%)'].round(0)
df

Unnamed: 0,Stock,Date,Price,Volume,Price Change (%)
0,A,2020-01-01,10,1000,0.0
1,A,2020-01-02,20,2000,100.0
2,A,2020-01-03,14,2000,-30.0
3,A,2020-01-04,15,2000,7.0
4,A,2020-01-05,16,2300,7.0
5,A,2020-01-06,17,2400,6.0
6,A,2020-01-07,18,2500,6.0
7,A,2020-01-08,13,2600,-28.0


##### Solution 2

We can also use the `shift` function in `pandas` to solve this problem. We can use the `shift` function to shift the price column by 1 row and calculate the price change.

In [15]:
df['Price (Previous Day)'] = df['Price'].shift(1)
df['Price Change (%)'] = ((df['Price'] - df['Price (Previous Day)']) / df['Price (Previous Day)'] * 100).round(0)
df.drop('Price (Previous Day)', axis=1, inplace=True)
df.fillna(0, inplace=True)
df


Unnamed: 0,Stock,Date,Price,Volume,Price Change (%)
0,A,2020-01-01,10,1000,0.0
1,A,2020-01-02,20,2000,100.0
2,A,2020-01-03,14,2000,-30.0
3,A,2020-01-04,15,2000,7.0
4,A,2020-01-05,16,2300,7.0
5,A,2020-01-06,17,2400,6.0
6,A,2020-01-07,18,2500,6.0
7,A,2020-01-08,13,2600,-28.0


##### Solution 3

We can directly use the `pct_change` function in `pandas` to solve this problem.

In [16]:
df['Price Change (%)'] = df['Price'].pct_change().round(2)
df['Price Change (%)'] = df['Price Change (%)'].fillna(0)
df['Price Change (%)'] = df['Price Change (%)'] * 100

df

Unnamed: 0,Stock,Date,Price,Volume,Price Change (%)
0,A,2020-01-01,10,1000,0.0
1,A,2020-01-02,20,2000,100.0
2,A,2020-01-03,14,2000,-30.0
3,A,2020-01-04,15,2000,7.0
4,A,2020-01-05,16,2300,7.0
5,A,2020-01-06,17,2400,6.0
6,A,2020-01-07,18,2500,6.0
7,A,2020-01-08,13,2600,-28.0


In [17]:
##### Data Visualization

# plot the bar chart
fig = px.bar(df, x='Date', y='Price Change (%)', color='Stock', width=800, height=400)

# set the title
fig.update_layout(title='Price Change of Stock A', title_x=0.5)

# show the plot
fig.show()

#### Situation 4:

Raw table:

| Name  | Gender | Height (cm) | Weight (kg) |
| ----- |--------|-------------|-------------|
| Alice | F      | 165         | 50          |
| Bob   | M      | 180         | 70          |
| Cindy | F      | 170         | 60          |
| David | M      | 175         | 65          |
| Emily | F      | 160         | 45          |
| Frank | M      | 185         | 75          |
| Grace | F      | 155         | 40          |
| Henry | M      | 160         | 80          |
| Irene | F      | 165         | 70          |


We want to calculate the BMI for them and add a new column to the dataframe. The BMI formula is: `BMI = weight / height^2`.
In addition, we want to add a new column to the dataframe to indicate the BMI level. The BMI level is defined as follows:
- BMI < 18.5, the BMI level is "Underweight"
- 18.5 <= BMI < 25, the BMI level is "Normal"
- 25 <= BMI < 30, the BMI level is "Overweight"

The table we want:

| Name  | Gender | Height (cm) | Weight (kg) | BMI | BMI level |
| ----- |--------|-------------|-------------|-----|-------------|
| Alice | F      | 165         | 50          | 18.3| Underweight |
| Bob   | M      | 180         | 70          | 21.6| Normal      |
| Cindy | F      | 170         | 60          | 20.8| Normal      |
| David | M      | 175         | 65          | 21.2| Normal      |
| Emily | F      | 160         | 45          | 17.6| Underweight |
| Frank | M      | 185         | 75          | 21.9| Normal      |
| Grace | F      | 155         | 40          | 16.8| Underweight |
| Henry | M      | 160         | 80          | 31.3| Overweight  |
| Irene | F      | 165         | 70          | 25.7| Overweight  |

```

In [18]:
df  = pd.DataFrame({
    'Name' : ['Alice', 'Bob', 'Cindy', 'David', 'Emily', 'Frank', 'Grace', 'Henry', 'Irene'],
    'Gender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'],
    'Height (cm)' : [165, 180, 170, 175, 160, 185, 155, 160, 165],
    'Weight (kg)' : [50, 70, 60, 65, 45, 75, 40, 80, 70]})


##### Solution

In [19]:
# create a new column for BMI
df['BMI'] = df['Weight (kg)'] / (df['Height (cm)'] / 100) ** 2

# define the function to calculate the BMI category
def bmi_category(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif bmi < 25:
        return 'Normal'
    else:
        return 'Overweight'

df['BMI Category'] = df['BMI'].apply(bmi_category)
df

Unnamed: 0,Name,Gender,Height (cm),Weight (kg),BMI,BMI Category
0,Alice,F,165,50,18.365473,Underweight
1,Bob,M,180,70,21.604938,Normal
2,Cindy,F,170,60,20.761246,Normal
3,David,M,175,65,21.22449,Normal
4,Emily,F,160,45,17.578125,Underweight
5,Frank,M,185,75,21.913806,Normal
6,Grace,F,155,40,16.649324,Underweight
7,Henry,M,160,80,31.25,Overweight
8,Irene,F,165,70,25.711662,Overweight


In [20]:
# plot the histogram
fig = px.histogram(df, x='BMI', color='BMI Category', nbins=4, width=800, height=400)

# show the bmi category
fig.update_layout(barmode='overlay')

# set the title
fig.update_layout(title='BMI Category', title_x=0.5)

# show the plot
fig.show()