In [2]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

## Tutorial 3: Better understanding of `for loops` and `defining functions`

In this tutorial we will learn how to use `for loops` and `defining functions` to make our code more efficient and easier to read. To increase your understanding of `for loops` and `defining functions`, we are going to write some algorithms to do some data manipulation. **Please remember most of these algorithms have already been implemented in the `pandas` library, which can be directly used in your code.** Today we are going to write these algorithms just to help you to understand `for loops in dataframes` better. 

### For loop

In this tutorial, we are going to use `for loops` to iterate through a dataframe. For example, we have a dataframe as follows:

| Name      | Gender | Age | Job        |
| --------- | ------ | --- | ---------- |
| Alice     | F      | 20  | Data Analyst |
| Bob       | M      | 30  | Software Engineer |
| Charlie   | M      | 40  | Data Scientist |
| David     | M      | 50  | Consultant |
| Eve       | F      | 60  | Business Analyst |



If we want to get the information of each cell in the dataframe, we can use `for loops` to iterate through the dataframe. The code is as follows:

In [3]:
df = pd.DataFrame(
    {
        "Name": ["Alice", "Bob", "Charlie", "David", "Eve"],
        "Gender": ["F", "M", "M", "M", "F"],
        "Age": [20, 30, 40, 50, 60],
        "Job": ["Data Analyst", "Software Engineer", "Data Scientist", "Consultant", "Business Analyst"],
    }
)
for i in range(len(df)):
    for j in range(len(df.columns)):
        print(df.iloc[i, j])

Alice
F
20
Data Analyst
Bob
M
30
Software Engineer
Charlie
M
40
Data Scientist
David
M
50
Consultant
Eve
F
60
Business Analyst


Alternatively, we can use `iterrows()` to iterate through the dataframe. The code is as follows:

In [4]:
for ind, row in df.iterrows():
    for col in df.columns:
        print(row[col])
    

Alice
F
20
Data Analyst
Bob
M
30
Software Engineer
Charlie
M
40
Data Scientist
David
M
50
Consultant
Eve
F
60
Business Analyst


### Defining functions

In real production work, we often need to write some functions to make our code more efficient and easier to read. If we can encapsulate some code into a function, we can use this function in other places. 

The logic of the function is like running a sub-program. The function can take some inputs and return some outputs. The inputs and outputs are called `parameters` and `return values`. It is very similar to the logic of functions in mathematics.

For example, we want to calculate the area and perimeter of a rectangle. We can write a function to do this. The code is as follows:

In [11]:
def cal_area_perimeter(length, width):
    area = length * width
    perimeter = 2 * (length + width)
    # print(a,p)
    return (area, perimeter)

x, y = cal_area_perimeter(10, 20)
print(x, y)


200 60


With the help of function, we can make some data manipulation process more easy. For example, I have a dataframe of date:
| Date      |
| --------- |
| 2022-11-01 |
| 2022-11-02 |
| 2022-11-03 |
| 2022-11-04 |
| 2022-11-05 |
| 2022-11-06 |
| 2022-11-07 |


If we want to get the next weekend date, we can write a function to do this. The code is as follows:


In [10]:
df = pd.DataFrame(
    {
        'Date': ['2022-11-01', '2022-11-02', '2022-11-03', '2022-11-04', '2022-11-05', '2022-11-06', '2022-11-07']
    }
)
def get_next_sunday(date):
    date = pd.to_datetime(date)
    # if the date is Sunday, return the same date
    if date.day_name() == 'Sunday':
        return date
    else:
        # get the next Sunday
        next_sunday = date + pd.DateOffset(6 - date.dayofweek)
        return next_sunday

df['Date'] = df['Date'].apply(get_next_sunday)


df['Date_datetime'] = df['Date'].apply(lambda x: pd.to_datetime(x))
df

Unnamed: 0,Date,Date_new,Date_datetime
0,2022-11-01,2022-11-06,2022-11-01
1,2022-11-02,2022-11-06,2022-11-02
2,2022-11-03,2022-11-06,2022-11-03
3,2022-11-04,2022-11-06,2022-11-04
4,2022-11-05,2022-11-06,2022-11-05
5,2022-11-06,2022-11-06,2022-11-06
6,2022-11-07,2022-11-13,2022-11-07


### Exercise

#### Situation 1

Raw table:
| products | sales_in_2019 | sales_in_2020 | sales_in_2021 |
|----------|---------------|---------------|---------------|
| A       | 10            | 20            | 30            |
| B       | 20            | 30            | 40            |
| C       | 30            | 40            | 50            |
| D       | 40            | 50            | 60            |
| E       | 50            | 60            | 70            |
| F       | 60            | 70            | 80            |
| G       | 70            | 80            | 90            |
| H       | 80            | 90            | 100           |
| I       | 90            | 100           | 110           |
| J       | 100           | 110           | 120           |


The table we want:
| products | year | sales |
|----------|------|-------|
| A       | 2019 | 10    |
| A       | 2020 | 20    |
| A       | 2021 | 30    |
| B       | 2019 | 20    |
| B       | 2020 | 30    |
| B       | 2021 | 40    |
| C       | 2019 | 30    |
| C       | 2020 | 40    |
| C       | 2021 | 50    |
| D       | 2019 | 40    |
| D       | 2020 | 50    |
| D       | 2021 | 60    |
| E       | 2019 | 50    |
| E       | 2020 | 60    |
| E       | 2021 | 70    |
| F       | 2019 | 60    |
| F       | 2020 | 70    |
| F       | 2021 | 80    |
| G       | 2019 | 70    |
| G       | 2020 | 80    |
| G       | 2021 | 90    |
| H       | 2019 | 80    |
| H       | 2020 | 90    |
| H       | 2021 | 100   |
| I       | 2019 | 90    |
| I       | 2020 | 100   |
| I       | 2021 | 110   |
| J       | 2019 | 100   |
| J       | 2020 | 110   |
| J       | 2021 | 120   |



In [5]:
res = {
    'products': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'sales_in_2019': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
    'sales_in_2020': [20, 30, 40, 50, 60, 70, 80, 90, 100, 110],
    'sales_in_2021': [30, 40, 50, 60, 70, 80, 90, 100, 110, 120]
}
df = pd.DataFrame(res)
df

Unnamed: 0,products,sales_in_2019,sales_in_2020,sales_in_2021
0,A,10,20,30
1,B,20,30,40
2,C,30,40,50
3,D,40,50,60
4,E,50,60,70
5,F,60,70,80
6,G,70,80,90
7,H,80,90,100
8,I,90,100,110
9,J,100,110,120


#### Situation 2

Raw table:
| products | month | sales |
|----------|-------|-------|
| A       | 1     | 10    |
| B       | 1     | 20    |
| C       | 1     | 30    |
| A       | 2     | 40    |
| B       | 2     | 50    |
| C       | 2     | 60    |
| A       | 3     | 70    |
| B       | 3     | 80    |
| C       | 3     | 90    |
| A       | 4     | 100   |
| B       | 4     | 110   |
| C       | 4     | 120   |

The format we want:
| products | month | sales | cumulative_sales |
|----------|-------|-------|------------------|
| A       | 1     | 10    | 10               |
| A       | 2     | 40    | 50               |
| A       | 3     | 70    | 120              |
| A       | 4     | 100   | 220              |
| B       | 1     | 20    | 20               |
| B       | 2     | 50    | 70               |
| B       | 3     | 80    | 150              |
| B       | 4     | 110   | 260              |
| C       | 1     | 30    | 30               |
| C       | 2     | 60    | 90               |
| C       | 3     | 90    | 180              |
| C       | 4     | 120   | 300              |


In [None]:
df  = pd.DataFrame({
    'products' : ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'month' : [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
    'sales' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120]}
)

df

#### Situation 3:
Raw table:
| Stock | Date       | Price | Volume |
|-------|------------|-------|--------|
| A     | 2020-01-01 | 10    | 1000   |
| A     | 2020-01-02 | 20    | 2000   |
| A     | 2020-01-03 | 14    | 2000   |
| A     | 2020-01-04 | 15    | 2000   |
| A     | 2020-01-05 | 16    | 2300   |
| A     | 2020-01-06 | 17    | 2400   |
| A     | 2020-01-07 | 18    | 2500   |
| A     | 2020-01-08 | 13    | 2600   |

The table we want:
| Stock | Date       | Price | Volume | Price Change (%) |
|-------|------------|-------|--------|------------------|
| A     | 2020-01-01 | 10    | 1000   | 0                |
| A     | 2020-01-02 | 20    | 2000   | 100              |
| A     | 2020-01-03 | 14    | 2000   | -30              |
| A     | 2020-01-04 | 15    | 2000   | 7                |
| A     | 2020-01-05 | 16    | 2300   | 7                |
| A     | 2020-01-06 | 17    | 2400   | 6                |
| A     | 2020-01-07 | 18    | 2500   | 6                |
| A     | 2020-01-08 | 13    | 2600   | -28              |



In [27]:
df = pd.DataFrame({
    'Stock' : ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'],
    'Date' : ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08'],
    'Price' : [10, 20, 14, 15, 16, 17, 18, 13],
    'Volume' : [1000, 2000, 2000, 2000, 2300, 2400, 2500, 2600]
})    
df

Unnamed: 0,Stock,Date,Price,Volume
0,A,2020-01-01,10,1000
1,A,2020-01-02,20,2000
2,A,2020-01-03,14,2000
3,A,2020-01-04,15,2000
4,A,2020-01-05,16,2300
5,A,2020-01-06,17,2400
6,A,2020-01-07,18,2500
7,A,2020-01-08,13,2600


#### Situation 4:

Raw table:

| Name  | Gender | Height (cm) | Weight (kg) |
| ----- |--------|-------------|-------------|
| Alice | F      | 165         | 50          |
| Bob   | M      | 180         | 70          |
| Cindy | F      | 170         | 60          |
| David | M      | 175         | 65          |
| Emily | F      | 160         | 45          |
| Frank | M      | 185         | 75          |
| Grace | F      | 155         | 40          |
| Henry | M      | 160         | 80          |
| Irene | F      | 165         | 70          |


We want to calculate the BMI for them and add a new column to the dataframe. The BMI formula is: `BMI = weight / height^2`.
In addition, we want to add a new column to the dataframe to indicate the BMI level. The BMI level is defined as follows:
- BMI < 18.5, the BMI level is "Underweight"
- 18.5 <= BMI < 25, the BMI level is "Normal"
- 25 <= BMI < 30, the BMI level is "Overweight"

The table we want:

| Name  | Gender | Height (cm) | Weight (kg) | BMI | BMI level |
| ----- |--------|-------------|-------------|-----|-------------|
| Alice | F      | 165         | 50          | 18.3| Underweight |
| Bob   | M      | 180         | 70          | 21.6| Normal      |
| Cindy | F      | 170         | 60          | 20.8| Normal      |
| David | M      | 175         | 65          | 21.2| Normal      |
| Emily | F      | 160         | 45          | 17.6| Underweight |
| Frank | M      | 185         | 75          | 21.9| Normal      |
| Grace | F      | 155         | 40          | 16.8| Underweight |
| Henry | M      | 160         | 80          | 31.3| Overweight  |
| Irene | F      | 165         | 70          | 25.7| Overweight  |

```

In [None]:
df  = pd.DataFrame({
    'Name' : ['Alice', 'Bob', 'Cindy', 'David', 'Emily', 'Frank', 'Grace', 'Henry', 'Irene'],
    'Gender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'],
    'Height (cm)' : [165, 180, 170, 175, 160, 185, 155, 160, 165],
    'Weight (kg)' : [50, 70, 60, 65, 45, 75, 40, 80, 70]})
