## Lesson 4 Overview

 * What's a CSV file
 * List contents of a CSV file
 * Total steps for each member
 * Print the names of people with steps count more than 100k
 * Print total steps for each team

## Let's load today's lesson!

### Open Azure Notebooks library 

Go to https://notebooks.azure.com -> Sign in if needed -> Select **python-codeacademy-sg**

### Update lesson file to latest version

Select **New** -> **From URL** -> input https://raw.githubusercontent.com/viettrung9012/python-codeacademy-sg/master/Lesson4.ipynb (URL is available in **Lesson4.ipynb**) -> Click outside input then select **Upload** (overwrite if needed)

### Open Jupyter lab

From your browser's bookmark or **Run** -> Change browser URL path from **/nb/tree** to **/nb/lab**

Select **Lesson4.ipynb**

## What is a CSV file?

A CSV is a comma separated values file which allows data to be saved in a table structured format. A CSV file is similar to an Excel spreadsheet, though it doesn't have the style formatting and has a .csv extension instead. Traditionally they take the form of a text file containing information separated by commas, hence the name.

### Content in CSV

Starting from this lesson, we will use the data from Biggest Loser. 

For those who are unfamiliar with the Biggest Loser step challenge, this was a 4-week long challenge where teams (each consisting 5 members) combined their daily step count in order to reach these goals:

    1. To be the first team to reach 1 million steps

    2. To have the highest overall step count amongst all teams

    3. To show the most improved team step count (from weeks 1+2 to weeks 3+4)

*** The following sample data is from the Biggest Loser step challenge. ***

     team_no,team_name,team_captain,team_member,2018-04-02,2018-04-03

     1,TBD,FALSE,1-1,11980,10437

     1,TBD,FALSE,1-2,22935,13399

However, CSV file is a table, as we are familiar with Excel, to make it more clear, we can have a comparison with Excel we generally use in our daily work. The first row is the column names, and the specific information of each person starts from the second row.

If you open the CSV file using a text Editor such as Notepad, you'll see the content as above.

If you open the CSV file using Excel, the content will have the same format as a .xls file.

## List contents of a CSV file

To read and analyse the data, we use the Python csv library. CSV literally stands for comma separated variable, where the comma is what is known as a "delimiter."

   *** Tips: Whatever the libraries you need, you should import all the necessary libraries at the beginnng of the code.***
    
** Let's get started! **

### How to read a CSV file in Python? -- `csv.reader()` function

    1. The first thing we need to do is import CSV, so we can use the functions in that library.
    2. Open the file by path
    3. To read a CSV file, we can use csv.reader() function under Python csv library
    4. Use "for" statement to fetch the data row by row from a CSV file
    5. Print it


In [None]:
# Print CSV file
import csv
with open('Biggest Loser 2018.csv') as csvfile:
    readCSV = csv.reader(csvfile)
    for row in readCSV:
        print(row)

Above, we've shown how to open a CSV file and read by each row.

### Print specific line

Also, you can print specific data by index on each row.

Please refer to the code block below:

In [None]:
# Print CSV file
import csv
with open('Biggest Loser 2018.csv') as csvfile:
    readCSV = csv.reader(csvfile)
    for row in readCSV:
        print(row[0], row[1], row[2])

#### `with` statement

From the code above, we use a keyword `with` when opening the CSV file. Then what's `with`? And what's the `with` statement used for?

By using `with` statement, you can get better syntax and errors handling. We call it exceptions handling.

`with` statement simplifies errors handling by wrapping common preparation and cleanup tasks. In addition, it will automatically close the file even if there are errors when opening it. The `with` statement also provides a way for ensuring that a clean-up is always used.

Here is how we print the second row of the CSV data:

In [None]:
# Print the data of first line
import csv
with open('Biggest Loser 2018.csv') as csvfile:
    readCSV = csv.reader(csvfile)
    for row_index, row in enumerate(readCSV):
        if row_index == 1:
            print(row)
            break

As we already know, the index of a **list** starts from 0, this rule suits for everywhere in coding as well. 

CSV is a table (or a list of list of cell), therefore, the index of column and row start from 0 as well.

From the sample above, when row_index == 0, which means printing the column names, when row_index == 1, then it can print the first person's data in the table.

## Total Steps for each member

After the Biggest Loser step challenge, I think all the attendees wonder -- **How many steps did they clock in total?**

It's not a hard question, let's do coding to get the answer.

### Steps statistics

In this part, we should know two things.

   * Which rows are the data for attendees.
   * Which columns are the data for steps. 

Seen from the table, we can easily figure out the answers.
  
   * -- Rows start from the second row to the end, that is to say, the index starts from 1(We also can say greater than 0, not including 0 -- row_index > 0).
   * -- Columns start from the fifth column to the end, that is to say, the index starts from 4(We also can say greater than 3, not including 3 -- column_index > 3).
   
Please refer to the code block below:
   

In [None]:
# Print total steps per person
import csv
with open('Biggest Loser 2018.csv') as csvfile:                                        # Open the CSV file by path
    readCSV = csv.reader(csvfile)                                                      # Read the CSV file
    for row_index, row in enumerate(readCSV):                                          # "for" loop statement to fetch the data                     
        if row_index > 0:                                                              # Rows start from index 1
            total_steps = 0                                                             # Define a parameter to store each attendees' total steps
            for column_index, steps_in_column in enumerate(row):                       # Calculate the total steps by using "for" loop statement
                if column_index > 3 and steps_in_column != '':                         # Columns start from index 4
                    total_steps = total_steps + int(steps_in_column)                 
            print("Total steps of Member" + str(row_index) + " is:" + str(total_steps)) # After "for" statement completes running, totalSteps is the answer we want

### Type casting `int()`
From the code block above, we get total steps for every attendee, and print them one by one.

You may have noticed something, when calculating the total steps, we use a new thing in the code on Line 10.

```python 
totalSteps = totalSteps + int(steps_in_column) 
```
** **

Last week, we learnt str(), which is used for converting other types to string. Now I'm going to talk about int().

When fetching the data from a table, the data types are the same -- All are string type.

Let's check the code block again.

```python 
totalSteps = 0
    for column_index, steps_in_column in enumerate(row):  
        if column_index > 3 and steps_in_column != '':                                
            totalSteps = totalSteps + int(steps_in_column)
```
** **

We define totalSteps and set initialization to 0. Therefore, it is an integer.

Meanwhile, steps_in_column is the data from table, which is a string.

Since string type can't be used for computing, we need to convert the string to integer if you want to do addition here. 

As a result, **`int()` function is used for converting other types to integer.** We have to do it in computing.

## Print the names of people with steps count more than 100k

Now let's use the data from the CSV file to do some simple analysis.

In last section, we have already printed every attendees' total steps, then use the data to compare with 100k, we can figure out who has more than 100k steps.

*** But question: How to print the names of people who have more than 100k steps? ***

We can know the sequence number of the members, but we don't the know their names.

Can we print the names by the column name or column index? -- **Actually, we can't make it if using the way above!!!**

### How to print the names by column name or column index? -- `csv.DictReader()` function

When reading CSV file, we use **`csv.reader()`**, then loop printing the data row by row. In this way, we don't know which column the data belongs to.

Now I'm telling you another way to read the CSV file, **`csv.DictReader()`**. The returning object looks like a dictionary, operates like a regular reader, and maps the information read into a dict whose keys are given by the optional fieldnames parameter.

More details about `csv.DictReader()` you can find here: https://docs.python.org/3/library/csv.html#csv.DictReader

#### Get the column names

Please refer to the code block below:


In [None]:
# Sample for DictReader
# Print header
import csv
with open('Biggest Loser 2018.csv') as csvfile:
    readCSV = csv.DictReader(csvfile)
    header = readCSV.fieldnames
    print(header)



From the results printed above, the "header" is a list of the **column names**.

#### Print the data by column index and column name

Please refer to the code block below:

In [None]:
# Print the data by column index and column name
import csv
with open('Biggest Loser 2018.csv') as csvfile:
    readCSV = csv.DictReader(csvfile)
    header = readCSV.fieldnames
    team_number = header[0]
    team_name = header[1]
    for row in readCSV:
        print(row[team_number], row[team_name], row['team_captain'], row['team_member'])


#### Print the names with corresponding steps

Please refer to the code block below:


In [None]:
# Print total steps per person by using DictReader
import csv
with open('Biggest Loser 2018.csv') as csvfile:
    readCSV = csv.DictReader(csvfile)
    header = readCSV.fieldnames
    for row in readCSV:
        total_steps = 0
        for column_index, steps_in_column in enumerate(header):
            steps_in_column = row[header[column_index]]
            if column_index > 3 and steps_in_column != '':
                total_steps = total_steps + int(steps_in_column)
        print("Total steps of " + row['team_member'] + " is:" + str(total_steps))


### Write a function for counting total steps

Steps start from column 5, we can count the total steps by looping header index starting from 4, which is column_index > 3

```python
def total_steps_by_row(row, header):
    totalSteps = 0                                                       # Define totalSteps
    for header_index, fieldNames in enumerate(header):                   # Looping header for counting the steps
        steps_in_column = row[header[header_index]]
        if header_index > 3 and steps_in_column != '':                   # Step amount starts from column index 4
            totalSteps = totalSteps + int(steps_in_column)
    return totalSteps
```
** **
By reusing the function, we can get the total steps per person, then compare with 100k

We can define a string parameter, **NamesForStepsMoreThan100K** to store the names which satisfy the conditions.

Please refer to the code block below:

In [None]:
# Print names of people with steps count more than 100k
import csv


def total_steps_by_row(row, header):
    total_steps = 0
    for header_index, field_names in enumerate(header):
        steps_in_column = row[header[header_index]]
        if header_index > 3 and steps_in_column != '':
            total_steps = total_steps + int(steps_in_column)
    return total_steps


with open('Biggest Loser 2018.csv') as csvfile:
    readCSV = csv.DictReader(csvfile)
    header = readCSV.fieldnames
    NamesForStepsMoreThan100K = []
    
    for row in readCSV:
        total_steps = total_steps_by_row(row, header)
        if total_steps >= 100000:
            NamesForStepsMoreThan100K.append(row['team_member'])
    print("Steps more than 100K: " + str(NamesForStepsMoreThan100K))


**Question: Following the code block above, can you complete a code block to print the names of people with steps count more than 300k and 400k?**

In [None]:
# Write you code here:

# Read from CSV file
# Count for total steps
# Compare with 300K/400k
# Print the names



## Print total steps for each team

From the code above, we have completed a function for counting total steps for each attendee.

To print the total steps for each team, what else we should do? -- We should know **all the team names**.

Let's write more functions.  

### Get all the team names

To store the team names, create a list.

Please refer to the code block below:

```python
def get_team_names_from_data(steps_data):
    team_names = []                              # Create a list to store all the team names
    for row in steps_data:                        # Loop the CSV file to get the team name from each row
        if row['team_name'] not in team_names:   # If the team name is not in the list, then append it to the list
            team_names.append(row['team_name'])
    return team_names
```
** **
### Count total steps by team name 

After getting the team name, the next step is count the total steps by team name.

Please refer to the code block below:

```python
def get_total_steps_by_team_name_from_data(steps_data, header, team_name):
    team_total_steps = 0
    for row in steps_data:
        if row['team_name'] == team_name:   # If the "team_name" in a row equals the target team_name, then plus it
            team_total_steps = team_total_steps + total_steps_by_row(row, header)
```
** **
### Loop for team names

To print the total steps for each team, let's do a loop for the team names outside the code block above.

Please refer to the code block below:

```python
for name in team_names:
    total_steps_by_team = get_total_steps_by_team_name_from_data(steps_data, name)
```

** **

***Tips: When using csv.reader() or csv.DictReader() in Python, all the data only can be read one time. After reading, the data will be deleted in memory. If you want to use the data for multi-times, you'd better store the data into another new list, then you can reuse it.***

### Reference

** Entire code for printing total steps by each team**

Please refer to the code block below:

In [None]:
# Print total steps by each team
import csv


def total_steps_by_row(row, header):
    total_steps = 0
    for header_index, fieldNames in enumerate(header):
        steps_in_column = row[header[header_index]]
        if header_index > 3 and steps_in_column != '':
            total_steps = total_steps + int(steps_in_column)
    return total_steps


def get_team_names(steps_data):
    team_names = []
    for row in steps_data:
        if row['team_name'] not in team_names:
            team_names.append(row['team_name'])
    return team_names


def get_total_steps_by_team_name_from_data(steps_data, header, team_name):
    team_total_steps = 0
    for row in steps_data:
        if row['team_name'] == team_name:
            team_total_steps = team_total_steps + total_steps_by_row(row, header)
    return team_total_steps


with open('Biggest Loser 2018.csv') as csvfile:
    readCSV = csv.DictReader(csvfile)
    header = readCSV.fieldnames                 # Get the header and pass to the function get_total_steps_by_team_name_from_data(steps_data, header, team_name)
    stepsData = list(readCSV)                   # For avoiding reading from a file multiple times(because it's not efficient), convert the data to a list for reuse
    team_names = get_team_names(stepsData)
for name in team_names:
    total_steps_by_team = get_total_steps_by_team_name_from_data(stepsData, header, name)
    print("Total steps of Team " + name + " is: " + str(total_steps_by_team))


**That's it for Lesson 4!**

**See you next week!**