# What is this about?
This notebook is your required prep for the first session of **OIT 248**. This notebook was also the basis of the Week Zero Python bootcamp, so it will look familiar to all of you who attended. **However, please note that the prep for the first class asks you to go through every section of the notebook and to try out the exercises as well!**

# Comments
Comments in Python start with a `#` like directly below

In [None]:
# this line and the line below it are comments and Python will ignore them
# print("Something interesting")

# instead, the line below this will print a welcome message
print("Welcome to OIT 248! The number of minutes in a leap year is:", 366*24*60)

## In a Jupyter notebook, it's easy to add comments and notes, like this one. This particular type of cell is a Markdown cell, which allows us to comment and document our notebooks.
### You can also see the effect of using one or multiple # signs: these create sections, sub-sections, sub-sub-sections, etc.

#### Click the Run button above of hit `Shift+Enter` on your keyboard to finish this cell and move on to the next one.

#### Here are a few reminders as we get started using python through Jupyter:
* Don't forget to save your work by hitting the save button above (far left) periodically
* We will mostly be using "Markdown" and "Code" cells - you can switch between them with the pulldown menu above
* If you want to clear all of the output and re-run a notebook, go to the "Kernel" menu above, and select "Restart and clear output'
* We will mostly just be using the Run button today, but some of the other buttons will be useful to you as we progress: Stop, Copy, Paste, etc.

<font color=blue>**NOTE**: Throughout this class, we will not always code something in the most efficient way possible, and there are almost always multiple ways of achieving the same thing. We'll try to use the most intuitive or robust approach in our solutions, but don't hesitate to ask us to check your work if you decided to take a different approach.</font>

In [None]:
# This is a code cell - if you want to add a comment in a code cell, just put a pound sign in front of it
# Let's do some simple calculations
6*9 + 41

In [None]:
2**10

In [None]:
600/34 - 5

# Variables
Let's create some simple variables. The equals sign `=` **assigns** a value to the variable.
We will also use a double equals sign `==` later, which **tests for equality**.

In [None]:
# Assign the value 10 to the variable NumStudents
NumStudents = 10

In [None]:
# IMPORTANT NOTE: Variable names are case-sensitive
# This works:
NumStudents

In [None]:
# This will give you an error, because the n is not capitalized:
numStudents

<font color=blue>**Remark.** As you are executing code cells, note how the number next to each input code cell and corresponding output is increasing: your first cell had `In [1]` and `Out[1]`, then `In[2]` and `Out[2]`, and so on. That counter keeps track of the latest instruction that was run. In a Jupyter Notebook, you can actually go back and run older segments of code or even jump ahead and skip some code, so the progression is not sequential. To see this, try running the first line of code again! Assuming that you did not run something twice so far, you should see the counter changing to `In[7]` and `Out[7]`. Being able to jump back and forth is useful because it allows you to play/test different things, without having to run the entire code linearly from top to bottom. **But you have to be careful, because running an older instruction or skipping code could result in unintended things.** For instance, consider the piece of code below: </font>

In [None]:
NumStudents = NumStudents + 1

NumStudents

<font color=blue>This increases the value of the `NumStudents` variable by one. When you run it for the first time, it should print 11 (because that variable was initially assigned a value of 10 in `In[4]`). But if you run this code again, it will print 12, and then 13, etc. </font><br>
    
<font color=blue>**Useful tip**: sometimes, it is helpful to re-run all the instructions in the notebook from top to bottom up to some point. (For instance, suppose we want to re-run everything from the top to this section of the notebook). To do so:<br></font>
  1. <font color=blue> restart the Kernel: either from **"Kernel > Restart"** or by clicking the button with the little clockwise turning arrow that says "restart the kernel (with dialog)"</font>
  2. <font color=blue> select the cell immediately below the point where you want the code to stop </font>
  3. <font color=blue> from the **"Cell"** menu, select **"Run All Above"**</font>
</font>

<font color=blue> Feel free to try this right now! Select this cell of text, and execute the steps above! You will see that the notebook will re-run, but the counter will not make it all the way here! The reason is the (intentional) error in `In[6]`: **the execution thread will stop when it encounters an error!** So you will need to run everything after that manually.</font>

Note that above, we printed the value of a variable by simply typing its name, i.e., `NumStudents`. That works fine if it's the last instruction in the code. But if you have some other instructions (like new variable assignments) following that, the printing would not work. For instance, consider this piece of code:

In [None]:
NumStudents

aux = 5

As you can see, the code above is not displaying the value of `NumStudents` anymore! In this case, to display the value, use the function `print`, like below:

In [None]:
print(NumStudents)
aux = 5

Above, `print` is a **function** that simply displays its given **argument** (i.e., the variable `StudentName`). We will see many more functions as we go along. Note that the arguments to a function are always surrounded by parentheses `(...)`.

<font color=red>If you've never heard of `print` or functions, then OIT 248 is probably not the right level! :-) </font>

Now let's set the number of students back to 10, because that's useful later!

In [None]:
NumStudents = 10
print(NumStudents)

# Data Types
The main kinds of data that we'll be using in OIT 248 are `int` (integers), `float` (floating point values), `str` (string), boolean (`bool`), and some more advanced Python-specific data types (lists and dictionaries, which we'll discuss a bit later).

In [None]:
# integers (int)
x = 10

# floating point numbers (float)
z = 4.5

# boolean (bool)
is_OIT248_amazing = True

<font color=red>If you've never heard of `int`, `float`, `str`, `bool`, then OIT 248 is probably not the right level! :-) </font>

You can create strings either using single quotation `'` or using double-quotation `"`. Each of these is useful depending on the case: you should use double quotes if there is an apostrophe in the string and single quotes if there is a double-quote in the string...

In [None]:
# strings (str) can be created with either ` or ""
one_string = "Why does Arbuckle not have salmon today?"
another_string = 'I LOVE THE GSB!'

# if a string has an apostrophe in it, you must use " when creating it
a_string_with_apostrophe = "Jack's IPhone"
print(a_string_with_apostrophe)

# similarly, if a string has a quotation mark in it, you must use the apostrophe to create it '
a_string_with_quotation = 'An example of string is "Jack"'
print(a_string_with_quotation)

# If-Else Statements
You can implement "if-else statements" in Python using the following syntax:
> if `logical_condition_1`:<br>
> $ \qquad$ first instruction if logical_condition_1 is True<br>
> $ \qquad$ second instruction if logical_condition_1 is True<br>
> elif `logical_condition_2`:<br>
> $ \qquad$ instructions if logical_condition_1 is False and logical_condition_2 is True <br>
> ...<br>
> else:<br>
> $ \qquad$ instructions if all logical conditions above are False

Some examples:

In [None]:
a = 15
b = 17
if (a > b):
    print("a is bigger than b")
elif (a == b):
    print("a is equal to b")
else:
    print("a is smaller than b")

<font color=blue>**A few critical things to note in the code for `if-else` above:** </font><br>
 <font color=blue>1. the colon `:` is critical on the first line. </font><br> 
 <font color=blue>2. the intendation on the second line (and for any instructions in that block) is critical you can use as many spaces as you like or use tabs </font>

The colon and indentation are how python knows what to do. The colon indicates the start of the instructions corresponding to the case when the `if` statement is true, and all following indented lines will be considered part of that code block. As soon as you unindent a line, it is no longer part of the loop.

For instance, consider the following code:

In [None]:
# an if statement to show the effect of indentation
if a > b:
    print("I am only printing this if a is bigger than b")
print("I am printing this regardless of a and b")

Also, forgetting the colon or the indentation can result in errors:

In [None]:
# forgetting the colon :
if (a > b)
    print("a is bigger than b")

In [None]:
# forgetting the indentation :
if (a > b):
print("a is bigger than b")

The logical conditions that appear in the `if-else` statement are constructed with logical operators like `<=`, `>=`, `<`, `>`, `==`, `!=` and 
boolean operators `and`, `or`, `not` that allow you to combine conditions. We'll practice with those soon...

<font color=red>If you've never heard of an `if-else` statement before or have trouble understanding these operators, then OIT 248 is probably not the right level! :-) </font>

# For Loops
A `for` loop is used to iterate over a sequence. Syntax:
> for `variable` in `sequence`:<br>
> $ \qquad$ instructions line 1<br>
> $ \qquad$ instructions line 2

<font color=blue>**Just like with `if` statements, the colon `:` and indentation are critical in a `for` loop.**</font>

For now, we don't have very nice examples, but they are coming up shortly...

<font color=red>If you've never heard of a `for` loop before, then OIT 248 is probably not the right level! :-) </font>

# Lists
Lists are groups of values, linked together into a single variable.

Let's create a list of student names...

In [None]:
# List of 10 student names
StudentName = ["Ann", "Bob", "Carl", "Dan", "Eva", "Fiona", "Gabe", "Hal", "Irene", "Jack"]

# print it out
print(StudentName)

The square brackets and commas are what define this as a list. 

<font color=blue>__Remark.__ **In Python, a list can contain many kinds of data.** This might be surprising to someone with coding background in other languages... Here's an example of a list that contains a string, a float, another list, and some numbers. </font>

In [None]:
# crazy list containing a string, a float, another list (!), and several 5s
crazy_list = ["a string", 4.75, ['Federer', 'Nadal', 'Djokovic'], 5, 5, 5]
print(crazy_list)

You can check the **length** of the list (i.e., how many elements are in the list) using the `len` function

In [None]:
# Check the length of the list
len(StudentName)

The elements in the list are ordered and you can recover elements at specific locations in the list using the bracket indexing. 

Python uses "zero-indexing", which means that the first element has index 0 and the last one has index `len(list)-1`.

Let's print out the first element in the list:

In [None]:
StudentName[0]

You can also access several contiguous items in the list using a **range** of indices specified with `:`

In [None]:
StudentName[1:3]

<font color=blue>**Important!** The syntax `1:3` will create the range `1,2`. So in other words, starting at the first number and up to but __not__ including the last number.</font> 

In [None]:
# This one will print Ann, Bob, and Carl
StudentName[0:3]

In [None]:
# If you want to go all the way to the end of the string, just leave out the last digit
StudentName[6:]

FYI, Python also allows negative indexing! This is starting at the end of the string, so `-1` is the last element:

In [None]:
StudentName[-1]

To change an element at a specific location of a list, you can just use the `=` sign with the right indexing. 

For instance, let's print the list and then change the element in the second position and print it again to see the results:

In [None]:
print(StudentName)
StudentName[1] = "Bobby"
print(StudentName)

You can print out all the elements in a list with a `for` loop as follows:

In [None]:
for x in StudentName:
    print(x)

<font color=blue>**Remark**. The `for` loop above may seem surprising if you have a background in other programming languages. Note that we are actually **looping through the elements of the list directly**, without any need to index them numerically. This kind of flexibility is what makes Python powerful and we encourage you to get used to it quickly! </blue>

Instead, the following code loops using a numeric index and may seem like something more familiar:

In [None]:
for i in [0,1,2,3,4]:
    print(StudentName[i])

To loop through **all** the elements, you would need to create  do create the indices 0, 1, 2, ..., len(list)-1. You can do that using `range`, as follows.

In [None]:
for i in range(len(StudentName)):
    print(StudentName[i])

Let's unpack that statement a bit. Recall that `len` returns the length of a list. The `range` function returns a sequence of numbers and its syntax is:
>  `range(start,stop,step)`

- <font color=blue>`start` is optional and is an integer that specifies at which position to start. If you omit this, the default value is 0.</font>
- <font color=blue>`stop` is required and is an integer that specifies at which position to end. This value is **not** included in the sequence (so the last value that may appear is `stop-1`).</font>
- <font color=blue>`step` is optional and specifies the increment. If you omit it, the default is 1.</font>

<font color=blue> So `range(start,stop,step)` produces the sequence of values `start`, `start+step`, `start+2*step`, up to the largest number of the form `start+k*step` that is **strictly** smaller than `stop`.</font>

For instance, this creates the range of values 0, 1, 2, 3, 4:

In [None]:
range(5)

By itself, this is not useful, but we can loop through it using a `for` loop:

In [None]:
for i in range(5):
    print(i)

Putting things together, the syntax `for i in range(len(StudentName))` loops through all the integers 0, 1, 2, ..., len(StudentName)-1, which allows looping through the whole list.

## EXERCISE.
<font color=green>**This two-part exercise allows you to practice some of the learnings so far. Please do the following:**</font>
  - <font color=green>a. Print out all the student names with at least 4 characters.</font>
  - <font color=green>b. Print out all the student names with at least 4 characters and different from 'Gabe'.</font>  

Now let's give each of the students an ID number.

In [None]:
# Give each of the students an ID number from 1000 up
StudentID = list(range(1000,1010))

In [None]:
# What did we just create?
StudentID

Let's say we also wanted to create a variable for which section each student is in. 

In [None]:
# If all 10 were in the same section:
Section = ['Section 1'] * len(StudentName)

In [None]:
Section

In [None]:
# If the first 5 were in Section 1, and the next 5 were in Section 2
Section = ['Section 1'] * 5 + ['Section 2'] * 5

In [None]:
Section

<font color=blue>**The example above show how to use `+` and `*` with a list.** If you want to create a list with repetitions of the same value, you can use `*` **applied to a list** containing just the one value. The operator '+' can be applied to lists: it will just concatenate the list elements.</font>

Note that you need to make sure you operate with **lists**. To understand that, note that adding a string to the list of names creates an error:

In [None]:
# adding a string to the list of names:
StudentName = StudentName + "Joe"

But adding a **list** with the string "Joe" works:

In [None]:
# adding a string to the list of names:
StudentName = StudentName + ["Joe"]
print(StudentName)

Now let's get rid of the last element in the list, because we only want 10 students... You can remove an element by simply reassigning the list with the right indexing:

In [None]:
StudentName = StudentName[0:len(StudentName)-1]
print(StudentName)

Now let's create some salary figures for these 10 students (in thousands of dollars...)

In [None]:
Salaries = [175, 189, 168, 196, 182, 188, 198, 162, 191, 143]

Let's find out who makes the most and who makes the least...

In [None]:
# Maximum value
max(Salaries)

In [None]:
min(Salaries)

What if we wanted to know which students the maximum/minimum salaries belong to? We can use the `index` function.

In [None]:
Salaries.index(198)

In [None]:
Salaries.index(143)

The `index` function simply returns the index (i.e., location) in the list corresponding to a given value.

<font color=blue>Note that the `index` "function" is called a bit differently that what we did so far. We are using the variable name `Salaries` and the dot `.` and then using the function `index`. This kind of function is called a `method`. It's really like any other function, but the key difference is that it "lives" inside the variable that appears before the dot `.` symbol. So when it is called, it is applied and acting upon that variable. (This has to do with object-oriented programming, but we will not dive it that much more deeply because it's not critical for our class...) </font>

Now let's find out which student has the highest salary...

In [None]:
# Which STUDENT has the max salary?
StudentName[Salaries.index(max(Salaries))]

## Creating lists from other information
One of the most frequent operations that we'll deal with is to create one list based on some other information. 

For instance, suppose we want to create a list of the students whose names start with the letter 'G'. We can do that using a `for` loop:

In [None]:
students_with_g = []                              # create an empty list  

for s in StudentName:                             # loop through all the student names
    if s[0] == 'G':                               # check if the string 's' starts with the letter 'G'
        students_with_g = students_with_g + [s]   # add 's' to the list

print(students_with_g)

The most elegant way to do that in Python is using a **list comprehension**. This can make life a lot easier and it is worth getting used to list comprehensions!

In [None]:
# create a list using a list comprehension
students_with_g = [s for s in StudentName if s[0]=='G']

List comprehensions offer a very simple way to create a new list based on some existing lists. The syntax is:

> `newlist = [`_expression_ `for` _item_ `in` _sequence_ `if` _condition_ `== True]`

_sequence_ is typically another list or a range (but we will see some other examples throughout the class)

## EXERCISE.
**<font color=green>Create and then print a list with all the students with salaries of 150 or below.</font>**

# Printing
We already saw the `print` function, which allows printing variables. Let's dive a bit deeper and print some more complex messages. 

In [None]:
print("We have 10 students in our class.")
# Better approach:
print("We have %d students in our class." % len(StudentName))

The '%' symbol is used to put a dynamic object in a print statement. It may look familiar to those of you with C/C++ or MATLAB background. 

The letter afterwards specifies the type of data: %d (for integers), %s (for strings), %f (for "floating point" values)

In [None]:
print("The average salary is %f" % (sum(Salaries)/len(Salaries)))

In [None]:
# Let's clean up the output
print("The average salary is %.1f" % (sum(Salaries)/len(Salaries)))
print("The average salary is %.2f" % (sum(Salaries)/len(Salaries)))

<font color=blue>__Remark.__ Although the approach above works, there are better ways to print in Python. We suggest that you get used to **printing with `f-strings`**. Here is how the code immediately above would look with an f-string:

In [None]:
# print the same thing with f-strings
print(f"The average salary is {sum(Salaries)/len(Salaries):.1f}")
print(f"The average salary is {sum(Salaries)/len(Salaries):.2f}")

With the f-string approach, you can print many types of variables and combine text and variables. 

For instance, let's print the name, ID, and salary for the first 5 students, with enough space so that the ID and salary remain left-aligned.

In [None]:
for i in range(5):
    print(f"The student with name {StudentName[i]:<6s} has ID {StudentID[i]} and salary {Salaries[i]}")

For more details and examples, you can check out the **Cheat Sheet**!

## EXERCISE.
<font color=green>**Calculate and print the _average salary_ of all the students, with 2 digits of precision.**</font>

# Dictionaries
Dictionaries are a widely used data structure in python. A dictionary is just a set of key/value pairs and the data is organized so that we can look up the value based on the key. As such, the `keys` have to be unique (but the values can be repeated). For instance, a dictionary could be used to implement a phonebook: you would have the `name` as a key, and the `value` could be the phone number (and other contact information for that person). 

Let's construct a dictionary in our case. Recall that we had 10 student names, 10 ID numbers, and the corresponding salaries: 

In [None]:
print(StudentName)
print(StudentID)
print(Salaries)

What if we want to create a dictionary where we can look up students based on the name and recover the salary?

We could do something like this:

In [None]:
Salary_dict = {
    StudentName[0] : 175,
    StudentName[1] : 189,
    StudentName[2] : 168,
    StudentName[3] : 196,
    StudentName[4] : 182,
    StudentName[5] : 188,
    StudentName[6] : 198,
    StudentName[7] : 162,
    StudentName[8] : 191,
    StudentName[9] : 143
}

The curly braces `{...}` tell Python that we are creating a dictionary, and the way we are doing it above is through (key,value) pairs separated by a colon `:`

Let's print this out for a look:

In [None]:
# print it
print(Salary_dict)

We can now retrieve the salary for a given name quite quickly! Let's print Dan's salary:

In [None]:
# retrieve Dan's salary
Salary_dict["Dan"]

You can also change the value associated with a key with a simple assignment.

In [None]:
# change Dan's salary
Salary_dict["Dan"] = 100000
print(Salary_dict)

Of course, the way we created the dictionary above was very manual and error prone. 

The best way would be to do it programmatically - something like this:

In [None]:
# A different way to create this dictionary:

# Initialize the dictionary to be empty
Salary_dict2 = {}

# Add the elements with a for loop
for i in range(len(StudentName)):
    Salary_dict2[StudentName[i]] = Salaries[i]

print(Salary_dict2)

There are many other functions in Python that directly create dictionaries from lists of keys and values, or from other data structures. For our purposes, the above should pretty much be enough.

## EXERCISE.
<font color=green>**Suppose we are worried that two students have the same name... Create a dictionary with keys corresponding to IDs and values corresponding to a list containing the name and the student's salary. Once done, print out the dictionary.**

# Pandas module
Pandas is a Python library used for working with data sets. It has very useful functions for analyzing, cleaning, exploring, and manipulating data, and we will be using it a lot throughout our class. (And in case you're wondering, the name is **not** about an animal -- it's actually short for "panel data"!) Over coverage here will be very brief, but for more details check this resource: <a href="https://www.w3schools.com/python/pandas/default.asp">https://www.w3schools.com/python/pandas/default.asp.</a>

First, let's import the pandas library (also known as "module"). There are a few different options for importing a module. Option 1 is to use the following line:

> ``import pandas``

This imports the "pandas" module, and allows us to use all the functions it contains using a syntax like ``pandas.read_csv()`` (this allows reading a CSV file!)

If we think it will be annoying to type pandas over and over, we can assign it a 'short name':

> ``import pandas as pd``

This also imports the "pandas" module, but now we would refer to the `read_csv` function as ``pd.read_csv()``.

The last option would be to do:

> ``from pandas import *``

This imports everything in the pandas module and makes it so that we can just refer to the function as ``read_csv()``.
This is potentially useful if the functions are specific enough that you don't think the same names will be used anywhere else, but dangerous if you think there might be overlap.

We'll stick with the second approach here.


In [None]:
# let's first import the module
import pandas as pd

In `pandas`, data is organized and stored as **DataFrames**. You can think of a DataFrame in close analogy with a table in Excel: it is a two-dimensional table that has data on its columns, and each column may have a header/name.

Normally we read DataFrames from files, but here we will just create a DataFrame from a dictionary. Let's first create a dictionary with some data.

In [None]:
# a dictionary that stores some data
dictionary_with_data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Juan', 'Chenxi'],
    'Age': [25, 30, 22, 28, 27, 30],
    'Id' : [10001, 10002, 10003, 10004, 10005, 10006],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Guadalajara', 'Singapore']
}

print(dictionary_with_data)

Now, let's turn this into a dataframe where the `keys` of the dictionary with be the column names and the values will correspond to the values on each column.

In [None]:
# create a DataFrame from the dictionary
df = pd.DataFrame(dictionary_with_data)

# display it
display(df)

<font color=blue>The way to think about a **DataFrame** is basically as a table with columns and rows (very similar to an Excel table!) where both columns and rows are **labeled/named**. Specifically, in our case: 
   - <font color=blue>the **column names/labels** are _"Name", "Age", "Id", "City"_</font> 
   - <font color=blue>the **row labels** are the numbers 0, 1, 2, 3, ..., which you can see displayed in the "first" column above.</font> 
    
<font color=blue>In addition to these labels, which can be easily changed (we'll see how later), every row and every column also has a **numeric index**. These indices start from 0 and increase by 1, left-to-right for columns and top-to-bottom for rows. So in our case:</font>
   - <font color=blue>for columns: _Name_ has numeric index 0, _Age_ has index 1, etc. _(Side note: that is why, when referring to the column with the labels 0, 1, 2,... , I put "first" in quotation marks above; that column with 0, 1, 2,... is not _really_ part of our DataFrame, so the first _actual_ column of data is __Name__)_</font>
   - <font color=blue>for rows: the row labeled 0 also has numeric index 0, the row labeled 1 has numeric index 1, etc.</font> 

<font color=blue>You can use a DataFrame's `.columns` **attribute** to get the names of all columns and its `.index`  **attribute** to get the labels of all rows. These attributes are stored as a specific kind of data structure (`Index`) through which you can actually easily iterate with a `for` loop, but sometimes it's more convenient to just convert them into Python lists like we do below. For the numeric indices, all you need to know is **how many** columns and rows there are, which you can easily get using `len(.)` on the two lists or using the `shape` attribute. Let's see these in action! </font>

In [None]:
# obtain the row and column labels, and calculate the number of rows/columns

print(f"\n{'='*100}\nColumn labels:")
# let's obtain all the labels/names for columns and convert this into a list
col_labels = list(df.columns)    # IMPORTANT! There are no (.) after "columns" because that is not a function/method !
print(col_labels)

## similarly, but now for row labels
print(f"\n{'='*100}\nRow labels:")
# let's obtain all the row labels
row_labels = list(df.index)    # IMPORTANT! Again, there are no (.) after "index" because it's not a function/method!
print(row_labels)

## getting the number of rows and columns, in two ways
print(f"\n{'='*100}\nHow many rows/cols?")
print(f"There are {len(row_labels)} rows and {len(col_labels)} columns.")
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

Above, we used the function `display(.)` to visualize a DataFrames. This works great but in a large DataFrame, you may want to only display a few rows, which you can do with the `head(.)` method:

In [None]:
# display the first 3 rows
df.head(3)

### Column operations

You can see a specific column using the syntax `df[column_name]`:

In [None]:
# check out the column "Age"
df["Age"]

To obtain the names of all the columns, use the attribute `df.columns`.

In [None]:
# get all the column names
df.columns

As you can see, this returns an `Index` object (which we have not discussed), but you can iterate through it with a usual `for` loop or you can transform it into a regular Python List with the function `list`.

In [None]:
# let's iterate through the names of the columns with a for loop
for c in df.columns:
     print(c)

# let's store the columns in a list and display the list
column_names = list(df.columns)
print(column_names)

<font color=red>**Warning!**</font>`df.columns` **is not a method**! If you use round brackets, you will get an error:

In [None]:
# the following would generate an error!
df.columns()

### Row operations

To get all the row labels, used `df.index`.

In [None]:
# get all the row labels
df.index

This also returns an object (a `RangeIndex`), but you can readily loop through it with a `for` or store it as a list.

In [None]:
# let's iterate through the indices of all the rows
for i in df.index:
     print(i)

# let's store the columns in a list and display the list
row_index = list(df.index)
print(row_index)

Sometimes, it is useful to change the `index` used for the DataFrame. For instance, we can assign the row labels to the values of a python list with **unique** values by simply setting `index`:

In [None]:
# change the index to 0, 1, ...
df.index = list(['a','b','c','d','e','f'])

display(df)

### Retrieve elements
One of the most important operations with a DataFrame is to retrieve an element located at a certain row and column.

If you know the row and column labels, you can use the following syntax:
  > `df[column_label][row_label]`

In [None]:
## retrieve something from the dataframe using df[column_label][row_label]
# let's get the element on row 'c' and column "Age"
print(df["Age"]['c'])   # approach 1

First time you see this it might look a bit confusing. Typically, in mathematics, we write $M_{i,j}$ or $M[i,j]$ for row $i$ and column $j$ of a matrix $M$, so in the syntax above, it might seem like the order or rows and columns is switched! 

The idea is that `df[column_label]` returns the entire column with that name and then, applying `[row_label]` to that simply returns the element at the `[row_label]` location. 

In contrast, the next approach that uses `.loc` may seem more natural. This has the syntax:
   > `df.loc[row_label][column_label]`

In [None]:
# let's get the element on row 'c' and column "Age"
print(df.loc['c',"Age"]) 

Finally, there is an option to index into the DataFrame using entirely numeric indices, using `iloc. The syntax here is:
> `df.iloc[numeric_row_index, numeric_column_index]`

This is similar to what we wrote earlier, with $M_{i,j}$ or $M[i,j]$ for matrix $M$.

In [None]:
# let's retrieve the element in row 1 and column 1
df.iloc[0, 0]

To not get confused, remember that Python uses 0-based indexing (and the very first column, with the `index`, does not count as a proper column).

<font color=red>**IMPORTANT. You cannot combine labels with numeric indices when retrieving an element.** You have to choose one way: either you retrieve using the row/column labels (using `df.loc`) **or** you retrieve using the numeric indices (0..number of rows-1, 0...number of columns-1) using `df.iloc[...]`</font>

### Looping
To loop through the elements in a row or a column (or a subsection of the dataframe), you can just use a regular `for` loop

In [None]:
# let's loop through the entire DataFrame on columns and print the name of the column and the contents:
for c in list(df.columns):
    # for every column
    print(f"Column '{c}' contains:")
    for r in list(df.index):
        # for every row
        print(f"In location {r} : {df.loc[r,c]}")

## Reading data files
We will read data files using panda's `read_csv` or `read_excel` functions. The former reads files with `.csv` (Comma-Separated-Values) extension, whereas the latter reads `.xlsx` (Excel) files.

Let's read in a csv file, and store it as the data frame Grades.

In [None]:
# Read in our csv file and store it as the data frame Grades
Grades = pd.read_csv("Gradebook.csv")

Note the syntax for reading CSV files. Above, we used 
> `df = pd.read_csv(full_file_name)`<br>

where `full_file_name` is the complete filename, including a path if needed.

In [None]:
# Let's take a look at our data frame Grades
display(Grades)

Note that here, Pandas automatically assigned an `index` to the DataFrame that goes from 0 to 49. 

But in this case, maybe a better index should be the `student ID`, which presumably is unique. We could change that index, but we can actually specify when reading the file that Pandas should use a specific column as an index. The syntax is:
> `df = pd.read_csv(full_file_name, index_col)`<br>

where now `index_col` is a **numeric index** of the column to use to construct the index (i.e., the row labels).

Let's try it again, using the `StudentID` (which has column index 0) as our row labels: 

In [None]:
# Read in our csv file with StudentID as index
Grades = pd.read_csv("Gradebook.csv", index_col=0)

# display the first 5 rows
Grades.head(5)

If our data was instead in an Excel file (.xlsx), we would just use the ``read_excel`` function instead of the ``read_csv`` function. We'll see example of this later.

Let's now calculate and display the average grade on the midterm for the entire class. We need to loop through all the rows and calculate the average for each row.

**Option 1.** Suppose we want to loop through the rows using the row labels. We first create a list with all the row labels

In [None]:
row_labels = list(Grades.index)
print("Row labels:", row_labels)

Because we're looping using row labels, we also need to have **labels** for the columns. We will be looping through all the grade components (midterm, homework, participation). Let's create a list with those columns.

In [None]:
columns_with_grades = ["MidtermGrade",	"HomeworkGrade", "ParticipationGrade"]
print("Column labels:", columns_with_grades)

Now, we're ready to loop through all the rows and calculate the sum on the columns.

In [None]:
# Aprooach that loops through rows using row labels

# Let's loop through the rows using the row label
for student_id in row_labels:

    # Let's calculate the total score for the student identified by `student_id`. 
    # We do this by adding all the columns with grades and we store the result in total_score 
    total_score = 0
    for grade_component in columns_with_grades:
        #  `grade_component` will take values like `MidtermGrade`, `HomeworkGrade`, ...
        total_score += Grades.loc[student_id,grade_component]   # add the score to the total

    # Calculate and print the average, with 2 digits of precision
    print(f"Student with ID {student_id} has average: {total_score/len(columns_with_grades):.2f}")

A slightly more compact approach for calculating the sum above is to use `sum(...)` instead of doing another `for` loop. Here is now:

In [None]:
# Aproach that loops through rows using row labels

# Let's loop through the rows using the row label
for student_id in row_labels:

    # Let's calculate the total score for the student identified by `student_id` using SUM
    total_score = sum(Grades.loc[student_id,grade_component] for grade_component in columns_with_grades)

    # Calculate and print the average, with 2 digits of precision
    print(f"Student with ID {student_id} has average: {total_score/len(columns_with_grades):.2f}")

Hopefully the `sum` expression above made sense! What we're summing is whatever we are putting inside the parenthesis for `sum(.)`. Here, what's inside are terms of the form `Grades.loc[student_id,grade_component]` and the inner `for grade_component in columns_with_grades` says that we have one such term for every value of `grade_component`, i.e., for every grade component. So the sum is basically just giving us the total score that we want, which we can then divide by the number of grade components (3) to get the average.

Of course, you can solve the entire question by looping through the dataframe using **numeric indices** instead. This is a bit more clunky and not necessary in this case, but here's how you could do that. 

Note that if you decide to loop through the rows using a numeric index (i.e., 0 ... number of rows-1), then:
 - you will **have to** loop with a numeric index through the columns as well. In this case, the relevant indices for the columns we care about are 1, 2, 3 (for the "MidtermGrade","HomeworkGrade", "ParticipationGrade", respectively).
 - you will have to use `iloc` to access the data.

Let's first create a list with the relevant column indices.

In [None]:
relevant_col_idx = [1,2,3]

In [None]:
# Let's loop through the rows using the numeric index, 0.. number_of_rows - 1
# Recall that `len(df)` returns the number of rows in a dataframe
for row_idx in range(len(Grades)):

    # Let's use the SUM function to get the total score. (We could also do a for loop...) 
    total_score = sum(Grades.iloc[row_idx, col_idx] for col_idx in relevant_col_idx)

    # Calculate and print the average, with 2 digits of precision
    print(f"Student with ID {student_id} has average: {total_score/len(relevant_col_idx):.2f}")

<font color=blue>Lastly, to show you the most elegant way to compute the average scores for all students with just one line of code, consider the following syntax that uses **list comprehensions** and the **sum** function.</font>

In [None]:
# Create a list with all the scores, in one shot
average_scores = [ sum(Grades.loc[student_id,gc] for gc in columns_with_grades)/len(columns_with_grades) for student_id in row_labels ]

<font color=blue>This creates a list with the average score for each student. Hopefully you agree that this is simpler! 

<font color=blue>Let's unpack the statement a bit. Note first that we're creating a list because we have the `[]` operator. How many elements in the list? The very last `for` statement, `for student_id in row_labels`, says that this list will have one element for every `student_id` from `row_labels`, i.e., for each row of our data. Lastly, you can convince yourself that the `sum(.)` expression is exactly like what we used in the `for` loop above. </font>

<font color=blue>All we need to do is to print the IDs and scores. We could of course do a for loop (using a numeric index, because we have the two separate lists):</font>

In [None]:
for row_idx in range(len(Grades)):
    print(f"Student with ID {student_id} has average: {average_scores[row_idx]:.2f}")

## EXERCISE.
<font color=green>**1. Calculate and print the GPA assuming a weight of 35% for the midterm, 45% for the final, and 30% for participation, for all the students in the class.**<br>
<font color=green>**2. Under the weighting in #1, who is the student with the largest GPA in the class?**<br>