## Agenda
- [Class feedback](https://umasslowell.co1.qualtrics.com/jfe/form/SV_bQ5Uz5heMoXCTsi)
- Algorithmic thinking
- Then more practice with loops and dictionaries, and hopefully data frames

## Learning objectives
- Can articulate algorithmic solutions to problems and write them down as a framework **before** adding code
- Able to recognize uses for and create more complex dictionaries that store data sets
- Describe the connection between data frames and data sets and dictionaries


## Algorithmic thinking
What it means: break down a solution into sequential, logical, detailed steps 

[pb & j dad video](https://www.youtube.com/watch?v=cDA3_5982h8)


Examples:
- Recipe
- Furniture assembly instructions
- Driving directions

Try it: write the steps you need to do to brush your teeth. 

What happens if you do the steps in a different order? What happens if you leave out a step? 

## Daily homework

First, write down your algorithm for doing the daily homework in a code cell.
Put each step on a separate line as a *comment*. Write in human language not in code. 

Then, add your code in under each line of the algorithm


## 1. Getting fancy with data structures
First, we will build some comfort with loops, dictionary creation, and algorithmic thinking.

**Exercise 1.1**: Write code that uses the plant2place dictionary and stores how many times each place appears in the plant2place. 
- Again, first write the algorithm
- Then add the code

## 1.1 Dictionaries of lists

Dictionaries are very flexible and useful. They can have as *keys* strings or numbers, and as values, anything:
- Sets
- Lists
- Other dictionaries!

Lists as well have other lists as items (so we can have a list of lists or lists of dictionaries!). It depends on what you're trying to accomplish.

![img](https://c.tenor.com/x-tCSS8RPZEAAAAd/mind-blown-mind.gif "mind blown")

**Exercise 1.2**: Write code that uses the plant2place dictionary and creates another dictionary where the keys are plants, and the values are **lists** containing all the plants in that place. Again, algorithm then code

## 2. From dictionaries to data frames
Dictionaries can store a lot of data.

One way we can store a data set is by using the dictionary *keys as features*, and a *list of observations* of each feature. 

Here the features are characteristics of lakes, and the observations are different lakes: 

In [None]:
lakes = {'name':['Huron','Ontario','Michigan','Erie','Superior'],
        'surface':[23000, 7340,22400,9900,31700],
        'ave_depth':[195,284,279,70,480],
        'max_depth':[750,802,923,210,1333],
        'shared_canada':[True,True, False, True, True],
         'temperature':[48,51,49,54,43],
         'native_name':['Karegnondi','Oniatarí:io','Mich gami','Erielhonan','Gichi-gami']
        }

Downside: the observations of each feature are not connected to each other. 

**Discuss**: What kind of algorithm would you use with this dictionary to look at up average depth of Lake Michigan?

### 2.1 Making a DataFrame

We can instead store this in a table format by using a python toolbox called **pandas**. 
![img](https://miro.medium.com/max/1400/1*_oSOImPmBFeKj8vqE4FCkQ.jpeg "pandas")

We call these toolboxes **package** and we get access to them by a special command called `import` as below. The `as pd` part creates a nickname, so we save typing a few letters:

In [None]:
import pandas as pd

Once we do this, we have access to the pandas functions -- we can use them by specifying the package (`pandas` or `pd`), and using the period (`.`) to say we want to use a function or object in that package.

The main object we'll use for now is `DataFrame` which allows us to create an object that is a table/spreadsheet. So `DataFrame` is a table we can work with with python code.

We can create a DataFrame in a number of ways.  

We can convert our dictionary to a DataFrame object. This only works if each list in the dictionary has the same length:

In [None]:
pd.DataFrame(lakes)

Just like a dictionary prints out in a certain way with the `{` and the `:`, a DataFrame prints out a certain way for you to look at it. 

**Think about it**: How would you fill in your table if this was instead your dictionary holding the data? 

In [None]:
lakes = {'name':['Huron','Ontario','Michigan','Erie','Superior'],
        'surface':[31700],
        'ave_depth':[195],
        'max_depth':[750,802],
        'shared_canada':[ False, True, True],
        'temperature':[48,51,54,43],
        'native_name':['Oniatarí:io','Mich gami','Erielhonan','Gichi-gami']}

You can also make a data frame from a **list of lists**. 
Each list again must be the same length.

In [None]:
list_of_lists = [['Huron','Ontario','Michigan','Erie','Superior'],
              [23000, 7340,22400,9900,31700],
             [195,284,279,70,480],
             [750,802,923,210,1333],
             [True,True, False, True, True],
                [48,51,49,54,43],
                ['Karegnondi','Oniatarí:io','Mich gami','Erielhonan','Gichi-gami']]

**Discuss**: 
- how long is this list? 
- What are the items in this list?

In [None]:
pd.DataFrame(list_of_lists)

**Discuss**: compare this to the result from using the dictionary. What is the difference?

The `DataFrame` function does not just print out the pretty table. It actually puts your data into an `object` called a `DataFrame`. The pretty table is how Jupyter displays that object.

**Discuss**: how would you assign the result to a variable?

There are lots of other ways to make a DataFrame. We'll talk about those later.

We're going to assign the result to a data frame

In [None]:
df = pd.DataFrame(lakes)  

### 2.2 Index and columns

The rows and columns have **names** as you can see by the bolding in the table. The names of the rows are the lake number 0 --> 4 and the names of the columns are the features of the lakes. 

In pandas DataFrames ojbect, the names of the rows are called the **index** and the names of the columns are called the **columns** (I don't know why!). 

We  can set any column to be the index using `set_index` and give as an argument the name of the column yoou want to be the index. We'll say we want the "names" column. This will give you back another dataFrame where the row names(index) are set to be the lake names: 

In [None]:
df.set_index("name")

But we could also set any other column to be the index, though that would be weird:

In [None]:
df.set_index("temperature")

Notice this didn't do anything to our original `df`, it just **returns** another data frame. 
How can we save this new data frame to another variable?

In [None]:
df

Now, we can look at what the index and columns are they are using the DataFrame object and using the `.index` **attributes**

In [None]:
df.index

In [None]:
df.columns

### 2.3 `.loc[]` to access items by index and columns

In Excel we can access them using "B3", etc. 

We can access the rows and the columns of the DataFrame using the special `.loc[]` operator.

Inside, you can specify the ROW then the COLUMN.  You can remember this because **DataFrames <font color=red>R</font>o<font color=red>C</font>k!!**

In [None]:
df_named.loc['Erie', 'surface']

**Exercise 2.3.1**: use `.loc` to get the maximum depth of Lake Superior.

**Self check**: use `.loc` to get the Native name of Lake Michigan

### 2.4 Accessing a whole row or column
You can leave off the rows or the columns in your `.loc[]` but keep the comma and it will give you back *the whole row (or column)*

This single row or column is called a `series` which are basically the one-dimensional version of a DataFrame. Any time you take a slice across one row (getting multiple columns for one lake) or a slice across a column (getting multiple rows for one feature), you get a series:

In [None]:
df_named.loc[,'surface']

In [None]:
df_named.loc['Erie',]

**Discuss**: What will happen if I try to run the cell below and why?

In [None]:
df_named.loc[,'Erie']

You can also leave off the comma and it can work, but this can be a bit confusing

A particular specification for code is called *syntax*. The *syntax* for accessing items from a dictionary is `plant2place['Tea']`

**You can access data for one *feature* (a whole column) from a data frame using dictionary syntax**

**Discuss**: Compare the results of the commands below. How is the  code different? and how is the results different:

In [None]:
lakes['surface']

In [None]:
df_named['surface']

### 2.5 Indexing using a list
We don't need to specify just the name of one row/column but we have many ways.  Say we are only interested in the depths info for Ontario and Michigan. We can create a list of all the rows and all the columns we are interested in:

In [None]:
rows = ['Michigan', 'Ontario']
cols = ['ave_depth', 'max_depth']

Now I will use the variables to just get these rows and columns from the DataFrame:

In [None]:
df.loc[rows, cols]

**Exercise 2.5.1**: use list indexing to manually put the data frame in alphabetical order by lake name, and select only the native_name and shared_canada columns.

### 2.6 `.iloc[]` locates by row/column number
Sometimes you might want to get more specific and choose your own rows and columns by number, rather than just head and tail. With `iloc` you specify row number and column numbers (zero based) 

This follows the same rules as list position indexing, but gives you 2 dimensions to pick from.

In [None]:
df_named.iloc[:2,3:5]

**Exercise 2.6.1**: Look at the `df_named` and verify which rows and columns you would expect this to have given you.

## 3. Vectorized operations
Where pandas gets really powerful is the ability to manipulate a bunch of numbers at once. 



In [None]:
df['temperature'] / 100

In [None]:
df_named['ave_depth'] + 1000

What is an alternative way to accomplish the same task?

Two reasons why this is better:
- less code
- much much faster with large data

**3.1 Exercise**: convert the Fahrenheit temperatures to Celsius using vector operations (celsius = 5/9 * (fahrenheight - 32))

### 3.1 Vectorized math between two series
You aren't limited to a Series and a number, with 2 Series of the same size, pandas will perform the operations like you were in a for loop. So we can get for each lake, how much deeper the deepest point is than the average depth.

In [None]:
diff_depth =  df_named['max_depth'] - df_named['ave_depth']

In [None]:
diff_depth

**3.1.1 Exercise**: use vectorized math to create a Series containing a very approximate volume for each lake

### 3.2 Vectorized string operations
The main ones are vectorized string concatenation

Just like doing `"Lake" + "Karegnondi"` would give you a string concatenating the two, this does the same thing in a vectorized fashion-- notice that what you get is of course another Series.

In [None]:
"Lake " + df['native_name'] + " is a lake"

**Self check**: Write code to make a series that say "The native name for Erie is Erielhonan" etc for all lakes using vector operations.

### 3.4 Vectorized inequalities
Just like the other vectorized operations, we can do tests that give Boolean results for vectorized operations

In [None]:
df['temperature'] > 50

In [None]:
df

In [None]:
df['native_name'] == "Erielhonan"

Just like in 3.2, this can be between 2 Series rather than comparing each item to the same thing

In [None]:
df['ave_depth'] < df['max_depth']

**Exercise 3.4.1**: Write one line of code  that finds out if each lake's volume is more than 1 million cubic feet 

### 3.5 Vectorized Boolean operations.

There are different operators for Series and DataFrame Booleans versus regular ones. These will operate on the whole Series at once:
- `and` --> `&`
- `or` --> `|`
- `not` --> `~`


To flip True and False (the `not` operation) you must use the `~`:

In [None]:
~df['shared_canada']

To combine two inequalities you must put the inequality in parentheses like below! Common source of errors!

In [None]:
df['shared_canada'] & (df['temperature'] < 50)

In [None]:
df['shared_canada'] | (df['temperature'] < 50)

**Self check**: Use Boolean operations to get for all lakes a Boolean Series indicating lakes that are shared with Canada and have average depth more than 200 feet.

## 4. Dimension/shape
You can get the number of rows and columns by getting the `.shape` attribute. 

Remember it's always Row then Column because **DataFrames <font color=red>R</font>o<font color=red>C</font>k!!**

In [None]:
df.shape

What do you expect the shape of a series to be? How many rows and columns do you expect in the variable `aseries` below?

In [None]:
aseries = df['shared_canada']
aseries.shape