# Announcements

- In general, place your "final/reviewable" code in the `apputil.py` file
- Autograder Issue
  - Check your email for a *Welcome to Gradescope for INFO-H 501* email, and follow [these](https://guides.gradescope.com/hc/en-us/articles/21853290544909-Joining-a-Course#h_01HGJZXHHGAN1A0Z5558WB246S) instructions.
  - Consider these [ways to log in](https://guides.gradescope.com/hc/en-us/articles/37910796801677-Logging-in-to-Gradescope-with-your-school-s-LMS-or-SSO-credentials-beta#h_01JZNP28SH7A4PHPR478JMAEFD). If it helps, the [course code](https://guides.gradescope.com/hc/en-us/articles/21853290544909-Joining-a-Course#h_01HGJZZGH2MPTC934NGG9Y01Q0) is **ZY45VZ**.
- Student Groups

# More Python Basics

*This section uses content from the [Python section](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/00-Python.html#) of Melanie Walsh's Introduction to Cultural Analytics & Python.*

The purpose of this notebook is to give you a stronger, more complete foundation for "vanilla" (built-in) Python by discussing loops, as well as to introduce the NumPy library. **The goal of this lab is to give you the tools you need to start building robust functions which can make useful calculations on large datasets.**

To that end, we'll cover the following:

1. More Python Basics
    - Loops
    - Generator Iterators
2. NumPy
    - NumPy Arrays
    - Array Operations

## Loops

Python provides two different types of loop: the `for` loop, and the `while` loop. The `for` loop iterates a block of code through the elements of some iterable (i.e., list-like) object, whereas the `while` loop executes a block of code as long as some condition is met. In other words, `for` loops are best suited for *finite* (definite) iteration, and `while` loops are best suited for *potentially infinite* (indefinite) iteration. For our purposes, **every loop should be finite**, so practically, any `while` loop can (and often should) be converted into a `for` loop.

We will introduce both here, but since `while` loops can sometimes result in infinite loops, often causing the computer to freeze up, our focus in this class will be on `for` loops.

### `for` loops

Again, a `for` loop will execute a block of code for each item in a list-like object. On the first line, we type the English word `for`, a new variable name for each item in the list, the English word `in`, the name of the list-like object, and then a colon `:`. All instructions for a `for` loop must be indented.

In [8]:
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# suppose we want to print the odd numbers in `numbers` that are 5 or less
for number in numbers:
    # we can "skip" values using continue
    if number == "two":
        continue
    
    # remember what the `%` symbol does
    elif number % 2 == 1:
        print(f"{number} is an odd number.")
    
    # `break` completely stops the looping process
    elif number > 5:
        break

1 is an odd number.
3 is an odd number.
5 is an odd number.


* In the same way that variables (in general) should be descriptively named, the iterating variable of for-loops (between `for` and `in`) should be descriptive of what is held behind it.
* As it is usually difficult to determine what is going on, print statements help to "debug" the contents of variables as you go. **Use looped print statements sparingly!**

### `while` loops

**<span style='color:red;'>CAUTION:</span>** Do your best to **avoid using `while` loops if you can**; convert them into a `for` loop (above) whenever possible.

In [3]:
# start with `while`, the condition to be met, and then the block of code
i = 1
j = 2

while j > i:
    if j in [10, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7]:
        print("j =", j)
    elif j == 1e8:
        print("j is really big now, and it's just getting bigger ...")
    j += 1

j = 10
j = 100
j = 1000
j = 10000
j = 100000
j = 1000000
j = 10000000
j is really big now, and it's just getting bigger ...


KeyboardInterrupt: 

<font color='lighblue'>You can interrupt the above code with the "**I, I**" shortcut, or clicking the "stop" button.</font>

For the most part, `while` loops should only be considered for the following cases:
* infinite looping, where there might not ever be an end (mostly in software engineering)
* when there is user interaction
* situations where the object being iterated through changes with each loop.

Sometimes, in the last case, a `for` loop may still do the job ...

## Generator Iterators

In Python, an **iterable** is an object *containing* one or more data elements that a `for` loop can iterate over. For example, lists, tuples, dictionaries, and sets are all iterable objects. They provide multiple elements of data that you can iterate over.

However, consider two alternate scenarios:

1. We do not know all the elements to be iterated over, but we know how they are generated.
2. We know the elements to be iterated over, but we do not want them taking up memory until it is their "turn".

In either of these cases, we can use the Python [generator](https://realpython.com/introduction-to-python-generators/) iterator to 'stream' elements to iterate over.

Three common built-in Python generators are `range()`, `enumerate()`, and `zip()`.

### `range`

`range()` generates numbers in a sequence, from `(start, stop, step)`.

In [None]:
[0, 1, 2]

[0, 1, 2]

In [None]:
# on it's own, an iterator is just a function
range(3)

range(0, 3)

We can create our own generator using the `yield` keyword.

In [1]:
def my_range(n):
    i = 0
    while i < n:
        i += 1
        yield((i, f"this is {i}"))

In [None]:
def my_range(n):
    i > 5
    while i < n:
        i += 1
        yield((i, f"this is {i}"))
    print(my_range(3))
    print(my_range(5))
    print(my_range(10))

In [None]:
instance_of_my_range = my_range(5)

In [None]:
next(instance_of_my_range)

(1, 'this is 1')

In [None]:
for i in instance_of_my_range:
    print(i)

(2, 'this is 2')
(3, 'this is 3')
(4, 'this is 4')
(5, 'this is 5')


In [None]:
# what happens when the generator "runs out" ...
next(instance_of_my_range)

Or, we can use a [generator expression](https://peps.python.org/pep-0289/), which is similar to a list comprehension:

In [None]:
# generator comprehension
gen_list = (c for c in ['a', 'b', 'c'])
gen_list

<generator object <genexpr> at 0x7ef0e41069b0>

In [None]:
next(gen_list)

'a'

The basic `range` generator behaves as expected ...

In [None]:
# by default, the first value will be 0, and step is 1
for i in range(3):
    print(i)

0
1
2


In [None]:
# we can iterate from 2 to 20, by step size 3
for i in range(2, 20, 3):
    print(i)

2
5
8
11
14
17


In [None]:
# we can even go backwards
for i in range(5, 1, -1):
    print(i)

5
4
3
2


### `enumerate`

If we are working with a defined iterable (like a list), but we also want to log the index of each element as we iterate over it, we can use `enumerate`.

In [None]:
names = ['astrid', 'jamaal', 'timo', 'hyewon']

# notice the order of `i` and `element` here
for i, element in enumerate(names):
    print(f"index: {i} ... element: {element}")

index: 0 ... element: astrid
index: 1 ... element: jamaal
index: 2 ... element: timo
index: 3 ... element: hyewon


### `zip`

Suppose we have two different iterables that we would like to combine into a list of tuples (or a dictionary).

In [None]:
keys =   ["a", "b", "c"]
values = [1,     2,   3]

kv_dict = {}
kv_list = []

for k, v in zip(keys, values):
    # print(k, v)
    
    kv_dict[k] = v
    kv_list.append((k, v))

In [None]:
kv_dict

{'a': 1, 'b': 2, 'c': 3}

In [None]:
kv_list

[('a', 1), ('b', 2), ('c', 3)]

Alternatively, we could use list/dictionary comprehensions ...

In [None]:
keys =   ["a", "b", "c", "a"]
values = [1,     2,   3,  4 ]

In [None]:
[(k, v) for k, v in zip(keys, values)]

[('a', 1), ('b', 2), ('c', 3), ('a', 4)]

In [None]:
{k:v for k, v in zip(keys, values)}

{'a': 4, 'b': 2, 'c': 3}

# Intro to NumPy

*This is a summarized version of [NumPy's absolute beginners tutorial](https://numpy.org/doc/stable/user/absolute_beginners.html). Each sub-section title here links to its corresponding section in the tutorial.*

[NumPy](https://numpy.org/doc/stable/user/absolute_beginners.html) (Numerical Python) is an enormous open source Python library that is *easily* the universal standard for working with numerical data in Python. Its users include everyone from beginning coders to experienced researchers doing state-of-the-art scientific and industrial research and development. NumPy improves on numerical programming in Python by introducing optimized low-level programming, and a library of high-level mathematical functions that operate on arrays and matrices.

In [None]:
import numpy as np

## [NumPy Arrays](https://numpy.org/doc/stable/user/absolute_beginners.html#what-is-an-array)

The NumPy n-dimensional array (`ndarray`) is the central data structure of the NumPy library. We can think of it roughly as a vector (with 1-dimension) or a matrix (with 2-dimensions), but it might be better to think of it as a "[tensor](https://www.kdnuggets.com/2018/05/wtf-tensor.html)" (with 3 or more, n-dimensions). NumPy arrays are homogeneous, i.e., the data type is consistent across all elements (this allows for optimized calculations). The `shape` of the array is a tuple of integers giving the size of the array along each dimension. *For math people, think about an $n$-dimensional NumPy array as points on a "grid" in $n$-space.*

Note: NumPy is *vast* and has capabilities reaching far beyond what is included in this notebook. Again, **this is just an introduction to NumPy.**

In [None]:
a = np.array([[1, 2, 3], 
              [4, 5, 6]])

In [None]:
# `.ndim` gives us the number of dimensions of our array
a.ndim

2

In [None]:
# `shape` gives the length of each dimension
a.shape

(2, 3)

In [None]:
# `size` is the number of elements in the array
a.size

6

In [None]:
# number of bytes consumed by the array
a.nbytes

48

Each dimension is called an **axis**, working your way from "outside" to "inside". E.g., in `a`, the rows are `axis 0`, and the columns are `axis 1`.

### [Building Arrays](https://numpy.org/doc/stable/user/absolute_beginners.html#how-to-create-a-basic-array)

A few quick arrays include the `zeros` and the `ones` arrays:

In [None]:
np.zeros((2, 2), "int")  # saved as an integer instead of a float

array([[0, 0],
       [0, 0]])

In [None]:
np.ones((2, 2))

array([[1., 1.],
       [1., 1.]])

In [None]:
np.ones((2, 2), "bool")  # 1s and 0s can also be boolean

array([[ True,  True],
       [ True,  True]])

Additionally, we have the `arange` and `linspace` functions.

In [None]:
# arange is similar the `range` we've seen above
np.arange(1, 10, 2)

array([1, 3, 5, 7, 9])

In [None]:
# linspace creates evenly spaced elements in an array
np.linspace(0, 5, num=11)

array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])

In [None]:
# convert a 1-d array to a 2-d array by adding an axis
np.arange(1, 10, 2)[:, np.newaxis]

array([[1],
       [3],
       [5],
       [7],
       [9]])

In [None]:
# using `reshape` defines the new "size" for each axis
a = np.arange(3, 10, 2).reshape(-1, 1)
b = np.arange(4, 20, 4).reshape(-1, 1)

# concatenate is one way to merge arrays together
c = np.concatenate((a, b), axis=1)
c

array([[ 3,  4],
       [ 5,  8],
       [ 7, 12],
       [ 9, 16]])

In [None]:
# another example of reshaping
c.reshape(2, 4)

array([[ 3,  4,  5,  8],
       [ 7, 12,  9, 16]])

In [None]:
# "-1" here means "infer axis size" or "maximum axis size"
c.reshape(-1, 1)

array([[ 3],
       [ 4],
       [ 5],
       [ 8],
       [ 7],
       [12],
       [ 9],
       [16]])

In [None]:
# we can "vertically" stack using vstack
a = np.array([3,  4,  5,  8])
b = np.array([7, 12,  9, 16])

np.vstack((a, b))

array([[ 3,  4,  5,  8],
       [ 7, 12,  9, 16]])

In [None]:
# or we could "horizontally" stack using hstack
a = np.array([3,  4,  5,  8])[:, np.newaxis]
b = np.array([7, 12,  9, 16])[:, np.newaxis]

np.hstack((a, b))

array([[ 3,  7],
       [ 4, 12],
       [ 5,  9],
       [ 8, 16]])

You can also [flatten](https://numpy.org/doc/stable/user/absolute_beginners.html#reshaping-and-flattening-multidimensional-arrays) arrays, which is a slightly different take on "reshaping". Here, we reduce the values of a multi-dimensional array into a flat, 1-D array.

In [None]:
x = np.array([[1,  2,  3,  4],
              [5,  6,  7,  8],
              [9, 10, 11, 12]])

x.flatten()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

### [Indexing and Slicing](https://numpy.org/doc/stable/user/absolute_beginners.html#indexing-and-slicing)

There are several ways to extract "slices" of an array. The most basic mirrors how we'd slice a list (i.e., using `start:stop:step` syntax).

In [None]:
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
a = np.arange(1, 10+1)

a[2:9:2]

array([3, 5, 7, 9])

We can also use boolean (True/False) filtering based on logical conditions. In NumPy, the **and** operator is `&`, and the **or** operator is `|`.

In [None]:
a = np.array([[1 , 2,  3,  4], 
              [5,  6,  7,  8], 
              [9, 10, 11, 12]])

In [None]:
(a == 1)

array([[ True, False, False, False],
       [False, False, False, False],
       [False, False, False, False]])

In [None]:
(a > 5)

array([[False, False, False, False],
       [False,  True,  True,  True],
       [ True,  True,  True,  True]])

In [None]:
mask = (a == 1) | (a > 5)
mask

array([[ True, False, False, False],
       [False,  True,  True,  True],
       [ True,  True,  True,  True]])

In [None]:
a[mask]

array([ 1,  6,  7,  8,  9, 10, 11, 12])

Here, `np.where` returns values based on a condition, and we can use the `np.nan` (Not A Number, or "Null") value to represent missing values.

In [None]:
np.where(mask, a, np.nan)

array([[ 1., nan, nan, nan],
       [nan,  6.,  7.,  8.],
       [ 9., 10., 11., 12.]])

We can also use the `[:]` notation to extract different kinds of slices.

In [None]:
# first "vector" (i.e., the first row)
a[0,:]

# a[0,]  # shorthand!

array([1, 2, 3, 4])

In [None]:
# the second value for each row; and think of `:` as all appropriate values
a[:, 1][:, np.newaxis]

array([[ 2],
       [ 6],
       [10]])

In [None]:
# using [:] notation to get vertical column
a[:, 1:2]

array([[ 2],
       [ 6],
       [10]])

<span style='color:red;'>**Important Note on Slicing:**</span> Slices of arrays are *references* to the original array. **If you need to extract a permanently "independent" slice of an array, you need to `copy` it.**

In [None]:
a = np.array([[1 , 2,  3,  4], 
              [5,  6,  7,  8]])

# a copy to check later
a_copy = a.copy()

# here, we "extract" (refer to) a slice of `a`
a_slice = a[:, :2]
a_slice

array([[1, 2],
       [5, 6]])

In [None]:
# here, we amend (mutate) that slice
a_slice[0, :] = np.array([9, 9])
a_slice

array([[9, 9],
       [5, 6]])

In [None]:
# we have then updated the original array, too
a

array([[9, 9, 3, 4],
       [5, 6, 7, 8]])

In [None]:
a_copy

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

### [Random Generation](https://numpy.org/doc/stable/user/absolute_beginners.html#generating-random-numbers)

NumPy also has the capability to generate random numbers from various probability distributions.

In [None]:
# the uniform distribution is just one of *many* options
data = np.random.uniform(size=(2, 3))
data

array([[0.09493969, 0.08380801, 0.99244332],
       [0.36204233, 0.2312437 , 0.3968152 ]])

In [None]:
# including a random seed ensures the "same random" generations with each run
gen = np.random.default_rng(seed=33)
gen.normal(loc=5, scale=1, size=(3, 2))

array([[5.39836997, 4.43717666],
       [5.58883494, 5.0421181 ],
       [3.42909948, 6.00165475]])

**This random seed no longer applies when run in a separate cell.**

In [None]:
gen.normal(loc=5, scale=1, size=(3, 2))

array([[4.90212381, 5.61980221],
       [6.83683215, 5.26842997],
       [3.92553132, 4.31902103]])

## Array Operations

NumPy provides all the standard mathematical operations you'd expect, but the difference now is that these operations are extremely fast, and they operate in a "vectorized" fashion. This can happen in three ways: element-wise, using what is called "broadcasting", or using standard mathematical matrix operations. We can also calculate aggregate values such as mean or minimum, etc.

### [Element-wise Operations](https://numpy.org/doc/stable/user/absolute_beginners.html#basic-array-operations)

In [None]:
# we can create an array of only ones
ones = np.ones((2, 4))
ones

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [None]:
ones + a

array([[ 3.,  5.,  7.,  9.],
       [11., 13., 15., 17.]])

In [None]:
ones / a

array([[0.5       , 0.25      , 0.16666667, 0.125     ],
       [0.1       , 0.08333333, 0.07142857, 0.0625    ]])

### [Broadcasting](https://numpy.org/doc/stable/user/absolute_beginners.html#broadcasting)

When [broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html) an operation between two arrays (with potentially different shapes), NumPy first compares their shapes element-wise, starting with the trailing (i.e. rightmost) dimension, and working its way left. Two dimensions are compatible when (a) they are equal, or (b) one of them is 1.

```python
Compatible:
(4, 5, 2)
   (5, 1)             First, 1 works. Then, 5 = 5.

Not compatible:
(100, 4)
  (5, 2)             First, 4 ≠ 2 and neither are 1.
```

In [None]:
a = np.array([[1, 2],
              [4, 5],
              [7, 8]])
               
b = np.array([[3], 
              [2], 
              [1]])

In [None]:
# Notice how multiplying by `b` is broadcast "across" `a`
print(f"a shape: {a.shape}")
print(f"b shape: {b.shape}")
a * b

a shape: (3, 2)
b shape: (3, 1)


array([[ 3,  6],
       [ 8, 10],
       [ 7,  8]])

In [None]:
# the same happens with scalars
4 * b

array([[12],
       [ 8],
       [ 4]])

In [None]:
b

array([[3],
       [2],
       [1]])

In [None]:
# we an also broadcast across 2+ axes
c = np.array([1, 2, 3])

b * c

array([[3, 6, 9],
       [2, 4, 6],
       [1, 2, 3]])

### Matrix Arithmetic

Any array is essentially a matrix of values, and NumPy provides the standard matrix and vector operations. Here are just a few examples.

In [None]:
a = np.array([[1, 2],
              [3, 4],
              [5, 6]])

b = np.array([[9, 8, 7],
              [6, 5, 4]])

c = np.array([1, 2, 3])

In [None]:
# the @ symbol represents matrix multiplication
a @ b

array([[21, 18, 15],
       [51, 44, 37],
       [81, 70, 59]])

In [None]:
b @ c.reshape(-1, 1)

array([[46],
       [28]])

In [None]:
# np.dot works too
np.dot(a, b)

array([[21, 18, 15],
       [51, 44, 37],
       [81, 70, 59]])

In [None]:
c.dot(c)

np.int64(14)

In [None]:
# if this update gets annoying, you can use the legacy NumPy number printing
np.set_printoptions(legacy='1.13')

In [None]:
c.dot(c)

14

In [None]:
# .T calculates the transpose (not for single rows/columns)
a.T

array([[1, 3, 5],
       [2, 4, 6]])

### [Aggregation](https://numpy.org/doc/stable/user/absolute_beginners.html#more-useful-array-operations)

Often, you'll want to calculate aggregate functions (e.g., mean, minimum, etc.) for a whole array, or for just rows/columns (or some other axis).

In [None]:
a = np.array([[ 3,  6,  9, 12, 15],
              [ 4,  8, 12, 16, 20]])

In [None]:
# we can calculate a minimum for all the data
a.min()

3

In [None]:
# or, just across an axis (here, across columns for each row)
a.min(axis=1)

array([3, 4])

In [None]:
a.mean(axis=0)

array([  3.5,   7. ,  10.5,  14. ,  17.5])

In [None]:
a.sum(axis=1)

array([45, 60])

Given a vector of weights, NumPy can also calculate a weighted average.

In [None]:
w = np.array([1, 4, 1, 2, 5])

In [None]:
np.average(a, axis=1, weights=w).reshape(-1, 1)

array([[ 10.38461538],
       [ 13.84615385]])

The [unique](https://numpy.org/doc/stable/user/absolute_beginners.html#how-to-get-unique-items-and-counts) function is helpful to find out the distinct values in a vector or tensor.

In [None]:
a = np.array([11, 11, 12, 13, 14, 16, 17, 11, 13, 11, 14, 18, 10, 14])

np.unique(a)

array([10, 11, 12, 13, 14, 16, 17, 18])

# Explore

Test your understanding of this week's content with the following explorations.

*Note: unless otherwise noted, **explorations are completely optional and will not be reviewed.***

## instructions

- Let `s` be a 1-D boolean (True/False) NumPy array where each element represents a student, and each value is an indicator for whether a student was caught texting in class.
- Let `d` be 1-D numeric array where each element represents an instructor, and each value is the number of points that instructor would remove from a student's grade if they are caught texting in class.

## exploration 1

Write a function which takes in `s` and `d`, and returns an array with a row for each instructor, and a column for each student. The values should be the number of points that would be removed for each student-instructor combination.

**Example:**

```python
s = np.array([True, False, False, True, True])
d = np.array([20, 15])

>>> text_removals(s, d)
>>> array([[20,  0,  0, 20, 20],
           [15,  0,  0, 15, 15]])
```


In [7]:
import numpy as np
# Create an array with a row for each intstructor, and a column for each student
s = np.array(['False', 'True', 'False', 'True', 'False', 'True', 'False', 'True', 'False', 'True'])
d = np.array([10,5,3,1])

# Call the function
def text_removals(s, d):
    removal_counts = []
    for i in d:
        count = np.sum(s == 'True')
        removal_counts.append(count)
        s = s[i:]  # Remove the first i elements
    return removal_counts
text_removals(s, d)

[np.int64(5), np.int64(0), np.int64(0), np.int64(0)]

## exploration 2

Adjust the function to return the total number of points each instructor would have removed by adding an *optional* Boolean argument called `totals`.

**Example:**

```python
s = np.array([True, False, False, True, True])
d = np.array([20, 15])

>>> text_removals(s, d, totals=True)
>>> array([60, 45])
```


## exploration 3

The Fibonacci Series starts with 0 and 1. Each of the following numbers are the sum of the previous two numbers in the series. So, the first element is 0, the next element is 1, the next is 1, and the next is 2, and so on. Write a python *loop* which prints the following, given a number `n`:

```text
Fibonacci 1 = 0
Fibonacci 2 = 1
Fibonacci 3 = 1
...
Fibonacci n = ?
```