# Week 02

## Citing open-source / found code

Sometimes the citation will be part of the code. Whenever you use the `import` command, I'll know the code is coming form somewhere else and it's easy to figure out where.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.plot(np.sin(np.arange(0, 4 * np.pi, .1)))
plt.plot(np.cos(np.arange(0, 4 * np.pi, .1)), c='r')
plt.show()

Other times the citation will have to be a little more explicit.

A link to the original code, repo, or stackoverflow answer is enough.

In [None]:
import cv2
from scipy import fftpack
from imagehash import ImageHash

# Function for computing the perceptual hash of an image
# Based on code from the vframe project:
#   https://github.com/vframeio/vframe/blob/master/src/vframe/utils/im_utils.py#L37-L48
# which is based on code from the imagehash library:
#   https://github.com/JohannesBuchner/imagehash/blob/master/imagehash.py#L197

def phash(im, hash_size=8, highfreq_factor=4):
  wh = hash_size * highfreq_factor
  im = cv2.resize(im, (wh, wh), interpolation=cv2.INTER_NEAREST)
  if len(im.shape) > 2 and im.shape[2] > 1:
    im = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
  mdct = fftpack.dct(fftpack.dct(im, axis=0), axis=1)
  dctlowfreq = mdct[:hash_size, :hash_size]
  med = np.median(dctlowfreq)
  diff = dctlowfreq > med
  return ImageHash(diff)

Ok, back to Week 02

## Setup

Let's import some helper functions and libraries

In [None]:
import random

## Ranges

<img src="./imgs/range.jpg" width="500px" />

Range of integers between 0 and 10:

In [None]:
range(0, 10)

# TODO: take a look at the range values using casting
list(range(0, 10))

Range of integers between 0 and 100 skipping by 10s:

In [None]:
range(0, 100, 10)

# TODO: take a look at the range values using casting
list(range(0, 100, 10))

## Lists
### Creating lists from sequences of numbers
#### Create a list with all the numbers between 0 and 1000 that end in 91

In [None]:
list_x91 = []

# TODO: for loop
for i in range(91, 1000, 100):
  list_x91.append(i)
print(list_x91)

# TODO: comprehension
list_x91 = [i for i in range(91, 1000, 100)]
print(list_x91)

# TODO: casting
list_x91 = list(range(91, 1000, 100))
print(list_x91)

### List indexing

Indexing from the front is normal:

In [None]:
print(list_x91)
print(list_x91[0])
print(list_x91[2])
print(list_x91[8])

But, Python also lets us index from the back with negative numbers:

In [None]:
print(list_x91[-1])
print(list_x91[-2])
print(list_x91[-8])

### Create a list with 10 number 0's

In [None]:
list_10_0 = []

# TODO: math with lists
list_10_0 = 10 * [0]
print(list_10_0)

list_10_0_3_3_4_7 = 10 * [0] + 3 * [3] + 4 * [7]
print(list_10_0_3_3_4_7)

### Create list of numbers between 0 and 100 that are divisible by 7:

In [None]:
# TODO: probably easier using comprehension

list_100_7 = []
for i in range(0,100):
    if i % 7 == 0:
        list_100_7.append(i)
print(list_100_7)

list_100_7 = [i for i in range(0,100) if i % 7 == 0]
print(list_100_7)

### List functions

Members of each `list` object.

<img src="./imgs/lists00.jpg" width="500px" />

### Create a list of 1000 random numbers between 0 and 1000

In [None]:
# TODO: with for loop
list_of_randoms = []
for cnt in range(1000):
    list_of_randoms.append(random.randint(0, 1000))

print(len(list_of_randoms))

# TODO: with comprehension
list_of_randoms_c = [random.randint(0, 1000) for cnt in range(1000)]
print(len(list_of_randoms_c))

### Print the numbers and their index

In [None]:
# TODO: with len
for idx in range(len(list_of_randoms)):
  print(idx, list_of_randoms[idx])

In [None]:
# TODO: with enumerate
for idx,val in enumerate(list_of_randoms):
  print(idx, val)

### Find the largest element on a list

Go through all of the elements and compare each element to the largest number seen so far.

Update the `largest` variable if we encounter a larger number.

In [None]:
largest = list_of_randoms[0]

# TODO: find max
for x in list_of_randoms:
    if x > largest:
        largest = x

print(largest)

### Find the smallest element on a list

Go through all of the elements and compare each element to the smallest number seen so far.

Update the `smallest` variable if we encounter a smaller number.

In [None]:
smallest = list_of_randoms[0]

# TODO: find min
for x in list_of_randoms:
    if x < smallest:
        smallest = x

print(smallest)

### Find the sum of all elements on a list

Go through all of the elements and add their values to an accumulator variable.

In [None]:
my_sum = 0

# TODO: find sum
for x in list_of_randoms:
    my_sum += x

print(my_sum)

### Python has built in functions for doing these things

In [None]:
min(list_of_randoms), max(list_of_randoms), sum(list_of_randoms)

### Find the 5 largest and 5 smallest numbers on a list

# 🤔

### Python has a function for sorting a list that could help

In [None]:
my_sorted_list = sorted(list_of_randoms)

print(list_of_randoms)
print(my_sorted_list)

### Functions on lists

These are functions that Python gives us to work on lists.

There are functions for sorting, reversing and getting the length of a `list`:

<img src="./imgs/lists01.jpg" width="600px" />

#### Order from largest to smallest

Sort, then reverse:

In [None]:
my_reversed_sorted_list = list(reversed(my_sorted_list))

print(list_of_randoms)
print(my_sorted_list)
print(my_reversed_sorted_list)

#### Order from largest to smallest

Sort in reverse:

In [None]:
my_reversed_sorted_list = sorted(list_of_randoms, reverse=True)

print(list_of_randoms)
print(my_sorted_list)
print(my_reversed_sorted_list)

### With a sorted list we can more easily print the 5 smallest and 5 largest elements


In [None]:
my_sorted_list[ :5], my_sorted_list[-5: ]

### :W:T:F:?:

### Slicing

Python has a built-in mechanism for getting sub-sections of a list called *slicing*.

Instead of a single index, we specify two values in the square bracket, separated by a `:`, to specify where our slice starts and ends:

<img src="./imgs/slicing.jpg" width="700px" />

One **VERY** important thing to remember is that the second index in the bracket is **NOT** included in the slice.

In [None]:
my_list = [random.randint(0, 12) for i in range(0, 20)]
my_list, my_list[0 : 5]

As another example:  
`my_list[4 : 10]` would be used to access $6$ elements starting at position $4$, so ...
<br>elements $4$ - $9$ on the list. The second index in the slice, $10$, is not included.

In [None]:
my_list[4 : 10]

And, Python being Python, it tries to be smart and keep us from unnecessary typing:
- if the first index is blank, the slice will start at the first element 
- if the second index is blank, the slice will go until the end of the list

In [None]:
my_list, my_list[0 : 5], my_list[ :5]

In [None]:
my_list[15 : 20], my_list[15: ]

We can use negative indexes to slice from the back:

`a_list[-5 : len(a_list)]` would grab the last 5 elements from the list `my_list`,
<br>but this can be simplified with `a_list[-5: ]`.

In [None]:
my_list[-5 : len(my_list)], my_list[-5: ]

### How would we get the 5 items in the center?

In [None]:
center_index = len(my_list) // 2
center_5 = my_list[center_index - 2 : center_index + 3]

print(my_list)
print(center_5)

### This should make more sense now:

In [None]:
my_sorted_list[ :5], my_sorted_list[-5: ]

## Objects

### Creating objects

In [None]:
my_info = {
  "name": "thiago",
  "id": 8114,
  "zip": 11001,
  "grades": [90, 80, 60],
  "attendance": [True, True, False, True, True],
  "final grade": "A"
}
my_info

### Accessing values at specific keys

In [None]:
print(my_info["name"])
print(my_info["grades"])

### Modifying and Adding values

In [None]:
my_info["zip"] = 11202
my_info["course"] = 9103
my_info["section"] = "H"
my_info

### Iterating over keys, values and items

<img src="./imgs/objects.jpg" width="500px" />

In [None]:
# TODO use my_info.keys(), .values() and .items() to print object

for k in my_info.keys():
    print(k)

print()

for v in my_info.values():
    print(v)

print()

for k,v in my_info.items():
    print(k,":", v)


## List of objects

### Create a list of 10 objects with random heights and brooklyn zip codes.

```python
my_data = [
  {"height": [60, 70], "zip": [11200, 11250]},
  {"height": [60, 70], "zip": [11200, 11250]},
  {"height": [60, 70], "zip": [11200, 11250]},
  ...
]
```

In [None]:
my_data = []
# TODO: create list of random objects

for cnt in range(10):
    obj = {
        "height": random.randint(60, 70),
        "zip": random.randint(11200, 11250)
    }
    my_data.append(obj)

my_data

### BONUS: Let's add an id for each object and remove height

In [None]:
# import library with functions/variables that deal with words and text
import string

# like, a list of all lowercase letters
print(string.ascii_lowercase)

# and a list of all digits
print(string.digits)

In [None]:
# function to create a random id
def create_id():
  id = ""
  # choose 3 random letter from a list of all letters
  for cnt in range(3):
    id += random.choice(string.ascii_lowercase)

  # choose 4 random digits from a list of all digits
  for cnt in range(4):
    id += random.choice(string.digits)
  return id

for obj in my_data:
  del obj["height"]
  obj["id"] = create_id()

my_data

### Let's add a list of 3 grades for each member of the list and another item with their computed average

In [None]:
# TODO: first, add grade list to objects

for obj in my_data:
    obj["grades"] = [random.randint(70, 100) for cnt in range(3)]

my_data

### Average

<img src="./imgs/average00.jpg" width="500px" />

<img src="./imgs/average01.jpg" width="500px" />

In [None]:
# TODO: compute and store averages

for obj in my_data:
    obj["avg"] = round(sum(obj["grades"]) / len(obj["grades"]), 1)

my_data

### Get highest and lowest average grades

First, get all average grades, then use `min()`/`max()`

In [None]:
grades = []
for obj in my_data:
    grades.append(obj["avg"])

min(grades), max(grades)

### Sort objects by average grades

We could first get all the average grades and then sort the new list:

In [None]:
# TODO: get list of avg grades
grades = []
for obj in my_data:
    grades.append(obj["avg"])

by_grade = sorted(grades)

print("original:\n", grades)
print("sorted:\n", by_grade)

### But now we don't have the other associated information with each grade.

We want to sort the list while keeping the objects together.

Would be nice to be able to do something like this, just like with a `list`:

In [None]:
by_grade = sorted(my_data)
print(by_grade)

### Sorting Objects

For lists of objects we have to tell python which values to compare to determine their order.

We do this by defining a key function.

Key functions receive one argument, that can be an object, a list, a class member, anything... and they return one numerical value.

<img src="./imgs/list-of-objects.jpg" width="620px" />

In [None]:
# this key function receives a student-info object with {height, grade, zip, etc}
# and should return just the average grade value
def gradeKey(person):
  return person["avg"]

# then we can just use it when we call sorted()
by_grade = sorted(my_data, key=gradeKey)

by_grade

In [None]:
# TODO: sort by first assignment grade
def hw01Key(person):
  return person["grades"][0]

by_hw01 = sorted(my_data, key=hw01Key)

by_hw01

### `min()`/`max()` functions also work with a `key` argument:

In [None]:
# student with highest average grade
max_by_grade = max(my_data, key=gradeKey)

# student with lowest score on first assignment
min_by_hw01 = min(my_data, key=hw01Key)

print(max_by_grade)
print(min_by_hw01)

## Bigger Lists

## Setup

Include some helper functions and libraries

In [None]:
!wget -q https://github.com/DM-GY-9103-2024F-H/9103-utils/raw/main/src/data_utils.py

In [None]:
import matplotlib.pyplot as plt

from data_utils import object_from_json_url

### Load ANSUR 2 Databse

The `JSON` file has a subset of the measurements found [here](https://www.openlab.psu.edu/ansur2/).

In [None]:
ANSUR_JSON_URL = "https://raw.githubusercontent.com/DM-GY-9103-2024F-H/9103-utils/main/datasets/json/ansur.json"
ansur = object_from_json_url(ANSUR_JSON_URL)

# TODO: look at the data

# Answer:
#   - how many rows/records/items ?
#   - tallest height ?
#   - longest ear ?
#   - average ear length ?

print(len(ansur))
ansur[:2]

### Let's look at a simpler versions:

In [None]:
AHW_JSON_URL = "https://raw.githubusercontent.com/DM-GY-9103-2024F-H/9103-utils/main/datasets/json/ansur_age_height_weight_object.json"
ahw_objs = object_from_json_url(AHW_JSON_URL)

# TODO: look at data
# How is it organized ?

In [None]:
AHW_LIST_URL = "https://raw.githubusercontent.com/DM-GY-9103-2024F-H/9103-utils/main/datasets/json/ansur_age_height_weight.json"
ahws = object_from_json_url(AHW_LIST_URL)

# TODO: look at data
# How is it organized ?

# Answer the following:
#   - how many items ?
#   - how do we access the height of a person ?

## List of Lists

Just like we can put lists inside objects, and objects inside lists, we can also put lists inside lists.

If we want to get to a particular value we have to use $2$ indices instead of using just one:
`list[i][j]`

The first index tells Python which of the sub-lists we want, and the second specifies the item on that list.

<img src="./imgs/list-of-lists00.jpg" width="700px" />

<img src="./imgs/list-of-lists01.jpg" width="700px" />

Sometimes we'll refer to the first index as the row index and the second index as the column index.

That's because if we imagine our list of lists as a 2-dimensional matrix of numbers, the first index tells Python which row we want to access and the second tells which column:

<img src="./imgs/list-of-lists02.jpg" width="700px" />

<img src="./imgs/list-of-lists03.jpg" width="700px" />

### Datasets

We'll see this kind of structure a lot.

It's very common for datasets to be organized by rows/columns, where each column specifies a different *property* (or *feature*) and each row is a different *measurement* (or *record*) of those features.

In our example above, our dataset had $3$ *features* (age, height, weight), and one *record* per person.

<img src="./imgs/datasets00.jpg" width="700px" />

### JSON

It's also common to find datasets specified in the JSON format.

Instead of just being a list of lists with values, each *record* is an object that specifies the names and values of its *features*:

<img src="./imgs/datasets01.jpg" width="700px" />

There are advantages and disadvantages to each. We'll soon look at another way to organize datasets that will make it easier to go from one type to the other if we have to.

## Plots

We can use the [matplot](https://matplotlib.org/stable/api/pyplot_summary.html) library to visualize our data.

In [None]:
# TODO: get heights
heights = []

plt.plot(heights, 'bo', markersize=2)
plt.show()

In [None]:
# TODO: get weights
weights = []

plt.plot(weights, 'ro', markersize=2)
plt.show()

In [None]:
# TODO: plot ages in green
ages = []

### Sorting data can give a different perspective

In [None]:
sorted_heights = sorted(heights)
plt.plot(sorted_heights, 'bo', markersize=2)
plt.show()

### Histograms

In [None]:
min_height = min(heights)
max_height = max(heights)
plt.hist(heights, bins=range(min_height, max_height + 1))
plt.grid()
plt.show()

## Correlation

Measurement of how $2$ independent variables (features) are related to each other.

<img src="./imgs/correlation.jpg" width="800px" />

They can have *positive* or *direct* correlation, if an increase in one of the variables comes with an increase in the other.

They can have *negative* or *inverse* correlation if an increase in one of the variables is accompanied by a decrease in the other.

Or, there can be *weak* or *NO* correlation, if a change in one variable doesn't seem to be accompanied by a change in the other.

In [None]:
# use "column" lists from above to plot scatter plot
plt.scatter(ages, heights, marker='o', alpha=0.2)
plt.xlabel("age")
plt.ylabel("height")
plt.show()

In [None]:
# TODO plot other combinations of variables
# TODO: any correlation ?