# Homework 2: Arrays and Tables

**Recommended Reading**: 
* [Data Types](https://www.inferentialthinking.com/chapters/04/data-types.html) 
* [Sequences](https://www.inferentialthinking.com/chapters/05/sequences.html)
* [Tables](https://www.inferentialthinking.com/chapters/06/tables.html).

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests. Each time you start your server, you will need to execute this cell again to load the tests.
Throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. 

In [74]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

from gofer.ok import check

<font color="#E74C3C">**Important**: In this homework, the `gofer` tests will tell you whether your answer is correct, except for Parts 4, 5 & 6. In future homework assignments, correctness tests will typically not be provided.</font>

## 1. Creating Arrays


**Question 1.** Make an array called `weird_numbers` containing the following numbers (in the given order):

1. -2
2. the sine of 1.2
3. 3
4. 5 to the power of the cosine of 1.2

*Hint:* `sin` and `cos` are functions in the `math` module.

In [75]:
# Our solution involved one extra line of code before creating
# weird_numbers.
...
import math
weird_numbers = make_array(-2,math.sin(1.2),3,5**math.cos(1.2))
weird_numbers

array([-2.        ,  0.93203909,  3.        ,  1.79174913])

In [76]:
check('tests/q1_1.py')

**Question 2.** Make an array called `book_title_words` containing the following three strings: "Eats", "Shoots", and "and Leaves".

In [77]:
book_title_words = make_array("Eats", "Shoots", "and Leaves")
book_title_words

array(['Eats', 'Shoots', 'and Leaves'], dtype='<U10')

In [78]:
check('tests/q1_2.py')

Strings have a method called `join`.  `join` takes one argument, an array of strings.  It returns a single string.  Specifically, the value of `a_string.join(an_array)` is a single string that's the [concatenation](https://en.wikipedia.org/wiki/Concatenation) ("putting together") of all the strings in `an_array`, **except** `a_string` is inserted in between each string.

**Question 3.** Use the array `book_title_words` and the method `join` to make two strings:

1. "Eats, Shoots, and Leaves" (call this one `with_commas`)
2. "Eats Shoots and Leaves" (call this one `without_commas`)

*Hint:* If you're not sure what `join` does, first try just calling, for example, `"foo".join(book_title_words)` .

In [79]:
with_commas = ", ".join(book_title_words)
without_commas = " ".join(book_title_words)

# These lines are provided just to print out your answers.
print('with_commas:', with_commas)
print('without_commas:', without_commas)

with_commas: Eats, Shoots, and Leaves
without_commas: Eats Shoots and Leaves


In [80]:
check('tests/q1_3.py')

## 2. Indexing Arrays


These exercises give you practice accessing individual elements of arrays.  In Python (and in many programming languages), elements are accessed by *index*, so the first element is the element at index 0.  

**Question 1.** The cell below creates an array of some numbers.  Set `third_element` to the third element of `some_numbers`.

In [81]:
some_numbers = make_array(-1, -3, -6, -10, -15)

third_element = some_numbers.item(2)
third_element

-6

In [82]:
check('tests/q2_1.py')

**Question 2.** The next cell creates a table that displays some information about the elements of `some_numbers` and their order.  Run the cell to see the partially-completed table, then fill in the missing information in the cell (the strings that are currently "???") to complete the table.

In [83]:
elements_of_some_numbers = Table().with_columns(
    "English name for position", make_array("first", "second", "third", "fourth", "fifth"),
    "Index",                     make_array("0", "1", "2", "3", "4"),
    "Element",                   some_numbers)
elements_of_some_numbers

English name for position,Index,Element
first,0,-1
second,1,-3
third,2,-6
fourth,3,-10
fifth,4,-15


In [84]:
check('tests/q2_2.py')

**Question 3.** You'll sometimes want to find the *last* element of an array.  Suppose an array has 142 elements.  What is the index of its last element?

In [85]:
index_of_last_element = 141


In [86]:
check('tests/q2_3.py')

More often, you don't know the number of elements in an array, its *length*.  (For example, it might be a large dataset you found on the Internet.)  The function `len` takes a single argument, an array, and returns the `len`gth of that array (an integer).

**Question 4.** The cell below loads an array called `president_birth_years`.  The last element in that array is the most recent birth year of any deceased president. Assign that year to `most_recent_birth_year`.

In [87]:
president_birth_years = Table.read_table("president_births.csv").column('Birth Year')

most_recent_birth_year = president_birth_years.item(-1)
most_recent_birth_year

1917

In [88]:
check('tests/q2_4.py')

**Question 5.** Finally, assign `sum_of_birth_years` to the sum of the first, tenth, and last birth year in `president_birth_years`

In [89]:
sum_of_birth_years = president_birth_years.item(0) + president_birth_years.item(9) + president_birth_years.item(-1)

In [90]:
check('tests/q2_5.py')

## 3. Basic Array Arithmetic


**Question 1.** Multiply the numbers 42, 4224, 42422424, and -250 by 157.  For this question, **don't** use arrays.

In [91]:
first_product = 42*157
second_product = 4224*157
third_product = 42422424*157
fourth_product = -250*157
print(first_product, second_product, third_product, fourth_product)

6594 663168 6660320568 -39250


In [92]:
check('tests/q3_1.py')

**Question 2.** Now, do the same calculation, but using an array called `numbers` and only a single multiplication (`*`) operator.  Store the 4 results in an array named `products`.

In [93]:
numbers = make_array(42, 4224, 42422424, -250)
products = make_array(numbers * 157)
products

array([[      6594,     663168, 6660320568,     -39250]])

In [94]:
check('tests/q3_2.py')

**Question 3.** Oops, we made a typo!  Instead of 157, we wanted to multiply each number by 1577.  Compute the fixed products in the cell below using array arithmetic.  Notice that your job is really easy if you previously defined an array containing the 4 numbers.

In [95]:
fixed_products = make_array(numbers*1577)
fixed_products

array([[      66234,     6661248, 66900162648,     -394250]])

In [96]:
check('tests/q3_3.py')

**Question 4.** We've loaded an array of temperatures in the next cell.  Each number is the highest temperature observed on a day at a climate observation station, mostly from the US.  Since they're from the US government agency [NOAA](noaa.gov), all the temperatures are in Fahrenheit.  Convert them all to Celsius by first subtracting 32 from them, then multiplying the results by $\frac{5}{9}$. Make sure to **ROUND** each result to the nearest integer using the `np.round` function.

In [97]:
max_temperatures = Table.read_table("temperatures.csv").column("Daily Max Temperature")

celsius_max_temperatures = np.round((max_temperatures-32)*5/9)
celsius_max_temperatures

array([-4., 31., 32., ..., 17., 23., 16.])

In [98]:
check('tests/q3_4.py')

**Question 5.** The cell below loads all the *lowest* temperatures from each day (in Fahrenheit).  Compute the size of the daily temperature range for each day.  That is, compute the difference between each daily maximum temperature and the corresponding daily minimum temperature.  **Give your answer in Celsius!** Make sure **NOT** to round your answer for this question!

In [99]:
min_temperatures = Table.read_table("temperatures.csv").column("Daily Min Temperature")

celsius_temperature_ranges = ((max_temperatures-32)*5/9) - ((min_temperatures -32)*5/9)
celsius_temperature_ranges

array([ 6.66666667, 10.        , 12.22222222, ..., 17.22222222,
       11.66666667, 11.11111111])

In [100]:
check('tests/q3_5.py')

## 4. World Population


The cell below loads a table of estimates of the world population for different years, starting in 1950. The estimates come from the [US Census Bureau website](https://www.census.gov/en.html).

In [101]:
world = Table.read_table("world_population.csv").select('Year', 'Population')
world.show(4)

Year,Population
1950,2557628654
1951,2594939877
1952,2636772306
1953,2682053389


The name `population` is assigned to an array of population estimates.

In [102]:
population = world.column(1)
population

array([2557628654, 2594939877, 2636772306, 2682053389, 2730228104,
       2782098943, 2835299673, 2891349717, 2948137248, 3000716593,
       3043001508, 3083966929, 3140093217, 3209827882, 3281201306,
       3350425793, 3420677923, 3490333715, 3562313822, 3637159050,
       3712697742, 3790326948, 3866568653, 3942096442, 4016608813,
       4089083233, 4160185010, 4232084578, 4304105753, 4379013942,
       4451362735, 4534410125, 4614566561, 4695736743, 4774569391,
       4856462699, 4940571232, 5027200492, 5114557167, 5201440110,
       5288955934, 5371585922, 5456136278, 5538268316, 5618682132,
       5699202985, 5779440593, 5857972543, 5935213248, 6012074922,
       6088571383, 6165219247, 6242016348, 6318590956, 6395699509,
       6473044732, 6551263534, 6629913759, 6709049780, 6788214394,
       6866332358, 6944055583, 7022349283, 7101027895, 7178722893,
       7256490011])

In this question, you will apply some built-in Numpy functions to this array.

<img src="array_diff.png" style="width: 600px;"/>

The difference function `np.diff` subtracts each element in an array by the element that preceeds it. As a result, the length of the array `np.diff` returns will always be one less than the length of the input array.

<img src="array_cumsum.png" style="width: 700px;"/>

The cumulative sum function `np.cumsum` outputs an array of partial sums. For example, the third element in the output array corresponds to the sum of the first, second, and third elements.

**Question 1.** Very often in data science, we are interested understanding how values change with time. Use `np.diff` and `np.max` (or just `max`) to calculate the largest annual change in population between any two consecutive years.

In [103]:
largest_population_change = max(np.diff(population))
largest_population_change

87515824

In [104]:
check('tests/q4_1.py')

**Question 2.** Describe in words the result of the following expression. What do the values in the resulting array represent (choose one)?

In [105]:
np.cumsum(np.diff(population))

array([  37311223,   79143652,  124424735,  172599450,  224470289,
        277671019,  333721063,  390508594,  443087939,  485372854,
        526338275,  582464563,  652199228,  723572652,  792797139,
        863049269,  932705061, 1004685168, 1079530396, 1155069088,
       1232698294, 1308939999, 1384467788, 1458980159, 1531454579,
       1602556356, 1674455924, 1746477099, 1821385288, 1893734081,
       1976781471, 2056937907, 2138108089, 2216940737, 2298834045,
       2382942578, 2469571838, 2556928513, 2643811456, 2731327280,
       2813957268, 2898507624, 2980639662, 3061053478, 3141574331,
       3221811939, 3300343889, 3377584594, 3454446268, 3530942729,
       3607590593, 3684387694, 3760962302, 3838070855, 3915416078,
       3993634880, 4072285105, 4151421126, 4230585740, 4308703704,
       4386426929, 4464720629, 4543399241, 4621094239, 4698861357])

1) The total population change between consecutive years, starting at 1951.

2) The total population change between 1950 and each later year, starting at 1951.

3) The total population change between 1950 and each later year, starting inclusively at 1950.

In [106]:
# Assign cumulative_sum_answer to 1, 2, or 3
cumulative_sum_answer = 3

In [107]:
check('tests/q4_2.py')

## 5. Old Faithful


**Important**: In this question, the `gofer` tests don't tell you whether or not your answer is correct. They only check that your answer is close. However, when the question is graded, we will check for the correct answer. Therefore, you should do your best to submit answers that not only pass the tests, but are also correct.

Old Faithful is a geyser in Yellowstone that erupts every 44 to 125 minutes (according to [Wikipedia](https://en.wikipedia.org/wiki/Old_Faithful)). People are [often told that the geyser erupts every hour](http://yellowstone.net/geysers/old-faithful/), but in fact the waiting time between eruptions is more variable. Let's take a look.

**Question 1.** The first line below assigns `waiting_times` to an array of 272 consecutive waiting times between eruptions, taken from a classic 1938 dataset. Assign the names `shortest`, `longest`, and `average` so that the `print` statement is correct.

In [108]:
waiting_times = Table.read_table('old_faithful.csv').column('waiting')

shortest = min(waiting_times)
longest = max(waiting_times)
average = sum(waiting_times)/ len(waiting_times)

print("Old Faithful erupts every", shortest, "to", longest, "minutes and every", average, "minutes on average.")

Old Faithful erupts every 43 to 96 minutes and every 70.8970588235294 minutes on average.


In [109]:
check('tests/q5_1.py')

**Question 2.** Assign `biggest_decrease` to the biggest decrease in waiting time between two consecutive eruptions. For example, the third eruption occurred after 74 minutes and the fourth after 62 minutes, so the decrease in waiting time was 74 - 62 = 12 minutes. 

*Hint*: You'll need an array arithmetic function [mentioned in the textbook](https://www.inferentialthinking.com/chapters/05/1/arrays.html#Functions-on-Arrays).

*Hint 2*: The biggest decrease could be negative, but in the end, we want to return the absolute value of the biggest decrease

In [110]:
biggest_decrease = abs(min(np.diff(waiting_times)))
biggest_decrease

45

In [111]:
check('tests/q5_2.py')

**Question 3.** If you expected Old Faithful to erupt every hour, you would expect to wait a total of `60 * k` minutes to see `k` eruptions. Set `difference_from_expected` to an array with 272 elements, where the element at index `i` is the absolute difference between the expected and actual total amount of waiting time to see the first `i+1` eruptions.  *Hint*: You'll need to compare a cumulative sum to a range.

For example, since the first three waiting times are 79, 54, and 74, the total waiting time for 3 eruptions is 79 + 54 + 74 = 207. The expected waiting time for 3 eruptions is 60 * 3 = 180. Therefore, `difference_from_expected.item(2)` should be $|207 - 180| = 27$.

In [112]:
difference_from_expected = abs(np.cumsum(waiting_times) - 60*272)
difference_from_expected

array([16241, 16187, 16113, 16051, 15966, 15911, 15823, 15738, 15687,
       15602, 15548, 15464, 15386, 15339, 15256, 15204, 15142, 15058,
       15006, 14927, 14876, 14829, 14751, 14682, 14608, 14525, 14470,
       14394, 14316, 14237, 14164, 14087, 14021, 13941, 13867, 13815,
       13767, 13687, 13628, 13538, 13458, 13400, 13316, 13258, 13185,
       13102, 13038, 12985, 12903, 12844, 12769, 12679, 12625, 12545,
       12491, 12408, 12337, 12273, 12196, 12115, 12056, 11972, 11924,
       11842, 11782, 11690, 11612, 11534, 11469, 11396, 11314, 11258,
       11179, 11108, 11046, 10970, 10910, 10832, 10756, 10673, 10598,
       10516, 10446, 10381, 10308, 10220, 10144, 10064, 10016,  9930,
        9870,  9780,  9730,  9652,  9589,  9517,  9433,  9358,  9307,
        9225,  9163,  9075,  9026,  8943,  8862,  8815,  8731,  8679,
        8593,  8512,  8437,  8378,  8289,  8210,  8151,  8070,  8020,
        7935,  7876,  7789,  7736,  7667,  7590,  7534,  7446,  7365,
        7320,  7238,

In [113]:
check('tests/q5_3.py')

**Question 4.** If instead you guess that each waiting time will be the same as the previous waiting time, how many minutes would your guess differ from the actual time, averaging over every wait time except the first one.

For example, since the first three waiting times are 79, 54, and 74, the average difference between your guess and the actual time for just the second and third eruption would be $\frac{|79-54|+ |54-74|}{2} = 22.5$.

In [114]:
average_error = np.average(abs(np.diff(waiting_times)))
average_error

20.52029520295203

In [115]:
check('tests/q5_4.py')

## 6. Tables


**Question 1.** Suppose you have 4 apples, 3 oranges, and 3 pineapples.  (Perhaps you're using Python to solve a high school Algebra problem.)  Create a table that contains this information.  It should have two columns: "fruit name" and "count".  Give it the name `fruits`.

**Note:** Use lower-case and singular words for the name of each fruit, like `"apple"`.

In [116]:
# Our solution uses 1 statement split over 3 lines.
fruits = Table().with_columns(
        "fruit name", make_array("apple", "orange", "pineapple"),
        "count", make_array(4, 3, 3))
fruits

fruit name,count
apple,4
orange,3
pineapple,3


In [117]:
check('tests/q6_1.py')

**Question 2.** The file `inventory.csv` contains information about the inventory at a fruit stand.  Each row represents the contents of one box of fruit.  Load it as a table named `inventory`.

In [118]:
inventory = Table.read_table("inventory.csv")
inventory

box ID,fruit name,count
53686,kiwi,45
57181,strawberry,123
25274,apple,20
48800,orange,35
26187,strawberry,255
57930,grape,517
52357,strawberry,102
43566,peach,40


In [119]:
check('tests/q6_2.py')

**Question 3.** Does each box at the fruit stand contain a different fruit?

In [120]:
# Set all_different to "Yes" if each box contains a different fruit or to "No" if multiple boxes contain the same fruit
all_different = "No"
all_different

'No'

In [121]:
check('tests/q6_3.py')

**Question 4.** The file `sales.csv` contains the number of fruit sold from each box last Saturday.  It has an extra column called "price per fruit (\$)" that's the price *per item of fruit* for fruit in that box.  The rows are in the same order as the `inventory` table.  Load these data into a table called `sales`.

In [122]:
sales = Table.read_table("sales.csv")
sales

box ID,fruit name,count sold,price per fruit ($)
53686,kiwi,3,0.5
57181,strawberry,101,0.2
25274,apple,0,0.8
48800,orange,35,0.6
26187,strawberry,25,0.15
57930,grape,355,0.06
52357,strawberry,102,0.25
43566,peach,17,0.8


In [123]:
check('tests/q6_4.py')

**Question 5.** How many fruits did the store sell in total on that day?

In [124]:
total_fruits_sold = sum(sales.column("count sold"))
total_fruits_sold

638

In [125]:
check('tests/q6_5.py')

**Question 6.** What was the store's total revenue (the total price of all fruits sold) on that day?

*Hint:* If you're stuck, think first about how you would compute the total revenue from just the grape sales.

In [126]:
total_revenue = sum(sales.column(2)*sales.column(3))
total_revenue

106.85

In [127]:
check('tests/q6_6.py')

**Question 7.** Make a new table called `remaining_inventory`.  It should have the same rows and columns as `inventory`, except that the amount of fruit sold from each box should be subtracted from that box's count, so that the "count" is the amount of fruit remaining after Saturday.

In [128]:
remaining_inventory = inventory.with_column("count", inventory.
                      column("count") - sales.column("count sold"))                                             
remaining_inventory

box ID,fruit name,count
53686,kiwi,42
57181,strawberry,22
25274,apple,20
48800,orange,0
26187,strawberry,230
57930,grape,162
52357,strawberry,0
43566,peach,23


In [129]:
check('tests/q6_7.py')

## 7. Submission


Once you're finished, select "Save and Checkpoint" in the File menu and then execute the `submit` cell below. If you submit more than once before the deadline, we will only grade your final submission.

In [130]:
# For your convenience, you can run this cell to run all the tests at once!
import glob
from gofer.ok import grade_notebook
if not globals().get('__GOFER_GRADER__', False):
    display(grade_notebook('hw02.ipynb', sorted(glob.glob('tests/q*.py'))))

with_commas: Eats, Shoots, and Leaves
without_commas: Eats Shoots and Leaves
6594 663168 6660320568 -39250
Old Faithful erupts every 43 to 96 minutes and every 70.8970588235294 minutes on average.
['tests/q1_1.py', 'tests/q1_2.py', 'tests/q1_3.py', 'tests/q2_1.py', 'tests/q2_2.py', 'tests/q2_3.py', 'tests/q2_4.py', 'tests/q2_5.py', 'tests/q3_1.py', 'tests/q3_2.py', 'tests/q3_3.py', 'tests/q3_4.py', 'tests/q3_5.py', 'tests/q4_1.py', 'tests/q4_2.py', 'tests/q5_1.py', 'tests/q5_2.py', 'tests/q5_3.py', 'tests/q5_4.py', 'tests/q6_1.py', 'tests/q6_2.py', 'tests/q6_3.py', 'tests/q6_4.py', 'tests/q6_5.py', 'tests/q6_6.py', 'tests/q6_7.py']
Question 1:


Question 2:


Question 3:


Question 4:


Question 5:


Question 6:


Question 7:


Question 8:


Question 9:


Question 10:


Question 11:


Question 12:


Question 13:


Question 14:


Question 15:


Question 16:


Question 17:


Question 18:


Question 19:


Question 20:


Question 21:


Question 22:


Question 23:


Question 24:


Question 25:


Question 26:


1.0