In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab02.ipynb")

# Lab 2: Arrays and Tables

**Helpful Resources:**
- [Python Reference](http://www.cs.williams.edu/~cs104/python-library-ref.html): Cheat sheet of helpful library methods.

**Readings:**
- [Ch 4. Datatypes](https://inferentialthinking.com/chapters/04/Data_Types.html)
- [Ch 5.1. Arrays](https://inferentialthinking.com/chapters/05/1/Arrays.html) 
- [Ch 6. Tables](https://inferentialthinking.com/chapters/06/Tables.html)

Please complete this notebook by filling in the cells provided. Before you begin, execute the previous cell to load the provided tests.

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!

**Note: This homework has hidden tests on it. That means even though tests may say 100% passed, doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework.**

In [None]:
# Run this cell to set up the notebook, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines make plots look nice and hide some messy Python warnings.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
warnings.simplefilter('ignore', np.VisibleDeprecationWarning)

## 1. Table Review and Analyzing a Dataset (10 pts)


Now that you're familiar with table operations, let’s answer an interesting question about a dataset!  Run the cell below to load the `imdb` table. It contains information about the 250 highest-rated movies on IMDb.

In [None]:
# Just run this cell

imdb = Table.read_table('imdb.csv')
imdb

Often, we want to perform multiple operations - sorting, filtering, or others - in order to turn a table we have into something more useful. You can do these operations one by one, e.g.

```
first_step = original_tbl.where(“col1”, are.equal_to(12))
second_step = first_step.sort(‘col2’, descending=True)
```

However, since the value of the expression `original_tbl.where(“col1”, are.equal_to(12))` is itself a table, you can just call a table method on it:

```
original_tbl.where(“col1”, are.equal_to(12)).sort(‘col2’, descending=True)
```
You should organize your work in the way that makes the most sense to you, using informative names for any intermediate tables you create. 

#### Part 1.1 (5 pts)


Create a table of movies released between 2010 and 2016 (inclusive) with ratings above 8. The table should only contain the columns `Title` and `Rating`, **in that order**.

Assign the table to the name `above_eight`.

*Hint:* Think about the steps you need to take, and try to put them in an order that make sense. Feel free to create intermediate tables for each step, but please make sure you assign your final table the name `above_eight`!


In [None]:
above_eight = ...
above_eight

In [None]:
grader.check("q1.1")

#### Part 1.2 (5 pts)


Use `num_rows` (and arithmetic) to find the *proportion* of movies in the dataset that were released 1900-1999, and the *proportion* of movies in the dataset that were released in the year 2000 or later.

Assign `proportion_in_20th_century` to the proportion of movies in the dataset that were released 1900-1999, and `proportion_in_21st_century` to the proportion of movies in the dataset that were released in the year 2000 or later.

*Hint:* The *proportion* of movies released in the 1900's is the *number* of movies released in the 1900's, divided by the *total number* of movies.


In [None]:
num_movies_in_dataset = ...
num_in_20th_century = ...
num_in_21st_century = ...
proportion_in_20th_century = ...
proportion_in_21st_century = ...
print("Proportion in 20th century:", proportion_in_20th_century)
print("Proportion in 21st century:", proportion_in_21st_century)

In [None]:
grader.check("q1.2")

## 2. Creating Arrays (15 pts)



#### Part 2.1 (5 pts)


Make an array called `numbers` containing the following numbers (in the given order)

1. -2
2. the floor of 12.6
3. 3
4. 5 to the power of the ceil of 5.3

*Hint:* `floor` and `ceil` are functions in the `math` module. Importing modules is covered in 2.1 of Lab 2!

*Note:* Python lists are different/behave differently than NumPy arrays. In Data 8, we use NumPy arrays, so please make an **array**, not a Python list.

In [None]:
# Our solution involved one extra line of code before creating
# numbers.
...
numbers = ...
numbers

In [None]:
grader.check("q2.1")

#### Part 2.2 (5 pts)


Make an array called `book_title_words` containing the following three strings: "Eats", "Shoots", and "and Leaves".

In [None]:
book_title_words = ...
book_title_words

In [None]:
grader.check("q2.2")

#### Part 2.3 (5 pts)


Strings have a method called `join`.  `join` takes one argument, an array of strings.  It returns a single string.  Specifically, the value of `a_string.join(an_array)` is a single string that's the [concatenation](https://en.wikipedia.org/wiki/Concatenation) ("putting together") of all the strings in `an_array`, **with** `a_string` inserted in between each string.

Use the array `book_title_words` and the method `join` to make two strings:

1. "Eats, Shoots, and Leaves" (call this one `with_commas`)
2. "Eats Shoots and Leaves" (call this one `without_commas`)

In [None]:
with_commas = ...
without_commas = ...

# These lines are provided just to print out your answers.
print('with_commas:', with_commas)
print('without_commas:', without_commas)

In [None]:
grader.check("q2.3")

## 3. Indexing Arrays (25 pts)



These exercises give you practice accessing individual elements of arrays with the `array.item(index)` method.  In Python, elements are accessed by its *index*; for example, the first element is the element at index 0. Indices must be **integers**.

***Note:* If you have previous coding experience, you may be familiar with bracket notation. DO NOT use bracket notation when indexing (i.e. `arr[0]`), as this can yield different data type outputs than what we will be expecting. This can cause you to fail an autograder test.**

Be sure to refer to the [Python Reference](http://data8.org/fa21/python-reference.html) on the website if you feel stuck!

#### Part 3.1 (5 pts)


The cell below creates an array of some numbers.  Set `third_element` to the third element of `some_numbers`.

In [None]:
some_numbers = make_array(-1, -3, -6, -10, -15)

third_element = ...
third_element

In [None]:
grader.check("q3.1")

#### Part 3.2 (5 pts)


The next cell creates a table that displays some information about the elements of `some_numbers` and their order.  Run the cell to see the partially-completed table, then fill in the missing information (the cells that say "Ellipsis") by assigning `blank_a`, `blank_b`, `blank_c`, and `blank_d` to the correct elements in the table.

*Hint:* Replace the `...` with strings or numbers. As a reminder, indices should be **integers**.

In [None]:
blank_a = ...
blank_b = ...
blank_c = ...
blank_d = ...
elements_of_some_numbers = Table().with_columns(
    "English name for position", make_array("first", "second", blank_a, blank_b, "fifth"),
    "Index",                     make_array(blank_c, 1, 2, blank_d, 4),
    "Element",                   some_numbers)
elements_of_some_numbers

In [None]:
grader.check("q3.2")

#### Part 3.3 (5 pts)


You'll sometimes want to find the **last** element of an array.  Suppose an array has 142 elements.  What is the index of its last element?

In [None]:
index_of_last_element = ...

In [None]:
grader.check("q3.3")

#### Part 3.4 (5 pts)


More often, you don't know the number of elements in an array, its *length*.  (For example, it might be a large dataset you found on the Internet.)  The function `len` takes a single argument, an array, and returns the `len`gth of that array (an integer).

The cell below loads an array called `president_birth_years`.  Calling `tbl.column(...)` on a table returns an array of the column specified, in this case the `Birth Year` column of the `president_births` table. The last element in that array is the most recent among the birth years of all the deceased Presidents. Assign that year to `most_recent_birth_year`.

In [None]:
president_birth_years = Table.read_table("president_births.csv").column('Birth Year')

most_recent_birth_year = ...
most_recent_birth_year

In [None]:
grader.check("q3.4")

#### Part 3.5 (5 pts)


Finally, assign `min_of_birth_years` to the minimum of the first, sixteenth, and last birth years listed in `president_birth_years`.

In [None]:
min_of_birth_years = min(president_birth_years.item(0), president_birth_years.item(15), most_recent_birth_year)
min_of_birth_years

In [None]:
grader.check("q3.5")

## 4. Basic Array Arithmetic (25 pts)



#### Part 4.1 (5 pts)


Multiply the numbers 42, -4224, 424224242, and 250 by 157. Assign each variable below such that `first_product` is assigned to the result of $42 * 157$, `second_product` is assigned to the result of $-4224 * 157$, and so on.

For this question, **don't** use arrays.

In [None]:
first_product = ...
second_product = ...
third_product = ...
fourth_product = ...
print(first_product, second_product, third_product, fourth_product)

In [None]:
grader.check("q4.1")

#### Part 4.2 (5 pts)


Now, do the same calculation, but using an array called `numbers` and only a single multiplication (`*`) operator.  Store the 4 results in an array named `products`.

In [None]:
numbers = ...
products = ...
products

In [None]:
grader.check("q4.2")

#### Part 4.3 (5 pts)


Oops, we made a typo!  Instead of 157, we wanted to multiply each number by 1577.  Compute the correct products in the cell below using array arithmetic.  Notice that your job is really easy if you previously defined an array containing the 4 numbers.

In [None]:
correct_products = ...
correct_products

In [None]:
grader.check("q4.3")

#### Part 4.4 (5 pts)


We've loaded an array of temperatures in the next cell.  Each number is the highest temperature observed on a day at a climate observation station, mostly from the US.  Since they're from the US government agency [NOAA](https://www.noaa.gov/), all the temperatures are in Fahrenheit.  Convert them all to Celsius by first subtracting 32 from them, then multiplying the results by $\frac{5}{9}$. Make sure to **ROUND** the final result after converting to Celsius to the nearest integer using the `np.round` function.

In [None]:
max_temperatures = Table.read_table("temperatures.csv").column("Daily Max Temperature")

celsius_max_temperatures = ...
celsius_max_temperatures

In [None]:
grader.check("q4.4")

#### Part 4.5 (5 pts)


The cell below loads all the *lowest* temperatures from each day (in Fahrenheit).  Compute the daily temperature range for each day. That is, compute the difference between each daily maximum temperature and the corresponding daily minimum temperature.  **Pay attention to the units, give your answer in Celsius!** Make sure **NOT** to round your answer for this question! 

*Note:* Remember that in the previous part, `celsius_max_temperatures` was rounded, so you might not want to use that in this question.

In [None]:
min_temperatures = Table.read_table("temperatures.csv").column("Daily Min Temperature")

celsius_temperature_ranges = ...
celsius_temperature_ranges

In [None]:
grader.check("q4.5")

## 5. Old Faithful (20 pts)



[Old Faithful](https://en.wikipedia.org/wiki/Old_Faithful) is a geyser in Yellowstone that erupts every 44 to 125 minutes. People are [often told that the geyser erupts every hour](http://yellowstone.net/geysers/old-faithful/), but in fact the waiting time between eruptions is more variable. Let's take a look.

#### Part 5.1 (5 pts)


The first line below assigns `waiting_times` to an array of 272 consecutive waiting times between eruptions, taken from a classic 1938 dataset. Assign the names `shortest`, `longest`, and `average` so that the `print` statement is correct. **(4 Points)**

In [None]:
waiting_times = Table.read_table('old_faithful.csv').column('waiting')

shortest = ...
longest = ...
average = ...

print("Old Faithful erupts every", shortest, "to", longest, "minutes and every", average, "minutes on average.")

In [None]:
grader.check("q5.1")

#### Part 5.2 (5 pts)


Assign `biggest_decrease` to the biggest decrease in waiting time between two consecutive eruptions. For example, the third eruption occurred after 74 minutes and the fourth after 62 minutes, so the decrease in waiting time was 74 - 62 = 12 minutes.

*Hint*: We want to return the absolute value of the biggest decrease.

In [None]:
# np.diff() calculates the difference between subsequent values  
# in a NumPy array.
differences = np.diff(waiting_times) 
biggest_decrease = ...
biggest_decrease

In [None]:
grader.check("q5.2")

#### Part 5.3 (5 pts)


Suppose the surveyors started watching Old Faithful at the start of the first eruption. Assume that they watch until the end of the tenth eruption. For some of that time they will be watching eruptions, and for the rest of the time they will be waiting for Old Faithful to erupt. How many minutes will they spend waiting for eruptions?

*Hint:* One way to approach this problem is to use the `take` or `where` method on the table `faithful`. 

*Another Hint:* `first_nine_waiting_times` must be an array.

In [None]:
faithful = Table.read_table('old_faithful.csv')

faithful_with_eruption_nums = ...
first_nine_waiting_times = ...
total_waiting_time_until_tenth = ...
total_waiting_time_until_tenth

In [None]:
grader.check("q5.3")

#### Part 5.4 (5 pts)


Let’s imagine your guess for the next waiting time was always just the length of the previous waiting time. If you always guessed the previous waiting time, how big would your error in guessing the waiting times be, on average? **(4 Points)**

For example, since the first four waiting times are 79, 54, 74, and 62, the average difference between your guess and the actual time for just the second, third, and fourth eruptions would be $\frac{|79-54|+ |54-74|+ |74-62|}{3} = 19$.

In [None]:
differences = np.diff(waiting_times)
average_error = ...
average_error

In [None]:
grader.check("q5.4")

## 6. Manipulating Tables (35 pts)



#### Part 6.1 (5 pts)


Suppose you have 4 apples, 3 oranges, and 3 pineapples.  Create a table that contains this information.  It should have two columns: `fruit name` and `count`.  Assign the new table to the variable `fruits`.

**Note:** Use lower-case and singular words for the name of each fruit, like `"apple"`.

In [None]:
# Our solution uses 1 statement split over several lines.
fruits = ...
    ...
    ...
...
fruits

In [None]:
grader.check("q6.1")

#### Part 6.2 (5 pts)


The file `inventory.csv` contains information about the inventory at a fruit stand.  Each row represents the contents of one box of fruit. Load it as a table named `inventory` using the `Table.read_table()` function. `Table.read_table(...)` takes one argument (data file name in string format) and returns a table.

In [None]:
inventory = ...
inventory

In [None]:
grader.check("q6.2")

#### Part 6.3 (5 pts)


Does each box at the fruit stand contain a different fruit? Set `all_different` to `True` if each box contains a different fruit or to `False` if multiple boxes contain the same fruit.

*Hint:* You don't have to write code to calculate the True/False value for `all_different`. Just look at the `inventory` table and assign `all_different` to either `True` or `False` according to what you can see from the table in answering the question.

In [None]:
all_different = ...
all_different

In [None]:
grader.check("q6.3")

#### Part 6.4 (5 pts)


The file `sales.csv` contains the number of fruit sold from each box in one day.  It has an extra column called "price per fruit (\$)" that's the price *per item of fruit* for fruit in that box.  The rows are in the same order as the `inventory` table.  Load these data into a table called `sales`.

In [None]:
sales = ...
sales

In [None]:
grader.check("q6.4")

#### Part 6.5 (5 pts)


How many fruits did the store sell in total on that day?

In [None]:
total_fruits_sold = ...
total_fruits_sold

In [None]:
grader.check("q6.5")

#### Part 6.6 (5 pts)


What was the store's total revenue (the total price of all fruits sold) on that day?

*Hint:* If you're stuck, think first about how you would compute the total revenue from just the grape sales.

In [None]:
total_revenue = ...
total_revenue

In [None]:
grader.check("q6.6")

#### Part 6.7 (5 pts)


Make a new table called `remaining_inventory`.  It should have the same rows and columns as `inventory`, except that the amount of fruit sold from each box should be subtracted from that box's **original** count, so that the "count" is **updated to be** the amount of fruit remaining after that day's sales.

In [None]:
remaining_inventory = ...
    ...
...
...

remaining_inventory

In [None]:
grader.check("q6.7")

## 7. You're Done!


**Important submission information:** Follow these steps to submit your work:
* Run the tests and verify that they pass as you expect. 
* Choose **Save Notebook** from the **File** menu.
* **Run the final cell** and click the link below to download the zip file. 

Once you have downloaded that file, go to [Gradescope](https://www.gradescope.com/) and submit the zip file to the corresponding assignment. The name of this assignment is "Lab 2 Autograder". **Be sure your work is saved before running the last cell!**

Once you have submitted, your Gradescope assignment should show you passing all the tests you passed in your assignment notebook.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()