In [None]:
from datascience import *
import numpy as np 

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

from IPython.display import Image

# Lecture 7

## Class Data Survey

Please fill out the following survey before Wednesday's lecture:
https://forms.gle/wv67cXJN6o83vqDbA

We will use the data as a (hopefully) interesting case study for learning the `group` method next class. (Individuals will not be identified, and you are free to skip whatever questions you'd like.)

## Histograms

We will briefly review histograms. First, load the table of top movies from 2017, and add a column containing the age of each movie:

In [None]:
# Highest grossing movies as of 2017
top_movies = Table.read_table('data/top_movies_2017.csv')

# Add a column of ages
ages = 2023 - top_movies.column('Year')
top_movies = top_movies.with_column('Age', ages)

top_movies 

Remember that *bins* are defined by an array, and we can use the `hist` method to visualize the distribution of ages with a histogram:

In [None]:
# Define the bins [0, 5), [5, 10), [10, 15), ..., [65, 102]
my_bins = make_array(0, 5, 10, 15, 25, 40, 65, 102)

top_movies.hist('Age', bins = my_bins, unit = 'Year')

If we don't want to specify bins ourselves, we can omit the `bins=my_bins` argument, and let Python select bins for us:

In [None]:
top_movies.hist('Age', unit = 'Year')

## Lists

Much like arrays, lists are another way of storing sequences in Python. You can create a new list using square brackets `[` and `]`:

In [None]:
# Define our first list
L = [42, 'capybara', min, make_array(1, 2, 3, 4)]

This list has 4 elements. You can confirm this using the `len` function:

In [None]:
len(L)

Individual elements of a list can also be accessed using square brackets:

In [None]:
L[0]

In [None]:
L[1]

**Question:** the last element of `L` is an array. How can you look up the first element of this array?

In [None]:
# ...

We can use lists to add new rows to a table. Let's load a table of top movies each year:

In [None]:
movies = Table.read_table('data/movies_by_year.csv')
movies

The table is missing data from 2023. So far this year, 91 movies have been released, with a total gross of $672,638,766. While *Avatar: The Way of Water* was released in 2022, it is still the highest-grossing film of the year 2023. We can append this information to the table using the `with_row` method:

In [None]:
# Create a list with the relevant data
new_row = [2023, 672638766, 91, 'Avatar: The Way of Water']
movies = movies.with_row(new_row)

# Sort the table so that the 2023 row appears on top
movies = movies.sort('Year', descending=True)
movies

## Functions ##

Let's start out by defining a simple function. Given an array of numbers `values`, define `spread(values)` to be the difference between the max and min value:

In [None]:
def spread(values):
    """
    Takes a set of values and computes the difference between the max and min value.
    (Using triple quotes, I can write a string on multiple lines.)
    This string serves as a comment, called a 'docstring.'
    """
    return max(values) - min(values)

In [None]:
help(spread)

In [None]:
ages = make_array(18, 20, 22, 32)
spread(ages)

Here's another example of a function:

**Question:** define a function `triple` that takes a single input `x` and triples its value:

In [None]:
# ...

In [None]:
triple(3)

In [None]:
num = 4
triple(num)

In [None]:
triple(num * 5)

**Discussion question:** What does the function below do?
What kind of input does it take?
What output will it give?
What's a reasonable name?

In [None]:
def f(s):
    return np.round(s / sum(s) * 100, 2)

In [None]:
# Suggestion: try breaking up the function into several steps
# Test out what each step does with different values of s
# ...

Functions can also take in multiple arguments! Remember the Pythagorean Theorem? Let's define a function to compute the hypotenuse length for a right triangle with side lengths $x$ and $y$:

$ h = \sqrt{ x^2 + y^2 } $

In [None]:
def hypotenuse(x, y):
    """
    Compute the length of the hypotenuse for a right triangle with side lengths x and y.
    """
    hypot_squared = x ** 2 + y ** 2
    return np.sqrt(hypot_squared)

In [None]:
hypotenuse(9, 12)

In [None]:
hypotenuse(2, 2)

**Question:** If you drive $x$ miles in $t$ hours, then your average speed is $x / t$ miles per hour. Write a function called `average_speed` with two arguments `x` and `t` that returns average speed.

In [None]:
# ...

## Apply

The table method `apply` calls a function on every entry of a column (or multiple columns), returning an array with the results. Let's look at some basic examples.

Earlier, we loaded the table of top movies as of 2017, then added a column containing the age of each movie:

In [None]:
# Load the table of top movies from 2017
# Remember this table doesn't have a column for movies ages---we had to calculate that ourselves
top_movies_without_age = Table.read_table('data/top_movies_2017.csv')

# Select only the top 10 movies so that we don't print out 200 rows
top10 = top_movies_without_age.sort('Gross (Adjusted)', descending=True).take(np.arange(10))
top10

One way to get a column of ages (which we did earlier in this notebook) is to first extract the column of years as an array:

In [None]:
ages = 2023 - top10.column('Year')
top10.with_column('Age', ages)  

Another approach is to use the `apply` method.

**Question:** define a function called `get_age` which takes a single argument `movie_year` and returns the age of the movie.

In [None]:
# ...

We can then **apply** our new `get_age` function to each row of the table, using the `apply` method:

In [None]:
# Apply takes two (or more) arguments:
# First is the name of the function to apply 
# The remaining arguments are names of columns that we are applying the function to
ages = top10.apply(get_age, 'Year')
ages

In [None]:
# Again, use the with_column method to add the new ages array as a column of the table
top10.with_column('Age', ages)

The `apply` method also works with functions that take several arguments.

In [None]:
def get_movie_info(title, gross_adj, year):
    """
    Create a string explaining some basic information about a movie.
    Takes 3 arguments: the movie title, adjusted gross earnings, and year as arguments.
    """
    age = 2023 - year
    millions = gross_adj / 1000000
    millions_rounded_down = int(millions)
    return 'The movie ' + title + ' is ' + str(age) + ' years old and made over $' + str(millions_rounded_down) + ' million'

In [None]:
# Since get_movie_info takes 3 arguments, we need to name 3 columns in the apply function:
top10.apply(get_movie_info, 'Title', 'Gross (Adjusted)', 'Year')

The order of columns matters! If the order of columns doesn't match the function signature `get_movie_info(title, gross_adj, year)`, then problems can occur:

In [None]:
# Apply get_movie_info again, but with the columns out of order
top10.apply(get_movie_info, 'Title', 'Year', 'Gross (Adjusted)')

## Challenge Questions

Suppose that we are investigating different models of hybrid cars. A table of hybrid car specs can be read in from `data/cars2016.csv`:

In [None]:
cars = Table.read_table('data/cars2016.csv')
cars

**Question 1:** generate a histogram to visualize the distribution of highway MPG for models from the year 2012.

In [None]:
# ...

The `cars` table has a column `year`, making it easy to select only rows from the year 2012. But it's not always this easy when working with real-world datasets. What if the `cars2016.csv` file didn't have a column for years? Let's drop this column using the `drop` method:

In [None]:
cars_without_year = cars.drop('year')
cars_without_year

Can we still guess the year for each row in the table? Take a look at the `name` column! Let's look at the value of `name` for several rows in the table:

In [None]:
names_array = cars_without_year.column('name')
print(names_array.item(0))
print(names_array.item(100))
print(names_array.item(1000))
print(names_array.item(5000))

Do you see a pattern?

Given a string `s` with multiple words (separated by spaces), you can use the string method `s.split()` to get a list of each word:

In [None]:
"2009 Audi A3 3.2".split()

**Question 2:** define a function called `get_year` with a single argument `name`. The function should take a name from the `cars` table, e.g. "2009 Audi A3 3.2", and return a reasonable guess for the year of the car, as an int. 

In [None]:
# ...

**Question 3:** using the `apply` method and your new `get_year` function, create an array of ints called `inferred_years`. Add this array to the `cars_without_year` table as a column called `inferred_year`.

In [None]:
# ...

Instead of highway MPG, let's look at *combined MPG*. The EPA defines combined MPG as a specific weighted average of city MPG and highway MPG. Given numbers `city_mpg` and `highway_mpg`, the formula is `combined_mpg = 0.55 * city_mpg + 0.45 * highway_mpg`.

**Question 4:** by defining your own function to calculate `combined_mpg` from two arguments, and using the `apply` method, create an array of combined MPGs. Add this array to the `cars_without_year` table as a column called `combined_mpg`.

In [None]:
# ...

**Bonus challenge question:** we've been careful to refer to the years in the `inferred_year` columns as guesses, since while our approach to guessing the years is reasonable, it might not be 100% accurate. Can you calculate the accuracy of our prediction (as a percentage)?

*Hint 1:* remember that the `inferred_year` column from the `cars_without_year` table contains our predictions, while the `year` column from the `cars` table contains the true years.

*Hint 2:* you can do this using the `apply` method, but remember that `apply` only works on a single table. Feel free to add extra columns to your tables as needed to answer this question.

*Hint 3:* I've provided a useful function below. It uses an `if` statement, which we have't learned about yet, but see if you can understand the behavior of this function anyway:

In [None]:
def numbers_are_equal(x, y):
    """
    Returns 1 if x and y are equal, and 0 otherwise.
    """
    if x == y:     # This is a "conditional statement." We will learn about them next week!
        return 1
    else:
        return 0

In [None]:
# ...