# Lab 2: Arrays and Tables

## Due Saturday, July 3, 11:59 pm 

Welcome to lab 2!  This week, we'll learn about arrays, which allow us to store sequences of data, and *tables*, which let us work with multiple arrays of data about the same things. These topics are covered in the [textbook](https://eldridgejm.github.io/dive_into_data_science/front.html).

**Please do not use for-loops for any questions in this lab.** If you don't know what a for-loop is, don't worry -- we haven't covered them yet. But if you do know what they are and are wondering why it's not OK to use them, it is because loops in Python are slow, and looping over arrays and tables should usually be avoided.

First, set up imports by running the cell below.

In [1]:
import math
import babypandas as bpd

import otter
grader = otter.Notebook()

# 1. Arrays

Computers are most useful when you can use a small amount of code to *do the same action* to *many different things*.

For example, in the time it takes you to calculate the 18% tip on a restaurant bill, a laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day.  (That's if you're pretty fast at doing arithmetic in your head!)

**Arrays** are how we put many values in one place so that we can operate on them as a group. For example, if `billions_of_numbers` is an array of numbers, the expression

    .18 * billions_of_numbers

gives a new array of numbers that's the result of multiplying each number in `billions_of_numbers` by .18 (18%).  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Concretely, an array is a **collection of values of the same type**, like a column in an Excel spreadsheet. 

<img src="data/excel_array.jpg">

## 1.1. Making arrays
You can type in the data that goes in an array yourself, but that's not typically how programs work. Normally, we create arrays by loading them from an external source, like a data file.

First, though, let's learn how to do it the hard way. To begin, we can make a list of numbers by putting them within square brackets:

In [2]:
my_list = [0.125, 4.75, -1.3]
my_list

Just like `int`, `float`, and `str`, the `list` is a datatype provided by Python. It is very flexible and easy to work with, but it is *slowwww*.

As data scientists, we'll often be working with millions or even billions of numbers. For this, we need something faster than a `list`. Instead of lists, we will use *arrays*. 

Arrays are provided by a package called [NumPy](http://www.numpy.org/) (pronounced "NUM-pie" or, if you prefer to pronounce things incorrectly, "NUM-pee").  The package is called `numpy`, but it's standard to rename it `np` for brevity.  You can do that with:

    import numpy as np

Data scientists, as well as engineers and scientists of all kinds, use `numpy` frequently, and you'll see quite a bit of it if you're a data science major.

In [3]:
import numpy as np

Now, to create an array, call the function `np.array` with a list of numbers.  Run this cell to see an example:

In [4]:
np.array([0.125, 4.75, -1.3])

Note that you need the square-brackets here. If you were to try running the following code, Python will yell at you because you forgot them:

```
np.array(0.125, 4.75, -1.3)
```

Arrays themselves are also values, just like numbers and strings.  That means you can assign them names or use them as arguments to functions.

**Question 1.1.1.** Make an array containing the numbers 1, 2, and 3, in that order.  Name it `small_numbers`.

In [5]:
small_numbers = ...
small_numbers

In [None]:
grader.check("q111")

**Question 1.1.2.** Make an array containing the numbers 0, 1, -1, $\pi$, and $e$, in that order.  Name it `interesting_numbers`.  *Hint:* remember from the last lab that $\pi$ and $e$ are available from the `math` module, which has already been imported.

In [9]:
interesting_numbers = ...
interesting_numbers

In [None]:
grader.check("q112")

**Question 1.1.3.** Make an array containing the five strings `"Hello"`, `","`, `" "`, `"world"`, and `"!"`.  (The third one is a single space inside quotes.)  Name it `hello_world_components`.

*Note:* If you print `hello_world_components`, you'll notice some extra information in addition to its contents: `dtype='<U5'`.  That's just NumPy's extremely way of saying that the things in the array are strings. In case you're interested, the `U` means that this string is encoded in [unicode](https://en.wikipedia.org/wiki/Unicode).

In [13]:
hello_world_components = ...
hello_world_components

In [None]:
grader.check("q113")

Very often in data science, we want to work with many numbers that are evenly spaced within some range.  NumPy provides a special function for this called `arange`.  `np.arange(start, stop, space)` produces an array with all the numbers starting at `start` and counting up by `space`, stopping before `stop` is reached.

For example, the value of `np.arange(1, 8, 2)` is an array with elements 1, 3, 5, and 7 -- it starts at 1 and counts up by 2, then stops before 8.  In other words, it makes the same array as `np.array([1, 3, 5, 7])`.

`np.arange(4, 9, 1)` is an array with elements 4, 5, 6, 7, and 8.  (It doesn't contain 9 because `np.arange` stops *before* the stop value is reached.)

**Question 1.1.4.** Use `np.arange` to create an array with the multiples of 99 from 0 up to (**and including**) 9999.  (So its elements are 0, 99, 198, 297, etc.)

In [15]:
multiples_of_99 = ...
multiples_of_99

In [None]:
grader.check("q114")

##### Temperature readings
NOAA (the US National Oceanic and Atmospheric Administration) operates weather stations that measure surface temperatures at different sites around the United States.  The hourly readings are [publicly available](http://www.ncdc.noaa.gov/qclcd/QCLCD?prior=N).

Suppose we download all the hourly data from the Sandy Eggo, California site for the month of December 2015.  To analyze the data, we want to know when each reading was taken, but we find that the data don't include the timestamps of the readings (the time at which each one was taken).

However, we know the first reading was taken at the first instant of December 2015 (midnight on December 1st) and each subsequent reading was taken exactly 1 hour after the last.

**Question 1.1.5.** Create an array of the *time, in seconds, since the start of the month* at which each hourly reading was taken.  Name it `collection_times`.

*Hint 1:* There were 31 days in December, which is equivalent to ($31 \times 24$) hours or ($31 \times 24 \times 60 \times 60$) seconds.  So your array should have $31 \times 24$ elements in it.

*Hint 2:* The `len` function works on arrays, too.  If your `collection_times` isn't passing the tests, check its length and make sure it has $31 \times 24$ elements.

In [19]:
collection_times = ...
collection_times

In [None]:
grader.check("q115")

## 1.2. Working with single elements of arrays ("indexing")
Let's work with a more interesting dataset.  The next cell creates an array called `population` that includes estimated world populations in every year from **1950** to roughly the present.  (The estimates come from the [US Census Bureau website](http://www.census.gov/population/international/data/worldpop/table_population.php).)

Rather than type in the data manually, we've loaded them from a file on your computer called `world_population.csv`.  You'll learn how to do that next week.

In [23]:
# Don't worry too much about what goes on in this cell.
population = bpd.read_csv("data/world_population.csv").get("Population").values
population

Here's how we get the first element of `population`, which is the world population in the first year in the dataset, 1950.

In [24]:
population[0]

Notice that we use square brackets here. The square brackets signal that we are *accessing* an element of the array. Square brackets in Python are kind of like subscripts in math.

The value of that expression is the number 2557628654 (around 2.5 billion), because that's the first thing in the array `population`.

Notice that we wrote `population[0]`, not `population[1]`, to get the first element.  This is a weird convention in computer science.  0 is called the *index* of the first item.  It's the number of elements that appear *before* that item.  So 3 is the index of the 4th item.

Here are some more examples.  In the examples, we've given names to the things we get out of `population`.  Read and run each cell.

In [25]:
# The third element in the array is the population
# in 1952.
population_1952 = population[2]
population_1952

In [26]:
# The thirteenth element in the array is the population
# in 1962 (which is 1950 + 12).
population_1962 = population[12]
population_1962

In [27]:
# The 66th element is the population in 2015.
population_2015 = population[65]
population_2015

In [28]:
# The array has only 66 elements, so this doesn't work.
# (There's no element with 66 other elements before it.)

population_2016 = population[66]
population_2016

# (after running this cell you can place a # before each line above to make sure that it doesn't run again)

In [29]:
# Since make_array returns an array, we can call .item(3)
# on its output to get its 4th element, just like we
# "chained" together calls to the method "replace" earlier.
np.array([-1, -3, 4, -2])[3]

**Question 1.2.1.** Set `population_1973` to the world population in 1973, by getting the appropriate element from `population`.

In [30]:
population_1973 = ...
population_1973

In [None]:
grader.check("q121")

## 1.3. Doing something to every element of an array
Arrays are primarily useful for doing the same operation many times, so we don't often have to use `.item` and work with single elements.

##### Logarithms
Here is one simple question we might ask about world population:

> How big was the population in *orders of magnitude* in each year?

The logarithm function is one way of measuring how big a number is. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.

We could try to answer our question like this, using the `log10` function from the `math` module and the `item` method you just saw:

In [32]:
import math
population_1950_magnitude = math.log10(population[0])
population_1951_magnitude = math.log10(population[1])
population_1952_magnitude = math.log10(population[2])
population_1953_magnitude = math.log10(population[3])
...

But this is tedious and doesn't really take advantage of the fact that we are using a computer.

Instead, NumPy provides its own version of `log10` that takes the logarithm of each element of an array.  It takes a single array of numbers as its argument.  It returns an array of the same length, where the first element of the result is the logarithm of the first element of the argument, and so on.

**Question 1.3.1.** Use NumPy's `log10` function (not `math.log10`!) to compute the logarithms of the world population in every year.  Give the result (an array of 66 numbers) the name `population_magnitudes`.  Your code should be very short.

In [33]:
population_magnitudes = ...
population_magnitudes

In [None]:
grader.check("q131")

<img src="data/array_logarithm.jpg">

This is called *elementwise* application of the function, since it operates separately on each element of the array it's called on.  The textbook's section on arrays has a useful list of NumPy functions that are designed to work elementwise, like `np.log10`.

##### Arithmetic
Arithmetic also works elementwise on arrays.  For example, you can divide all the population numbers by 1 billion to get numbers in billions:

In [36]:
population_in_billions = population / 1000000000
population_in_billions

You can do the same with addition, subtraction, multiplication, and exponentiation (`**`). For example, you can calculate a tip on several restaurant bills at once (in this case just 3):

In [37]:
restaurant_bills = np.array([20.12, 39.90, 31.01])
print("Restaurant bills:\t", restaurant_bills)
tips = .2 * restaurant_bills
print("Tips:\t\t\t", tips)

<img src="data/array_multiplication.jpg">

**Question 1.3.2.** Suppose the total charge at a restaurant is the original bill plus the tip.  That means we can multiply the original bill by 1.2 to get the total charge.  Compute the total charge for each bill in `restaurant_bills`.

In [38]:
total_charges = ...
total_charges

In [None]:
grader.check("q132")

Let's read in some data to use in the next question.

In [41]:
more_restaurant_bills = bpd.read_csv("data/more_restaurant_bills.csv").get("Bill").values

**Question 1.3.3.** `more_restaurant_bills` contains 100,000 bills!  Compute the total charge for each one.

In [42]:
more_total_charges = ...
more_total_charges

In [None]:
grader.check("q133")

The function `sum` takes a single array of numbers as its argument.  It returns the sum of all the numbers in that array (so it returns a single number, not an array).

**Question 1.3.4.** What was the sum of all the bills in `more_restaurant_bills`, *including tips*?

In [45]:
sum_of_bills = ...
sum_of_bills

In [None]:
grader.check("q134")

**Question 1.3.5.** The powers of 2 ($2^0 = 1$, $2^1 = 2$, $2^2 = 4$, etc) arise frequently in computer science.  (For example, you may have noticed that storage on smartphones or USBs come in powers of 2, like 16 GB, 32 GB, or 64 GB.)  Use `np.arange` and the exponentiation operator `**` to compute the first 30 powers of 2, starting from `2^0`.

*Hint*: Did your kernel "die" when you ran your solution? There is a common incorrect response to this problem that tries to create an array with so many entries that Python gives up. If this happens to you, double-check your answer!

In [47]:
powers_of_2 = ...
powers_of_2

In [None]:
grader.check("q135")

# 2. Growth Rates

If you get stuck on this part of the lab you can always reference this part of the textbook for help [growth rate](http://sierra.ucsd.edu/dsc10-book/chapters/03/2/1/Growth.html)

The relationship between two measurements of the same quantity taken at different times is often expressed as a growth rate. For example, the United States federal government employed 2,766,000 people in 2002 and 2,814,000 people in 2012. To compute a growth rate, we must first decide which value to treat as the initial amount. For values over time, the earlier value is a natural choice. 

Set the variable "intial" to the number of federal government employees in 2002. And set the variable "changed" to the number of federal government employess in 2012.

In python you can use the underscore `_` to help you visualize large numbers. So `1000` and `1_000` are equivalent in python.

In [49]:
initial = ...
changed = ...
print("intial: " + str(initial))
print("changed: " + str(changed))

Now compute the ten year growth rate of federal government employees. The equation for growth rate is the differance between the final value and the initial value, divided by the initial value. Remember to use paranthesis. 
`(final - initial) / initial`

In [50]:
growth_rate = ...
growth_rate

In [None]:
grader.check("q21")

A growth rate can be negative, representing a decrease in some value. For example, the number of manufacturing jobs in the US decreased from 15.3 million in 2002 to 11.9 million in 2012. Calculate the ten year growth rate of US manufacturing jobs.

In [52]:
manufacturing_growth_rate = ...
manufacturing_growth_rate

In [None]:
grader.check("q22")

An annual growth rate is a growth rate of some quantity over a single year. An annual growth rate of 0.035, accumulated each year for 10 years, gives a much larger ten-year growth rate of 0.41 (or 41%).

In [54]:
ten_year_growth_rate = 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 - 1
ten_year_growth_rate

This same computation can be expressed using names and exponents.

In [55]:
annual_growth_rate = 0.035
ten_year_growth_rate = (1 + annual_growth_rate) ** 10 - 1
ten_year_growth_rate

**Question 2.3.** If a stock portfolio had an annual growth rate of 10%, what would be its 15 year growth rate?

In [56]:
fifteen_year_growth_rate = ...
fifteen_year_growth_rate

In [None]:
grader.check("q23")

Likewise, a ten-year growth rate can be used to compute an equivalent annual growth rate. Below, t is the number of years that have passed between measurements. The following computes the annual growth rate of federal expenditures over the last 10 years.

In [58]:
initial = 2.37
changed = 3.38
t = 10
annual_growth_rate = (changed/initial) ** (1/t) - 1
annual_growth_rate

**Question 2.4.** If you wanted a stock portfolio with a 15 year growth rate of 100%, what would its annual growth rate have to be?

In [59]:
annual_growth_rate = ...
annual_growth_rate

In [None]:
grader.check("q24")

# 3.1. Tables (DataFrames)

## 3.1. Introduction

For a collection of things in the world, an array is useful for describing a single attribute of each thing. For example, among the collection of US States, an array could describe the land area of each. Tables extend this idea by describing multiple attributes for each element of a collection.

In most data science applications, we have data about many entities, but we also have several kinds of data about each entity.

For example, in the cell below we have two arrays. The first one contains the world population in each year (as [estimated](http://www.census.gov/population/international/data/worldpop/table_population.php) by the US Census Bureau), and the second contains the years themselves (in order, so the first elements in the population and the years arrays correspond).

In [61]:
population_amounts = bpd.read_csv("data/world_population.csv").get("Population").values
population_years = np.arange(1950, 2015+1)
print("Population column:", population_amounts)
print("Years column:", population_years)

Suppose we want to answer this question:

> When did world population cross 6 billion?

You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a 2-dimensional table called a *DataFrame*.

Just as `numpy` provides arrays, a popular package called `pandas` provides `DataFrame`s. `pandas` is *the* tool for doing data science in Python. Unfortunately, `pandas` isn't as cute as its name might suggest: it's very complicated and can be somewhat hard to learn.

Instead of using `pandas`, we'll use a package that we've created especially for DSC 10. It is a *subset* of `pandas`, including only the parts that we think are necessary and throwing out all of the rest. Because it is smaller (and cuter), we've called it `babypandas`. 

You can import `babypandas` using the following code:

In [62]:
import babypandas as bpd


The nice thing about `babypandas` is that it is easier to learn *but* every bit of code you write using `babypandas` will work with `pandas`, too. If you're a data science major, or just going to be doing a lot of data analysis in Python, you'll see quite a lot of `pandas` in your future.

The expression below:

- creates an empty DataFrame using the expression `bpd.DataFrame()`,
- assigns two columns to the DataFrame by calling `assign`,
- assigns the result to the name `population`, and finally
- evaluates `population` so that we can see the table.

`Year` and `Population` are column labels that we have chosen. The names `population_amounts` and `years` were assigned above to two arrays of the same length.

In [63]:
population_table = bpd.DataFrame().assign(
    Population=population_amounts,
    Year=population_years
)
population_table

Now the data are all together in a single table! It's much easier to parse this data--if you need to know what the population was in 1959, for example, you can tell from a single glance. We'll revisit this table later.

**Question 3.1.1.** In the cell below, we've created 2 arrays. Using the steps above, assign `top_10_movies` to a DataFrame that has two columns called "Rating" and "Name", which hold `top_10_movie_ratings` and `top_10_movie_names` respectively.

In [64]:
top_10_movie_ratings = np.array([9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8])
top_10_movie_names = np.array([
        'The Shawshank Redemption (1994)',
        'The Godfather (1972)',
        'The Godfather: Part II (1974)',
        'Pulp Fiction (1994)',
        "Schindler's List (1993)",
        'The Lord of the Rings: The Return of the King (2003)',
        '12 Angry Men (1957)',
        'The Dark Knight (2008)',
        'Il buono, il brutto, il cattivo (1966)',
        'The Lord of the Rings: The Fellowship of the Ring (2001)'
])

In [65]:
top_10_movies = ...
top_10_movies

In [None]:
grader.check("q311")

Suppose you want to add your own ratings to this table. The cell below contains your ranking of each movie:

In [69]:
my_ranking = [8, 2, 1, 9, 7, 10, 6, 4, 3, 5]

**Question 3.1.2** You can use the `assign` method to add a column to an already-existing table, too. Create a new DataFrame called `with_ranking` by adding a column named "Ranking" to the table in `top_10_movies`.

In [70]:
with_ranking = ...
with_ranking

In [None]:
grader.check("q312")

## 3.2. Indexes

You may have noticed that the table of population numbers what looks like an extra, unlabeled column on the left with the numbers 0 through 65. In fact, this is not a column, its what we call an *Index*. The index contains the row labels. Whereas the columns of this table are labeled "Population" and "Year", the rows are labeled 0, 1, ..., 65.

By default, `babypandas` doesn't know how to label the rows, and so it just numbers them (starting with 0). Of course, in this case it makes more sense to use the year as a row's label. We can do this by telling `babypandas` to set the `'Year'` column as the index:

In [74]:
population_by_year = population_table.set_index('Year')
population_by_year

As we'll see, this does more than make the table look nicer -- it is very useful, too.

**Question 3.2.1** Create a new DataFrame named `top_10_movies_by_name` by taking the DataFrame you made above, `top_10_movies`, and setting the index to be the "Name" column.

In [75]:
top_10_movies_by_name = ...
top_10_movies_by_name

In [None]:
grader.check("q321")

You can get an array of row names using `.index`. For instance, the array of row names of the `population` table is:

In [77]:
population_by_year.index

**Question 3.2.2** Using code, assign to `fourth_movie` the name of the fourth movie in `top_10_movies_by_name`.

In [78]:
fourth_movie = ...
fourth_movie

In [None]:
grader.check("q322")

## 3.3 Reading a table from a file
In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we can use functions provided by `babypandas`.

The `bpd.read_csv()` function takes one argument, a path to a data file (a string), and returns a `DataFrame`.  There are many formats for data files, but CSV ("comma-separated values") is the most common.

**Question 3.3.1.** The file `data/imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb.  Load it as a table called `imdb`.

In [80]:
imdb = ...
imdb

In [None]:
grader.check("q331")

Notice the dots in the middle of the table. This means that a lot of the rows have been omitted. This table is big enough that only a few of its rows are displayed, but the others are still there.  10 are shown, so there are 250 movies total.

Where did `imdb.csv` come from? Take a look at [this lab's folder](./). If you go into the `data/` directory, you should see a file called `imdb.csv`.

Open up the `imdb.csv` file in that folder and look at the format. What do you notice? The `.csv` filename ending says that this file is in the [CSV (comma-separated value) format](http://edoceo.com/utilitas/csv-file-format).

**Question 3.3.2.** It makes more sense to use the movie title as the row label. Create a new DataFrame called `imdb_by_name` which uses the movie title as the index.

In [82]:
imdb_by_name = ...
imdb_by_name

In [None]:
grader.check("q332")

## 3.4. Series



Suppose we're interested primarily in movie ratings. To extract just this column from the table, we use the `.get` method:

In [86]:
ratings = imdb_by_name.get('Rating')
ratings

Notice how not only the movie ratings have been returned, but also the name of the movie! This is precisely because we have set the movie title to be the index! For example, if we had asked for the `Rating` column of the original DataFrame, `imdb`, we would see:

In [87]:
imdb.get('Rating')

This is one way in which indices are very useful.

At first glance, it might look like asking for a column using `.get` returns a table with one column, but that's not quite right. Instead, it returns a special type of thing called a *Series*:

In [88]:
type(imdb_by_name.get('Rating'))

You can think of a `Series` as an array with an index. Whereas arrays are simple sequences of numbers without labels, `Series` can have labels. This is often very useful.

`ratings` is now a `Series` which contains the column of movie ratings. Suppose we're interested in the rating of a particular movie: `Alien`. To do so, we will use the `.loc` *accessor* which pulls a value from the Series at a particular *loc*ation:

In [89]:
ratings.loc['Alien']

There are a couple of things to note here. First, those are square brackets around `Alien`. This is because `.loc` is not a function, but an *accessor*. The square brackets signal that we're going to ber extracting an element from the `Series`. Second, we passed in the label as a string.

**Question 3.4.1.** Find the rating of "The Bourne Ultimatum".

In [90]:
bourne_rating = ...
bourne_rating

In [None]:
grader.check("q341")

Now suppose we wanted to know the year in which `Alien` was released. We could do this by getting the column of years:

In [92]:
years = imdb_by_name.get('Year')
years

And then using `.loc` to get the right entry:

In [93]:
years.loc['Alien']

We could also do this in one step by *chaining* the operations together:

In [94]:
imdb_by_name.get('Year').loc['Alien']

This works because Python first evaluates `imdb_by_name.get('Year')` to a Series. It then evaluates the `.loc['Alien']` to return the year.

Chaining is used pretty frequently and can be handy. Just be sure not to chain *too* many things together that your code gets hard to read. You can always save an intermediate result to a variable.

**Question 3.4.2** Find the decade in which "A Beautiful Mind" was released using chaining. Hint: `imbd_by_name` has a column named "Decade".

In [95]:
decade = ...
decade

In [None]:
grader.check("q342")

## 4. Analyzing datasets
With just a few table methods, we can answer some interesting questions about the IMDb dataset.

If we want just the ratings of the movies, we can get an array that contains the data in that column:

In [97]:
ratings = imdb_by_name.get("Rating")
ratings

Remember that `ratings` is a `Series`. `Series` objects have some useful methods.

**Question 4.1.** Find the rating of the highest-rated movie in the dataset.

*Hint:* Type `ratings.` and hit Tab to see a list of the available methods. Is there one that looks useful?

In [98]:
highest_rating = ...
highest_rating

In [None]:
grader.check("q41")

You probably want to know the *name* of the movie whose rating you found!  To do that, we can sort the whole Series using the `.sort_values` method:

In [100]:
ratings.sort_values()

So there are actually 2 highest-rated movies in the dataset: *The Shawshank Redemption* and *The Godfather*.

Notice that we are sorting by the ratings, not the labels! Moreover, the label follows the rating as it is sorted. This is exactly what we want.

Had we wanted the highest rating movies on top, we would need to specify that the sorting should not be in ascending order with a *keyword argument*:


In [101]:
ratings.sort_values(ascending=False)

Not only can we sort `Series`, but we can sort entire `DataFrame`s, too. When we do that, we have to specify the column to sort by:

In [102]:
imdb_by_name.sort_values('Rating')

Similarly, we can specify that the sort should be in descending order:

In [103]:
imdb_by_name.sort_values('Rating', ascending=False)


Some details about sorting:

1. The first argument to `sort_values` is the name of a column to sort by.
2. If the column has strings in it, `sort` will sort alphabetically; if the column has numbers, it will sort numerically.
3. The value of `imdb_by_name.sort_values("Rating")` is a *copy*; the `imdb_by_name` table doesn't get modified. For example, if we called `imdb_by_name.sort("Rating")`, then running `imdb_by_name` by itself would still return the unsorted table. To save the result, you should assign it to a new variable.
4. Rows always stick together when a table is sorted.  It wouldn't make sense to sort just one column and leave the other columns alone.  For example, in this case, if we sorted just the `"Rating"` column, the movies would all end up with the wrong ratings.

**Question 4.2.** Create a version of `imdb_by_name` that's sorted chronologically, with the earliest movies first.  Call it `imdb_sorted`.

In [104]:
imdb_sorted = ...
imdb_sorted

In [None]:
grader.check("q42")

**Question 4.3.** What's the title of the earliest movie in the dataset?  You could just look this up from the output of the previous cell.  Instead, write Python code to find out.

*Hint:* Remember that the index is an array.

In [106]:
earliest_movie_title = ...
earliest_movie_title

In [None]:
grader.check("q43")

Suppose we want to get the rating of the oldest movie in the table. One way to do this is to first find the index label of the oldest movie (which we've already done). We then extract the 'Rating' column and use `.loc` to find the rating of the oldest movie.

In [108]:
imdb_sorted.get('Rating').loc[earliest_movie_title]

There's a faster way, though. A Series not only has a `.loc` method, but also an `.iloc` method. While `.loc` looks up things by *label*, `.iloc` looks up elements by *integer position*.

Let's remember what is in the "Rating" column:

In [109]:
imdb_sorted.get('Rating')

If we want the rating of the first row, we can use `.iloc[0]`:

In [110]:
imdb_sorted.get('Rating').iloc[0]

This returns the exact same thing as `imdb_sorted.get('Rating').loc['The Kid']`; these are two ways of doing the same thing. Usually it is more convenient to access an element by its label rather than by its integer position.

**Question 4.4.** What is the rating of the third oldest movie in the dataset? You could just look this up from the output of the previous cell. Instead, write Python code to find out.

In [111]:
earliest_movie_rating = ...
earliest_movie_rating

In [None]:
grader.check("q44")

# 5. Finding pieces of a dataset

Suppose you're interested in movies from the 1950s.  Sorting the table by year doesn't help you, because the 1950s are in the middle of the dataset. Instead, we'll use a feature of `Series` which allows us to easily compare each element in a column to a particular value.

First remember that we can use `.get` to extract a single column. The result is not a `DataFrame`, but rather a `Series`:

In [113]:
imdb_by_name.get('Decade')

We want to check whether each movie is released in the decade 1940. Python gives us a way of checking whether two things are equal with `==` (remember that `=` is already taken: it assigns values to variable names):

In [114]:
3 == 4

In [115]:
3 == 3

`True` and `False` are instances of a type that we haven't seen before:

In [116]:
type(True)

`bool` stands for "Boolean". We say that "True" and "False" are *Boolean* values.

It turns out that we can easily check if *all* of the elements in a `Series` are equal to something:

In [117]:
imdb_by_name.get('Decade') == 1950

We see that the result is a new series which has `True` only where the decade was 1950, and `False` everywhere else. We say that the resulting series is a series of *Booleans*, or a *Boolean Series*.

Let's call this result `is_from_fifties`. Its name can be read like it is a question: "is this movie from the 1950s"?

In [118]:
is_from_1950s = imdb_by_name.get('Decade') == 1950
is_from_1950s

Each row is an answer to the question. Is "The Elephant Man" from the fifties? `False`. Is "All About Eve" from the fifties? `True`.

We can use `is_from_1950s` to select only the rows from `imdb_by_name` for which the answer is `True`. The syntax for this is:

In [119]:
imdb_by_name[is_from_1950s]

What `imdb_by_name[is_from_1950s]` does, precisely, is to go through the table `imdb_by_name` row by row. If the row named "Singin' in the Rain" has the value `True` in `is_from_1950s`, that row is kept. If the value is `False`, the row is tossed. And so on, for every row.

Note that we could have accomplished this without ever creating the variable `is_from_1950s` by simply placing the code that we used to create the boolean series directly inside the `[...]`:

In [120]:
imdb_by_name[imdb_by_name.get('Decade') == 1950]

**Question 5.1.** Create a table called `ninety_nine` containing the movies that came out in 1999.

In [121]:
ninety_nine = ...
ninety_nine

In [None]:
grader.check("q51")

So far we've only been finding where a column is *exactly* equal to a certain value. However, there are many other comparison operators.  Here are a few:

|Operator|Tests|
|-|-|
|`==`|thing on left is equal to thing on right|
|`!=`|thing on left is *not* equal to thing on right|
|`>`|thing on left is greater than (and not equal to) thing on right|
|`>=`|thing on left is greater than or equal to thing on right|
|`<`|thing on left is less than (and not equal to) thing on right|

The textbook section on selecting rows has more examples.


**Question 5.2.** Using operators from the table above, find all the movies with a rating higher than 8.5.  Put their data in a table called `really_highly_rated`.

In [123]:
really_highly_rated = ...
really_highly_rated

In [None]:
grader.check("q52")

What is the highest rating of any movie from the 1920s? We now have the tools to answer questions like these. Breaking it into pieces, we first find all of the movies from the 1920s:

In [125]:
is_from_1920s = imdb_by_name.get('Decade') == 1920
is_from_1920s

We then select only these movies from our table:

In [126]:
from_1920s = imdb_by_name[is_from_1920s]
from_1920s

We then find the highest rating out of just these movies:

In [127]:
from_1920s.get('Rating').max()

Or, if we wanted to do all of this more concisely using chaining:

In [128]:
imdb_by_name[imdb_by_name.get('Decade') == 1920].get('Rating').max()

**Question 5.3.** Find the average rating for movies released in the 20th century and the average rating for movies released in the 21st century for the movies in `imdb`.

*Hint*: `Series` have a `.mean()` method. Note that the year 2000 is in the 20th century!

In [129]:
average_20th_century_rating = ...
average_20th_century_rating

In [None]:
grader.check("q531")

In [131]:
average_21st_century_rating = ...
average_21st_century_rating

In [None]:
grader.check("q532")

The property `shape` tells you how many rows and columns are in a table.  (A "property" is just a method that doesn't need to be called by adding parentheses.)

In [133]:
imdb_by_name.shape

Like an array, you can get the first element of the shape using [0], and the second element using [1]. For instance, the number of rows in `imdb_by_name` is:

In [134]:
imdb_by_name.shape[0]

We can use this to answer "How many movies are from the 20th century?":

In [135]:
imdb_by_name[imdb_by_name.get('Year') <= 2000].shape[0]

**Question 5.4.** Use `shape` (and arithmetic) to find the *proportion* of movies in the dataset that were released in the 20th century, and the proportion from the 21st century.

*Hint:* The *proportion* of movies released in the 20th century is the *number* of movies released in the 20th century, divided by the *total number* of movies.

In [136]:
proportion_in_20th_century = ...
proportion_in_20th_century

In [None]:
grader.check("q541")

In [138]:
proportion_in_21st_century = ...
proportion_in_21st_century

In [None]:
grader.check("q542")

**Question 5.5.** Check out the `population_by_year` table from the introduction to this lab.  Compute the year when the world population first went above 6 billion.

In [140]:
year_population_crossed_6_billion = ...
year_population_crossed_6_billion

In [None]:
grader.check("q55")

# Finish Line

Congratulations! You are done with Lab 02.

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.

In [142]:
# For your convenience, you can run this cell to run all the tests at once!
grader.check_all()