# Lab 1: Arrays and DataFrames

## Due Saturday, April 15th at 11:59pm

Welcome to Lab 1!  This week, we'll learn about arrays, which allow us to store sequences of data, and DataFrames, which let us work with multiple arrays of data about the same things. These topics are covered in [BPD 7-10](https://notes.dsc10.com/02-data_sets/arrays.html) in the `babypandas` notes. You should complete this entire lab so that all tests pass and submit it to Gradescope by 11:59PM on the due date.


**Please do not use for-loops for any questions in this lab.** If you don't know what a for-loop is, don't worry – we haven't covered them yet. But if you do know what they are and are wondering why it's not OK to use them, it is because loops in Python are slow, and looping over arrays and DataFrames should usually be avoided.

First, set up the imports we'll need by running the cell below.

In [1]:
import numpy as np
import babypandas as bpd

import otter
grader = otter.Notebook()

# 1. Arrays

Computers are most useful when you can use a small amount of code to *do the same action* to *many different things*.

For example, in the time it takes you to calculate the 18% tip on a restaurant bill, a laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day.  (That is, if you're pretty fast at doing arithmetic in your head!)

**Arrays** are how we put many values in one place so that we can operate on them as a group. For example, if `billions_of_numbers` is an array of numbers, the expression

```py
0.18 * billions_of_numbers
```

evaluates to a new array of numbers that's the result of multiplying each number in `billions_of_numbers` by 0.18 (18%).  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Concretely, an array is a **collection of values of the same type**, like a column in a spreadsheet (think Google Sheets or Microsoft Excel). 

<img src="data/sheet_array.png" width=600>

## 1.1. Making arrays
You can type in the data that goes in an array yourself, but that's not typically how we'll create arrays. Normally, we create arrays by loading them from an external source, like a data file.

First, though, let's learn how to do it the hard way. To begin, we can make a **list** of numbers by putting them within square brackets and separating them by commas:

In [2]:
my_list = [14, -2.26, 0.15]
my_list

[14, -2.26, 0.15]

Just like `int`, `float`, and `str`, the `list` is a data type provided by Python. Lists are very flexible and easy to work with, but they are *slowwww* 🐢.

As data scientists, we'll often be working with millions or even billions of numbers. For this, we need something faster than a `list`. Instead of lists, we will use *arrays*. 

Arrays are provided by a package called [NumPy](http://www.numpy.org/) (pronounced "NUM-pie" or, if you prefer to pronounce things incorrectly, "NUM-pee").  The package is called `numpy`, but it's standard to rename it `np` for brevity.  You can do that with:

    import numpy as np

Data scientists, as well as engineers and scientists of all kinds, use `numpy` frequently, and you'll see quite a bit of it if you're a data science major.

In [3]:
import numpy as np

Now, to create an array, call the function `np.array` with a list of numbers.  Run this cell to see an example:

In [4]:
np.array([14, -2.26, 0.15])

array([14.  , -2.26,  0.15])

Note that you need the square-brackets here. If you were to try running the following code, Python would yell at you because you forgot them:

```py
np.array(14, -2.26, 0.15)
```

<img src='data/brackets.png' width=400>

Arrays themselves are also values, just like numbers and strings.  That means you can assign them names or use them as arguments to functions.

**Question 1.1.1.** Make an array containing the numbers 2, 4, and 6, in that order.  Name it `even_numbers`.

In [5]:
even_numbers = np.array((2, 4, 6))
even_numbers

array([2, 4, 6])

In [6]:
grader.check("q1_1_1")

**Question 1.1.2.** Make an array containing the numbers 0, -1, 1, $\pi$, and $e$, in that order.  Name it `odd_numbers`.

*Hint:* $\pi$ and $e$ are available from the `np` module, which has already been imported. Just as you used `math.pi` to get $\pi$ in the last lab, you can use `np.pi` to get $\pi$ as well. **Do not** import the `math` module.

In [7]:
odd_numbers = np.array((0,-1,1,np.pi,np.e))
odd_numbers

array([ 0.        , -1.        ,  1.        ,  3.14159265,  2.71828183])

In [8]:
grader.check("q1_1_2")

**Question 1.1.3.** Make an array containing the five strings `"Hello"`, `","`, `" "`, `"world"`, and `"!"`.  (The third one is a single space inside quotes.)  Name it `hello_world_components`.

*Note:* If you print `hello_world_components`, you'll notice some extra information in addition to its contents: `dtype='<U5'`.  That's just NumPy's extremely odd way of saying that the things in the array are strings. In case you're interested, the `U` means that this string is encoded in [unicode](https://en.wikipedia.org/wiki/Unicode), and the `<5` means all strings in the array are 5 characters long or less.

In [9]:
hello_world_components = np.array(("Hello", ",", " ", "world", "!"))
hello_world_components

array(['Hello', ',', ' ', 'world', '!'], dtype='<U5')

In [10]:
grader.check("q1_1_3")

Very often in data science, we want to work with many numbers that are evenly spaced within some range.  NumPy provides a special function for this called `arange`.  The expression `np.arange(start, stop, space)` produces an array with all the numbers starting at `start` and counting up by `space`, stopping **before** `stop` is reached.

For example, the value of `np.arange(1, 8, 2)` is an array with elements 1, 3, 5, and 7 – it starts at 1 and counts up by 2, then stops before 8.  In other words, it makes the same array as `np.array([1, 3, 5, 7])`.

`np.arange(4, 9, 1)` is an array with elements 4, 5, 6, 7, and 8.  (It doesn't contain 9 because `np.arange` stops *before* the stop value is reached.)

**Question 1.1.4.** Use `np.arange` to create an array with all the multiples of 99 from 0 up to (**and including**) 9999.  (So its elements are 0, 99, 198, 297, etc.)

In [11]:
multiples_of_99 = np.arange(0, 9999+99, 99)
multiples_of_99

array([   0,   99,  198,  297,  396,  495,  594,  693,  792,  891,  990,
       1089, 1188, 1287, 1386, 1485, 1584, 1683, 1782, 1881, 1980, 2079,
       2178, 2277, 2376, 2475, 2574, 2673, 2772, 2871, 2970, 3069, 3168,
       3267, 3366, 3465, 3564, 3663, 3762, 3861, 3960, 4059, 4158, 4257,
       4356, 4455, 4554, 4653, 4752, 4851, 4950, 5049, 5148, 5247, 5346,
       5445, 5544, 5643, 5742, 5841, 5940, 6039, 6138, 6237, 6336, 6435,
       6534, 6633, 6732, 6831, 6930, 7029, 7128, 7227, 7326, 7425, 7524,
       7623, 7722, 7821, 7920, 8019, 8118, 8217, 8316, 8415, 8514, 8613,
       8712, 8811, 8910, 9009, 9108, 9207, 9306, 9405, 9504, 9603, 9702,
       9801, 9900, 9999])

In [12]:
grader.check("q1_1_4")

##### Temperature readings 🌡️
NOAA (the US National Oceanic and Atmospheric Administration) operates weather stations that measure surface temperatures at different sites around the United States.  The hourly readings are [publicly available](http://www.ncdc.noaa.gov/qclcd/QCLCD?prior=N).

Suppose we download all the hourly data from the San Diego, California site for the month of December 2022. To analyze the data, we want to know when each reading was taken, but we find that the data doesn't include the timestamps of the readings (the time at which each one was taken).

However, we know the first reading was taken at the first instant of December 2022 (midnight on December 1st) and each subsequent reading was taken exactly 1 hour after the last.

**Question 1.1.5.** Create an array of the *time, in seconds, since the start of the month* at which each hourly reading was taken.  Name it `collection_times`.

*Hint 1:* There are 31 days in December, which is equivalent to ($31 \times 24$) hours or ($31 \times 24 \times 60 \times 60$) seconds.  

*Hint 2:* The `len` function works on arrays, too.  If your `collection_times` isn't passing the tests, check its length and make sure it has $31 \times 24$ elements, since readings are taken hourly for 31 days.

In [13]:
collection_times = np.arange(0, 31*24*3600, 3600)
collection_times

array([      0,    3600,    7200,   10800,   14400,   18000,   21600,
         25200,   28800,   32400,   36000,   39600,   43200,   46800,
         50400,   54000,   57600,   61200,   64800,   68400,   72000,
         75600,   79200,   82800,   86400,   90000,   93600,   97200,
        100800,  104400,  108000,  111600,  115200,  118800,  122400,
        126000,  129600,  133200,  136800,  140400,  144000,  147600,
        151200,  154800,  158400,  162000,  165600,  169200,  172800,
        176400,  180000,  183600,  187200,  190800,  194400,  198000,
        201600,  205200,  208800,  212400,  216000,  219600,  223200,
        226800,  230400,  234000,  237600,  241200,  244800,  248400,
        252000,  255600,  259200,  262800,  266400,  270000,  273600,
        277200,  280800,  284400,  288000,  291600,  295200,  298800,
        302400,  306000,  309600,  313200,  316800,  320400,  324000,
        327600,  331200,  334800,  338400,  342000,  345600,  349200,
        352800,  356

In [14]:
grader.check("q1_1_5")

## 1.2. Working with single elements of arrays ("indexing")
Let's work with a more interesting dataset.  The next cell creates an array called `population` that includes estimated world populations in every year from **1950** to **2022**.  (The estimates come from the [International Database](https://www.census.gov/data-tools/demo/idb/#/country?COUNTRY_YEAR=2022&COUNTRY_YR_ANIM=2022), maintained by the US Census Bureau.)

Rather than type in the data manually, we've loaded them from a file on your computer called `world_population_2022.csv`.  You'll learn how to read in data from files very soon.

In [15]:
# Don't worry too much about what goes on in this cell.
population = bpd.read_csv("data/world_population_2022.csv").get("Population").values
population

array([2557619597, 2594942227, 2636777090, 2682060684, 2730237675,
       2782111389, 2835315327, 2891368627, 2948159570, 3000742521,
       3043031253, 3084053711, 3140239653, 3210037409, 3281477826,
       3350773176, 3421097064, 3490825940, 3562887008, 3637819236,
       3713457589, 3791172327, 3867519813, 3943132388, 4017779234,
       4089387557, 4159536915, 4230430893, 4301282222, 4374940345,
       4445975606, 4527418598, 4610620221, 4694937687, 4777055423,
       4862317393, 4949951891, 5040273543, 5131575729, 5222662682,
       5315511894, 5403253915, 5490481497, 5568231516, 5650178207,
       5733211108, 5815333785, 5895837672, 5975189305, 6053955779,
       6132455985, 6211328357, 6290282107, 6369186797, 6448262425,
       6527056809, 6607396274, 6689442159, 6773319540, 6857160919,
       6939761510, 7022084781, 7105001721, 7188528811, 7271598780,
       7353476064, 7435151387, 7516769535, 7597066210, 7676686052,
       7756873419, 7831718605, 7905336896])

Here's how we get the first element of `population`, which is the world population in the first year in the dataset, 1950.

In [16]:
population[0]

2557619597

Notice that we use square brackets here. The square brackets signal that we are *accessing* an element of the array. Square brackets in Python are kind of like subscripts in math.

The value of that expression is the number 2557619597 (around 2.5 billion), because that's the first thing in the array `population`.

Notice that we wrote `population[0]`, not `population[1]`, to get the first element.  This is a weird convention in computer science. 0 is called the *index* of the first item.  It's the number of elements that appear *before* that item.  So 3 is the index of the 4th item.

Here are some more examples.  In the examples, we've given names to the things we get out of `population`.  Read and run each cell.

In [17]:
# The third element in the array is the population in 1952.
population_1952 = population[2]
population_1952

2636777090

In [18]:
# The thirteenth element in the array is the population
# in 1962 (which is 1950 + 12).
population_1962 = population[12]
population_1962

3140239653

In [19]:
# The 73rd element in the array is the population in 2022.
population_2022 = population[72]
population_2022

7905336896

In [20]:
# The array has only 73 elements, so this doesn't work.
# (There's no element with 73 other elements before it.)

#population_2023 = population[73]
#population_2023

# 🚨 After running this cell, please place a # before each line above to make sure that it doesn't run again.

**Question 1.2.1.** Set `population_1998` to the world population in 1998, by getting the appropriate element from `population`.

In [21]:
population_1998 = population[48]
population_1998

5975189305

In [22]:
grader.check("q1_2_1")

## 1.3. Doing something to every element of an array
Arrays are primarily useful for doing the same operation many times, so we don't often have to access and work with single elements.

##### Logarithms
Here is one simple question we might ask about world population:

> How big was the population in *orders of magnitude* in each year?

The logarithm function is one way of measuring how big a number is. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.

We could try to answer our question like this, using the `log10` function from NumPy on each element of the `population` array:

In [23]:
population_1950_magnitude = np.log10(population[0])
population_1951_magnitude = np.log10(population[1])
population_1952_magnitude = np.log10(population[2])
population_1953_magnitude = np.log10(population[3])

# ... and so on!

But this is tedious and repetitive. There must be a better way!

It turns out that NumPy's `log10` is pretty powerful. Not only can it take in a single number (like `population[0]`) as input and return the logarithm of a single number, but it can **also** take in an entire array of numbers and return the logarithm of each element in that array!

If you give NumPy's `log10` an array as input, it will return an array of the same length, where the first element of the result is the logarithm of the first element of the input, the second element of the result is the logarithm of the second element of the input, and so on.

<img src="data/array_logarithm.jpg">

This is called *elementwise* application of the function, since it operates separately on each element of the array it's called on.  

**Question 1.3.1.** Use NumPy's `log10` function to compute the logarithms of the world population in every year.  Give the result (an array of 73 numbers) the name `population_magnitudes`.  Your code should be very short.

In [24]:
population_magnitudes = np.log10(population)
population_magnitudes

array([9.40783595, 9.41412769, 9.42107342, 9.4284686 , 9.43620046,
       9.44437451, 9.45260137, 9.46110346, 9.46955099, 9.47722873,
       9.48330641, 9.48912193, 9.49696279, 9.50651009, 9.51606947,
       9.52514503, 9.5341654 , 9.54292819, 9.55180205, 9.56084112,
       9.56977847, 9.57877353, 9.58743255, 9.59584136, 9.60398607,
       9.61165827, 9.61904498, 9.6263846 , 9.63359794, 9.64097214,
       9.64796708, 9.65585065, 9.66375935, 9.67162983, 9.67916028,
       9.6868433 , 9.69460098, 9.70245411, 9.71025074, 9.71789198,
       9.72554509, 9.73265538, 9.73961043, 9.74571728, 9.75206215,
       9.75839793, 9.76457465, 9.77054552, 9.77635167, 9.78203924,
       9.78763444, 9.79318449, 9.79867012, 9.80408399, 9.8094427 ,
       9.81471739, 9.82003035, 9.8253899 , 9.83080156, 9.83614434,
       9.84134455, 9.84646607, 9.85156419, 9.85664002, 9.86162991,
       9.86649268, 9.87128982, 9.87603123, 9.88064591, 9.88517378,
       9.8896867 , 9.89385707, 9.89792038])

In [25]:
grader.check("q1_3_1")

##### Arithmetic
Arithmetic also works elementwise on arrays.  For example, you can divide all the population numbers by 1 billion to get numbers in billions:

In [26]:
population_in_billions = population / 1000000000
population_in_billions

array([2.5576196 , 2.59494223, 2.63677709, 2.68206068, 2.73023768,
       2.78211139, 2.83531533, 2.89136863, 2.94815957, 3.00074252,
       3.04303125, 3.08405371, 3.14023965, 3.21003741, 3.28147783,
       3.35077318, 3.42109706, 3.49082594, 3.56288701, 3.63781924,
       3.71345759, 3.79117233, 3.86751981, 3.94313239, 4.01777923,
       4.08938756, 4.15953692, 4.23043089, 4.30128222, 4.37494034,
       4.44597561, 4.5274186 , 4.61062022, 4.69493769, 4.77705542,
       4.86231739, 4.94995189, 5.04027354, 5.13157573, 5.22266268,
       5.31551189, 5.40325391, 5.4904815 , 5.56823152, 5.65017821,
       5.73321111, 5.81533378, 5.89583767, 5.9751893 , 6.05395578,
       6.13245598, 6.21132836, 6.29028211, 6.3691868 , 6.44826243,
       6.52705681, 6.60739627, 6.68944216, 6.77331954, 6.85716092,
       6.93976151, 7.02208478, 7.10500172, 7.18852881, 7.27159878,
       7.35347606, 7.43515139, 7.51676953, 7.59706621, 7.67668605,
       7.75687342, 7.8317186 , 7.9053369 ])

You can do the same with addition, subtraction, multiplication, and exponentiation (`**`). For example, you can calculate a twenty percent tip on several restaurant bills at once:

In [27]:
restaurant_bills = np.array([20.12, 39.90, 31.01])
print("Restaurant bills:\t", restaurant_bills)
tips = 0.2 * restaurant_bills 
print("Tips:\t\t\t", tips)

Restaurant bills:	 [20.12 39.9  31.01]
Tips:			 [4.024 7.98  6.202]


<img src="data/array_multiplication.jpg">

**Question 1.3.2.** Suppose the total charge at a restaurant is the original bill plus the tip (20%).  That means we can multiply the original bill by 1.2 to get the total charge.  Compute the total charge for each bill in `restaurant_bills` and give the resulting array the name `total_charges`.

In [28]:
total_charges = 1.2*restaurant_bills
total_charges

array([24.144, 47.88 , 37.212])

In [29]:
grader.check("q1_3_2")

Let's read in some data to use in the next question.

In [30]:
more_restaurant_bills = bpd.read_csv("data/more_restaurant_bills.csv").get("Bill").values

**Question 1.3.3.** The array `more_restaurant_bills` contains 100,000 bills!  Compute the total charge for each one, assuming again a twenty percent tip, and give the resulting array the name `more_total_charges`.

In [31]:
more_total_charges = 1.2*more_restaurant_bills
more_total_charges

array([20.244, 20.892, 12.216, ..., 19.308, 18.336, 35.664])

In [32]:
grader.check("q1_3_3")

The function `sum` takes a single array of numbers as its argument.  It returns the sum of all the numbers in that array (so it returns a single number, not an array).

**Question 1.3.4.** What was the sum of all the bills in `more_restaurant_bills`, **including tips**?

In [33]:
sum_of_bills = sum(more_total_charges)
sum_of_bills

1795730.0640000193

In [34]:
grader.check("q1_3_4")

##### Powers of Two
The powers of 2 ($2^0 = 1$, $2^1 = 2$, $2^2 = 4$, etc) arise frequently in computer science.  (For example, you may have noticed that storage on smartphones or computers come in powers of 2, like 64 GB, 128 GB, or 256 GB.)

**Question 1.3.5.** Use `np.arange` and the exponentiation operator `**` to create an array containing the first 40 powers of 2, starting from $2^0=1$.

*Hint 1*: Did your kernel "die" when you ran your solution? There is a common incorrect response to this problem that tries to create an array with so many entries that Python gives up and crashes. If this happens to you, double-check your answer! 

*Hint 2*: Maybe just start with the first 5 powers of two. Once you get that working, then try all 40. At no point should you have to manually write `0, 1, 2, 3, 4, ...`; if you find yourself trying that, scroll up to earlier in the lab notebook.

In [63]:
powers_of_2 = 2**np.arange(40)
powers_of_2

array([           1,            2,            4,            8,
                 16,           32,           64,          128,
                256,          512,         1024,         2048,
               4096,         8192,        16384,        32768,
              65536,       131072,       262144,       524288,
            1048576,      2097152,      4194304,      8388608,
           16777216,     33554432,     67108864,    134217728,
          268435456,    536870912,   1073741824,   2147483648,
         4294967296,   8589934592,  17179869184,  34359738368,
        68719476736, 137438953472, 274877906944, 549755813888])

In [64]:
grader.check("q1_3_5")

# 2. DataFrames 

## 2.1. Introduction

For a collection of things in the world, an array is useful for describing a single attribute of each thing. For example, among the collection of US States, an array could describe the land area of each. Tables extend this idea by describing multiple attributes for each element of a collection. In a table of states, for example, we might keep track of land area, population, state capital, and the name of the governor. In other words, tables keep track of many entities (individuals, stored as rows), and for each entity, many attributes (features, stored as columns).

In the cell below we have two arrays. The first one contains the world population in each year (as estimated by the US Census Bureau), and the second contains the years themselves (in order, so the first elements in the population and the years arrays correspond).

In [65]:
population_amounts = bpd.read_csv("data/world_population_2022.csv").get("Population").values
population_years = np.arange(1950, 2022+1)
print("Population column:", population_amounts)
print("Years column:", population_years)

Population column: [2557619597 2594942227 2636777090 2682060684 2730237675 2782111389
 2835315327 2891368627 2948159570 3000742521 3043031253 3084053711
 3140239653 3210037409 3281477826 3350773176 3421097064 3490825940
 3562887008 3637819236 3713457589 3791172327 3867519813 3943132388
 4017779234 4089387557 4159536915 4230430893 4301282222 4374940345
 4445975606 4527418598 4610620221 4694937687 4777055423 4862317393
 4949951891 5040273543 5131575729 5222662682 5315511894 5403253915
 5490481497 5568231516 5650178207 5733211108 5815333785 5895837672
 5975189305 6053955779 6132455985 6211328357 6290282107 6369186797
 6448262425 6527056809 6607396274 6689442159 6773319540 6857160919
 6939761510 7022084781 7105001721 7188528811 7271598780 7353476064
 7435151387 7516769535 7597066210 7676686052 7756873419 7831718605
 7905336896]
Years column: [1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963
 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977
 1978 19

Suppose we want to answer this question:

> When did the world's population surpass 7 billion?

You could technically answer this question just by staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 7 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a table.

Just as `numpy` provides arrays, a popular package called `pandas` provides **DataFrames**, which is `pandas`' name for **tables**. `pandas` is *the* tool for doing data science in Python. Unfortunately, `pandas` isn't as cute as its name might suggest: it's very complicated and can be somewhat hard to learn.

Instead of using `pandas`, we'll use a package that we've created specifically for DSC 10. It is a *subset* of `pandas`, including only the parts that we think are necessary and throwing out all of the rest. Because it is smaller (and cuter), we've called it `babypandas`. 

<img src='data/pandas-babypandas.jpg' width=400>

You can import `babypandas` using the following code:

In [66]:
import babypandas as bpd


The nice thing about `babypandas` is that it is easier to learn *but* every bit of code you write using `babypandas` will work with `pandas`, too. If you're a data science major, or just going to be doing a lot of data analysis in Python, you'll see quite a lot of `pandas` in your future.

The cell below:

- creates an empty DataFrame using the expression `bpd.DataFrame()`,
- assigns two columns to the DataFrame by calling `assign`,
- assigns the resulting DataFrame to the name `population_df`, and finally
- displays `population_df` so that we can see the DataFrame we've made.

`"Population"` and `"Year"` are column labels that we have chosen. We could have chosen anything, but it's a good idea to choose names that are descriptive and not too long.

In [67]:
population_df = bpd.DataFrame().assign(
    Population=population_amounts,
    Year=population_years
)
population_df

Unnamed: 0,Population,Year
0,2557619597,1950
1,2594942227,1951
2,2636777090,1952
3,2682060684,1953
4,2730237675,1954
...,...,...
68,7597066210,2018
69,7676686052,2019
70,7756873419,2020
71,7831718605,2021


Now the data are all together in a single DataFrame! It's much easier to parse this data. If you need to know what the population was in 2011, for example, you can tell from a single glance. We'll revisit this DataFrame later.

**Question 2.1.1.** In the cell below, we've created 2 arrays. Using the steps above, assign `top_10_movies` to a DataFrame that has two columns called `"Rating"` and `"Name"`, which hold `top_10_movie_ratings` and `top_10_movie_names` respectively.

In [40]:
top_10_movie_ratings = np.array([9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8])
top_10_movie_names = np.array([
        'The Shawshank Redemption (1994)',
        'The Godfather (1972)',
        'The Godfather: Part II (1974)',
        'Pulp Fiction (1994)',
        "Schindler's List (1993)",
        'The Lord of the Rings: The Return of the King (2003)',
        '12 Angry Men (1957)',
        'The Dark Knight (2008)',
        'Il buono, il brutto, il cattivo (1966)',
        'The Lord of the Rings: The Fellowship of the Ring (2001)'
])

In [68]:
top_10_movies = bpd.DataFrame().assign(Rating = top_10_movie_ratings, Name = top_10_movie_names)
top_10_movies

Unnamed: 0,Rating,Name
0,9.2,The Shawshank Redemption (1994)
1,9.2,The Godfather (1972)
2,9.0,The Godfather: Part II (1974)
3,8.9,Pulp Fiction (1994)
4,8.9,Schindler's List (1993)
5,8.9,The Lord of the Rings: The Return of the King ...
6,8.9,12 Angry Men (1957)
7,8.9,The Dark Knight (2008)
8,8.9,"Il buono, il brutto, il cattivo (1966)"
9,8.8,The Lord of the Rings: The Fellowship of the R...


In [69]:
grader.check("q2_1_1")

Suppose you want to add your own ratings to this DataFrame. The cell below contains your ranking of each movie:

In [70]:
my_ranking = [8, 2, 1, 9, 7, 10, 6, 4, 3, 5]

**Question 2.1.2** You can use the `assign` method to add a column to an already-existing DataFrame, too. Create a new DataFrame called `with_ranking` by adding a column named `"Ranking"` to the DataFrame in `top_10_movies`.

In [72]:
with_ranking = top_10_movies.assign(Ranking = my_ranking )
with_ranking

Unnamed: 0,Rating,Name,Ranking
0,9.2,The Shawshank Redemption (1994),8
1,9.2,The Godfather (1972),2
2,9.0,The Godfather: Part II (1974),1
3,8.9,Pulp Fiction (1994),9
4,8.9,Schindler's List (1993),7
5,8.9,The Lord of the Rings: The Return of the King ...,10
6,8.9,12 Angry Men (1957),6
7,8.9,The Dark Knight (2008),4
8,8.9,"Il buono, il brutto, il cattivo (1966)",3
9,8.8,The Lord of the Rings: The Fellowship of the R...,5


In [73]:
grader.check("q2_1_2")

## 2.2. Indexes

You may have noticed that the DataFrame of populations contains what looks like an extra, unlabeled column on the left with the numbers 0 through 65. **This is not a column, it's what we call an *index***. The index contains the row labels. Whereas the columns of this DataFrame are labeled `"Population"` and `"Year"`, the rows are labeled 0, 1, ..., 65.

By default, `babypandas` doesn't know how to label the rows, and so it just numbers them (starting with 0). Of course, in this case it makes more sense to use the year as a row's label. We can do this by telling `babypandas` to set the `"Year"` column as the index:

In [74]:
population_by_year = population_df.set_index('Year')
population_by_year

Unnamed: 0_level_0,Population
Year,Unnamed: 1_level_1
1950,2557619597
1951,2594942227
1952,2636777090
1953,2682060684
1954,2730237675
...,...
2018,7597066210
2019,7676686052
2020,7756873419
2021,7831718605


As we'll see, this does more than make the DataFrame look nicer -- it is very useful, too.

**Question 2.2.1** Create a new DataFrame named `top_10_movies_by_name` by taking the DataFrame you made above, `top_10_movies`, and setting the index to be the `"Name"` column.

In [77]:
top_10_movies_by_name = top_10_movies.set_index("Name")
top_10_movies_by_name

Unnamed: 0_level_0,Rating
Name,Unnamed: 1_level_1
The Shawshank Redemption (1994),9.2
The Godfather (1972),9.2
The Godfather: Part II (1974),9.0
Pulp Fiction (1994),8.9
Schindler's List (1993),8.9
The Lord of the Rings: The Return of the King (2003),8.9
12 Angry Men (1957),8.9
The Dark Knight (2008),8.9
"Il buono, il brutto, il cattivo (1966)",8.9
The Lord of the Rings: The Fellowship of the Ring (2001),8.8


In [78]:
grader.check("q2_2_1")

You can get an array of row names using `.index`. For instance, the array of row names of the `population_by_year` DataFrame is:

In [79]:
population_by_year.index

Int64Index([1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960,
            1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,
            1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982,
            1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993,
            1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004,
            2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015,
            2016, 2017, 2018, 2019, 2020, 2021, 2022],
           dtype='int64', name='Year')

**Question 2.2.2** Using code, assign to `tenth_movie` the name of the tenth movie in `top_10_movies_by_name`.

*Hint:* Remember that the index is an array, and we use square brackets to access elements of an array.

In [81]:
tenth_movie = top_10_movies_by_name.index[9]
tenth_movie

'The Lord of the Rings: The Fellowship of the Ring (2001)'

In [82]:
grader.check("q2_2_2")

## 2.3 Reading a DataFrame from a file
In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we can use functions provided by `babypandas` to read in data from external files.

The `bpd.read_csv()` function takes one argument, a path to a data file (a string), and returns a DataFrame.  There are many formats for data files, but CSV ("comma-separated values") is the most common. 

**Question 2.3.1.** The file `data/imdb.csv` contains information about the 250 highest-rated movies on IMDb.  Load it as a DataFrame called `imdb`.

In [84]:
imdb = bpd.read_csv("data/imdb.csv")
imdb

Unnamed: 0,Votes,Rating,Title,Year,Decade
0,88355,8.4,M,1931,1930
1,132823,8.3,Singin' in the Rain,1952,1950
2,74178,8.3,All About Eve,1950,1950
3,635139,8.6,Léon,1994,1990
4,145514,8.2,The Elephant Man,1980,1980
...,...,...,...,...,...
245,1078416,8.7,Forrest Gump,1994,1990
246,31003,8.1,Le salaire de la peur,1953,1950
247,167076,8.2,3 Idiots,2009,2000
248,91689,8.1,Network,1976,1970


In [85]:
grader.check("q2_3_1")

Notice the dots in the middle of the DataFrame. This means that a lot of the rows have been omitted. This DataFrame is big enough that only a few of its rows are displayed, but the others are still there.  There are 250 movies total.

Where did `imdb.csv` come from? Take a look at [this lab's folder](./). If you go into the `data/` directory, you should see a file called `imdb.csv`.

Open up the `imdb.csv` file in that folder and look at the format. What do you notice? The `.csv` filename ending says that this file is in the [CSV (comma-separated value) format](http://edoceo.com/utilitas/csv-file-format).

**Question 2.3.2.** This is a data set of movies, so it makes sense to use the movie title as the row label. Create a new DataFrame called `imdb_by_name` which uses the movie title as the index.

In [86]:
imdb_by_name = imdb.set_index("Title")
imdb_by_name

Unnamed: 0_level_0,Votes,Rating,Year,Decade
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M,88355,8.4,1931,1930
Singin' in the Rain,132823,8.3,1952,1950
All About Eve,74178,8.3,1950,1950
Léon,635139,8.6,1994,1990
The Elephant Man,145514,8.2,1980,1980
...,...,...,...,...
Forrest Gump,1078416,8.7,1994,1990
Le salaire de la peur,31003,8.1,1953,1950
3 Idiots,167076,8.2,2009,2000
Network,91689,8.1,1976,1970


In [87]:
grader.check("q2_3_2")

## 2.4. Series



Suppose we're interested primarily in movie ratings. To extract just this column from the DataFrame, we use the `.get` method:

In [88]:
ratings = imdb_by_name.get('Rating')
ratings

Title
M                                        8.4
Singin' in the Rain                      8.3
All About Eve                            8.3
Léon                                     8.6
The Elephant Man                         8.2
                                        ... 
Forrest Gump                             8.7
Le salaire de la peur                    8.1
3 Idiots                                 8.2
Network                                  8.1
Eternal Sunshine of the Spotless Mind    8.3
Name: Rating, Length: 250, dtype: float64

Notice how not only the movie ratings have been returned, but also the name of the movie! This is precisely because we have set the movie title to be the index! For example, if we had asked for the `"Rating"` column of the original DataFrame, `imdb`, we would see:

In [89]:
imdb.get('Rating')

0      8.4
1      8.3
2      8.3
3      8.6
4      8.2
      ... 
245    8.7
246    8.1
247    8.2
248    8.1
249    8.3
Name: Rating, Length: 250, dtype: float64

This is one way in which indices are very useful - they provide meaningful labels for the data.

At first glance, it might look like asking for a column using `.get` returns a DataFrame with one column, but that's not quite right. Instead, it returns a special type of thing called a *Series*:

In [90]:
type(imdb_by_name.get('Rating'))

babypandas.bpd.Series

You can think of a `Series` as an array with an index. Whereas arrays are simple sequences of numbers without labels, `Series` can have labels. This is often very useful.

`ratings` is now a `Series` which contains the column of movie ratings. Suppose we're interested in the rating of a particular movie: _Alien_. To do so, we will use the `.loc` *accessor* which pulls a value from the Series at a particular *loc*ation:

In [91]:
ratings.loc["Alien"]

8.5

There are a couple of things to note here. First, those are square brackets around `"Alien"`. This is because `.loc` is not a method, but an *accessor*. The square brackets signal that we're going to be extracting an element from the `Series`. Second, we passed in the label as a string.

**Question 2.4.1.** Find the rating of _3 Idiots_.

In [93]:
three_idiots_rating = ratings.loc["3 Idiots"]
three_idiots_rating

8.2

In [94]:
grader.check("q2_4_1")

Now suppose we wanted to know the year in which _Alien_ was released. We could do this by first getting the column of years:

In [95]:
years = imdb_by_name.get('Year')
years

Title
M                                        1931
Singin' in the Rain                      1952
All About Eve                            1950
Léon                                     1994
The Elephant Man                         1980
                                         ... 
Forrest Gump                             1994
Le salaire de la peur                    1953
3 Idiots                                 2009
Network                                  1976
Eternal Sunshine of the Spotless Mind    2004
Name: Year, Length: 250, dtype: int64

And then using `.loc` to get the right entry:

In [96]:
years.loc['Alien']

1979

We could also do this in one step by *chaining* the operations together:

In [97]:
imdb_by_name.get('Year').loc['Alien']

1979

This works because Python first evaluates `imdb_by_name.get('Year')` to a Series. It then evaluates the `.loc['Alien']` to return the year.

Chaining is used pretty frequently and can be handy. Just be sure not to chain *too* many things together that your code gets hard to read. You can always save an intermediate result to a variable.

**Question 2.4.2** Find the decade in which _Gone Girl_ was released using chaining. 

*Hint*: `imdb_by_name` has a column named `"Decade"`.

In [98]:
decade = imdb_by_name.get('Decade').loc["Gone Girl"]
decade

2010

In [99]:
grader.check("q2_4_2")

# 3. Analyzing datasets

With just a few DataFrame methods, we can answer some interesting questions about the IMDb dataset.

If we want just the ratings of the movies, we can use `.get`:

In [109]:
ratings = imdb_by_name.get("Rating")
ratings

Title
M                                        8.4
Singin' in the Rain                      8.3
All About Eve                            8.3
Léon                                     8.6
The Elephant Man                         8.2
                                        ... 
Forrest Gump                             8.7
Le salaire de la peur                    8.1
3 Idiots                                 8.2
Network                                  8.1
Eternal Sunshine of the Spotless Mind    8.3
Name: Rating, Length: 250, dtype: float64

Remember that `ratings` is a Series. Series objects have some useful methods.

**Question 3.1.** Find the rating of the highest-rated movie in the dataset.

*Hint:* Type `ratings.` and hit Tab to see a list of the available methods. Is there one that looks useful?

In [111]:
highest_rating = ratings.max()
highest_rating

9.2

In [112]:
grader.check("q3_1")

You probably want to know the *name* of the movie whose rating you found!  To do that, we can sort the whole Series using the `.sort_values` method:

In [113]:
ratings.sort_values()

Title
Akira                               8.0
Per un pugno di dollari             8.0
Guardians of the Galaxy             8.0
The Man Who Shot Liberty Valance    8.0
Underground                         8.0
                                   ... 
Schindler's List                    8.9
12 Angry Men                        8.9
The Godfather: Part II              9.0
The Shawshank Redemption            9.2
The Godfather                       9.2
Name: Rating, Length: 250, dtype: float64

So there are actually 2 highest-rated movies in the dataset: *The Shawshank Redemption* and *The Godfather*.

Notice that we are sorting by the ratings, not the labels! Moreover, the label follows the rating as it is sorted. This is exactly what we want.

When we use the `sort_values` method, the resulting Series has the data sorted in ascending order, from small to large. This is the default behavior of `sort_values`, but we can change that. Had we wanted the highest rated movies on top, we would need to specify that the sorting should not be in ascending order with an optional *keyword argument*:


In [114]:
ratings.sort_values(ascending=False)

Title
The Godfather                             9.2
The Shawshank Redemption                  9.2
The Godfather: Part II                    9.0
12 Angry Men                              8.9
Il buono, il brutto, il cattivo (1966)    8.9
                                         ... 
Monsters, Inc. (2001)                     8.0
The Big Sleep                             8.0
X-Men: Days of Future Past                8.0
Roman Holiday                             8.0
Kumonosu-jô                               8.0
Name: Rating, Length: 250, dtype: float64

If we set the keyword argument `ascending` to `True`, we get the same result as if we did not set it at all. This is what we mean when we say that the default behavior of `sort_values` is to sort in ascending order. Confirm that the next two cells give the same output.

In [115]:
ratings.sort_values(ascending=True)

Title
Akira                               8.0
Per un pugno di dollari             8.0
Guardians of the Galaxy             8.0
The Man Who Shot Liberty Valance    8.0
Underground                         8.0
                                   ... 
Schindler's List                    8.9
12 Angry Men                        8.9
The Godfather: Part II              9.0
The Shawshank Redemption            9.2
The Godfather                       9.2
Name: Rating, Length: 250, dtype: float64

In [116]:
ratings.sort_values()

Title
Akira                               8.0
Per un pugno di dollari             8.0
Guardians of the Galaxy             8.0
The Man Who Shot Liberty Valance    8.0
Underground                         8.0
                                   ... 
Schindler's List                    8.9
12 Angry Men                        8.9
The Godfather: Part II              9.0
The Shawshank Redemption            9.2
The Godfather                       9.2
Name: Rating, Length: 250, dtype: float64

Not only can we sort Series, but we can sort entire DataFrames, too. When we do that, we have to specify the column to sort by:

In [117]:
imdb_by_name.sort_values('Rating')

Unnamed: 0_level_0,Votes,Rating,Year,Decade
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Akira,91652,8.0,1988,1980
Per un pugno di dollari,124671,8.0,1964,1960
Guardians of the Galaxy,527349,8.0,2014,2010
The Man Who Shot Liberty Valance,49135,8.0,1962,1960
Underground,39447,8.0,1995,1990
...,...,...,...,...
Schindler's List,761224,8.9,1993,1990
12 Angry Men,384187,8.9,1957,1950
The Godfather: Part II,692753,9.0,1974,1970
The Shawshank Redemption,1498733,9.2,1994,1990


Similarly, we can specify that the sort should be in descending order:

In [118]:
imdb_by_name.sort_values('Rating', ascending=False)

Unnamed: 0_level_0,Votes,Rating,Year,Decade
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Godfather,1027398,9.2,1972,1970
The Shawshank Redemption,1498733,9.2,1994,1990
The Godfather: Part II,692753,9.0,1974,1970
12 Angry Men,384187,8.9,1957,1950
"Il buono, il brutto, il cattivo (1966)",447875,8.9,1966,1960
...,...,...,...,...
"Monsters, Inc. (2001)",500576,8.0,2001,2000
The Big Sleep,59578,8.0,1946,1940
X-Men: Days of Future Past,427099,8.0,2014,2010
Roman Holiday,87437,8.0,1953,1950



Some details about sorting a DataFrame:

1. The first argument to `sort_values` is the name of a column to sort by.
2. If the column has strings in it, `sort` will sort alphabetically; if the column has numbers, it will sort numerically.
3. `imdb_by_name.sort_values("Rating")` returns a new DataFrame; the `imdb_by_name` DataFrame doesn't get modified. For example, if we called `imdb_by_name.sort("Rating")`, then running `imdb_by_name` by itself would still return the unsorted DataFrame. To save the result, you should assign it to a new variable.
4. Rows always stick together when a DataFrame is sorted.  It wouldn't make sense to sort just one column and leave the other columns alone.  For example, in this case, if we sorted just the `"Rating"` column, the movies would all end up with the wrong ratings.

**Question 3.2.** Create a version of `imdb_by_name` that's sorted chronologically, with the earliest movies first.  Call it `imdb_sorted`.

In [123]:
imdb_sorted = imdb_by_name.sort_values("Year", ascending = True)
imdb_sorted

Unnamed: 0_level_0,Votes,Rating,Year,Decade
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Kid,55784,8.3,1921,1920
The Gold Rush,58506,8.2,1925,1920
The General,46332,8.2,1926,1920
Metropolis,98794,8.3,1927,1920
M,88355,8.4,1931,1930
...,...,...,...,...
The Grand Budapest Hotel,369141,8.1,2014,2010
Relatos salvajes,46987,8.0,2014,2010
Interstellar,689541,8.6,2014,2010
Mad Max: Fury Road,262425,8.3,2015,2010


In [124]:
grader.check("q3_2")

**Question 3.3.** What's the title of the earliest movie in the dataset?  You could just look this up from the output of the previous cell.  Instead, write Python code to find out.

*Hint:* Remember that the index is an array.

In [141]:
earliest_movie_title = imdb_sorted.index[0]
earliest_movie_title

'The Kid'

In [142]:
grader.check("q3_3")

Suppose we want to get the rating of the oldest movie in the DataFrame. One way to do this is to first find the index label of the oldest movie (which we've already done). We then extract the `"Rating"` column and use `.loc` to find the rating of the oldest movie.

In [143]:
imdb_sorted.get('Rating').loc[earliest_movie_title]

8.3

There's a faster way, though. A Series not only has a `.loc` accessor, but also an `.iloc` accessor. While `.loc` looks up things by *label*, `.iloc` looks up elements by *integer position*.

Let's remember what is in the `"Rating"` column:

In [144]:
imdb_sorted.get('Rating')

Title
The Kid                     8.3
The Gold Rush               8.2
The General                 8.2
Metropolis                  8.3
M                           8.4
                           ... 
The Grand Budapest Hotel    8.1
Relatos salvajes            8.0
Interstellar                8.6
Mad Max: Fury Road          8.3
Inside Out (2015/I)         8.5
Name: Rating, Length: 250, dtype: float64

If we want the rating of the first row, we can use `.iloc[0]`:

In [145]:
imdb_sorted.get('Rating').iloc[0]

8.3

This returns the exact same thing as `imdb_sorted.get('Rating').loc['The Kid']`; these are two ways of doing the same thing. Usually it is more convenient to access an element by its label rather than by its integer position, but both `.loc` and `.iloc` are good to know.

**Question 3.4.** What is the rating of the fifth oldest movie in the dataset? You could just look this up from the output of the previous cell. Instead, write Python code to find out.

In [154]:
fifth_oldest_rating = imdb_sorted.get("Rating").iloc[4]
fifth_oldest_rating

8.4

In [155]:
grader.check("q3_4")

# 4. Finding pieces of a dataset

Suppose you're interested in movies from the 1950s.  Sorting the DataFrame by year doesn't help you, because the 1950s are in the middle of the dataset. Instead, we'll use a feature of Series that allows us to easily compare each element in a column to a particular value.

First, remember that we can use `.get` to extract a single column. The result is not a DataFrame, but rather a Series:

In [156]:
imdb_by_name.get('Decade')

Title
M                                        1930
Singin' in the Rain                      1950
All About Eve                            1950
Léon                                     1990
The Elephant Man                         1980
                                         ... 
Forrest Gump                             1990
Le salaire de la peur                    1950
3 Idiots                                 2000
Network                                  1970
Eternal Sunshine of the Spotless Mind    2000
Name: Decade, Length: 250, dtype: int64

We want to check whether each movie is released in the decade 1940. Python gives us a way of checking whether two things are equal with `==` (remember that `=` is already being used for another purpose: it assigns values to variable names):

In [157]:
3 == 4

False

In [158]:
3 == 3

True

`True` and `False` are instances of a type that we haven't seen before:

In [159]:
type(True)

bool

`bool` stands for "Boolean", named after the English logician [George Boole](https://en.wikipedia.org/wiki/George_Boole). We say that "True" and "False" are *Boolean* values.

It turns out that we can easily check if *each* of the elements in a `Series` is equal to something:

In [160]:
imdb_by_name.get('Decade') == 1950

Title
M                                        False
Singin' in the Rain                       True
All About Eve                             True
Léon                                     False
The Elephant Man                         False
                                         ...  
Forrest Gump                             False
Le salaire de la peur                     True
3 Idiots                                 False
Network                                  False
Eternal Sunshine of the Spotless Mind    False
Name: Decade, Length: 250, dtype: bool

We see that the result is a new series which has `True` only where the decade was 1950, and `False` everywhere else. We say that the resulting series is a series of *Booleans*, or a *Boolean Series*.

Let's call this result `is_from_1950s`. Its name can be read like it is a question: "is this movie from the 1950s"?

In [161]:
is_from_1950s = imdb_by_name.get('Decade') == 1950
is_from_1950s

Title
M                                        False
Singin' in the Rain                       True
All About Eve                             True
Léon                                     False
The Elephant Man                         False
                                         ...  
Forrest Gump                             False
Le salaire de la peur                     True
3 Idiots                                 False
Network                                  False
Eternal Sunshine of the Spotless Mind    False
Name: Decade, Length: 250, dtype: bool

Each row is an answer to this question. Is _The Elephant Man_ from the 1950s? `False`. Is _All About Eve_ from the 1950s? `True`.

We can use `is_from_1950s` to select only the rows from `imdb_by_name` for which the answer is `True`. The syntax for this is:

In [162]:
imdb_by_name[is_from_1950s]

Unnamed: 0_level_0,Votes,Rating,Year,Decade
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Singin' in the Rain,132823,8.3,1952,1950
All About Eve,74178,8.3,1950,1950
Some Like It Hot,156432,8.3,1959,1950
The Killing,56671,8.0,1956,1950
Roman Holiday,87437,8.0,1953,1950
...,...,...,...,...
Det sjunde inseglet,98949,8.2,1957,1950
The Night of the Hunter,57974,8.0,1955,1950
Smultronstället,55861,8.2,1957,1950
Strangers on a Train,85012,8.1,1951,1950


What `imdb_by_name[is_from_1950s]` does, precisely, is to go through `imdb_by_name` row by row. If the row named _Singin' in the Rain_ has the value `True` in `is_from_1950s`, that row is kept. If the value is `False`, the row is discarded. And so on, for every row.

Note that we could have accomplished this without ever creating the variable `is_from_1950s` by simply placing the code that we used to create the boolean series directly inside the `[...]`. This is a typical pattern you'll be using a lot!

In [163]:
imdb_by_name[imdb_by_name.get('Decade') == 1950]

Unnamed: 0_level_0,Votes,Rating,Year,Decade
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Singin' in the Rain,132823,8.3,1952,1950
All About Eve,74178,8.3,1950,1950
Some Like It Hot,156432,8.3,1959,1950
The Killing,56671,8.0,1956,1950
Roman Holiday,87437,8.0,1953,1950
...,...,...,...,...
Det sjunde inseglet,98949,8.2,1957,1950
The Night of the Hunter,57974,8.0,1955,1950
Smultronstället,55861,8.2,1957,1950
Strangers on a Train,85012,8.1,1951,1950


It helps to read the square brackets as "where." So the command in the cell above says to keep all rows from `imbdb_by_name` *where* the decade is the 1950s. 

Creating a new DataFrame by selecting only certain rows from an existing DataFrame which satisfy some condition is called *querying*. The line of code `imdb_by_name[imdb_by_name.get('Decade') == 1950]` is a *query*.

**Question 4.1.** Create a DataFrame called `ninety_eight` containing the movies that came out in 1998.

In [166]:
ninety_eight = imdb_by_name[imdb_by_name.get("Year")==1998]
ninety_eight

Unnamed: 0_level_0,Votes,Rating,Year,Decade
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Saving Private Ryan,769893,8.5,1998,1990
American History X,694602,8.5,1998,1990
"Lock, Stock and Two Smoking Barrels (1998)",372863,8.2,1998,1990
The Big Lebowski,473988,8.2,1998,1990
The Truman Show,583004,8.0,1998,1990


In [167]:
grader.check("q4_1")

So far we've only been finding where a column is *exactly* equal to a certain value. However, there are many other comparison operators we could use.  Here are a few:

|Operator|Tests|
|-|-|
|`==`|thing on left is equal to thing on right|
|`!=`|thing on left is *not* equal to thing on right|
|`>`|thing on left is greater than (and not equal to) thing on right|
|`>=`|thing on left is greater than or equal to thing on right|
|`<`|thing on left is less than (and not equal to) thing on right|

[Note 10](https://notes.dsc10.com/02-data_sets/querying.html#examples) in the course notes has more examples.

**Question 4.2.** Using operators from the table above, find all the movies with a rating higher than 8.6.  Put their data in a DataFrame called `really_highly_rated`.

In [168]:
really_highly_rated = imdb_by_name[imdb_by_name.get("Rating")>8.6]
really_highly_rated

Unnamed: 0_level_0,Votes,Rating,Year,Decade
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Godfather,1027398,9.2,1972,1970
The Shawshank Redemption,1498733,9.2,1994,1990
"Il buono, il brutto, il cattivo (1966)",447875,8.9,1966,1960
The Lord of the Rings: The Two Towers,967389,8.7,2002,2000
The Dark Knight,1473049,8.9,2008,2000
...,...,...,...,...
The Lord of the Rings: The Fellowship of the Ring,1099087,8.8,2001,2000
Star Wars,770011,8.7,1977,1970
One Flew Over the Cuckoo's Nest,606395,8.7,1975,1970
Pulp Fiction,1166532,8.9,1994,1990


In [169]:
grader.check("q4_2")

What is the highest rating of any movie from the 1990s? We now have the tools to answer questions like these. Breaking it into pieces, we first find all of the movies from the 1990s:

In [170]:
is_from_1990s = imdb_by_name.get('Decade') == 1990
is_from_1990s

Title
M                                        False
Singin' in the Rain                      False
All About Eve                            False
Léon                                      True
The Elephant Man                         False
                                         ...  
Forrest Gump                              True
Le salaire de la peur                    False
3 Idiots                                 False
Network                                  False
Eternal Sunshine of the Spotless Mind    False
Name: Decade, Length: 250, dtype: bool

We then select only these movies from our DataFrame:

In [171]:
from_1990s = imdb_by_name[is_from_1990s]
from_1990s

Unnamed: 0_level_0,Votes,Rating,Year,Decade
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Léon,635139,8.6,1994,1990
Mononoke-hime,192165,8.4,1997,1990
Saving Private Ryan,769893,8.5,1998,1990
In the Name of the Father,95212,8.1,1993,1990
Before Sunrise,158867,8.0,1995,1990
...,...,...,...,...
The Lion King,548750,8.4,1994,1990
La vita è bella,358305,8.6,1997,1990
The Truman Show,583004,8.0,1998,1990
Pulp Fiction,1166532,8.9,1994,1990


We then find the highest rating out of just these movies:

In [172]:
from_1990s.get('Rating').max()

9.2

Or, if we wanted to do all of this more concisely using chaining:

In [173]:
imdb_by_name[imdb_by_name.get('Decade') == 1990].get('Rating').max()

9.2

**Question 4.3.** Find the average rating for movies released in the 20th century and the average rating for movies released in the 21st century for the movies in `imdb`.

*Hint*: Series have a `.mean()` method. Note that the year 2000 is in the 20th century, and that the earliest movie in the dataset is from 1921!

In [176]:
average_20th_century_rating = imdb_by_name[imdb_by_name.get('Decade')<2001].get('Rating').mean()
average_20th_century_rating

8.274208144796381

In [177]:
grader.check("q4_3_1")

In [178]:
average_21st_century_rating = imdb_by_name[imdb_by_name.get('Decade')>2001].get('Rating').mean()
average_21st_century_rating

8.200000000000001

In [179]:
grader.check("q4_3_2")

The property `shape` tells you how many rows and columns are in a DataFrame.  (A "property" is like a method that doesn't need to be called by adding parentheses.)

In [180]:
imdb_by_name.shape

(250, 4)

Like an array, you can get the first element of the shape using [0], and the second element using [1]. For instance, the number of rows in `imdb_by_name` is:

In [181]:
imdb_by_name.shape[0]

250

We can use this to answer "How many movies are from the 20th century?":

In [182]:
imdb_by_name[imdb_by_name.get('Year') <= 2000].shape[0]

176

**Question 4.4.** Use `shape` (and arithmetic) to find the *proportion* of movies in the dataset that were released in the 20th century, and the proportion from the 21st century.

*Hint:* The *proportion* of movies released in the 20th century is the *number* of movies released in the 20th century, divided by the *total number* of movies in the dataset.

In [197]:
proportion_in_20th_century = imdb_by_name[imdb_by_name.get('Year') < 2001].shape[0]
proportion_in_20th_century

176

In [198]:
grader.check("q4_4_1")

In [199]:
proportion_in_21st_century = num_movies_21st_century = imdb_by_name[imdb_by_name.get('Year') >= 2001].shape[0]
proportion_in_21st_century

74

In [200]:
grader.check("q4_4_2")

**Question 4.5.** Finally, let's revisit the `population_by_year` DataFrame from earlier in the lab.  Compute the year when the world population first went above 7 billion.

In [None]:
year_population_crossed_7_billion = population_by_year[population_by_year.get("Year")
year_population_crossed_7_billion

In [None]:
grader.check("q4_5")

# Finish Line 🏁

Congratulations! You are done with Lab 1.

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.
5. Stick around while the Gradescope autograder grades your work. Make sure you see that all tests have passed on Gradescope.
6. Check that you have a confirmation email from Gradescope and save it as proof of your submission.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
grader.check_all()