# Pandas Data Frames

This notebook contains exercises for Pandas Data Frames.

**At the end of each exercise there are cells containing `assert` statements that you can use to check your answers.**

In [None]:
import pandas as pd
import numpy as np
from utils.dataset_loader import MOVIES_DATASET_PATH
%autosave 30

## Exercise 1: Indexing Into Data Frames

---

You will now use pandas to extract some information from a dataset of movies.

It's useful to `print` what you select.
You can also put the variable you want printed at the last line of a cell.
This is especially useful with pandas objects, which render differently in Jupyter cells.

Printing/displaying like this often will let you know if, for instance, you're selecting rows when you mean to select columns, etc.

### Question 1.1

Load the movies dataset (its path is imported above in `MOVIES_DATASET_PATH`).
You will need to specify that column `0` contains the index

In [None]:
movies_df = ...

### Question 1.2

The pandas functions `df.head(N)` selects the first `N` rows of the DataFrame. 
Use it to examine the dataset and familiarize yourself with it.

*HINT: Put a single call to `.head` in the cell below instead of using `print`. Pandas objects render better when displayed like this in Jupyter notebooks.*

In [None]:
...

### Question 1.3

Select the revenue of all movies in the dataset.

* Use `head` on this series to display the first `50` revenues. What do you notice?

In [None]:
movies_revenues = ...

### Question 1.4

Select the title of the movie at index 1459.

In [None]:
movie_1459_title = ...

### Question 1.5

Select the title and revenue of the first 100 movies.

In [None]:
title_revenue_first_100 = ...

### Question 1.6

Select the highest grossing movie in the dataset (movie with most `revenue`).

* You should use `movies_revenues` that you created above and which contains all the revenues.
* Use `idxmax` (index-max) on this Series to find the index of the movie with most revenue.
* Then use `highest_grossing_idx` to select the movie's title.

In [None]:
highest_grossing_idx = ...

In [None]:
highest_grossing_title = ...

### [Optional] Question 1.7

Select the title of the lowest-**rated** movie in the dataset.

_HINT: Use `idxmin` on the `vote_average` column of the dataset._

In [None]:
...

#### Run these cells after finishing the exercise questions to check your answers.

In [None]:
assert movies_revenues.shape == (movies_revenues.shape[0],), "Shape of movies_revenues is wrong!"

In [None]:
assert title_revenue_first_100.shape == (100, 2), "Shape of title_revenue_first_100 is wrong!"

In [None]:
assert movie_1459_title == 'The Lizzie McGuire Movie', "Title of movie 1459 is wrong!"

In [None]:
assert highest_grossing_idx == 4434, "Wrong highest_grossing_idx selected!"

## Exercise 2: Mathematical Operations

---

You will now use pandas to compute some statistics about the movies.

In [None]:
# Run this cell first to display some more information about the dataset.

movies_df.describe()

### Question 2.1

The above cell displays `movies_df.describe()`.

What kind of object is returned by `movies_df.describe()`? Store it into the variable `desc`.

In [None]:
desc = ...

### Question 2.2

Extract from `desc` the mean (average) budget of movies.

In [None]:
mean_budget_movies = ...

### Question 2.3

Extract the mean (average) runtime of the first 100 movies.

*HINT: You need to use the original DataFrame for this question.*

In [None]:
mean_runtime_first_100_movies = ...

### Question 2.4

Extract the total revenue of all movies.

In [None]:
total_revenue_all_movies = ...

#### Run this cell after finishing the exercise questions to check your answers.

In [None]:
assert np.isclose(mean_budget_movies, 4375023.772096334, rtol=1e-4), "Mean budget of movies is wrong!"

In [None]:
assert np.isclose(mean_runtime_first_100_movies, 92.34, rtol=1e-2), "Mean runtime of first 100 movies is wrong!"