# APIs Lab
In this lab we will practice using APIs to retrieve and store data.

In [1]:
# Imports at the top
import json
import urllib
import pandas as pd
import numpy as np
import requests
import json
import re
import matplotlib.pyplot as plt
%matplotlib inline

## Exercise 1: Get Data From Sheetsu

[Sheetsu](https://sheetsu.com/) is an online service that allows you to access any Google spreadsheet from an API. This can be a very handy way to share a dataset with colleagues as well as to create a mini centralized data storage, that is simpler to edit than a database.

A Google Spreadsheet with wine data can be found [here](https://docs.google.com/a/generalassemb.ly/spreadsheets/d/1JWRwDnwIMLgvPqNMdJLmAJgzvz0K3zAUc6jev3ci1c8/edit?usp=sharing).

You can access it through the Sheetsu API at this endpoint: https://sheetsu.com/apis/v1.0/cc9420722ae4. [Here](https://sheetsu.com/docs/beta) is Sheetsu's documentation.


Questions:

1. Use the requests library to access the document. Inspect the response text. What kind of data is it?
2. Check the status code of the response object. What code is it?
3. Use the appropriate libraries and read functions to read the response into a Pandas Dataframe
4. Once you've imported the data into a dataframe, check the value of the 5th line: what's the price?

In [2]:
api = "https://sheetsu.com/apis/v1.0/cc9420722ae4"
response = requests.get(api)

In [3]:
type(response.text)

unicode

In [4]:
response.status_code

200

In [5]:
df = pd.DataFrame(json.loads(response.text))

In [6]:
df.head()

Unnamed: 0,Color,Consumed In,Country,Grape,Name,Price,Region,Score,Vintage,Vinyard
0,W,2015,Portugal,,,,Portugal,4.0,2013,Vinho Verde
1,W,2015,France,,,17.8,France,3.0,2013,Peyruchet
2,W,2015,Oregon,,,20.0,Oregon,3.0,2013,Abacela
3,W,2015,Spain,chardonay,,7.0,Spain,2.5,2012,Ochoa
4,R,2015,US,"chiraz, cab",Spice Trader,6.0,,3.0,2012,Heartland


In [7]:
df2 = pd.read_json(response.text)

In [8]:
df2.tail()

Unnamed: 0,Color,Consumed In,Country,Grape,Name,Price,Region,Score,Vintage,Vinyard
93,R,2015,US,,My wonderful wine,200,Sonoma,10,1973,
94,R,2015,US,,My wonderful wine,200,Sonoma,10,1973,
95,R,2015,US,,My wonderful wine,200,Sonoma,10,1973,
96,R,2015,US,,My wonderful wine,200,Sonoma,10,1973,
97,R,2015,US,,My wonderful wine,200,Sonoma,10,1973,


> Answers:
    1. A string.
    2. 200
    3. Options inlucde: pd.read_json; json.loads + pd.Dataframe
    4. 6

### Exercise 2: Post Data to Sheetsu
Now that we've learned how to read data, it'd be great if we could also write data. For this we will need to use a _POST_ request.

1. Use the post command to add the following data to the spreadsheet:

In [9]:
post_data = {
'Grape' : ''
, 'Name' : 'Test'
, 'Color' : 'R'
, 'Country' : 'US'
, 'Region' : 'Sonoma'
, 'Vinyard' : ''
, 'Score' : '10'
, 'Consumed In' : '2015'
, 'Vintage' : '1973'
, 'Price' : '200'
}

In [13]:
requests.post(api, data=post_data)
r = requests.get(api)

In [14]:
r.status_code

200

1. What status did you get? How can you check that you actually added the data correctly?
- In this exercise, your classmates are adding data to the same spreadsheet. What happens because of this? Is it a problem? How could you mitigate it?

In [15]:
df3 = pd.read_json(r.text)

In [16]:
df3.tail()

Unnamed: 0,Color,Consumed In,Country,Grape,Name,Price,Region,Score,Vintage,Vinyard
95,R,2015,US,,My wonderful wine,200,Sonoma,10,1973,
96,R,2015,US,,My wonderful wine,200,Sonoma,10,1973,
97,R,2015,US,,My wonderful wine,200,Sonoma,10,1973,
98,R,2015,US,,Test,200,Sonoma,10,1973,
99,R,2015,US,,Test,200,Sonoma,10,1973,


## Exercise 3: Data munging

Get back to the dataframe you've created in the beginning. Let's do some data munging:

1. Search for missing data
    - Is there any missing data? How do you deal with it?
    - Is there any data you can just remove?
    - Are the data types appropriate?
- Summarize the data 
    - Try using describe, min, max, mean, var

In [17]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 98 entries, 0 to 97
Data columns (total 10 columns):
Color          98 non-null object
Consumed In    98 non-null int64
Country        98 non-null object
Grape          98 non-null object
Name           98 non-null object
Price          98 non-null object
Region         98 non-null object
Score          98 non-null object
Vintage        98 non-null int64
Vinyard        98 non-null object
dtypes: int64(2), object(8)
memory usage: 8.4+ KB


In [18]:
df4 = df2.replace('', np.nan)

In [19]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 98 entries, 0 to 97
Data columns (total 10 columns):
Color          98 non-null object
Consumed In    98 non-null int64
Country        96 non-null object
Grape          18 non-null object
Name           89 non-null object
Price          92 non-null object
Region         97 non-null object
Score          97 non-null object
Vintage        98 non-null int64
Vinyard        29 non-null object
dtypes: int64(2), object(8)
memory usage: 8.4+ KB


In [20]:
df4.describe()

Unnamed: 0,Consumed In,Vintage
count,98.0,98.0
mean,2014.857143,1984.581633
std,0.454077,17.961542
min,2013.0,1973.0
25%,2015.0,1973.0
50%,2015.0,1973.0
75%,2015.0,2011.0
max,2015.0,2013.0


## Exercise 4: Feature Extraction

We would like to use a regression tree to predict the score of a wine. In order to do that, we first need to select and engineer appropriate features.

- Set the target to be the Score column, drop the rows with no score
- Use pd.get_dummies to create dummy features for all the text columns
- Fill the nan values in the numerical columns, using an appropriate method
- Train a Decision tree regressor on the Score, using a train test split:
        X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.3, random_state=42)
- Plot the test values, the predicted values and the residuals
- Calculate R^2 score
- Discuss your findings


## Exercise 5: IMDB Movies

Sometimes an API doesn't provide all the information we would like to get and we need to be creative.
Here we will use a combination of scraping and API calls to investigate the ratings and gross earnings of famous movies.

## 5.a Get top movies

The Internet Movie Database contains data about movies. Unfortunately it does not have a public API.

The page http://www.imdb.com/chart/top contains the list of the top 250 movies of all times. Retrieve the page using the requests library and then parse the html to obtain a list of the `movie_ids` for these movies. You can parse it with regular expression or using a library like `BeautifulSoup`.

**Hint:** movie_ids look like this: `tt2582802`

In [21]:
r2 = requests.get("http://www.imdb.com/chart/top")

from bs4 import BeautifulSoup

soup = BeautifulSoup(r2.content, "lxml")

id_list = re.findall("tt[0-9]{7,8}", r2.content)

id_list = set(id_list)

## 5.b Get top movies data

Although the Internet Movie Database does not have a public API, an open API exists at http://www.omdbapi.com.

Use this API to retrieve information about each of the 250 movies you have extracted in the previous step.
- Check the documentation of omdbapi.com to learn how to request movie data by id
- Define a function that returns a python object with all the information for a given id
- Iterate on all the IDs and store the results in a list of such objects
- Create a Pandas Dataframe from the list

In [225]:
api_url = "http://www.omdbapi.com/?i={}&plot=full&r=json"

In [226]:
r = requests.get(api_url.format('tt2582802'))

pd.DataFrame(json.loads(r.text), index=[0])

Unnamed: 0,Actors,Awards,Country,Director,Genre,Language,Metascore,Plot,Poster,Rated,Released,Response,Runtime,Title,Type,Writer,Year,imdbID,imdbRating,imdbVotes
0,"Miles Teller, J.K. Simmons, Paul Reiser, Melis...",Won 3 Oscars. Another 87 wins & 131 nominations.,USA,Damien Chazelle,"Drama, Music",English,88,A promising young drummer enrolls at a cut-thr...,https://images-na.ssl-images-amazon.com/images...,R,15 Oct 2014,True,107 min,Whiplash,movie,Damien Chazelle,2014,tt2582802,8.5,413720


In [227]:
def get_info(id_num, num):
    r = requests.get(api_url.format(id_num))
    df = pd.DataFrame(json.loads(r.text), index=[num])
    return df

In [229]:
df = pd.concat([get_info(i, j) for j,i in enumerate(id_list)])

In [230]:
df.shape

(250, 20)

In [231]:
df.head()

Unnamed: 0,Actors,Awards,Country,Director,Genre,Language,Metascore,Plot,Poster,Rated,Released,Response,Runtime,Title,Type,Writer,Year,imdbID,imdbRating,imdbVotes
0,"Miles Teller, J.K. Simmons, Paul Reiser, Melis...",Won 3 Oscars. Another 87 wins & 131 nominations.,USA,Damien Chazelle,"Drama, Music",English,88.0,A promising young drummer enrolls at a cut-thr...,https://images-na.ssl-images-amazon.com/images...,R,15 Oct 2014,True,107 min,Whiplash,movie,Damien Chazelle,2014,tt2582802,8.5,413720
1,"Toshirô Mifune, Takashi Shimura, Keiko Tsushim...",Nominated for 2 Oscars. Another 5 wins & 6 nom...,Japan,Akira Kurosawa,"Action, Adventure, Drama",Japanese,98.0,"A veteran samurai, who has fallen on hard time...",https://images-na.ssl-images-amazon.com/images...,UNRATED,19 Nov 1956,True,207 min,Seven Samurai,movie,"Akira Kurosawa (screenplay), Shinobu Hashimoto...",1954,tt0047478,8.7,232249
2,"Harrison Ford, Karen Allen, Paul Freeman, Rona...",Won 4 Oscars. Another 30 wins & 23 nominations.,USA,Steven Spielberg,"Action, Adventure","English, German, Hebrew, Spanish, Arabic, Nepali",85.0,The year is 1936. An archeology professor name...,https://images-na.ssl-images-amazon.com/images...,PG,12 Jun 1981,True,115 min,Raiders of the Lost Ark,movie,"Lawrence Kasdan (screenplay), George Lucas (st...",1981,tt0082971,8.5,671034
3,"William Holden, Alec Guinness, Jack Hawkins, S...",Won 7 Oscars. Another 23 wins & 7 nominations.,"UK, USA",David Lean,"Adventure, Drama, War","English, Japanese, Thai",,After settling his differences with a Japanese...,https://images-na.ssl-images-amazon.com/images...,PG,14 Dec 1957,True,161 min,The Bridge on the River Kwai,movie,"Pierre Boulle (novel), Carl Foreman (screenpla...",1957,tt0050212,8.2,151604
4,"Robert Downey Jr., Chris Evans, Mark Ruffalo, ...",Nominated for 1 Oscar. Another 34 wins & 75 no...,USA,Joss Whedon,"Action, Sci-Fi, Thriller","English, Russian",69.0,"Nick Fury is the director of S.H.I.E.L.D., an ...",https://images-na.ssl-images-amazon.com/images...,PG-13,04 May 2012,True,143 min,The Avengers,movie,"Joss Whedon (screenplay), Zak Penn (story), Jo...",2012,tt0848228,8.1,1010033


## 5.c Get gross data

The OMDB API is great, but it does not provide information about Gross Revenue of the movie. We'll revert back to scraping for this.

- Write a function that retrieves the gross revenue from the entry page at imdb.com
- The function should handle the exception of when the page doesn't report gross revenue
- Retrieve the gross revenue for each movie and store it in a separate dataframe

In [232]:
r = requests.get("http://www.imdb.com/title/tt3385516/")

soup = BeautifulSoup(r.text, "lxml")

def get_gross(id_num):
    r = requests.get("http://www.imdb.com/title/{}/".format(id_num))
    soup = BeautifulSoup(r.text, "lxml")
    try:
        for i in soup.findAll("div", class_="txt-block"):
            for j in i.findAll("h4", class_="inline"):
                if "Gross" in j.text:
                    text = j.parent.text.split()[1]
                    num = text.replace(",", "").strip("$")
                    return float(num)
    except:
        return np.nan

In [235]:
for i in soup.findAll("div", class_="txt-block"):
    for j in i.findAll("h4", class_="inline"):
        if "Gross" in j.text:
            print j.parent.text.split()[1]

$155,333,829


In [236]:
get_gross(df["imdbID"][0])

13092000.0

In [237]:
df["Gross_earnings"] = df["imdbID"].map(get_gross)

KeyboardInterrupt: 

In [151]:
df.head()

Unnamed: 0,Actors,Awards,Country,Director,Genre,Language,Metascore,Plot,Poster,Rated,...,Response,Runtime,Title,Type,Writer,Year,imdbID,imdbRating,imdbVotes,Gross_earnings
0,"Miles Teller, J.K. Simmons, Paul Reiser, Melis...",Won 3 Oscars. Another 87 wins & 131 nominations.,USA,Damien Chazelle,"Drama, Music",English,88.0,A promising young drummer enrolls at a cut-thr...,https://images-na.ssl-images-amazon.com/images...,R,...,True,107 min,Whiplash,movie,Damien Chazelle,2014,tt2582802,8.5,413720,13092000.0
1,"Toshirô Mifune, Takashi Shimura, Keiko Tsushim...",Nominated for 2 Oscars. Another 5 wins & 6 nom...,Japan,Akira Kurosawa,"Action, Adventure, Drama",Japanese,98.0,"A veteran samurai, who has fallen on hard time...",https://images-na.ssl-images-amazon.com/images...,UNRATED,...,True,207 min,Seven Samurai,movie,"Akira Kurosawa (screenplay), Shinobu Hashimoto...",1954,tt0047478,8.7,232249,269061.0
2,"Harrison Ford, Karen Allen, Paul Freeman, Rona...",Won 4 Oscars. Another 30 wins & 23 nominations.,USA,Steven Spielberg,"Action, Adventure","English, German, Hebrew, Spanish, Arabic, Nepali",85.0,The year is 1936. An archeology professor name...,https://images-na.ssl-images-amazon.com/images...,PG,...,True,115 min,Raiders of the Lost Ark,movie,"Lawrence Kasdan (screenplay), George Lucas (st...",1981,tt0082971,8.5,671034,242374454.0
3,"William Holden, Alec Guinness, Jack Hawkins, S...",Won 7 Oscars. Another 23 wins & 7 nominations.,"UK, USA",David Lean,"Adventure, Drama, War","English, Japanese, Thai",,After settling his differences with a Japanese...,https://images-na.ssl-images-amazon.com/images...,PG,...,True,161 min,The Bridge on the River Kwai,movie,"Pierre Boulle (novel), Carl Foreman (screenpla...",1957,tt0050212,8.2,151604,27200000.0
4,"Robert Downey Jr., Chris Evans, Mark Ruffalo, ...",Nominated for 1 Oscar. Another 34 wins & 75 no...,USA,Joss Whedon,"Action, Sci-Fi, Thriller","English, Russian",69.0,"Nick Fury is the director of S.H.I.E.L.D., an ...",https://images-na.ssl-images-amazon.com/images...,PG-13,...,True,143 min,The Avengers,movie,"Joss Whedon (screenplay), Zak Penn (story), Jo...",2012,tt0848228,8.1,1010033,623279547.0


## 5.d Data munging

- Now that you have movie information and gross revenue information, let's clean the two datasets.
- Check if there are null values. Be careful they may appear to be valid strings.
- Convert the columns to the appropriate formats. In particular handle:
    - Released
    - Runtime
    - year
    - imdbRating
    - imdbVotes
- Merge the data from the two datasets into a single one

In [171]:
df = df.replace("", np.nan)
df = df.replace("N/A", np.nan)

In [172]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 250 entries, 0 to 249
Data columns (total 22 columns):
Actors            250 non-null object
Awards            246 non-null object
Country           250 non-null object
Director          250 non-null object
Genre             250 non-null object
Language          249 non-null object
Metascore         169 non-null object
Plot              250 non-null object
Poster            248 non-null object
Rated             249 non-null object
Released          249 non-null object
Response          250 non-null object
Runtime           250 non-null object
Title             250 non-null object
Type              250 non-null object
Writer            250 non-null object
Year              250 non-null object
imdbID            250 non-null object
imdbRating        250 non-null object
imdbVotes         250 non-null object
Gross_earnings    184 non-null float64
Date              249 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(20)


In [174]:
df["Date"] = pd.to_datetime(df["Released"])

In [176]:
df["Runtime_min"] = df["Runtime"].map(lambda x: float(x.split()[0]))

In [179]:
df["Year"] = df["Year"].astype(int)

In [182]:
df["imdbRating"] = df["imdbRating"].astype(float)

In [183]:
df["imdbVotes"] = df["imdbVotes"].map(lambda x: float(x.replace(",", "")))

In [184]:
df.head()

Unnamed: 0,Actors,Awards,Country,Director,Genre,Language,Metascore,Plot,Poster,Rated,...,Title,Type,Writer,Year,imdbID,imdbRating,imdbVotes,Gross_earnings,Date,Runtime_min
0,"Miles Teller, J.K. Simmons, Paul Reiser, Melis...",Won 3 Oscars. Another 87 wins & 131 nominations.,USA,Damien Chazelle,"Drama, Music",English,88.0,A promising young drummer enrolls at a cut-thr...,https://images-na.ssl-images-amazon.com/images...,R,...,Whiplash,movie,Damien Chazelle,2014,tt2582802,8.5,413720.0,13092000.0,2014-10-15,107.0
1,"Toshirô Mifune, Takashi Shimura, Keiko Tsushim...",Nominated for 2 Oscars. Another 5 wins & 6 nom...,Japan,Akira Kurosawa,"Action, Adventure, Drama",Japanese,98.0,"A veteran samurai, who has fallen on hard time...",https://images-na.ssl-images-amazon.com/images...,UNRATED,...,Seven Samurai,movie,"Akira Kurosawa (screenplay), Shinobu Hashimoto...",1954,tt0047478,8.7,232249.0,269061.0,1956-11-19,207.0
2,"Harrison Ford, Karen Allen, Paul Freeman, Rona...",Won 4 Oscars. Another 30 wins & 23 nominations.,USA,Steven Spielberg,"Action, Adventure","English, German, Hebrew, Spanish, Arabic, Nepali",85.0,The year is 1936. An archeology professor name...,https://images-na.ssl-images-amazon.com/images...,PG,...,Raiders of the Lost Ark,movie,"Lawrence Kasdan (screenplay), George Lucas (st...",1981,tt0082971,8.5,671034.0,242374454.0,1981-06-12,115.0
3,"William Holden, Alec Guinness, Jack Hawkins, S...",Won 7 Oscars. Another 23 wins & 7 nominations.,"UK, USA",David Lean,"Adventure, Drama, War","English, Japanese, Thai",,After settling his differences with a Japanese...,https://images-na.ssl-images-amazon.com/images...,PG,...,The Bridge on the River Kwai,movie,"Pierre Boulle (novel), Carl Foreman (screenpla...",1957,tt0050212,8.2,151604.0,27200000.0,1957-12-14,161.0
4,"Robert Downey Jr., Chris Evans, Mark Ruffalo, ...",Nominated for 1 Oscar. Another 34 wins & 75 no...,USA,Joss Whedon,"Action, Sci-Fi, Thriller","English, Russian",69.0,"Nick Fury is the director of S.H.I.E.L.D., an ...",https://images-na.ssl-images-amazon.com/images...,PG-13,...,The Avengers,movie,"Joss Whedon (screenplay), Zak Penn (story), Jo...",2012,tt0848228,8.1,1010033.0,623279547.0,2012-05-04,143.0


In [185]:
df.to_csv("imdb_lab.csv", encoding="utf-8")

In [238]:
df = pd.read_csv("imdb_lab.csv", encoding="utf-8")

## 5.d Text vectorization

There are several columns in the data that contain a comma separated list of items, for example the Genre column and the Actors column. Let's transform those to binary columns using the count vectorizer from scikit learn.

Append these columns to the merged dataframe.

**Hint:** In order to get the actors name right, you'll have to set the `token_pattern` parameter in `CountVectorizer` to u'(?u)\\w+\.?\\w?\.? \\w+'. Can you see why? How does this differ from the default?

In [217]:
from sklearn import feature_extraction

In [218]:
vectorizer = feature_extraction.text.CountVectorizer(token_pattern=u'(?u)\w+.?\w?.? \w+')
actors_df = vectorizer.fit_transform(df["Actors"]).todense()
actor_names = vectorizer.get_feature_names()

vectorizer = feature_extraction.text.CountVectorizer()
genre_df = vectorizer.fit_transform(df["Genre"]).todense()
genre_names = vectorizer.get_feature_names()

for i,j in enumerate(actor_names):
    df["Actor_"+j.replace(" ", "_")] = actors_df[:,i]

for i,j in enumerate(genre_names):
    df["Genre_"+j.replace(" ", "_")] = genre_df[:,i]

In [221]:
df.head()

Unnamed: 0.1,Unnamed: 0,Actors,Awards,Country,Director,Genre,Language,Metascore,Plot,Poster,...,Genre_music,Genre_musical,Genre_mystery,Genre_noir,Genre_romance,Genre_sci,Genre_sport,Genre_thriller,Genre_war,Genre_western
0,0,"Miles Teller, J.K. Simmons, Paul Reiser, Melis...",Won 3 Oscars. Another 87 wins & 131 nominations.,USA,Damien Chazelle,"Drama, Music",English,88.0,A promising young drummer enrolls at a cut-thr...,https://images-na.ssl-images-amazon.com/images...,...,1,0,0,0,0,0,0,0,0,0
1,1,"Toshirô Mifune, Takashi Shimura, Keiko Tsushim...",Nominated for 2 Oscars. Another 5 wins & 6 nom...,Japan,Akira Kurosawa,"Action, Adventure, Drama",Japanese,98.0,"A veteran samurai, who has fallen on hard time...",https://images-na.ssl-images-amazon.com/images...,...,0,0,0,0,0,0,0,0,0,0
2,2,"Harrison Ford, Karen Allen, Paul Freeman, Rona...",Won 4 Oscars. Another 30 wins & 23 nominations.,USA,Steven Spielberg,"Action, Adventure","English, German, Hebrew, Spanish, Arabic, Nepali",85.0,The year is 1936. An archeology professor name...,https://images-na.ssl-images-amazon.com/images...,...,0,0,0,0,0,0,0,0,0,0
3,3,"William Holden, Alec Guinness, Jack Hawkins, S...",Won 7 Oscars. Another 23 wins & 7 nominations.,"UK, USA",David Lean,"Adventure, Drama, War","English, Japanese, Thai",,After settling his differences with a Japanese...,https://images-na.ssl-images-amazon.com/images...,...,0,0,0,0,0,0,0,0,1,0
4,4,"Robert Downey Jr., Chris Evans, Mark Ruffalo, ...",Nominated for 1 Oscar. Another 34 wins & 75 no...,USA,Joss Whedon,"Action, Sci-Fi, Thriller","English, Russian",69.0,"Nick Fury is the director of S.H.I.E.L.D., an ...",https://images-na.ssl-images-amazon.com/images...,...,0,0,0,0,0,1,0,1,0,0


In [222]:
df.to_csv("imdb_lab.csv", encoding="utf-8")

## Bonus:

- What are the top 10 grossing movies?
- Who are the 10 actors that appear in the most movies?
- What's the average grossing of the movies in which each of these actors appear?
- What genre is the oldest movie?


In [224]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Columns: 883 entries, Unnamed: 0 to Genre_western
dtypes: bool(1), float64(5), int64(861), object(16)
memory usage: 1.7+ MB


In [None]:
df["Gross_earnings"]

In [243]:
df[["Title", "Gross_earnings"]].sort_values(by="Gross_earnings", ascending=False).head(10)

Unnamed: 0,Title,Gross_earnings
160,Star Wars: The Force Awakens,936627416.0
4,The Avengers,623279547.0
37,The Dark Knight,533316061.0
235,Star Wars: Episode IV - A New Hope,460935665.0
159,The Dark Knight Rises,448130642.0
48,The Lion King,422783777.0
100,Toy Story 3,414984497.0
68,Harry Potter and the Deathly Hallows: Part 2,380955619.0
9,Finding Nemo,380838870.0
78,The Lord of the Rings: The Return of the King,377019252.0


In [244]:
actors_df = pd.DataFrame(actors_df, columns=actor_names)

In [255]:
top_10_actors = actors_df.sum().sort_values(ascending=False).head(10).index

In [256]:
top_10_actors = [i.replace(" ", "_") for i in top_10_actors]

In [258]:
df["Gross_earnings"][df["Actor_"+top_10_actors[0]]==1].mean()

36559324.0

In [261]:
mean_gross_10 = pd.DataFrame(top_10_actors, columns=["Actor"])
mean_gross_10["Mean_gross"] = mean_gross_10["Actor"].map(lambda x: df["Gross_earnings"][df["Actor_"+x]==1].mean())

In [265]:
pd.set_option('display.float_format', '{:.2f}'.format)

In [266]:
mean_gross_10

Unnamed: 0,Actor,Mean_gross
0,robert_de_niro,36559324.0
1,harrison_ford,351913357.29
2,leonardo_dicaprio,168664745.14
3,clint_eastwood,71853197.6
4,tom_hanks,242304668.67
5,mark_hamill,499211810.25
6,james_stewart,13850000.0
7,christian_bale,309968305.0
8,matt_damon,181570234.0
9,joe_pesci,23654986.0


In [267]:
df[["Title", "Date"]].sort_values(by="Date").head()

Unnamed: 0,Title,Date
88,The Kid,1921-02-06
180,The General,1927-02-24
212,Metropolis,1927-03-13
194,Sunrise,1927-11-04
174,The Passion of Joan of Arc,1928-10-25
