# APIs Lab
In this lab we will practice using APIs to retrieve and store data.

In [94]:
# Imports at the top
import json
import urllib
import pandas as pd
import numpy as np
import requests
import json
import re
from imdbpie import Imdb
import nltk
from nltk.tokenize import RegexpTokenizer
import collections
from collections import Counter
import matplotlib.pyplot as plt
%matplotlib inline

In [95]:
imdb = Imdb()
imdb = Imdb(anonymize=True)

## 5.a Get bottom and top movies

The Internet Movie Database contains data about movies. Unfortunately it does not have a public API.

The page http://www.imdb.com/chart/top contains the list of the top worst 100 movies of all time. Retrieve the page using the requests library and then parse the html to obtain a list of the `movie_ids` for these movies. You can parse it with regular expression or using a library like `BeautifulSoup`.

**Hint:** movie_ids look like this: `tt2582802`

In [96]:
#So this function gets a list of all 100 movies IDs.
#However, IMDB.com doesn't like people getting all of their data very easily, so we'll just use this page to get the IDs
#So this function doesn't iterate through pages, as all movies in the Bottom 100 are on a single page.
#It takes their unique IDs that are encoded in the HTML, and puts them in a list, called 'entries
def get_entries(site):
    response = requests.get(site)
    html = response.text
    entries = re.findall("<a href.*?/title/(.*?)/", html) #Wrong regex
    return list(set(entries))

In [97]:
bottomEntries = get_entries('http://www.imdb.com/chart/bottom')

In [98]:
topEntries = get_entries('http://www.imdb.com/chart/top')

## 5.b Get bottom and top movies data

Although the Internet Movie Database does not have a public API, an open API exists at http://www.omdbapi.com.

Use this API to retrieve information about each of the 100 movies you have extracted in the previous step.
- Check the documentation of omdbapi.com to learn how to request movie data by id
- Define a function that returns a python object with all the information for a given id
- Iterate on all the IDs and store the results in a list of such objects
- Create a Pandas Dataframe from the list

In [99]:
#Now that we have the 250 IDs, we need a way to search omdapi (which has gathered all data for each IMDB movie in a 
#nice little JSON tree). 

#So we need to scrape each movie's JSON tree with Beautiful soup
#Just like with indeed.com, it's going to use omdabpi's search engine 250 times, once for each id in the entries list
#from above. After it searches a movie id in the lsit above, it will scrape its JSON tree.
def get_entry(entry):
    res = requests.get('http://www.omdbapi.com/?i='+entry)
    if res.status_code != 200:
        print entry, res.status_code
    else:
        print '.',
    try:
        j = json.loads(res.text)
    except ValueError:
        j = None
    return j

In [100]:
#So you're going to repreat the function above for every item(movie id) in the 'entries' list
#It returns a dictionary that can then be turned into a dataframe
bottomEntriesDictList = [get_entry(e) for e in bottomEntries]

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


In [101]:
#Here we turn the JSON file for each of th 100 movies into a dataframe
bottom100 = pd.DataFrame(bottomEntriesDictList)

In [102]:
#So you're going to repreat the function above for every item(movie id) in the 'entries' list
#It returns a dictionary that can then be turned into a dataframe
topEntriesDictList = [get_entry(e) for e in topEntries]

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


In [103]:
#Here we turn the JSON file for each of th 100 movies into a dataframe
top250 = pd.DataFrame(topEntriesDictList)

## 5.c Get gross data

The OMDB API is great, but it does not provide information about Gross Revenue of the movie. We'll revert back to scraping for this.

- Write a function that retrieves the gross revenue from the entry page at imdb.com
- The function should handle the exception of when the page doesn't report gross revenue
- Retrieve the gross revenue for each movie and store it in a separate dataframe

In [104]:
#There is still some information we would want, but OMDb API does not provide
#So, we have to go back to imdb.com to scrape the gross revenue for each movie
#This function will ultimately search for each movie by their id in the entries list, and scrape the gross revenue into
#a new list called 'grosses

def get_gross(entry): #define the function
    response = requests.get('http://www.imdb.com/title/'+entry) #This will generate a request from the page for an entry
    html = response.text
    try:
        gross_list = re.findall("Gross:</h4>[ ]*\$([^ ]*)", html) #This will create a list with the value after the word 'Gross'
        gross = int(gross_list[0].replace(',', '')) #This creates a new value by convertinf the above to an integer and eliminating commas
        return gross
    except Exception as ex:
        return None

In [105]:
bottomGrosses = [(e, get_gross(e)) for e in bottomEntries]#Repeat the function above for each id in the entries list

In [109]:
topGrosses = [(e, get_gross(e)) for e in topEntries]

In [110]:
bottomGrosses = pd.DataFrame(bottomGrosses, columns=['imdbID', 'gross'])

In [111]:
topGrosses = pd.DataFrame(topGrosses, columns=['imdbID', 'gross'])

In [112]:
bottomGrosses["gross"] = bottomGrosses["gross"].fillna(bottomGrosses["gross"].mean())

In [113]:
topGrosses["gross"] = topGrosses["gross"].fillna(topGrosses["gross"].mean())

## 5.d Data munging

- Now that you have movie information and gross revenue information, let's clean the two datasets.
- Check if there are null values. Be careful they may appear to be valid strings.
- Convert the columns to the appropriate formats. In particular handle:
    - Released
    - Runtime
    - year
    - imdbRating
    - imdbVotes
- Merge the data from the two datasets into a single one

In [114]:
def cleanData(df, df1):
    movieType = []
    df.drop(['Actors','Awards','Country','Director','Genre','Language','Metascore',
             'Plot','Poster','Rated','Released','Response','Type','Writer'], axis =1 ,inplace=True)
    df = df[df.Runtime != 'N/A']
    for row in ['Runtime']:
        df['Runtime'] = df['Runtime'].str.rstrip('min').astype(float)
    for row in ['Year']:
        df['Year'] = df['Year'].astype(int)
    for row in ['imdbRating']:
        df['imdbRating'] = df['imdbRating'].astype(float)
    for row in ['imdbVotes']:
        df['imdbVotes'] = df['imdbVotes'].replace(',','',regex=True).astype(float)
    for row in df['imdbRating']:
        if row <= 3:
            movieType.append(0)
        else:
            movieType.append(1)
    df['movieType'] = movieType
    df = df.rename(columns = {'imdbID'      :'imdbID',
                                  'Title'       :'title',
                                  'Year'        :'year',
                                  'Runtime'     :'runtime',
                                  'imdbVotes'   :'imdbVotes',
                                  'imdbRating'  :'imdbRating',
                                  'movieType'   :'movieType'})
    df = df[['imdbID', , 'movieType', 'title', 'year', 'runtime', 'imdbVotes', 'imdbRating']]
    df = pd.merge(df, df1)
    return df

In [115]:
top250 = cleanData(top250, topGrosses)
bottom100 = cleanData(bottom100, bottomGrosses)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is tryin

In [116]:
top250.head()

Unnamed: 0,imdbID,title,year,runtime,imdbVotes,imdbRating,movieType,gross
0,tt2582802,Whiplash,2014,107.0,384504.0,8.5,1,13092000.0
1,tt0047478,Seven Samurai,1954,207.0,226364.0,8.7,1,269061.0
2,tt0082971,Raiders of the Lost Ark,1981,115.0,653557.0,8.5,1,242374454.0
3,tt0050212,The Bridge on the River Kwai,1957,161.0,147591.0,8.2,1,27200000.0
4,tt0848228,The Avengers,2012,143.0,980989.0,8.1,1,623279547.0


In [117]:
bottom100.head()

Unnamed: 0,imdbID,title,year,runtime,imdbVotes,imdbRating,movieType,gross
0,tt0118665,Baby Geniuses,1999,97.0,19205.0,2.5,0,27141960.0
1,tt0058548,Santa Claus Conquers the Martians,1964,81.0,8735.0,2.5,0,9953230.0
2,tt0089280,Hobgoblins,1988,88.0,8416.0,2.3,0,9953230.0
3,tt0830861,A Fox's Tale,2008,85.0,7203.0,2.2,0,9953230.0
4,tt0299930,Gigli,2003,121.0,41461.0,2.4,0,5660084.0


In [118]:
top250.to_csv('../../../../../01-projects/assets/06-project6-assets/data/top250.csv', encoding='utf8', index=False)
bottom100.to_csv('../../../../../01-projects/assets/06-project6-assets/data/bottom100.csv', encoding='utf8', index=False)

## 5.d Text vectorization

There are several columns in the data that contain a comma separated list of items, for example the Genre column and the Actors column. Let's transform those to binary columns using the count vectorizer from scikit learn.

Append these columns to the merged dataframe.

**Hint:** In order to get the actors name right, you'll have to modify the `token_pattern` in the `CountVectorizer`.

In [119]:
#Now we need to scrape the reviews for each of our movie collections, but put them in a different dataframe
#So first, we put the imdbIDs in their respective lists so we can iterate through them when scraping reviews
#We need the ID again so we can use it as the common key with which we can join tablesl ater
top250MovieIDs = top250.imdbID.values.tolist()
bottom100MovieIDs = bottom100.imdbID.values.tolist()

In [121]:
top250Reviews = []
top250IDs = []
for x in top250MovieIDs: #For every ID in the ID list
    review = imdb.get_title_reviews(x, max_results=15) #We take a list of 15 reviews
    for i in review: #For every review in the list of reviews
        top250IDs.append(x) #We add that reviews id to one list 
        top250Reviews.append(i.text) #and the review to another, so they all correspond

In [122]:
#Turn those two lists into a dataframe with the ID and 15 reviews for each ID
top250ReviewData = pd.DataFrame({"imdbID": top250IDs, "reviews": top250Reviews})

In [123]:
#We repeat the process above, except with the bottom 100
#We don't want to combine these dataframes yet, because we want the top 50 adjectives used to describe
#the worst and best movies, and see to which extent there is overlap or exclusivity in the ways
#people describe good and bad movies
bottom100Reviews = []
bottom100IDs = []
for x in bottom100MovieIDs:
    review = imdb.get_title_reviews(x, max_results=15)
    for i in review:
        bottom100IDs.append(x)
        bottom100Reviews.append(i.text)

In [124]:
bottom100ReviewData = pd.DataFrame({"imdbID": bottom100IDs, "reviews": bottom100Reviews})

In [125]:
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
top250Tokens = [tokenizer.tokenize(review) for review in top250Reviews]

In [126]:
top250PosTokens = [nltk.tag.pos_tag(token) for token in top250Tokens]

In [127]:
bottom100Tokens = [nltk.word_tokenize(review) for review in bottom100Reviews]

In [128]:
bottom100PosTokens = [nltk.tag.pos_tag(token) for token in bottom100Tokens]

In [None]:
#Next 5 cells will have to be repeated for bottom100 after you get top250

In [129]:
top250AdjList = []
for x in top250PosTokens:
    # each x is a list of (word, POS tag) tuples
    for word, pos in x:
        if pos in ['JJ', 'JJS', 'JJR']: # feel free to add any other tags you may be looking for
            top250AdjList.append(word)

In [130]:
top250CommonAdj= [a for a, b in Counter(top250AdjList).most_common(50)]

In [131]:
#Create a dataframe that places each descriptor as an index
#We see that a top 50 adjective is the letter 't.' Not sure what the pos tagger is doing there, so we'll drop the col
top250Descrip = pd.DataFrame(columns=top250CommonAdj)
top250Descrip.head()

Unnamed: 0,'t,great,best,other,good,many,first,more,such,most,...,special,acting,main,final,funny,important,emotional,later,simple,strong


In [132]:
#Now we want a dataframe that joins the Review data (movieID, 15reviews each), from above with these descriptors
top250Movies = pd.DataFrame(top250ReviewData)
top250Movies = top250Movies.join(top250Descrip)
#We see right now it's just filled with NaN values, so we'll populate the cells in the loop below

In [133]:
#First, let's drop that 't column
top250Movies.drop(top250Movies.columns[[2]], axis=1, inplace=True)
top250Movies.head()

Unnamed: 0,movieID,reviews,great,best,other,good,many,first,more,such,...,special,acting,main,final,funny,important,emotional,later,simple,strong
0,tt2582802,http://switchingreels.com/2014/01/28/sundance-...,,,,,,,,,...,,,,,,,,,,
1,tt2582802,Taking the festival circuit by storm since its...,,,,,,,,,...,,,,,,,,,,
2,tt2582802,After seeing Damien Chazelle's Whiplash - a fi...,,,,,,,,,...,,,,,,,,,,
3,tt2582802,There is so many excellent great things to say...,,,,,,,,,...,,,,,,,,,,
4,tt2582802,I saw this about 24 hours ago at the Best of F...,,,,,,,,,...,,,,,,,,,,


In [134]:
def getDescriptors(df):
    for c, col in enumerate(df.columns[2:]):
        for r, row in enumerate(df.index):
            reviewLower = df.loc[row,"reviews"].lower()
            if (col in reviewLower):
                df.loc[row,col] = 1
            else:
                df.loc[row,col] = 0

In [135]:
getDescriptors(top250Movies)

In [136]:
top250Movies = top250Movies.ix[:, 0:51]

In [None]:
#Now repeat the process for bottom 100
#A function that does all of this would be preferred, but I couldn't get it to work

In [137]:
bottom100AdjList = []
for x in bottom100PosTokens:
    # each x is either a list of (word, POS tag) tuples
    for word, pos in x:
        if pos in ['JJ', 'JJS', 'JJR']: # feel free to add any other tags you may be looking for
            bottom100AdjList.append(word)

In [138]:
bottom100CommonAdj= [a for a, b in Counter(bottom100AdjList).most_common(50)]

In [139]:
#Create a dataframe that places each descriptor as an index
#We see that a top 50 adjective is the letter 't.' Not sure what the pos tagger is doing there, so we'll drop the col
bottom100Descrip = pd.DataFrame(columns=bottom100CommonAdj)

In [140]:
#Now we want a dataframe that joins the Review data (movieID, 15reviews each), from above with these descriptors
bottom100Movies = pd.DataFrame(bottom100ReviewData)
bottom100Movies = bottom100Movies.join(bottom100Descrip)
#We see right now it's just filled with NaN values, so we'll populate the cells in the loop below

In [141]:
#If everything is in order call the function above
getDescriptors(bottom100Movies)

In [142]:
#We need to join both of these dataframes to the original dataframes that have more info about each movie ID
#First, though, we need to drop the 'Reviews' column because we don't need all of that text
top250Movies.drop(['reviews'], axis =1 ,inplace=True)
bottom100Movies.drop(['reviews'], axis =1 ,inplace=True)

In [143]:
#Then we need to groupby imdbID so we can actually join the tables.
#Currently, the review data has 15 rows for each id, while the original movie data only has 1
top250MoviesCopy = top250Movies.groupby(["movieID"], group_keys=False, as_index=False).apply(lambda x: x.iloc[:,1:].max())
bottom100MoviesCopy = bottom100Movies.groupby(["movieID"], group_keys=False, as_index=False).apply(lambda x: x.iloc[:,1:].max())

In [149]:
#Before we go any further, let's save these sentiment tables to their own .csv files
top250Movies.to_csv('../../../../../01-projects/assets/06-project6-assets/data/top250Descriptors.csv', encoding='utf8', index=False)
bottom100Movies.to_csv('../../../../../01-projects/assets/06-project6-assets/data/bottom100Descriptors.csv', encoding='utf8', index=False)

In [145]:
top250.head()

Unnamed: 0,imdbID,title,year,runtime,imdbVotes,imdbRating,movieType,gross
0,tt2582802,Whiplash,2014,107.0,384504.0,8.5,1,13092000.0
1,tt0047478,Seven Samurai,1954,207.0,226364.0,8.7,1,269061.0
2,tt0082971,Raiders of the Lost Ark,1981,115.0,653557.0,8.5,1,242374454.0
3,tt0050212,The Bridge on the River Kwai,1957,161.0,147591.0,8.2,1,27200000.0
4,tt0848228,The Avengers,2012,143.0,980989.0,8.1,1,623279547.0


In [150]:
goodMovies = top250.join(top250MoviesCopy)

In [151]:
badMovies = bottom100.join(bottom100MoviesCopy)

In [156]:
# Now we can save these to .csv files to avoid all of that scraping above
goodMovies.to_csv('../../../../../01-projects/assets/06-project6-assets/data/top250MoviesDescriptors.csv', encoding='utf8', index=False)
badMovies.to_csv('../../../../../01-projects/assets/06-project6-assets/data/bottom100MoviesDescriptors.csv', encoding='utf8', index=False)

In [157]:
goodMovies.shape

(250, 57)

In [158]:
badMovies.shape

(98, 58)

In [161]:
#Finally, we want a single table that adds the bottom100 movies, their info, and descriptors to the top250 table
#Our shape should be 348 movies, and, since they share 8 columns, we should get a total of 99 columns
movies = pd.merge(goodMovies, badMovies, on='imdbID', how='outer')

In [169]:
goodMovies.head()

Unnamed: 0,imdbID,title,year,runtime,imdbVotes,imdbRating,movieType,gross,great,best,other,good,many,first,more,such,most,own,same,real,few,much,old,new,true,last,young,different,bad,classic,big,beautiful,whole,brilliant,original,greatest,better,only,second,top,wonderful,long,little,human,least,powerful,full,special,acting,main,final,funny,important,emotional,later,simple,strong
0,tt2582802,Whiplash,2014,107.0,384504.0,8.5,1,13092000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0
1,tt0047478,Seven Samurai,1954,207.0,226364.0,8.7,1,269061.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0
2,tt0082971,Raiders of the Lost Ark,1981,115.0,653557.0,8.5,1,242374454.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0
3,tt0050212,The Bridge on the River Kwai,1957,161.0,147591.0,8.2,1,27200000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0
4,tt0848228,The Avengers,2012,143.0,980989.0,8.1,1,623279547.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0


In [168]:
#pd.set_option('display.max_columns', 999)
movies.head()

Unnamed: 0,imdbID,title_x,year_x,runtime_x,imdbVotes_x,imdbRating_x,movieType_x,gross_x,great_x,best_x,other_x,good_x,many_x,first_x,more_x,such_x,most_x,own_x,same_x,real_x,few_x,much_x,old_x,new_x,true,last_x,young_x,different,bad_x,classic,big_x,beautiful,whole_x,brilliant,original_x,greatest,better_x,only_x,second_x,top,wonderful,long_x,little_x,human,least_x,powerful,full,special_x,acting_x,main_x,final,funny_x,important,emotional,later,simple,strong,title_y,year_y,runtime_y,imdbVotes_y,imdbRating_y,movieType_y,gross_y,bad_y,good_y,other_y,worst,first_y,more_y,funny_y,many_y,such_y,least_y,same_y,few_y,most_y,best_y,old_y,great_y,awful,only_y,acting_y,horrible,terrible,original_y,whole_y,own_y,better_y,real_y,much_y,stupid,big_y,special_y,new_y,sure,little_y,poor,wrong,low,main_y,last_y,entire,young_y,worse,long_y,high,worth,next,second_y,hard,several,able,hilarious
0,tt2582802,Whiplash,2014.0,107.0,384504.0,8.5,1.0,13092000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,tt0047478,Seven Samurai,1954.0,207.0,226364.0,8.7,1.0,269061.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,tt0082971,Raiders of the Lost Ark,1981.0,115.0,653557.0,8.5,1.0,242374454.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,tt0050212,The Bridge on the River Kwai,1957.0,161.0,147591.0,8.2,1.0,27200000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,tt0848228,The Avengers,2012.0,143.0,980989.0,8.1,1.0,623279547.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


## Bonus:

- What are the top 10 grossing movies?
- Who are the 10 actors that appear in the most movies?
- What's the average grossing of the movies in which each of these actors appear?
- What genre is the oldest movie?
