# Explanatory Data Analysis & Data Presentation (Movies Dataset)

## Data Import and first Inspection

1. __Import__ the movies dataset from the CSV file "movies_complete.csv". __Inspect__ the data.

__Some additional information on Features/Columns__:

* **id:** The ID of the movie (clear/unique identifier).
* **title:** The Official Title of the movie.
* **tagline:** The tagline of the movie.
* **release_date:** Theatrical Release Date of the movie.
* **genres:** Genres associated with the movie.
* **belongs_to_collection:** Gives information on the movie series/franchise the particular film belongs to.
* **original_language:** The language in which the movie was originally shot in.
* **budget_musd:** The budget of the movie in million dollars.
* **revenue_musd:** The total revenue of the movie in million dollars.
* **production_companies:** Production companies involved with the making of the movie.
* **production_countries:** Countries where the movie was shot/produced in.
* **vote_count:** The number of votes by users, as counted by TMDB.
* **vote_average:** The average rating of the movie.
* **popularity:** The Popularity Score assigned by TMDB.
* **runtime:** The runtime of the movie in minutes.
* **overview:** A brief blurb of the movie.
* **spoken_languages:** Spoken languages in the film.
* **poster_path:** The URL of the poster image.
* **cast:** (Main) Actors appearing in the movie.
* **cast_size:** number of Actors appearing in the movie.
* **director:** Director of the movie.
* **crew_size:** Size of the film crew (incl. director, excl. actors).

In [24]:
# Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import HTML

In [25]:
movies_df = pd.read_csv('../data/raw/movies_complete.csv')
movies_df.head()

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,...,vote_average,popularity,runtime,overview,spoken_languages,poster_path,cast,cast_size,crew_size,director
0,862,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,30.0,373.554033,Pixar Animation Studios,...,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",English,<img src='http://image.tmdb.org/t/p/w185//uXDf...,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,13,106,John Lasseter
1,8844,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,65.0,262.797249,TriStar Pictures|Teitler Film|Interscope Commu...,...,6.9,17.015539,104.0,When siblings Judy and Peter discover an encha...,English|Français,<img src='http://image.tmdb.org/t/p/w185//vgpX...,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,26,16,Joe Johnston
2,15602,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,1995-12-22,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,...,6.5,11.7129,101.0,A family wedding reignites the ancient feud be...,English,<img src='http://image.tmdb.org/t/p/w185//1FSX...,Walter Matthau|Jack Lemmon|Ann-Margret|Sophia ...,7,4,Howard Deutch
3,31357,Waiting to Exhale,Friends are the people who let you be yourself...,1995-12-22,Comedy|Drama|Romance,,en,16.0,81.452156,Twentieth Century Fox Film Corporation,...,6.1,3.859495,127.0,"Cheated on, mistreated and stepped on, the wom...",English,<img src='http://image.tmdb.org/t/p/w185//4wjG...,Whitney Houston|Angela Bassett|Loretta Devine|...,10,10,Forest Whitaker
4,11862,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,1995-02-10,Comedy,Father of the Bride Collection,en,,76.578911,Sandollar Productions|Touchstone Pictures,...,5.7,8.387519,106.0,Just when George Banks has recovered from his ...,English,<img src='http://image.tmdb.org/t/p/w185//lf9R...,Steve Martin|Diane Keaton|Martin Short|Kimberl...,12,7,Charles Shyer


In [26]:
HTML(movies_df[['title', 'poster_path']][:5].to_html(escape=False))

Unnamed: 0,title,poster_path
0,Toy Story,
1,Jumanji,
2,Grumpier Old Men,
3,Waiting to Exhale,
4,Father of the Bride Part II,


## The best and the worst movies...

2. __Filter__ the Dataset and __find the best/worst n Movies__ with the

- Highest Revenue
- Highest Budget
- Highest Profit (=Revenue - Budget)
- Lowest Profit (=Revenue - Budget)
- Highest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10) 
- Lowest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)
- Highest number of Votes
- Highest Rating (only movies with 10 or more Ratings)
- Lowest Rating (only movies with 10 or more Ratings)
- Highest Popularity

In [27]:
movies_df['profit_musd'] = movies_df['revenue_musd'] - movies_df['budget_musd']
movies_df['return_musd'] = movies_df['revenue_musd'] / movies_df['budget_musd']

In [28]:
movies_df.columns

Index(['id', 'title', 'tagline', 'release_date', 'genres',
       'belongs_to_collection', 'original_language', 'budget_musd',
       'revenue_musd', 'production_companies', 'production_countries',
       'vote_count', 'vote_average', 'popularity', 'runtime', 'overview',
       'spoken_languages', 'poster_path', 'cast', 'cast_size', 'crew_size',
       'director', 'profit_musd', 'return_musd'],
      dtype='object')

In [35]:
movies = movies_df[['title', 'poster_path', 'budget_musd', 'revenue_musd', 'profit_musd', 'return_musd', 'release_date',
                    'vote_count', 'vote_average', 'popularity']].copy()
movies.columns = ['Title', 'Poster', 'Budget', 'Revenue', 'Profit', 'ROI', 'Release Date', 'Votes', 'Average Rating', 'Popularity']
movies.set_index('Title', inplace=True)
movies.head()

Unnamed: 0_level_0,Poster,Budget,Revenue,Profit,ROI,Release Date,Votes,Average Rating,Popularity
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Toy Story,<img src='http://image.tmdb.org/t/p/w185//uXDf...,30.0,373.554033,343.554033,12.451801,1995-10-30,5415.0,7.7,21.946943
Jumanji,<img src='http://image.tmdb.org/t/p/w185//vgpX...,65.0,262.797249,197.797249,4.043035,1995-12-15,2413.0,6.9,17.015539
Grumpier Old Men,<img src='http://image.tmdb.org/t/p/w185//1FSX...,,,,,1995-12-22,92.0,6.5,11.7129
Waiting to Exhale,<img src='http://image.tmdb.org/t/p/w185//4wjG...,16.0,81.452156,65.452156,5.09076,1995-12-22,34.0,6.1,3.859495
Father of the Bride Part II,<img src='http://image.tmdb.org/t/p/w185//lf9R...,,76.578911,,,1995-02-10,173.0,5.7,8.387519


__Define__ an appropriate __user-defined function__ to reuse code.

In [46]:
def best_worst(by, n = 5, ascending = False, min_bud = 0, min_votes = 0):
    temp_df = movies[(movies['Budget'] >= min_bud) & (movies['Votes'] >= min_votes)][['Release Date', 'Poster', by]
                                                                                     ].sort_values(by = by, ascending = ascending).head(n).copy()
    return HTML(temp_df.to_html(escape=False))

__Movies Top 5 - Highest Revenue__

In [47]:
best_worst('Revenue')

Unnamed: 0_level_0,Release Date,Poster,Revenue
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,2009-12-10,,2787.965087
Star Wars: The Force Awakens,2015-12-15,,2068.223624
Titanic,1997-11-18,,1845.034188
The Avengers,2012-04-25,,1519.55791
Jurassic World,2015-06-09,,1513.52881


__Movies Top 5 - Highest Budget__

In [48]:
best_worst('Budget')

Unnamed: 0_level_0,Release Date,Poster,Budget
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Pirates of the Caribbean: On Stranger Tides,2011-05-14,,380.0
Pirates of the Caribbean: At World's End,2007-05-19,,300.0
Avengers: Age of Ultron,2015-04-22,,280.0
Superman Returns,2006-06-28,,270.0
John Carter,2012-03-07,,260.0


__Movies Top 5 - Highest Profit__

In [49]:
best_worst('Profit')

Unnamed: 0_level_0,Release Date,Poster,Profit
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,2009-12-10,,2550.965087
Star Wars: The Force Awakens,2015-12-15,,1823.223624
Titanic,1997-11-18,,1645.034188
Jurassic World,2015-06-09,,1363.52881
Furious 7,2015-04-01,,1316.24936


__Movies Top 5 - Lowest Profit__

In [50]:
best_worst('Profit', ascending=True)

Unnamed: 0_level_0,Release Date,Poster,Profit
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Lone Ranger,2013-07-03,,-165.71009
The Alamo,2004-04-07,,-119.180039
Mars Needs Moms,2011-03-09,,-111.007242
Valerian and the City of a Thousand Planets,2017-07-20,,-107.447384
The 13th Warrior,1999-08-27,,-98.301101


__Movies Top 5 - Highest ROI__

In [51]:
best_worst('ROI')

Unnamed: 0_level_0,Release Date,Poster,ROI
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Less Than Zero,1987-11-06,,12396380.0
Modern Times,1936-02-05,,8500000.0
Welcome to Dongmakgol,2005-08-04,,4197477.0
Aquí Entre Nos,2012-03-30,,2755584.0
"The Karate Kid, Part II",1986-06-18,,1018619.0


__Movies Top 5 - Lowest ROI__

In [52]:
best_worst('ROI', ascending=True)

Unnamed: 0_level_0,Release Date,Poster,ROI
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Chasing Liberty,2004-01-09,,5.217391e-07
The Cookout,2004-09-03,,7.5e-07
Never Talk to Strangers,1995-10-20,,9.375e-07
To Rob a Thief,2007-08-31,,1.499133e-06
Deadfall,1993-10-08,,1.8e-06


__Movies Top 5 - Most Votes__

In [53]:
best_worst('Votes')

Unnamed: 0_level_0,Release Date,Poster,Votes
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Inception,2010-07-14,,14075.0
The Dark Knight,2008-07-16,,12269.0
Avatar,2009-12-10,,12114.0
The Avengers,2012-04-25,,12000.0
Deadpool,2016-02-09,,11444.0


__Movies Top 5 - Highest Rating__

In [54]:
best_worst('Average Rating')

Unnamed: 0_level_0,Release Date,Poster,Average Rating
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Time Pass,2014-01-03,,10.0
Shuttlecock Boys,2012-08-02,,10.0
Forever,2006-10-05,,10.0
Souls of Zen: Ancestors and Agency in Contemporary Japanese Temple Buddhism,2012-06-01,,10.0
Elaine Stritch: At Liberty,2002-01-01,,10.0


__Movies Top 5 - Lowest Rating__

In [55]:
best_worst('Average Rating', ascending=True)

Unnamed: 0_level_0,Release Date,Poster,Average Rating
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Extinction: Nature Has Evolved,2017-03-10,,0.0
Roukli,2015-09-17,,0.0
Joe and Max,2002-03-03,,0.0
Call Me by Your Name,2017-10-27,,0.0
Unrated II: Scary as Hell,2011-02-26,,0.5


__Movies Top 5 - Most Popular__

In [56]:
best_worst('Popularity')

Unnamed: 0_level_0,Release Date,Poster,Popularity
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Minions,2015-06-17,,547.488298
Wonder Woman,2017-05-30,,294.337037
Beauty and the Beast,2017-03-16,,287.253654
Baby Driver,2017-06-28,,228.032744
Big Hero 6,2014-10-24,,213.849907


## Find your next Movie

3. __Filter__ the Dataset for movies that meet the following conditions:

__Search 1: Science Fiction Action Movie with Bruce Willis (sorted from high to low Rating)__

__Search 2: Movies with Uma Thurman and directed by Quentin Tarantino (sorted from short to long runtime)__

__Search 3: Most Successful Pixar Studio Movies between 2010 and 2015 (sorted from high to low Revenue)__

__Search 4: Action or Thriller Movie with original language English and minimum Rating of 7.5 (most recent movies first)__

## Are Franchises more successful?

4. __Analyze__ the Dataset and __find out whether Franchises (Movies that belong to a collection) are more successful than stand-alone movies__ in terms of:

- mean revenue
- median Return on Investment
- mean budget raised
- mean popularity
- mean rating

hint: use groupby()

__Franchise vs. Stand-alone: Average Revenue__

__Franchise vs. Stand-alone: Return on Investment / Profitability (median)__

__Franchise vs. Stand-alone: Average Budget__

__Franchise vs. Stand-alone: Average Popularity__

__Franchise vs. Stand-alone: Average Rating__

## Most Successful Franchises

5. __Find__ the __most successful Franchises__ in terms of

- __total number of movies__
- __total & mean budget__
- __total & mean revenue__
- __mean rating__

## Most Successful Directors

6. __Find__ the __most successful Directors__ in terms of

- __total number of movies__
- __total revenue__
- __mean rating__