# Project : Explanatory Data Analysis & Data Presentation (Movies Dataset)

## Data Import and first Inspection

1. __Import__ the movies dataset from the CSV file "movies_complete.csv". __Inspect__ the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('movies_complete.csv', parse_dates=['release_date'])
df.head()

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,...,vote_average,popularity,runtime,overview,spoken_languages,poster_path,cast,cast_size,crew_size,director
0,862,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,30.0,373.554033,Pixar Animation Studios,...,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",English,<img src='http://image.tmdb.org/t/p/w185//uXDf...,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,13,106,John Lasseter
1,8844,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,65.0,262.797249,TriStar Pictures|Teitler Film|Interscope Commu...,...,6.9,17.015539,104.0,When siblings Judy and Peter discover an encha...,English|Français,<img src='http://image.tmdb.org/t/p/w185//vgpX...,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,26,16,Joe Johnston
2,15602,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,1995-12-22,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,...,6.5,11.7129,101.0,A family wedding reignites the ancient feud be...,English,<img src='http://image.tmdb.org/t/p/w185//1FSX...,Walter Matthau|Jack Lemmon|Ann-Margret|Sophia ...,7,4,Howard Deutch
3,31357,Waiting to Exhale,Friends are the people who let you be yourself...,1995-12-22,Comedy|Drama|Romance,,en,16.0,81.452156,Twentieth Century Fox Film Corporation,...,6.1,3.859495,127.0,"Cheated on, mistreated and stepped on, the wom...",English,<img src='http://image.tmdb.org/t/p/w185//4wjG...,Whitney Houston|Angela Bassett|Loretta Devine|...,10,10,Forest Whitaker
4,11862,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,1995-02-10,Comedy,Father of the Bride Collection,en,,76.578911,Sandollar Productions|Touchstone Pictures,...,5.7,8.387519,106.0,Just when George Banks has recovered from his ...,English,<img src='http://image.tmdb.org/t/p/w185//lf9R...,Steve Martin|Diane Keaton|Martin Short|Kimberl...,12,7,Charles Shyer


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44691 entries, 0 to 44690
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id                     44691 non-null  int64         
 1   title                  44691 non-null  object        
 2   tagline                20284 non-null  object        
 3   release_date           44657 non-null  datetime64[ns]
 4   genres                 42586 non-null  object        
 5   belongs_to_collection  4463 non-null   object        
 6   original_language      44681 non-null  object        
 7   budget_musd            8854 non-null   float64       
 8   revenue_musd           7385 non-null   float64       
 9   production_companies   33356 non-null  object        
 10  production_countries   38835 non-null  object        
 11  vote_count             44691 non-null  float64       
 12  vote_average           42077 non-null  float64       
 13  p

## Lets see if there are any duplicates

In [4]:
#We can consider duplicates those rows thah have the same title and same reelease date
df[df.duplicated(subset=['title','release_date'])]

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,...,vote_average,popularity,runtime,overview,spoken_languages,poster_path,cast,cast_size,crew_size,director


## Lets see if there are any NAN values

In [5]:
#We have a lot of Nans
features_with_na = [feature for feature in df.columns if df[feature].isna().sum()>0]
for feature in features_with_na:
    print(f'{feature.upper()} has {np.round(df[feature].isna().mean()*100,2)}% of NAN values')

TAGLINE has 54.61% of NAN values
RELEASE_DATE has 0.08% of NAN values
GENRES has 4.71% of NAN values
BELONGS_TO_COLLECTION has 90.01% of NAN values
ORIGINAL_LANGUAGE has 0.02% of NAN values
BUDGET_MUSD has 80.19% of NAN values
REVENUE_MUSD has 83.48% of NAN values
PRODUCTION_COMPANIES has 25.36% of NAN values
PRODUCTION_COUNTRIES has 13.1% of NAN values
VOTE_AVERAGE has 5.85% of NAN values
RUNTIME has 3.38% of NAN values
OVERVIEW has 2.13% of NAN values
SPOKEN_LANGUAGES has 8.05% of NAN values
POSTER_PATH has 0.5% of NAN values
CAST has 4.9% of NAN values
DIRECTOR has 1.64% of NAN values


__Some additional information on Features/Columns__:

* **id:** The ID of the movie (clear/unique identifier).
* **title:** The Official Title of the movie.
* **tagline:** The tagline of the movie.
* **release_date:** Theatrical Release Date of the movie.
* **genres:** Genres associated with the movie.
* **belongs_to_collection:** Gives information on the movie series/franchise the particular film belongs to.
* **original_language:** The language in which the movie was originally shot in.
* **budget_musd:** The budget of the movie in million dollars.
* **revenue_musd:** The total revenue of the movie in million dollars.
* **production_companies:** Production companies involved with the making of the movie.
* **production_countries:** Countries where the movie was shot/produced in.
* **vote_count:** The number of votes by users, as counted by TMDB.
* **vote_average:** The average rating of the movie.
* **popularity:** The Popularity Score assigned by TMDB.
* **runtime:** The runtime of the movie in minutes.
* **overview:** A brief blurb of the movie.
* **spoken_languages:** Spoken languages in the film.
* **poster_path:** The URL of the poster image.
* **cast:** (Main) Actors appearing in the movie.
* **cast_size:** number of Actors appearing in the movie.
* **director:** Director of the movie.
* **crew_size:** Size of the film crew (incl. director, excl. actors).

## The best and the worst movies...

2. __Filter__ the Dataset and __find the best/worst n Movies__ with the

- Highest Revenue
- Highest Budget
- Highest Profit (=Revenue - Budget)
- Lowest Profit (=Revenue - Budget)
- Highest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10) 
- Lowest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)
- Highest number of Votes
- Highest Rating (only movies with 10 or more Ratings)
- Lowest Rating (only movies with 10 or more Ratings)
- Highest Popularity

In [6]:
df['profit'] = df['revenue_musd'] - df['budget_musd']

In [7]:
df['Return_On_Investment'] = df['revenue_musd']/df['budget_musd']

In [8]:
#vamos a modificar el formato de la columna release_time
df['release_date'] = pd.to_datetime(df['release_date'])
df['release_date'] = df['release_date'].dt.strftime('%d-%m-%Y')

In [9]:
df.head()

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,...,runtime,overview,spoken_languages,poster_path,cast,cast_size,crew_size,director,profit,Return_On_Investment
0,862,Toy Story,,30-10-1995,Animation|Comedy|Family,Toy Story Collection,en,30.0,373.554033,Pixar Animation Studios,...,81.0,"Led by Woody, Andy's toys live happily in his ...",English,<img src='http://image.tmdb.org/t/p/w185//uXDf...,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,13,106,John Lasseter,343.554033,12.451801
1,8844,Jumanji,Roll the dice and unleash the excitement!,15-12-1995,Adventure|Fantasy|Family,,en,65.0,262.797249,TriStar Pictures|Teitler Film|Interscope Commu...,...,104.0,When siblings Judy and Peter discover an encha...,English|Français,<img src='http://image.tmdb.org/t/p/w185//vgpX...,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,26,16,Joe Johnston,197.797249,4.043035
2,15602,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,22-12-1995,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,...,101.0,A family wedding reignites the ancient feud be...,English,<img src='http://image.tmdb.org/t/p/w185//1FSX...,Walter Matthau|Jack Lemmon|Ann-Margret|Sophia ...,7,4,Howard Deutch,,
3,31357,Waiting to Exhale,Friends are the people who let you be yourself...,22-12-1995,Comedy|Drama|Romance,,en,16.0,81.452156,Twentieth Century Fox Film Corporation,...,127.0,"Cheated on, mistreated and stepped on, the wom...",English,<img src='http://image.tmdb.org/t/p/w185//4wjG...,Whitney Houston|Angela Bassett|Loretta Devine|...,10,10,Forest Whitaker,65.452156,5.09076
4,11862,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,10-02-1995,Comedy,Father of the Bride Collection,en,,76.578911,Sandollar Productions|Touchstone Pictures,...,106.0,Just when George Banks has recovered from his ...,English,<img src='http://image.tmdb.org/t/p/w185//lf9R...,Steve Martin|Diane Keaton|Martin Short|Kimberl...,12,7,Charles Shyer,,


__Define__ an appropriate __user-defined function__ to reuse code.

In [10]:
def find_top_and_bottom_5 (column):
    top_5 = df.loc[df[column].nlargest(5).index,'title']
    bottom_5 = df.loc[df[column].nsmallest(5).index, 'title']
    return top_5,bottom_5

__Movies Top 5 - Highest Revenue__

In [11]:
top_5_movies,bottom_5_movies = find_top_and_bottom_5('revenue_musd')
print(top_5_movies)

14448                          Avatar
26265    Star Wars: The Force Awakens
1620                          Titanic
17669                    The Avengers
24812                  Jurassic World
Name: title, dtype: object


__Movies Top 5 - Highest Budget__

In [12]:
top_5_movies,bottom_5_movies = find_top_and_bottom_5('budget_musd')
print(top_5_movies)

16986    Pirates of the Caribbean: On Stranger Tides
11743       Pirates of the Caribbean: At World's End
26268                        Avengers: Age of Ultron
10985                               Superman Returns
16006                                        Tangled
Name: title, dtype: object


__Movies Top 5 - Highest Profit__

In [13]:
top_5_movies,bottom_5_movies = find_top_and_bottom_5('profit')
print(top_5_movies)

14448                          Avatar
26265    Star Wars: The Force Awakens
1620                          Titanic
24812                  Jurassic World
28501                       Furious 7
Name: title, dtype: object


__Movies Top 5 - Lowest Profit__

In [14]:
top_5_movies,bottom_5_movies = find_top_and_bottom_5('profit')
print(bottom_5_movies)

20959                                The Lone Ranger
7164                                       The Alamo
16659                                Mars Needs Moms
43611    Valerian and the City of a Thousand Planets
2684                                The 13th Warrior
Name: title, dtype: object


__Movies Top 5 - Highest ROI__

In [15]:
top_5_movies,bottom_5_movies = find_top_and_bottom_5('Return_On_Investment')
print(top_5_movies)

3965              Less Than Zero
3312                Modern Times
14270      Welcome to Dongmakgol
22041             Aquí Entre Nos
2284     The Karate Kid, Part II
Name: title, dtype: object


__Movies Top 5 - Lowest ROI__

In [16]:
top_5_movies,bottom_5_movies = find_top_and_bottom_5('Return_On_Investment')
print(bottom_5_movies)

6955             Chasing Liberty
8041                 The Cookout
2636     Never Talk to Strangers
12859             To Rob a Thief
17381                   Deadfall
Name: title, dtype: object


__Movies Top 5 - Most Votes__

In [17]:
top_5_movies,bottom_5_movies = find_top_and_bottom_5('vote_count')
print(top_5_movies)

15368          Inception
12396    The Dark Knight
14448             Avatar
17669       The Avengers
26272           Deadpool
Name: title, dtype: object


__Movies Top 5 - Highest Rating__

In [18]:
top_5_movies,bottom_5_movies = find_top_and_bottom_5('vote_average')
print(top_5_movies)

186                                     Reckless
393                         Girl in the Cadillac
702     The Haunted World of Edward D. Wood, Jr.
732       Carmen Miranda: Bananas Is My Business
1615                    Other Voices Other Rooms
Name: title, dtype: object


__Movies Top 5 - Lowest Rating__

In [19]:
top_5_movies,bottom_5_movies = find_top_and_bottom_5('vote_average')
print(bottom_5_movies)

1360                             Inside
1548                  Alive and Kicking
12189                       Joe and Max
13138    Pete Seeger: The Power of Song
14072               Mr. Robinson Crusoe
Name: title, dtype: object


__Movies Top 5 - Most Popular__

In [20]:
top_5_movies,bottom_5_movies = find_top_and_bottom_5('popularity')
print(top_5_movies)

30330                 Minions
32927            Wonder Woman
41556    Beauty and the Beast
42940             Baby Driver
24187              Big Hero 6
Name: title, dtype: object


## Find your next Movie

3. __Filter__ the Dataset for movies that meet the following conditions:

__Search 1: Science Fiction Action Movie with Bruce Willis (sorted from high to low Rating)__

In [21]:
mask1 = df['genres'].str.contains('Science Fiction', na=False)
mask2 = df['cast'].str.contains('Bruce Willis', na =False)
df_bruce_sciefi = df.loc[(mask1 & mask2)].sort_values(by='vote_average',ascending = False)
df_bruce_sciefi.head()

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,...,runtime,overview,spoken_languages,poster_path,cast,cast_size,crew_size,director,profit,Return_On_Investment
31,63,Twelve Monkeys,The future is history.,29-12-1995,Science Fiction|Thriller|Mystery,,en,29.5,168.84,Universal Pictures|Atlas Entertainment|Classico,...,129.0,"In the year 2035, convict James Cole reluctant...",English|Français,<img src='http://image.tmdb.org/t/p/w185//2F9K...,Bruce Willis|Madeleine Stowe|Brad Pitt|Christo...,65,151,Terry Gilliam,139.34,5.72339
1448,18,The Fifth Element,There is no future without it.,07-05-1997,Adventure|Fantasy|Action|Thriller|Science Fiction,,en,90.0,263.92018,Columbia Pictures|Gaumont,...,126.0,"In 2257, a taxi driver is unintentionally give...",English|svenska|Deutsch,<img src='http://image.tmdb.org/t/p/w185//fPtl...,Bruce Willis|Gary Oldman|Ian Holm|Milla Jovovi...,114,134,Luc Besson,173.92018,2.932446
3836,9741,Unbreakable,Some things are only revealed by accident.,13-11-2000,Science Fiction|Thriller|Drama,,en,75.0,248.118121,Limited Edition Productions Inc.|Touchstone Pi...,...,106.0,An ordinary man makes an extraordinary discove...,English,<img src='http://image.tmdb.org/t/p/w185//kXkV...,Bruce Willis|Samuel L. Jackson|Robin Wright|Sp...,40,56,M. Night Shyamalan,173.118121,3.308242
19218,59967,Looper,"Hunted By Your Future, Haunted By Your Past",26-09-2012,Action|Thriller|Science Fiction,,en,30.0,47.042,Endgame Entertainment|FilmDistrict|DMG Enterta...,...,118.0,"In the futuristic action thriller Looper, time...",English,<img src='http://image.tmdb.org/t/p/w185//sNjL...,Joseph Gordon-Levitt|Bruce Willis|Emily Blunt|...,34,42,Rian Johnson,17.042,1.568067
1786,95,Armageddon,The Earth's Darkest Day Will Be Man's Finest Hour,01-07-1998,Action|Thriller|Science Fiction|Adventure,,en,140.0,553.799566,Jerry Bruckheimer Films|Touchstone Pictures|Va...,...,151.0,When an asteroid threatens to collide with Ear...,English|Pусский,<img src='http://image.tmdb.org/t/p/w185//fMtO...,Bruce Willis|Billy Bob Thornton|Ben Affleck|Li...,67,108,Michael Bay,413.799566,3.955711


__Search 2: Movies with Uma Thurman and directed by Quentin Tarantino (sorted from short to long runtime)__

In [22]:
df_Uma_Quentin = df.loc[(df['director'] == 'Uma Thurman') | (df['director'] == 'Quentin Tarantino')].sort_values(by='runtime')
df_Uma_Quentin.head()

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,...,runtime,overview,spoken_languages,poster_path,cast,cast_size,crew_size,director,profit,Return_On_Investment
40605,10614,My Best Friend's Birthday,"Your ass is grass, and I'm the lawnmower.",01-01-1987,Comedy,,en,,,,...,36.0,My Best Friend's Birthday is a partially lost ...,English,<img src='http://image.tmdb.org/t/p/w185//8OQ8...,Quentin Tarantino|Allen Garfield|Craig Hamann|...,11,12,Quentin Tarantino,,
1047,500,Reservoir Dogs,Every dog has his day.,02-09-1992,Crime|Thriller,,en,1.2,14.661007,Live Entertainment|Lionsgate|Dog Eat Dog Produ...,...,99.0,A botched robbery indicates a police informant...,English,<img src='http://image.tmdb.org/t/p/w185//g7sp...,Harvey Keitel|Tim Roth|Michael Madsen|Chris Pe...,15,63,Quentin Tarantino,13.461007,12.217506
6667,24,Kill Bill: Vol. 1,Go for the kill.,10-10-2003,Action|Crime,Kill Bill Collection,en,30.0,180.949,Miramax Films|A Band Apart|Super Cool ManChu,...,111.0,An assassin is shot at the altar by her ruthle...,English|日本語|Français,<img src='http://image.tmdb.org/t/p/w185//v7Ta...,Uma Thurman|Lucy Liu|Vivica A. Fox|Daryl Hanna...,36,161,Quentin Tarantino,150.949,6.031633
11785,1991,Death Proof,A crash course in revenge,21-05-2007,Action|Thriller,Grindhouse Collection,en,25.0,25.037897,The Weinstein Company|Dimension Films|Troublem...,...,113.0,"Austin's hottest DJ, Jungle Julia, sets out in...",English,<img src='http://image.tmdb.org/t/p/w185//iy60...,Kurt Russell|Zoë Bell|Rosario Dawson|Vanessa F...,19,60,Quentin Tarantino,0.037897,1.001516
7208,393,Kill Bill: Vol. 2,The bride is back for the final cut.,16-04-2004,Action|Crime|Thriller,Kill Bill Collection,en,30.0,152.159461,Miramax Films|A Band Apart|Super Cool ManChu,...,136.0,The Bride unwaveringly continues on her roarin...,English|普通话|Español|广州话 / 廣州話,<img src='http://image.tmdb.org/t/p/w185//2yhg...,Uma Thurman|David Carradine|Daryl Hannah|Micha...,27,130,Quentin Tarantino,122.159461,5.071982


__Search 3: Most Successful Pixar Studio Movies between 2010 and 2015 (sorted from high to low Revenue)__

In [23]:
df_most_pixar = df.loc[(df['production_companies'].str.contains('Pixar',na = False)) & (df['release_date'] > '01-01-2010') & (df['release_date'] < '31-12-2015')].sort_values(by='revenue_musd',ascending = False)
df_most_pixar.head()

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,...,runtime,overview,spoken_languages,poster_path,cast,cast_size,crew_size,director,profit,Return_On_Investment
15236,10193,Toy Story 3,No toy gets left behind.,16-06-2010,Animation|Family|Comedy,Toy Story Collection,en,200.0,1066.969703,Walt Disney Pictures|Pixar Animation Studios,...,103.0,"Woody, Buzz, and the rest of Andy's toys haven...",English|Español,<img src='http://image.tmdb.org/t/p/w185//amY0...,Tom Hanks|Tim Allen|Ned Beatty|Joan Cusack|Mic...,45,38,Lee Unkrich,866.969703,5.334849
37611,127380,Finding Dory,An unforgettable journey she probably won't re...,16-06-2016,Adventure|Animation|Comedy|Family,Finding Nemo Collection,en,200.0,1028.570889,Walt Disney Pictures|Pixar Animation Studios|H...,...,97.0,Dory is reunited with her friends Nemo and Mar...,English,<img src='http://image.tmdb.org/t/p/w185//iLDP...,Ellen DeGeneres|Albert Brooks|Hayden Rolence|E...,31,39,Andrew Stanton,828.570889,5.142854
6179,12,Finding Nemo,There are 3.7 trillion fish in the ocean. They...,30-05-2003,Animation|Family,Finding Nemo Collection,en,94.0,940.335536,Pixar Animation Studios,...,100.0,"Nemo, an adventurous young clownfish, is unexp...",English,<img src='http://image.tmdb.org/t/p/w185//n5eT...,Albert Brooks|Ellen DeGeneres|Alexander Gould|...,24,104,Andrew Stanton,846.335536,10.00357
29957,150540,Inside Out,Meet the little voices inside your head.,09-06-2015,Drama|Comedy|Animation|Family,,en,175.0,857.611174,Walt Disney Pictures|Pixar Animation Studios,...,94.0,"Growing up can be a bumpy road, and it's no ex...",English,<img src='http://image.tmdb.org/t/p/w185//lRHE...,Amy Poehler|Phyllis Smith|Richard Kind|Bill Ha...,65,50,Pete Docter,682.611174,4.900635
20888,62211,Monsters University,School never looked this scary.,20-06-2013,Animation|Family,"Monsters, Inc. Collection",en,200.0,743.559607,Walt Disney Pictures|Pixar Animation Studios,...,104.0,A look at the relationship between Mike and Su...,English,<img src='http://image.tmdb.org/t/p/w185//tyHH...,Billy Crystal|John Goodman|Steve Buscemi|Helen...,24,13,Dan Scanlon,543.559607,3.717798


__Search 4: Action or Thriller Movie with original language English and minimum Rating of 7.5 (most recent movies first)__

In [24]:
mask1 = df['genres'].str.contains('Action', na = False) | df['genres'].str.contains('Thriller', na = False)
mask2 = df['vote_average']>= 7.5
mask3 = df['original_language'] == 'en'
df_act_thril = df.loc[mask1 & mask2 & mask3]
df_act_thril.head()

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,...,runtime,overview,spoken_languages,poster_path,cast,cast_size,crew_size,director,profit,Return_On_Investment
5,949,Heat,A Los Angeles Crime Saga,15-12-1995,Action|Crime|Drama|Thriller,,en,60.0,187.436818,Regency Enterprises|Forward Pass|Warner Bros.,...,170.0,"Obsessive master thief, Neil McCauley leads a ...",English|Español,<img src='http://image.tmdb.org/t/p/w185//lbf2...,Al Pacino|Robert De Niro|Val Kilmer|Jon Voight...,65,71,Michael Mann,127.436818,3.123947
46,807,Se7en,Seven deadly sins. Seven ways to die.,22-09-1995,Crime|Mystery|Thriller,,en,33.0,327.311859,New Line Cinema|Juno Pix|Cecchi Gori Pictures,...,127.0,Two homicide detectives are on a desperate hun...,English,<img src='http://image.tmdb.org/t/p/w185//GQP6...,Brad Pitt|Morgan Freeman|Gwyneth Paltrow|John ...,49,91,David Fincher,294.311859,9.918541
49,629,The Usual Suspects,Five Criminals. One Line Up. No Coincidence.,19-07-1995,Drama|Crime|Thriller,,en,6.0,23.341568,Blue Parrot Productions|Bad Hat Harry Productions,...,106.0,"Held in an L.A. interrogation room, Verbal Kin...",Español|English|Français|Magyar,<img src='http://image.tmdb.org/t/p/w185//6Dr8...,Stephen Baldwin|Gabriel Byrne|Chazz Palminteri...,18,64,Bryan Singer,17.341568,3.890261
108,197,Braveheart,Every man dies. Not every man truly lives.,24-05-1995,Action|Drama|History|War,,en,72.0,210.0,Icon Entertainment International|The Ladd Comp...,...,177.0,"Enraged at the slaughter of Murron, his new br...",English|Français|Latin|,<img src='http://image.tmdb.org/t/p/w185//or1g...,Mel Gibson|Catherine McCormack|Sophie Marceau|...,55,67,Mel Gibson,138.0,2.916667
255,11,Star Wars,"A long time ago in a galaxy far, far away...",25-05-1977,Adventure|Action|Science Fiction,Star Wars Collection,en,11.0,775.398007,Lucasfilm|Twentieth Century Fox Film Corporation,...,121.0,Princess Leia is captured and held hostage by ...,English,<img src='http://image.tmdb.org/t/p/w185//6FfC...,Mark Hamill|Harrison Ford|Carrie Fisher|Peter ...,106,20,George Lucas,764.398007,70.490728


## Are Franchises more successful?

4. __Analyze__ the Dataset and __find out whether Franchises (Movies that belong to a collection) are more successful than stand-alone movies__ in terms of:

- mean revenue
- median Return on Investment
- mean budget raised
- mean popularity
- mean rating

In [25]:
df.columns

Index(['id', 'title', 'tagline', 'release_date', 'genres',
       'belongs_to_collection', 'original_language', 'budget_musd',
       'revenue_musd', 'production_companies', 'production_countries',
       'vote_count', 'vote_average', 'popularity', 'runtime', 'overview',
       'spoken_languages', 'poster_path', 'cast', 'cast_size', 'crew_size',
       'director', 'profit', 'Return_On_Investment'],
      dtype='object')

hint: use groupby()

* __Franchise vs. Stand-alone: Average Revenue__
* __Franchise vs. Stand-alone: Return on Investment / Profitability (median)__
* __Franchise vs. Stand-alone: Average Budget__
* __Franchise vs. Stand-alone: Average Popularity__
* __Franchise vs. Stand-alone: Average Rating__

In [26]:
#Lets assume that NA values means No franchise
df['Franchise'] = df['belongs_to_collection'].notna()

In [27]:
aggregated_df = df.groupby('Franchise').agg({'revenue_musd':'mean','Return_On_Investment':'median','budget_musd':'median','popularity':'mean','vote_count':'mean'})
column_mapping = {
    'revenue_musd': 'Avg_Revenue',
    'Return_On_Investment': 'Median_ROI',
    'budget_musd': 'Median_Budget',
    'popularity': 'Avg_Popularity',
    'vote_count': 'Avg_Vote_Count'
}
aggregated_df.rename(columns=column_mapping,inplace=True)
aggregated_df

Unnamed: 0_level_0,Avg_Revenue,Median_ROI,Median_Budget,Avg_Popularity,Avg_Vote_Count
Franchise,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
False,44.742814,1.619699,7.037519,2.592726,78.28955
True,165.708193,3.709195,16.0,6.245051,412.387856


## Most Successful Franchises

5. __Find__ the __most successful Franchises__ in terms of

- __total number of movies__
- __total & mean budget__
- __total & mean revenue__
- __mean rating__

In [28]:
df_franchise = df[df['Franchise'] == True]

In [29]:
df_franchise.groupby('belongs_to_collection')['title'].count().sort_values(ascending = False).head(10)

belongs_to_collection
The Bowery Boys                           29
Totò Collection                           27
James Bond Collection                     26
Zatôichi: The Blind Swordsman             26
The Carry On Collection                   25
Charlie Chan (Sidney Toler) Collection    21
Pokémon Collection                        20
Godzilla (Showa) Collection               16
Dragon Ball Z (Movie) Collection          15
Charlie Chan (Warner Oland) Collection    15
Name: title, dtype: int64

In [30]:
franchises = df_franchise.groupby('belongs_to_collection').agg({'budget_musd': ['sum','mean'], 'title':'count','popularity':'mean'})
franchises

Unnamed: 0_level_0,budget_musd,budget_musd,title,popularity
Unnamed: 0_level_1,sum,mean,count,mean
belongs_to_collection,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
... Has Fallen Collection,130.0,65.0,2,13.008090
00 Schneider Filmreihe,0.0,,1,1.931299
08/15 Collection,0.0,,1,0.628127
100 Girls Collection,0.0,,2,3.078652
101 Dalmatians (Animated) Collection,4.0,4.0,2,13.063212
...,...,...,...,...
Сказки Чуковского,0.0,,1,0.731334
Чебурашка и крокодил Гена,0.0,,1,0.881504
Что Творят мужчины! (Коллекция),2.0,2.0,2,1.299605
男はつらいよ シリーズ,0.0,,3,0.038901


In [31]:
franchises.nlargest(20,('title','count'))

Unnamed: 0_level_0,budget_musd,budget_musd,title,popularity
Unnamed: 0_level_1,sum,mean,count,mean
belongs_to_collection,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
The Bowery Boys,0.0,,29,0.198006
Totò Collection,0.0,,27,1.047324
James Bond Collection,1539.65,59.217308,26,13.453502
Zatôichi: The Blind Swordsman,0.0,,26,1.100603
The Carry On Collection,0.0,,25,3.22043
Charlie Chan (Sidney Toler) Collection,0.0,,21,0.630996
Pokémon Collection,250.7204,50.14408,20,4.374284
Godzilla (Showa) Collection,2.81,0.562,16,3.422842
Charlie Chan (Warner Oland) Collection,0.0,,15,0.664986
Dragon Ball Z (Movie) Collection,5.0,5.0,15,6.183563


In [32]:
franchises.nlargest(20,('popularity','mean'))

Unnamed: 0_level_0,budget_musd,budget_musd,title,popularity
Unnamed: 0_level_1,sum,mean,count,mean
belongs_to_collection,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Wonder Woman Collection,149.0,149.0,1,294.337037
Deadpool Collection,58.0,58.0,1,187.860492
Avatar Collection,237.0,237.0,1,185.070892
Guardians of the Galaxy Collection,370.0,185.0,2,119.311296
John Wick Collection,60.0,30.0,2,116.558939
Despicable Me Collection,299.0,74.75,6,106.715961
Blade Runner Collection,28.0,28.0,1,96.272374
Planet of the Apes (Reboot) Collection,415.0,138.333333,3,87.720253
The Avengers Collection,500.0,250.0,2,63.633534
Captain America Collection,560.0,186.666667,3,61.307807


In [33]:
franchises.nlargest(20,('budget_musd','sum'))

Unnamed: 0_level_0,budget_musd,budget_musd,title,popularity
Unnamed: 0_level_1,sum,mean,count,mean
belongs_to_collection,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
James Bond Collection,1539.65,59.217308,26,13.453502
Harry Potter Collection,1280.0,160.0,8,26.253893
Pirates of the Caribbean Collection,1250.0,250.0,5,53.972237
The Fast and the Furious Collection,1009.0,126.125,8,10.799435
X-Men Collection,983.0,163.833333,6,9.707541
Transformers Collection,965.0,193.0,5,14.425077
Star Wars Collection,854.35,106.79375,8,23.410919
The Hobbit Collection,750.0,250.0,3,25.205614
The Terminator Collection,661.4,132.28,5,21.893972
Mission: Impossible Collection,650.0,130.0,5,16.50971


## Most Successful Directors

6. __Find__ the __most successful Directors__ in terms of

- __total number of movies__
- __total revenue__
- __mean rating__

In [34]:
df_director = df.groupby('director').agg({'title':'count','revenue_musd':'sum','vote_average':'mean'})

In [35]:
df_director.nlargest(20, 'title')

Unnamed: 0_level_0,title,revenue_musd,vote_average
director,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
John Ford,66,85.170757,6.381818
Michael Curtiz,65,37.8175,5.998246
Werner Herzog,54,24.57258,6.805556
Alfred Hitchcock,53,250.107584,6.639623
Georges Méliès,49,0.0,5.934694
Woody Allen,49,993.970588,6.691837
Jean-Luc Godard,46,0.867433,6.804348
Sidney Lumet,46,294.522734,6.576744
Charlie Chaplin,44,26.519181,6.540909
Raoul Walsh,43,1.21388,6.004762


In [36]:
df_director.nlargest(20, 'revenue_musd')

Unnamed: 0_level_0,title,revenue_musd,vote_average
director,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Steven Spielberg,33,9256.621422,6.893939
Peter Jackson,13,6528.244659,7.138462
Michael Bay,13,6437.466781,6.392308
James Cameron,11,5900.61031,6.927273
David Yates,9,5334.563196,6.7
Christopher Nolan,11,4747.408665,7.618182
Robert Zemeckis,19,4138.233542,6.794737
Tim Burton,21,4032.916124,6.733333
Ridley Scott,24,3917.52924,6.604167
Chris Columbus,15,3866.836869,6.44


In [37]:
df_director.nlargest(20, 'vote_average')

Unnamed: 0_level_0,title,revenue_musd,vote_average
director,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A.W. Vidmer,1,0.0,10.0
Amy Schatz,1,0.0,10.0
Ana Poliak,1,0.0,10.0
Andrew Bowser,1,0.0,10.0
Andrew Napier,1,0.0,10.0
Antonis Sotiropoulos,1,0.0,10.0
Barry Bruce,1,0.0,10.0
Brandon Chesbro,1,0.0,10.0
Brett M. Butler,1,0.0,10.0
Brian Skeet,1,0.0,10.0
