# Project 1: Explanatory Data Analysis & Data Presentation (Movies Dataset)

# Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 1 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code. <br> <br>
Keep in mind that it´s all about __getting the right results/conclusions__. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. 

## Data Import and first Inspection

1. __Import__ the movies dataset from the CSV file "movies_complete.csv". __Inspect__ the data.

__Some additional information on Features/Columns__:

* **id:** The ID of the movie (clear/unique identifier).
* **title:** The Official Title of the movie.
* **tagline:** The tagline of the movie.
* **release_date:** Theatrical Release Date of the movie.
* **genres:** Genres associated with the movie.
* **belongs_to_collection:** Gives information on the movie series/franchise the particular film belongs to.
* **original_language:** The language in which the movie was originally shot in.
* **budget_musd:** The budget of the movie in million dollars.
* **revenue_musd:** The total revenue of the movie in million dollars.
* **production_companies:** Production companies involved with the making of the movie.
* **production_countries:** Countries where the movie was shot/produced in.
* **vote_count:** The number of votes by users, as counted by TMDB.
* **vote_average:** The average rating of the movie.
* **popularity:** The Popularity Score assigned by TMDB.
* **runtime:** The runtime of the movie in minutes.
* **overview:** A brief blurb of the movie.
* **spoken_languages:** Spoken languages in the film.
* **poster_path:** The URL of the poster image.
* **cast:** (Main) Actors appearing in the movie.
* **cast_size:** number of Actors appearing in the movie.
* **director:** Director of the movie.
* **crew_size:** Size of the film crew (incl. director, excl. actors).

In [71]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import HTML
#pd.options.display.max_colwidth = 200
df = pd.read_csv('movies_complete.csv', parse_dates=['release_date'])

## The best and the worst movies...

2. __Filter__ the Dataset and __find the best/worst n Movies__ with the

- Highest Revenue
- Highest Budget
- Highest Profit (=Revenue - Budget)
- Lowest Profit (=Revenue - Budget)
- Highest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10) 
- Lowest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)
- Highest number of Votes
- Highest Rating (only movies with 10 or more Ratings)
- Lowest Rating (only movies with 10 or more Ratings)
- Highest Popularity

__Define__ an appropriate __user-defined function__ to reuse code.

__Movies Top 5 - Highest Revenue__

In [5]:
df.loc[:,['id', 'title', 'revenue_musd']].sort_values('revenue_musd', ascending=False).head(5)

Unnamed: 0,id,title,revenue_musd
14448,19995,Avatar,2787.965087
26265,140607,Star Wars: The Force Awakens,2068.223624
1620,597,Titanic,1845.034188
17669,24428,The Avengers,1519.55791
24812,135397,Jurassic World,1513.52881


__Movies Top 5 - Highest Budget__

In [6]:
df.loc[:,['id', 'title', 'budget_musd']].sort_values('budget_musd', ascending=False).head(5)

Unnamed: 0,id,title,budget_musd
16986,1865,Pirates of the Caribbean: On Stranger Tides,380.0
11743,285,Pirates of the Caribbean: At World's End,300.0
26268,99861,Avengers: Age of Ultron,280.0
10985,1452,Superman Returns,270.0
18517,49529,John Carter,260.0


__Movies Top 5 - Highest Profit__

In [7]:
df['profit_musd'] = df['revenue_musd'] - df['budget_musd']
df.loc[:,['id', 'title', 'profit_musd']].sort_values('profit_musd', ascending=False).head(5)

Unnamed: 0,id,title,profit_musd
14448,19995,Avatar,2550.965087
26265,140607,Star Wars: The Force Awakens,1823.223624
1620,597,Titanic,1645.034188
24812,135397,Jurassic World,1363.52881
28501,168259,Furious 7,1316.24936


__Movies Top 5 - Lowest Profit__

In [8]:
df.loc[:,['id', 'title', 'profit_musd']].sort_values('profit_musd', ascending=True).head(5)

Unnamed: 0,id,title,profit_musd
20959,57201,The Lone Ranger,-165.71009
7164,10733,The Alamo,-119.180039
16659,50321,Mars Needs Moms,-111.007242
43611,339964,Valerian and the City of a Thousand Planets,-107.447384
2684,1911,The 13th Warrior,-98.301101


__Movies Top 5 - Highest ROI__

In [96]:
df['roi'] = df['revenue_musd'] / df['budget_musd']
df[df['budget_musd'] >= 10].loc[:,['id', 'title', 'roi']].sort_values('roi', ascending=False).head(5)

Unnamed: 0,id,title,roi
1055,601,E.T. the Extra-Terrestrial,75.520507
255,11,Star Wars,70.490728
588,114,Pretty Woman,33.071429
18300,77338,The Intouchables,32.806221
1144,1891,The Empire Strikes Back,29.911111


__Movies Top 5 - Lowest ROI__

In [10]:
df[df['budget_musd'] >= 10].loc[:,['id', 'title', 'roi']].sort_values('roi', ascending=True).head(5)

Unnamed: 0,id,title,roi
6955,14844,Chasing Liberty,5.217391e-07
8041,18475,The Cookout,7.5e-07
17381,33927,Deadfall,1.8e-06
6678,10944,In the Cut,1.916667e-06
20015,98339,The Samaritan,0.0002100833


__Movies Top 5 - Most Votes__

In [11]:
df.loc[:,['id', 'title', 'vote_count']].sort_values('vote_count', ascending=False).head(5)

Unnamed: 0,id,title,vote_count
15368,27205,Inception,14075.0
12396,155,The Dark Knight,12269.0
14448,19995,Avatar,12114.0
17669,24428,The Avengers,12000.0
26272,293660,Deadpool,11444.0


__Movies Top 5 - Highest Rating__

In [12]:
df.loc[:,['id', 'title', 'vote_average']].sort_values('vote_average', ascending=False).head(5)

Unnamed: 0,id,title,vote_average
36996,162611,Portrait of a Young Man in Three Movements,10.0
33891,143980,Brave Revolutionary,10.0
1615,64562,Other Voices Other Rooms,10.0
35505,211139,The Lion of Thebes,10.0
25882,287299,Katt Williams: Priceless: Afterlife,10.0


__Movies Top 5 - Lowest Rating__

In [13]:
df.loc[:,['id', 'title', 'vote_average']].sort_values('vote_average', ascending=True).head(5)

Unnamed: 0,id,title,vote_average
24522,154515,"Dance, Fools, Dance",0.0
37762,93734,.hack Liminality: In the Case of Mai Minase,0.0
26403,66963,Lucrezia Borgia,0.0
12189,111744,Joe and Max,0.0
20450,281583,The Substitute,0.0


__Movies Top 5 - Most Popular__

In [14]:
df.loc[:,['id', 'title', 'popularity']].sort_values('popularity', ascending=False).head(5)

Unnamed: 0,id,title,popularity
30330,211672,Minions,547.488298
32927,297762,Wonder Woman,294.337037
41556,321612,Beauty and the Beast,287.253654
42940,339403,Baby Driver,228.032744
24187,177572,Big Hero 6,213.849907


## Find your next Movie

3. __Filter__ the Dataset for movies that meet the following conditions:

__Search 1: Science Fiction Action Movie with Bruce Willis (sorted from high to low Rating)__

__Search 2: Movies with Uma Thurman and directed by Quentin Tarantino (sorted from short to long runtime)__

__Search 3: Most Successful Pixar Studio Movies between 2010 and 2015 (sorted from high to low Revenue)__

__Search 4: Action or Thriller Movie with original language English and minimum Rating of 7.5 (most recent movies first)__

In [77]:
mask1a = df.apply(lambda row: 'Science Fiction' in str(row['genres']).split('|') and 'Action' in str(row['genres']).split('|'), axis=1)
mask1b = df.apply(lambda row: 'Bruce Willis' in str(row['cast']).split('|'), axis=1)
search1 = df[mask1a & mask1b].loc[:, ['poster_path', 'title', 'vote_average']].sort_values('vote_average', ascending=False)

mask2a = df.apply(lambda row: 'Uma Thurman' in str(row['cast']).split('|'), axis=1)
mask2b = df['director']=='Quentin Tarantino'
search2 = df[mask2a & mask2b].loc[:, ['poster_path', 'title', 'runtime']].sort_values('runtime', ascending=True)

mask3a = df.apply(lambda row: row['release_date'] >= pd.to_datetime('2010') and row['release_date'] <= pd.to_datetime('2015'), axis=1)
mask3b = df.apply(lambda row: 'Pixar Animation Studios' in str(row['production_companies']).split('|'), axis=1)
search3 = df[mask3a & mask3b].loc[:, ['poster_path', 'title', 'revenue_musd']].sort_values('revenue_musd', ascending=False)

mask4a = df['genres'].str.contains('Thriller') | df['genres'].str.contains('Action')
mask4b = df['original_language'] == 'en'
mask4c = df['vote_average'] >= 7.5
search4 = df[mask4a & mask4b & mask4c].loc[:, ['poster_path', 'title', 'vote_average', 'release_date']].sort_values('release_date', ascending=False)

In [78]:
HTML(search1.to_html(escape=False))

Unnamed: 0,poster_path,title,vote_average
1448,,The Fifth Element,7.3
19218,,Looper,6.6
1786,,Armageddon,6.5
14135,,Surrogates,5.9
20333,,G.I. Joe: Retaliation,5.4
27619,,Vice,4.1


In [79]:
HTML(search2.to_html(escape=False))

Unnamed: 0,poster_path,title,runtime
6667,,Kill Bill: Vol. 1,111.0
7208,,Kill Bill: Vol. 2,136.0
291,,Pulp Fiction,154.0


In [80]:
HTML(search3.to_html(escape=False))

Unnamed: 0,poster_path,title,revenue_musd
15236,,Toy Story 3,1066.969703
20888,,Monsters University,743.559607
17220,,Cars 2,559.852396
18900,,Brave,538.983207
16392,,Day & Night,
21694,,The Blue Umbrella,
21697,,Toy Story of Terror!,
22489,,La luna,
24252,,Hawaiian Vacation,
24254,,Small Fry,


In [81]:
HTML(search4.to_html(escape=False))

Unnamed: 0,poster_path,title,vote_average,release_date
44490,,Descendants 2,7.5,2017-07-21
43941,,Dunkirk,7.5,2017-07-19
42624,,The Book of Henry,7.6,2017-06-16
26273,,Guardians of the Galaxy Vol. 2,7.6,2017-04-19
43467,,Revengeance,8.0,2017-04-05
44431,,First Round Down,10.0,2017-03-04
41506,,Logan,7.6,2017-02-28
42877,,Tomato Red,8.0,2017-02-24
44447,,Zero 3,8.7,2017-01-27
41622,,The River Thief,9.3,2016-10-14


## Are Franchises more successful?

4. __Analyze__ the Dataset and __find out whether Franchises (Movies that belong to a collection) are more successful than stand-alone movies__ in terms of:

- mean revenue
- median Return on Investment
- mean budget raised
- mean popularity
- mean rating

hint: use groupby()

__Franchise vs. Stand-alone: Average Revenue__

In [93]:
df['Franchise'] = df['belongs_to_collection'].notna()
gbo = df.groupby('Franchise')
gbo['revenue_musd'].mean()

Franchise
False     44.742814
True     165.708193
Name: revenue_musd, dtype: float64

__Franchise vs. Stand-alone: Return on Investment / Profitability (median)__

In [98]:
gbo['roi'].median()

Franchise
False    1.619699
True     3.709195
Name: roi, dtype: float64

__Franchise vs. Stand-alone: Average Budget__

In [99]:
gbo['budget_musd'].mean()

Franchise
False    18.047741
True     38.319847
Name: budget_musd, dtype: float64

__Franchise vs. Stand-alone: Average Popularity__

In [100]:
gbo['popularity'].mean()

Franchise
False    2.592726
True     6.245051
Name: popularity, dtype: float64

__Franchise vs. Stand-alone: Average Rating__

In [101]:
gbo['vote_average'].mean()

Franchise
False    6.008787
True     5.956806
Name: vote_average, dtype: float64

## Most Successful Franchises

5. __Find__ the __most successful Franchises__ in terms of

- __total number of movies__
- __total & mean budget__
- __total & mean revenue__
- __mean rating__

In [118]:
gbo2 = df[df['Franchise']].groupby('belongs_to_collection')
print('Number of Movies: ')
print(gbo2['id'].count().sort_values(ascending=False).head(5))
print('---------------------------------')
print('Mean Budget: ')
print(gbo2.agg({'budget_musd': ['sum', 'mean']}).sort_values(by=('budget_musd', 'mean'), ascending=False).head(5))
print('---------------------------------')
print('Mean Revenue: ')
print(gbo2.agg({'revenue_musd': ['sum', 'mean']}).nlargest(5, ('revenue_musd', 'mean')))
print('---------------------------------')
print('Mean Rating: ')
print(gbo2['vote_average'].mean().sort_values(ascending=False).head(5))
print('---------------------------------')

Number of Movies: 
belongs_to_collection
The Bowery Boys                  29
Totò Collection                  27
James Bond Collection            26
Zatôichi: The Blind Swordsman    26
The Carry On Collection          25
Name: id, dtype: int64
---------------------------------
Mean Budget: 
                                    budget_musd       
                                            sum   mean
belongs_to_collection                                 
Tangled Collection                        260.0  260.0
The Avengers Collection                   500.0  250.0
Pirates of the Caribbean Collection      1250.0  250.0
The Hobbit Collection                     750.0  250.0
Man of Steel Collection                   475.0  237.5
---------------------------------
Mean Revenue: 
                        revenue_musd             
                                 sum         mean
belongs_to_collection                            
Avatar Collection        2787.965087  2787.965087
The Avengers Collec

## Most Successful Directors

6. __Find__ the __most successful Directors__ in terms of

- __total number of movies__
- __total revenue__
- __mean rating__

In [124]:
director = df.groupby('director').agg({'title': 'count', 'revenue_musd': 'sum', 'vote_average': 'mean'})
print('Number of Movies: ')
print(director.nlargest(5, 'title'))
print('------------------------------------------------------------------')
print('Total Revenue: ')
print(director.nlargest(5, 'revenue_musd'))
print('------------------------------------------------------------------')
print('Mean Rating: ')
print(director.nlargest(5, 'vote_average'))
print('------------------------------------------------------------------')

Number of Movies: 
                  title  revenue_musd  vote_average
director                                           
John Ford            66     85.170757      6.381818
Michael Curtiz       65     37.817500      5.998246
Werner Herzog        54     24.572580      6.805556
Alfred Hitchcock     53    250.107584      6.639623
Georges Méliès       49      0.000000      5.934694
------------------------------------------------------------------
Total Revenue: 
                  title  revenue_musd  vote_average
director                                           
Steven Spielberg     33   9256.621422      6.893939
Peter Jackson        13   6528.244659      7.138462
Michael Bay          13   6437.466781      6.392308
James Cameron        11   5900.610310      6.927273
David Yates           9   5334.563196      6.700000
------------------------------------------------------------------
Mean Rating: 
               title  revenue_musd  vote_average
director                                

## Most Successful Actors

In [129]:
dfIndexed = df.set_index('id')

In [134]:
actors = dfIndexed['cast'].str.split('|', expand=True).stack().reset_index(1, drop=True).to_frame().rename(columns={0: 'Actor'})
actors = actors.merge(dfIndexed, how='left', left_index=True, right_index=True)

In [135]:
actorsGBO = actors.groupby('Actor').agg({'title': 'count', 'revenue_musd': ['sum', 'mean'], 'vote_average': 'mean', 'popularity': 'mean'})

In [136]:
actorsGBO.nlargest(10, ('title', 'count'))

Unnamed: 0_level_0,title,revenue_musd,revenue_musd,vote_average,popularity
Unnamed: 0_level_1,count,sum,mean,mean,mean
Actor,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Bess Flowers,240,368.913259,14.75653,6.184186,2.030528
Christopher Lee,148,9417.047887,324.725789,5.910204,4.749606
John Wayne,125,236.094,11.242571,5.712097,3.092939
Samuel L. Jackson,122,17109.620672,213.870258,6.266116,11.703945
Michael Caine,110,8053.404838,191.747734,6.269444,8.265272
Gérard Depardieu,109,1247.608953,95.969919,6.053211,3.703836
John Carradine,109,255.839586,19.679968,5.546667,2.43495
Donald Sutherland,108,5390.766679,138.224787,6.233962,7.00323
Jackie Chan,108,4699.185933,146.84956,6.275701,5.862638
Frank Welker,107,13044.15247,326.103812,6.310377,9.571404


In [137]:
actorsGBO.nlargest(10, ('revenue_musd', 'sum'))

Unnamed: 0_level_0,title,revenue_musd,revenue_musd,vote_average,popularity
Unnamed: 0_level_1,count,sum,mean,mean,mean
Actor,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Stan Lee,48,19414.957555,647.165252,6.513043,29.936175
Samuel L. Jackson,122,17109.620672,213.870258,6.266116,11.703945
Warwick Davis,34,13256.032188,662.801609,6.294118,13.088614
Frank Welker,107,13044.15247,326.103812,6.310377,9.571404
John Ratzenberger,46,12596.126073,449.861645,6.484444,10.959477
Jess Harnell,35,12234.608163,611.730408,6.435294,10.919015
Hugo Weaving,40,11027.578473,459.482436,6.473684,10.96789
Ian McKellen,44,11015.592318,478.938796,6.353488,15.44718
Johnny Depp,69,10653.760641,217.423687,6.44058,12.378196
Alan Rickman,45,10612.625348,353.754178,6.715556,10.399285


In [140]:
actorsGBO[actorsGBO.loc[:, ('title', 'count')] > 10].nlargest(10, ('revenue_musd', 'mean'))

Unnamed: 0_level_0,title,revenue_musd,revenue_musd,vote_average,popularity
Unnamed: 0_level_1,count,sum,mean,mean,mean
Actor,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Gloria Stuart,18,1845.034188,1845.034188,6.36875,3.477432
Keith Richards,23,2967.713802,989.237934,6.463636,5.032988
James Cameron,12,1862.075059,931.03753,7.063636,4.691718
Matthew Lewis,11,7915.3125,879.479167,7.372727,23.097479
Luke de Woolfson,11,1720.671036,860.335518,5.718182,8.767206
Yuri Lowenthal,17,1708.162716,854.081358,6.188235,19.884649
Dominic Monaghan,11,3289.607607,822.401902,6.045455,10.621675
Peter Mayhew,11,4820.721631,803.453605,6.7,12.303552
Victoria De Mare,12,783.112979,783.112979,5.058333,17.606919
Alex Zahara,17,769.653595,769.653595,5.958824,5.087243


In [141]:
actorsGBO[actorsGBO.loc[:, ('title', 'count')] > 10].nlargest(10, ('vote_average', 'mean'))

Unnamed: 0_level_0,title,revenue_musd,revenue_musd,vote_average,popularity
Unnamed: 0_level_1,count,sum,mean,mean,mean
Actor,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
David Attenborough,11,0.0,,8.27,2.147383
Yo Oizumi,13,511.210189,102.242038,7.723077,7.512642
Şener Şen,16,11.074013,3.691338,7.693333,0.912166
Akira Tani,12,0.327081,0.163541,7.654545,5.041536
Daisuke Katô,19,0.423649,0.141216,7.611111,3.42824
Adile Naşit,15,0.9128,0.4564,7.530769,0.578235
Haruko Sugimura,19,0.0,,7.526316,2.155955
Isao Kimura,11,0.327081,0.163541,7.49,4.129724
Mitsuko Yoshikawa,12,0.0,,7.481818,0.437893
Yûnosuke Itô,12,0.05524,0.05524,7.472727,2.887282


In [142]:
actorsGBO[actorsGBO.loc[:, ('title', 'count')] > 10].nlargest(10, ('popularity', 'mean'))

Unnamed: 0_level_0,title,revenue_musd,revenue_musd,vote_average,popularity
Unnamed: 0_level_1,count,sum,mean,mean,mean
Actor,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Katy Mixon,12,1519.572457,151.957246,5.841667,51.974337
Terry Notary,11,6947.21137,694.721137,6.472727,51.575849
Mark Smith,11,2195.523957,243.947106,6.545455,40.076962
Jon Hamm,25,3449.345393,191.6303,6.328,39.417351
Gal Gadot,11,5449.53284,495.412076,6.327273,37.385856
Ava Acres,21,6272.35833,482.489102,5.985714,36.260864
Emma Watson,19,9639.203121,535.511284,6.768421,35.965301
Keith Jardine,11,1062.491947,212.498389,5.963636,32.003903
Karen Gillan,12,1834.673013,305.778836,6.783333,31.384562
Wilbur Fitzgerald,12,1539.757207,139.977928,6.525,31.234321
