![example](images/director_shot.jpeg)

# Project Title

**Authors:** Student 1, Student 2, Student 3
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [57]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gzip
import json
import requests
import re
from bs4 import BeautifulSoup
%matplotlib inline

In [58]:
# more data
tn_moviebudgets = pd.read_csv('tn.movie_budgets.csv.gz')
tm_movies = pd.read_csv('tmdb.movies.csv.gz')
rt_reviews = pd.read_csv('rt.reviews.tsv.gz', compression='gzip',
                   error_bad_lines=False, sep="\t", encoding = 'latin-1')
rt_movieinfo = pd.read_csv('rt.movie_info.tsv.gz', compression='gzip',
                   error_bad_lines=False, sep="\t", encoding = 'latin-1')
imdb_title_princ = pd.read_csv('imdb.title.principals.csv.gz')
imdb_title_crew = pd.read_csv('imdb.title.crew.csv.gz')
imdb_title_akas = pd.read_csv('imdb.title.akas.csv.gz')
imdb_name_basics = pd.read_csv('imdb.name.basics.csv.gz')

In [59]:
#main data
imdb_title_ratings = pd.read_csv('imdb.title.ratings.csv.gz')
imdb_title_basics = pd.read_csv('imdb.title.basics.csv.gz')
bom_movie_gross = pd.read_csv('bom.movie_gross.csv.gz')

In [60]:
imdb_title_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


In [61]:
imdb_title_ratings.head()

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [62]:
imdb_title_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   tconst           146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [73]:
primary_title = imdb_title_basics['primary_title']

In [74]:
imdb_ratings_joined = imdb_title_basics.merge(imdb_title_ratings, on = 'tconst',  how = 'outer')
imdb_ratings_joined

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77.0
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43.0
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517.0
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13.0
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119.0
...,...,...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama,,
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary,,
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy,,
146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,,,


In [75]:
bom_movie_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [66]:
bom_movie_gross.sort_values(by='year', ascending=False)[0:50]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
3386,An Actor Prepares,Grav.,1700.0,,2018
3183,On the Basis of Sex,Focus,24600000.0,13600000.0,2018
3176,Tyler Perry's Acrimony,LGF,43500000.0,2900000.0,2018
3177,Mary Queen of Scots,Focus,16500000.0,29900000.0,2018
3178,The Possession of Hannah Grace,SGem,14800000.0,28200000.0,2018
3179,Overlord,Par.,21700000.0,20000000.0,2018
3180,The Darkest Minds,Fox,12700000.0,28400000.0,2018
3181,Holmes and Watson,Sony,30600000.0,9900000.0,2018
3182,Show Dogs,Global Road,17900000.0,21300000.0,2018
3184,Namiya,CL,70800.0,35300000.0,2018


In [67]:
tm_movies.info()
tm_movies.sort_values(by='release_date')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
14335,14335,"[18, 10752]",143,en,All Quiet on the Western Front,9.583,1930-04-29,All Quiet on the Western Front,7.8,299
21758,21758,"[27, 53]",43148,en,The Vampire Bat,2.292,1933-01-21,The Vampire Bat,5.6,23
3580,3580,"[35, 18, 10749]",263768,fr,Le Bonheur,1.653,1936-02-27,Le Bonheur,8.7,3
26345,26345,[],316707,en,How Walt Disney Cartoons Are Made,0.600,1939-01-19,How Walt Disney Cartoons Are Made,7.3,3
11192,11192,"[18, 36, 10749]",887,en,The Best Years of Our Lives,9.647,1946-12-25,The Best Years of Our Lives,7.8,243
...,...,...,...,...,...,...,...,...,...,...
24819,24819,[18],481880,en,Trial by Fire,4.480,2019-05-17,Trial by Fire,7.0,3
24003,24003,"[18, 9648, 53]",411144,en,We Have Always Lived in the Castle,14.028,2019-05-17,We Have Always Lived in the Castle,5.2,24
24892,24892,[99],541577,en,This Changes Everything,3.955,2019-06-28,This Changes Everything,0.0,1
24265,24265,"[10749, 18]",428836,en,Ophelia,8.715,2019-06-28,Ophelia,0.0,4


In [76]:
tn_moviebudgets.info()
tn_moviebudgets['release_date'] = pd.to_datetime(tn_moviebudgets['release_date'])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 5782 non-null   int64         
 1   release_date       5782 non-null   datetime64[ns]
 2   movie              5782 non-null   object        
 3   production_budget  5782 non-null   object        
 4   domestic_gross     5782 non-null   object        
 5   worldwide_gross    5782 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 271.2+ KB


In [77]:
tn_moviebudgets.sort_values(by='release_date')[5750:]

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
3795,96,2019-05-17,The Sun is Also a Star,"$9,000,000","$4,950,029","$5,434,029"
1380,81,2019-05-17,John Wick: Chapter 3 â Parabellum,"$40,000,000","$141,744,320","$256,498,033"
80,81,2019-05-24,Aladdin,"$182,000,000","$246,734,314","$619,234,314"
4012,13,2019-05-24,BrightBurn,"$7,000,000","$16,794,432","$27,989,498"
124,25,2019-05-31,Godzilla: King of the Monsters,"$170,000,000","$85,576,941","$299,276,941"
4265,66,2019-05-31,MA,"$5,000,000","$36,049,540","$44,300,625"
1370,71,2019-05-31,Rocketman,"$41,000,000","$57,342,725","$108,642,725"
580,81,2019-06-07,The Secret Life of Pets 2,"$80,000,000","$63,795,655","$113,351,496"
2,3,2019-06-07,Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
4534,35,2019-06-07,Late Night,"$4,000,000","$246,305","$246,305"


In [78]:
rt_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


In [79]:
rt_movieinfo.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [82]:
rt_joined_studios = rt_joined['studio'].value_counts()
rt_joined_studios.tolist()
rt_joined_studios
pd.DataFrame(rt_joined_studios)

Unnamed: 0,studio
Universal Pictures,4423
Paramount Pictures,3142
20th Century Fox,2418
Sony Pictures,2135
Sony Pictures Classics,2096
...,...
Corridor,1
FilmRise,1
International Film Circuit,1
Grindstone Entertainment,1


In [81]:
rt_joined = rt_movieinfo.merge(rt_reviews, on = 'id', how = 'outer')
rt_joined


Unnamed: 0,id,synopsis,rating_x,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio,review,rating_y,fresh,critic,top_critic,publisher,date
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,,,,,,,,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0.0,Patrick Nabarro,"November 10, 2018"
2,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0.0,io9.com,"May 23, 2018"
3,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0.0,Stream on Demand,"January 4, 2018"
4,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0.0,MUBI,"November 16, 2017"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54852,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,The real charm of this trifle is the deadpan c...,,fresh,Laura Sinagra,1.0,Village Voice,"September 24, 2002"
54853,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,1/5,rotten,Michael Szymanski,0.0,Zap2it.com,"September 21, 2005"
54854,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2/5,rotten,Emanuel Levy,0.0,EmanuelLevy.Com,"July 17, 2005"
54855,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2.5/5,rotten,Christopher Null,0.0,Filmcritic.com,"September 7, 2003"


In [83]:
imdb_title_princ.info()
imdb_title_princ.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028186 entries, 0 to 1028185
Data columns (total 6 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   tconst      1028186 non-null  object
 1   ordering    1028186 non-null  int64 
 2   nconst      1028186 non-null  object
 3   category    1028186 non-null  object
 4   job         177684 non-null   object
 5   characters  393360 non-null   object
dtypes: int64(1), object(5)
memory usage: 47.1+ MB


Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0111414,1,nm0246005,actor,,"[""The Man""]"
1,tt0111414,2,nm0398271,director,,
2,tt0111414,3,nm3739909,producer,producer,
3,tt0323808,10,nm0059247,editor,,
4,tt0323808,1,nm3579312,actress,,"[""Beth Boothby""]"


In [84]:
imdb_title_crew.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   tconst     146144 non-null  object
 1   directors  140417 non-null  object
 2   writers    110261 non-null  object
dtypes: object(3)
memory usage: 3.3+ MB


In [85]:
imdb_name_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 606648 entries, 0 to 606647
Data columns (total 6 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   nconst              606648 non-null  object 
 1   primary_name        606648 non-null  object 
 2   birth_year          82736 non-null   float64
 3   death_year          6783 non-null    float64
 4   primary_profession  555308 non-null  object 
 5   known_for_titles    576444 non-null  object 
dtypes: float64(2), object(4)
memory usage: 27.8+ MB


In [86]:
imdbcast_crew = imdb_title_princ.merge(imdb_name_basics, on = 'nconst', how = 'outer')
imdbcast_crew = imdbcast_crew.sort_values(by = 'tconst')
imdbcast_crew = imdbcast_crew.merge(imdb_title_crew, on = 'tconst', how = 'outer')
imdbcast_crew

Unnamed: 0,tconst,ordering,nconst,category,job,characters,primary_name,birth_year,death_year,primary_profession,known_for_titles,directors,writers
0,tt0063540,3.0,nm0756379,actor,,"[""Ganeshi N. Prasad""]",Balraj Sahni,1913.0,1973.0,"actor,writer,director","tt0055039,tt0043307,tt0234827,tt0233326",nm0712540,"nm0023551,nm1194313,nm0347899,nm1391276"
1,tt0063540,7.0,nm1194313,writer,story,,Mahasweta Devi,1926.0,2016.0,writer,"tt0108001,tt0832902,tt0063540,tt0178562",nm0712540,"nm0023551,nm1194313,nm0347899,nm1391276"
2,tt0063540,9.0,nm1391276,writer,screenplay,,Anjana Rawail,,,"writer,costume_designer","tt0293499,tt0266712,tt0266757,tt0063540",nm0712540,"nm0023551,nm1194313,nm0347899,nm1391276"
3,tt0063540,6.0,nm0023551,writer,dialogue,,Abrar Alvi,1927.0,2009.0,"writer,actor,director","tt0071811,tt0359496,tt0056436,tt0061046",nm0712540,"nm0023551,nm1194313,nm0347899,nm1391276"
4,tt0063540,8.0,nm0347899,writer,dialogue,,Gulzar,1936.0,,"music_department,writer,soundtrack","tt0091256,tt0178186,tt1010048,tt2176013",nm0712540,"nm0023551,nm1194313,nm0347899,nm1391276"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1033229,tt7659080,,,,,,,,,,,nm6474441,
1033230,tt7763158,,,,,,,,,,,,
1033231,tt7980000,,,,,,,,,,,,
1033232,tt8352852,,,,,,,,,,,,


In [None]:
# Here you run your code to explore the data

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to clean the data

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***