In [1]:
import pandas as pd

## Exploring Rotten Tomatoes and IMDB Data Sets for Insights

Hello there!

In this notebook file, I will be taking a look at IMDB and Rotten Tomatos data sets sourced from the web.

For my initial analysis of the movies data sets, I first plan on doing an initial scan of a couple of different data sets that were provided in our /zippedData project folder. I want to see what types of data we are working with and if there are any notable columns or inconsistencies in the sets.

In [17]:
basics = pd.read_csv('imdb.title.basics.csv.gz')

In [4]:
basics.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


The IMDB data looks pretty straightforward, and nice and clean. Although, one of my responsibilities for my team is to explore insights regarding the ratings and reviews of movies. I think I want to take a look at the Rotten Tomatoes data and see if anything interesting sticks out.

In [2]:
rotten = pd.read_csv('rt.movie_info.tsv.gz', delimiter="\t")
#initialized a dataframe to access a database containing rotten tomatos reviews

In [3]:
rotten.head(20)

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,
5,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures
6,10,Some cast and crew from NBC's highly acclaimed...,PG-13,Comedy,Jake Kasdan,Mike White,"Jan 11, 2002","Jun 18, 2002",$,41032915.0,82 minutes,Paramount Pictures
7,13,"Stewart Kane, an Irishman living in the Austra...",R,Drama,Ray Lawrence,Raymond Carver|Beatrix Christian,"Apr 27, 2006","Oct 2, 2007",$,224114.0,123 minutes,Sony Pictures Classics
8,14,"""Love Ranch"" is a bittersweet love story that ...",R,Drama,Taylor Hackford,Mark Jacobson,"Jun 30, 2010","Nov 9, 2010",$,134904.0,117 minutes,
9,15,When a diamond expedition in the Congo is lost...,PG-13,Action and Adventure|Mystery and Suspense|Scie...,Frank Marshall,John Patrick Shanley,"Jun 9, 1995","Jul 27, 1999",,,108 minutes,


In [12]:
!ls

MovieAnalysis-Samantha.ipynb
Untitled.ipynb
bom.movie_gross.csv.gz
imdb.name.basics.csv.gz
imdb.title.akas.csv.gz
imdb.title.basics.csv.gz
imdb.title.crew.csv.gz
imdb.title.principals.csv.gz
imdb.title.ratings.csv.gz
rt.movie_info.tsv.gz
rt.reviews.tsv.gz
tmdb.movies.csv.gz
tn.movie_budgets.csv.gz


In [14]:
# Gathering information about Microsoft company values: https://www.microsoft.com/en-us/about/values
# Mission in action: Innovation, Diversity and Inclusion, Corporate Social Responsibility, AI, Trustworthy Computing, COVID-19 safety
# Values: Respect, Integrity, Accountability

In [37]:
type(rotten)

pandas.core.frame.DataFrame

In [38]:
rotten.info()
# we can see a lot of missing values are present... box office data seems missing from the majority of reviews, so we will
# probably just use this data set to get an idea of sentiment and not necessarily sales stats

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


In [39]:
rotten.index
# determine amount of reviews contained in extracted data set, 1560 reviews

RangeIndex(start=0, stop=1560, step=1)

In [40]:
rotten['rating']

0        R
1        R
2        R
3        R
4       NR
        ..
1555     R
1556    PG
1557     G
1558    PG
1559     R
Name: rating, Length: 1560, dtype: object

In [41]:
rotten.describe()

Unnamed: 0,id
count,1560.0
mean,1007.303846
std,579.164527
min,1.0
25%,504.75
50%,1007.5
75%,1503.25
max,2000.0


In [42]:
rotten['genre'].unique()

array(['Action and Adventure|Classics|Drama',
       'Drama|Science Fiction and Fantasy',
       'Drama|Musical and Performing Arts', 'Drama|Mystery and Suspense',
       'Drama|Romance', 'Drama|Kids and Family', 'Comedy', 'Drama',
       'Action and Adventure|Mystery and Suspense|Science Fiction and Fantasy',
       nan, 'Documentary', 'Documentary|Special Interest',
       'Classics|Comedy|Drama', 'Comedy|Drama|Mystery and Suspense',
       'Action and Adventure|Comedy|Drama',
       'Action and Adventure|Drama|Science Fiction and Fantasy',
       'Art House and International|Comedy|Drama|Musical and Performing Arts',
       'Musical and Performing Arts',
       'Classics|Comedy|Musical and Performing Arts|Romance',
       'Action and Adventure|Drama|Mystery and Suspense',
       'Action and Adventure|Mystery and Suspense',
       'Art House and International|Classics|Horror|Mystery and Suspense',
       'Horror',
       'Action and Adventure|Classics|Drama|Mystery and Suspense',
   

In [24]:
#Lots of entries have several genres assigned to the movie being reviewed

In [26]:
#want to determine somehow how to gain information about the RT rating for these reviews...

In [43]:
rotten.keys()

Index(['id', 'synopsis', 'rating', 'genre', 'director', 'writer',
       'theater_date', 'dvd_date', 'currency', 'box_office', 'runtime',
       'studio'],
      dtype='object')

In [44]:
rotten['id']

0          1
1          3
2          5
3          6
4          7
        ... 
1555    1996
1556    1997
1557    1998
1558    1999
1559    2000
Name: id, Length: 1560, dtype: int64

In [29]:
#note: come back to this data set... might take a while to uncover whats happening here

In [32]:
#lets try looking at the RT movie data set and see if we can combine these... if not then try imbd

In [4]:
rtreviews = pd.read_csv('rt.reviews.tsv.gz', sep="\t", encoding='latin-1')

In [5]:
rtreviews.head(50)
#some reviews contain rating out of 5, others contain a letter grade

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"
5,3,... Cronenberg's Cosmopolis expresses somethin...,,fresh,Michelle Orange,0,Capital New York,"September 11, 2017"
6,3,"Quickly grows repetitive and tiresome, meander...",C,rotten,Eric D. Snider,0,EricDSnider.com,"July 17, 2013"
7,3,Cronenberg is not a director to be daunted by ...,2/5,rotten,Matt Kelemen,0,Las Vegas CityLife,"April 21, 2013"
8,3,"Cronenberg's cold, exacting precision and emot...",,fresh,Sean Axmaker,0,Parallax View,"March 24, 2013"
9,3,Over and above its topical urgency or the bit ...,,fresh,Kong Rithdee,0,Bangkok Post,"March 4, 2013"


In [34]:
#my first guess is that we can link the rotten tomatos data frames by "id" where "id" is the movie with many reviews associated with it


In [5]:
movie5 = rtreviews[rtreviews['id'] == 5]

movie5

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
163,5,This is not the smoothest trip: the transition...,,fresh,David Ansen,1,Newsweek,"February 26, 2018"
164,5,Charming tale of songwriter finding her voice ...,4/5,fresh,Brian Costello,0,Common Sense Media,"July 12, 2016"
165,5,"The lead, Ileana Douglas, is good, but the mus...",C+,fresh,Emanuel Levy,0,EmanuelLevy.Com,"June 7, 2011"
166,5,"An intelligent, engaging comedy-drama that wil...",3.5/4,fresh,Michael Dequina,0,TheMovieReport.com,"September 10, 2005"
167,5,Illeana Douglas grabs onto her first starring ...,,fresh,Frank Wilkins,0,ReelTalk Movie Reviews,"December 3, 2003"
168,5,Pean to creativity and a celebration of one wo...,,fresh,,0,Spirituality and Practice,"August 27, 2002"
169,5,,3/5,fresh,Cole Smithey,0,ColeSmithey.com,"October 10, 2005"
170,5,,2/5,rotten,Chuck O'Leary,0,Fantastica Daily,"October 9, 2005"
171,5,,3/5,fresh,Philip Martin,0,Arkansas Democrat-Gazette,"April 28, 2005"
172,5,,3/5,fresh,Eric Melin,0,Lawrence.com,"August 24, 2004"


In [39]:
rtreviews['id']
#looks like this is a collection of movies 3-2000?

0           3
1           3
2           3
3           3
4           3
         ... 
54427    2000
54428    2000
54429    2000
54430    2000
54431    2000
Name: id, Length: 54432, dtype: int64

In [8]:
!ls

MovieAnalysis-Samantha.ipynb
Untitled.ipynb
bom.movie_gross.csv.gz
imdb.name.basics.csv.gz
imdb.title.akas.csv.gz
imdb.title.basics.csv.gz
imdb.title.crew.csv.gz
imdb.title.principals.csv.gz
imdb.title.ratings.csv.gz
rt.movie_info.tsv.gz
rt.reviews.tsv.gz
tmdb.movies.csv.gz
tn.movie_budgets.csv.gz


It appears that the ID #3 in both my first and second Rotten Tomatoes dataframes mentions David Cronenberg. 
ID #5 mentions Ileana Douglas being the lead of the movie in both sets of data as well.

It is safe for me to assume that the ID # is the common key column in both Rotten Tomatoes sets of data. Therefore, I want to merge these so that I have a more thorough set of data to work with.

## A Quick Jump Back To IMDB's Reviews

Before I make the commitment to diving head first into the Rotten Tomatoes data, I realized that IMDB has its own separate "ratings" file which may be easier and cleaner to work with. I hope I can explore both, but I want to take a quick peek.

In [11]:
imdbreviews = pd.read_csv("imdb.title.ratings.csv.gz")

In [13]:
imdbreviews.head()

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [14]:
#tconst could be the key that identifies all the movies and links them between the data sets

In [15]:
#so far, basics DF is "IMDB title basics", and imdbreviews DF is IMDB title ratings. 

In [22]:
#reminder of what the IMDB basics data set looks like, this is "title basics"
basics.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [19]:
#looking at IMBD "name basics"
namebasics = pd.read_csv("imdb.name.basics.csv.gz")

In [20]:
namebasics.head()

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer","tt0837562,tt2398241,tt0844471,tt0118553"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department","tt0896534,tt6791238,tt0287072,tt1682940"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer","tt1470654,tt0363631,tt0104030,tt0102898"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department","tt0114371,tt2004304,tt1618448,tt1224387"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator","tt0452644,tt0452692,tt3458030,tt2178256"


In [21]:
namebasics.keys()

Index(['nconst', 'primary_name', 'birth_year', 'death_year',
       'primary_profession', 'known_for_titles'],
      dtype='object')

In [26]:
titleakas = pd.read_csv('imdb.title.akas.csv.gz')
titleakas.head(20)

Unnamed: 0,title_id,ordering,title,region,language,types,attributes,is_original_title
0,tt0369610,10,Джурасик свят,BG,bg,,,0.0
1,tt0369610,11,Jurashikku warudo,JP,,imdbDisplay,,0.0
2,tt0369610,12,Jurassic World: O Mundo dos Dinossauros,BR,,imdbDisplay,,0.0
3,tt0369610,13,O Mundo dos Dinossauros,BR,,,short title,0.0
4,tt0369610,14,Jurassic World,FR,,imdbDisplay,,0.0
5,tt0369610,15,Jurassic World,GR,,imdbDisplay,,0.0
6,tt0369610,16,Jurassic World,IT,,imdbDisplay,,0.0
7,tt0369610,17,Jurski svijet,HR,,imdbDisplay,,0.0
8,tt0369610,18,Olam ha'Yura,IL,he,imdbDisplay,,0.0
9,tt0369610,19,Jurassic World: Mundo Jurásico,MX,,imdbDisplay,,0.0


In [27]:
titleakas.tail(30)
#we can see titles of movies being printed here in multiple languages

Unnamed: 0,title_id,ordering,title,region,language,types,attributes,is_original_title
331673,tt9562694,5,Alien Warfare,US,,imdbDisplay,,0.0
331674,tt9593792,1,Ghost Wife,TH,,,,0.0
331675,tt9593792,2,Ghost Wife,,,original,,1.0
331676,tt9593792,3,Nguoi Vo Ma,VN,,imdbDisplay,,0.0
331677,tt9644084,1,Der Atem,DE,,,,0.0
331678,tt9644084,2,Der Atem,,,original,,1.0
331679,tt9644084,3,The Breath,XWW,en,alternative,,0.0
331680,tt9654246,1,The Wild Man of the North,XWW,en,imdbDisplay,,0.0
331681,tt9654246,2,Vildmarkens søn,NO,,,,0.0
331682,tt9654246,3,Vildmarkens søn,,,original,,1.0


This is all quite interesting information but not what I was really hoping to find. I'm going to go back to my initial plan of combining my two Rotten Tomatoes data sets and see where it takes me.

## Rotten Tomatoes Data Merge

In [11]:
#Rotten Tomatoes Movies List- DF
rotten.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [7]:
rotten["synopsis"][0]
rotten["synopsis"][1]
#im beginning to think that the only way to find out the movie title affiliated with the review in the data is if it is mentioned in the synopsis, or if somehow we can get it by matching ID with another column..
#movie title probably is not important though, as we want to measure positive/negative reviews/sentiment/etc
#and come up with a meaningful conclusion

"New York City, not-too-distant-future: Eric Packer, a 28 year-old finance golden boy dreaming of living in a civilization ahead of this one, watches a dark shadow cast over the firmament of the Wall Street galaxy, of which he is the uncontested king. As he is chauffeured across midtown Manhattan to get a haircut at his father's old barber, his anxious eyes are glued to the yuan's exchange rate: it is mounting against all expectations, destroying Eric's bet against it. Eric Packer is losing his empire with every tick of the clock. Meanwhile, an eruption of wild activity unfolds in the city's streets. Petrified as the threats of the real world infringe upon his cloud of virtual convictions, his paranoia intensifies during the course of his 24-hour cross-town odyssey. Packer starts to piece together clues that lead him to a most terrifying secret: his imminent assassination. -- (C) Official Site"

In [6]:
rtreviews.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [67]:
rtreviews['rating'].value_counts()
#how many different "rating" values are there?
#looks like the rating data requires a LOT of cleaning, lets filter by freshness first

3/5       4327
4/5       3672
3/4       3577
2/5       3160
2/4       2712
          ... 
4.9          1
3.1          1
7.4          1
F+           1
4.3/10       1
Name: rating, Length: 186, dtype: int64

In [70]:
rtreviews['fresh'].value_counts()
#a Red tomato is fresh, which means 60% of the reviews are positive

fresh     33035
rotten    21397
Name: fresh, dtype: int64

In [51]:
rtreviews['id'].value_counts()

782     338
1067    275
1525    262
1777    260
1083    260
       ... 
28        1
102       1
348       1
476       1
1727      1
Name: id, Length: 1135, dtype: int64

In [52]:
rtreviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


In [53]:
rotten.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


In [59]:
#Cannot seem to find a way to determine the movie title. But perhaps we can get information on genre, rating, dates, review type, etc.

Alright, I have a better sense of the types of values found in both data sets. Now I am ready to begin merging.

In [8]:
#heroes_and_powers_df = heroes_df.merge(powers_df_transposed, on="name", how="inner")

rot = rotten.merge(rtreviews, on='id', how='inner')

rot.columns

Index(['id', 'synopsis', 'rating_x', 'genre', 'director', 'writer',
       'theater_date', 'dvd_date', 'currency', 'box_office', 'runtime',
       'studio', 'review', 'rating_y', 'fresh', 'critic', 'top_critic',
       'publisher', 'date'],
      dtype='object')

In [9]:
#I want to rename the column "rating_x" to say Rated, to distinguish it from a RT rating, since the values of rating_x are PG, R, etc.

rot = rot.rename(columns={"rating_x": "Rated"})

In [24]:
rot

Unnamed: 0,id,synopsis,Rated,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio,review,rating_y,fresh,critic,top_critic,publisher,date
0,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54427,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,The real charm of this trifle is the deadpan c...,,fresh,Laura Sinagra,1,Village Voice,"September 24, 2002"
54428,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"


In [196]:
droppeddirs = rot.loc[rot["genre"] == "Documentary|Special Interest"]
droppeddirs["director"].value_counts()

Adam Ravetch|Sarah Robertson                   93
Matt Tynauer|Matt Tyrnauer                     70
Leslie Cockburn                                32
Peter Byck                                     15
Errol Morris                                   11
Tim Nackashi|David Sampliner|Eric DelaBarre     6
Brent Leung                                     3
Emile de Antonio|Dan Talbot                     1
Name: director, dtype: int64

In [10]:
rot["Rated"].value_counts()

R        24371
PG-13    18008
PG        8246
NR        2650
G         1071
Name: Rated, dtype: int64

In [11]:
rot["studio"].value_counts()

Universal Pictures         4422
Paramount Pictures         3141
20th Century Fox           2415
Sony Pictures              2135
Sony Pictures Classics     2094
                           ... 
Knowledge Matters             3
Corridor                      1
Fox                           1
MVD Entertainment Group       1
Gravitas                      1
Name: studio, Length: 162, dtype: int64

In [12]:
rot = rot.rename(columns={"rating_y": "Rating"})

In [13]:
rot["Rating"].value_counts()[:20]

3/5      4327
4/5      3672
3/4      3577
2/5      3160
2/4      2712
2.5/4    2381
3.5/4    1777
3.5/5    1289
5/5      1237
B        1163
1/5      1113
1.5/4    1095
4/4       995
2.5/5     992
B+        832
1/4       822
B-        821
C         779
C+        665
4.5/5     567
Name: Rating, dtype: int64

In [39]:
rot["Rating"].value_counts()[20:40]

7/10     522
A-       514
8/10     505
C-       493
6/10     468
1.5/5    405
A        397
5/10     351
D        324
9/10     304
4/10     252
D+       204
0/5      162
8        143
3/10     140
1        138
0/4      132
7        125
F        109
0.5/4    100
Name: Rating, dtype: int64

The ratings for each film aren't very consistent, but looking at these value counts, I think I will try splitting out any of the reviews I can find that are x/5 to seek any consistent findings. 

In [14]:
relevant_ratings_5 = ["3/5", "4/5", "2/5", "3.5/5", "5/5", "1/5", "2.5/5", "4.5/5", "1.5/5", "0/5"]

In [15]:
ratings_out_of_5 = (rot.loc[rot['Rating'].isin(relevant_ratings_5)])

In [42]:
ratings_out_of_5

Unnamed: 0,id,synopsis,Rated,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio,review,Rating,fresh,critic,top_critic,publisher,date
0,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
7,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,Cronenberg is not a director to be daunted by ...,2/5,rotten,Matt Kelemen,0,Las Vegas CityLife,"April 21, 2013"
15,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,For better or worse - often both - Cosmopolis ...,3/5,fresh,Adam Ross,0,The Aristocrat,"September 27, 2012"
16,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,For one of the smartest films I've seen in a w...,4/5,fresh,Patrick Kolan,0,Shotgun Cinema,"September 26, 2012"
23,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,Those who said Don DeLillo's book was unfilmab...,2/5,rotten,Mike Scott,0,Times-Picayune,"September 7, 2012"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54424,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,Dawdles and drags when it should pop; it doesn...,1.5/5,rotten,Manohla Dargis,1,Los Angeles Times,"September 26, 2002"
54428,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"


In [43]:
ratings_out_of_5.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16924 entries, 0 to 54431
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            16924 non-null  int64 
 1   synopsis      16889 non-null  object
 2   Rated         16908 non-null  object
 3   genre         16908 non-null  object
 4   director      15267 non-null  object
 5   writer        14002 non-null  object
 6   theater_date  16451 non-null  object
 7   dvd_date      16451 non-null  object
 8   currency      9290 non-null   object
 9   box_office    9290 non-null   object
 10  runtime       16635 non-null  object
 11  studio        11014 non-null  object
 12  review        13302 non-null  object
 13  Rating        16924 non-null  object
 14  fresh         16924 non-null  object
 15  critic        15840 non-null  object
 16  top_critic    16924 non-null  int64 
 17  publisher     16799 non-null  object
 18  date          16924 non-null  object
dtypes: i

In [44]:
ratings_out_of_5["Rating"].value_counts()

3/5      4327
4/5      3672
2/5      3160
3.5/5    1289
5/5      1237
1/5      1113
2.5/5     992
4.5/5     567
1.5/5     405
0/5       162
Name: Rating, dtype: int64

Now I have a more measurable set of ratings, since I extracted all of the ratings that were scaled 0-5.

In [16]:
ratings_out_of_5["studio"].value_counts()[:20]

Universal Pictures          1154
Paramount Pictures           802
20th Century Fox             678
Sony Pictures                639
Sony Pictures Classics       604
Warner Bros. Pictures        499
The Weinstein Company        303
Columbia Pictures            292
Fox Searchlight Pictures     282
Walt Disney Pictures         269
MGM                          258
Warner Bros.                 218
Focus Features               215
Miramax Films                212
Roadside Attractions         186
New Line Cinema              174
Paramount Vantage            160
Summit Entertainment         158
Lionsgate Films              148
Newmarket Film Group         124
Name: studio, dtype: int64

My thoughts right now are... I want to explore the following:
1) Is there a correlation between the rating and the studio? Are certain studios gaining better reviews from the critics than the others?
2) Do certain genres receive poorer or better ratings? Which should we avoid, or which may be a winner with the crowds?
3) What dates are we working with in this set? I will probably use theater_date to start.

In [17]:
top20studios = ratings_out_of_5["studio"].value_counts()[:20].index

In [49]:
top20studios

Index(['Universal Pictures', 'Paramount Pictures', '20th Century Fox',
       'Sony Pictures', 'Sony Pictures Classics', 'Warner Bros. Pictures',
       'The Weinstein Company', 'Columbia Pictures',
       'Fox Searchlight Pictures', 'Walt Disney Pictures', 'MGM',
       'Warner Bros.', 'Focus Features', 'Miramax Films',
       'Roadside Attractions', 'New Line Cinema', 'Paramount Vantage',
       'Summit Entertainment', 'Lionsgate Films', 'Newmarket Film Group'],
      dtype='object')

In [18]:
#ratings_out_of_5 = (rot.loc[rot['Rating'].isin(relevant_ratings_5)])
top20studios_df = ratings_out_of_5.loc[ratings_out_of_5["studio"].isin(top20studios)]

In [51]:
top20studios_df

Unnamed: 0,id,synopsis,Rated,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio,review,Rating,fresh,critic,top_critic,publisher,date
243,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,"Great boy-and-dog tale, but be prepared for te...",4/5,fresh,Nell Minow,0,Common Sense Media,"December 26, 2010"
248,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,They may find themselves mystified and a littl...,2.5/5,rotten,,1,New York Times,"January 1, 2000"
254,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,Lyrical and very touching.,4/5,fresh,Nell Minow,0,Movie Mom,"January 1, 2000"
263,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,The movie doesn't so much tug at your heart as...,3.5/5,fresh,Robert Faires,0,Austin Chronicle,"January 1, 2000"
271,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,Are guaranteed to cause alarm and even tears.,4/5,fresh,Desson Thomson,1,Washington Post,"January 1, 2000"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54424,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,Dawdles and drags when it should pop; it doesn...,1.5/5,rotten,Manohla Dargis,1,Los Angeles Times,"September 26, 2002"
54428,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"


In [53]:
len(top20studios_df['id'].unique())

220

My new dataframe, top20studios_df, narrows our movie review data to approx. 7,300 reviews, which seems to match up to 220 different movies. We can always increase these numbers if need be, but I want to see if there is any connection with the most reviewed studios, as this could mean these movies are the most commonly watched, thus could give us some important information.

In [54]:
top20studios_df["theater_date"].value_counts()

Jan 9, 2015     153
Nov 21, 2007    128
Aug 3, 2018     117
Aug 13, 2010    114
Dec 1, 2017     104
               ... 
Sep 1, 2000       8
Nov 10, 1999      8
Jun 11, 2004      5
Oct 13, 1981      3
Dec 25, 1974      2
Name: theater_date, Length: 203, dtype: int64

At first glance, we can see that there are movies from all different eras included in our set.

In [55]:
top20studios_df["theater_date"].value_counts()[:20]

Jan 9, 2015     153
Nov 21, 2007    128
Aug 3, 2018     117
Aug 13, 2010    114
Dec 1, 2017     104
Mar 9, 2012     102
May 20, 2016    101
Nov 9, 2012      86
Dec 14, 2012     79
Nov 25, 2011     79
Nov 20, 2015     77
Nov 28, 2014     77
Mar 16, 2018     72
Dec 20, 2013     72
Aug 12, 2016     69
Oct 16, 2015     69
Oct 22, 2010     67
Nov 23, 2012     66
Jan 14, 2011     64
Aug 23, 2013     64
Name: theater_date, dtype: int64

In [59]:
jan92015 = top20studios_df.loc[top20studios_df["theater_date"] == "Jan 9, 2015"]

In [61]:
jan92015["id"].value_counts()
#was just testing if this release date served any relevance being the most common, but it is only affiliated with 2 different movies

1960    94
689     59
Name: id, dtype: int64

I think I want to replace all the ratings in my data set so that I could plot them. So 3.5/5 will turn into just 3.5.

In [63]:
ratings_out_of_5["Rating"].value_counts()

3/5      4327
4/5      3672
2/5      3160
3.5/5    1289
5/5      1237
1/5      1113
2.5/5     992
4.5/5     567
1.5/5     405
0/5       162
Name: Rating, dtype: int64

In [19]:
top20_ratings_copy = top20studios_df.copy() #just incase i screw up the column value replacements

In [91]:
top20_ratings_copy

Unnamed: 0,id,synopsis,Rated,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio,review,Rating,fresh,critic,top_critic,publisher,date
243,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,"Great boy-and-dog tale, but be prepared for te...",4/5,fresh,Nell Minow,0,Common Sense Media,"December 26, 2010"
248,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,They may find themselves mystified and a littl...,2.5/5,rotten,,1,New York Times,"January 1, 2000"
254,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,Lyrical and very touching.,4/5,fresh,Nell Minow,0,Movie Mom,"January 1, 2000"
263,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,The movie doesn't so much tug at your heart as...,3.5/5,fresh,Robert Faires,0,Austin Chronicle,"January 1, 2000"
271,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,Are guaranteed to cause alarm and even tears.,4/5,fresh,Desson Thomson,1,Washington Post,"January 1, 2000"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54424,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,Dawdles and drags when it should pop; it doesn...,1.5/5,rotten,Manohla Dargis,1,Los Angeles Times,"September 26, 2002"
54428,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"


In [20]:
#top20_ratings_copy = top20_ratings_copy["Rating"].replace({"3/5": "3", "4/5": "4", "2/5": "2", "3.5/5": "3.5", 
#                                                           "5/5": "5", "1/5": "1", "2.5/5": "2.5", "4.5/5": "4.5", 
#                                                          "1.5/5": "1.5", "0/5": "0"}, inplace=True)

ratings_replace_dict = {"3/5": "3", "4/5": "4", "2/5": "2", "3.5/5": "3.5", 
                        "5/5": "5", "1/5": "1", "2.5/5": "2.5", "4.5/5": "4.5", 
                        "1.5/5": "1.5", "0/5": "0"}

top20_ratings_replaced = top20studios_df["Rating"].replace(ratings_replace_dict, inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


In [21]:
top20studios_df
#looks like it worked...!!

Unnamed: 0,id,synopsis,Rated,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio,review,Rating,fresh,critic,top_critic,publisher,date
243,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,"Great boy-and-dog tale, but be prepared for te...",4,fresh,Nell Minow,0,Common Sense Media,"December 26, 2010"
248,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,They may find themselves mystified and a littl...,2.5,rotten,,1,New York Times,"January 1, 2000"
254,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,Lyrical and very touching.,4,fresh,Nell Minow,0,Movie Mom,"January 1, 2000"
263,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,The movie doesn't so much tug at your heart as...,3.5,fresh,Robert Faires,0,Austin Chronicle,"January 1, 2000"
271,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,Are guaranteed to cause alarm and even tears.,4,fresh,Desson Thomson,1,Washington Post,"January 1, 2000"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54424,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,Dawdles and drags when it should pop; it doesn...,1.5,rotten,Manohla Dargis,1,Los Angeles Times,"September 26, 2002"
54428,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,1,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2.5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"


In [77]:
# Somehow, when I created a copy of my top 20 production companies DF, it created a "NoneType" object that I cannot work with. 
#Will need to investigate tomorrow in the AM.

In [22]:
top20studios_df["Rating"].value_counts()

3      1735
4      1521
2      1333
3.5     698
2.5     555
5       500
1       447
4.5     324
1.5     224
0        38
Name: Rating, dtype: int64

In [23]:
top20studios_df["Rating"].describe()

count     7375
unique      10
top          3
freq      1735
Name: Rating, dtype: object

In [103]:
top20studios_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7375 entries, 243 to 54431
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            7375 non-null   int64 
 1   synopsis      7375 non-null   object
 2   Rated         7375 non-null   object
 3   genre         7375 non-null   object
 4   director      6529 non-null   object
 5   writer        6198 non-null   object
 6   theater_date  7358 non-null   object
 7   dvd_date      7358 non-null   object
 8   currency      5990 non-null   object
 9   box_office    5990 non-null   object
 10  runtime       7266 non-null   object
 11  studio        7375 non-null   object
 12  review        6823 non-null   object
 13  Rating        7375 non-null   object
 14  fresh         7375 non-null   object
 15  critic        6863 non-null   object
 16  top_critic    7375 non-null   int64 
 17  publisher     7310 non-null   object
 18  date          7375 non-null   object
dtypes: 

In [24]:
top20studios_df["Rating"] = top20studios_df["Rating"].astype(str).astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top20studios_df["Rating"] = top20studios_df["Rating"].astype(str).astype(float)


In [25]:
top20studios_df

Unnamed: 0,id,synopsis,Rated,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio,review,Rating,fresh,critic,top_critic,publisher,date
243,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,"Great boy-and-dog tale, but be prepared for te...",4.0,fresh,Nell Minow,0,Common Sense Media,"December 26, 2010"
248,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,They may find themselves mystified and a littl...,2.5,rotten,,1,New York Times,"January 1, 2000"
254,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,Lyrical and very touching.,4.0,fresh,Nell Minow,0,Movie Mom,"January 1, 2000"
263,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,The movie doesn't so much tug at your heart as...,3.5,fresh,Robert Faires,0,Austin Chronicle,"January 1, 2000"
271,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,"Mar 3, 2000","Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,Are guaranteed to cause alarm and even tears.,4.0,fresh,Desson Thomson,1,Washington Post,"January 1, 2000"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54424,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,Dawdles and drags when it should pop; it doesn...,1.5,rotten,Manohla Dargis,1,Los Angeles Times,"September 26, 2002"
54428,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,1.0,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2.0,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2.5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"


In [26]:
top20studios_df["Rating"].value_counts()

3.0    1735
4.0    1521
2.0    1333
3.5     698
2.5     555
5.0     500
1.0     447
4.5     324
1.5     224
0.0      38
Name: Rating, dtype: int64

In [27]:
top20studios_df["Rating"].describe()

count    7375.000000
mean        3.054441
std         1.074168
min         0.000000
25%         2.000000
50%         3.000000
75%         4.000000
max         5.000000
Name: Rating, dtype: float64

Hooray! We have converted these ratings to floats so that the ratings are now measurable numbers.

In [37]:
#Now I want to do a GROUP BY comparing the production companies to their average rating

In [36]:
averagerating = top20studios_df.groupby(["studio"])["Rating"].mean().sort_values(ascending=False)
averagerating

studio
The Weinstein Company       4.079208
Miramax Films               3.627358
Fox Searchlight Pictures    3.592199
Sony Pictures Classics      3.452815
Focus Features              3.427907
Newmarket Film Group        3.346774
Roadside Attractions        3.250000
Paramount Vantage           3.250000
MGM                         3.100775
Paramount Pictures          3.072319
Walt Disney Pictures        3.068773
Warner Bros. Pictures       3.062124
Sony Pictures               2.888889
Universal Pictures          2.885615
Warner Bros.                2.834862
New Line Cinema             2.787356
Columbia Pictures           2.655822
Summit Entertainment        2.623418
20th Century Fox            2.566372
Lionsgate Films             2.162162
Name: Rating, dtype: float64

Based on this exploratory data analysis, there seem to be certain production companies (such as Weinstein Company, Miramax, Fox Searchlight) that receive overall better reviews than others. Although, my data was reduced pretty heavily. I want to compare this to the ratings of the entire data set. 

Since the data set's reviews are all over the place (some out of 10, some A- or D+, some out of 4), I think I will first assess rotten vs. fresh. 

Based on information I found online, "fresh" is indicateive of a 60% or higher review. Perhaps later I can convert all the ratings to percentages and go from there. But for now, I want to see if we get the same data re: production companies if we utilize the entire data set, but sort by freshness.

In [39]:
#bringing back our original merged dataframe "rot"

rot.head()

Unnamed: 0,id,synopsis,Rated,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio,review,Rating,fresh,critic,top_critic,publisher,date
0,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [40]:
rot.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54432 entries, 0 to 54431
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            54432 non-null  int64 
 1   synopsis      54300 non-null  object
 2   Rated         54346 non-null  object
 3   genre         54345 non-null  object
 4   director      48992 non-null  object
 5   writer        45206 non-null  object
 6   theater_date  53206 non-null  object
 7   dvd_date      53206 non-null  object
 8   currency      33310 non-null  object
 9   box_office    33310 non-null  object
 10  runtime       53594 non-null  object
 11  studio        40125 non-null  object
 12  review        48869 non-null  object
 13  Rating        40915 non-null  object
 14  fresh         54432 non-null  object
 15  critic        51710 non-null  object
 16  top_critic    54432 non-null  int64 
 17  publisher     54123 non-null  object
 18  date          54432 non-null  object
dtypes: i

We can see here that there are NO missing values from the "fresh" column in this dataset, which indicates to me that it is very important to Rotten Tomato's ratings system. And will be helpful to us in gaging the success of all the movies in the set.

In [41]:
rot["fresh"].value_counts()

fresh     33035
rotten    21397
Name: fresh, dtype: int64

We have confirmed that the freshness column will return either "fresh" or "rotten". 

In [77]:
#im going to reduce some columns here for viewing purposes
#selected_columns = ['A', 'C', 'D']
#new = select_columns(old, selected_columns)

freshness_df = rot[["id","Rated","genre","director","studio","review","Rating","fresh","critic"]].copy()
freshness_df

Unnamed: 0,id,Rated,genre,director,studio,review,Rating,fresh,critic
0,3,R,Drama|Science Fiction and Fantasy,David Cronenberg,Entertainment One,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro
1,3,R,Drama|Science Fiction and Fantasy,David Cronenberg,Entertainment One,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz
2,3,R,Drama|Science Fiction and Fantasy,David Cronenberg,Entertainment One,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker
3,3,R,Drama|Science Fiction and Fantasy,David Cronenberg,Entertainment One,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman
4,3,R,Drama|Science Fiction and Fantasy,David Cronenberg,Entertainment One,... a perverse twist on neorealism...,,fresh,
...,...,...,...,...,...,...,...,...,...
54427,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,The real charm of this trifle is the deadpan c...,,fresh,Laura Sinagra
54428,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,,1/5,rotten,Michael Szymanski
54429,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,,2/5,rotten,Emanuel Levy
54430,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,,2.5/5,rotten,Christopher Null


In [78]:
freshness_replace = {"fresh":1, "rotten":0}
sorting_freshness = freshness_df["fresh"].replace(freshness_replace, inplace=True)
#studio duplicates - studio_replace = {}

In [79]:
freshness_df
#confirming that my new DF has a rating of 1 or 0 for freshness

Unnamed: 0,id,Rated,genre,director,studio,review,Rating,fresh,critic
0,3,R,Drama|Science Fiction and Fantasy,David Cronenberg,Entertainment One,A distinctly gallows take on contemporary fina...,3/5,1,PJ Nabarro
1,3,R,Drama|Science Fiction and Fantasy,David Cronenberg,Entertainment One,It's an allegory in search of a meaning that n...,,0,Annalee Newitz
2,3,R,Drama|Science Fiction and Fantasy,David Cronenberg,Entertainment One,... life lived in a bubble in financial dealin...,,1,Sean Axmaker
3,3,R,Drama|Science Fiction and Fantasy,David Cronenberg,Entertainment One,Continuing along a line introduced in last yea...,,1,Daniel Kasman
4,3,R,Drama|Science Fiction and Fantasy,David Cronenberg,Entertainment One,... a perverse twist on neorealism...,,1,
...,...,...,...,...,...,...,...,...,...
54427,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,The real charm of this trifle is the deadpan c...,,1,Laura Sinagra
54428,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,,1/5,0,Michael Szymanski
54429,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,,2/5,0,Emanuel Levy
54430,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,,2.5/5,0,Christopher Null


In [80]:
s = pd.Series(freshness_df["fresh"])
freshness_df["fresh"] = pd.to_numeric(s)

In [82]:
freshness_df["fresh"]

0        1
1        0
2        1
3        1
4        1
        ..
54427    1
54428    0
54429    0
54430    0
54431    1
Name: fresh, Length: 54432, dtype: int64

In [83]:
averagefreshness = freshness_df.groupby(["studio"])["fresh"].mean().sort_values(ascending=False)

In [85]:
averagefreshness[:20]

studio
Fox                                        1.000000
Iskander Films                             1.000000
Corridor                                   1.000000
Criterion Collection                       1.000000
MVD Entertainment Group                    1.000000
Variance Films                             1.000000
Buffalo Films                              0.961538
STX Entertainment                          0.947059
Laemmle/Zeller Films                       0.939394
The Film Arcade                            0.937500
The Weinstein Company                      0.930624
United Artists Pictures                    0.927835
Fingerprint Releasing / Bleecker Street    0.926724
Film 44                                    0.916667
Film District                              0.905109
Universal                                  0.904762
Walt Disney Animation Studios              0.897727
Janus Films                                0.896226
Aviron Pictures                            0.882353
A24  

In [86]:
top_20_freshest = averagefreshness[:20].index

We see some variation from our original findings here when including all of the reviews and grouping them by freshness. It appears Fox, Iskander Films, Corridor, Criterion, MVD and Variance have all their ratings at a 60% or higher. 

That being said, Fox probably has a lot more reviews than some of these other companies. I want to now sort them by amount of reviews.

In [87]:
top_20_freshest

Index(['Fox', 'Iskander Films', 'Corridor', 'Criterion Collection',
       'MVD Entertainment Group', 'Variance Films', 'Buffalo Films',
       'STX Entertainment', 'Laemmle/Zeller Films', 'The Film Arcade',
       'The Weinstein Company', 'United Artists Pictures',
       'Fingerprint Releasing / Bleecker Street', 'Film 44', 'Film District',
       'Universal', 'Walt Disney Animation Studios', 'Janus Films',
       'Aviron Pictures', 'A24'],
      dtype='object', name='studio')

In [91]:
#freshness_and_counts = freshness_df.groupby(["studio"])["fresh"].count().sort_values(ascending=False)

In the below function, I am grouping my dataframe by studio and want to count the amount of "fresh" (60% and up) ratings they receive. 

In [94]:
freshness_and_counts = freshness_df.groupby('studio')['fresh'].apply(lambda x: x[x == 1].count()).sort_values(ascending=False)



In [120]:
freshness_and_counts[:37]

studio
Universal Pictures                         2373
Paramount Pictures                         1760
Sony Pictures Classics                     1586
Warner Bros. Pictures                      1157
Sony Pictures                              1133
20th Century Fox                           1008
The Weinstein Company                       939
Fox Searchlight Pictures                    802
Focus Features                              611
Walt Disney Pictures                        569
Miramax Films                               550
MGM                                         501
Columbia Pictures                           450
Paramount Vantage                           429
New Line Cinema                             402
Warner Bros.                                399
Roadside Attractions                        357
Newmarket Film Group                        350
Lions Gate Films                            289
Fox Searchlight                             263
Buena Vista Pictures             

In [100]:
len(freshness_df["studio"].unique())

163

With the above, I confirmed that my equation worked properly since my "freshness_and_counts" function printed 163 different counts, and the length of unique production studios in my dataframe is also 163.

Okay, now we can see that certain companies have significantly more reviews than others. In fact, companies like Iskander Films, Corridor, Criterion Collection, and MVD Entertainment Group were amongst the highest ratings of freshness, but theoretically could have just one or two ratings that are being measured.

In [104]:
freshness_and_counts.tail(10)

studio
Film Foundry Releasing            2
Freestyle Releasing               1
Fox                               1
Corridor                          1
Film Sales Company                1
MVD Entertainment Group           1
Gravitas                          0
Knowledge Matters                 0
20th Century Fox Film Corporat    0
Full Circle Releasing             0
Name: fresh, dtype: int64

Wow, it is confirmed that our initial mean calculation really doesn't mean much without accounting for the amount of reviews! Fox, Corridor, and MVD all only have a single review attached to their studio name! That isn't very measurable.

In [114]:
freshness_and_counts.describe()

count     162.000000
mean      147.283951
std       317.219685
min         0.000000
25%        13.250000
50%        38.000000
75%       131.250000
max      2373.000000
Name: fresh, dtype: float64

Based on the describe function of every movie review in our full data set that has at least one positive/"fresh" review, we can see that there is a huge variation of amounts of reviews that certain studios have received over others. I'm thinking right now the goal is to emulate which companies are having success, popularity that is reflected in their positive reviews. 

It looks like, out of the companies with any positive reviews, the average amount of entries is 147. So I will next filter out the companies that contain less than this many entries, for we want to take popularity into account.

In [124]:
#used descending counts function from before to determine all the companies that had at least 147 positive/"fresh" reviews

positive_reviews_companies = freshness_and_counts[:37].index

In [125]:
positive_reviews_companies

Index(['Universal Pictures', 'Paramount Pictures', 'Sony Pictures Classics',
       'Warner Bros. Pictures', 'Sony Pictures', '20th Century Fox',
       'The Weinstein Company', 'Fox Searchlight Pictures', 'Focus Features',
       'Walt Disney Pictures', 'Miramax Films', 'MGM', 'Columbia Pictures',
       'Paramount Vantage', 'New Line Cinema', 'Warner Bros.',
       'Roadside Attractions', 'Newmarket Film Group', 'Lions Gate Films',
       'Fox Searchlight', 'Buena Vista Pictures', 'DreamWorks SKG',
       'Fingerprint Releasing / Bleecker Street', 'USA Films', 'Universal',
       'Summit Entertainment', 'A24', 'A24 Films', 'Dimension Films', 'WB',
       'Paramount', 'STX Entertainment', 'Lionsgate Films',
       'Walt Disney Animation Studios', 'Buena Vista Distribution Compa',
       'IFC Films', 'Lionsgate'],
      dtype='object', name='studio')

In [128]:
#I am creating a new DF that keeps all companies with 147 positive reviews or more

positive_reviews_ratings_df = freshness_df.loc[freshness_df["studio"].isin(positive_reviews_companies)]

#freshness_df.groupby(["studio"])["Rating"].mean().sort_values(ascending=False)


In [129]:
positive_reviews_ratings_df

Unnamed: 0,id,Rated,genre,director,studio,review,Rating,fresh,critic
243,8,PG,Drama|Kids and Family,Jay Russell,Warner Bros. Pictures,"Great boy-and-dog tale, but be prepared for te...",4/5,1,Nell Minow
244,8,PG,Drama|Kids and Family,Jay Russell,Warner Bros. Pictures,"Not a mawkish dying-dog tearjerker, ""My Dog Sk...",3.5/4,1,Nick Rogers
245,8,PG,Drama|Kids and Family,Jay Russell,Warner Bros. Pictures,Features two irresistible lead characters -- a...,2.5/4,1,Judith Egerton
246,8,PG,Drama|Kids and Family,Jay Russell,Warner Bros. Pictures,A heart-affecting boy-with-a-dog drama.,,1,
247,8,PG,Drama|Kids and Family,Jay Russell,Warner Bros. Pictures,An affectionate and tender family film...,C+,0,Dennis Schwartz
...,...,...,...,...,...,...,...,...,...
54427,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,The real charm of this trifle is the deadpan c...,,1,Laura Sinagra
54428,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,,1/5,0,Michael Szymanski
54429,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,,2/5,0,Emanuel Levy
54430,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,,2.5/5,0,Christopher Null


In [140]:
#Now I am going to get the average(mean) review score of these. We have a much larger data set to work with now, with 31,000 reviews.

#Need to convert the ratings to percentages... lots of cleaning going to be involved!

positive_reviews_ratings_df['Rating'].value_counts()[:50]



3/4       2242
3/5       1984
4/5       1888
2/4       1704
2.5/4     1518
2/5       1455
3.5/4     1152
3.5/5      849
1.5/4      727
B          726
4/4        632
2.5/5      614
5/5        579
B+         540
B-         519
C          500
1/5        487
1/4        468
C+         394
4.5/5      383
A-         341
8/10       327
6/10       305
C-         301
7/10       283
A          259
1.5/5      244
5/10       219
D          208
9/10       189
4/10       149
D+         130
1           85
3/10        82
8           69
0/4         68
F           59
7           55
D-          51
0.5/4       48
2/10        42
0/5         41
7.5/10      40
A+          38
6           36
8.5/10      35
0.5/5       30
5           25
3           22
5.5/10      21
Name: Rating, dtype: int64

So, so many review types. What a mess. I am now realizing that Rotten Tomatoes curates reviews from the web and converts them to their freshness scale, but many different movie news sources have their own rating scale. I want everything to be uniform so I will do my best to convert all of these to percentages. It's tedious but I think it will be worth it.


In [142]:
positive_reviews_ratings_df.loc[positive_reviews_ratings_df['Rating'] == "7"]

Unnamed: 0,id,Rated,genre,director,studio,review,Rating,fresh,critic
1188,34,PG-13,Action and Adventure|Mystery and Suspense,John Woo,Paramount Pictures,Action fans will get a kick out of it. Mark th...,7,1,Brian Webster
1949,57,R,Comedy,,The Weinstein Company,Like a production of When Harry Met Sally put ...,7,1,Radheyan Simonpillai
2212,65,PG,Comedy|Kids and Family|Romance,Kevin Lima,Walt Disney Pictures,"It may not win over diehard Disney haters, and...",7,1,Brian Webster
3465,118,PG,Comedy|Science Fiction and Fantasy,Dean Parisot,DreamWorks SKG,Director Dean Parisot is willing to go more th...,7,1,Brian Webster
7863,300,PG-13,Drama,Ryan Murphy,Sony Pictures,"... an irritating, smug film about a horrible ...",7,0,Philip Martin
8602,322,R,Horror,Guillermo del Toro,Universal Pictures,Most times the tropes of gothic horror work we...,7,1,Hugo Hern
9158,338,PG-13,Drama|Mystery and Suspense,Clint Eastwood,Warner Bros. Pictures,Hereafter is an overly schematic melodrama tha...,7,0,Radheyan Simonpillai
9277,346,PG-13,Drama|Romance,Joan Chen,MGM,,7,0,Brian Webster
9901,377,R,Comedy|Drama|Musical and Performing Arts,Ang Lee,Focus Features,...it's hard to get past the idea of Taking Wo...,7,0,Philip Martin
11290,433,PG-13,Action and Adventure|Mystery and Suspense,Doug Liman,Universal Pictures,A great handbook on how to cop ideas from the ...,7,1,Dan Jardine


I am unfortunately going to need to drop anything that does not have a clear scale. For example, in the above DF locator, I tested all values that had a rating of 7. You can see that some are fresh, and some aren't. Most would assume a 7/10, which would equal a 70% and fall into the "fresh" category...

In [154]:
A_rating = positive_reviews_ratings_df.loc[positive_reviews_ratings_df['Rating'] == "A"]
A_rating["fresh"].value_counts()
#of course, all A ratings should be fresh

1    259
Name: fresh, dtype: int64

It is safe to say that we can include A ratings in our data set.

There are a LOT of B's! Let's test those out.

In [155]:
B_rating = positive_reviews_ratings_df.loc[positive_reviews_ratings_df['Rating'] == "B"]
B_rating["fresh"].value_counts()

1    725
0      1
Name: fresh, dtype: int64

Okay, besides one outlier, all of the B's are fresh across the board. Perfect.

In [157]:
Bminus_rating = positive_reviews_ratings_df.loc[positive_reviews_ratings_df['Rating'] == "B-"]
Bminus_rating["fresh"].value_counts()
#most Bminus ratings are fresh

1    491
0     28
Name: fresh, dtype: int64

In [150]:
C_rating = positive_reviews_ratings_df.loc[positive_reviews_ratings_df['Rating'] == "C"]
C_rating["fresh"].value_counts()

0    490
1     10
Name: fresh, dtype: int64

Okay, so it is pretty safe to assume that a "C" rating is not fresh. So far it is looking like we can trust the letter grades to include in our data analysis, but I want to test some more.

In [152]:
Cplus_rating = positive_reviews_ratings_df.loc[positive_reviews_ratings_df['Rating'] == "C+"]
Cplus_rating["fresh"].value_counts()

0    336
1     58
Name: fresh, dtype: int64

Uh oh, a C+ rating actually has a bit more of a discrepancy than a C rating. The goal is for us to assess the positive reviews only, so let's move on to more rating scales.

It just dawned on me that I can drop the non-fresh values and probably will have a more clear indicator of which ratings correlate to the "fresh" indicator on the freshness scale.

I also have noticed a lot of duplicates in the studio names... I will clean those up as well with a dictionary. I think these two changes will make our data a lot cleaner.
1) Fix duplicate studios and 2) Drop non-fresh movie ratings to gage the successful movies only

In [160]:
positive_reviews_companies

Index(['Universal Pictures', 'Paramount Pictures', 'Sony Pictures Classics',
       'Warner Bros. Pictures', 'Sony Pictures', '20th Century Fox',
       'The Weinstein Company', 'Fox Searchlight Pictures', 'Focus Features',
       'Walt Disney Pictures', 'Miramax Films', 'MGM', 'Columbia Pictures',
       'Paramount Vantage', 'New Line Cinema', 'Warner Bros.',
       'Roadside Attractions', 'Newmarket Film Group', 'Lions Gate Films',
       'Fox Searchlight', 'Buena Vista Pictures', 'DreamWorks SKG',
       'Fingerprint Releasing / Bleecker Street', 'USA Films', 'Universal',
       'Summit Entertainment', 'A24', 'A24 Films', 'Dimension Films', 'WB',
       'Paramount', 'STX Entertainment', 'Lionsgate Films',
       'Walt Disney Animation Studios', 'Buena Vista Distribution Compa',
       'IFC Films', 'Lionsgate'],
      dtype='object', name='studio')

In [161]:
#studio_names_replace = {"fresh":1, "rotten":0}
#sorting_studios = freshness_df["fresh"].replace(freshness_replace, inplace=True)

#Remember, positivereviewsratingsdf is the DF that contains production companies with 147 or more positive reviews. "Popular"
is_fresh = [1]
popular_and_fresh = positive_reviews_ratings_df.loc[positive_reviews_ratings_df["fresh"].isin(is_fresh)]


In [162]:
popular_and_fresh

Unnamed: 0,id,Rated,genre,director,studio,review,Rating,fresh,critic
243,8,PG,Drama|Kids and Family,Jay Russell,Warner Bros. Pictures,"Great boy-and-dog tale, but be prepared for te...",4/5,1,Nell Minow
244,8,PG,Drama|Kids and Family,Jay Russell,Warner Bros. Pictures,"Not a mawkish dying-dog tearjerker, ""My Dog Sk...",3.5/4,1,Nick Rogers
245,8,PG,Drama|Kids and Family,Jay Russell,Warner Bros. Pictures,Features two irresistible lead characters -- a...,2.5/4,1,Judith Egerton
246,8,PG,Drama|Kids and Family,Jay Russell,Warner Bros. Pictures,A heart-affecting boy-with-a-dog drama.,,1,
249,8,PG,Drama|Kids and Family,Jay Russell,Warner Bros. Pictures,I had shelved a movie critic's usual reserve a...,3/4,1,Roger Ebert
...,...,...,...,...,...,...,...,...,...
54420,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,The spaniel-eyed Jean Reno infuses Hubert with...,3/4,1,Megan Turner
54422,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,Arguably the best script that Besson has writt...,3.5/5,1,Wade Major
54425,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,Despite Besson's high-profile name being Wasab...,,1,Andy Klein
54427,2000,R,Action and Adventure|Art House and Internation...,,Columbia Pictures,The real charm of this trifle is the deadpan c...,,1,Laura Sinagra


The goal currently is to figure out which companies are getting not only the best reviews, but a lot of them. Virality now a days is a very important component to film success. We want consumers to be talking about the film, so production companies that made movies that only have one or two positive reviews are not going to do us much justice.

In [163]:
popular_and_fresh["studio"].value_counts()

Universal Pictures                         2373
Paramount Pictures                         1760
Sony Pictures Classics                     1586
Warner Bros. Pictures                      1157
Sony Pictures                              1133
20th Century Fox                           1008
The Weinstein Company                       939
Fox Searchlight Pictures                    802
Focus Features                              611
Walt Disney Pictures                        569
Miramax Films                               550
MGM                                         501
Columbia Pictures                           450
Paramount Vantage                           429
New Line Cinema                             402
Warner Bros.                                399
Roadside Attractions                        357
Newmarket Film Group                        350
Lions Gate Films                            289
Fox Searchlight                             263
Buena Vista Pictures                    

In [174]:
popular_and_fresh["studio"].value_counts()[:10]

Universal Pictures          2563
Paramount Pictures          1924
Warner Bros. Pictures       1722
Sony Pictures Classics      1586
Sony Pictures               1133
20th Century Fox            1008
The Weinstein Company        939
Fox Searchlight Pictures     802
Focus Features               611
Lionsgate                    601
Name: studio, dtype: int64

In [185]:
most_popular_movies = popular_and_fresh["studio"].value_counts().index
most_popular_movies

#make a diagram showing correlation between the amount of ratings, and average freshness score


Index(['Universal Pictures', 'Paramount Pictures', 'Warner Bros. Pictures',
       'Sony Pictures Classics', 'Sony Pictures', '20th Century Fox',
       'The Weinstein Company', 'Fox Searchlight Pictures', 'Focus Features',
       'Lionsgate', 'Walt Disney Pictures', 'Miramax Films', 'MGM',
       'Columbia Pictures', 'Paramount Vantage', 'New Line Cinema',
       'A24 Films', 'Roadside Attractions', 'Newmarket Film Group',
       'Fox Searchlight', 'Buena Vista Pictures', 'DreamWorks SKG',
       'Fingerprint Releasing / Bleecker Street', 'USA Films',
       'Summit Entertainment', 'Dimension Films', 'STX Entertainment',
       'Walt Disney Animation Studios', 'Buena Vista Distribution Compa',
       'IFC Films'],
      dtype='object')

Our popular and fresh dataframe has narrowed down the companies to focus on with minimal studio name cleaning needed. 

In [169]:
#DUPLICATES: warner bros vs warner bros pictures? Paramount and paramount pictures for sure. a24. lionsgate vs Lions Gate, universal vs universal pictures

studio_names_replace = {"Lions Gate Films":"Lionsgate", "Lionsgate Films":"Lionsgate", "A24": "A24 Films", "WB": "Warner Bros. Pictures", "Warner Bros.": "Warner Bros. Pictures",
                       "Paramount": "Paramount Pictures", "Universal": "Universal Pictures"}
cleaning_studios_method = popular_and_fresh["studio"].replace(studio_names_replace, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


In [171]:
popular_and_fresh["studio"].value_counts()

Universal Pictures                         2563
Paramount Pictures                         1924
Warner Bros. Pictures                      1722
Sony Pictures Classics                     1586
Sony Pictures                              1133
20th Century Fox                           1008
The Weinstein Company                       939
Fox Searchlight Pictures                    802
Focus Features                              611
Lionsgate                                   601
Walt Disney Pictures                        569
Miramax Films                               550
MGM                                         501
Columbia Pictures                           450
Paramount Vantage                           429
New Line Cinema                             402
A24 Films                                   366
Roadside Attractions                        357
Newmarket Film Group                        350
Fox Searchlight                             263
Buena Vista Pictures                    

Cleaning up the studios did not change the amount of entries drastically but it will be easier for us to measure success with no duplicates!

In [182]:
popular_and_fresh["Rating"].value_counts()[:50]

3/4       2239
4/5       1887
3/5       1780
3.5/4     1151
3.5/5      845
2.5/4      778
B          725
4/4        632
5/5        579
B+         540
B-         491
4.5/5      382
A-         341
8/10       327
7/10       278
A          259
6/10       190
9/10       189
1           82
2.5/5       71
8           62
C+          58
7           40
7.5/10      40
2/4         39
A+          38
8.5/10      35
9           16
6.5/10      16
3.0/4       16
9.5/10      13
3.5         13
3           12
2/5         12
3.0/5       10
C           10
4.0/4       10
5.0/5        8
5/10         8
3/6          7
4.0/5        6
2.1/2        6
4/6          6
7.1/10       5
6            5
5            5
4            5
3.7          4
R            4
2.7          4
Name: Rating, dtype: int64

I think if I convert all the letter ratings to a number first, I will be able to convert the rest of these ratings... somehow.... into percentages

In [None]:
letter_grade_conversion = {"B":"", "B+": "", "B-": "", "A-": "", "A": "", "C+": "", "A+": "", "C": ""}
#cleaning_grades_method = popular_and_fresh["studio"].replace(studio_names_replace, inplace=True)

In [None]:
popular_and_fresh