# Movie Data Analysis

## Overview

This project explores different types of films in order to identify actionable recommendataions for Microsoft's new movie studio. Descriptive analysis of which films perform the best at the box office shows that 'blank'. Microsoft can use this analysis to prioritize the types of films they create.

## Business Problem

Like other companies creating original video content, Microsoft may be able utilize current success metrics to appropriately allocate their resources and produce films that will perform well. By doing so, Microsoft can become a household name studio that has films enjoyed by many. By using datasets from some of the most popular film review websites, such as IMDB and Rotten Tomatoes, I describe

## Data Understanding

IMDB

In [2]:
import pandas as pd

In [3]:
imdb_basics = pd.read_csv("zippedData/imdb.title.basics.csv.gz")
imdb_ratings = pd.read_csv("zippedData/imdb.title.ratings.csv.gz")

In [4]:
imdb_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   tconst           146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [5]:
imdb_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


### IMDB Genres

The IMDB title dataset includes the film titles from 2010 and projected titles up until 2115, with additional information in runtime and genres.

In [6]:
imdb_basics.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [7]:
imdb_basics['primary_title'].nunique()

136071

In [8]:
imdb_basics['original_title'].nunique()

137773

In [9]:
imdb_basics['start_year'].describe()

count    146144.000000
mean       2014.621798
std           2.733583
min        2010.000000
25%        2012.000000
50%        2015.000000
75%        2017.000000
max        2115.000000
Name: start_year, dtype: float64

In [10]:
imdb_basics['runtime_minutes'].describe()

count    114405.000000
mean         86.187247
std         166.360590
min           1.000000
25%          70.000000
50%          87.000000
75%          99.000000
max       51420.000000
Name: runtime_minutes, dtype: float64

In [11]:
imdb_basics['genres'].value_counts()

Documentary                   32185
Drama                         21486
Comedy                         9177
Horror                         4372
Comedy,Drama                   3519
                              ...  
Action,Adventure,Musical          1
Crime,Documentary,Fantasy         1
Animation,Mystery                 1
Biography,Thriller,Western        1
Drama,News,Sci-Fi                 1
Name: genres, Length: 1085, dtype: int64

### IMDB Ratings

IMDB ratings dataset includes average rating and number of votes for each film.

In [12]:
imdb_ratings.head()

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [13]:
imdb_ratings['averagerating'].describe()

count    73856.000000
mean         6.332729
std          1.474978
min          1.000000
25%          5.500000
50%          6.500000
75%          7.400000
max         10.000000
Name: averagerating, dtype: float64

In [14]:
imdb_ratings['numvotes'].describe()

count    7.385600e+04
mean     3.523662e+03
std      3.029402e+04
min      5.000000e+00
25%      1.400000e+01
50%      4.900000e+01
75%      2.820000e+02
max      1.841066e+06
Name: numvotes, dtype: float64

## Data Preparation

### Data Cleaning

Knowing that I want to focus on genre and ratings information, I start by checking for any null values in both columns and finding the unique genres. 

In [15]:
imdb_basics['genres'].isna().sum()

5408

In [16]:
#replacing null values in genres with string
imdb_basics['genres'] = imdb_basics['genres'].fillna('Unknown')

In [17]:
unique_genres_list = []
for genre_details in imdb_basics['genres']:
    genres_list = genre_details.split(',')
    for genre in genres_list:
        unique_genres_list.append(genre)
        
unique_genres_list = sorted(list(set(unique_genres_list)))
unique_genres_list

['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'Unknown',
 'War',
 'Western']

In [18]:
for genre in unique_genres_list:
    imdb_basics[genre] = 0

In [19]:
imdb_basics.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,Action,Adult,Adventure,Animation,...,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,Unknown,War,Western
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
for index, genre_details in enumerate(imdb_basics['genres']):
    for genre in unique_genres_list:
        if genre in genre_details:
            imdb_basics.at[index, genre] = 1

In [21]:
imdb_basics.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,Action,Adult,Adventure,Animation,...,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,Unknown,War,Western
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
#checking null values in ratings
imdb_ratings['averagerating'].isna().sum()

0

### Merging Datasets

Explanation - 

In [30]:
imdb_df = pd.merge(imdb_basics, imdb_ratings, on="tconst")

In [31]:
imdb_df.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,Action,Adult,Adventure,Animation,...,Sci-Fi,Short,Sport,Talk-Show,Thriller,Unknown,War,Western,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",1,0,0,0,...,0,0,0,0,0,0,0,0,7.0,77
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",0,0,0,0,...,0,0,0,0,0,0,0,0,7.2,43
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,0,0,0,0,...,0,0,0,0,0,0,0,0,6.9,4517
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",0,0,0,0,...,0,0,0,0,0,0,0,0,6.1,13
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",0,0,0,0,...,0,0,0,0,0,0,0,0,6.5,119


In [45]:
imdb_df['averagerating'].describe()

count    73856.000000
mean         6.332729
std          1.474978
min          1.000000
25%          5.500000
50%          6.500000
75%          7.400000
max         10.000000
Name: averagerating, dtype: float64

In [62]:
top_rated = imdb_df[imdb_df['averagerating'] > 7.4 ]
top_rated.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,Action,Adult,Adventure,Animation,...,Sci-Fi,Short,Sport,Talk-Show,Thriller,Unknown,War,Western,averagerating,numvotes
6,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,83.0,"Adventure,Animation,Comedy",0,0,1,1,...,0,0,0,0,0,0,0,0,8.1,263
9,tt0159369,Cooper and Hemingway: The True Gen,Cooper and Hemingway: The True Gen,2013,180.0,Documentary,0,0,0,0,...,0,0,0,0,0,0,0,0,7.6,53
11,tt0170651,T.G.M. - osvoboditel,T.G.M. - osvoboditel,2018,60.0,Documentary,0,0,0,0,...,0,0,0,0,0,0,0,0,7.5,6
12,tt0176694,The Tragedy of Man,Az ember tragédiája,2011,160.0,"Animation,Drama,History",0,0,0,1,...,0,0,0,0,0,0,0,0,7.8,584
14,tt0230212,The Final Journey,The Final Journey,2010,120.0,Drama,0,0,0,0,...,0,0,0,0,0,0,0,0,8.8,8


In [66]:
genres_high_ratings = top_rated[unique_genres_list].sum()

In [69]:
genres_high_ratings.sort_values(ascending=False)

Documentary    8360
Drama          5943
Comedy         2484
Biography      1498
History        1080
Music          1020
Action          956
Romance         871
Thriller        856
Adventure       792
Family          786
Crime           700
Sport           431
Mystery         412
Horror          334
Animation       313
Fantasy         291
News            256
War             220
Unknown         217
Musical         204
Sci-Fi          195
Western          47
Reality-TV        7
Game-Show         1
Short             1
Talk-Show         0
Adult             0
dtype: int64

In [70]:
for genre in unique_genres_list:
    genre_ratings = imdb_df.loc[imdb_df[genre] == 1]['averagerating'].mean()
    print(f"{genre}: {genre_ratings:.2f} average rating")

Action: 5.81 average rating
Adult: 3.77 average rating
Adventure: 6.20 average rating
Animation: 6.25 average rating
Biography: 7.16 average rating
Comedy: 6.00 average rating
Crime: 6.12 average rating
Documentary: 7.33 average rating
Drama: 6.40 average rating
Family: 6.39 average rating
Fantasy: 5.92 average rating
Game-Show: 7.30 average rating
History: 7.04 average rating
Horror: 5.00 average rating
Music: 6.93 average rating
Musical: 6.50 average rating
Mystery: 5.92 average rating
News: 7.27 average rating
Reality-TV: 6.50 average rating
Romance: 6.15 average rating
Sci-Fi: 5.49 average rating
Short: 8.80 average rating
Sport: 6.96 average rating
Talk-Show: nan average rating
Thriller: 5.64 average rating
Unknown: 6.50 average rating
War: 6.58 average rating
Western: 5.87 average rating
