# Movie Data Analysis <a id='intro'></a>

In this project, I will work with data from the entertainment industry with records on movies and shows. The research will focus on the "Golden Age" of television, which began in 1999 with the release of *The Sopranos* and is still ongoing. The aim of this project is to investigate how the number of votes a title receives impacts its ratings. The assumption is that highly-rated shows (we will focus on TV shows, ignoring movies) released during the "Golden Age" of television also have the most votes.

First, I'll evaluate the quality of the data and see whether its issues are significant. Then, during data preprocessing, I will account for the most critical problems before analyzing the movie data.

## Initialization and Preparing Data

In [7]:
import pandas as pd # load the data
movie = pd.read_csv('/datasets/movies_and_shows.csv'); movie.head()

Unnamed: 0,name,Character,r0le,TITLE,Type,release Year,genres,imdb sc0re,imdb v0tes
0,Robert De Niro,Travis Bickle,ACTOR,Taxi Driver,MOVIE,1976,"['drama', 'crime']",8.2,808582.0
1,Jodie Foster,Iris Steensma,ACTOR,Taxi Driver,MOVIE,1976,"['drama', 'crime']",8.2,808582.0
2,Albert Brooks,Tom,ACTOR,Taxi Driver,MOVIE,1976,"['drama', 'crime']",8.2,808582.0
3,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR,Taxi Driver,MOVIE,1976,"['drama', 'crime']",8.2,808582.0
4,Cybill Shepherd,Betsy,ACTOR,Taxi Driver,MOVIE,1976,"['drama', 'crime']",8.2,808582.0


According to the documentation:
- `'name'` — actor/director's name and last name
- `'Character'` — character played (for actors)
- `'r0le '` — the person's contribution to the title (it can be in the capacity of either actor or director)
- `'TITLE '` — title of the movie/show
- `'  Type'` — show or movie
- `'release Year'` — year when movie/show was released
- `'genres'` — list of genres under which the movie/show falls
- `'imdb sc0re'` — score on IMDb
- `'imdb v0tes'` — votes on IMDb

At first glance, the dataset shows inconsistencies in the column names. Some names are in uppercase while others are in lowercase, several contain whitespace, and a few column names have digit '0' instead of letter 'o'. To improve readability and usability, I will rename the columns to more appropriate and consistent names.

In [24]:
movie = movie.rename(columns = {'   name':'name', 'Character':'character', 'r0le':'role', 
                                'TITLE':'title', 'release Year':'release_year',
                                'imdb sc0re':'imdb_score', 'imdb v0tes':'imdb_votes', '  Type':'type'})

In [25]:
movie.isna().sum() # find missing values

name               0
character          0
role               0
title              1
type               0
release_year       0
genres             0
imdb_score      4609
imdb_votes      4726
dtype: int64

As shown above, there are missing values in three columns. While the missing value in `'title'` isn't critical, missing values in `'imdb_score'` and `'imdb_votes'` affect around 6% of the data, which could impact our analysis. To ensure data integrity, it is best to drop all rows with missing values.

In [26]:
movie = movie.dropna() # drop values
movie.duplicated().sum() # check for duplicates

6994

In [27]:
df[df.duplicated()].tail() 

Unnamed: 0,name,Character,r0le,TITLE,Type,release Year,genres,imdb sc0re,imdb v0tes
85569,Jessica Cediel,Liliana Navarro,ACTOR,Lokillo,MOVIE,2021,['comedy'],3.8,68.0
85570,Javier Gardeaz?­bal,"Agust??n ""Peluca"" Ort??z",ACTOR,Lokillo,MOVIE,2021,['comedy'],3.8,68.0
85571,Carla Giraldo,Valery Reinoso,ACTOR,Lokillo,MOVIE,2021,['comedy'],3.8,68.0
85572,Ana Mar??a S?­nchez,Lourdes,ACTOR,Lokillo,MOVIE,2021,['comedy'],3.8,68.0
85577,Isabel Gaona,Cacica,ACTOR,Lokillo,MOVIE,2021,['comedy'],3.8,68.0


In [28]:
movie = movie.drop_duplicates().reset_index() # drop duplicates

After checking for duplicate rows, it is evident that removing them will not distort the data.

In [29]:
# viewing unique type names
movie['type'].unique()

array(['MOVIE', 'the movie', 'tv show', 'shows', 'movies', 'tv shows',
       'tv series', 'tv', 'SHOW'], dtype=object)

In [30]:
# function for replacing implicit duplicates
def replace_wrong_show(wrong_shows_list, correct_shows):
    for i in wrong_shows_list:
        movie['type'] = movie['type'].replace(i, correct_shows)
    return movie

In [32]:
# removing implicit duplicates
wrong_shows_list = ['shows', 'tv show', 'tv shows', 'tv series', 'tv']
wrong_movies_list = ['the movie', 'movies']
movie = replace_wrong_show(wrong_shows_list, 'SHOW')
movie = replace_wrong_show(wrong_movies_list, 'MOVIE')

In [33]:
# viewing unique genre names
movie['type'].unique()

array(['MOVIE', 'SHOW'], dtype=object)

There are implicit duplicates in the `type` field, where values such as `SHOW`, `MOVIE`, and their variations (e.g., tv show, the movie) were written inconsistently. To ensure consistency and accuracy, we standardized these values by replacing all variations with their correct forms: SHOW and MOVIE.

In [34]:
movie.info() # general summary/information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73859 entries, 0 to 73858
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   index         73859 non-null  int64  
 1   name          73859 non-null  object 
 2   character     73859 non-null  object 
 3   role          73859 non-null  object 
 4   title         73859 non-null  object 
 5   type          73859 non-null  object 
 6   release_year  73859 non-null  int64  
 7   genres        73859 non-null  object 
 8   imdb_score    73859 non-null  float64
 9   imdb_votes    73859 non-null  float64
dtypes: float64(2), int64(2), object(6)
memory usage: 5.6+ MB


After fixing the issues with incorrect header styles, missing values, and duplicate rows and implicit duples, the last step is to check the general summary, which has appropriate data types for each variable. Therefore, the data is cleaned and prepared for analysis.

## Data Analysis <a id='hypotheses'></a>

To begin the analysis, I'll calculate the average amount of votes for each score (this data is available in the `imdb_score` and `imdb_votes` columns), and then check how these averages relate to each other. If the averages for shows with the highest scores are bigger than those for shows with lower scores, the assumption of movies/shows released during the "Golden Age" of televion having the most votes will appear to be true.

I will do so by filtering the dataframe and only include shows released in 1999 or later, keeping only titles published in 1999 or later in our dataframe before filtering the table to only contain shows. Additionally, for simplicity, the scores that are to be grouped will be rounded. For instance, titles with scores like 7.8, 8.1, and 8.3 will all be placed in the same bucket with a score of 8.

In [39]:
# filter the movies
release = movie[movie['release_year'] >= 1999]
release = release[release['type'] == 'SHOW']
# rounding column with scores
release['imdb_score'] = release['imdb_score'].round()
release.tail()

Unnamed: 0,index,name,character,role,title,type,release_year,genres,imdb_score,imdb_votes
73831,85433,Maneerat Kam-Uan,Ae,ACTOR,Let's Eat,SHOW,2021,"['drama', 'comedy']",8.0,5.0
73832,85434,Rudklao Amratisha,unknown,ACTOR,Let's Eat,SHOW,2021,"['drama', 'comedy']",8.0,5.0
73833,85435,Jaturong Mokjok,unknown,ACTOR,Let's Eat,SHOW,2021,"['drama', 'comedy']",8.0,5.0
73834,85436,Pisamai Wilaisak,unknown,ACTOR,Let's Eat,SHOW,2021,"['drama', 'comedy']",8.0,5.0
73835,85437,Sarawut Wichiensarn,unknown,DIRECTOR,Let's Eat,SHOW,2021,"['drama', 'comedy']",8.0,5.0


In [38]:
# identify outliers based on number of votes
score_counts = release.groupby('imdb_score')['imdb_votes'].count()
print(score_counts)

imdb_score
2.0       24
3.0       27
4.0      180
5.0      592
6.0     2494
7.0     4706
8.0     4842
9.0      557
10.0       8
Name: imdb_votes, dtype: int64


Based on the aggregation performed, it is evident that scores 2 (24 voted shows), 3 (27 voted shows), and 10 (only 8 voted shows) are outliers. There isn't enough data for these scores for the average number of votes to be meaningful, so I will focus on the 4-9 range.

In [42]:
# filtering dataframe
filtered = release[(release['imdb_score'] >= 4) & (release['imdb_score'] <= 9)]
# group scores and corresponding average number of votes
mean = filtered.groupby('imdb_score')['imdb_votes'].mean().reset_index(); mean

Unnamed: 0,imdb_score,imdb_votes
0,4.0,5277.583333
1,5.0,3143.942568
2,6.0,3481.717322
3,7.0,8727.068211
4,8.0,30299.460967
5,9.0,126904.109515


Now, I'll round the column with the averages, rename both columns, and print the dataframe in descending order.

In [41]:
# round column with averages
mean['imdb_votes'] = mean['imdb_votes'].round()
mean = mean.rename(columns={'imdb_score': 'Score', 'imdb_votes': 'Votes'})
# descending order
mean_sorted = mean.sort_values(by='Votes', ascending=False)
mean_sorted

if 'imdb_votes' in filtered.columns:
         mean = filtered.groupby('imdb_score')['imdb_votes'].mean().reset_index()
         print(mean.head())

   imdb_score    imdb_votes
0         4.0   5277.583333
1         5.0   3143.942568
2         6.0   3481.717322
3         7.0   8727.068211
4         8.0  30299.460967


The assumption matches the analysis: the movies and shows with the top 3 scores have the most amounts of votes and were released during the "Golden Age" of television.

## Conclusion <a id='hypotheses'></a>

The research done confirms that highly-rated shows released during the "Golden Age" of television also have the most votes. While shows with score 4 have more votes than ones with scores 5 and 6, the top three (scores 7-9) have the largest number. The data studied represents around 94% of the original set, so we can be confident in our findings.