# Top Rated Movies on Netflix

Hello everyone and thank you for checking out my notebook. In this notebook, I plan to tackle the task 'How to find the best-rated Movies in Netflix.' I used the data provided in the Netflix Movies and TV Shows dataset which can be found [here](https://www.kaggle.com/shivamb/netflix-shows). I also use the IMDb movies extensive dataset to help with this investigation which can be found [here](https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset).

I took a good bit of my preprocessing techniques from Erin Ward and the notebook in which she submitted. She did an excellent job and helped me get started with this task. Her work can be found [here](https://www.kaggle.com/eward96/best-movies-on-netflix-eda/notebook#notebook-container).

Enjoy and let me know if you have any comments!

# Imports

In [None]:
import pandas as pd
from pprint import pprint
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Loading in Data

The Netflix data contains one dataset.<br><br>
The IMDb data contains four datasets.<br><br>
I will load them all in here and display the first row with all the columns so we can get an idea of what each dataset looks like.

In [None]:
netflix_titles = pd.read_csv('../input/netflix-shows/netflix_titles.csv')
netflix_titles.head(1)

In [None]:
imbd_movies = pd.read_csv('../input/imdb-extensive-dataset/IMDb movies.csv')
imbd_movies.head(1)

In [None]:
imbd_names = pd.read_csv('../input/imdb-extensive-dataset/IMDb names.csv')
imbd_movies.head(1)

In [None]:
imbd_ratings = pd.read_csv('../input/imdb-extensive-dataset/IMDb ratings.csv')
imbd_ratings.head(1)

In [None]:
imbd_title_principals = pd.read_csv('../input/imdb-extensive-dataset/IMDb title_principals.csv')
imbd_title_principals.head(1)

# Keeping Important Columns Only

We don't need all of the columns in some of the datasets.<br>
Here we will just select the ones that we think are the most worthy of being investigated.

In [None]:
netflix_titles.columns

In [None]:
netflix_titles = netflix_titles[['type','title','country','release_year']]
netflix_titles = netflix_titles.rename(columns={'release_year':'year'})

In [None]:
imbd_movies.columns

In [None]:
imbd_movies = imbd_movies[['imdb_title_id','year','title','genre','votes','avg_vote','budget']]

In [None]:
# All seem important for right now
imbd_ratings.columns

In [None]:
len(imbd_movies)

In [None]:
len(imbd_ratings)

We can see here that the len of both imbd_movies and imbd_ratings are the same. <br>
This will allow us to merge the two together.

In [None]:
imbd_movie_ratings = imbd_movies.merge(imbd_ratings, on='imdb_title_id')

In [None]:
# It is easier to deal with 0 valued data than NaN
imbd_movie_ratings = imbd_movie_ratings.fillna(0)

In [None]:
imbd_movie_ratings.info()

# Weighted Averages

IMDb data set only contains the weighted average for everyone that submitted a vote, regardless of sex or age. It would be interesting to see what the weighted average would be for each sex and the columns labeled 'allgenders', 'males', and 'females' and their corresponding age (if there is one). Again this is adapted from what Erin did. I just simplified the code here. <br><br> That is what we do below. This will allow us to make more interesting plots and dive deeper into the data.

In [None]:
def weighted_averages(x):
    number = x[0]
    avg = x[1]
    if number != 0.0:
        return ((number/(number+1000))*avg) + ((1000/(number+1000))*5.9)
    else:
        return 0

votes_per_group = []
# Grab the columns ending in age_votes 
for column in imbd_movie_ratings.columns:
    if column.isnumeric and 'age' in column and 'votes' in column:
        votes_per_group.append(column)
        
avg_votes_per_group = []
# Grab the columns ending in age_avg_vote
for column in imbd_movie_ratings.columns:
    if column.isnumeric and 'age' in column and 'avg_vote' in column:
        avg_votes_per_group.append(column)

# Create new column in imdb_movie_ratings DataFrame
for i in votes_per_group:
    imbd_movie_ratings['weighted_' + str(i)] = ''
    
tuple_list = []

for x in range(0,len(votes_per_group)):
    tuple_list.append((votes_per_group[x],avg_votes_per_group[x]))
    
# Apply the weighted_avergaes function defined above
for votes, avg_votes in tuple_list:
    imbd_movie_ratings.loc[:,'weighted_' + votes] = imbd_movie_ratings[[votes,avg_votes]].apply(weighted_averages,axis=1)
    
imbd_movie_ratings.head()

In [None]:
# Make new df (final_df) by mergin Netlfix data (netflix_titles) with IMDb data (imbd_movie_ratings)
final_df = netflix_titles.merge(imbd_movie_ratings, how = 'inner', left_on=['title', 'year'], right_on=['title', 'year'])

In [None]:
final_df.head()

In [None]:
# Check to see if we have any missing data
print(final_df.isnull().sum().to_string())

In [None]:
# See the data that is missing
final_df[final_df['country'].isnull()]

In [None]:
# Fill in missing country names by doing quick Google search
final_df.loc[6,'country'] = 'India'
final_df.loc[74,'country'] = 'United Kingdom'
final_df.loc[262,'country'] = 'India'
final_df.loc[508,'country'] = 'Indonesia'

In [None]:
final_df[final_df['country'].isnull()]

# Top Rated Movies

Let's look at the top-rated movies overall, and by both males and females separately.


In [None]:
def plot_data(col_name, graph_title, xlabel = 'Rating'):
    
    rating = final_df[[col_name]].loc[:, col_name].sort_values(ascending=False).head(10).values
    index = final_df[[col_name]].loc[:, col_name].sort_values(ascending=False).index[:10]

    sns.set_context('poster')
    sns.set(rc={'figure.figsize':(10,8)})
    sns.set_style('darkgrid')

    fig, ax = plt.subplots()
    title = []

    for i in index:
        title.append(final_df.loc[i, 'title'])

    y_pos = np.arange(len(title))

    ratings = []
    for r in rating:
        ratings.append(r)

    ax.barh(y_pos, ratings, edgecolor='black',color=('#0094FD','#1BA1FF','#37ACFF','#63BEFE',
                                                     '#83CBFD','#A6DAFF','#C2E5FF','#D7EEFF','#E6F4FF','#F5FBFF'))
    ax.set_yticks(y_pos)
    ax.set_yticklabels(title, fontsize=15)
    ax.invert_yaxis()
    ax.set_xlabel(xlabel, fontsize=15)
    ax.set_ylabel('Movie', fontsize=15)
    ax.set_title(graph_title, fontsize=20)


    plt.show()

In [None]:
plot_data('weighted_average_vote', 'Top 10 Weighted Movies on Netlfix')

In [None]:
plot_data('weighted_females_allages_votes', 'Top 10 Weighted Female Movies on Netlfix')

In [None]:
plot_data('weighted_males_allages_votes', 'Top 10 Weighted Male Movies on Netflix')

# Top Rated Movies by Genre

In [None]:
def plot_data_genre(genre_name, graph_title, xlabel = 'Rating'):
    
    final_df[genre_name] = ''
    for i in range(0, len(final_df)):
        if genre_name in final_df.loc[i].genre:
            final_df.loc[i,genre_name] = 1
        else:
            final_df.loc[i,genre_name] = 0
            
    df = final_df[final_df[genre_name] == 1]
    
    rating = df[['weighted_average_vote']].loc[:, 'weighted_average_vote'].sort_values(ascending=False).head(10).values
    index = df[['weighted_average_vote']].loc[:, 'weighted_average_vote'].sort_values(ascending=False).index[:10]

    sns.set_context('poster')
    sns.set(rc={'figure.figsize':(10,8)})
    sns.set_style('darkgrid')

    fig, ax = plt.subplots()
    title = []

    for i in index:
        title.append(final_df.loc[i, 'title'])

    y_pos = np.arange(len(title))

    ratings = []
    for r in rating:
        ratings.append(r)

    ax.barh(y_pos, ratings, edgecolor='black',color=('#0094FD','#1BA1FF','#37ACFF','#63BEFE',
                                                     '#83CBFD','#A6DAFF','#C2E5FF','#D7EEFF','#E6F4FF','#F5FBFF'))
    ax.set_yticks(y_pos)
    ax.set_yticklabels(title, fontsize=15)
    ax.invert_yaxis()
    ax.set_xlabel(xlabel, fontsize=15)
    ax.set_ylabel('Movie', fontsize=15)
    ax.set_title(graph_title, fontsize=20)


    plt.show()

In [None]:
plot_data_genre('Drama', 'Top Weighted Drama Movies on Netflix')

In [None]:
plot_data_genre('Comedy', 'Top Weighted Comedy Movies on Netflix')

In [None]:
plot_data_genre('Action', 'Top Weighted Action Movies on Netflix')

In [None]:
plot_data_genre('Romance', 'Top Weighted Romance Movies on Netflix')

In [None]:
plot_data_genre('Thriller', 'Top Weighted Thriller Movies on Netflix')

# Top Rated Movies by Release Year

In [None]:
def plot_release_year(release_year, graph_title, xlabel = 'Rating'):
            
    df = final_df[final_df['year'] == release_year]
    
    rating = df[['weighted_average_vote']].loc[:, 'weighted_average_vote'].sort_values(ascending=False).head(10).values
    index = df[['weighted_average_vote']].loc[:, 'weighted_average_vote'].sort_values(ascending=False).index[:10]

    sns.set_context('poster')
    sns.set(rc={'figure.figsize':(10,8)})
    sns.set_style('darkgrid')

    fig, ax = plt.subplots()
    title = []

    for i in index:
        title.append(final_df.loc[i, 'title'])

    y_pos = np.arange(len(title))

    ratings = []
    for r in rating:
        ratings.append(r)

    ax.barh(y_pos, ratings, edgecolor='black',color=('#0094FD','#1BA1FF','#37ACFF','#63BEFE',
                                                     '#83CBFD','#A6DAFF','#C2E5FF','#D7EEFF','#E6F4FF','#F5FBFF'))
    ax.set_yticks(y_pos)
    ax.set_yticklabels(title, fontsize=15)
    ax.invert_yaxis()
    ax.set_xlabel(xlabel, fontsize=15)
    ax.set_ylabel('Movie', fontsize=15)
    ax.set_title(graph_title, fontsize=20)

    plt.show()

In [None]:
plot_release_year(2020,'Top Rated Movies Released in 2020')

In [None]:
plot_release_year(2015,'Top Rated Movies Released in 2015')

In [None]:
plot_release_year(2010,'Top Rated Movies Released in 2010')

In [None]:
plot_release_year(2005,'Top Rated Movies Released in 2005')

In [None]:
plot_release_year(2000,'Top Rated Movies Released in 2000')

# Highest US Budgeted Films and Their Ratings

In [None]:
final_df['Budget_in_dollars'] = ''

for i in final_df[final_df['country'] == 'United States'].index:
    if final_df.loc[i].budget != 0:
        final_df.loc[i,'Budget_in_dollars'] = int(final_df[final_df['country'] == 'United States'].budget[i].split()[1])
    else:
        final_df.loc[i,'Budget_in_dollars'] = 0 

In [None]:
df = final_df[final_df['Budget_in_dollars'] != '']

budget = df[['Budget_in_dollars']].loc[:, 'Budget_in_dollars'].sort_values(ascending=False).head(10).values
index = df[['Budget_in_dollars']].loc[:, 'Budget_in_dollars'].sort_values(ascending=False).index[:10]

title = []
for i in index:
    title.append(final_df.loc[i, 'title'])

budgets = []
for r in budget:
    budgets.append(r)

ratings = []
for i in index:
    ratings.append(final_df.iloc[i].weighted_average_vote)
    
y_pos = np.arange(len(title))

sns.set_context('paper')
sns.set(rc={'figure.figsize':(12,10)})
fig, ax = plt.subplots(2)

ax[0].barh(y_pos, budgets, edgecolor='black',color=('#0094FD','#1BA1FF','#37ACFF','#63BEFE',
                                                     '#83CBFD','#A6DAFF','#C2E5FF','#D7EEFF','#E6F4FF','#F5FBFF'))
ax[0].set_yticks(y_pos)
ax[0].set_yticklabels(title, fontsize=15)
ax[0].invert_yaxis()
ax[0].set_xlabel('Budget (In Millons)', fontsize=15)
ax[0].set_ylabel('Movie', fontsize=15)
ax[0].set_title('Top 10 US Budget Movies', fontsize=20)

ax[1].barh(y_pos, ratings, edgecolor='black',color=('#0094FD','#1BA1FF','#37ACFF','#63BEFE',
                                                     '#83CBFD','#A6DAFF','#C2E5FF','#D7EEFF','#E6F4FF','#F5FBFF'))
ax[1].set_yticks(y_pos)
ax[1].set_yticklabels(title, fontsize=15)
ax[1].invert_yaxis()
ax[1].set_xlabel('Rating', fontsize=15)
ax[1].set_ylabel('Movie', fontsize=15)
ax[1].set_title('Top 10 US Budget Movies and Their Ratings', fontsize=20)

plt.tight_layout()
plt.figure(dpi=100)
plt.show()

# US v Non US Rating Comparisons

We have to create two new columns to do this comparison. When we calculated weighted averages above, we did not do so for us and non_us voters. The cell below will calculate that. <br><br> With the following graphs, we will see that the difference by which us and non us residents rate a movie does not vary by much but it is still interesting to see the slight differences.

In [None]:
def weighted_averages_ver2(x):
    number = x[0]
    avg = x[1]
    if number != 0.0:
        return ((number/(number+1000))*avg) + ((1000/(number+1000))*5.9)
    else:
        return 0

final_df.loc[:,'weighted_us_votes'] = final_df[['us_voters_votes','us_voters_rating']].apply(weighted_averages_ver2,axis=1)
final_df.loc[:,'weighted_non_us_votes'] = final_df[['non_us_voters_votes','non_us_voters_rating']].apply(weighted_averages_ver2,axis=1)

final_df.head()

In [None]:
def us_v_non_us_voters(column, title1, title2):
    votes = df[[column]].loc[:, column].sort_values(ascending=False).head(10).values
    index = df[[column]].loc[:, column].sort_values(ascending=False).index[:10]

    title = []

    for i in index:
        title.append(final_df.loc[i, 'title'])

    y_pos = np.arange(len(title))

    us_ratings = []
    for i in index:
        us_ratings.append(final_df.iloc[i].weighted_us_votes)

    non_us_ratings = []
    for i in index:
        non_us_ratings.append(final_df.iloc[i].weighted_non_us_votes)

    sns.set_context('paper')
    sns.set(rc={'figure.figsize':(12,10)})
    fig, ax = plt.subplots(2)

    ax[0].barh(y_pos, us_ratings, edgecolor='black',color=('#0094FD','#1BA1FF','#37ACFF','#63BEFE',
                                                         '#83CBFD','#A6DAFF','#C2E5FF','#D7EEFF','#E6F4FF','#F5FBFF'))
    ax[0].set_yticks(y_pos)
    ax[0].set_yticklabels(title, fontsize=15)
    ax[0].invert_yaxis()
    ax[0].set_xlabel('US Rating', fontsize=15)
    ax[0].set_ylabel('Movie', fontsize=15)
    ax[0].set_title(title1, fontsize=20)

    ax[1].barh(y_pos, non_us_ratings, edgecolor='black',color=('#0094FD','#1BA1FF','#37ACFF','#63BEFE',
                                                         '#83CBFD','#A6DAFF','#C2E5FF','#D7EEFF','#E6F4FF','#F5FBFF'))
    ax[1].set_yticks(y_pos)
    ax[1].set_yticklabels(title, fontsize=15)
    ax[1].invert_yaxis()
    ax[1].set_xlabel('Non US Rating', fontsize=15)
    ax[1].set_ylabel('Movie', fontsize=15)
    ax[1].set_title(title2, fontsize=20)

    plt.tight_layout()
    plt.show()

In [None]:
us_v_non_us_voters('avg_vote', 'Top 10 US Users Rated Movies Based on Top Average Votes',
                  'Top 10 Non US Users Rated Movies Based on Top Average Votes')

In [None]:
us_v_non_us_voters('votes', 'Top 10 US Users Rated Movies Based on Top Votes',
                  'Top 10 Non US Users Rated Movies Based on Top Votes')

In [None]:
us_v_non_us_voters('males_allages_avg_vote', 'Top 10 US Users Rated Movies Based on Top Male Average Votes',
                  'Top 10 Non US Users Rated Movies Based on Top Male Average Votes')

In [None]:
us_v_non_us_voters('females_allages_avg_vote', 'Top 10 US Users Rated Movies Based on Top Female Average Votes',
                  'Top 10 Non US Users Rated Movies Based on Top Female Average Votes')

# Conclusion

Thank you for checking out this notebook. I hope you leanred something. <br><br> Let me know if you have any comments or quesitons!