In this notebook, we will dive into a Netflix database (for the sake of easiness, we will limit ourselves to movies and not series). We will try to understand which range of people like what movies. To quantify how much a movie is "liked" we will use data from the IMBD rating website.

This notebook takes inspiration from: https://www.kaggle.com/eward96/best-movies-on-netflix-eda/data

As always, let's start by uploading some useful libraries.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # for figures
import matplotlib as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Now we load the Netflix data and take a quick look at it.

In [None]:
netflix = pd.read_csv('../input/netflix-shows/netflix_titles.csv')
netflix_movies = netflix[netflix['type'] == 'Movie']
netflix_movies.head()

Among all this information, only some is useful for us (at least for now).
We are only interested in the title of the movie, its genre (listed_in) and the release year.

In [None]:
data = netflix_movies[['title', 'listed_in', 'release_year']]
data = data.rename(columns={'release_year': 'year'})
data = data.rename(columns={'listed_in': 'genre'})
data.head()
data.info()

Now we have to load the IMBD data.

In [None]:
imdb_movies = pd.read_csv('../input/imdb-movie-extensive-dataset/IMDb movies.csv/IMDb movies.csv')
imdb_movies.head()
imdb_movies = imdb_movies[['imdb_title_id', 'title', 'year']]
imdb_ratings = pd.read_csv('../input/imdb-movie-extensive-dataset/IMDb ratings.csv/IMDb ratings.csv')
imdb = pd.merge(imdb_movies, imdb_ratings, on='imdb_title_id')
imdb.head()
imdb.info()

This is a lot of info. We will analyze movie ratings based on year of the movie, gender and age of the reviewers.
Let's start by analyzing the best movies for "everybody" (so not segmented by gender or age).
Small note, the information on the year is already included in the netflix dataset, so we don't need to load it again. However, if you load it, and try to merge by title and year, you are going to miss some data points possibly due to small differences or onymicity.

In [None]:
imdb = imdb[['title', 'weighted_average_vote']]
imdb.info()

Now let's merge the two dataset together (via the title)

In [None]:
data = data.merge(imdb, how="inner", left_on=['title'], right_on=['title'])
print(data.head())
data.info()

It is worth noting that from a dataset of around 8k we are reduced to just 2.4k points. This can be due to the fact the movie's title were saved differently (capital letters or such). **LOOK AGIAN INTO THIS**

In [None]:
top = data["weighted_average_vote"].nlargest(1) # this returns the top weight and another number which I guess it's the index
index = top.index[0]
top_title = data['title'].iloc[index]
top_weight = data["weighted_average_vote"].iloc[index]

print("Best movie ever " + str(top_title)+ " with vote: " + str(top_weight))

Now let's start analizing segmenting customers for gender and age

In [None]:
# reload (the whole dataset this time)
imdb_movies = pd.read_csv('../input/imdb-movie-extensive-dataset/IMDb movies.csv/IMDb movies.csv')
imdb_movies.head()
imdb_movies = imdb_movies[['imdb_title_id', 'title', 'year']]
imdb_ratings = pd.read_csv('../input/imdb-movie-extensive-dataset/IMDb ratings.csv/IMDb ratings.csv')
imdb = pd.merge(imdb_movies, imdb_ratings, on='imdb_title_id')
data = netflix_movies.merge(imdb, how="inner", left_on=['title'], right_on=['title'])

pd.set_option('max_rows', None)
print(data.isna().sum())
pd.reset_option('max_rows')


As we can see many data are missing. This is well reasonable since some groups cannot/ did not vote for some movies. For instance, people below 18 cannot vote for pg-18 movies. 
Since we already have information about the average vote, we can fill in the missing datas with zeros (since this won't affect the average which is already being computed). A vote of zero, in this case, means that a particular category is not interested in one specific movie.

In [None]:
# reload (the whole dataset this time)
#data = data.fillna(0)
#pd.set_option('max_rows', None)
#print(data.isna().sum())
#pd.reset_option('max_rows')


Now we will try to make something a bit more sophisticated. I aim to make a function that given a gender/age group, it returns the best k movies.
If geneder/age group are not specified, then it is assumed to be all of them.
if k is not specified, it is assumed to be 10.
(in this dataset, we have info only for men/female.)

In [None]:
def DetermineColumn(gender_group,age_group):
    
    A = 'dummy'
    if gender_group in ["F",'f','fem', "Fem", "FEM", "female", "Female", "FEMALE", "females", "Females", "FEMALES"]:
        A = "females"
    if gender_group in ["M",'m','male', "Male", "MALE"]:
        A = "males"
    if gender_group in ["A",'a','all', "All", "ALL"]:
        A = "allgenders"
    if A == 'dummy':
        print('I coudn\'t understand the gender. Please try again. (the script is going to fail)')
    B = 'dummy'
    if age_group == '0-18':
        B = "0age"
    if age_group == '18-30':
        B = "18age"
    if age_group == '30-45':
        B = "30age"
    if age_group == '45+':
        B = "45age"
    if age_group == 'all':
        B = "allages"
    if B == 'dummy':
        print('I coudn\'t understand the age group. Please try again. (the script is going to fail)')
        print('your age group: ', age_group)
        print('possible choices: 0-18, 18-30, 30-45, 45+, all')
    column = A+'_'+B+'_avg_vote'
    if column == 'allgenders_allages_avg_vote':
        # here I am not actually sure if one should select
        # mean_vote or weighted_average_vote
        column = 'weighted_average_vote'
    
    return column

def BestK(data, gender_group = 'all', age_group = 'all', k = 10):

    column = DetermineColumn(gender_group,age_group)
    
    titles = []
    votes = []
    topK=data[column].nlargest(k)
    for i in range(len(topK)):
        index = topK.index[i]
        title = data['title'].iloc[index]
        titles.append(title)
        vote = data[column].iloc[index]
        votes.append(vote)
        print(i+1, '.', title, ':', vote)
        
    ax = sns.barplot(x = titles, y = votes)
    ax.set(xlabel='Movie Title', ylabel='Mean Rating')
    ax.set_title('Best Rated Movies on Netflix for ' + gender_group + ' of age ' + age_group)
    
    return 

Let's test it

In [None]:
# please chose any age group among:
# 0-18,18-30, 30-45, 45+, 'all'
age_group = '18-30'
# please chose any gender group among:
# 'male', 'female', 'all'
gender_group = 'male'
# please choose the list lenght (default = 10)
k = 6
BestK(data,  gender_group = gender_group, age_group = age_group, k = k)

Now let's see how poupar are the movies genre with respect to gender / age.

In [None]:
#data.info()
#pd.set_option('max_rows', None)
#print(data.groupby(['listed_in'])['weighted_average_vote'].mean())
#pd.reset_option('max_rows')

# make a vector of all possible genres
Genres = []
for i in range(len(data)):
    row = data.iloc[i]
    genres = row['listed_in']
    genres = genres.split(',')
    for g in genres:
        g = g.replace(" ", "")
        if g not in Genres:
            Genres.append(g)
#print(Genres)

# now we have all genres, let's initialize columns for that
for g in range(len(Genres)):
    vector = [False] * len(data)
    data[Genres[g]] = vector  
#print(data['Action & Adventure']).head()

# and now let's fill these columns
for i in range(len(data)):
    row = data.iloc[i]
    genres = row['listed_in']
    genres = genres.split(',')
    for g in genres:
        g = g.replace(" ", "")
        data.at[i,g] = True

 now we have all genres in the df, let's create the heatmap
 
 

In [None]:
age_groups = ["0-18", "18-30", "30-45", "45+", "all"]
gender_groups = ['male', 'female', 'all' ]

boolean = True
for a_g in age_groups:
    for g_g in gender_groups:
        column = DetermineColumn(g_g,a_g)
        group_a_g_g_g = g_g + '_'+ a_g
        vector = []
        for g in Genres:
            average = data[column].loc[data[g] == True].mean()
            vector.append(average)
        if boolean:
            hm_df = {group_a_g_g_g : vector}
            hm_df = pd.DataFrame(hm_df)
            boolean = False
        else:
            hm_df[group_a_g_g_g] = vector

hm = sns.heatmap(data = hm_df, yticklabels = Genres, center=5)

although difficult to read, the previous heatmap has all the info we were looking for.
Let's just try to visualize it better

In [None]:
# choose any (multiple allowed)
# 'Dramas', 'InternationalMovies', 'HorrorMovies', 'Action&Adventure', 'IndependentMovies', 
#'Sci-Fi&Fantasy', 'Thrillers', 'Comedies', 'RomanticMovies', 'Music&Musicals', 
#'Children&FamilyMovies', 'Documentaries', 'CultMovies', 'LGBTQMovies', 'AnimeFeatures', 
#'ClassicMovies', 'SportsMovies', 'Faith&Spirituality', 'Movies', 'Stand-UpComedy'
genres_to_analyze = ['InternationalMovies', 'HorrorMovies', 'Action&Adventure']

# choose any (multiple allowed): 'male', 'female', 'all'
genders_to_analyze = ['female','male']

# choose any (multiple allowed): ["0-18", "18-30", "30-45", "45+", "all"]
age_to_analyze = [ "18-30", "30-45", "all"]

In [None]:
def PrintSubHeatMap(genres_to_analyze,genders_groups,age_groups,Genres):
        
    boolean = True
    for a_g in age_groups:
        for g_g in genders_groups:
            column = DetermineColumn(g_g,a_g)
            group_a_g_g_g = g_g + '_'+ a_g
            vector = []
            for g in Genres:
                if g in genres_to_analyze:
                    average = data[column].loc[data[g] == True].mean()
                    vector.append(average)
            if boolean:
                hm_df = {group_a_g_g_g : vector}
                hm_df = pd.DataFrame(hm_df)
                boolean = False
            else:
                hm_df[group_a_g_g_g] = vector

    hm = sns.heatmap(data = hm_df, yticklabels = genres_to_analyze, center=5)

Now let's plot the heatmap of only the selected options

In [None]:
PrintSubHeatMap(genres_to_analyze,genders_to_analyze,age_to_analyze,Genres)

From the heatmap (the main one), we can see that *cult movies* and *classic movies* are the most liked ones (and this does not really come as a suprise I would say). 
The least liked are *horrors* and *movie* (which i belive is an incorrect label).

What is the tuple [gender-age-genre] most and least liked?

In [None]:
# 
maximum = ['dummy', 0 ]
minimum = ['dummy', 10]
boolean = True
for a_g in age_groups:
    for g_g in gender_groups:
        column = DetermineColumn(g_g,a_g)
        group_a_g_g_g = g_g + '_'+ a_g
        for g in Genres:
            average = data[column].loc[data[g] == True].mean()
            if average > maximum[1]:
                maximum = [ [g, g_g, a_g] ,average]
            if average < minimum[1]:
                minimum = [ [g, g_g, a_g] ,average]

print('most  liked genre - gender - age: ',  maximum)
print('least liked genre - gender - age: ',  minimum)

# actually it's probably more efficint to ask the min to the coloumn of the database
# and then confront just those. 
# Anyway the df is small so also like this is quite fast

Another analysis i would like to dive in, is to try to take into consideration "national" biases in movie rating. For instance, an italian such as myself might be more benevolent when jeudging italian movies becuase they may speak to something closer to my own reality. This becomes interesting because some countries with massive population can shift the overall score of some movies. 

In [None]:
data['country'].value_counts()

Let's take the case of India (1.3+ billion people) VS USA (0.3+ billion people). They have roughly hte same number of movies in this dataframe but Let's check how many indian movies there are in the top 100. (disclaimer, this is not to say that Bollywood or such does not produce good movies. Bollywood prodocues excellent movies in my opinion).

In [None]:
# take top 100 movies (general rating)
top100 = data.nlargest(100,['weighted_average_vote'])
# isolate only indian movies
print('Indian movies in the top100: ', len(top100[(top100.country == "india") | (top100.country == "India")]))
# isolate only indian movies
print('Usa movies in the top100: ', len(top100[(top100.country == "United States") | (top100.country == "united states")]))

68 of the top 100 movies were produced in India as opposed to just 13 in the Usa, and this does not even take into account the collaboration with other countries!