In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

DATA ANALYSIS PRACTICE

In this notebook, I used the data from Netflix including information of wide range of movies from 2014 to 2021. Each movies consisted of title, genre, premiere date, runtime, IMDB score and language.

Questions:


- How many movies have been premiered each year?

- How many languages used for movies in Netflix?

- Are movies with more than 2 genres received the higher IMDB scores?

- Are movies with more than 2 languages received the higher IMDB scores?

- Does runtime affect the IMDB score besides genre and language?


There are 3 parts in the notebook:

1) Data access and wrangling

2) Data explore and visualization

3) Summary

In [None]:
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px

%matplotlib inline

In [None]:
df = pd.read_csv('/kaggle/input/netflix-original-films-imdb-scores/NetflixOriginals.csv')
df

Data Wrangling:

- 'Premiere' shoulb be datetime type
- To see the number of movies each year, year information should be separated.

In [None]:
df['Premiere'] = pd.to_datetime(df['Premiere'])

import datetime
df['month']=df['Premiere'].dt.month
df['year']=df['Premiere'].dt.year

In [None]:
genre_1 = df.copy()
genre_1['Genre'] = genre_1['Genre'].str.split('/')
genre_1 = genre_1.explode('Genre')
genre_1

In [None]:
# some words should be retype with the correct spell, some other have ' ' should also replace with ''
genre_1['Genre'].replace({' ':''},
                    inplace=True)
genre_1['Genre'].replace({'Comedy-drama':'Drama-Comedy','Romance drama':'Romantic drama','Horror-thriller':'Horror thriller',
                    'Comedy horror':'Horror comedy','Variety show':'Variety Show','Action-thriller':'Action thriller',
                    'Heist film':'Heist','Science fiction':'Science Fiction','Family film':'Family',
                    'Romantic teenage drama':'Romantic teen drama'},
                   inplace=True)

In [None]:
# there are variety movie genre with only 1 count, so I replace them with the new name 'other' and count them togenther as
# a new genre
class_movie =  genre_1['Genre'].value_counts()
index = class_movie.index.tolist()

for i in index:
    if class_movie[i]==1:
        genre_1['Genre'].replace({i:'other'},inplace=True)
        
genre_1['Genre'].value_counts()

**DATA EXPLORATORY AND VISUALIZATION**

1) How many movies had been premiered each year?

In [None]:
color_base=sb.color_palette()[0]
sb.countplot(data=df,x='year',color=color_base)
plt.xlabel('Year')
plt.ylabel('Frequency')
plt.title('Number of movies each from 2014-2021');

The number of movies has increased each year as the peak was in 2020. Due to the year 2021 is still on-going, I expected that there are more movies coming up in 2021.

In [None]:
plt.figure(figsize=(14,20))
order_movie = genre_1['Genre'].value_counts().index.tolist()
sb.countplot(data=genre_1, y='Genre',color=color_base, order=order_movie)
plt.xlabel('Number of movie')
plt.ylabel('Frequency')
plt.title('Number of movies in each genre from 2014-2021');

In [None]:
fig =px.sunburst(genre_1,path=['year','Genre'])
fig.show()

Here is the pie plot tp present the number of movie in each 'Genre' premiere each year, from 2014-2021. As the observation from the plot, 'Documentary, Drama, and Comedy' took the first 3 positions which have the highest number of movies.

In [None]:
sb.boxplot(data=df,x='IMDB Score');
plt.title('IMDB Score distribution');

Based on the boxplot, the IMDB Score was mostly from near to 6 to 7. There is 1 movie get the 9 score

In [None]:
# here I count the number of genre of each movies
for i in np.arange(0,df.shape[0]):
    df.loc[i,'number_of_class'] = np.shape(df.loc[i,'Genre'].split('/'))[0]

In [None]:
# here I count the number of language of each movies translated
for i in np.arange(0,df.shape[0]):
    df.loc[i,'number_of_language'] = np.shape(df.loc[i,'Language'].split('/'))[0]
    
df

2) Are movies with more than 2 genres received the higher IMDB scores?

In [None]:
sb.boxplot(data=df, y='IMDB Score',x='number_of_class');
plt.title('IMDB Score of movies with different genre');
plt.xlabel('Number of genre');

Comment:

Movies classified into more then one genre might gain high score from viewers due to the abitility to approach to a wide range of people ages.

3) Are movies with more than 2 languages received the higher IMDB scores?

In [None]:
sb.boxplot(data=df, y='IMDB Score',x='number_of_language');
plt.title('IMDB Score of movies with different language');
plt.xlabel('Number of language');

Comment: Movies with translated into multi-language rather than only English might gain high score from viewers due to the abitility to approach to a wide range of people using non-english.

4) Does runtime affect the IMDB score besides genre and language?

In [None]:
sb.regplot(data=df,x='IMDB Score',y='Runtime');
plt.title('Scatter plot of runtime and IMDB score');

Scatter plot of IMDB Score and Runtime showed no correlation between them.

In [None]:
plt.figure(figsize=(8,12))
plt.subplot(2,1,1)
cat_mean = df.groupby(['number_of_class','number_of_language'])['IMDB Score'].mean()
cat_mean = cat_mean.reset_index()
cat_mean = cat_mean.pivot(index='number_of_class',columns='number_of_language',values='IMDB Score')
sb.heatmap(cat_mean, annot=True)
plt.xlabel('Number of language')
plt.ylabel('Number of genre')
plt.title('Mean IMDB Score of movies with different number of language and genre');

plt.subplot(2,1,2)
cat_mean_1 = df.groupby(['number_of_class','number_of_language']).size()
cat_mean_1 = cat_mean_1.reset_index()
cat_mean_1 = cat_mean_1.pivot(index='number_of_class',columns='number_of_language',values=0)
sb.heatmap(cat_mean_1, annot=True, fmt='.1f');
plt.xlabel('Number of languages')
plt.ylabel('Number of genre');
plt.title('Number of movies with different number of genre and number of language');

Comment: The number of movies with classified into one genre and one language accounted for the highest number of movies form 2014-2021. This large number may explain for the low average of IMDB Score due to the wide range of score for the high number of movies. Whereas highest score belongs to the 4-genre-and-1-language movie which only took 2 movies of all.

**SUMMARY**

There are total of 584 movies in the dataset. The number of movies premiered increased each year from 2014 to 2021. In the dataset, there are a wide range of movie genre in which documenary accounted for the hishest number of all genre. Besides that, all the movies were translated into multi-language in order to easily reach the viewer of foreign countries. As the consequence, the IMDB scores evaluated by viewers also higher in movies translated into more than one language than other movies in English. However, movies classified in many genres also got the high scores compared to one-genre movies.