# Netflix Original Films & IMDB Scores - EDA
This dataset consists of all Netflix original films released as of June 1st, 2021. Additionally, it also includes all Netflix documentaries and specials. The data was webscraped off of this Wikipedia page, which was then integrated with a dataset consisting of all of their corresponding IMDB scores. IMDB scores are voted on by community members, and the majority of the films have 1,000+ reviews.

THE dataset available on Kaggle.

Dataset consist of:

Title
Genre
Premiere date
Runtime
IMDB scores
Languages



This colab notebooks cover this question:



1. In which language were the long-running films created according to the dataset? Make a visualization.
2. Find and visualize the IMDB values of the movies shot in the 'Documentary' genre between January 2019 and June 2020.
3. Which genre has the highest IMDB rating among movies shot in English?
4. What is the average 'runtime' of movies shot in 'Hindi'?
5. How many categories does the Genre Column have and what are they? Visualize it.
6. Find the 3 most used languages in the movies in the data set.
7. Top 10 Movies With IMDB Ratings
8. What is the correlation between IMDB score and 'Runtime'? Examine and visualize.
9. Top 10 Genre by IMDB Score
10. What are the top 10 movies with the highest 'runtime'? Visualize it.
11. In which year was the most movies released? Visualize it.
12. Which language movies have the lowest average IMDB rating? Visualize it.
13. Which year has the greatest total runtime?
14. What is the "Genre" where each language is used the most?
15. Is there any outlier data in the data set? Please explain.

In [4]:
#Importing the dataset
import pandas as pd
import numpy as np

In [5]:
#Loading the dataset

netflix_data = pd.read_csv("NetflixOriginals.csv", encoding_errors= "replace")
netflix_data.head(10)

Unnamed: 0,Title,Genre,Premiere,Runtime,IMDB Score,Language
0,Enter the Anime,Documentary,"August 5, 2019",58,2.5,English/Japanese
1,Dark Forces,Thriller,"August 21, 2020",81,2.6,Spanish
2,The App,Science fiction/Drama,"December 26, 2019",79,2.6,Italian
3,The Open House,Horror thriller,"January 19, 2018",94,3.2,English
4,Kaali Khuhi,Mystery,"October 30, 2020",90,3.4,Hindi
5,Drive,Action,"November 1, 2019",147,3.5,Hindi
6,Leyla Everlasting,Comedy,"December 4, 2020",112,3.7,Turkish
7,The Last Days of American Crime,Heist film/Thriller,"June 5, 2020",149,3.7,English
8,Paradox,Musical/Western/Fantasy,"March 23, 2018",73,3.9,English
9,Sardar Ka Grandson,Comedy,"May 18, 2021",139,4.1,Hindi


In [6]:
#Checking the data info 

netflix_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 584 entries, 0 to 583
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Title       584 non-null    object 
 1   Genre       584 non-null    object 
 2   Premiere    584 non-null    object 
 3   Runtime     584 non-null    int64  
 4   IMDB Score  584 non-null    float64
 5   Language    584 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 27.5+ KB


1. In which language were the long-running films created according to the dataset? Make a visualization.

In [7]:
long_films = netflix_data[netflix_data["Runtime"]> 50]
language_counts = long_films["Language"].value_counts()
print(f"The languages with the longest runtime are : {language_counts}")

The languages with the longest runtime are : Language
English                       358
Hindi                          33
Spanish                        29
French                         20
Italian                        14
Portuguese                     10
Indonesian                      9
Korean                          6
Turkish                         5
Japanese                        5
German                          5
Marathi                         3
Dutch                           3
Polish                          3
English/Japanese                2
Thai                            2
Filipino                        2
Norwegian                       1
English/Akan                    1
English/Russian                 1
English/Mandarin                1
English/Arabic                  1
English/Spanish                 1
English/Korean                  1
Spanish/English                 1
Tamil                           1
Khmer/English/French            1
Thia/English                

2. Find and visualize the IMDB values of the movies shot in the 'Documentary' genre between January 2019 and June 2020.

In [8]:
#Renaming the IMDB Score to ratings

netflix_data.rename(columns={"IMDB Score": "ratings"}, inplace = True)
netflix_data.info()

#filtering the movies shot in the documentary genre between January 2019 and June 2020
half_time = netflix_data[netflix_data["Genre"] == "Documentary"]

#converting the premiere column to date_timea
pd.to_datetime(netflix_data["Premiere"], errors = "coerce")

#Filtering to o find the movies premiered in the stipulated time
mask = (netflix_data["Premiere"] >= "2019-01-01") & (netflix_data["Premiere"] <= "2020-05-01")

mask.head()
half_time.head()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 584 entries, 0 to 583
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Title     584 non-null    object 
 1   Genre     584 non-null    object 
 2   Premiere  584 non-null    object 
 3   Runtime   584 non-null    int64  
 4   ratings   584 non-null    float64
 5   Language  584 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 27.5+ KB


Unnamed: 0,Title,Genre,Premiere,Runtime,ratings,Language
0,Enter the Anime,Documentary,"August 5, 2019",58,2.5,English/Japanese
10,Searching for Sheela,Documentary,"April 22, 2021",58,4.1,English
15,After the Raid,Documentary,"December 19, 2019",25,4.3,Spanish
20,"Hello Privilege. It's Me, Chelsea",Documentary,"September 13, 2019",64,4.4,English
30,After Maria,Documentary,"May 24, 2019",37,4.6,English/Spanish


3. Which genre has the highest IMDB rating among movies shot in English?

In [9]:
#Filtering for the english movies
english_movies = netflix_data[netflix_data["Language"] == "English"]

english_rating = english_movies.groupby("Genre")["ratings"].mean()
highest_rating = english_rating.idxmax()

print(f"the genre with the highest ratings are :{highest_rating}")

the genre with the highest ratings are :Animation/Christmas/Comedy/Adventure


4. What is the average 'runtime' of movies shot in 'Hindi'?


In [10]:
hindi_movies = netflix_data[netflix_data["Language"] == "Hindi"]
avg_runtime = hindi_movies["Runtime"].mean()
print(f"Average Runtime for movies shot in Hindi is : {avg_runtime}")

Average Runtime for movies shot in Hindi is : 115.78787878787878


5. How many categories does the Genre Column have and what are they? Visualize it.

In [11]:
num_of_genre = netflix_data["Genre"].nunique()
cat_of_genre = netflix_data["Genre"].unique()

print(f"We have {num_of_genre} unique genres, and the are {cat_of_genre}")

We have 115 unique genres, and the are ['Documentary' 'Thriller' 'Science fiction/Drama' 'Horror thriller'
 'Mystery' 'Action' 'Comedy' 'Heist film/Thriller'
 'Musical/Western/Fantasy' 'Drama' 'Romantic comedy' 'Action comedy'
 'Horror anthology' 'Political thriller' 'Superhero-Comedy' 'Horror'
 'Romance drama' 'Anime / Short' 'Superhero' 'Heist' 'Western'
 'Animation/Superhero' 'Family film' 'Action-thriller' 'Teen comedy-drama'
 'Romantic drama' 'Animation' 'Aftershow / Interview' 'Christmas musical'
 'Science fiction adventure' 'Science fiction' 'Variety show'
 'Comedy-drama' 'Comedy/Fantasy/Family' 'Supernatural drama'
 'Action/Comedy' 'Action/Science fiction' 'Romantic teenage drama'
 'Comedy / Musical' 'Musical' 'Science fiction/Mystery' 'Crime drama'
 'Psychological thriller drama' 'Adventure/Comedy' 'Black comedy'
 'Romance' 'Horror comedy' 'Christian musical' 'Romantic teen drama'
 'Family' 'Dark comedy' 'Comedy horror' 'Psychological thriller' 'Biopic'
 'Science fiction/Thril

6. Find the 3 most used languages in the movies in the data set.

In [12]:
most_lang = netflix_data["Language"].value_counts()

print(f"most used languages in the data are: {most_lang.head(4)}")

most used languages in the data are: Language
English    401
Hindi       33
Spanish     31
French      20
Name: count, dtype: int64


7. Top 10 Movies With IMDB Ratings

In [13]:
top_movies = netflix_data.nlargest(10,"ratings")[["Title", "ratings"]]
print(top_movies)

                                           Title  ratings
583     David Attenborough: A Life on Our Planet      9.0
582    Emicida: AmarElo - It's All For Yesterday      8.6
581                      Springsteen on Broadway      8.5
578   Ben Platt: Live from Radio City Music Hall      8.4
579        Taylor Swift: Reputation Stadium Tour      8.4
580  Winter on Fire: Ukraine's Fight for Freedom      8.4
576                      Cuba and the Cameraman       8.3
577                       Dancing with the Birds      8.3
571                                         13th      8.2
572            Disclosure: Trans Lives on Screen      8.2


8. What is the correlation between IMDB score and 'Runtime'? Examine and visualize.

In [15]:
rating_corr = netflix_data["ratings"].corr(netflix_data["Runtime"])

print(f"The correlation betweeen the IMDB Ratings and the Runtime is {rating_corr}")

The correlation betweeen the IMDB Ratings and the Runtime is -0.04089629142078874
