I'm beginning my Data Science journey on Kaggle and this is my first project on this platform. The aim of this project was to just explore the dataset and get acquainted with the different visualizations that can be made. I enjoyed doing this and I hope to do more projects on Kaggle. To anybody going through this notebook, any comments on how I have done and your suggestions on what more I can do and the things that I can improve will be appreciated  and it will be very valuable to my learning.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df=pd.read_csv('../input/netflix-shows/netflix_titles.csv')

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.tail()

We can see that there are a few missing values in the data. Let us look at how many values are missing.

In [None]:
df.isnull().sum().sort_values(ascending=False)

The director name has the most missing values followed by cast and country. Date and Rating have very few missing values. At the time of doing this project, I do not have any knowledge about different imputation techniques, so I'm going to continue exploring the original dataset.

In [None]:
df.duplicated('title').sum()

There are no duplicate values in the dataset.

Let's first see the distribution of content between movies and TV shows.

In [None]:
types=df['type'].value_counts()
fig,ax1=plt.subplots(figsize=(8,4))
ax1.pie(types,explode=(0.05,0.05),labels=types.keys(),autopct="%.1f%%",shadow=True,startangle=45,colors=('Blue','Red'))
ax1.axis('equal')
plt.title('Distribution of Content on Netflix')

69.1% of titles available on Netflix are movies while 30.9% are TV shows.

Let's now look at which countries have produced the most movies and TV shows.

In [None]:
x=df[df['country'].notnull()]
countries_type=pd.DataFrame(x['country'].str.split(',').tolist(),index=x['type']).stack()
countries_type=countries_type.reset_index([0, 'type'])
countries_type.columns=['type','country']
countries_type['country']=countries_type['country'].str.strip()
top10_countries=countries_type['country'].value_counts().head(10)
plt.figure(figsize=(12,6))
sns.barplot(x=top10_countries.keys(),y=top10_countries.values,palette='rainbow')
plt.title('Top 10 Countries by No. of Titles Produced on Netflix')

United States is seen to be the leader of producing titles on Netflix by a huge margin. India and United Kingdom come second and third in number of titles produced.

Let us now see the distribution of Movies and TV shows among countries

In [None]:
country_movies = countries_type[countries_type['type']=="Movie"]
top10_countriesbymov = country_movies['country'].value_counts().head(10)
plt.figure(figsize=(12,6))
sns.barplot(x=top10_countriesbymov.keys(),y=top10_countriesbymov.values,palette='rainbow')
plt.title('Top 10 Countries by No. of Movies Produced on Netflix')

In [None]:
country_tv = countries_type[countries_type['type']=="TV Show"]
top10_countriesbytv = country_tv['country'].value_counts().head(10)
plt.figure(figsize=(12,6))
sns.barplot(x=top10_countriesbytv.keys(),y=top10_countriesbytv.values,palette='rainbow')
plt.title('Top 10 Countries by No. of TV Shows Produced on Netflix')
plt.tight_layout()

**Insights gained from the three plots of analyzing titles by movies are:**
1. United States is the leader of production of both movies and TV shows by a huge margin compared to other countries
2. India produces a lot more movies compared TV shows.
3. Japan,South Korea, Taiwan and Australia produces more TV shows than movies

Note: 507 values for country are missing in the dataset. All the analysis has been done without any imputation. Though I still believe that even if the missing values United States will be the leader in terms of titles produced, still by a huge margin considering the global market for shows and movies coming out of the United States and the Hollywood industry, I personally think this may affect the distribution of titles between India and United Kingdom.

Let us now look at the amount of titles by the year they were released.

In [None]:
plt.figure(figsize=(12,6))
df_year = pd.DataFrame()
df_year = df[df['release_year'] > 1999]
sns.countplot(x='release_year',data=df_year,palette="husl")
plt.tight_layout()
plt.title('No. of movies released each year after 2000')

A lot of titles were released in the years 2016-2020. 2021 has very less amount of titles released and I think the reason is because the dataset was created in early 2021 so an update on the dataset could give us more insights.

From here on I want to focus on the type of content on Netflix. The ratings they've recieved and the genre will be the focus.

In [None]:
df['rating'].unique()

A quick google search gives us the following meanings for the ratings

**TV Shows**
TV-MA - This program is intended to be viewed by mature, adult audiences and may be unsuitable for children under 17

TV-14 - This program may be unsuitable for children under 14 years of age

TV-PG - This program contains material that parents may find unsuitable for younger children. Parental guidance is recommended

TV-G - This program is suitable for all ages

TV-Y7 - This program is most appropriate for children age 7 and up

TV-Y7-FV - TV-Y7 with Fantasy violence.

TV-Y - This program is aimed at a very young audience, including children from ages 2–6.

**Movies**
Rated G: General audiences – All ages admitted

Rated PG: Parental guidance suggested – Some material may not be suitable for children.

Rated PG-13: Parents strongly cautioned – Some material may be inappropriate for children under 13.

Rated R: Restricted – Under 17 requires accompanying parent or adult guardian.

Rated NC-17: Adults Only – No one 17 and under admitted.

NR and UR means Not Rated and Unrated Respectively.

I do not want to change the values in my dataframe but I will use the above details to understand the results.

In [None]:
df_movies = df[df['type']=='Movie']
df_TVshows = df[df['type']=='TV Show']

In [None]:
mov_ratings = df_movies['rating'].value_counts()
plt.figure(figsize=(17,7))
plt.pie(mov_ratings,labels=mov_ratings.keys(),explode=[0.03]*len(mov_ratings),autopct="%.1f%%")
plt.title("Distribution of Movie Ratings on Netflix")
plt.tight_layout()

In [None]:
tv_ratings = df_TVshows['rating'].value_counts()
plt.figure(figsize=(17,7))
plt.pie(tv_ratings,labels=tv_ratings.keys(),explode=[0.03]*len(tv_ratings),autopct="%.1f%%")
plt.title("Distribution of TV Show Ratings on Netflix")
plt.tight_layout()

A lot of the content on Netflix is rated to be unsuitable for children under 17 years and under 14 years.

In [None]:
x=df_movies[df_movies['rating'].notnull()]
mov_genre_ratings=pd.DataFrame(x['listed_in'].str.split(',').tolist(),index=x['rating']).stack()
mov_genre_ratings=mov_genre_ratings.reset_index([0, 'rating'])
mov_genre_ratings.columns=['rating','genre']
mov_genre_ratings['genre']=mov_genre_ratings['genre'].str.strip()
plt.figure(figsize=(10,10))
sns.countplot(y=mov_genre_ratings['genre'],data=mov_genre_ratings,palette="viridis")
plt.xticks(rotation=45)
plt.title('Genres movies are listed in')
plt.tight_layout()

Most movies on Netflix are listed under International Movies and Dramas while Cult Movies,Anime Features and Faith&Spiruality are the genres least no. of movies are listed in.

In [None]:
x=df_TVshows[df_TVshows['rating'].notnull()]
tv_genre_ratings=pd.DataFrame(x['listed_in'].str.split(',').tolist(),index=x['rating']).stack()
tv_genre_ratings=tv_genre_ratings.reset_index([0, 'rating'])
tv_genre_ratings.columns=['rating','genre']
tv_genre_ratings['genre']=tv_genre_ratings['genre'].str.strip()
plt.figure(figsize=(10,10))
sns.countplot(y=tv_genre_ratings['genre'],data=tv_genre_ratings,palette="viridis")
plt.title('Genres TV shows are listed in')
plt.tight_layout()

International TV Shows and TV Dramas are the genres TV Shows are most listed in while Classic&Cult TV and TV Thrillers are the genres in which least no. of TV Shows are listed in.

Next I would like to see if there is any relation between the rating and genre of the titles on Netflix.

In [None]:
t=mov_genre_ratings.groupby(['rating','genre']).size()
t1 = t.to_frame()
t1.columns=['count']
t2=pd.pivot_table(data=t1,index='rating',columns='genre',values='count')
t2=t2.fillna(0.0)
sns.clustermap(t2,cmap="viridis")

We can see that TV-MA and TV-14 rated movies are mostly classified in similar genres.

In [None]:
t=tv_genre_ratings.groupby(['rating','genre']).size()
t1 = t.to_frame()
t1.columns=['count']
t2=pd.pivot_table(data=t1,index='rating',columns='genre',values='count')
t2=t2.fillna(0.0)
sns.clustermap(t2,cmap="viridis")

For TV shows, we can see that TV-MA and TV-14 rated shows are very similar in the genres they are listed in and also that TV-Y7 and TV-Y are mostly listed in Kids TV as they are meant for kids.

Let us look at the distribution of duration of movies and TV Shows. The duration for movies in in minutes and the duration for TV shows is in terms of seasons

In [None]:
mov_duration = df_movies['duration'].str.split()
mov_duration=mov_duration.apply(lambda x:x[0])
sns.distplot(mov_duration)
plt.xlabel("Duration in minutes")

In [None]:
tv_duration = df_TVshows['duration'].str.split()
tv_duration=tv_duration.apply(lambda x:x[0])
sns.distplot(tv_duration)
plt.xlabel("Duration in seasons")

In [None]:
df_movies.loc[mov_duration.astype('float64').idxmax()]

In [None]:
df_movies.loc[mov_duration.astype('float64').idxmin()]

The movie with the highest duration is Black Mirror: Bandersnatch with a runtime of 312 minutes and the movie with a minimum duration is an animated short film called silent with a runtime of 3 minutes.

In [None]:
df_TVshows.loc[tv_duration.astype('float64').idxmax()]

The TV show that has the most seasons in Grey's Anatomy with 16 Seasons.

References:
https://sureshssarda.medium.com/pandas-splitting-exploding-a-column-into-multiple-rows-b1b1d59ea12e