Netflex Movies and TV Show

![](http://miro.medium.com/max/1950/1*rmB5iEYR6zkWrzrZcCcFog.jpeg)

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Objectif

What we have to do:
1. Is Netflix has increasingly focusing on TV rather than movies in recent years.
2. The most rating in Netflex. 
3. Network analysis of Actors / Directors and find interesting insights.
4. The top countries that produce more title in Netflex.
5. The top directors that produce more title in Netflex.
6. The top Categories produced in Netflex.

**Import the libraries needed**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Read our dataframe with the use of read_csv in Pandas

In [None]:
df = pd.read_csv("../input/netflix-shows/netflix_titles.csv")
df.head()

Number of rows of our data is 6234.

In [None]:
index = df.index
number_of_rows = len(index)
print(number_of_rows)

We will verify if the data is duplicated by any way, and if yes we will try to drop it by using drop_duplicates().

In [None]:
df.drop_duplicates()

Verify the missing data

In [None]:
df.isnull().sum()

**Types of our Data :**

In [None]:
df.dtypes

In [None]:
dummy = pd.get_dummies(df['type'])
df = pd.concat([df,dummy], axis=1)
#del df['type']
df.head()

# Type (Movie / TV Show)

We Will see now if Netflix is focusing about increasing his contents on TV rather than movies and Why.

In [None]:
df['type'].value_counts()

We see that the range between the 2 types is about 2296.

In [None]:
plt.title("Histogram representing the content type of Netflix (Movies or TV Shows)")
df.groupby('type').size().plot(kind='bar')

In [None]:
plt.figure(figsize=(12,6))
plt.title("Percentation of Netflix that are either Movies or TV Shows")
g = plt.pie(df.type.value_counts(),explode=(0.025,0.025), labels=df.type.value_counts().index, 
            colors=['blue','Cyan'],autopct='%1.1f%%', startangle=180)
plt.show()

In [None]:
plt.figure(figsize=(14, 6))
x1 = list(df[df['type'] == 'Movie']['release_year'])
x2 = list(df[df['type'] == 'TV Show']['release_year'])


colors = ['#000064', '#00ccff']
names = ['Movie', 'TV Show']

plt.hist([x1, x2], bins = int(180/15),
         color = colors, label=names)

# Plot formatting
plt.legend()
plt.xlabel('Year')
plt.ylabel('Number of Movies / TV Shows')
plt.title('The Histogram of the distribution of Movies & TV Show per year')

I think that after this diagram we can result in 2 hypoteses : 
    1.   Netflix is increasing her focusing on Movies rather than TV SHOW in the recent years, speacilally in 2020.
            1.1.   Most people watch movies with notion of watching an actor/director's work they like.
            1.2.   Most of people dosen't have time to watch a serie countiously so they prefer 
            watch a movie because 
            it will take just about 90 min. 
    2.   Netflix try to increase More and More her TV Shows rather Than Movies.
            2.1.   This could be because the people who are the most excited about particular 
            movies will have 
            seen them in the cinema before they become available on Netflix, meaning that 
            they have less incentive 
            to check them out on Netflix when the chance comes up.
            2.2.   TV shows tend to be less expensive to make than movies, meaning that Netflix 
            can afford 
            to make not just more of them but also make a wide range of them. 
            2.3.   The consumers can have a wide range of interests, meaning that a wide range of 
            content is needed to bring them in as subscribers.

# Content

The explanation of each content in Netflix:

**TV-Y**  actually means that this program is generally designed to be viewed by very young audiences under the age of 7 (ages 0 to 6).

**TV-Y7** means that a program may not be suitable for children under 7.

**TV-G** in the United States TV Parental Guidelines signifies content that is suitable for all audiences. Some children's programs that have content that teens or adults will relate to use a TV-G rating, as opposed to a TV-Y rating. This rating is also used for shows with inoffensive content (such as cooking shows, religious programming, nature documentaries, shows about pets and animals, classic television shows, and many shows on Disney Channel carry this rating (particularly sitcoms).

**TV-Y7-FV** is recommended for ages 7 and older, with the unique advisory that the program contains fantasy violence.

**TV-MA** are usually created for an adult audience. Some content may not be appropriate for children under the age of 17, due to strong intense violence and particularly coarse language.

**TV-14** Parents strongly cautioned. This program contains some material that many parents would find unsuitable for children under 14 years of age.

**TV-PG** : Parental Guidance Suggested. This program contains material that parents may find unsuitable for younger children.

**R** Under 17 requires accompanying parent or adult guardian,Parents are urged to learn more about the film before taking their young children with them.

**PG-13** Parents Strongly Cautioned, Some Material May Be Inappropriate for Children Under 13. 

**NR** or UR:If a film has not been submitted for a rating or is an uncut version of a film that was submitted.

**PG**:Some material may not be suitable for children,May contain some material parents might not like for their young children.

**G**:All ages admitted. Nothing that would offend parents for viewing by children.

**NC-17** a rating assigned to a movie by the Motion Picture Association of America advising that persons under the age of 18 will not be admitted to a theater showing the film.

In [None]:
df['rating'].value_counts()

Here we can see that the most gategory audience in Netflix are the adults audience. "TV-MA" Some content may not be appropriate for children under the age of 17, due to strong intense violence and particularly coarse language.

To illustrate it visually we will use the plot function.

In [None]:
plt.figure(figsize=(12,6))
df.groupby('rating').size().plot(kind='bar',color='green',orientation='vertical')

# Titles

With the use of .value_counts() we will have the possibility to count the number of times each title occurred in our dataset.

In [None]:
df['title'].value_counts()[:30]

We see the most viewed titles are presented in this plot graph:

In [None]:
plt.figure(figsize=(12,6))
df['title'].value_counts()[:35].plot(kind='barh')

# Duration

In [None]:
df['duration'].value_counts()[:30]

In [None]:
plt.figure(figsize=(12,6))
df['duration'].value_counts()[:35].plot(kind='barh')

# Country

We will see the top countries that produce more title in Netflex.

In [None]:
import squarify
y = df['country'].value_counts()[:14]
fig = plt.figure(figsize=(25, 12))
squarify.plot(sizes = y.values, label = y.index, color=sns.color_palette("RdGy", 20),
             linewidth=4, text_kwargs={'fontsize':18, 'fontweight' : 'bold'})
plt.title('Top 14 producing countries', position=(0.5, 1.0+0.03), fontsize = 18, fontweight='bold')
plt.axis('off')
plt.show()

In [None]:
filt_countries = df.set_index('title').country.str.split(', ', expand=True).stack().reset_index(level=1, drop=True);
plt.figure(figsize=(13,7))
g = sns.countplot(y = filt_countries, order=filt_countries.value_counts().index[:15], palette='rocket_r')
plt.title('Top 15 Countries Contributor on Netflix')
plt.xlabel('Titles')
plt.ylabel('Country')
plt.show()

In [None]:
df['country'].value_counts()

# Directors

Top 10 Director Based with the most of titles:

In [None]:
top_directors = df[df.director != 'NaN'].set_index('title').director.str.split(', ', expand=True).stack().reset_index(level=1, drop=True)
plt.figure(figsize=(13,7))
plt.title('Top 10 Director Based with the most of titles')
sns.countplot(y = top_directors, order=top_directors.value_counts().index[:10], palette='rocket_r')
plt.show()

The most popular director on Netflix, with the most titles, is mainly international.

**Top Genres on Netflix**

In [None]:

top_listed_genre = df.set_index('title').listed_in.str.split(', ', expand=True).stack().reset_index(level=1, drop=True)
plt.figure(figsize=(13,7))
plt.title('Top 36 Genres in Netflix')
sns.countplot(y = top_listed_genre, order=top_listed_genre.value_counts().index[:36], palette='rocket_r')
plt.show()


The most popular categorie on Netflix, with the most titles, is international Movies, and then Dramas in the second state.