In this notebook, I will be examining a **data set of Neftlix movies and TV Shows that were available in 2019**. This notebook contains some data visualizations, all of which provide further insight on the different variables in the data set.

To start, I imported the necessary libraries for this journal. To create my visualizations, I imported the **seaborn library.** In addition, I imported the **Matplotlib library** as well so that I could develop graphical figures and plot data points accordingly.

In [None]:

import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Next, I examined the first five rows of the data to see the the variables I was working with.

In [None]:
netflix_filepath = "../input/netflix-shows/netflix_titles.csv"
netflix_data =  pd.read_csv(netflix_filepath, index_col="show_id")
netflix_data.columns
netflix_data.head() 

As we can see above, a majority of the variables are categorical, such as the type of Netflix entry as well as the age rating. For this reason, I will be creating visualizations such as histograms and categorical scatter plots to display the counts of these variables. 

Lets start with a categorical scatter plot. This will show the number of TV shows and movies released on Netflix each year.

In [None]:
plt.title("Movies and TV Shows Released on Netflix")
sns.swarmplot(x= netflix_data['type'],
              y=netflix_data['release_year'])


This categorical scatter plot gives us a few interesting points regarding the release date and type of Netflix entry. To start,the figure clearly shows that there are far more movies on Netflix compared to TV shows, which, for those who use Netflix, can verify. In addition to this, a vast majority of Netflix entries (both TV shows and Movies) are fairly recent. In the scatterplot above, we can see that a major portion of the points fall within the years 2010 and 2019. This makes sense as Netflix would want to provide its users with more recent entries to keep them interested in its service. However, there are still a great number of entries that were released prior to 2010, especially movies. In fact, we can see a small cluster of movies that were released during the 1940s. If we examine the relationship between the number of Netflix entries (for each type) and the release year, we see two different trends. The number of movies appears to increase as each year progresses. However, for TV shows, the number stays constant and eventually starts to increase after 1980. I find this particularly interesting as television became more popular in the United States during the 1950s and 1960s. Thus, I would expect Netflix to include more TV shows from that time, but that is clearly not the case.

To see the growth of Netlix entries over the years, I created a lineplot, with the x-axis representing the year and the y-axis representing the number of entries for each year. 

In [None]:
netflix_count = netflix_data['release_year'].value_counts()
netflix_count = pd.DataFrame(netflix_count).reset_index()
netflix_count.columns = ['release_year','Number of Entries',]
netflix_count

ax = sns.lineplot(x="release_year", y="Number of Entries",
                  data= netflix_count)

The graph showed an increasing exponential pattern, which I expected. As I stated before in my analysis of the categorical scatterplot, the majority of Netflix entries are in the 2010-2019 range. This is clearly supported by the line graph above, as we see a drastic increase in the number of entries once the graph reaches the year 2010 and progresses from there.

For the next portion of this journal, I will be looking the age ratings of Netlix entries. Shown below is a histogram and pie chart of the age ratings. 

In [None]:
plt.figure(figsize=(14,6))
plt.title("Age Ratings Movies and TV Shows Released on Netflix")

ax = sns.countplot(data=netflix_data,y= "rating")


In [None]:
netflix_count_2 = netflix_data['rating'].value_counts()
netflix_count_2 = pd.DataFrame(netflix_count_2).reset_index()
netflix_count_2.columns = ['rating','count']
netflix_count_2
rating_data = netflix_count_2["rating"]
count_data = netflix_count_2["count"]
colors = ["b", "r", "y", "m", "c","g", "#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#8c564b", "powderblue", "navy", "darkcyan"]
explode = (0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.005) 
plt.figure(figsize=(24,9.5))
plt.pie(count_data, labels=rating_data, explode=explode, colors=colors,
autopct='%1.1f%%', shadow=False, startangle=140)
plt.show()

We can see from both the histogram and the pie chart that a majority of Netlix entries have an age rating of TV-MA. Despite this fact, young adult and children's content actually makes up most of Netflix's content. According to the pie chart, about 41% of Netflix content is for adults, a statistic that I honestly did not expect. Considering the vast amount of TV-MA shows and R-rated movies on Netflix's platform, I expected about 60% of Netflix's content to be for adults and 40% to be for teens and under. 

For the last portion of this journal, I will be focusing on some specific details I found in this data set.

To start, I noticed that an incredibly small percentage of Netflix content is listed in the Unrated and NC-17 categories. I wanted to find how many films/TV shows were in this category and what these films/TV shows were.

In [None]:
netflix_specific = netflix_data[netflix_data['rating']== 'NC-17']
num = netflix_specific.shape[0]
print ("There are " + str(num) + " movies/tv shows that are rated NC-17 on Netlfix. They are displayed below")
netflix_specific 

In [None]:
netflix_specific_2 = netflix_data[netflix_data['rating']== 'UR']
num2 = netflix_specific_2.shape[0]
print ("There are " + str(num2) + " movies/tv shows that are unrated on Netlfix. They are displayed below")
netflix_specific_2

Only 9 movies in this data set are in the NC-17 (2) and Unrated (7) categories. Let's find out what percentage of the entire data set that is ...

In [None]:
num_total = netflix_data.shape[0]
percent = 100 * (9/ (num_total))
print (str(percent) + "% of all the content in Netflix is listed in the NC-17 and Unrated categories. That is 9 out of " + str(num_total) + " entries.")

0.144% !. That is a very small percentage of all of Netflix's content. The only age rating that has more content on Netflix is the G category, as seen on the histogram.

**Conclusion**

That is the end of my data analysis of the Netflix Movies and TV Shows Data set. I can certainly say that through my data visualizations and calculations, I was able to gain a vast amount of insight on Netflix's content that I did not know before. Throughout this project, I enjoyed relating the details I found in the data to my already pre-existent knowledge of Netflix in general. I hoped you enjoyed learning about this Netflix data set as much as I did!