In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Hello! Welcome to my first Kaggle Notebook. 

I thought I'd do a simple analysis to help visualize some data regarding Netflix shows and movies.

In [None]:
data = pd.read_csv('../input/netflix-shows/netflix_titles.csv')

Let's see what the data looks like...

In [None]:
data.head()

And how much of it there is...

In [None]:
data.shape

I want to divide the data up based on the type of media. I think the "type" column is probably just two types: shows and movies. But I want to make sure, in case they have different designations for things like "docuseries," for example.

In [None]:
data.type.unique()

So I was right - there's only two unique values in that column, "TV Show," and "Movie."

It would be interesting to see when Netflix had its heaviest hand in adding new media, and to see if there was a difference in movies and shows. 

What I'm noticing, though, is that the "Date Added Column" is a string. Lets confirm that.

In [None]:
type(data.iloc[7,6])

Yup, that's a string, and I only want the year, not the whole date.
First, let me extract the year out of that column.

In [None]:
data['year_added'] = data['date_added'].str[-4:]
data.head()

But that doesn't solve the issue of it being a string. We'll have to do some more conversion there.

In [None]:
data['year_added'] = pd.to_numeric(data['year_added'])

Let's see how that worked.

In [None]:
data.head()

In [None]:
type(data.iloc[15,12])

I'd prefer if it were an integer. So let's make that quick fix.

In [None]:
data['year_added'] = data['year_added'].astype(int)

Error?
Ah... there must be nulls. Should have checked that first!

In [None]:
data['year_added'].isnull().values.sum()

Let's get rid of the nulls in our "year_added" column.

In [None]:
data.dropna(subset=['year_added'])

In [None]:
data['year_added'].isnull().sum()

Ok, now we should be able to turn that column into integers:

In [None]:
data['year_added'] = data['year_added'].astype(int)

Looks like that worked....

In [None]:
data.head()

Double checking...

In [None]:
type(data.iloc[15,12])

Now for the fun stuff! I want to divide this into two data frames - Movies, and TV Shows.
We'll make filters for that.

In [None]:
Movie_Filter = data['type']=='Movie'
Movie_Data = data[Movie_Filter]
Movie_Data.head()

Let's just make sure that worked properly by looking up the unique values in the "type" column.

In [None]:
Movie_Data.type.unique()

Perfect! Let's do the same thing for TV Shows

In [None]:
Show_Filter = data['type']=="TV Show"
Show_Data = data[Show_Filter]

In [None]:
Show_Data.head()

Double checking our work again...

In [None]:
Show_Data.type.unique()

Now let's put together a stacked histogram by the "Year Added" colum we so carefully created.
I'm going to make this an "overlapping" histogram instead of a stacked one, so we can compare movies to shows.

In [None]:
plt.figure(figsize=(10,4))
plt.hist(Movie_Data['year_added'],bins=25, alpha=.5, label="Movies")
plt.hist(Show_Data['year_added'],bins=25,  alpha=.5, label="Shows")

plt.xlabel("Data", size=14)
plt.ylabel("Count", size=14)
plt.title("Distributions of Year Added - Movies vs. Shows (overlapped)", size=20)
plt.legend(loc='upper left')

#hist = Movie_Data['year_added'].hist(label='my label')
#hist2 = Show_Data['year_added'].hist(label='my label')

It looks like the heavier years of adding media were 2019 and 2020 for both movies and shows. Movies peaked in 2019, and tv shows peaked in 2020.
The volume of movies added is substantially larger than of tv shows added - Netflix seems to favor movies.

Thanks for reading my first notebook! Let me know what you think, or and constructive feedback you may have.