## 1. Objective

<b>The purpose of this notebook is to practice data exploration. For this, a dataset on Netflix was used, which is available in Kaggle.</b>

## 2. Imports

In [15]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## 3. Load Dataset

In [16]:
df = pd.read_csv('netflix_titles.csv')

## 4. Data Exploration

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


> Initially, you noted that the dataset has 8807 rows and 12 columns. In addtion, there are some columns with NaN values: director, cast, date_added, rating and duration.

In [18]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [19]:
df[['show_id', 'type']].groupby(by='type', as_index=False).count()

Unnamed: 0,type,show_id
0,Movie,6131
1,TV Show,2676


> About the "type" column in this dataset, there are two classifications, Movie ou TV Show, with more movies than TV Show.

In [20]:
# About Releases

older_release = df['release_year'].unique().min()
newest_release = df['release_year'].unique().max()

#About Movie Time

shorter_movie = min(
    [
        int(
            str(i).strip(' min')
        ) for i in df[df.type == 'Movie']['duration'].dropna().unique()
    ]
)
longer_movie = max(
    [
        int(str(i).strip(' min')
           ) for i in df[df.type == 'Movie']['duration'].dropna().unique()
    ]
)

# About TV Show Time

shorter_tv_show = min(
    [
        int(
            str(i).strip(' Seasons')
        ) for i in df[df.type == 'TV Show']['duration'].dropna().unique()
    ]
)
longer_tv_show = max(
    [
        int(str(i).strip(' Seasons')
           ) for i in df[df.type == 'TV Show']['duration'].dropna().unique()
    ]
)

# Print Informations

print(f"Older Release: {older_release}")
print(f"Newest Release: {newest_release}")

print(f"\nShorter Movie: {shorter_movie} min")
print(f"Longer Movie: {longer_movie} min")

print(f"\nShorter TV Show: {shorter_tv_show} Seasons")
print(f"Longer TV Show: {longer_tv_show} Seasons")

Older Release: 1925
Newest Release: 2021

Shorter Movie: 3 min
Longer Movie: 312 min

Shorter TV Show: 1 Seasons
Longer TV Show: 17 Seasons


> About the "release_year" column, the older year realease is 1925 and the newest release is 2021. Furthermore, the shorter Movie is 3 minutes and the longer Movie is 312 minutes. Finally, the shorter TV Show has 01 Season and the longer TV Show has 17 Seasons.

In [21]:
# Sturges' Rule

n = df.shape[0]   
k = int((1 + (10 / 3) * np.log10(n)).round(0))

# Frequency Distribution

frequency = pd.value_counts(
    pd.cut(
        x=df['release_year'],
        bins=k,
        precision=0,
        include_lowest=True,
    ),
)

percentage = (pd.value_counts(
    pd.cut(
        x=df['release_year'],
        bins=k,
        precision=0,
        include_lowest=True
    ),
    normalize=True
) * 100).round(2)

frequency_distribution_dataframe = pd.DataFrame({'Frequency': frequency, 'Percentage': percentage})

frequency_distribution_dataframe

Unnamed: 0,Frequency,Percentage
"(2014.0, 2021.0]",6216,70.58
"(2007.0, 2014.0]",1544,17.53
"(2000.0, 2007.0]",485,5.51
"(1994.0, 2000.0]",221,2.51
"(1987.0, 1994.0]",132,1.5
"(1980.0, 1987.0]",87,0.99
"(1973.0, 1980.0]",48,0.55
"(1966.0, 1973.0]",32,0.36
"(1959.0, 1966.0]",15,0.17
"(1939.0, 1946.0]",12,0.14


> About the year of releases of the Movies and TV Shows, the Sturges' Rule was used to define the amount of categories to build a dataframe with the distribution of frequencies. It can be observed that the period between 2014 and 2021 corresponds to more 70% of the release years.