<div style = "display: fill;
              border-radius: 10px;
              background-color: #E50914;">
    <h1 style = "padding: 15px; 
                 color: White;
                 text-align:center;
                 font-family: Trebuchet MS;">Netflix Movies and TV Shows
    </h1>
</div>

<div align="center"> 
    <img src="https://content.api.news/v3/images/bin/3fcf3cdcc2cfd6905e582a186e46e584" alt="Netflix" width="75%" style="margin-top:2rem;margin-bottom:2rem">
</div>

<div>
    <p>Netflix is one of the most popular media and video streaming platforms. They have over 8000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally.</p>
</div>

<div>
    <p>Analysis ideas:</p>
    <ul>
        <li><p>Understanding what content is available in different countries</p>
        <li><p>Identifying similar content by matching text-based features</p>
        <li><p>Network analysis of Actors / Directors and find interesting insights</p>
        <li><p>Does Netflix has more focus on TV Shows than movies in recent years.</p>
    </ul>
</div>

<div style = "display: fill;
              border-radius: 10px;
              background-color: #E50914;">
    <h2 style = "padding: 15px; 
                 color: White;
                 text-align: left;
                 font-family: Trebuchet MS;">Table of Contents
    </h2>
</div>

* [<span style="font-family: Trebuchet MS; font-size:15px;">1. Imports</span>](#imports)
* [<span style="font-family: Trebuchet MS; font-size:15px;">2. Loading the Dataset</span>](#loading-the-dataset)
* [<span style="font-family: Trebuchet MS; font-size:15px;">3. Understanding the Data</span>](#understanding-the-data)
    * [<span style="font-family: Trebuchet MS; font-size:15px;">3.1 Checking Null Values</span>](#checking-null-values)
    * [<span style="font-family: Trebuchet MS; font-size:15px;">3.2 Optimizing the Dataset</span>](#optimizing-the-dataset)
* [<span style="font-family: Trebuchet MS; font-size:15px;">4. Exploratory Data Analysis</span>](#exploratory-data-analysis)


<div id="imports"
     style = "display: fill;
              border-radius: 10px;
              background-color: #E50914;">
    <h2 style = "padding: 15px; 
                 color: White;
                 text-align: left;
                 font-family: Trebuchet MS;">1. Imports
    </h2>
</div>

In [1]:
%%capture
import numpy as np
import pandas as pd
import plotly.express as px

<div id="loading-the-dataset"
     style = "display: fill;
              border-radius: 10px;
              background-color: #E50914;">
    <h2 style = "padding: 15px; 
                 color: White;
                 text-align: left;
                 font-family: Trebuchet MS;">2. Loading the Dataset
    </h2>
</div>

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/sahinozan/Netflix-Movies-TV-Shows/main/netflix_titles.csv", encoding="utf-8")
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


<div id="understanding-the-data"
     style = "display: fill;
              border-radius: 10px;
              background-color: #E50914;">
    <h2 style = "padding: 15px; 
                 color: White;
                 text-align: left;
                 font-family: Trebuchet MS;">3. Understanding the Data
    </h2>
</div>

Dataset consists of 12 features.
1. **show_id**: The unique id number for each show or movie
    - s1
    - s2
    - s3
    - ...
2. **type**: Type of the content
    - Movie
    - TV Show
3. **title**: Title of the content
    - The Starling
    - Squid Game
    - Jaws: The Revenge
    - ...
4. **director**: Name of the director for that movie or show
    - Steven Spielberg
    - Cédric Jimenez
    - Hirotsugu Kawasaki
    - ...
5. **cast**: Names of the cast for that movie or show 
    - Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane...
    - Antti Pääkkönen, Heljä Heikkinen, Lynne Guaglione, Pasi Ruohonen, Rauno Ahonen...
    - Luna Wedler, Jannis Niewöhner, Milan Peschel, Edin Hasanović...
    - ...
6. **country**: List of countries in which the content is available 
    - United States, South Africa, India...
    - United States, United Kingdom, Canada, Germany...
    - Canada, France, Japan, Russia...
    - ...
7. **data_added**: The date the content was added
    - September 25, 2021
    - September 22, 2021
    - August 23, 2021
    - ...
8. **release_year**: Release year of the content
    - 2020
    - 2014
    - 2006
    - ...
9. **rating**: Motion picture content rating 
    - PG-13
    - TV-14
    - R
    - ...
10. **duration**: Total duration of the content
    - 90 min
    - 2 seasons
    - 48 min
    - ...
11. **listed_in**: List of categories where content is listed
    - Documentaries, International TV Shows, TV Dramas, TV Mysteries...
    - Comedies, Dramas, International Movies...
    - Documentaries, TV Dramas, Comedies...
    - ...
12. **description**: Short description about the content
    - As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable
    - After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth
    - Tayo speeds into an adventure when his friends get kidnapped by evil magicians invading their city in search of a magical gemstone
    - ...

<div id="checking-null-values"
     style = "display: fill;
              border-radius: 10px;
              background-color: #E50914;">
    <h3 style = "padding: 15px; 
                 color: White;
                 text-align: left;
                 font-family: Trebuchet MS;">3.1 Checking Null Values
    </h3>
</div>

Let's first check if there are any null values in the dataset

In [3]:
df.isna().sum().sort_values(ascending=False)

director        2634
country          831
cast             825
date_added        10
rating             4
duration           3
show_id            0
type               0
title              0
release_year       0
listed_in          0
description        0
dtype: int64

* There are small amount of null values in the **date_added**, **rating**, **duration** columns. 
* Meanwhile, there are large amount of null values in the **country**, **cast**, **director** columns.

<div id="optimizing-the-dataset"
     style = "display: fill;
              border-radius: 10px;
              background-color: #E50914;">
    <h3 style = "padding: 15px; 
                 color: White;
                 text-align: left;
                 font-family: Trebuchet MS;">3.2 Optimizing the Dataset
    </h3>
</div>

Many of the features are `object` type. We can convert those features into more specific and compact types to save memory and increase efficiency.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


Let's first check the number of unique values for each feature to make the conversion process easier

In [5]:
for i in df.columns:
    print(f"Number of unique values in the {i} column: {df[i].nunique()}")

Number of unique values in the show_id column: 8807
Number of unique values in the type column: 2
Number of unique values in the title column: 8807
Number of unique values in the director column: 4528
Number of unique values in the cast column: 7692
Number of unique values in the country column: 748
Number of unique values in the date_added column: 1767
Number of unique values in the release_year column: 74
Number of unique values in the rating column: 17
Number of unique values in the duration column: 220
Number of unique values in the listed_in column: 514
Number of unique values in the description column: 8775


Some of the columns contain less unique values. These will be the ones that we do conversion on them. Columns that contain less unique values:
* **type**
* **release_year**
* **rating**

Other interesting features are **show_id**, **title**, and **description**. They contain unique values for each row. I will check them first to see what is going.

In [6]:
print("Number of unique values:", df["show_id"].nunique())
print(df["show_id"].unique())

Number of unique values: 8807
['s1' 's2' 's3' ... 's8805' 's8806' 's8807']


There is a unique value for each show. I don't understand the point of this feature. It only contains the values starting from 1 to 8807 which is basically an index for every row. The problem is that we already have that information stored in our dataframe so, I will remove this feature because it doesn't contain any valuable information for the analysis.

In [7]:
df.drop(labels="show_id", axis=1, inplace=True)

In [8]:
print("Number of unique values:", df["title"].nunique())
print(df["title"].unique())

Number of unique values: 8807
['Dick Johnson Is Dead' 'Blood & Water' 'Ganglands' ... 'Zombieland'
 'Zoom' 'Zubaan']


Similar to **show_id**, this column also contains unique values for each show but the difference is that these values are not just integers like an index value. These values contain the name of the each show which is probably will be used in the analysis. Therefore, I will not remove this feature.

In [9]:
print("Number of unique values:", df["description"].nunique())
print(df["description"].unique())

Number of unique values: 8775
['As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.'
 'After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth.'
 'To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war.'
 ...
 'Looking to survive in a world taken over by zombies, a dorky college student teams with an urban roughneck and a pair of grifter sisters.'
 'Dragged from civilian life, a former superhero must train a new crop of youthful saviors when the military preps for an attack by a familiar villain.'
 "A scrappy but poor boy worms his way into a tycoon's dysfunctional family, while facing his fear of music and the truth about his past."]


Similar to **title**, this column also contains unique values for each show. These values contain the short description of the each show which we may use it in the analysis. Therefore, I will not remove this feature.

In [10]:
print("Number of unique values:", df["type"].nunique())
print(df["type"].unique())

Number of unique values: 2
['Movie' 'TV Show']


There are only two unique values in the **type** column. These are essentially used to categorize the content into **Movie** or **TV-Show**. I will convert this feature into `category` type.

In [11]:
df["type"] = df["type"].astype("category")

In [12]:
print("Number of unique values:", df["release_year"].nunique())
print(df["release_year"].unique())

Number of unique values: 74
[2020 2021 1993 2018 1996 1998 1997 2010 2013 2017 1975 1978 1983 1987
 2012 2001 2014 2002 2003 2004 2011 2008 2009 2007 2005 2006 1994 2015
 2019 2016 1982 1989 1990 1991 1999 1986 1992 1984 1980 1961 2000 1995
 1985 1976 1959 1988 1981 1972 1964 1945 1954 1979 1958 1956 1963 1970
 1973 1925 1974 1960 1966 1971 1962 1969 1977 1967 1968 1965 1946 1942
 1955 1944 1947 1943]


There are 74 unique values in the **release_year** column. Also, these are integer values. Smallest integer datatype that can hold these values is `int16`.

In [13]:
df["release_year"] = df["release_year"].astype("int16")

In [14]:
print("Number of unique values:", df["rating"].nunique())
print(df["rating"].unique())

Number of unique values: 17
['PG-13' 'TV-MA' 'PG' 'TV-14' 'TV-PG' 'TV-Y' 'TV-Y7' 'R' 'TV-G' 'G'
 'NC-17' '74 min' '84 min' '66 min' 'NR' nan 'TV-Y7-FV' 'UR']


There are 17 unique values in the **rating** column. These values contain both integers and strings. Therefore, we will convert this column into `category` type.

In [15]:
df["rating"] = df["rating"].astype("category")

In [16]:
print("Number of unique values:", df["duration"].nunique())
print(df.loc[:30, "duration"].unique())

Number of unique values: 220
['90 min' '2 Seasons' '1 Season' '91 min' '125 min' '9 Seasons' '104 min'
 '127 min' '4 Seasons' '67 min' '94 min' '5 Seasons' '161 min' '61 min'
 '166 min' '147 min' '103 min' '97 min' '106 min' '111 min']


There are 220 unique values in the **duration** column. Considering there are 8807 rows, there will be a lot of repetition. That's why I will convert it into `category` type.

In [17]:
df["duration"] = df["duration"].astype("category")

In [18]:
print("Number of unique values:", df["director"].nunique())
print(df.loc[:20, "director"].unique())

Number of unique values: 4528
['Kirsten Johnson' nan 'Julien Leclercq' 'Mike Flanagan'
 'Robert Cullen, José Luis Ucha' 'Haile Gerima' 'Andy Devonshire'
 'Theodore Melfi' 'Kongkiat Komesiri' 'Christian Schwochow'
 'Bruno Garotti' 'Pedro de Echave García, Pablo Azorín Williams'
 'Adam Salky' 'Olivier Megaton']


There are 4528 unique values in the **director** column. Similar to **duration** column, there are way too many unique values for the `category` type. I will not convert it into `category` type.

In [19]:
print("Number of unique values:", df["date_added"].nunique())
print(df.loc[:20, "date_added"].unique())

Number of unique values: 1767
['September 25, 2021' 'September 24, 2021' 'September 23, 2021'
 'September 22, 2021']


There are 1767 unique values in the **date_added** columns. We can use `datetime64` type for this column. This will helps us to sort dates easier.

In [20]:
df["date_added"] = df["date_added"].astype("datetime64")

In [21]:
print("Number of unique values:", df["country"].nunique())
print(df.loc[:20, "country"].unique())

Number of unique values: 748
['United States' 'South Africa' nan 'India'
 'United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia'
 'United Kingdom' 'Germany, Czech Republic' 'Mexico']


There are 748 unique values in the **country** column. Because of the same reason with **duration** column, I will convert this into `category` type.

In [22]:
df["country"] = df["country"].astype("category")

In [23]:
print(df.loc[:5, "cast"].unique(), "\n")
print(df.loc[:5, "listed_in"].unique())

[nan
 'Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng'
 'Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera'
 'Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar'
 'Kate Siegel, Zach Gilford, Hamish Linklater, Henry Thomas, Kristin Lehman, Samantha Sloyan, Igby Rigney, Rahul Kohli, Annarah Cymone, Annabeth Gish, Alex Essoe, Rahul Abburi, Matt Biedel, Michael Trucco, Crystal Balint, Louis Oliver'] 

['Documentaries' 'International TV Shows, TV Dramas, TV Mysteries'
 'Crime TV Shows, International TV Shows, TV Action & Adventure'
 'Docuseries, Reality TV'
 'International TV Shows, Romant

The **cast** and **listed_in** columns does contain list of values for each show. We can't do any conversion on this. 

We're done with the optimization. Let's check the dataframe after the conversion.

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   type          8807 non-null   category      
 1   title         8807 non-null   object        
 2   director      6173 non-null   object        
 3   cast          7982 non-null   object        
 4   country       7976 non-null   category      
 5   date_added    8797 non-null   datetime64[ns]
 6   release_year  8807 non-null   int16         
 7   rating        8803 non-null   category      
 8   duration      8804 non-null   category      
 9   listed_in     8807 non-null   object        
 10  description   8807 non-null   object        
dtypes: category(4), datetime64[ns](1), int16(1), object(5)
memory usage: 514.4+ KB


* Before the optimization, dataset had a memory usage of `825.8+ KB`.
* After the optimization, dataset have a memory usage of `514.4+ KB`.
* We lowered the memory usage by almost **38%**.

<div id="exploratory-data-analysis"
     style = "display: fill;
              border-radius: 10px;
              background-color: #E50914;">
    <h2 style = "padding: 15px; 
                 color: White;
                 text-align: left;
                 font-family: Trebuchet MS;">4. Exploratory Data Analysis
    </h2>
</div>